Dec 6, 2011 8:20 PM
Anyone else having on-going issues with Zenoss or am I missing something?
Greetings!!
I'm fairly new to using Zenoss, and am wondering (hoping) that I missed some small minor detail. I have been working with Zenoss for a good 5 months, now and just can't seem to iron out the basic features. I have tried to follow instructions from forums (Zenoss and others) and official documentation, but usually find that directions do not match with what I see in the GUI or have no specifications as to whether I make the changes from the console, GUI, or what. I have downloaded the correct documentation using links from Zenoss GUI.
Our On-going Issues:
hm...not sure where to start. I'm hoping the below makes sense as I will do a brain dump...
- incorrect filesystem size reporting (found solution on several forums, issue not resolved)
- have to modify snmp configs on all linux servers to report proper capacity of NIC (what are we supposed to do with our NAS's that do not allow for snmp custom config modifications?)
- inaccurate, non-human number reporting for network traffic (found solution, partially working?)
- inaccurate(?) CPU utilization
- sporadic errors with little or no meaningful details
Our setup:
OS: Ubuntu 11.04 Linux 2.6.38-8-server
Zenoss: 3.0 core
zenoss-stack: 3.1.0-0
Incorrect FileSystem Reporting:
Our FileSystems report incorrect sizes. I have read and understand the 5% variance on Linux systems. At one point the numbers were WAY off, now seem to be a little off which I can live with. Once after I cleared the "event cache" and "all heart beats" all FileSystem values were being reported as zero (except for "total bytes"). A reboot solved this issue. After that reboot plus a couple of days we started receiving this warning:
/Perf/Filesystem threshold of high disk usage exceeded: current value 1193927.00 Filesystem threshold exceeded: 892.6% used (-1.01 GB free)
I understand (to some extent) the above error, but why now all of a sudden when the FileSystem had this usage for a while now.
So far, I have modified our FileSystem settings as follows (not sure what terminology I should even use):
1. Add a new zProperty
http://xxx.xx.xx.xxx:8080/zport/dmd/manage
click: Properties (tab)
Add: Name: zFileSystemSizeOffset, Type: float, Value: 1.0 click "add"
2. Create a transform rule for FileSystem Events
http://xxx.xx.xx.xxx:8080/zport/dmd/Events/Perf/Filesystem/editEventClassTransform
for f in device.os.filesystems():
if f.name() != evt.component: continue
# Extract the percent and free from the summary
import re
m = re.search("threshold of [^:]+: current value ([\d\.]+)", evt.message)
if not m: continue
usedBlocks = float(m.groups()[0])
totalBlocks = f.totalBlocks * getattr(device, "zFileSystemSizeOffset", 1)
p = (usedBlocks / totalBlocks) * 100
freeAmtGB = ((totalBlocks - usedBlocks) * f.blockSize) / 1073741824
# Make a nicer summary
evt.summary = "Filesystem threshold exceeded: %3.1f%% used (%3.2f GB free)" % (p,freeAmtGB)
break
3. Change Threshold
http://xxx.xx.xx.xxx:8080/zport/dmd/template#templateTree:/zport/dmd/Devices/Server/rrdTemplates/FileSystem
http://xxx.xx.xx.xxx:8080/zport/dmd/Devices/Server/rrdTemplates/FileSystem
double-click: define threshold
change value to:
+ (here.totalBlocks * here.zFileSystemSizeOffset ) * .90
4. Go back and modify the zProperty
http://xxx.xx.xx.xxx:8080/zport/dmd/itinfrastructure#devices:.zport.dmd.Devices:configuration properties
-set zFileSystemSizeOffset: 0.95
Incorrect Bandwidth Reporting:
For every Linux server add the below to the snmp config settings locally to address Zenoss misreading Gbps as Mbps:
override ifSpeed.1 uinteger 1000000000
override ifSpeed.2 uinteger 1000000000
Then on Zenoss:
http://xxx.xx.xx.xxx:8080/zport/dmd/Events/Perf/Interface/eventClassStatus
modify transforms:
import re
fs_id = device.prepId(evt.component)
for f in device.os.interfaces():
if f.id != fs_id: continue
# Extract the percent and utilization from the summary
m = re.search("threshold of [^:]+: current value ([\d\.]+)", evt.message)
if not m: continue
currentusage = (float(m.group(1))) * 8
p = (currentusage / f.speed) * 100
evtKey = evt.eventKey
# Whether Input or Output Traffic
# if evtKey == "ifInOctets_ifInOctets|high utilization":
if evtKey == "ifHCInOctets_ifHCInOctets|high utilization":
evtNewKey = "Input"
# elif evtKey == "ifOutOctets_ifOutOctets|high utilization":
elif evtKey == "ifHCOutOctets_ifHCOutOctets|high utilization":
evtNewKey = "Output"
# Mbps utilization
Usage = currentusage / 1000000
evt.summary = "High " + evtNewKey + " Utilization: Currently (%3.2f Mbps) or %3.2f%% is being used." % (Usage, p)
break
Modify "high utilization":
(here.speed or 1e9) / 8 * .1
Inaccurate Non-human Legible CPU Read-outs:
We have same issues with CPU read outs. I won't bother pasting the modifications for that here unless asked for.
New MySQL Error:
As of yesterday, this new warning started:
"|mysql|/Status/IpService||5|IP Service mysql is down" - the only change was the installation of PostgreSQL and restarting of the SNMP daemon.
I have spent countless hours so far in trying to get Zenoss to report properly or at least in a manner that is at least somewhat useful. Is it normal to spend quite some time customizing each new monitored host? I can understand the filesystem issue which is inherent to Linux systems, but what about the network band width? My modifications are pretty easy to do on Linux systems but what about our NAS and other network devices that need to be added? Is it expected to modify each CPU entry as well? It seems a bit odd that so much customization needs to be done.
I read great reviews about Zenoss, and wouldn't mind getting it to work for us. I'm sure I missed something. Is anyone able to shed some light on this? Does anyone else have these issues? I would really like to continue using Zenoss instead of switching to another monitoring application.
Your insights are greatly appreciated!!
Thanks in advance.
-
Like (0)