Archived community.zenoss.org | full text search
Skip navigation
7723 Views 5 Replies Latest reply: Dec 9, 2011 5:33 PM by jesse raider RSS
jesse raider Newbie 4 posts since
Aug 4, 2011
Currently Being Moderated

Dec 6, 2011 8:20 PM

Anyone else having on-going issues with Zenoss or am I missing something?

Greetings!!

 

I'm fairly new to using Zenoss, and am wondering (hoping) that I missed some small minor detail. I have been working with Zenoss for a good 5 months, now and just can't seem to iron out the basic features. I have tried to follow instructions from forums (Zenoss and others) and official documentation, but usually find that directions do not match with what I see in the GUI or have no specifications as to whether I make the changes from the console, GUI, or what. I have downloaded the correct documentation using links from Zenoss GUI.

 

Our On-going Issues:

 

hm...not sure where to start. I'm hoping the below makes sense as I will do a brain dump...

 

- incorrect filesystem size reporting (found solution on several forums, issue not resolved)

- have to modify snmp configs on all linux servers to report proper capacity of NIC (what are we supposed to do with our NAS's that do not allow for snmp custom config modifications?)

- inaccurate, non-human number reporting for network traffic (found solution, partially working?)

- inaccurate(?) CPU utilization

- sporadic errors with little or no meaningful details

 

Our setup:

OS: Ubuntu 11.04  Linux 2.6.38-8-server

Zenoss: 3.0 core

zenoss-stack: 3.1.0-0                           

 

Incorrect FileSystem Reporting:

 

Our FileSystems report incorrect sizes. I have read and understand the 5% variance on Linux systems. At one point the numbers were WAY off, now seem to be a little off which I can live with. Once after I cleared the "event cache" and "all heart beats" all FileSystem values were being reported as zero (except for "total bytes"). A reboot solved this issue. After that reboot plus a couple of days we started receiving this warning:

 

/Perf/Filesystem  threshold of high disk usage exceeded: current value 1193927.00   Filesystem threshold exceeded: 892.6% used (-1.01 GB free)

 

I understand (to some extent) the above error, but why now all of a sudden when the FileSystem had this usage for a while now.

 

So far, I have modified our FileSystem settings as follows (not sure what terminology I should even use):

 

1.  Add a new zProperty

http://xxx.xx.xx.xxx:8080/zport/dmd/manage

click: Properties (tab)

Add: Name: zFileSystemSizeOffset, Type: float, Value: 1.0 click "add"

 

2. Create a transform rule for FileSystem Events

http://xxx.xx.xx.xxx:8080/zport/dmd/Events/Perf/Filesystem/editEventClassTransform

for f in device.os.filesystems():
    if f.name() != evt.component: continue

    # Extract the percent and free from the summary
    import re
    m = re.search("threshold of [^:]+: current value ([\d\.]+)", evt.message)
    if not m: continue
    usedBlocks = float(m.groups()[0])
    totalBlocks = f.totalBlocks * getattr(device, "zFileSystemSizeOffset", 1)
    p = (usedBlocks / totalBlocks) * 100
    freeAmtGB = ((totalBlocks - usedBlocks) * f.blockSize) / 1073741824

    # Make a nicer summary
    evt.summary = "Filesystem threshold exceeded: %3.1f%% used (%3.2f GB free)" % (p,freeAmtGB)
    break

 



3. Change Threshold

http://xxx.xx.xx.xxx:8080/zport/dmd/template#templateTree:/zport/dmd/Devices/Server/rrdTemplates/FileSystem
http://xxx.xx.xx.xxx:8080/zport/dmd/Devices/Server/rrdTemplates/FileSystem

double-click: define threshold

change value to:

+ (here.totalBlocks * here.zFileSystemSizeOffset ) * .90

 

4. Go back and modify the zProperty

http://xxx.xx.xx.xxx:8080/zport/dmd/itinfrastructure#devices:.zport.dmd.Devices:configuration properties

-set zFileSystemSizeOffset: 0.95

 

 

Incorrect Bandwidth Reporting:

 

For every Linux server add the below to the snmp config settings locally to address Zenoss misreading Gbps as Mbps:

 

override ifSpeed.1 uinteger 1000000000
override ifSpeed.2 uinteger 1000000000

Then on Zenoss:

 

http://xxx.xx.xx.xxx:8080/zport/dmd/Events/Perf/Interface/eventClassStatus

 

modify transforms:

 

import re
fs_id = device.prepId(evt.component)
for f in device.os.interfaces():
    if f.id != fs_id: continue
    # Extract the percent and utilization from the summary
    m = re.search("threshold of [^:]+: current value ([\d\.]+)", evt.message)
    if not m: continue
    currentusage = (float(m.group(1))) * 8
    p = (currentusage / f.speed) * 100
    evtKey = evt.eventKey
   
    # Whether Input or Output Traffic
    # if evtKey == "ifInOctets_ifInOctets|high utilization":
    if evtKey == "ifHCInOctets_ifHCInOctets|high utilization":
        evtNewKey = "Input"
    # elif evtKey == "ifOutOctets_ifOutOctets|high utilization":
    elif evtKey == "ifHCOutOctets_ifHCOutOctets|high utilization":
        evtNewKey = "Output"
    # Mbps utilization
        Usage = currentusage / 1000000
        evt.summary = "High " + evtNewKey + " Utilization: Currently (%3.2f Mbps) or %3.2f%%  is being used." %  (Usage, p)
    break

 

Modify "high utilization":

http://xxx.xx.xx.xxx:8080/zport/dmd/template#templateTree:/zport/dmd/Devices/rrdTemplates/ethernetCsmacd_64

(here.speed or 1e9) / 8 * .1

 

 

Inaccurate Non-human Legible CPU Read-outs:

 

We have same issues with CPU read outs. I won't bother pasting the modifications for that here unless asked for.

 

 

New MySQL Error:

 

As of yesterday, this new warning started:

 

"|mysql|/Status/IpService||5|IP Service mysql is down" - the only change was the installation of PostgreSQL and restarting of the SNMP daemon.

 

 

 

I have spent countless hours so far in trying to get Zenoss to report properly or at least in a manner that is at least somewhat useful. Is it normal to spend quite some time customizing each new monitored host?  I can understand the filesystem issue which is inherent to Linux systems, but what about the network band width? My modifications are pretty easy to do on Linux systems but what about our NAS and other network devices that need to be added? Is it expected to modify each CPU entry as well? It seems a bit odd that so much customization needs to be done.

 

I read great reviews about Zenoss, and wouldn't mind getting it to work for us. I'm sure I missed something. Is anyone able to shed some light on this? Does anyone else have these issues? I would really like to continue using Zenoss instead of switching to another monitoring application.

 

Your insights are greatly appreciated!!

 

Thanks in advance.

  • jmp242 ZenossMaster 4,060 posts since
    Mar 7, 2007

    Welcome to the Zenoss forums. I'm sorry you're having so many issues. It can be normal to spend some time configuring a class of hosts (I'm thinking both in the sense of Zenoss Device Classes, and in the "human" sense of a class being a Linux webserver vs a Windows Domain Controller etc). However, my experiance has been once you have a class defined into a Zenoss Device Class, adding additional hosts of that type is down to the add device, set the Device Class and enter the FQDN.

     

    There are of course little issues that can crop up, or big ones, depending on your specific envrionment. First, I use RHEL derivative Scientific Linux 5. RHEL5 is what I think Zenoss is developed on, so as you get farther from that base OS, some issues occasionally show up.

     

    It can take some time to fully realize an enterprise monitoring system. No matter what one you use, there's going to be tweaking, and as far as I can tell, it's ongoing as OS patches or snmp agents update and change, or as you add new devices. It's very unlikely to ever be set it and forget it.

     

    The reason you see multiple forum posts with different solutions is that many superficially similar issues have different causes. Some people's windows WMI access (for instance) issues are simply a firewall setting. Others are not using an admin account and need to go through the painful Microsoft permission setup issues. One person needed a Microsoft patch, another needed to enable NTLMv2 in Zenoss because they had higher security requirements for their Windows servers than the defaults(If I recall correctly). But for each, the forum post usually was summarized to WMI doesn't work for me.

     

    Final background information: If you have time to learn Zenoss, you can get it set up with help from the community. If you're in a hurry, or don't really want to know the obscure details of your environment (Zenoss and the monitored devices), you may want to consider either a Community Consulting engagement or purchasing Zenoss Enterprise.

     

    To what I know off my head, I'll get more info over the next few days to fill in the gaps:

    Filesystem Reporting. It's basically related to what net-snmp shows. There are two issues I've seen commly reported and one of them I've experianced myself.

    1. Net-snmp restarts, and re-orders the filesystems that it reports. To simplify, filesystem 1 and filesystem 2 can switch places on a net-snmp restart.

    2. You resize a filesystem.

    Zenoss only detects these changes on a remodel. By default it remodels every 12 hours, but zenmodeler can "zombiefy" and so it's potentially a good idea to set up a cron job to restart it every day or so. This will also trigger a re-model. One thing to consider is if you can, have either of the two events above restart zenmodeler. I can think of a few ways to do that, and can go into them if requested.

     

    CPU utilization is very very tricky. It starts with the fact that I've yet to see a consensus as to how you should even calculate it (just at the OS level). There are at least 2 sets of OIDs you could use, and you have to take number of cores into account. Doesn't help that different OSs do it differently. I personally mostly ignor CPU utilization because of the issues - search the forums for some really long threads on all the factors here. I use load average on Linux instead, and on Windows I take the CPU use with a grain of salt. Now, this lack of care is probably local to my environment, but consider - what are you using the CPU utilization for? Alerts? Planning? It's probably important to understand what the numbers Zenoss is getting mean, and then we can customize that data to make more sense - you've already seen event transforms for munging it to human readibility - you can also alter RPNs and graph definitions (though you'll lose historical data doing this) to change what the graphs are showing.

     

    To the MySQL error - are you running MySQL on the server? Have you re-modeled? The IP service is basically testing a MySQL port to see if it can talk to it... Is Zenoss supposed to be monitoring that IP service on this device?

     

    --

    James Pulver

    ZCA Member

    LEPP Computer Group

    Cornell University

  • jcurry ZenossMaster 1,021 posts since
    Apr 15, 2008

    Hi Jesse,

    You're right!  Sometimes Zenoss is rather confusing because it is trying to do clever stuff for you that doesn't always suit!

     

    I suspect some of your filesystems issues is to do with big (>2TB) filesystems??   By default, filesystem info is gathered using SNMP to get info from the Host Resources MIB.  The issue here is that almost all SNMP agents only support 32-bit values for filesystems so for big filesystems the numbers wrap and you get nonsense like big negative numbers for disk free as you are seeing.  This really isn't Zenoss's fault - it's just doing the best it can with the data provided.  I haven't found an SNMP agent that supports 64-bit values in this area though I have heard rumours of one. 

     

    The usual way to circumvent the problem is to swap both your filesystem modeler and your filesystem performance template to use a zCommand-style collector, rather than an SNMP one (they are provided in the Core product).  Have you seen the thread at message/62359#62359 ?  As James says, remember that the modeler plugin (sometimes called the collector plugin) collects CONFIGURATION information and is run by zenmodeler every 12 hours (by default).  You need to get that in place first.  The performance template collects PERFORMANCE info, typically every 5 minutes, though if you use the command-based templates you have more control over the interval for performance data collection.  Don't forget that if you swap to the command-based monitoring, you will also need to setup a zCommand user and either a password or an sshkeys entry - so the setup is trickier but you then have more control.  Incidentally, I have not needed to use your clever transforms to make sense of filesystem stuff, but this may be because I can live with the 5% variance that Linux builds in.

     

    The interface utilisation I have seen with a number of clients.  I hadn't found the snmp.conf solution that you document so thanks for that - but, as you say, it requires a lot of extra agent configuration which is a big overhead.  I have resolved this on the Zenoss side by creating extra copies of the interface utilisation template and modifying the formula there.  You then need to make sure that your devices are separated into classes such that they use the right template.  Bear in mind that both the filesystem and the interface templates are COMPONENT templates (not device templates) so you do not bind them manually; they are bound automatically based on the device class.

     

    Another confusion around interface traffic where Zenoss is clever but not well documented, is that if you are collecting interface data from a device using SNMP V1 then the ethernetCsmacd template will be automatically bound; if you use SNMP V2 then the ethernetCsmacd64 template will be used.  This is a different template that gets values from the IF MIB rather than from the interface table of MIB-2 and it has a different threshold formula.  This is in order to automatically make use of the 64-bit values that ARE available with these SNMP values, if the newer protocol is supported.  However, it can lead to what looks like very diverse issues, especially if you don't notice which template is actually used (which you can always find out by going to a device's Interface component and use the Display dropdown to show the Template used).

     

    On your mysql error, I suspect that the PostgreSQL install means that the TCP port  for mysql has changed in some way.  Have a look at the definition of that service in Zenoss.  If possible, I try and check ports with

       telnet <box> <service port number>

     

    as a starting point.

     

    Hang in there - it's worth it in the end.

     

    If you want more help, feel free to approach me directly.

     

    Cheers,

    Jane

  • jcurry ZenossMaster 1,021 posts since
    Apr 15, 2008

    Hi Jesse,

    To change the interface high utilization threshold, you obviously have found the template:

     

    Advanced --> Monitoring Templates --> ethernetCsmacd_64

     

    --> /Devices --> Thresholds --> high utilization

     

    Simply double-click on the high utilization threshold and change the:

    Maximum Value: (here.speed or 1e9) / 8 * .1

    to

    Maximum Value: (here.speed or 1e9) / 8 * .7               for 70%

     

    Not sure why you have .1 (10%) on the end end as I believe the out-of-the-box default is .75 ie 75%.  You shouldn't need to change anything else.

     

    Cheers,

    Jane

More Like This

  • Retrieving data ...

Legend

  • Correct Answers - 4 points
  • Helpful Answers - 2 points