Anyone else having on-going issues with Zenoss or am I missing something?

Up to Discussions in zenoss-users

7723 Views 5 Replies Latest reply: Dec 9, 2011 5:33 PM by jesse raider

jesse raider

4 posts since
Aug 4, 2011

Currently Being Moderated

Dec 6, 2011 8:20 PM

Greetings!!

I'm fairly new to using Zenoss, and am wondering (hoping) that I missed some small minor detail. I have been working with Zenoss for a good 5 months, now and just can't seem to iron out the basic features. I have tried to follow instructions from forums (Zenoss and others) and official documentation, but usually find that directions do not match with what I see in the GUI or have no specifications as to whether I make the changes from the console, GUI, or what. I have downloaded the correct documentation using links from Zenoss GUI.

Our On-going Issues:

hm...not sure where to start. I'm hoping the below makes sense as I will do a brain dump...

- incorrect filesystem size reporting (found solution on several forums, issue not resolved)

- have to modify snmp configs on all linux servers to report proper capacity of NIC (what are we supposed to do with our NAS's that do not allow for snmp custom config modifications?)

- inaccurate, non-human number reporting for network traffic (found solution, partially working?)

- inaccurate(?) CPU utilization

- sporadic errors with little or no meaningful details

Our setup:

OS: Ubuntu 11.04 Linux 2.6.38-8-server

Zenoss: 3.0 core

zenoss-stack: 3.1.0-0

Incorrect FileSystem Reporting:

Our FileSystems report incorrect sizes. I have read and understand the 5% variance on Linux systems. At one point the numbers were WAY off, now seem to be a little off which I can live with. Once after I cleared the "event cache" and "all heart beats" all FileSystem values were being reported as zero (except for "total bytes"). A reboot solved this issue. After that reboot plus a couple of days we started receiving this warning:

/Perf/Filesystem threshold of high disk usage exceeded: current value 1193927.00 Filesystem threshold exceeded: 892.6% used (-1.01 GB free)

I understand (to some extent) the above error, but why now all of a sudden when the FileSystem had this usage for a while now.

So far, I have modified our FileSystem settings as follows (not sure what terminology I should even use):

1. Add a new zProperty

http://xxx.xx.xx.xxx:8080/zport/dmd/manage

click: Properties (tab)

Add: Name: zFileSystemSizeOffset, Type: float, Value: 1.0 click "add"

2. Create a transform rule for FileSystem Events

http://xxx.xx.xx.xxx:8080/zport/dmd/Events/Perf/Filesystem/editEventClassTransform

for f in device.os.filesystems():
if f.name() != evt.component: continue

    # Extract the percent and free from the summary
    import re
    m = re.search("threshold of [^:]+: current value ([\d\.]+)", evt.message)
    if not m: continue
    usedBlocks = float(m.groups()[0])
    totalBlocks = f.totalBlocks * getattr(device, "zFileSystemSizeOffset", 1)
    p = (usedBlocks / totalBlocks) * 100
    freeAmtGB = ((totalBlocks - usedBlocks) * f.blockSize) / 1073741824

    # Make a nicer summary
    evt.summary = "Filesystem threshold exceeded: %3.1f%% used (%3.2f GB free)" % (p,freeAmtGB)
    break

3. Change Threshold

http://xxx.xx.xx.xxx:8080/zport/dmd/template#templateTree:/zport/dmd/Devices/Server/rrdTemplates/FileSystem
http://xxx.xx.xx.xxx:8080/zport/dmd/Devices/Server/rrdTemplates/FileSystem

double-click: define threshold

change value to:

+ (here.totalBlocks * here.zFileSystemSizeOffset ) * .90

4. Go back and modify the zProperty

http://xxx.xx.xx.xxx:8080/zport/dmd/itinfrastructure#devices:.zport.dmd.Devices:configuration properties

-set zFileSystemSizeOffset: 0.95

Incorrect Bandwidth Reporting:

For every Linux server add the below to the snmp config settings locally to address Zenoss misreading Gbps as Mbps:

override ifSpeed.1 uinteger 1000000000
override ifSpeed.2 uinteger 1000000000

Then on Zenoss:

http://xxx.xx.xx.xxx:8080/zport/dmd/Events/Perf/Interface/eventClassStatus

modify transforms:

import re
fs_id = device.prepId(evt.component)
for f in device.os.interfaces():
    if f.id != fs_id: continue
    # Extract the percent and utilization from the summary
    m = re.search("threshold of [^:]+: current value ([\d\.]+)", evt.message)
    if not m: continue
    currentusage = (float(m.group(1))) * 8
    p = (currentusage / f.speed) * 100
    evtKey = evt.eventKey

    # Whether Input or Output Traffic
    # if evtKey == "ifInOctets_ifInOctets|high utilization":
    if evtKey == "ifHCInOctets_ifHCInOctets|high utilization":
        evtNewKey = "Input"
    # elif evtKey == "ifOutOctets_ifOutOctets|high utilization":
    elif evtKey == "ifHCOutOctets_ifHCOutOctets|high utilization":
        evtNewKey = "Output"
    # Mbps utilization
        Usage = currentusage / 1000000
        evt.summary = "High " + evtNewKey + " Utilization: Currently (%3.2f Mbps) or %3.2f%% is being used." % (Usage, p)
    break

Modify "high utilization":

http://xxx.xx.xx.xxx:8080/zport/dmd/template#templateTree:/zport/dmd/Devices/rrdTemplates/ethernetCsmacd_64

(here.speed or 1e9) / 8 * .1

Inaccurate Non-human Legible CPU Read-outs:

We have same issues with CPU read outs. I won't bother pasting the modifications for that here unless asked for.

New MySQL Error:

As of yesterday, this new warning started:

"|mysql|/Status/IpService||5|IP Service mysql is down" - the only change was the installation of PostgreSQL and restarting of the SNMP daemon.

I have spent countless hours so far in trying to get Zenoss to report properly or at least in a manner that is at least somewhat useful. Is it normal to spend quite some time customizing each new monitored host? I can understand the filesystem issue which is inherent to Linux systems, but what about the network band width? My modifications are pretty easy to do on Linux systems but what about our NAS and other network devices that need to be added? Is it expected to modify each CPU entry as well? It seems a bit odd that so much customization needs to be done.

I read great reviews about Zenoss, and wouldn't mind getting it to work for us. I'm sure I missed something. Is anyone able to shed some light on this? Does anyone else have these issues? I would really like to continue using Zenoss instead of switching to another monitoring application.

Your insights are greatly appreciated!!

Thanks in advance.

Like (0)

jmp242
4,060 posts since
Mar 7, 2007

Currently Being Moderated

1. Dec 7, 2011 8:41 AM (in response to jesse raider)
Re: Anyone else having on-going issues with Zenoss or am I missing something?

Welcome to the Zenoss forums. I'm sorry you're having so many issues. It can be normal to spend some time configuring a class of hosts (I'm thinking both in the sense of Zenoss Device Classes, and in the "human" sense of a class being a Linux webserver vs a Windows Domain Controller etc). However, my experiance has been once you have a class defined into a Zenoss Device Class, adding additional hosts of that type is down to the add device, set the Device Class and enter the FQDN.

There are of course little issues that can crop up, or big ones, depending on your specific envrionment. First, I use RHEL derivative Scientific Linux 5. RHEL5 is what I think Zenoss is developed on, so as you get farther from that base OS, some issues occasionally show up.

It can take some time to fully realize an enterprise monitoring system. No matter what one you use, there's going to be tweaking, and as far as I can tell, it's ongoing as OS patches or snmp agents update and change, or as you add new devices. It's very unlikely to ever be set it and forget it.

The reason you see multiple forum posts with different solutions is that many superficially similar issues have different causes. Some people's windows WMI access (for instance) issues are simply a firewall setting. Others are not using an admin account and need to go through the painful Microsoft permission setup issues. One person needed a Microsoft patch, another needed to enable NTLMv2 in Zenoss because they had higher security requirements for their Windows servers than the defaults(If I recall correctly). But for each, the forum post usually was summarized to WMI doesn't work for me.

Final background information: If you have time to learn Zenoss, you can get it set up with help from the community. If you're in a hurry, or don't really want to know the obscure details of your environment (Zenoss and the monitored devices), you may want to consider either a Community Consulting engagement or purchasing Zenoss Enterprise.

To what I know off my head, I'll get more info over the next few days to fill in the gaps:
Filesystem Reporting. It's basically related to what net-snmp shows. There are two issues I've seen commly reported and one of them I've experianced myself.
1. Net-snmp restarts, and re-orders the filesystems that it reports. To simplify, filesystem 1 and filesystem 2 can switch places on a net-snmp restart.
2. You resize a filesystem.
Zenoss only detects these changes on a remodel. By default it remodels every 12 hours, but zenmodeler can "zombiefy" and so it's potentially a good idea to set up a cron job to restart it every day or so. This will also trigger a re-model. One thing to consider is if you can, have either of the two events above restart zenmodeler. I can think of a few ways to do that, and can go into them if requested.

CPU utilization is very very tricky. It starts with the fact that I've yet to see a consensus as to how you should even calculate it (just at the OS level). There are at least 2 sets of OIDs you could use, and you have to take number of cores into account. Doesn't help that different OSs do it differently. I personally mostly ignor CPU utilization because of the issues - search the forums for some really long threads on all the factors here. I use load average on Linux instead, and on Windows I take the CPU use with a grain of salt. Now, this lack of care is probably local to my environment, but consider - what are you using the CPU utilization for? Alerts? Planning? It's probably important to understand what the numbers Zenoss is getting mean, and then we can customize that data to make more sense - you've already seen event transforms for munging it to human readibility - you can also alter RPNs and graph definitions (though you'll lose historical data doing this) to change what the graphs are showing.

To the MySQL error - are you running MySQL on the server? Have you re-modeled? The IP service is basically testing a MySQL port to see if it can talk to it... Is Zenoss supposed to be monitoring that IP service on this device?

--
James Pulver
ZCA Member
LEPP Computer Group
Cornell University

Report Abuse

Like (0)
jcurry
1,021 posts since
Apr 15, 2008

Currently Being Moderated

2. Dec 8, 2011 4:27 AM (in response to jmp242)
Re: Anyone else having on-going issues with Zenoss or am I missing something?

Hi Jesse,
You're right! Sometimes Zenoss is rather confusing because it is trying to do clever stuff for you that doesn't always suit!

I suspect some of your filesystems issues is to do with big (>2TB) filesystems?? By default, filesystem info is gathered using SNMP to get info from the Host Resources MIB. The issue here is that almost all SNMP agents only support 32-bit values for filesystems so for big filesystems the numbers wrap and you get nonsense like big negative numbers for disk free as you are seeing. This really isn't Zenoss's fault - it's just doing the best it can with the data provided. I haven't found an SNMP agent that supports 64-bit values in this area though I have heard rumours of one.

The usual way to circumvent the problem is to swap both your filesystem modeler and your filesystem performance template to use a zCommand-style collector, rather than an SNMP one (they are provided in the Core product). Have you seen the thread at message/62359#62359 ? As James says, remember that the modeler plugin (sometimes called the collector plugin) collects CONFIGURATION information and is run by zenmodeler every 12 hours (by default). You need to get that in place first. The performance template collects PERFORMANCE info, typically every 5 minutes, though if you use the command-based templates you have more control over the interval for performance data collection. Don't forget that if you swap to the command-based monitoring, you will also need to setup a zCommand user and either a password or an sshkeys entry - so the setup is trickier but you then have more control. Incidentally, I have not needed to use your clever transforms to make sense of filesystem stuff, but this may be because I can live with the 5% variance that Linux builds in.

The interface utilisation I have seen with a number of clients. I hadn't found the snmp.conf solution that you document so thanks for that - but, as you say, it requires a lot of extra agent configuration which is a big overhead. I have resolved this on the Zenoss side by creating extra copies of the interface utilisation template and modifying the formula there. You then need to make sure that your devices are separated into classes such that they use the right template. Bear in mind that both the filesystem and the interface templates are COMPONENT templates (not device templates) so you do not bind them manually; they are bound automatically based on the device class.

Another confusion around interface traffic where Zenoss is clever but not well documented, is that if you are collecting interface data from a device using SNMP V1 then the ethernetCsmacd template will be automatically bound; if you use SNMP V2 then the ethernetCsmacd64 template will be used. This is a different template that gets values from the IF MIB rather than from the interface table of MIB-2 and it has a different threshold formula. This is in order to automatically make use of the 64-bit values that ARE available with these SNMP values, if the newer protocol is supported. However, it can lead to what looks like very diverse issues, especially if you don't notice which template is actually used (which you can always find out by going to a device's Interface component and use the Display dropdown to show the Template used).

On your mysql error, I suspect that the PostgreSQL install means that the TCP port for mysql has changed in some way. Have a look at the definition of that service in Zenoss. If possible, I try and check ports with
telnet <box> <service port number>

as a starting point.

Hang in there - it's worth it in the end.

If you want more help, feel free to approach me directly.

Cheers,
Jane

Report Abuse

Like (0)
jesse raider
4 posts since
Aug 4, 2011

Currently Being Moderated

3. Dec 9, 2011 2:39 PM (in response to jesse raider)
Re: Anyone else having on-going issues with Zenoss or am I missing something?

Hello,

Wow thank you James and Jane for a) the quick responses and b) for such in-depth content. The details provided have given me further insight, and helped make sense with what I was experiencing.

On a quick side note, I cannot take any credit for the custom

modifications I made as I copied them from various forums (including this one).

After reading your replies, and post thoughts, I think my confusion was due to two main issues. 1. Numerous configuration changes seemed to have no effect, as what you mentioned in regards to data being "zombiefied" or cached. 2. It appears that our Zenoss' monitoring stopped "working" for few weeks as far as we can tell. This might have something to do with the first point? Due to an unrelated issue we needed to reboot the server where Zenoss is installed. After the reboot alerts started coming throught. A good example is the issue I had with MySQL. It did turn out to be a firewall issue. Due an absence of alerts, there were

no indications of any problems. Then all of a sudden (after the reboot), we started receiving alerts (the "IP Service mysql is down" error I mentioned).

@ Jane, in terms of the filesystems size, ours is pretty small. Looks like after the reboot, the filesizes are reporting correctly now (allowing for the 5% variance). I will read through that discussion you listed as I'm sure there's some good information in there.

@ James, I would like to go ahead and configure the remodeler to be restarted routinely. Could you please post the exact steps on how to do this?

@ Jane or James, one final thing if I may. We are getting alerts for network traffic that exceeds 10%. Could you also post exact instructions on how and where we could configure the threshold to something like 70% or higher? Please find below the modifications I made so far. I don't mind wiping all the configs and starting from scratch. So from a default installation of Zenoss, please post the steps in modifying the threshold to something that exceeds a high value of lets say 70%. To the best of my understanding, I set the threshold to 80%.

Our Current Setup:

Advanced --> Monitoring Templates --> ethernetCsmacd_64

--> /Devices --> Thresholds --> high utilization
( http://xxx.xx.xx.xxx:8080/zport/dmd/template#templateTree:/zport/dmd/Devices/rrdTemplates/ethernetCsmacd_64 )

Selected: ifHClnOctets_ifHClnOctets, ifHCOtOctets_ifHCOutOctets

Maximum Value: (here.speed or 1e9) / 8 * .1

Event Class: /Perf/Interface

Events --> Events Classes --> Classes --> /Perf/Interface --> Transform
( http://xxx.xx.xx.xxx:8080/zport/dmd/Events/Perf/Interface/eventClassStatus )

-------------------------------------------------------------------------------
import re
fs_id = device.prepId(evt.component)
for f in device.os.interfaces():
    if f.id != fs_id: continue
    # Extract the percent and utilization from the summary
    m = re.search("threshold of [^:]+: current value ([\d\.]+)", evt.message)
    if not m: continue
    currentusage = (float(m.group(1))) * 8
    p = (currentusage / f.speed) * 100
    evtKey = evt.eventKey

    # Whether Input or Output Traffic
    # if evtKey == "ifInOctets_ifInOctets|high utilization":
    if evtKey == "ifHCInOctets_ifHCInOctets|high utilization":
        evtNewKey = "Input"
    # elif evtKey == "ifOutOctets_ifOutOctets|high utilization":
    elif evtKey == "ifHCOutOctets_ifHCOutOctets|high utilization":
        evtNewKey = "Output"
    # Mbps utilization
        Usage = currentusage / 1000000
        evt.summary = "High " + evtNewKey + " Utilization: Currently (%3.2f Mbps) or %3.2f%% is being used." % (Usage, p)
    break
-------------------------------------------------------------------------------

Email Alert we received:

-------------------------------------------------------------------------------
[zenoss] CLEAR: vccesnas01 High Output Utilization: Currently (60.27 Mbps) or 6.03% is being used.

Event: 'High Output Utilization: Currently (111.87 Mbps) or 11.19% is being used.'
Cleared by: 'High Output Utilization: Currently (60.27 Mbps) or 6.03% is being used.'
At: 2011/12/08 05:17:38.000
Device: vccesnas01
Component: eth0
Severity: Warning
Message:
threshold of high utilization exceeded: current value 13984009.51
-------------------------------------------------------------------------------

Thank you again for your efforts. I know it takes time and effort to work on people's requests. It's much appreciated.

You have been very helpful.

Report Abuse

Like (0)
jcurry
1,021 posts since
Apr 15, 2008

Currently Being Moderated

4. Dec 9, 2011 3:35 PM (in response to jesse raider)
Re: Anyone else having on-going issues with Zenoss or am I missing something?

Hi Jesse,
To change the interface high utilization threshold, you obviously have found the template:

Advanced --> Monitoring Templates --> ethernetCsmacd_64

--> /Devices --> Thresholds --> high utilization

Simply double-click on the high utilization threshold and change the:
Maximum Value: (here.speed or 1e9) / 8 * .1
to
Maximum Value: (here.speed or 1e9) / 8 * .7 for 70%

Not sure why you have .1 (10%) on the end end as I believe the out-of-the-box default is .75 ie 75%. You shouldn't need to change anything else.

Cheers,
Jane

Report Abuse

Like (0)
jesse raider
4 posts since
Aug 4, 2011

Currently Being Moderated

5. Dec 9, 2011 5:33 PM (in response to jesse raider)
Re: Anyone else having on-going issues with Zenoss or am I missing something?

Hello,

Thanks again for the prompt reply. I think I am good for now. I have adjusted the "high utilization" value as needed. I also found how to restart the ZenRemodeler via a cronjob. I added the below script to /etc/cron.daily:

----------------------------------------------------------------------------------
#!/bin/sh

/usr/local/zenoss/zenoss/bin/zenmodeler restart
logger -p local0.notice -t zREMODELER -f /usr/local/zenoss/zenoss/bin/zenmodeler
----------------------------------------------------------------------------------

Of course the path depends on where Zenoss is installed. I also added the logger command just so that something is written to the /var/log/syslog file.

@ James, no need to reply to my request unless you feel you have a more suitable solution.

Thank you again for your help!!

Report Abuse

Like (0)

Go to original post

Legend

Correct Answers - 4 points
Helpful Answers - 2 points

Dec 6, 2011 8:20 PM

Anyone else having on-going issues with Zenoss or am I missing something?

Actions

More Like This

Incoming Links

Legend