Mar 24, 2009 7:58 PM
ssCpuIdle value showing zero
-
Like (0)
zenoss@crt-monitor:~> snmpget -c snmpstring -v1 crt-nagios.blahblah.gov.au 1.3.6.1.4.1.2021.11.11.0
UCD-SNMP-MIB::ssCpuIdle.0 = INTEGER: 84
zenoss@crt-monitor:~> snmpget -c snmpstring -v1 crt-nagios.blahblah.gov.au 1.3.6.1.4.1.2021.11.11.0
UCD-SNMP-MIB::ssCpuIdle.0 = INTEGER: 0
Hello,
We are also experiencing this on RHEL 4 & 5. I originally thought it was an net-snmp bug, but it appears that ssCpu* are deprecated and ssCpuRaw is what should be used now.
See RHEL BUG: https://bugzilla.redhat.com/show_bug.cgi?id=473824
Can we get this updated in Zenoss to use ssCpuRaw? If so, how? This bug is causing our development groups to wonder why are systems are 100% utilized, when actually they are not.
Please Advise,
Thanks
LJ
What you're seeing isn't a bug in net-snmp. The net-snmp docs have stated for a long time that ssCpuIdle and the other ssCpu* OIDs are deprecated and should not be used. Instead, you should be using the ssCpuRaw* OIDs (ssCpuRawIdle, ssCpuRawUser, ssCpuRawSystem, etc...). You will get much better data from these OIDs as well. Instead of getting a "rough estimate" percentage (rounded down to an integer), you can actually see how many clock ticks have been used on each, which will give you MUCH greater accuracy.
This should be considered a ZenOSS bug in the default "Server/Linux" device template, but it's easy enough to fix yourself. Just edit the Devices->Servers->Linux Template, look in the Data Sources section for the ssCpu checks. Change the SNMP OID being used to the appropriate ssCpuRaw* OID. For instance, the ssCpuIdle OID is .1.3.6.1.4.1.2021.11.11.0. This OID should be changed in your Data Sources to the OID for ssCpuRawIdle, which is .1.3.6.1.4.1.2021.11.53.0.
You can find the right OIDs by running the following commands from any linux machine with a net-snmp utilities installed:
snmpwalk -v2c -c $COMMUNITY $HOSTNAME .1.3.6.1.4.1.2021.11
This will show you the complete list of available CPU OIDs. To get the OID as a numerical value rather than text, try this:
snmpwalk -v2c -c $COMMUNITY $HOSTNAME -O n ssCpuRawIdle
You can swap "ssCpuRawIdle" above with any of the other OIDs listed in the first command (ssCpuRawSystem, ssCpuRawUser, etc)
One thing to remember: all ssCpu* OIDs are integer "gauge" values. The ssCpuRaw* OIDs are all "counter" values. Changing from a guage data source to a counter data source requires some changes to your graphs.
Hope this helps...
Dave,
This is very useful information. Thanks for taking the time to write it up !
Now I just need to work with Zenoss to switch to ssCpuRaw* and update our graphs to a counter rather than a gauge ( without losing historical data )
Thanks again!
Trying to convert from ssCpu gauges to ssCpuRaw counters without losing historical data will be very challenging. Perhaps I just didn't put enough time into it. I know I could have pulled the data from the old ssCpu RRD files using rrdtool, then manually added it to the new ssCpuRaw RRD files using rrdtool as well, with a bit of math in the middle. Just didn't want to put the time and effort into it, since we are still in "testing" phase with ZenOSS and our historical data was not critical.
Also, you should be polling the full set of ssCpuRaw values, which includes more data (and additional data sources) than the default ssCpu stats. They're not really compatible. The good news is that if you setup your graphs properly using the ssCpuRaw stats, you will actually get more accurate info (by the clock tick), as well as a number of new data sources.
If I were you, I'd just start over with your CPU stats. Anything you were seeing from ssCpu was inaccurate garbage anyway, as the net-snmp docs clearly state. If you are running a multi-core system, those ssCpu stats are especially useless. (i.e., 400% idle!).
I actually ended up writing a script that grabs the per-core/cpu statistics as an index. This tells me when one of the CPU cores is maxed, even if the others are seeing low usage. Averaging load across all cores obfuscates actual CPU utilization, making it pointless to monitor CPU utilization. I want to see when one of my CPUs is being thrashed by a process. I don't care if the other CPUs/cores are fine. Until *NIX kernels support balancing the CPU load of a process across multiple CPUs/cores, I don't care what total percentage of free CPU time I have across the entire system. I'm concerned about each of the CPUs/cores, and whether ANY of them are being thrashed by some poorly written java, a runaway DB query, etc. The only way to do that is measure each CPU separately. The easiest way to do that on *nix-like systems (Solaris, Linux, Mac OS X) is by using the "iostat" and "mpstat" commands (part of the sysstat package).
This doesn't explain why this OID was working fine for at least the past 249 days (uptime on the lastest one with this issue) and it stops working now. Why does it stop working on all four servers at a site all at the same time?
A figure of 1942502991 ticks means nothing to me where as 90% idle does
An update.
As a bit of a background to the site - ESX 3.5 on a eight core machine, various guest VM's including these 4 linux problem machines (just one of the places we have the issue). I also realise that this issue isn't directly caused by Zenoss, it is being caused by the system being monitored. Just trying to find out if anyone knows what thing net-snmp is talking to that is giving the zero values.
Today I restarted one of the four servers at this site and zenoss has now cleared the CPU idle issue. Many manual SNMPGET's later it is still showing a non-zero figure. This probably explains why all four linux servers had this issue happen at the same time - they all came up at the same time when the ESX host was started. Appears the issue is uptime related from the guest OS. As to what 'it' is has me stumped.
Is there a SNMP transaction count limit that needs to be reset?
Hummer,
If you read the above thread, Dave the Dude stated:
"The net-snmp docs have stated for a long time that ssCpuIdle and the other ssCpu* OIDs are deprecated and should not be used"
I read and understood this statement. Rather than tell me not to use it, can you tell me how to change it? Can you tell me what 'bit' is stuffed? I need a bandage on the inferior solution that was provided by Zenoss until I, or someone else, figures out how to get the same thing another way. Zenoss should have implemented it differently but it is what I have at the moment. You are not being helpful. Add something contructive please.
I have a ticket in with Zenoss right now about this issue.
I have updated my Data Sources to use the New OIDs, and updated my Graphs to be Counters, rather than Gauges, but still no dice.
I'll be sure to post what the resolution is here once I get it.
Thanks!
HummerBoy wrote:
This doesn't explain why this OID was working fine for at least the past 249 days (uptime on the lastest one with this issue) and it stops working now. Why does it stop working on all four servers at a site all at the same time?
A figure of 1942502991 ticks means nothing to me where as 90% idle does
What I'm trying to tell you is that if it ever did work, you weren't getting valid results anyway. You can not rely on the SNMP OID ssCpu* to give you an accurate "percentage" of CPU usage. You need to calculate it yourself using the values returned by ssCpuRaw*.
I wrote a Nagios-compatible script for Linux devices that can be used from within ZenOSS (as a COMMAND data source) to grab the correct values for each operating system and give you pretty accurate results (to the hundredth of a percentage). It grabs a configurable number of seconds of data and calculcates from the delta, so it will require X number of seconds to run (configure the $sample_seconds variable to set the sample period). I've attached it to this post.
The ZenOSS Administrator's Guide explains specifically how to use scripts as a COMMAND data source. You will need to follow those instructions to setup the data sources for your graphs, but this script will give you the correct values. Also, you must have Net-SNMP utilities and perl installed on whatever system will be running this script. Make sure to run the script once from the command line as the zenoss user to see if it works.
The script would be run like (set your Command Template to):
check_linux_cpu.pl ${here/zSnmpVer} ${here/zSnmpCommunity} ${here/manageIp}
You can run it from the command line to test your server:
check_linux_cpu.pl $SNMP_VERSION $SNMP_COMMUNITY $HOSTNAME
$SNMP_VERSION = the appropriate SNMP version (use v2c)
$SNMP_COMMUNITY = the community string you are using
$HOSTNAME = the hostname or IP address of the server you want to poll for data
It will return values in the following format:
|Count=2 TotalUsed=4.705 User=4.004 Kernel=0.701 Idle=95.295 Wait=0.000
Count = the number of CPU cores on your system
TotalUsed = the total percentage of "Used" CPU time
User = percent CPU used on "User" processes
Kernel = percent CPU used on "Kernel" tasks
Idle = percent CPU idle (total percentage "Free" CPU)
Wait = percent CPU iowait (amount of time CPU spends waiting for I/O)
Note that it only works for Linux. If you want to check Solaris, BSD or Mac OS X machines (etc), you'd need to hack this script to specify which of the ssCpuRaw OIDs are important, and what OIDs you want to check.
Good luck!
HummerBoy wrote:
An update.
As a bit of a background to the site - ESX 3.5 on a eight core machine, various guest VM's including these 4 linux problem machines (just one of the places we have the issue). I also realise that this issue isn't directly caused by Zenoss, it is being caused by the system being monitored. Just trying to find out if anyone knows what thing net-snmp is talking to that is giving the zero values.
Today I restarted one of the four servers at this site and zenoss has now cleared the CPU idle issue. Many manual SNMPGET's later it is still showing a non-zero figure. This probably explains why all four linux servers had this issue happen at the same time - they all came up at the same time when the ESX host was started. Appears the issue is uptime related from the guest OS. As to what 'it' is has me stumped.
Is there a SNMP transaction count limit that needs to be reset?
VMware guests have their own problems with time-keeping. It's the same as the issues Net-SNMP has with obtaining accurate values for the ssCpu* stats. Namely, there isn't a fixed clock so no way for the kernel to know how many ticks are in a clock cycle. If you play with ntpupdate on a linux VMware guest, you will find your time drift growing exponentially. This is due to the kernel not calculating the clock cycle properly. It's a known issue, and one out of many reasons the Net-SNMP project abandoned ssCpu stats a LONG time ago.
Thanks for that. Very useful infomation. I will go and try out what you have put up. I will post back any results.
Yes, we have experienced the difference between the ssCpuPercent and calculating (manually with a calculator) from the ssCpuRawX figures. We found it, at the point we were calculating it, to be about 2-3% out but we assumed this would vary. At this point in time (pardon the pun) we would be happy to be 2% out than have nothing so I was looking for 'something' to go restart just so we could limp along until a better solution could be found. Rebooting the server makes it work again but this is not always an option. Hopefully your solution will negate all this. Pity Zenoss couldn't have done this for us instead of monitoring an invalid figure (Enterprise monitors the same OID's).
Yes, we have experienced the VMware clock drift issue and fully appreciate the issues with CPU cycles on a VM guest so we forced each machine to sync to a timesource provided by our comms providor.
Follow Us On Twitter »
|
Latest from the Zenoss Blog » | Community | Products | Services Resources | Customers Partners | About Us | ||
Copyright © 2005-2011 Zenoss, Inc.
|
||||||||