Delay between FS threshold breach and event - Open Source Network Monitoring and Systems Management

Archived community.zenoss.org | full text search

Skip navigation

Up to Discussions in zenoss-users

2562 Views 3 Replies Latest reply: May 14, 2012 1:01 AM by James Stewart

James Stewart

91 posts since
Dec 1, 2010

Currently Being Moderated

Dec 4, 2011 11:25 PM

Delay between FS threshold breach and event

Hello all,

I have just experienced a strange problem. A filesystem on one of our Linux servers started to fill up, eventually breaching first its Error threshold @ 12:45 and then its Critical threshold @ 12:55. This can be seen in the graph for the filesystem in question:

However, we did not receive any events for the Error threshold breach. We did receive a single event for the Critical threshold breach, but this did not arrive until about 40 mins after the breach occured, (this was subsequently auto-cleared):

As the filesystem usage is graphed in Zenoss and I do not see any errors for this host in zenperfsnnpd.log, I can only assume that the data was correctly collected via zenperfsnmpd.

So, can anybody suggest why this delay between the threshold breach and resulting event may have occured? There was no unusual load on my Zenoss server at the time and I have not seen any problems like this previously.

Failing that, any suggestions as to where I might look for futher clues?

Cheers,

James

Like (0)

Tags: threshold, event

Chet Luther
1,302 posts since
May 22, 2007

Currently Being Moderated

1. Dec 7, 2011 10:25 AM (in response to James Stewart)
Re: Delay between FS threshold breach and event

Could you provide the exact configuration of each of these thresholds?

Report Abuse

Like (0)
James Stewart
91 posts since
Dec 1, 2010

Currently Being Moderated

2. Dec 7, 2011 6:45 PM (in response to Chet Luther)
Re: Delay between FS threshold breach and event

Sure Chet, thanks for your interest...

This is based on the standard Filesystem monitoring template applied to /Server, polling OID 1.3.6.1.2.1.25.2.3.1.6 to get a value for usedBlocks.

I had a requierment to be able to set custom thresholds for ever filesystem on every server. Creating local template copies for this would be a mess, so I have a slightly more complex setup to determine my Critical, Error and Warning thresholds. For each server I have 3 Custom Properties, (each shown with examples):

cFilesystemCritical: '/|95 /boot|80 /home|95 /opt|95 /tmp|90 /var|90'
cFilesystemError: '/|90 /boot|70 /home|90 /opt|90 /tmp|80 /var|80'
cFilesystemWarning: '/|85 /boot|60 /home|95 /opt|95 /tmp|70 /var|70'

I then use a one-liner in the 'Maximum Value' field of each threshold to dynamically obtain the threshold for that filesystem like so, (for the Critical threshold):

here.totalBlocks * float(here.device().getProperty('cFilesystemCritical').strip().split(here.name())[1].split()[0].split('|')[1]) / 100

A little complicated, but it has worked fine for a long time and continues to raise timely filesystem utilisation threshold breach events across many servers on a daily basis.

In the case above, it seems there was a delay somewhere in the chain of:

snmp poll->snmp value returned->threshold applied->event raised

As the filesytem utilisation was graphed and there were no errors in the zenperfsnmp log, I can only assume that the snmp poll returned data in a timely fashion. This leads me to believe that the delay occured the processing the obtained value. However 40 mins between threshold breach and event raising seem very strange...

Any ideas/help would be greatly appreciated...

J.

Report Abuse

Like (0)
James Stewart
91 posts since
Dec 1, 2010

Currently Being Moderated

3. May 14, 2012 1:01 AM (in response to James Stewart)
Re: Delay between FS threshold breach and event

For anybody who encounters similar issues, I believe that events were being dropped by Zenhub due to the event queue length limit being exceeded. This is discussed in previous posts:

thread/16678
message/50316

I've added additional Zenhub workers and increased the cache size, as discussed in this post:

docs/DOC-2521

Seems to have solved the problem for now...

Report Abuse

Like (0)

Go to original post

Legend

Correct Answers - 4 points
Helpful Answers - 2 points