Monitor our infrastructure hierarchy - Open Source Network Monitoring and Systems Management

Archived community.zenoss.org | full text search

Skip navigation

Up to Discussions in zenoss-users

32246 Views 6 Replies Latest reply: Feb 23, 2010 8:32 PM by phonegi

jonas.andersson

5 posts since
Sep 13, 2009

Currently Being Moderated

Feb 16, 2010 6:53 AM

Monitor our infrastructure hierarchy

I want to start this thread of by saying my understanding of Zenoss Enterprise is very limited in many aspects, so if i say something that seems taken out of the blue feel free to enlighten me.

We are monitor some 3800 devices, most windows and linux machines but also PDU's, UPS's and envirormental devices. This is accomplished with 4 servers. One running Zenoss and 3 running collectors.We mainly monitor ping-responses, diskusage and performance graph CPU utilization, running processes, free memory, swap-use and NTP on linux and we would like to do WMI monitoring at some point but for now we just do ping monitoring on windows.

My problem is that collectors time out and i believe the reason is that we dont monitor the whole infrastructure hierarchy but only the servers, meaning that if a switch goes down all machines behind it reports errors. I also believe the collection cycle chokes itself because of this. Another problem is balancing the monitored machines between collectors, whats the best practice here?

A few questions that could get me up to speed with the problem ahead:

* Is there a best practice Zenoss deployment guide?

* Can Zenoss be configured to have an understanding of network-topology and therfor allow less things to be tested while a connection to a switch goes down?

* What kind of resources would you allocate to monitor 3800 devices seen out of a in-real-world working experiance, on collectors and the main zenoss server (memory and mhz)?

Like (0)

Chris Krough
16 posts since
Apr 22, 2008

Currently Being Moderated

1. Feb 16, 2010 9:45 AM (in response to jonas.andersson)
Re: Monitor our infrastructure hierarchy

When you say that the collectors time out, what behavior are you seeing? The count of OIDs will give you a better metric for balancing collectors. You can get this number by doing the following on each collector:

grep -i "OID requests" zenperfsnmp.log | tail

Dependency checking won't help you with devices behind dead switches because Zenoss doesnt do layer 2 dependencies, but it will help you if a router goes down. You could manually set up L2 dependencies using Event Transforms, but that's not practical in a large environment. Search the documents and forums for "layer 3 dependency checking" for more information. Also take a look at docs/DOC-3238

Chris

Report Abuse

Like (0)
phonegi
446 posts since
Apr 15, 2009

Currently Being Moderated

2. Feb 16, 2010 9:53 AM (in response to jonas.andersson)
Re: Monitor our infrastructure hierarchy

You have quickly isolated one of the most controversial topics regarding Zenoss: device dependency. Simply put, there is no layer 2 device dependency mechanism in Zenoss. mray (one of the developers) has added a ticket here. So we know the development team is aware of the issue but there does not appear to be a current solution.

I believe your assessment is correct - your collectors are timing out if numerous devices are unreachable. The sum of the timeouts is probably greater than the cycle interval. Your use of multiple collectors is a good idea. If you are using Enterprise, I understand you can set up multiple collectors on a single system. This should alleviate some of the bottlenecks when devices become unreachable. With so many devices to be monitored, it may be necessary to use even more collectors.

I don't think there is any magic formula for determining hardware requirements. They are determined by how you configure your system, not the number of devices monitored. If you are collecting 200 data points per cycle from 300 devices, you will need more hardware resources than if you are simply pinging 1000 devices. There are a number of factors to consider. You need enough bandwidth to send/receive all the data in each cycle, your hard drives need enough space and speed to store all the data you are recording each cycle, event complexity can impact the processors, etc.

Report Abuse

Like (0)
mlist
49 posts since
Mar 27, 2009

Currently Being Moderated

3. Feb 16, 2010 11:31 AM (in response to phonegi)
Re: Monitor our infrastructure hierarchy

I think that Zenoss should ABSOLUTELY improve AT LEAST layer3 dependencies.
I used Nagios in large environment and I never had problems manually managing layer3 dependencies using the GUI (monarch) and, while I think that you can survive without layer2, layer3 is is compulsory for and enterprise product. Really Zenoss is already able to manage layer3 dependecies automatically (reading routing tables) but there are situations in which this is simply not possible (what if you don't have snmp access to a remote router?). Moreover is possible
to manually manage layer3 dependencies with Zenoss too but, this would require to write some. This is the reason I opend a future request requesting the ability to MANUALLY manage layer3 dep. and I hope that zenoss developers will take care of this request. I know that there are lot of feature requests and that resources are few but, in this case, I think that Zenoss should consider this improvment.

Report Abuse

Like (0)
mastrboy
10 posts since
Apr 17, 2008

Currently Being Moderated

4. Feb 23, 2010 6:40 PM (in response to mlist)
Re: Monitor our infrastructure hierarchy

agreed, this is actually a showstopper for us, can't have 2-300 alerts coming in when only one device is down.

only "solution" we have found so far is: docs/DOC-3215

And the auto-discover in the route table dependency model is basicly no good, as alot of our equipment are cisco switches with a management vlan on layer 2, meaning we can ping the device, but the device does no routing, it only act as a layer2 switch/bridge.

I'll quoute up a post from july 2009 to remind the devs that this is very much needed:

Matt Ray: "This feature is high on our feature request list but we have not
gotten to it yet because we have fairly limited development resources
(5 developers). Much of the recent work has been targeted at
improving stability and reliability with each release while squeezing
in new features where we can. As seen in the "What do you want in
King Crab" thread (http://forums.zenoss.com/viewtopic.php?t=9296),
this is quite popular request and we are very much aware of it."
Ref: thread/9845

Report Abuse

Like (0)
phonegi
446 posts since
Apr 15, 2009

Currently Being Moderated

5. Feb 23, 2010 8:12 PM (in response to mastrboy)
Re: Monitor our infrastructure hierarchy

My original thought was that it would not be that hard to develop an Event Command that would look to a file that contained manual layer 2 hierarchy entries. If an existing device was down, all "device down" events for devices lower in the hierarchy could be ignored. However, I soon realized that the command would be worthless without being able to control the order of devices in the ping cycle. The device that is higher in the hierarchy has to be checked first, otherwise, there won't be an active event for the lower device to reference.

For example, take a network switch, call it device A and connect servers AA, AB, AC... to it. Device A goes down. However, the ping cycle checks devices in this order: AB, AC, A, AA. Devices AB and AC (and possibly a lot more) will both send alerts as down because there is no active event indicating that device A is down. Finally an event is generated for device A so that when AA is determined to be down, no event is sent.

I guess my point is that the solution must be two parts: one that organizes the ping cycle so that ping events can be generated in a heirarchial manner and the other part to suppress cascading events. Not so simple.

Report Abuse

Like (0)
phonegi
446 posts since
Apr 15, 2009

Currently Being Moderated

6. Feb 23, 2010 8:32 PM (in response to phonegi)
Re: Monitor our infrastructure hierarchy

Hmmm... alternatively, if you created an alerting rule just for ping events that dictated a delay that was twice as long as the ping cycle, (assuming your ping cycle has been calculated to account for the amount of time it would take to ping a significant number of devices that cannot be found) that should ensure that all ping events from the previous cycle would be active. When the next cycle begins, the event command would be able to find and clear out any events for "lower" devices, leaving only the ping events for the devices at the top of the hierarchy.

This still doesn't account for IP service monitoring, etc...

Report Abuse

Like (0)

Go to original post

Legend

Correct Answers - 4 points
Helpful Answers - 2 points