Jul 22, 2010 10:28 AM
Dependency/Layer 2 - revisited
-
Like (0)
I figured since I upgraded to 3.0, I would throw these questions out there again. I have a central office with 120 remote locations. All are connected by MPLS. The packets travel over hops we do not control. We monitor the router and then several other devices at each facility. When the circuit or router goes down, we get flooded with alerts. I have not been able to find a reasonable fix to deal with the fact that packets go over telco hops out of our control. I have also tried unsuccessfully to use our naming standard to create transforms that suppressed the extra alerts. Our routers are named something like TX123-rtr-ser and the other devices are named things like TX123SRV and TX123-WAP. I tried to write something that would check by name to see if a router at a site was down before creating an event about something else being down. I could not make it work. With our previous product (Whatsup), this was not automatic, but it was very simple. You simply had a drop down box for dependencies where you could select a router that corresponded with a device. I have no problem setting something like this manually. I just have to figure something out. This is the one thing left in Zenoss that is killing us. We get so swamped with alerts if a circuit is bouncing, that we repeatedly overlook a critical alerts because we are on a blackberry and there are simply too many alerts to manage.
Thanks,
Matthew
Not even a whimper
I know I'm not the only one struggling with this....
All I can say is +100000 but I don't know how to patch it in.
--
James Pulver
Information Technology Area Supervisor
LEPP Computer Group
Cornell University
Hi Matthew,
100% agree that you are *not* alone.
/suntzu
I will add my yep to this again. And not trying to hijack but I don't see Layer 3 dependency either. On that, just to be able to manually set them would be nice.
tk
the lack of this is THE MAIN feature holding me back from be able to implement zenoss for real in the company i work for.
as long zenoss dont have options to setup parent children dependency between stuff in its inventory and the stuff in the devices you basically can not sort out the alarms making the monitoring complete
and zenoss will remain a thing i like but just keep testing out, but dont match up with other monitoring tools used today
Perhaps someone can explain alert supression?
To me, it means what I am used to from Whatsup.
I have 10 devices at a site. 1 is a router. Every other device has the
router set as a 'parent'.
If the router or circuit goes down, I only get an alert about the
router. In Zenoss, I get 10 alerts.
Also, if all devices go down, then the router comes up, but device A is
still down, I then get an alert that device A is down since its parent
is now up.
That is what I need. I have about 110 sites with MPLS circuits. They go
through telco routers I don't control, so I cannot rely on network
modeling.
Same even if you own the whole network. Its a pain when a router close to the server goes down and you have other routers, switches, servers, etc behind it. Tons of alerts. Be nice to have that router alert and the rest go unknown. Some other products do that.
That being said, there are tons of great features in Zenoss. Having to get used to 3.0 but so far I like it.
tk
This is the same major problem I've met with Zenoss. And receiving hundreds of alerts could also "hide" another alert (you know what I meen.. select all, Mark As Read and Voilà...)
I'm not a python developer, but I gathered some script and rewrote a bit some of them for my needs.
This is how my Zenoss is organized : I've created for all my sites a Location. The routeur name allways starts with "Routeur".
This idea was to check if the router of this site was down or not. If yes, then the other devices are set to be "Acknowledged" (which means no alert for me).
But I've faced another problem while doing this: while zenoss was sending ping to test, the non-routeur device could be ping'd right before the location's router.
So I've had to count the number of alert 'script taken somewhere on this site, but I don't remember where, all credits go to the author of course).
I have to tweak a little more to rewrite the summary without changing the dedupfields.
Hope it helps!!
Jude
In Transform of /Status/Ping:
# Create the dedup id; This is what zenoss normally does to the event to ascertain if
# it is a duplicate (another occurance) of an existing event. We are doing it in this
# transform to be able to reference the count variable, which does not come with an
# incoming event.
dedupfields = [evt.device, evt.component, evt.eventClass, evt.eventKey, evt.severity]
if not evt.eventKey:
dedupfields += [evt.summary]
mydedupid = '|'.join(map(str, dedupfields))
# Get the event details (including count) from the existing event that is in the mysql database
em = dmd.Events.getEventManager()
em.cleanCache()
try:
ed = em.getEventDetail(dedupid=mydedupid)
mycount = ed.count
except:
mycount = 0
myDevice = dmd.Devices.findDevice(evt.device)
myDevName = myDevice.getDeviceName()
myDevClass = myDevice.getDeviceClassName()
if mycount == 0:
if (myDevName.startswith('Routeur') == 0):
if (myDevClass.startswith('/Network') == 1) or (myDevClass.startswith('/Ping') == 1):
evt.eventState = 1
if mycount > 0:
if (myDevName.startswith('Routeur') == 0):
myLocation = myDevice.getLocationName()
splitLoc = myLocation.split('/')
locPath = ''
for path in splitLoc:
if (path != ''):
locPath = locPath + "['" + path + "']"
myCommand = "myDevicesList = dmd.Locations" + locPath + ".getSubDevices()"
exec myCommand
for myCurrentDevice in myDevicesList:
myDeviceName = myCurrentDevice.getDeviceName()
if (myDeviceName.startswith('Routeur')):
if (dmd.Devices.findDevicePingStatus(myDeviceName)>0):
evt.eventState = 1
# evt.summary = myDeviceName + " est injoignable"
else:
evt.eventState = 0
I'm not good with python and I have not successfully implement layer 2/3 dependency using transform. The way that I used was hacking the filter in the Alert Rule. For those who does not know, at the Alert Rule page, if you append the zenoss url with "/manage", it would take you to the Zope page; click on Property and then in the "where" textbox is the mysql filter for event. I added the following to suppress all events except for zenoss local gateway if the local gateway goes down:
and (ipAddress = 'ip.of.master.device' or (ipAddress = 'ip.of.dependent.device' and (select count(*) from status where ipAddress = 'ip.of.master.device' and eventclass = '/Status/Ping' and severity = 5) = 0))
Not sure that one would help much in my case l2huynh. In my case I need one of those for each router so that if the router in front of it goes down, the down router alerts and the others don't.
@jude16, where did you put that and by location I assume you grouped items at a location. Say I have a string of towers, hops, to get to the Zenoss server and edge. If tower 2 goes down tower 3 and tower 4 would be unknown or acknowledged in your script and not alert.
Have any tried this: blogs/zenossblog/2008/08/26/tip-of-the-month-layer-3-dependency-checker
I have all of my layer 3 devices in Zenoss and a manual traceroute from the zenoss server shows all hops but the ./tracepath.py does not show all hops. Very weird.
In any case, if a router goes down, I get 100 messages about every IP service and server that is down. Quite annoying.
FYI, I actually think there's only one JIRA request for Layer 2. Maybe I missed something. Anyway, here it is if you want to vote for it. Maybe that will nudge things along.
I was just thinking about this and wondering about taking the following approach (code sample below):
The basic idea is as mentioned to supress events caused by a network path failure. I'm wondering if that path can be deduced from the object database in the following fashion (as an event transform):
Given an event:
1) determine collector of the affected device (as another device in the dmd)
2) exame the os.routes() entries of both the device and its collector, in particular if the "nexthops" refer to other known devices
3) determine whether any of the "nexthops" are shared between devices
4) if not, examine each list of hops for both devices and try to determine if they share common "hops"
5) if a list of shared "hops" can be determined, and any of those hops is "down", then suppress the event.
The following code is fragile, untested and best regarded as "probably a bad idea". it almost certainly won't work as intended or described above:
def findNextHops(dev):
"""
return list of dmd devices from given device routes
"""
hops = []
for r in dev.os.routes():
hop = r.nexthop()
if hop is not None:
if hop.device() is not None:
if hop.device() not in hops:
hops.append(hop.device())
return hops
def findCommon(d,c):
"""
attempt to find hops in common between 2 nodes
"""
dhops = findNextHops(d)
chops = findNextHops(c)
inter = set(dhops).intersection(chops)
if len(inter) > 0:
return inter
else:
for e in dhops:
for f in chops:
return findCommon(e,f)
try:
# find the collector for the event device
collector= dmd.Devices.findDevice(d.getPerformanceServerName())
common = findCommon(device,collector)
for c in common:
if c.getPingStatus() > 0:
evt._action = "drop"
except:
pass
I realize there's much not taken into account, but perhaps this can be a starting point?
Joseph
Follow Us On Twitter »
|
Latest from the Zenoss Blog » | Community | Products | Services Resources | Customers Partners | About Us | ||
Copyright © 2005-2011 Zenoss, Inc.
|
||||||||