Dependency/Layer 2 - revisited - Open Source Network Monitoring and Systems Management

Up to Discussions in zenoss-users

17468 Views 14 Replies Latest reply: Dec 17, 2012 4:45 PM by j053ph4

Matthew Kitchin (public)

170 posts since
Nov 12, 2009

Currently Being Moderated

Jul 22, 2010 10:28 AM

Dependency/Layer 2 - revisited

I figured since I upgraded to 3.0, I would throw these questions out there again. I have a central office with 120 remote locations. All are connected by MPLS. The packets travel over hops we do not control. We monitor the router and then several other devices at each facility. When the circuit or router goes down, we get flooded with alerts. I have not been able to find a reasonable fix to deal with the fact that packets go over telco hops out of our control. I have also tried unsuccessfully to use our naming standard to create transforms that suppressed the extra alerts. Our routers are named something like TX123-rtr-ser and the other devices are named things like TX123SRV and TX123-WAP. I tried to write something that would check by name to see if a router at a site was down before creating an event about something else being down. I could not make it work. With our previous product (Whatsup), this was not automatic, but it was very simple. You simply had a drop down box for dependencies where you could select a router that corresponded with a device. I have no problem setting something like this manually. I just have to figure something out. This is the one thing left in Zenoss that is killing us. We get so swamped with alerts if a circuit is bouncing, that we repeatedly overlook a critical alerts because we are on a blackberry and there are simply too many alerts to manage.

Thanks,

Matthew

Like (0)

Matthew Kitchin (public)
170 posts since
Nov 12, 2009

Currently Being Moderated

1. Jul 27, 2010 11:12 AM (in response to Matthew Kitchin (public))
Re: Dependency/Layer 2 - revisited

Not even a whimper
I know I'm not the only one struggling with this....

Report Abuse

Like (0)
jmp242
4,060 posts since
Mar 7, 2007

Currently Being Moderated

2. Jul 27, 2010 11:27 AM (in response to Matthew Kitchin (public))
Re: Dependency/Layer 2 - revisited

All I can say is +100000 but I don't know how to patch it in.

--
James Pulver
Information Technology Area Supervisor
LEPP Computer Group
Cornell University

Report Abuse

Like (0)
Scott MacKenzie
18 posts since
Jul 22, 2010

Currently Being Moderated

3. Jul 27, 2010 11:28 AM (in response to Matthew Kitchin (public))
Re: Dependency/Layer 2 - revisited

Hi Matthew,

100% agree that you are *not* alone.

/suntzu

Report Abuse

Like (0)
tlkhorses
62 posts since
Oct 30, 2008

Currently Being Moderated

4. Jul 27, 2010 3:22 PM (in response to Matthew Kitchin (public))
Re: Dependency/Layer 2 - revisited

I will add my yep to this again. And not trying to hijack but I don't see Layer 3 dependency either. On that, just to be able to manually set them would be nice.

tk

Report Abuse

Like (0)
Gabe
94 posts since
Jun 29, 2008

Currently Being Moderated

5. Jul 27, 2010 6:27 PM (in response to Matthew Kitchin (public))
Re: Dependency/Layer 2 - revisited

the lack of this is THE MAIN feature holding me back from be able to implement zenoss for real in the company i work for.

as long zenoss dont have options to setup parent children dependency between stuff in its inventory and the stuff in the devices you basically can not sort out the alarms making the monitoring complete

and zenoss will remain a thing i like but just keep testing out, but dont match up with other monitoring tools used today

Report Abuse

Like (0)
twm1010
290 posts since
Jul 2, 2007

Currently Being Moderated

6. Jul 28, 2010 3:39 PM (in response to Gabe)
Re: Dependency/Layer 2 - revisited

Perhaps someone can explain alert supression?

Report Abuse

Like (0)
Matthew Kitchin (public)
170 posts since
Nov 12, 2009

Currently Being Moderated

7. Jul 28, 2010 3:50 PM (in response to twm1010)
Re: Dependency/Layer 2 - revisited

To me, it means what I am used to from Whatsup.
I have 10 devices at a site. 1 is a router. Every other device has the
router set as a 'parent'.
If the router or circuit goes down, I only get an alert about the
router. In Zenoss, I get 10 alerts.
Also, if all devices go down, then the router comes up, but device A is
still down, I then get an alert that device A is down since its parent
is now up.
That is what I need. I have about 110 sites with MPLS circuits. They go
through telco routers I don't control, so I cannot rely on network
modeling.

Report Abuse

Like (0)
tlkhorses
62 posts since
Oct 30, 2008

Currently Being Moderated

8. Jul 28, 2010 4:29 PM (in response to Matthew Kitchin (public))
Re: Dependency/Layer 2 - revisited

Same even if you own the whole network. Its a pain when a router close to the server goes down and you have other routers, switches, servers, etc behind it. Tons of alerts. Be nice to have that router alert and the rest go unknown. Some other products do that.

That being said, there are tons of great features in Zenoss. Having to get used to 3.0 but so far I like it.

tk

Report Abuse

Like (0)
jude16
1 posts since
Jul 30, 2010

Currently Being Moderated

9. Jul 30, 2010 3:11 AM (in response to Matthew Kitchin (public))
Re: Dependency/Layer 2 - revisited

This is the same major problem I've met with Zenoss. And receiving hundreds of alerts could also "hide" another alert (you know what I meen.. select all, Mark As Read and Voilà...)

I'm not a python developer, but I gathered some script and rewrote a bit some of them for my needs.

This is how my Zenoss is organized : I've created for all my sites a Location. The routeur name allways starts with "Routeur".
This idea was to check if the router of this site was down or not. If yes, then the other devices are set to be "Acknowledged" (which means no alert for me).

But I've faced another problem while doing this: while zenoss was sending ping to test, the non-routeur device could be ping'd right before the location's router.
So I've had to count the number of alert 'script taken somewhere on this site, but I don't remember where, all credits go to the author of course).

I have to tweak a little more to rewrite the summary without changing the dedupfields.

Hope it helps!!

Jude

In Transform of /Status/Ping:

# Create the dedup id; This is what zenoss normally does to the event to ascertain if
# it is a duplicate (another occurance) of an existing event. We are doing it in this
# transform to be able to reference the count variable, which does not come with an
# incoming event.

dedupfields = [evt.device, evt.component, evt.eventClass, evt.eventKey, evt.severity]
if not evt.eventKey:
    dedupfields += [evt.summary]
mydedupid = '|'.join(map(str, dedupfields))

# Get the event details (including count) from the existing event that is in the mysql database
em = dmd.Events.getEventManager()
em.cleanCache()
try:
    ed = em.getEventDetail(dedupid=mydedupid)
    mycount = ed.count
except:
    mycount = 0

myDevice = dmd.Devices.findDevice(evt.device)
myDevName = myDevice.getDeviceName()
myDevClass = myDevice.getDeviceClassName()

if mycount == 0:
if (myDevName.startswith('Routeur') == 0):
    if (myDevClass.startswith('/Network') == 1) or (myDevClass.startswith('/Ping') == 1):
      evt.eventState = 1

if mycount > 0:
if (myDevName.startswith('Routeur') == 0):
    myLocation = myDevice.getLocationName()
    splitLoc = myLocation.split('/')
    locPath = ''
    for path in splitLoc:
      if (path != ''):
        locPath = locPath + "['" + path + "']"
    myCommand = "myDevicesList = dmd.Locations" + locPath + ".getSubDevices()"
    exec myCommand

    for myCurrentDevice in myDevicesList:
      myDeviceName = myCurrentDevice.getDeviceName()
      if (myDeviceName.startswith('Routeur')):
        if (dmd.Devices.findDevicePingStatus(myDeviceName)>0):
           evt.eventState = 1
#          evt.summary = myDeviceName + " est injoignable"
        else:
           evt.eventState = 0

Report Abuse

Like (0)
l2huynh
62 posts since
Aug 18, 2009

Currently Being Moderated

10. Oct 7, 2011 4:24 PM (in response to Matthew Kitchin (public))
Re: Dependency/Layer 2 - revisited

I'm not good with python and I have not successfully implement layer 2/3 dependency using transform. The way that I used was hacking the filter in the Alert Rule. For those who does not know, at the Alert Rule page, if you append the zenoss url with "/manage", it would take you to the Zope page; click on Property and then in the "where" textbox is the mysql filter for event. I added the following to suppress all events except for zenoss local gateway if the local gateway goes down:

and (ipAddress = 'ip.of.master.device' or (ipAddress = 'ip.of.dependent.device' and (select count(*) from status where ipAddress = 'ip.of.master.device' and eventclass = '/Status/Ping' and severity = 5) = 0))

Report Abuse

Like (0)
tlkhorses
62 posts since
Oct 30, 2008

Currently Being Moderated

11. Oct 7, 2011 6:13 PM (in response to l2huynh)
Re: Dependency/Layer 2 - revisited

Not sure that one would help much in my case l2huynh. In my case I need one of those for each router so that if the router in front of it goes down, the down router alerts and the others don't.

@jude16, where did you put that and by location I assume you grouped items at a location. Say I have a string of towers, hops, to get to the Zenoss server and edge. If tower 2 goes down tower 3 and tower 4 would be unknown or acknowledged in your script and not alert.

Report Abuse

Like (0)
OneLoveAmaru
73 posts since
May 30, 2011

Currently Being Moderated

12. Nov 14, 2011 11:09 AM (in response to Matthew Kitchin (public))
Re: Dependency/Layer 2 - revisited

Have any tried this: blogs/zenossblog/2008/08/26/tip-of-the-month-layer-3-dependency-checker

I have all of my layer 3 devices in Zenoss and a manual traceroute from the zenoss server shows all hops but the ./tracepath.py does not show all hops. Very weird.

In any case, if a router goes down, I get 100 messages about every IP service and server that is down. Quite annoying.

Report Abuse

Like (0)
jeronimo
27 posts since
Dec 13, 2012

Currently Being Moderated

13. Dec 14, 2012 6:26 PM (in response to Matthew Kitchin (public))
Re: Dependency/Layer 2 - revisited

FYI, I actually think there's only one JIRA request for Layer 2. Maybe I missed something. Anyway, here it is if you want to vote for it. Maybe that will nudge things along.

http://jira.zenoss.com/jira/browse/ZEN-2678

Report Abuse

Like (0)
j053ph4
290 posts since
Dec 19, 2008

Currently Being Moderated

14. Dec 17, 2012 4:45 PM (in response to Matthew Kitchin (public))
Re: Dependency/Layer 2 - revisited

I was just thinking about this and wondering about taking the following approach (code sample below):

The basic idea is as mentioned to supress events caused by a network path failure. I'm wondering if that path can be deduced from the object database in the following fashion (as an event transform):

Given an event:

1) determine collector of the affected device (as another device in the dmd)
2) exame the os.routes() entries of both the device and its collector, in particular if the "nexthops" refer to other known devices
3) determine whether any of the "nexthops" are shared between devices
4) if not, examine each list of hops for both devices and try to determine if they share common "hops"
5) if a list of shared "hops" can be determined, and any of those hops is "down", then suppress the event.

The following code is fragile, untested and best regarded as "probably a bad idea". it almost certainly won't work as intended or described above:

def findNextHops(dev):
"""
     return list of dmd devices from given device routes
"""
hops = []
for r in dev.os.routes():
    hop = r.nexthop()
    if hop is not None:
      if hop.device() is not None:
        if hop.device() not in hops:
          hops.append(hop.device())
return hops

def findCommon(d,c):
"""
    attempt to find hops in common between 2 nodes
"""
dhops = findNextHops(d)
chops = findNextHops(c)
inter = set(dhops).intersection(chops)
if len(inter) > 0:
    return inter
else:
    for e in dhops:
      for f in chops:
        return findCommon(e,f)

try:
# find the collector for the event device
collector= dmd.Devices.findDevice(d.getPerformanceServerName())
common = findCommon(device,collector)
for c in common:
    if c.getPingStatus() > 0:
      evt._action = "drop"

except:
pass

I realize there's much not taken into account, but perhaps this can be a starting point?

Joseph

Report Abuse

Like (0)

Go to original post

Legend

Correct Answers - 4 points
Helpful Answers - 2 points

Jul 22, 2010 10:28 AM

Dependency/Layer 2 - revisited

Actions

More Like This

Legend