Archived community.zenoss.org | full text search
Skip navigation
17465 Views 14 Replies Latest reply: Dec 17, 2012 4:45 PM by j053ph4 RSS
Matthew Kitchin (public) Rank: Green Belt 170 posts since
Nov 12, 2009
Currently Being Moderated

Jul 22, 2010 10:28 AM

Dependency/Layer 2 - revisited

I figured since I upgraded to 3.0, I would throw these questions out there again. I have a central office with 120 remote locations. All are connected by MPLS. The packets travel over hops we do not control. We monitor the router and then several other devices at each facility. When the circuit or router goes down, we get flooded with alerts. I have not been able to find a reasonable fix to deal with the fact that packets go over telco hops out of our control. I have also tried unsuccessfully to use our naming standard to create transforms that suppressed the extra alerts. Our routers are named something like TX123-rtr-ser and the other devices are named things like TX123SRV and TX123-WAP. I tried to write something that would check by name to see if a router at a site was down before creating an event about something else being down. I could not make it work. With our previous product (Whatsup), this was not automatic, but it was very simple. You simply had a drop down box for dependencies where you could select a router that corresponded with a device. I have no problem setting something like this manually. I just have to figure something out. This is the one thing left in Zenoss that is killing us. We get so swamped with alerts if a circuit is bouncing, that we repeatedly overlook a critical alerts because we are on a blackberry and there are simply too many alerts to manage.

Thanks,

Matthew

  • jmp242 ZenossMaster 4,060 posts since
    Mar 7, 2007
    Currently Being Moderated
    Re: Dependency/Layer 2 - revisited

    All I can say is +100000 but I don't know how to patch it in.

     

    --

    James Pulver

    Information Technology Area Supervisor

    LEPP Computer Group

    Cornell University

  • Scott MacKenzie Rank: White Belt 18 posts since
    Jul 22, 2010
    Currently Being Moderated
    Re: Dependency/Layer 2 - revisited

    Hi Matthew,

     

    100% agree that you are *not* alone.

     

    /suntzu

  • tlkhorses Rank: White Belt 62 posts since
    Oct 30, 2008
    Currently Being Moderated
    Re: Dependency/Layer 2 - revisited

    I will add my yep to this again. And not trying to hijack but I don't see Layer 3 dependency either. On that, just to be able to manually set them would be nice.

     

    tk

  • Gabe Rank: Green Belt 94 posts since
    Jun 29, 2008
    Currently Being Moderated
    Re: Dependency/Layer 2 - revisited

    the lack of this is THE MAIN feature holding me back from be able to implement zenoss for real in the company i work for.

     

    as long zenoss dont have options to setup parent children dependency between stuff in its inventory and the stuff in the devices you basically can not sort out the alarms making the monitoring complete

     

    and zenoss will remain a thing i like but just keep testing out, but dont match up with other monitoring tools used today

  • twm1010 ZenossMaster 290 posts since
    Jul 2, 2007
    Currently Being Moderated
    6. Jul 28, 2010 3:39 PM (in response to Gabe)
    Re: Dependency/Layer 2 - revisited

    Perhaps someone can explain alert supression?

  • tlkhorses Rank: White Belt 62 posts since
    Oct 30, 2008
    Currently Being Moderated
    Re: Dependency/Layer 2 - revisited

    Same even if you own the whole network. Its a pain when a router close to the server goes down and you have other routers, switches, servers, etc behind it. Tons of alerts. Be nice to have that router alert and the rest go unknown. Some other products do that.

     

    That being said, there are tons of great features in Zenoss. Having to get used to 3.0 but so far I like it.

     

    tk

  • jude16 Newbie 1 posts since
    Jul 30, 2010
    Currently Being Moderated
    Re: Dependency/Layer 2 - revisited

    This is the same major problem I've met with Zenoss. And receiving hundreds of alerts could also "hide" another alert (you know what I meen.. select all, Mark As Read and Voilà...)

     

    I'm not a python developer, but I gathered some script and rewrote a bit some of them for my needs.

     

    This is how my Zenoss is organized : I've created for all my sites a Location. The routeur name allways starts with "Routeur".

    This idea was to check if the router of this site was down or not. If yes, then the other devices are set to be "Acknowledged" (which means no alert for me).


    But I've faced another problem while doing this: while zenoss was sending ping to test, the non-routeur device could be ping'd right before the location's router.

    So I've had to count the number of alert 'script taken somewhere on this site, but I don't remember where, all credits go to the author of course).

     

    I have to tweak a little more to rewrite the summary without changing the dedupfields.

     

    Hope it helps!!

     

    Jude

     

    In Transform of /Status/Ping:

     

    # Create the dedup id; This is what zenoss normally does to the event to ascertain if
    # it is a duplicate (another occurance) of an existing event. We are doing it in this
    # transform to be able to reference the count variable, which does not come with an
    # incoming event.

     

     

     

    dedupfields = [evt.device, evt.component, evt.eventClass, evt.eventKey, evt.severity]
    if not evt.eventKey:
        dedupfields += [evt.summary]
    mydedupid = '|'.join(map(str, dedupfields))

     

    # Get the event details (including count) from the existing event that is in the mysql database
    em = dmd.Events.getEventManager()
    em.cleanCache()
    try:
        ed = em.getEventDetail(dedupid=mydedupid)
        mycount = ed.count
    except:
        mycount = 0

     

    myDevice = dmd.Devices.findDevice(evt.device)
    myDevName = myDevice.getDeviceName()
    myDevClass = myDevice.getDeviceClassName()

     

    if mycount == 0:
      if (myDevName.startswith('Routeur') == 0):
        if (myDevClass.startswith('/Network') == 1) or (myDevClass.startswith('/Ping') == 1):
          evt.eventState = 1

     

    if mycount > 0:
      if (myDevName.startswith('Routeur') == 0):
        myLocation = myDevice.getLocationName()
        splitLoc = myLocation.split('/')
        locPath = ''
        for path in splitLoc:
          if (path != ''):
            locPath = locPath + "['" + path + "']"
        myCommand = "myDevicesList = dmd.Locations" + locPath + ".getSubDevices()"  
        exec myCommand

     

        for myCurrentDevice in myDevicesList:
          myDeviceName = myCurrentDevice.getDeviceName()
          if (myDeviceName.startswith('Routeur')):
            if (dmd.Devices.findDevicePingStatus(myDeviceName)>0):
               evt.eventState = 1
    #          evt.summary = myDeviceName + " est injoignable"
            else:
               evt.eventState = 0

  • l2huynh Rank: White Belt 62 posts since
    Aug 18, 2009
    Currently Being Moderated
    Re: Dependency/Layer 2 - revisited

    I'm not good with python and I have not successfully implement layer 2/3 dependency using transform. The way that I used was hacking the filter in the Alert Rule. For those who does not know, at the Alert Rule page, if you append the zenoss url with "/manage", it would take you to the Zope page; click on Property and then in the "where" textbox is the mysql filter for event. I added the following to suppress all events except for zenoss local gateway if the local gateway goes down:

     

    and (ipAddress = 'ip.of.master.device' or (ipAddress = 'ip.of.dependent.device' and (select count(*) from status where ipAddress = 'ip.of.master.device' and eventclass = '/Status/Ping' and severity = 5) = 0))

  • tlkhorses Rank: White Belt 62 posts since
    Oct 30, 2008
    Currently Being Moderated
    11. Oct 7, 2011 6:13 PM (in response to l2huynh)
    Re: Dependency/Layer 2 - revisited

    Not sure that one would help much in my case l2huynh. In my case I need one of those for each router so that if the router in front of it goes down, the down router alerts and the others don't.

     

    @jude16, where did you put that and by location I assume you grouped items at a location. Say I have a string of towers, hops, to get to the Zenoss server and edge. If tower 2 goes down tower 3 and tower 4 would be unknown or acknowledged in your script and not alert.

  • OneLoveAmaru Rank: White Belt 73 posts since
    May 30, 2011
    Currently Being Moderated
    Re: Dependency/Layer 2 - revisited

    Have any tried this: blogs/zenossblog/2008/08/26/tip-of-the-month-layer-3-dependency-checker

     

    I have all of my layer 3 devices in Zenoss and a manual traceroute from the zenoss server shows all hops but the ./tracepath.py does not show all hops. Very weird.

     

    In any case, if a router goes down, I get 100 messages about every IP service and server that is down. Quite annoying.

  • jeronimo Rank: White Belt 27 posts since
    Dec 13, 2012
    Currently Being Moderated
    Re: Dependency/Layer 2 - revisited

    FYI, I actually think there's only one JIRA request for Layer 2.  Maybe I missed something.  Anyway, here it is if you want to vote for it.  Maybe that will nudge things along.

     

    http://jira.zenoss.com/jira/browse/ZEN-2678

  • j053ph4 Rank: Green Belt 290 posts since
    Dec 19, 2008
    Currently Being Moderated
    Re: Dependency/Layer 2 - revisited

    I was just thinking about this and wondering about taking the following approach (code sample below):

     

    The basic idea is as mentioned to supress events caused by a network path failure.  I'm wondering if  that path can be deduced from the object database in the following fashion (as an event transform):

     

    Given an event:

     

    1) determine collector of the affected device (as another device in the dmd)

    2) exame the os.routes() entries of both the device and its collector, in particular if the "nexthops" refer to other known devices

    3) determine whether any of the "nexthops" are shared between devices

    4) if not, examine each list of hops for both devices and try to determine if they share common "hops"

    5) if a list of shared "hops" can be determined, and any of those hops is "down", then suppress the event.

     

     

    The following code is fragile, untested and best regarded as "probably a bad idea".  it almost certainly won't work as intended or described above:

     

     

    def findNextHops(dev):

      """

         return list of dmd devices from given device routes

      """

      hops = []

      for r in dev.os.routes():

        hop = r.nexthop()

        if hop is not None:

          if hop.device() is not None:

            if hop.device() not in hops:

              hops.append(hop.device())

      return hops

     

    def findCommon(d,c):

      """

        attempt to find hops in common between 2 nodes

      """

      dhops = findNextHops(d)

      chops = findNextHops(c)

      inter = set(dhops).intersection(chops)

      if len(inter) > 0:

        return inter

      else:

        for e in dhops:

          for f in chops:

            return findCommon(e,f)

     

    try:

      # find the collector for the event device

      collector= dmd.Devices.findDevice(d.getPerformanceServerName())

      common = findCommon(device,collector)

      for c in common:

        if c.getPingStatus() > 0:

          evt._action = "drop"

       

    except:

      pass

     

     

    I realize there's much not taken into account, but perhaps this can be a starting point?

     

    Joseph

More Like This

  • Retrieving data ...

Legend

  • Correct Answers - 4 points
  • Helpful Answers - 2 points