Archived community.zenoss.org | full text search
Skip navigation
1 2 Previous Next 164893 Views 24 Replies Latest reply: Feb 10, 2010 11:44 AM by ashvillerob RSS
coyote Rank: Green Belt 127 posts since
Jan 31, 2008
Currently Being Moderated

Jul 28, 2009 2:16 PM

New 36G, 16CPU server

The Senior Linux admin gave me a new Zenoss server with 16 CPUs, 36G of ram and lots of disk space.

I have Zenoss 2.4.2 installed and am trying to monitor 1800 devices.

Zenoss uses less that 12Gigs of ram and load is always less than 2. I have configured Zenhub with 8 workers.

The UI is slow and sometimes it times out. I have looked for a performance tuning guide but have not found one. There are lots of tuning options for each Zenoss daemon but I would really like to know what they do and how they affect each other.

Basically: what is the best resource for performance tuning Zenoss to take advantage of this server?

Thank you
  • mwcotton Rank: Brown Belt 563 posts since
    Apr 23, 2008
    Currently Being Moderated
    1. Jul 28, 2009 5:04 PM (in response to coyote)
    RE: New 36G, 16CPU server
    Whats your disk config? I have found the biggest thing that slows the server down is wait on i/o . If you got all that extra ram maybe you should look at configuring a ram drive and copy all your rrd's up there, of course occasionly rsync them back to the hard disks. I bet the system would fly then.
  • mwcotton Rank: Brown Belt 563 posts since
    Apr 23, 2008
    Currently Being Moderated
    3. Jul 28, 2009 7:43 PM (in response to coyote)
    RE: New 36G, 16CPU server
    you need to rebuild, think about all the writes your going to have to do writing to those rrd's, I would go with raid 10, and get even more drives if possible.
    Still the ram disk idea would be the ultimate speed increase.
  • Andreas Trawoeger Rank: Green Belt 109 posts since
    Apr 10, 2008
    Currently Being Moderated
    5. Jul 29, 2009 5:47 AM (in response to coyote)
    Re: New 36G, 16CPU server

    "coyote" wrote:

     

    The Senior Linux admin gave me a new Zenoss server with 16 CPUs, 36G of ram and lots of disk space.
    I have Zenoss 2.4.2 installed and am trying to monitor 1800 devices.
    Zenoss uses less that 12Gigs of ram and load is always less than 2. I have configured Zenhub with 8 workers.


    There are a couple of possibilities to tune Zenoss performance, but you have to be a bit careful with them.

    Mount /tmp as tmpfs in /etc/fstab
    vm.vfs_cache_pressure = 0
    vm.swappiness = 20
    vm.overcommit_memory = 2
    vm.overcommit_ratio=75
    vm.dirty_background_ratio = 1
    vm.dirty_ratio = 100
    vm.dirty_expire_centisecs = 3000 
    vm.dirty_writeback_centisecs = 500 

    Pro: Speeds up renderserver
    Cons: /tmp is used by zenbackup too. So either make /tmp large enough, or change the temp-dir setting of zenbackup, otherwise you backups will start failing.

    Modify you VM settings in /etc/sysctl.conf
    echo 0 > /sys/block/cciss!c0d0/queue/iosched/front_merges
    echo 150 > /sys/block/cciss!c0d0/queue/iosched/read_expire
    echo 1500 > /sys/block/cciss!c0d0/queue/iosched/write_expire


    Modify your I/O scheduling in /etc/rc.local
    <zodb_db temporary>
        # Temporary storage database (for sessions)
        <temporarystorage>
          name temporary storage for sessioning
        </temporarystorage>
        mount-point /temp_folder
        container-class Products.TemporaryFolder.TemporaryContainer
        cache-size 500000
        pool-size 16 
    </zodb_db>
    
    <zodb_db main>
      mount-point /
      # ZODB cache, in number of objects
      cache-size 500000
      pool-size 16
      <zeoclient>
        server localhost:8100
        storage 1
        name zeostorage
        var $INSTANCE/var
        
        # ZEO client cache, in bytes
        cache-size 10000MB
        # Uncomment to have a persistent disk cache
        #client zeo1
      </zeoclient>
    </zodb_db>

    Pro & Cons: Tuning VM and I/O settings is close to voodoo and very much depends on your hardware. So be prepared for a lot of try & fail till your performance will get any better.

    Increase your cache and pool size in zope.conf
    # Reseting IPs for all Devices
    for d in dmd.Devices.getSubDevices():
        d.setManageIp()
    # Commiting changes for all devices
    commit() 

    Pro: Will improve Zenoss speed.
    Cons: You have to be really careful if you want to do anything via zendmd.

    The likelihood of caching conflicts increase with your zenhub worker and Zope pool-size. So if you have any zendmd script that does something like this:
    # Reseting IPs for all Devices
    for d in dmd.Devices.getSubDevices():
       # Reset IP and commit change
        d.setManageIp()
        commit() 

    Rewrite it to:
    tmpfs /tmp     tmpfs rw 0 0

    Otherwise you will end up with your changes randomly ending up in an commit failed error.
  • Andreas Trawoeger Rank: Green Belt 109 posts since
    Apr 10, 2008
    Currently Being Moderated
    6. Jul 29, 2009 6:09 AM (in response to Andreas Trawoeger)
    Re: New 36G, 16CPU server

    "mwcotton" wrote:

     

    Whats your disk config? I have found the biggest thing that slows the server down is wait on i/o . If you got all that extra ram maybe you should look at configuring a ram drive and copy all your rrd's up there, of course occasionly rsync them back to the hard disks. I bet the system would fly then.


    You can do that by simply increasing your vm.dirty_expire_centisecs and vm.dirty_writeback_centisecs settings. With this settings you can tell Linux to keep filesystem modifications longer in RAM before they will be written to disk.

    If done right this will decrease the I/O load on your disks leading to better performance. The disadvantage with this approach is that, when data gets written back to disk. You have a lot of data that needs to be written. Which can temporarily block other processes.

    So sometimes it's better to do the complete opposite and decrease the values. That will lead to more I/O actives overall, but the load will be better spread over time and won't block other processes that much.
  • fdeckert Rank: Green Belt 110 posts since
    Jul 2, 2008
    Currently Being Moderated
    7. Jul 31, 2009 7:48 AM (in response to Andreas Trawoeger)
    Re: New 36G, 16CPU server
    Zenoss is written in python, and python is not multi-cpu aware.

    Having 16 cores will unfortunately not make the web interface any faster, as it's running only on one python process (zopectl) and will not use more the one core.

    Regarding the sluggish web interface, we felt the same and tried many tweaks:

    1. Setup an apache front end as a reverse-proxy cache

    2. For remote users, force HTTP compression
    lib/python/ZPublisher/HTTPResponse.py
    use_HTTP_content_compression = 1

    3. Increase zope.conf parms:
    zserver-threads 200
    python-check-interval 1000 (not sure this one help)
    <zodb_db main>
    mount-point /
    cache-size 100000
    pool-size 250
    <zeoclient>
    ...
    cache-size 1000Mb
    client zeo1
    </zeoclient>
    </zodb_db>

    4. Use python accelerator "psyco", only for python 32bits

    With these 4 tweaks we got a better zenoss experience, but it's still not light speed :-(

    We're thinking to move to a new server with a very high end single CPU.

    --
    Florian Deckert
    SopraGroup - France
  • spyder40 Rank: White Belt 45 posts since
    Apr 17, 2008
    Currently Being Moderated
    8. Jul 31, 2009 9:14 AM (in response to fdeckert)
    Re: New 36G, 16CPU server
    From what I'm seeing a good chunk of the slowness is in MySQL and the code around inserts/deletes/blah. Zenoss doesn't appear to be very efficient when the database grows. The history delete is killing us (32M rows currently) and I think it's just doing a sequential through the tables. A sharp DBA needs to take a look at the schema and code.

    Raid 5 is a no-no for database, too much of a write penalty due to the parity bits. Raid 10 is much quicker.
  • Chris Krough Rank: White Belt 16 posts since
    Apr 22, 2008
    Currently Being Moderated
    9. Jul 31, 2009 10:29 AM (in response to spyder40)
    Re: New 36G, 16CPU server
    That RAID 5 is going to be a bottleneck for you. As others have said, RAID10 is a better idea for the stuff in $ZENHOME/perf/
  • jbaird Rank: Green Belt 166 posts since
    Sep 18, 2007
    Currently Being Moderated
    10. Jul 31, 2009 1:36 PM (in response to Chris Krough)
    Re: New 36G, 16CPU server
    Yep, get rid of the RAID5. RAID10 is much better suited because of all of the writes that Zen must do to create RRDs.
  • beanfield Rank: Green Belt 161 posts since
    Apr 16, 2008
    Currently Being Moderated
    11. Sep 3, 2009 3:17 PM (in response to jbaird)
    Re: New 36G, 16CPU server

    "coyote" wrote:

     

    I have 6 SATA drives in a RAID 5 configuration, not the best performance but it is what the admin wants.

    I can back off the monitoring and performance collection if I have too. sda8 is where Zenoss is installed.

    # iostat
    Linux 2.6.18-128.1.16.el5 (zmaster) 07/28/2009

    avg-cpu: %user %nice %system %iowait %steal %idle
    0.75 0.00 0.08 0.01 0.00 99.15

    Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
    sda 25.28 3.24 632.88 5326012 1040210946
    sda1 0.00 0.02 0.00 25980 602
    sda2 0.00 0.00 0.00 1877 0
    sda3 0.31 0.40 5.03 652578 8261584
    sda4 0.00 0.00 0.00 12 0
    sda5 6.52 0.21 185.22 336968 304432832
    sda6 0.15 1.50 2.63 2465424 4322896
    sda7 1.08 0.00 261.88 6096 430430360
    sda8 17.21 1.11 177.97 1828235 292514760
    sda9 0.00 0.00 0.15 3619 246896
    sda10 0.00 0.00 0.00 4607 1016



    run "vmstat 1" and check the "wa" column (should be 3rd column in from the right). It should show your %i/o wait every second. It's shown above, but only for one period in time. If your i/o wait is staying relatively low (low single digits), then it's likely not a disk i/o problem on that server. Your %0.01 shown above is nothing....but the server could have just been between polling for that moment.

    Before doing too much tweaking of zope, I'd try and clean it up and re-index. This is what I do as the zenoss user (note, the spacing is important)

    # zendmd

    # Fix deviceSearch
    brains = dmd.Devices.deviceSearch()
    for d in brains:
        try:
            bah = d.getObject()
        except Exception:
            print "Removing non-existent device from deviceSearch: " + d.getPath()
            dmd.Devices.deviceSearch.uncatalog_object(d.getPath())
    commit()
    
    # Fix componentSearch
    brains = dmd.Devices.componentSearch()
    for d in brains:
        try:
            bah = d.getObject()
        except Exception:
            print "Removing non-existent device from componentSearch: " + d.getPath()
            dmd.Devices.componentSearch.uncatalog_object(d.getPath())
    commit()
    
    dmd.Devices.reIndex()
    commit()
    
    reindex()
    commit()
    


    You may also want to follow this thread on setting up a graph for the zenhub queue http://forums.zenoss.com/viewtopic.php?t=9429&highlight=zenhub+workers

    You can try and bump your workers even more, but it sounds like you have plenty. I'm running 7 on an 8 core box with ~1200 devices.
  • jenkinskj Rank: Green Belt 330 posts since
    Jul 30, 2009
    Currently Being Moderated
    12. Dec 3, 2009 12:14 PM (in response to spyder40)
    Re: New 36G, 16CPU server

    We encountered issues with history delete not working as well. I looked through the Zenoss ticket queue and heard that this was slated for 2.5.2.Can someone tell me when this is slated to be fixed? We are getting ready to roll our a production version of Zenoss and I would like to install the latest stable version so I do not have to worry about upgrading after the install.

     

    One more item. The problem we run into with MySQL is that once the data grows, it is difficult to reclaim space once history delete, optimize, etc. are run. We were informed that an export and import of the database is required to reclaim disk space. Has anyone run into this issue before?

     


  • jenkinskj Rank: Green Belt 330 posts since
    Jul 30, 2009
    Currently Being Moderated
    13. Dec 3, 2009 1:43 PM (in response to coyote)
    Re: New 36G, 16CPU server

    I have a few questions regarding your environment. We plan to deploy a similar sizing to your environment and I am trying to any leverage lessons learned here.

     

    • Considering the number of monitors you have, how big is your MySQL database?
    • How much history do you have Zenoss set to keep (i.e.Events and ZenRRD)?
    • What is your strategy for backup and recovery (Hot Backup / Cold Backup)?
    • Do you have a sample of your my.cnf you can provide?

     

    I would appreciate any information you can provide.

     

    Thank you,

    - Ken

  • rbilder Rank: White Belt 36 posts since
    Jun 25, 2008
    Currently Being Moderated
    14. Dec 3, 2009 9:41 PM (in response to coyote)
    Re: New 36G, 16CPU server

    It sounds like there are a couple of folks, at least, that are looking for performance/tuning ideas.  I'll try to give a run-down of what we have running, and will do my best to field follow up questions.

     

    Our "Primary" zenoss server -- runs pretty much everything -- MySql, Zope DB, UI, monitoring, modeling (nightly from cron), performance collection.

    This server is collecting perf data on about 250 devices.

    It is Ubuntu, quad 2.5 Ghz, with 8GB.  Also has SAS disks, just mirrored, no raid.

    All traps, etc come directly into this server, none to the collectors

     

    Our "Large Collector" -- collects for about 1400 devices

    Also Ubuntu, but a Dual-Core 3.2 Ghz, also with 8GB.  This is an older box with SCSI disks.

    It also does modeling each night (I run modeler from cron--not as a daemon--once/day is enough for us)

     

    Since this thing pretty much runs collection around the clock, I went for spindles.  Like others have said, RAID 5 is really going to put a damper on things...

    I went even a step further....

    Not much goes into $ZENHOME/perf--it's on the same file system as the mysql, zope, code, root, boot, etc.

    I setup 8 smaller disks into (4) 2-disk bundles named

      /perf1

      /perf2

      /perf3

      /perf4

     

    Zenoss looks in $ZENHOME/perf for the rrd directories, but I have them linked over to the /perfN disks--splitting devices somewhat equally.  So, when writing/updating RRD tables, performance looks something like this:

                                                                    write/sec                                         avg/wait  %util

    07:56:59 PM dev104-16    541.33      0.00   4698.67      8.68    108.23    197.79      1.85    100.00

    07:56:59 PM dev104-32    478.33      0.00   4288.00      8.96    108.06    224.59      2.09    100.00

    07:56:59 PM dev104-48    571.00      0.00   4805.33      8.42    107.82    189.89      1.75    100.00

    07:56:59 PM dev104-64    719.67      0.00   6525.33      9.07    108.16    145.74      1.39    100.00

     

    the SCSI disks will get upto about 7-8k writes/sec, the SAS disks will do 10-12k writes/sec on the primary collector.
    Every interval, zenperfsnmp reports that it sent, "Sent 129190 OID requests".  I assume that means my server is updating 129,000 RRD tables every five minutes.  The cycle time for the performance polling is anywhere from 60 secs to about 200 secs.  With RRD, a "normal" cycle takes 60 secs, but when you have to roll the hourly results up, that is what takes about 200 secs.  I am pretty sure I could add 300-500 smaller routers to this collector before needing to add the next one.
    We have 3 other collectors in the field (not really sure why I bother though...), that collect 146, 73, and 21 devices.  These servers also run other functions and are not dedicated to zenoss.  We gather device configurations nightly, in case of device failure, and other network functions on these 3 servers.
    For the .conf files, we pretty much just specify the monitor and localhost values with some exceptions.
    On each collector, I run zenmodeler nightly from cron (not as a daemon), with parallel set to 4.  It speeds up modeling and seem to work fine.  There is no logic behind it, it's faster than 1 and seems to be working fine.
    I have a bug ticket in, so right now zenstatus has parallel at 150.  But the default of 50 is fine if you are checking less than 50 tcp/udp connnection status items for things like SMTP, HTTP, and the like.  Zenstatus, at least in 2.5.1 needs to be higher than the number you are polling.  This should be temporary...
    On the main collector, for zenhub I have 4 workers running.  Somewhat because it's a 4 cpu box, but it seems to work well with 4 collectors.  When zenping tries to start up, it sends the zenhub off to map the network topology and really the zenworker is doing the bulk of the work.  If I had more processors, I'd be inclinded to run it at "Number of collectors" + 1.
    I also have a few small tweaks to cache size and such but probably nothing worth really worrying about.
    For the mysql DB, I'd probably run it on seperate spindles, but since we don't collect that much on that server, I have not bothered.
    Also, you may want to google for "innodb_file_per_table".  I wish I would have done this a while ago and would like to recommend that the zenoss stack include:
    [mysqld]
    innodb_file_per_table
    Since I have an existing DB, I'll need to dump data, stop the DB, delete the big ibdata1 file, then reload.  When tables are created with this option, a DB/table.ibd file is created.  My understanding is that would allow you to run an optimize table and free up disk space if your event history storage needs decrease.  I have not finished testing in our lab to know for myself if running the optimize table does indeed free up the file system space.  But, zenoss does run fine with the option, and it does indeed create a file per table.
    I encourage anyone that made it this far to comment.  This is what I am currently running, but am sure there is room for improvement or change.  I hope this helps you with your installs.  I've gotten alot of excellent help from the group here, and I hope it helps you two get going.  There are definitely folks here with more experience with zenoss, and with more python programming experience than I do.  Keep the questions coming, I'll try to help when I can...
    Regards,
    --Randy
1 2 Previous Next