New 36G, 16CPU server

Up to Discussions in zenoss-users

1 2 Previous Next 164868 Views 24 Replies Latest reply: Feb 10, 2010 11:44 AM by ashvillerob

coyote

127 posts since
Jan 31, 2008

Currently Being Moderated

Jul 28, 2009 2:16 PM

The Senior Linux admin gave me a new Zenoss server with 16 CPUs, 36G of ram and lots of disk space.

I have Zenoss 2.4.2 installed and am trying to monitor 1800 devices.

Zenoss uses less that 12Gigs of ram and load is always less than 2. I have configured Zenhub with 8 workers.

The UI is slow and sometimes it times out. I have looked for a performance tuning guide but have not found one. There are lots of tuning options for each Zenoss daemon but I would really like to know what they do and how they affect each other.

Basically: what is the best resource for performance tuning Zenoss to take advantage of this server?

Thank you

Like (0)

mwcotton
563 posts since
Apr 23, 2008

Currently Being Moderated

1. Jul 28, 2009 5:04 PM (in response to coyote)
RE: New 36G, 16CPU server

Whats your disk config? I have found the biggest thing that slows the server down is wait on i/o . If you got all that extra ram maybe you should look at configuring a ram drive and copy all your rrd's up there, of course occasionly rsync them back to the hard disks. I bet the system would fly then.

Report Abuse

Like (0)
coyote
127 posts since
Jan 31, 2008

Currently Being Moderated

2. Jul 28, 2009 6:02 PM (in response to mwcotton)
RE: New 36G, 16CPU server

I have 6 SATA drives in a RAID 5 configuration, not the best performance but it is what the admin wants.

I can back off the monitoring and performance collection if I have too. sda8 is where Zenoss is installed.

# iostat
Linux 2.6.18-128.1.16.el5 (zmaster) 07/28/2009

avg-cpu: %user %nice %system %iowait %steal %idle
0.75 0.00 0.08 0.01 0.00 99.15

Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sda 25.28 3.24 632.88 5326012 1040210946
sda1 0.00 0.02 0.00 25980 602
sda2 0.00 0.00 0.00 1877 0
sda3 0.31 0.40 5.03 652578 8261584
sda4 0.00 0.00 0.00 12 0
sda5 6.52 0.21 185.22 336968 304432832
sda6 0.15 1.50 2.63 2465424 4322896
sda7 1.08 0.00 261.88 6096 430430360
sda8 17.21 1.11 177.97 1828235 292514760
sda9 0.00 0.00 0.15 3619 246896
sda10 0.00 0.00 0.00 4607 1016

Report Abuse

Like (0)
mwcotton
563 posts since
Apr 23, 2008

Currently Being Moderated

3. Jul 28, 2009 7:43 PM (in response to coyote)
RE: New 36G, 16CPU server

you need to rebuild, think about all the writes your going to have to do writing to those rrd's, I would go with raid 10, and get even more drives if possible.
Still the ram disk idea would be the ultimate speed increase.

Report Abuse

Like (0)
coyote
127 posts since
Jan 31, 2008

Currently Being Moderated

4. Jul 28, 2009 11:13 PM (in response to mwcotton)
RE: New 36G, 16CPU server

The admin does want to rebuild but a RAM partition sounds like a good idea. I will give it try and reply with the results.

Report Abuse

Like (0)
Andreas Trawoeger
109 posts since
Apr 10, 2008

Currently Being Moderated

5. Jul 29, 2009 5:47 AM (in response to coyote)
Re: New 36G, 16CPU server

"coyote" wrote:

The Senior Linux admin gave me a new Zenoss server with 16 CPUs, 36G of ram and lots of disk space.
I have Zenoss 2.4.2 installed and am trying to monitor 1800 devices.
Zenoss uses less that 12Gigs of ram and load is always less than 2. I have configured Zenhub with 8 workers.

There are a couple of possibilities to tune Zenoss performance, but you have to be a bit careful with them.

Mount /tmp as tmpfs in /etc/fstab
vm.vfs_cache_pressure = 0 vm.swappiness = 20 vm.overcommit_memory = 2 vm.overcommit_ratio=75 vm.dirty_background_ratio = 1 vm.dirty_ratio = 100 vm.dirty_expire_centisecs = 3000 vm.dirty_writeback_centisecs = 500

Pro: Speeds up renderserver
Cons: /tmp is used by zenbackup too. So either make /tmp large enough, or change the temp-dir setting of zenbackup, otherwise you backups will start failing.

Modify you VM settings in /etc/sysctl.conf
echo 0 > /sys/block/cciss!c0d0/queue/iosched/front_merges echo 150 > /sys/block/cciss!c0d0/queue/iosched/read_expire echo 1500 > /sys/block/cciss!c0d0/queue/iosched/write_expire

Modify your I/O scheduling in /etc/rc.local
<zodb_db temporary>     # Temporary storage database (for sessions)     <temporarystorage>       name temporary storage for sessioning     </temporarystorage>     mount-point /temp_folder     container-class Products.TemporaryFolder.TemporaryContainer     cache-size 500000     pool-size 16 </zodb_db> <zodb_db main> mount-point / # ZODB cache, in number of objects cache-size 500000 pool-size 16 <zeoclient>     server localhost:8100     storage 1     name zeostorage     var $INSTANCE/var         # ZEO client cache, in bytes     cache-size 10000MB     # Uncomment to have a persistent disk cache     #client zeo1 </zeoclient> </zodb_db>

Pro & Cons: Tuning VM and I/O settings is close to voodoo and very much depends on your hardware. So be prepared for a lot of try & fail till your performance will get any better.

Increase your cache and pool size in zope.conf
# Reseting IPs for all Devices for d in dmd.Devices.getSubDevices():     d.setManageIp() # Commiting changes for all devices commit()

Pro: Will improve Zenoss speed.
Cons: You have to be really careful if you want to do anything via zendmd.

The likelihood of caching conflicts increase with your zenhub worker and Zope pool-size. So if you have any zendmd script that does something like this:
# Reseting IPs for all Devices for d in dmd.Devices.getSubDevices():    # Reset IP and commit change     d.setManageIp()     commit()

Rewrite it to:
tmpfs /tmp     tmpfs rw 0 0

Otherwise you will end up with your changes randomly ending up in an commit failed error.

Report Abuse

Like (0)
Andreas Trawoeger
109 posts since
Apr 10, 2008

Currently Being Moderated

6. Jul 29, 2009 6:09 AM (in response to Andreas Trawoeger)
Re: New 36G, 16CPU server

"mwcotton" wrote:

Whats your disk config? I have found the biggest thing that slows the server down is wait on i/o . If you got all that extra ram maybe you should look at configuring a ram drive and copy all your rrd's up there, of course occasionly rsync them back to the hard disks. I bet the system would fly then.

You can do that by simply increasing your vm.dirty_expire_centisecs and vm.dirty_writeback_centisecs settings. With this settings you can tell Linux to keep filesystem modifications longer in RAM before they will be written to disk.

If done right this will decrease the I/O load on your disks leading to better performance. The disadvantage with this approach is that, when data gets written back to disk. You have a lot of data that needs to be written. Which can temporarily block other processes.

So sometimes it's better to do the complete opposite and decrease the values. That will lead to more I/O actives overall, but the load will be better spread over time and won't block other processes that much.

Report Abuse

Like (0)
fdeckert
110 posts since
Jul 2, 2008

Currently Being Moderated

7. Jul 31, 2009 7:48 AM (in response to Andreas Trawoeger)
Re: New 36G, 16CPU server

Zenoss is written in python, and python is not multi-cpu aware.

Having 16 cores will unfortunately not make the web interface any faster, as it's running only on one python process (zopectl) and will not use more the one core.

Regarding the sluggish web interface, we felt the same and tried many tweaks:

1. Setup an apache front end as a reverse-proxy cache

2. For remote users, force HTTP compression
lib/python/ZPublisher/HTTPResponse.py
use_HTTP_content_compression = 1

3. Increase zope.conf parms:
zserver-threads 200
python-check-interval 1000 (not sure this one help)
<zodb_db main>
mount-point /
cache-size 100000
pool-size 250
<zeoclient>
...
cache-size 1000Mb
client zeo1
</zeoclient>
</zodb_db>

4. Use python accelerator "psyco", only for python 32bits

With these 4 tweaks we got a better zenoss experience, but it's still not light speed :-(

We're thinking to move to a new server with a very high end single CPU.

--
Florian Deckert
SopraGroup - France

Report Abuse

Like (0)
spyder40
45 posts since
Apr 17, 2008

Currently Being Moderated

8. Jul 31, 2009 9:14 AM (in response to fdeckert)
Re: New 36G, 16CPU server

From what I'm seeing a good chunk of the slowness is in MySQL and the code around inserts/deletes/blah. Zenoss doesn't appear to be very efficient when the database grows. The history delete is killing us (32M rows currently) and I think it's just doing a sequential through the tables. A sharp DBA needs to take a look at the schema and code.

Raid 5 is a no-no for database, too much of a write penalty due to the parity bits. Raid 10 is much quicker.

Report Abuse

Like (0)
Chris Krough
16 posts since
Apr 22, 2008

Currently Being Moderated

9. Jul 31, 2009 10:29 AM (in response to spyder40)
Re: New 36G, 16CPU server

That RAID 5 is going to be a bottleneck for you. As others have said, RAID10 is a better idea for the stuff in $ZENHOME/perf/

Report Abuse

Like (0)
jbaird
166 posts since
Sep 18, 2007

Currently Being Moderated

10. Jul 31, 2009 1:36 PM (in response to Chris Krough)
Re: New 36G, 16CPU server

Yep, get rid of the RAID5. RAID10 is much better suited because of all of the writes that Zen must do to create RRDs.

Report Abuse

Like (0)
beanfield
161 posts since
Apr 16, 2008

Currently Being Moderated

11. Sep 3, 2009 3:17 PM (in response to jbaird)
Re: New 36G, 16CPU server

"coyote" wrote:

I have 6 SATA drives in a RAID 5 configuration, not the best performance but it is what the admin wants.

I can back off the monitoring and performance collection if I have too. sda8 is where Zenoss is installed.

# iostat
Linux 2.6.18-128.1.16.el5 (zmaster) 07/28/2009

avg-cpu: %user %nice %system %iowait %steal %idle
0.75 0.00 0.08 0.01 0.00 99.15

Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sda 25.28 3.24 632.88 5326012 1040210946
sda1 0.00 0.02 0.00 25980 602
sda2 0.00 0.00 0.00 1877 0
sda3 0.31 0.40 5.03 652578 8261584
sda4 0.00 0.00 0.00 12 0
sda5 6.52 0.21 185.22 336968 304432832
sda6 0.15 1.50 2.63 2465424 4322896
sda7 1.08 0.00 261.88 6096 430430360
sda8 17.21 1.11 177.97 1828235 292514760
sda9 0.00 0.00 0.15 3619 246896
sda10 0.00 0.00 0.00 4607 1016

run "vmstat 1" and check the "wa" column (should be 3rd column in from the right). It should show your %i/o wait every second. It's shown above, but only for one period in time. If your i/o wait is staying relatively low (low single digits), then it's likely not a disk i/o problem on that server. Your %0.01 shown above is nothing....but the server could have just been between polling for that moment.

Before doing too much tweaking of zope, I'd try and clean it up and re-index. This is what I do as the zenoss user (note, the spacing is important)

# zendmd

# Fix deviceSearch brains = dmd.Devices.deviceSearch() for d in brains: try: bah = d.getObject() except Exception: print "Removing non-existent device from deviceSearch: " + d.getPath() dmd.Devices.deviceSearch.uncatalog_object(d.getPath()) commit() # Fix componentSearch brains = dmd.Devices.componentSearch() for d in brains: try: bah = d.getObject() except Exception: print "Removing non-existent device from componentSearch: " + d.getPath() dmd.Devices.componentSearch.uncatalog_object(d.getPath()) commit() dmd.Devices.reIndex() commit() reindex() commit()

You may also want to follow this thread on setting up a graph for the zenhub queue http://forums.zenoss.com/viewtopic.php?t=9429&highlight=zenhub+workers

You can try and bump your workers even more, but it sounds like you have plenty. I'm running 7 on an 8 core box with ~1200 devices.

Report Abuse

Like (0)
jenkinskj
330 posts since
Jul 30, 2009

Currently Being Moderated

12. Dec 3, 2009 12:14 PM (in response to spyder40)
Re: New 36G, 16CPU server

We encountered issues with history delete not working as well. I looked through the Zenoss ticket queue and heard that this was slated for 2.5.2.Can someone tell me when this is slated to be fixed? We are getting ready to roll our a production version of Zenoss and I would like to install the latest stable version so I do not have to worry about upgrading after the install.

One more item. The problem we run into with MySQL is that once the data grows, it is difficult to reclaim space once history delete, optimize, etc. are run. We were informed that an export and import of the database is required to reclaim disk space. Has anyone run into this issue before?

Report Abuse

Like (0)
jenkinskj
330 posts since
Jul 30, 2009

Currently Being Moderated

13. Dec 3, 2009 1:43 PM (in response to coyote)
Re: New 36G, 16CPU server

I have a few questions regarding your environment. We plan to deploy a similar sizing to your environment and I am trying to any leverage lessons learned here.

Considering the number of monitors you have, how big is your MySQL database?
How much history do you have Zenoss set to keep (i.e.Events and ZenRRD)?
What is your strategy for backup and recovery (Hot Backup / Cold Backup)?
Do you have a sample of your my.cnf you can provide?

I would appreciate any information you can provide.

Thank you,
- Ken

Report Abuse

Like (0)
rbilder
36 posts since
Jun 25, 2008

Currently Being Moderated

14. Dec 3, 2009 9:41 PM (in response to coyote)
Re: New 36G, 16CPU server

It sounds like there are a couple of folks, at least, that are looking for performance/tuning ideas. I'll try to give a run-down of what we have running, and will do my best to field follow up questions.

Our "Primary" zenoss server -- runs pretty much everything -- MySql, Zope DB, UI, monitoring, modeling (nightly from cron), performance collection.
This server is collecting perf data on about 250 devices.
It is Ubuntu, quad 2.5 Ghz, with 8GB. Also has SAS disks, just mirrored, no raid.
All traps, etc come directly into this server, none to the collectors

Our "Large Collector" -- collects for about 1400 devices
Also Ubuntu, but a Dual-Core 3.2 Ghz, also with 8GB. This is an older box with SCSI disks.
It also does modeling each night (I run modeler from cron--not as a daemon--once/day is enough for us)

Since this thing pretty much runs collection around the clock, I went for spindles. Like others have said, RAID 5 is really going to put a damper on things...
I went even a step further....
Not much goes into $ZENHOME/perf--it's on the same file system as the mysql, zope, code, root, boot, etc.
I setup 8 smaller disks into (4) 2-disk bundles named
/perf1
/perf2
/perf3
/perf4

Zenoss looks in $ZENHOME/perf for the rrd directories, but I have them linked over to the /perfN disks--splitting devices somewhat equally. So, when writing/updating RRD tables, performance looks something like this:
                                                                write/sec                                         avg/wait %util
07:56:59 PM dev104-16    541.33      0.00   4698.67      8.68    108.23    197.79      1.85    100.00
07:56:59 PM dev104-32    478.33      0.00   4288.00      8.96    108.06    224.59      2.09    100.00
07:56:59 PM dev104-48    571.00      0.00   4805.33      8.42    107.82    189.89      1.75    100.00
07:56:59 PM dev104-64    719.67      0.00   6525.33      9.07    108.16    145.74      1.39    100.00

the SCSI disks will get upto about 7-8k writes/sec, the SAS disks will do 10-12k writes/sec on the primary collector.

Every interval, zenperfsnmp reports that it sent, "Sent 129190 OID requests". I assume that means my server is updating 129,000 RRD tables every five minutes. The cycle time for the performance polling is anywhere from 60 secs to about 200 secs. With RRD, a "normal" cycle takes 60 secs, but when you have to roll the hourly results up, that is what takes about 200 secs. I am pretty sure I could add 300-500 smaller routers to this collector before needing to add the next one.

We have 3 other collectors in the field (not really sure why I bother though...), that collect 146, 73, and 21 devices. These servers also run other functions and are not dedicated to zenoss. We gather device configurations nightly, in case of device failure, and other network functions on these 3 servers.

For the .conf files, we pretty much just specify the monitor and localhost values with some exceptions.

On each collector, I run zenmodeler nightly from cron (not as a daemon), with parallel set to 4. It speeds up modeling and seem to work fine. There is no logic behind it, it's faster than 1 and seems to be working fine.

I have a bug ticket in, so right now zenstatus has parallel at 150. But the default of 50 is fine if you are checking less than 50 tcp/udp connnection status items for things like SMTP, HTTP, and the like. Zenstatus, at least in 2.5.1 needs to be higher than the number you are polling. This should be temporary...

On the main collector, for zenhub I have 4 workers running. Somewhat because it's a 4 cpu box, but it seems to work well with 4 collectors. When zenping tries to start up, it sends the zenhub off to map the network topology and really the zenworker is doing the bulk of the work. If I had more processors, I'd be inclinded to run it at "Number of collectors" + 1.

I also have a few small tweaks to cache size and such but probably nothing worth really worrying about.

For the mysql DB, I'd probably run it on seperate spindles, but since we don't collect that much on that server, I have not bothered.

Also, you may want to google for "innodb_file_per_table". I wish I would have done this a while ago and would like to recommend that the zenoss stack include:

[mysqld]
innodb_file_per_table

Since I have an existing DB, I'll need to dump data, stop the DB, delete the big ibdata1 file, then reload. When tables are created with this option, a DB/table.ibd file is created. My understanding is that would allow you to run an optimize table and free up disk space if your event history storage needs decrease. I have not finished testing in our lab to know for myself if running the optimize table does indeed free up the file system space. But, zenoss does run fine with the option, and it does indeed create a file per table.

I encourage anyone that made it this far to comment. This is what I am currently running, but am sure there is room for improvement or change. I hope this helps you with your installs. I've gotten alot of excellent help from the group here, and I hope it helps you two get going. There are definitely folks here with more experience with zenoss, and with more python programming experience than I do. Keep the questions coming, I'll try to help when I can...

Regards,
--Randy

Report Abuse

Like (0)

1 2 Previous Next

Go to original post

Jul 28, 2009 2:16 PM

New 36G, 16CPU server

Actions

More Like This

Incoming Links