Introduction
=================================================================================
This post discusses disk subsystem and process tuning options for running high volume Zenoss installations. The information is based on 64 bit Red Hat Enterprise Linux, but should apply to most Linux distributions supported by Zenoss.
- General Zenoss Performance Bottlenecks
- Filesystem tuning for configurations using standard spindle drives.
- Filesystem tuning for configurations using solid state storage.
General Zenoss Performance Bottlenecks
=================================================================================
One of the frequently asked questions in the #zenoss IRC channel and the Zenoss forums is 'How big should my zenoss server be?'. Unfortunately no formula exists for calculating server size based on the number of monitored devices. There are several factors that affect the load on your monitoring infrastructure. The type of resource limitation your system will experience (block, CPU, memory) depends on what you are monitoring, how long you keep the data, how frequently you collect the data, responsiveness of the monitored equipment, and the performance of the networks the monitoring traffic will traverse.
Many administrators who are new to Zenoss will understandably try to scale their monitoring infrastructure based on the number of devices they intend to monitor. While this is a good start, Zenoss monitoring daemons are focused on RRD Datasources, not devices. When determining hardware requirements for scaling zenoss you should consider the number of monitored datasources over the number of monitored devices. A server providing a small number of services may only require a dozen monitored datasources whereas a large server or router could require thousands or even tens of thousands of monitored datasources. Additionally, the amount of data recorded for each of those datasources affects the size of the resulting RRD file, which has a significant impact on block IO. Long consolidation periods, additional data (Holts Winters calculations, etc...), and the number of RRAs in an RRD file will all affect the total size of the RRD file and consequently drive changes in the amount of disk IO.
For most installations the disk subsystem is the first and most significant bottleneck. Zenoss records a single datasource per RRD file. The total number of RRD files in your system will equal the total number of datasources being monitored. Each of these RRD files needs to be opened, searched, modified, and written during each polling cycle. If you have an installation monitoring 50,000 datasources and the typical polling cycle of 300s, that is roughly 170 RRD file updates per second. Note that 170 is the number of files touched per second and not the number of transactions, which is likely much larger. The amount of data to be written and the number of disk transactions created during a polling cycle can easily exceed the write and IO speed of a single drive. The solution to overcoming single drive performance limitations is to use RAID storage arrays. Many users find that a high density RAID10 array provides the best balance of cost to performance for medium to large installations. Some very large installations will require dedicated high speed SAN storage or even solid state storage. Below I will suggest some storage configurations for both standard spindle disks and solid state storage on large zenoss installations.
Filesystem tuning for configurations using standard spindle drives.
=================================================================================
* RAID level
Due to the high throughput required by RRD updates, RAID10 is typically the most appropriate RAID level for Zenoss performance data. RAID10 provides excellent performance and redundancy at the cost of storage space. RAID5 is not an appropriate RAID level for the $ZENHOME/perf partition, the parity checks introduce too much overhead into the writing process. It is common for administrators to build out the a Zenoss server using a RAID1 or RAID5 array for the OS and related software (mount point '/') and a RAID10 on dedicated drives for the performance data (mount point '$ZENHOME/perf'). Dedicating a RAID10 array to $ZENHOME/perf helps the operating system and executables access data without being held up by RRD file updates. RAID level 10 performance increases as you increase the number of drives in the array.
* Filesystem Tuning
The second consideration for standard spindle drive storage is the choice of Filesystems. There are a number of high performance file systems available in most linux distributions. I am going to focus on EXT3 and EXT4 as they are the most common ones in use and come standard with RHEL and CentOS. ReiserFS and XFS may also be excellent choices for the $ZENHOME/perf partition, but I have not tested against them. When using RAID levels 0, 4, 5, and 6, it is beneficial to align filesystem blocks with RAID stripes. The mkfs command accepts arguments specifying the block size, stride, and stripe width of the filesystem. The following is an example calculation of filesystem options based on common settings for a six drive RAID10 array (mirror of stripes). Note that since we are dealing with a 'mirror of striped drives' we only need to be concerned with the RAID0 portion of our RAID10. The RAID1 portion of the array is not affected by filesystem alignment. To determine the correct filesystem arguments you can use the quick calculator here: http://busybox.net/~aldot/mkfs_stride.html or you can use the formula below.
Type of RAID: 0
# of data disks: 3 (3 on one side of the mirror, 3 on the other side)
Filesystem block size: 4k
RAID Chunk size: 64K
stride = (chunk size / fs_block_size)
stripe = stride * #ofDataDisks
stride = (64k / 4k) = 16k
stripe = 16k * 3 = 48k
The appropriate filesystem creation options for the 6 disk RAID10 above are:
mkfs.ext4 -b 4096 -E stride=16 -E stripe-width=48 /dev/xxx
* Kernel Elevator Tuning
The Linux kernel IO subsystem processes disk reads and writes according to scheduling algorithms known as Elevators. There is an excellent description of Elevators and tuning at http://www.redhat.com/docs/wp/performancetuning/iotuning/index.html. The default scheduler for Red Hat Linux 5 is the CFQ (Completely Fair Queuing) elevator. The available schedulers are noop, anticipatory, deadline, and cfq. The default 'cfq' scheduler or the 'deadline' scheduler are the best choices for most Zenoss installations. As each scheduler has different advantages and disadvantages it's best to try each in your environment. The scheduler can be specified by adding the "elevator=" option to the kernel line in grub.conf as follows:
kernel /vmlinuz-2.6.31.12-174.2.3.fc12.x86_64 ro root=/dev/mapper/vg_nergal-lv_root LANG=en_US.UTF-8 SYSFONT=latarcyrheb-sun16 KEYBOARDTYPE=pc KEYTABLE=us elevator=deadline
* EXT Mount Options
Journaling
By default, the EXT3 and EXT4 filesystem mounts with a journal in 'ordered' mode. In 'ordered' mode journal updates are committed to disk before any data is written. You can increase disk throughput by mounting the filesystem with the journal in 'writeback' mode. In 'writeback' mode journal updates are written to the disk according to the normal schedulers and data writes to disk are not held back waiting for journal updates as with 'ordered' mode. When using 'writeback' mode you should see a performance improvement, but it comes at a slight increase in data integrity risk due to unwritten data being lost in the event of a crash. Mounting with the journal in 'writeback' mode should only be done on system with a RAID controller that has an internal battery backup.
Timestamping
By default, the EXT filesystem mounts with the 'atime' option. 'atime' updates the inode access time each time a file is accessed. This update is typically unnecessary and creates a lot of extra writes. You can disable this with the 'noatime' mount option. The option 'nodiratime' is implied when using the 'noatime' option.
The final /etc/fstab line for the options above for a RAID10 partition at /dev/sdb1 is:
/dev/sdb1 /opt/zenoss/perf ext4 noatime,data=writeback 0 0
Filesystem tuning for configurations using solid state storage.
=================================================================================
Very large configurations may find it more cost effective to use solid state storage drives or cards. Solid state drives offer a huge improvement in read/write speeds and total transactions per second.
Moving to SSD storage significantly increases the capability of the disk subsystem, allowing you to monitor more datasources per collector, but it also moves your IO bottleneck to other components in the system. After moving to solid state drives I found the zenperfsnmp daemon itself to be a major bottleneck. Zenperfsnmp is restricted to a single thread on a single core. SSD storage may be capable of updating RRD files much faster than a single core running zenperfsnmp can handle, resulting in most of the CPU cores and the drives idling while data is being created. Previously, with spindle drives in RAID10, zenperfsnmp running on a single core was able to produce RRD data faster than the disk subsystem could write it, leading to block IO being the bottleneck. One solution for this single core problem is to run multiple collectors on a single server. From the Zenoss GUI it appears that you have several collectors, when in fact you just have multiple copies of zenperfsnmp (or any other daemon) running in parallel on a single server. Unfortunately these individual instances of monitoring daemons will not share a common list of monitored devices. You will need to manually distribute the devices across the collectors, which can be done fairly easily using a python script in zendmd, or tediously in the GUI.
Example of creating multiple collectors on a single collection server:
- Create a new hub 'h0' on the master Zen server under Settings > Collectors
- Set the number of workers to 3 in h0_zenhub.py
- In the GUI, create 3 new collectors 'h0c0','h0c1','h0c2' on the server with solid state storage, with all three collectors reporting back to 'h0'
- Edit each of the collectors to have a different port for the "Render URL", edit the h0cn_zenrender.conf 'httpport' entry to reflect the correct port
- Assign these three new collectors as performance monitors for devices.
Now you should have 3 collectors; h0c0, h0c1, and h0c2, running on one of your distributed collectors (hopefully with SSDs). All three collectors will report performance data back to the three hub 'h0' workers on the master server. Each collector will spawn it's own performance collection daemons, resulting in higher CPU utilization for the server they run on.
Special Considerations for Solid State Storage
Solid state drives rely on flash memory cells for storage. Flash memory cells are reliable for a limited number of writes, after which they cannot be counted on to maintain data. Solid state drive controllers work around this limitation by 'write balancing', or distributing writes across cells to increase the average amount of time each cell is usable. There are two types of cells in use, MLC (Multi-Level) and SLC (Single-Level). It's important to understand the performance and reliability differences between the two types of cells. Based on current numbers, ML cells are 'reliable' for around 10,000 writes, SLC for around 100,000. On top of understanding this limitation in the number of reliable writes, you should also understand the concept of write amplification. Write Amplification is an effect where the minimum number of writeable blocks on a drive causes the number of writes to increase dramatically, significantly reducing the life expectancy of the drive. For an excellent technical introduction of SSD drive design and function, read and reread Anand's SSD article here: http://www.anandtech.com/storage/showdoc.aspx?i=3531&p=1. Be sure to have a solid understanding of these concepts before attempting to put solid state storage into production for Zenoss. RRD file storage is particularly abusive to solid state storage due to the high number of small random reads and writes needed when updating RRD files. Using the expected number of blocks written per day for your environment, and the write amplification factors for the drives you intend to use, calculate the true life expectancy of the SSDs before making a major purchase. Be wary of vendor predictions for solid state drive life expectancy.
Solid State storage is a fairly new technology for the server market. Consider running the SSD storage in RAID1 pairs to increase reliability. Some of the tuning options for standard spindle disks do not apply to solid state storage. Solid state arrays should be mounted with the 'noatime' fstab option to help reduce the number of writes to the cells. Some high performance solid state storage drivers may bypass kernel schedulers entirely. Consult the documentation for your storage solution and find out if kernel tuning is discouraged or recommended. It is not necessary to align filesystem stripes for a RAID 1 array. Any other RAID level above 0, whether hardware or software controlled, may reduce the performance of solid state drives.