Greetings I am using Zenoss Core 4.2 centos 6 x64 with 16GB of RAM and have been monitoring ~15 servers successfully for over a month. After I added in more servers (~300) I started having problems adding in more devices and modelling my existing ones.
- When manually modeling through the UI, I see that "Zenhub has connected" but it then holds for over a minute.
- When viewing the logs I see "2012-11-20 13:36:48,698 WARNING zen.zensyslog: No service named 'EventService': ZenHub may be disconnected" (zenoss status shows zenhub up and I can netcat to localhost:8789)
- Enabling debugging on zenhub shows that the worklist increases and holds at around 50. I do not see anything in the 'Jobs' area of the Zenoss UI
Is there any way to clean out the worklist? I enabled debugging in zenjobs and see the following:
2012-11-20 13:59:25,946 ERROR celery.apps.worker:
Mediator
=================================================
File "/opt/zenoss/lib/python2.7/threading.py", line 525, in __bootstrap
self.__bootstrap_inner()
File "/opt/zenoss/lib/python2.7/threading.py", line 552, in __bootstrap_inner
self.run()
File "/opt/zenoss/lib/python/celery/utils/threads.py", line 51, in run
self.body()
File "/opt/zenoss/lib/python/celery/worker/mediator.py", line 69, in body
return
File "/opt/zenoss/lib/python/celery/worker/buckets.py", line 142, in get
not_empty.wait(timeout)
File "/opt/zenoss/lib/python2.7/threading.py", line 263, in wait
_sleep(delay)
=================================================
LOCAL VARIABLES
=================================================
{'delay': 0.03471708297729492,
'endtime': 1353437965.827212,
'gotit': False,
'remaining': -9.989738464355469e-05,
'saved_state': None,
'self': <Condition(<thread.lock object at 0x5779cf0>, 1)>,
'timeout': 1.0,
'waiter': <thread.lock object at 0x57797d0>}
Thread-7
=================================================
File "/opt/zenoss/lib/python2.7/threading.py", line 525, in __bootstrap
self.__bootstrap_inner()
File "/opt/zenoss/lib/python2.7/threading.py", line 552, in __bootstrap_inner
self.run()
File "/opt/zenoss/lib/python/billiard/pool.py", line 274, in run
return self.body()
File "/opt/zenoss/lib/python/billiard/pool.py", line 499, in body
Thread-5
=================================================
File "/opt/zenoss/lib/python2.7/threading.py", line 525, in __bootstrap
self.__bootstrap_inner()
File "/opt/zenoss/lib/python2.7/threading.py", line 552, in __bootstrap_inner
self.run()
File "/opt/zenoss/lib/python/billiard/pool.py", line 274, in run
return self.body()
File "/opt/zenoss/lib/python/billiard/pool.py", line 300, in body
time.sleep(0.8)
=================================================
LOCAL VARIABLES
=================================================
{'self': <Supervisor(Thread-5, started daemon 139794026075904)>}
Thread-6
=================================================
File "/opt/zenoss/lib/python2.7/threading.py", line 525, in __bootstrap
self.__bootstrap_inner()
File "/opt/zenoss/lib/python2.7/threading.py", line 552, in __bootstrap_inner
self.run()
File "/opt/zenoss/lib/python/billiard/pool.py", line 274, in run
return self.body()
File "/opt/zenoss/lib/python/billiard/pool.py", line 319, in body
for taskseq, set_length in iter(taskqueue.get, None):
File "/opt/zenoss/lib/python2.7/Queue.py", line 168, in get
self.not_empty.wait()
File "/opt/zenoss/lib/python2.7/threading.py", line 244, in wait
waiter.acquire()
=================================================
LOCAL VARIABLES
=================================================
{'saved_state': None,
'self': <Condition(<thread.lock object at 0x5779930>, 1)>,
'timeout': None,
'waiter': <thread.lock object at 0x57797f0>}
MainThread
=================================================
File "/opt/zenoss/Products/Jobber/zenjobs.py", line 118, in <module>
zj.run()
File "/opt/zenoss/Products/Jobber/zenjobs.py", line 63, in run
return self.celery.Worker(**kwargs).run()
File "/opt/zenoss/lib/python/celery/apps/worker.py", line 140, in run
self.run_worker()
File "/opt/zenoss/lib/python/celery/apps/worker.py", line 222, in run_worker
worker.start()
File "/opt/zenoss/lib/python/celery/worker/__init__.py", line 238, in start
component.start()
File "/opt/zenoss/lib/python/celery/worker/consumer.py", line 350, in start
self.consume_messages()
File "/opt/zenoss/lib/python/celery/worker/consumer.py", line 364, in consume_messages
self.connection.drain_events(timeout=1)
File "/opt/zenoss/lib/python/kombu/connection.py", line 167, in drain_events
return self.transport.drain_events(self.connection, **kwargs)
File "/opt/zenoss/lib/python/kombu/transport/amqplib.py", line 261, in drain_events
return connection.drain_events(**kwargs)
File "/opt/zenoss/lib/python/kombu/transport/amqplib.py", line 93, in drain_events
return self.wait_multi(self.channels.values(), timeout=timeout)
File "/opt/zenoss/lib/python/kombu/transport/amqplib.py", line 99, in wait_multi
chanmap.keys(), allowed_methods, timeout=timeout)
File "/opt/zenoss/lib/python/kombu/transport/amqplib.py", line 158, in _wait_multiple
channel, method_sig, args, content = read_timeout(timeout)
File "/opt/zenoss/lib/python/kombu/transport/amqplib.py", line 131, in read_timeout
return self.method_reader.read_method()
File "/opt/zenoss/lib/python/amqplib/client_0_8/method_framing.py", line 218, in read_method
self._next_method()
File "/opt/zenoss/lib/python/amqplib/client_0_8/method_framing.py", line 133, in _next_method
frame_type, channel, payload = self.source.read_frame()
File "/opt/zenoss/lib/python/amqplib/client_0_8/transport.py", line 149, in read_frame
frame_type, channel, size = unpack('>BHI', self._read(7))
File "/opt/zenoss/lib/python/amqplib/client_0_8/transport.py", line 261, in _read
s = self.sock.recv(65536)
File "/opt/zenoss/lib/python/celery/apps/worker.py", line 309, in cry_handler
logger.error("\n" + cry())
File "/opt/zenoss/lib/python/celery/utils/__init__.py", line 145, in cry
traceback.print_stack(frame, file=out)
=================================================
LOCAL VARIABLES
=================================================
{'frame': <frame object at 0x7f243c003800>,
'main_thread': None,
'out': <StringIO.StringIO instance at 0x4e15638>,
'sep': '=================================================\n',
't': <TaskHandler(Thread-6, started daemon 139794015586048)>,
'thread': <_MainThread(MainThread, started 139794333112064)>,
'tid': 139794333112064,
'tmap': {139793927235328: <Mediator(Mediator, started daemon 139793927235328)>,
139793937725184: <ResultHandler(Thread-7, started daemon 139793937725184)>,
139794015586048: <TaskHandler(Thread-6, started daemon 139794015586048)>,
139794026075904: <Supervisor(Thread-5, started daemon 139794026075904)>,
139794333112064: <_MainThread(MainThread, started 139794333112064)>}}
debug in zenhub.log:
2012-11-20 13:40:51,280 INFO zen: Setting logging level to DEBUG
2012-11-20 13:40:51,281 INFO zen.zenoss.protocols.amqp: error closing publisher [Errno 4] Interrupted system call
2012-11-20 13:40:51,318 DEBUG zen.Events: =============== incoming event ===============
2012-11-20 13:40:51,318 DEBUG zen.Events: Got a localhost zenhub heartbeat event (timeout 90 sec).
2012-11-20 13:40:51,318 DEBUG zen.zenoss.protocols.amqp: Publishing with routing key zenoss.heartbeat.localhost to exchange
zenoss.heartbeats
2012-11-20 13:40:51,343 INFO zen.ZenHub: Worker (2329) reports 2012-11-20 13:26:28,145 INFO zen.pbclientfactory: Initial con
nect timed out after 30 seconds
2012-11-20 13:40:51,343 INFO zen.ZenHub: Worker (2331) reports 2012-11-20 13:26:28,304 INFO zen.pbclientfactory: Initial con
nect timed out after 30 seconds
2012-11-20 13:40:54,154 DEBUG zen.hub: adding listener for localhost:EventService
2012-11-20 13:40:54,157 DEBUG zen.hub: adding listener for localhost:ZenStatusConfig
2012-11-20 13:40:54,180 DEBUG zen.ZenHub: worklist has 1 items
2012-11-20 13:40:54,180 DEBUG zen.ZenHub: get candidate workers for sendEvents...
2012-11-20 13:40:54,180 DEBUG zen.ZenHub: candidate workers are [0, 1]
2012-11-20 13:40:54,180 DEBUG zen.ZenHub: Giving sendEvents to worker 0, (localhost:Products.ZenHub.services.EventService.se
ndEvents)
2012-11-20 13:40:54,181 DEBUG zen.ZenHub: worklist has 1 items
2012-11-20 13:40:54,181 DEBUG zen.ZenHub: get candidate workers for getDevicePingIssues...
2012-11-20 13:40:54,181 DEBUG zen.ZenHub: candidate workers are [1]
2012-11-20 13:40:54,181 DEBUG zen.ZenHub: Giving getDevicePingIssues to worker 1, (localhost:Products.ZenHub.services.EventS
ervice.getDevicePingIssues)
2012-11-20 13:40:54,217 DEBUG zen.ZenHub: worklist has 1 items
2012-11-20 13:40:54,218 DEBUG zen.ZenHub: all workers are busy
2012-11-20 13:40:54,232 DEBUG zen.ZenHub: worker 1, work localhost:Products.ZenHub.services.EventService.getDevicePingIssues
finished in 0.0501899719238
2012-11-20 13:40:54,232 DEBUG zen.ZenHub: worklist has 1 items
2012-11-20 13:40:54,232 DEBUG zen.ZenHub: get candidate workers for getConfigProperties...
2012-11-20 13:40:54,232 DEBUG zen.ZenHub: candidate workers are [1]
2012-11-20 13:40:54,232 DEBUG zen.ZenHub: Giving getConfigProperties to worker 1, (localhost:Products.ZenHub.services.ZenSta
tusConfig.getConfigProperties)
2012-11-20 13:40:54,238 DEBUG zen.ZenHub: worker 1, work localhost:Products.ZenHub.services.ZenStatusConfig.getConfigPropert
ies finished in 0.00522994995117
2012-11-20 13:40:54,274 DEBUG zen.ZenHub: worklist has 1 items
2012-11-20 13:40:54,275 DEBUG zen.ZenHub: get candidate workers for getThresholdClasses...
2012-11-20 13:40:54,275 DEBUG zen.ZenHub: candidate workers are [1]
2012-11-20 13:40:54,275 DEBUG zen.ZenHub: Giving getThresholdClasses to worker 1, (localhost:Products.ZenHub.services.ZenSta
tusConfig.getThresholdClasses)
2012-11-20 13:40:54,283 DEBUG zen.ZenHub: worker 1, work localhost:Products.ZenHub.services.ZenStatusConfig.getThresholdClas
ses finished in 0.00764012336731
2012-11-20 13:40:54,285 DEBUG zen.ZenHub: worklist has 1 items
2012-11-20 13:40:54,285 DEBUG zen.ZenHub: get candidate workers for getCollectorThresholds...
2012-11-20 13:40:54,285 DEBUG zen.ZenHub: candidate workers are [1]
2012-11-20 13:40:54,285 DEBUG zen.ZenHub: Giving getCollectorThresholds to worker 1, (localhost:Products.ZenHub.services.Zen
StatusConfig.getCollectorThresholds)
2012-11-20 13:40:54,333 DEBUG zen.ZenHub: worker 1, work localhost:Products.ZenHub.services.ZenStatusConfig.getCollectorThre
sholds finished in 0.047210931778
2012-11-20 13:40:54,336 DEBUG zen.ZenHub: worklist has 1 items
2012-11-20 13:40:54,336 DEBUG zen.ZenHub: get candidate workers for getDeviceConfigs...
2012-11-20 13:40:54,336 DEBUG zen.ZenHub: candidate workers are [1]
2012-11-20 13:40:54,336 DEBUG zen.ZenHub: Giving getDeviceConfigs to worker 1, (localhost:Products.ZenHub.services.ZenStatus
Config.getDeviceConfigs)
2012-11-20 13:40:56,124 DEBUG zen.hub: adding listener for localhost:EventService
2012-11-20 13:40:56,127 DEBUG zen.hub: adding listener for localhost:ProcessConfig
2012-11-20 13:40:56,151 DEBUG zen.ZenHub: worklist has 1 items
2012-11-20 13:40:56,151 DEBUG zen.ZenHub: all workers are busy
2012-11-20 13:40:56,151 DEBUG zen.ZenHub: worklist has 2 items
2012-11-20 13:40:56,152 DEBUG zen.ZenHub: all workers are busy
2012-11-20 13:40:56,190 DEBUG zen.ZenHub: worklist has 3 items
2012-11-20 13:40:56,190 DEBUG zen.ZenHub: all workers are busy
2012-11-20 13:40:56,283 DEBUG zen.ZenHub: worklist has 3 items
2012-11-20 13:40:56,283 DEBUG zen.ZenHub: all workers are busy
2012-11-20 13:40:56,785 DEBUG zen.ZenHub: worker 1, work localhost:Products.ZenHub.services.ZenStatusConfig.getDeviceConfigs
finished in 2.44905090332
2012-11-20 13:40:56,786 DEBUG zen.ZenHub: worklist has 3 items
2012-11-20 13:40:56,786 DEBUG zen.ZenHub: get candidate workers for getDevicePingIssues...
2012-11-20 13:40:56,786 DEBUG zen.ZenHub: candidate workers are [1]
2012-11-20 13:40:56,837 DEBUG zen.ZenHub: Giving sendEvents to worker 1, (localhost:Products.ZenHub.services.EventService.sendEvents)
2012-11-20 13:40:56,837 DEBUG zen.ZenHub: all workers are busy
2012-11-20 13:40:57,980 DEBUG zen.ZenHub: worklist has 2 items
2012-11-20 13:40:57,980 DEBUG zen.ZenHub: all workers are busy
2012-11-20 13:41:01,284 DEBUG zen.ZenHub: worklist has 2 items
2012-11-20 13:41:01,285 DEBUG zen.ZenHub: all workers are busy
2012-11-20 13:41:06,285 DEBUG zen.ZenHub: worklist has 2 items
2012-11-20 13:41:06,285 DEBUG zen.ZenHub: all workers are busy
2012-11-20 13:41:11,286 DEBUG zen.ZenHub: worklist has 2 items
2012-11-20 13:41:11,286 DEBUG zen.ZenHub: all workers are busy
2012-11-20 13:41:16,287 DEBUG zen.ZenHub: worklist has 2 items
2012-11-20 13:41:16,287 DEBUG zen.ZenHub: all workers are busy
2012-11-20 13:41:21,287 DEBUG zen.ZenHub: worklist has 2 items
2012-11-20 13:41:21,287 DEBUG zen.ZenHub: all workers are busy
2012-11-20 13:41:21,319 DEBUG zen.Events: =============== incoming event ===============
2012-11-20 13:41:21,319 DEBUG zen.Events: Got a localhost zenhub heartbeat event (timeout 90 sec).
2012-11-20 13:41:21,319 DEBUG zen.zenoss.protocols.amqp: Publishing with routing key zenoss.heartbeat.localhost to exchange zenoss.heartbeats
2012-11-20 13:41:22,753 DEBUG zen.hub: adding listener for localhost:EventService
2012-11-20 13:41:22,756 DEBUG zen.hub: adding listener for localhost:EventLogConfig
2012-11-20 13:41:22,780 DEBUG zen.ZenHub: worklist has 3 items
2012-11-20 13:41:22,780 DEBUG zen.ZenHub: all workers are busy
2012-11-20 13:41:22,780 DEBUG zen.ZenHub: worklist has 4 items
2012-11-20 13:41:22,780 DEBUG zen.ZenHub: all workers are busy
Thanks,
/mike
EDIT: Changed the RAM threshold in the rabbitmq config and adjusted the disk space as per: message/68974#68974
-
Like (0)