Sensu performance/architecture

jawed_khelil · August 26, 2015, 1:12pm

Hi everyone
we have one sensu server dedicated for an openstack environment. the server is 8Go vm / 2 cpu
the server handles

3 classic checks (ram/cpu/disk)
3 metric checks (every 20 s)

we had about 50 virtual machines. when an unexpected increase of the number of vms, and we passed to 100 hundreds vms.

After 24 hours, the server accumulates 4 hours of delay to handle results. I can see 150K messages in the result queue.

I add a new server, but I am wondering if I will have to add a sensu server every 50 vms increase in the cloud environement.

I am asking if you can share your deployment architecture of sensu (number of server, ressources (cpu,ram), dedicated server for api) and the size of your infrastructure.

Is there any benchmarking test or results for sensu.

thank you

Kyle_Anderson · August 26, 2015, 2:36pm

Scaling here is going to be a function of the handlers in use.
Rabbitmq is probably fine, but the metric checks are 15 events per
second, and sensu has to spawn something to deal with them.
What handlers do you have on your metric checks?

···

On Wed, Aug 26, 2015 at 6:12 AM, jawed khelil <jkhelil@gmail.com> wrote:

Hi everyone
we have one sensu server dedicated for an openstack environment. the server
is 8Go vm / 2 cpu
the server handles
- 3 classic checks (ram/cpu/disk)
- 3 metric checks (every 20 s)

we had about 50 virtual machines. when an unexpected increase of the number
of vms, and we passed to 100 hundreds vms.
After 24 hours, the server accumulates 4 hours of delay to handle results.
I can see 150K messages in the result queue.
I add a new server, but I am wondering if I will have to add a sensu server
every 50 vms increase in the cloud environement.

I am asking if you can share your deployment architecture of sensu (number
of server, ressources (cpu,ram), dedicated server for api) and the size of
your infrastructure.

Is there any benchmarking test or results for sensu.

thank you

jawed_khelil · August 26, 2015, 2:44pm

Just another note
I noticed that when I add the second server, the result queue has started to decrease, In one hour I go from 150K to 0 message, a rate of 40m/s.

But the first server is still in a bad state. It remains at 94% ram used and multiple ruby defunct process.

Is this the expected behaviour, and I have to restart it to go to a normal situation.

pu(s): 62.4%us, 10.6%sy, 0.0%ni, 22.9%id, 2.9%wa, 0.6%hi, 0.6%si, 0.0%st

Mem: 8189764k total, 8086500k used, 103264k free, 3140k buffers

Swap: 524284k total, 524284k used, 0k free, 20320k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

32481 sensu 20 0 8674m 7.4g 1680 R 53.9 94.8 1057:24 sensu-server

31367 sensu 20 0 124m 6292 2160 R 5.9 0.1 0:00.05 ruby

31360 sensu 20 0 124m 5808 2028 R 4.7 0.1 0:00.04 ruby

31361 sensu 20 0 124m 6292 2160 D 4.7 0.1 0:00.04 ruby

31362 sensu 20 0 124m 6100 2136 R 4.7 0.1 0:00.04 ruby

31363 sensu 20 0 124m 5700 2028 R 4.7 0.1 0:00.04 ruby

31365 sensu 20 0 124m 6288 2160 D 4.7 0.1 0:00.04 ruby

31366 sensu 20 0 124m 5808 2028 R 4.7 0.1 0:00.04 ruby

31369 sensu 20 0 124m 5908 2032 R 4.7 0.1 0:00.04 ruby

31371 sensu 20 0 124m 5636 2008 R 4.7 0.1 0:00.04 ruby

31364 sensu 20 0 124m 5708 2028 R 3.5 0.1 0:00.03 ruby

31368 sensu 20 0 124m 5708 2028 R 3.5 0.1 0:00.03 ruby

31370 sensu 20 0 124m 5700 2028 R 3.5 0.1 0:00.03 ruby

31373 sensu 20 0 124m 5708 2028 R 3.5 0.1 0:00.03 ruby

31382 sensu 20 0 124m 5700 2028 R 3.5 0.1 0:00.03 ruby

31384 sensu 20 0 124m 5724 2028 R 3.5 0.1 0:00.03 ruby

14

···

Le mercredi 26 août 2015 15:12:31 UTC+2, jawed khelil a écrit :

Hi everyone
we have one sensu server dedicated for an openstack environment. the server is 8Go vm / 2 cpu
the server handles

3 classic checks (ram/cpu/disk)

3 metric checks (every 20 s)

we had about 50 virtual machines. when an unexpected increase of the number of vms, and we passed to 100 hundreds vms.

After 24 hours, the server accumulates 4 hours of delay to handle results. I can see 150K messages in the result queue.

I add a new server, but I am wondering if I will have to add a sensu server every 50 vms increase in the cloud environement.

I am asking if you can share your deployment architecture of sensu (number of server, ressources (cpu,ram), dedicated server for api) and the size of your infrastructure.

Is there any benchmarking test or results for sensu.

thank you

jawed_khelil · August 26, 2015, 2:48pm

@Kyle, I have two handlers.
flapjack handler as extension form here https://github.com/sensu/sensu-community-plugins/tree/master/extensions/handlers

influxdb handler from here https://github.com/yongtin/sensu-community-plugins/blob/dde6e484dfc09e178521ab08b508c3c8d7710d9a/handlers/metrics/influxdb-metrics.rb

···

Le mercredi 26 août 2015 15:12:31 UTC+2, jawed khelil a écrit :

Hi everyone
we have one sensu server dedicated for an openstack environment. the server is 8Go vm / 2 cpu
the server handles

3 classic checks (ram/cpu/disk)

3 metric checks (every 20 s)

we had about 50 virtual machines. when an unexpected increase of the number of vms, and we passed to 100 hundreds vms.

After 24 hours, the server accumulates 4 hours of delay to handle results. I can see 150K messages in the result queue.

I add a new server, but I am wondering if I will have to add a sensu server every 50 vms increase in the cloud environement.

I am asking if you can share your deployment architecture of sensu (number of server, ressources (cpu,ram), dedicated server for api) and the size of your infrastructure.

Is there any benchmarking test or results for sensu.

thank you

jawed_khelil · August 26, 2015, 2:49pm

@Kyle, I have two handlers.
flapjack handler as extension form here https://github.com/sensu/sensu-community-plugins/tree/master/extensions/handlers

influxdb handler from here https://github.com/yongtin/sensu-community-plugins/blob/dde6e484dfc09e178521ab08b508c3c8d7710d9a/handlers/metrics/influxdb-metrics.rb

···

Le mercredi 26 août 2015 16:44:35 UTC+2, jawed khelil a écrit :

Just another note
I noticed that when I add the second server, the result queue has started to decrease, In one hour I go from 150K to 0 message, a rate of 40m/s.

But the first server is still in a bad state. It remains at 94% ram used and multiple ruby defunct process.

Is this the expected behaviour, and I have to restart it to go to a normal situation.

pu(s): 62.4%us, 10.6%sy, 0.0%ni, 22.9%id, 2.9%wa, 0.6%hi, 0.6%si, 0.0%st

Mem: 8189764k total, 8086500k used, 103264k free, 3140k buffers

Swap: 524284k total, 524284k used, 0k free, 20320k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

32481 sensu 20 0 8674m 7.4g 1680 R 53.9 94.8 1057:24 sensu-server

31367 sensu 20 0 124m 6292 2160 R 5.9 0.1 0:00.05 ruby

31360 sensu 20 0 124m 5808 2028 R 4.7 0.1 0:00.04 ruby

31361 sensu 20 0 124m 6292 2160 D 4.7 0.1 0:00.04 ruby

31362 sensu 20 0 124m 6100 2136 R 4.7 0.1 0:00.04 ruby

31363 sensu 20 0 124m 5700 2028 R 4.7 0.1 0:00.04 ruby

31365 sensu 20 0 124m 6288 2160 D 4.7 0.1 0:00.04 ruby

31366 sensu 20 0 124m 5808 2028 R 4.7 0.1 0:00.04 ruby

31369 sensu 20 0 124m 5908 2032 R 4.7 0.1 0:00.04 ruby

31371 sensu 20 0 124m 5636 2008 R 4.7 0.1 0:00.04 ruby

31364 sensu 20 0 124m 5708 2028 R 3.5 0.1 0:00.03 ruby

31368 sensu 20 0 124m 5708 2028 R 3.5 0.1 0:00.03 ruby

31370 sensu 20 0 124m 5700 2028 R 3.5 0.1 0:00.03 ruby

31373 sensu 20 0 124m 5708 2028 R 3.5 0.1 0:00.03 ruby

31382 sensu 20 0 124m 5700 2028 R 3.5 0.1 0:00.03 ruby

31384 sensu 20 0 124m 5724 2028 R 3.5 0.1 0:00.03 ruby

14

Le mercredi 26 août 2015 15:12:31 UTC+2, jawed khelil a écrit :

Hi everyone
we have one sensu server dedicated for an openstack environment. the server is 8Go vm / 2 cpu
the server handles

3 classic checks (ram/cpu/disk)

3 metric checks (every 20 s)

we had about 50 virtual machines. when an unexpected increase of the number of vms, and we passed to 100 hundreds vms.

After 24 hours, the server accumulates 4 hours of delay to handle results. I can see 150K messages in the result queue.

I add a new server, but I am wondering if I will have to add a sensu server every 50 vms increase in the cloud environement.

I am asking if you can share your deployment architecture of sensu (number of server, ressources (cpu,ram), dedicated server for api) and the size of your infrastructure.

Is there any benchmarking test or results for sensu.

thank you

jawed_khelil · August 26, 2015, 2:49pm

@Kyle, I have two handlers.
flapjack handler as extension form here https://github.com/sensu/sensu-community-plugins/tree/master/extensions/handlers

influxdb handler from here https://github.com/yongtin/sensu-community-plugins/blob/dde6e484dfc09e178521ab08b508c3c8d7710d9a/handlers/metrics/influxdb-metrics.rb

···

Le mercredi 26 août 2015 16:36:07 UTC+2, Kyle Anderson a écrit :

Scaling here is going to be a function of the handlers in use.
Rabbitmq is probably fine, but the metric checks are 15 events per
second, and sensu has to spawn something to deal with them.
What handlers do you have on your metric checks?

On Wed, Aug 26, 2015 at 6:12 AM, jawed khelil jkh...@gmail.com wrote:

Hi everyone
we have one sensu server dedicated for an openstack environment. the server
is 8Go vm / 2 cpu
the server handles

3 classic checks (ram/cpu/disk)

3 metric checks (every 20 s)

we had about 50 virtual machines. when an unexpected increase of the number
of vms, and we passed to 100 hundreds vms.
After 24 hours, the server accumulates 4 hours of delay to handle results.
I can see 150K messages in the result queue.
I add a new server, but I am wondering if I will have to add a sensu server
every 50 vms increase in the cloud environement.

I am asking if you can share your deployment architecture of sensu (number
of server, ressources (cpu,ram), dedicated server for api) and the size of
your infrastructure.

Is there any benchmarking test or results for sensu.

thank you

Bryan_Brandau · August 26, 2015, 3:27pm

I’ll just echo what Kyle said, it really comes down to what the output of an event/metric is doing. If you are spawning a ton of handlers, the box will become overwhelmed. It’s always good to have a couple servers for processing events off the queue.

We can run hundreds of nodes with 25-30 checks per node on a single server but when you have an event storm that impacts the entire cluster of nodes, sensu (like any system) becomes overwhelmed. Plan scaling Sensu based on your entire cluster having an event storm.

I’ve mentioned in the past but we also don’t send metrics through Sensu. We use a collectd/graphite setup because we’re sending over a million metrics per minute. I wouldn’t want to scale Sensu to do that.

If you can do it safely, you can also look at incorporating an extension which would avoid the process forking. That would probably be better for your metrics.

-Bryan

···

On Wed, Aug 26, 2015 at 9:49 AM, jawed khelil jkhelil@gmail.com wrote:

@Kyle, I have two handlers.
flapjack handler as extension form here https://github.com/sensu/sensu-community-plugins/tree/master/extensions/handlers

influxdb handler from here https://github.com/yongtin/sensu-community-plugins/blob/dde6e484dfc09e178521ab08b508c3c8d7710d9a/handlers/metrics/influxdb-metrics.rb

Le mercredi 26 août 2015 16:36:07 UTC+2, Kyle Anderson a écrit :

Scaling here is going to be a function of the handlers in use.
Rabbitmq is probably fine, but the metric checks are 15 events per
second, and sensu has to spawn something to deal with them.
What handlers do you have on your metric checks?

On Wed, Aug 26, 2015 at 6:12 AM, jawed khelil jkh...@gmail.com wrote:

Hi everyone
we have one sensu server dedicated for an openstack environment. the server
is 8Go vm / 2 cpu
the server handles

3 classic checks (ram/cpu/disk)

3 metric checks (every 20 s)

we had about 50 virtual machines. when an unexpected increase of the number
of vms, and we passed to 100 hundreds vms.
After 24 hours, the server accumulates 4 hours of delay to handle results.
I can see 150K messages in the result queue.
I add a new server, but I am wondering if I will have to add a sensu server
every 50 vms increase in the cloud environement.

I am asking if you can share your deployment architecture of sensu (number
of server, ressources (cpu,ram), dedicated server for api) and the size of
your infrastructure.

Is there any benchmarking test or results for sensu.

thank you

jawed_khelil · August 26, 2015, 3:39pm

OK, I am thinking about delegating the metric taff to collectd/influxdb

···

On Wed, Aug 26, 2015 at 5:27 PM, Bryan Brandau agent462@gmail.com wrote:

I’ll just echo what Kyle said, it really comes down to what the output of an event/metric is doing. If you are spawning a ton of handlers, the box will become overwhelmed. It’s always good to have a couple servers for processing events off the queue.

We can run hundreds of nodes with 25-30 checks per node on a single server but when you have an event storm that impacts the entire cluster of nodes, sensu (like any system) becomes overwhelmed. Plan scaling Sensu based on your entire cluster having an event storm.

I’ve mentioned in the past but we also don’t send metrics through Sensu. We use a collectd/graphite setup because we’re sending over a million metrics per minute. I wouldn’t want to scale Sensu to do that.

If you can do it safely, you can also look at incorporating an extension which would avoid the process forking. That would probably be better for your metrics.

-Bryan

**Jawed Khelil **
Consultant SI

Mob. : +33 (0)6 34 10 32 52

Email: jkhelil@gmail.com

Pensons à l’environnement : n’imprimons nos messages que si nécessaire et en recto verso
“This message and any attachments (the “message”) are confidential and intended for the sole use the address(es). Any unauthorised copy or dissemination is strictly prohibited. If you are not the intended recipient, please notify us promptly and delete such message from your inbox.”

On Wed, Aug 26, 2015 at 9:49 AM, jawed khelil jkhelil@gmail.com wrote:

@Kyle, I have two handlers.
flapjack handler as extension form here https://github.com/sensu/sensu-community-plugins/tree/master/extensions/handlers

influxdb handler from here https://github.com/yongtin/sensu-community-plugins/blob/dde6e484dfc09e178521ab08b508c3c8d7710d9a/handlers/metrics/influxdb-metrics.rb

Le mercredi 26 août 2015 16:36:07 UTC+2, Kyle Anderson a écrit :

Scaling here is going to be a function of the handlers in use.
Rabbitmq is probably fine, but the metric checks are 15 events per
second, and sensu has to spawn something to deal with them.
What handlers do you have on your metric checks?

On Wed, Aug 26, 2015 at 6:12 AM, jawed khelil jkh...@gmail.com wrote:

Hi everyone
we have one sensu server dedicated for an openstack environment. the server
is 8Go vm / 2 cpu
the server handles

3 classic checks (ram/cpu/disk)

3 metric checks (every 20 s)

we had about 50 virtual machines. when an unexpected increase of the number
of vms, and we passed to 100 hundreds vms.
After 24 hours, the server accumulates 4 hours of delay to handle results.
I can see 150K messages in the result queue.
I add a new server, but I am wondering if I will have to add a sensu server
every 50 vms increase in the cloud environement.

I am asking if you can share your deployment architecture of sensu (number
of server, ressources (cpu,ram), dedicated server for api) and the size of
your infrastructure.

Is there any benchmarking test or results for sensu.

thank you

jawed_khelil · August 27, 2015, 8:04am

Is there a way with sensu handler to read from rabbitmq in batch mode ? I think if it is possible, sensu will fork less ruby process to read metric from rabbitmq and send them to influxdb.

what is the model of sponning ruby process in sensu-server (One check result in the queue will be run by One ruby process sponned by sensu)

···

Le mercredi 26 août 2015 17:39:48 UTC+2, jawed khelil a écrit :

OK, I am thinking about delegating the metric taff to collectd/influxdb

**Jawed Khelil **
Consultant SI

Mob. : +33 (0)6 34 10 32 52

Email: jkhelil@gmail.com

Pensons à l’environnement : n’imprimons nos messages que si nécessaire et en recto verso
“This message and any attachments (the “message”) are confidential and intended for the sole use the address(es). Any unauthorised copy or dissemination is strictly prohibited. If you are not the intended recipient, please notify us promptly and delete such message from your inbox.”

On Wed, Aug 26, 2015 at 5:27 PM, Bryan Brandau agent462@gmail.com wrote:

I’ll just echo what Kyle said, it really comes down to what the output of an event/metric is doing. If you are spawning a ton of handlers, the box will become overwhelmed. It’s always good to have a couple servers for processing events off the queue.

We can run hundreds of nodes with 25-30 checks per node on a single server but when you have an event storm that impacts the entire cluster of nodes, sensu (like any system) becomes overwhelmed. Plan scaling Sensu based on your entire cluster having an event storm.

I’ve mentioned in the past but we also don’t send metrics through Sensu. We use a collectd/graphite setup because we’re sending over a million metrics per minute. I wouldn’t want to scale Sensu to do that.

If you can do it safely, you can also look at incorporating an extension which would avoid the process forking. That would probably be better for your metrics.

-Bryan

On Wed, Aug 26, 2015 at 9:49 AM, jawed khelil jkhelil@gmail.com wrote:

@Kyle, I have two handlers.
flapjack handler as extension form here https://github.com/sensu/sensu-community-plugins/tree/master/extensions/handlers

influxdb handler from here https://github.com/yongtin/sensu-community-plugins/blob/dde6e484dfc09e178521ab08b508c3c8d7710d9a/handlers/metrics/influxdb-metrics.rb

Le mercredi 26 août 2015 16:36:07 UTC+2, Kyle Anderson a écrit :

Scaling here is going to be a function of the handlers in use.
Rabbitmq is probably fine, but the metric checks are 15 events per
second, and sensu has to spawn something to deal with them.
What handlers do you have on your metric checks?

On Wed, Aug 26, 2015 at 6:12 AM, jawed khelil jkh...@gmail.com wrote:

Hi everyone
we have one sensu server dedicated for an openstack environment. the server
is 8Go vm / 2 cpu
the server handles

3 classic checks (ram/cpu/disk)

3 metric checks (every 20 s)

we had about 50 virtual machines. when an unexpected increase of the number
of vms, and we passed to 100 hundreds vms.
After 24 hours, the server accumulates 4 hours of delay to handle results.
I can see 150K messages in the result queue.
I add a new server, but I am wondering if I will have to add a sensu server
every 50 vms increase in the cloud environement.

I am asking if you can share your deployment architecture of sensu (number
of server, ressources (cpu,ram), dedicated server for api) and the size of
your infrastructure.

Is there any benchmarking test or results for sensu.

thank you

Rob · September 10, 2015, 6:19pm

Hi Jawed,

You could also consider the following:

configuring influxdb with a UDP listener: https://github.com/influxdb/influxdb/blob/master/etc/config.sample.toml#L220
modify your metric collection scripts to output influxdb line protocol: https://influxdb.com/docs/v0.9/write_protocols/line.html
configure metric collection configs to use a custom handler extension based off the in-built UDP handler: https://sensuapp.org/docs/0.16/extensions#handler-extensions
configure the extension handler to reference the ‘only_check_output’ mutator: https://sensuapp.org/docs/0.16/extensions#mutator-extensions and https://github.com/sensu/sensu-extensions/blob/master/lib/sensu/extensions/mutators/only_check_output.rb

This should significantly improve the performance of your metrics routing since no forking will take place for each event that is processed.

Cheers,

···

On 27 August 2015 at 09:04, jawed khelil jkhelil@gmail.com wrote:

Is there a way with sensu handler to read from rabbitmq in batch mode ? I think if it is possible, sensu will fork less ruby process to read metric from rabbitmq and send them to influxdb.

what is the model of sponning ruby process in sensu-server (One check result in the queue will be run by One ruby process sponned by sensu)

Le mercredi 26 août 2015 17:39:48 UTC+2, jawed khelil a écrit :

OK, I am thinking about delegating the metric taff to collectd/influxdb

**Jawed Khelil **
Consultant SI

Mob. : +33 (0)6 34 10 32 52

Email: jkhelil@gmail.com

Pensons à l’environnement : n’imprimons nos messages que si nécessaire et en recto verso
“This message and any attachments (the “message”) are confidential and intended for the sole use the address(es). Any unauthorised copy or dissemination is strictly prohibited. If you are not the intended recipient, please notify us promptly and delete such message from your inbox.”

On Wed, Aug 26, 2015 at 5:27 PM, Bryan Brandau agent462@gmail.com wrote:

I’ll just echo what Kyle said, it really comes down to what the output of an event/metric is doing. If you are spawning a ton of handlers, the box will become overwhelmed. It’s always good to have a couple servers for processing events off the queue.

We can run hundreds of nodes with 25-30 checks per node on a single server but when you have an event storm that impacts the entire cluster of nodes, sensu (like any system) becomes overwhelmed. Plan scaling Sensu based on your entire cluster having an event storm.

I’ve mentioned in the past but we also don’t send metrics through Sensu. We use a collectd/graphite setup because we’re sending over a million metrics per minute. I wouldn’t want to scale Sensu to do that.

If you can do it safely, you can also look at incorporating an extension which would avoid the process forking. That would probably be better for your metrics.

-Bryan

On Wed, Aug 26, 2015 at 9:49 AM, jawed khelil jkhelil@gmail.com wrote:

@Kyle, I have two handlers.
flapjack handler as extension form here https://github.com/sensu/sensu-community-plugins/tree/master/extensions/handlers

influxdb handler from here https://github.com/yongtin/sensu-community-plugins/blob/dde6e484dfc09e178521ab08b508c3c8d7710d9a/handlers/metrics/influxdb-metrics.rb

Le mercredi 26 août 2015 16:36:07 UTC+2, Kyle Anderson a écrit :

Scaling here is going to be a function of the handlers in use.
Rabbitmq is probably fine, but the metric checks are 15 events per
second, and sensu has to spawn something to deal with them.
What handlers do you have on your metric checks?

On Wed, Aug 26, 2015 at 6:12 AM, jawed khelil jkh...@gmail.com wrote:

Hi everyone
we have one sensu server dedicated for an openstack environment. the server
is 8Go vm / 2 cpu
the server handles

3 classic checks (ram/cpu/disk)

3 metric checks (every 20 s)

we had about 50 virtual machines. when an unexpected increase of the number
of vms, and we passed to 100 hundreds vms.
After 24 hours, the server accumulates 4 hours of delay to handle results.
I can see 150K messages in the result queue.
I add a new server, but I am wondering if I will have to add a sensu server
every 50 vms increase in the cloud environement.

I am asking if you can share your deployment architecture of sensu (number
of server, ressources (cpu,ram), dedicated server for api) and the size of
your infrastructure.

Is there any benchmarking test or results for sensu.

thank you

Akshay_Kapoor · September 29, 2015, 11:10am

You mentioned that you don’t send time-series metrics through sensu. How do you achieve that, because the checks are subscribed in sensu and these handlers (influxdb) forks a new process everytime. So in your case for time-series you don’t use sensu at all not even for configuring which time-series data to send ??

···

On Wednesday, August 26, 2015 at 8:57:57 PM UTC+5:30, agent462 wrote:

I’ll just echo what Kyle said, it really comes down to what the output of an event/metric is doing. If you are spawning a ton of handlers, the box will become overwhelmed. It’s always good to have a couple servers for processing events off the queue.

We can run hundreds of nodes with 25-30 checks per node on a single server but when you have an event storm that impacts the entire cluster of nodes, sensu (like any system) becomes overwhelmed. Plan scaling Sensu based on your entire cluster having an event storm.

I’ve mentioned in the past but we also don’t send metrics through Sensu. We use a collectd/graphite setup because we’re sending over a million metrics per minute. I wouldn’t want to scale Sensu to do that.

If you can do it safely, you can also look at incorporating an extension which would avoid the process forking. That would probably be better for your metrics.

-Bryan

On Wed, Aug 26, 2015 at 9:49 AM, jawed khelil jkh...@gmail.com wrote:

@Kyle, I have two handlers.
flapjack handler as extension form here https://github.com/sensu/sensu-community-plugins/tree/master/extensions/handlers

influxdb handler from here https://github.com/yongtin/sensu-community-plugins/blob/dde6e484dfc09e178521ab08b508c3c8d7710d9a/handlers/metrics/influxdb-metrics.rb

Le mercredi 26 août 2015 16:36:07 UTC+2, Kyle Anderson a écrit :

Scaling here is going to be a function of the handlers in use.
Rabbitmq is probably fine, but the metric checks are 15 events per
second, and sensu has to spawn something to deal with them.
What handlers do you have on your metric checks?

On Wed, Aug 26, 2015 at 6:12 AM, jawed khelil jkh...@gmail.com wrote:

Hi everyone
we have one sensu server dedicated for an openstack environment. the server
is 8Go vm / 2 cpu
the server handles

3 classic checks (ram/cpu/disk)

3 metric checks (every 20 s)

we had about 50 virtual machines. when an unexpected increase of the number
of vms, and we passed to 100 hundreds vms.
After 24 hours, the server accumulates 4 hours of delay to handle results.
I can see 150K messages in the result queue.
I add a new server, but I am wondering if I will have to add a sensu server
every 50 vms increase in the cloud environement.

I am asking if you can share your deployment architecture of sensu (number
of server, ressources (cpu,ram), dedicated server for api) and the size of
your infrastructure.

Is there any benchmarking test or results for sensu.

thank you

jawed_khelil · September 29, 2015, 11:50am

For our case, we are using sensu for monitoring/alerting, and collectd/graphite for metrics.

···

On Tue, Sep 29, 2015 at 1:10 PM, Akshay Kapoor akshay.anilkapoor@gmail.com wrote:

You mentioned that you don’t send time-series metrics through sensu. How do you achieve that, because the checks are subscribed in sensu and these handlers (influxdb) forks a new process everytime. So in your case for time-series you don’t use sensu at all not even for configuring which time-series data to send ??

On Wednesday, August 26, 2015 at 8:57:57 PM UTC+5:30, agent462 wrote:

I’ll just echo what Kyle said, it really comes down to what the output of an event/metric is doing. If you are spawning a ton of handlers, the box will become overwhelmed. It’s always good to have a couple servers for processing events off the queue.

We can run hundreds of nodes with 25-30 checks per node on a single server but when you have an event storm that impacts the entire cluster of nodes, sensu (like any system) becomes overwhelmed. Plan scaling Sensu based on your entire cluster having an event storm.

I’ve mentioned in the past but we also don’t send metrics through Sensu. We use a collectd/graphite setup because we’re sending over a million metrics per minute. I wouldn’t want to scale Sensu to do that.

If you can do it safely, you can also look at incorporating an extension which would avoid the process forking. That would probably be better for your metrics.

-Bryan

On Wed, Aug 26, 2015 at 9:49 AM, jawed khelil jkh...@gmail.com wrote:

@Kyle, I have two handlers.
flapjack handler as extension form here https://github.com/sensu/sensu-community-plugins/tree/master/extensions/handlers

influxdb handler from here https://github.com/yongtin/sensu-community-plugins/blob/dde6e484dfc09e178521ab08b508c3c8d7710d9a/handlers/metrics/influxdb-metrics.rb

Le mercredi 26 août 2015 16:36:07 UTC+2, Kyle Anderson a écrit :

Scaling here is going to be a function of the handlers in use.
Rabbitmq is probably fine, but the metric checks are 15 events per
second, and sensu has to spawn something to deal with them.
What handlers do you have on your metric checks?

On Wed, Aug 26, 2015 at 6:12 AM, jawed khelil jkh...@gmail.com wrote:

Hi everyone
we have one sensu server dedicated for an openstack environment. the server
is 8Go vm / 2 cpu
the server handles

3 classic checks (ram/cpu/disk)

3 metric checks (every 20 s)

we had about 50 virtual machines. when an unexpected increase of the number
of vms, and we passed to 100 hundreds vms.
After 24 hours, the server accumulates 4 hours of delay to handle results.
I can see 150K messages in the result queue.
I add a new server, but I am wondering if I will have to add a sensu server
every 50 vms increase in the cloud environement.

I am asking if you can share your deployment architecture of sensu (number
of server, ressources (cpu,ram), dedicated server for api) and the size of
your infrastructure.

Is there any benchmarking test or results for sensu.

thank you

**Jawed Khelil **
Consultant SI

Mob. : +33 (0)6 34 10 32 52

Email: jkhelil@gmail.com

Pensons à l’environnement : n’imprimons nos messages que si nécessaire et en recto verso
“This message and any attachments (the “message”) are confidential and intended for the sole use the address(es). Any unauthorised copy or dissemination is strictly prohibited. If you are not the intended recipient, please notify us promptly and delete such message from your inbox.”

Topic		Replies	Views
Sensu for large environments Sensu Classic (EOL)	4	482	November 22, 2018
looking for sensu benchmark data... Sensu Classic (EOL)	3	445	September 17, 2014
Performance Increase: Can we enable RabbitMQ queue sharding with Sensu? How? Sensu Classic (EOL)	6	1860	July 13, 2019
Sensu and influxdb scaling strategies Sensu Classic (EOL)	4	543	July 18, 2016
Sporadic RabbitMQ result/keepalive queue processing issues Sensu Classic (EOL)	2	538	November 22, 2018

Sensu performance/architecture

Related topics