High load on monitoring servers

Mojo · May 14, 2014, 4:11pm

I’ve been building out sensu monitoring for a cluster of about 40 machines.

I have two VMs dedicated to sensu. RabbitMQ on both, redis on one, api, client, server, and dashboard on both.

I have two graphite machines, collecting metrics that are echoed to RabbitMQ by the sensu server through the graphite mutator.

I have a handful of checks and metrics, with more to come – disk, ntp, memory, load, a few application endpoints. I have an email handler for events, forwarding to gmail.

Right now my two sensu machines are running a load average around 10-20, usually about 18.

Typically my waiting processes include a few mailer handlers (eight when I glanced just now) and a few graphite mutators (four at this glance).

I’m not sure why the mailers are running and waiting. I see them there whether there’s an event to report or not.

Given that it’s a brain-dead configuration, is there something obviously wrong? The machines have 2GB memory, running at about 75% (no swap).

I can think of a few things to do:

Reduce the frequency of the checks (all at 60s right now)
Split off RabbitMQ to other machines
Split off sensu-server or sensu-api
Discover something in the code that’s driving the load (mailer? graphite-mutator?)

Any thoughts?

Best regards,
Mojo

Sean_Porter · May 14, 2014, 4:15pm

What type of handler are you using for Graphite? Is the mutator for Graphite using a command? Are metric events being sent to your mailer handler? Is the mailer handler blocked on SMTP?

···

On May 14, 2014 9:11 AM, “Mojo” mojo.la@gmail.com wrote:

I’ve been building out sensu monitoring for a cluster of about 40 machines.

I have two VMs dedicated to sensu. RabbitMQ on both, redis on one, api, client, server, and dashboard on both.

I have two graphite machines, collecting metrics that are echoed to RabbitMQ by the sensu server through the graphite mutator.

I have a handful of checks and metrics, with more to come – disk, ntp, memory, load, a few application endpoints. I have an email handler for events, forwarding to gmail.

Right now my two sensu machines are running a load average around 10-20, usually about 18.

Typically my waiting processes include a few mailer handlers (eight when I glanced just now) and a few graphite mutators (four at this glance).

I’m not sure why the mailers are running and waiting. I see them there whether there’s an event to report or not.

Given that it’s a brain-dead configuration, is there something obviously wrong? The machines have 2GB memory, running at about 75% (no swap).

I can think of a few things to do:

Reduce the frequency of the checks (all at 60s right now)

Split off RabbitMQ to other machines

Split off sensu-server or sensu-api

Discover something in the code that’s driving the load (mailer? graphite-mutator?)

Any thoughts?

Best regards,
Mojo

Sean_Porter · May 14, 2014, 5:07pm

I do not recommend using the Graphite mutator for the community repository, https://github.com/sensu/sensu-community-plugins/blob/master/mutators/graphite.rb Fork/exec is an expensive operation, this will cause your system to become “lethargic”, and consume the Sensu execution thread pool, not leaving room for the mailer, which would cause a backlog. I use the built-in Sensu mutator extension “only_check_output”, and produce metrics w/ the correct schema.

Sean.

···

On Wed, May 14, 2014 at 9:15 AM, Sean Porter portertech@gmail.com wrote:

What type of handler are you using for Graphite? Is the mutator for Graphite using a command? Are metric events being sent to your mailer handler? Is the mailer handler blocked on SMTP?

On May 14, 2014 9:11 AM, “Mojo” mojo.la@gmail.com wrote:

I’ve been building out sensu monitoring for a cluster of about 40 machines.

I have two VMs dedicated to sensu. RabbitMQ on both, redis on one, api, client, server, and dashboard on both.

I have two graphite machines, collecting metrics that are echoed to RabbitMQ by the sensu server through the graphite mutator.

I have a handful of checks and metrics, with more to come – disk, ntp, memory, load, a few application endpoints. I have an email handler for events, forwarding to gmail.

Right now my two sensu machines are running a load average around 10-20, usually about 18.

Typically my waiting processes include a few mailer handlers (eight when I glanced just now) and a few graphite mutators (four at this glance).

I’m not sure why the mailers are running and waiting. I see them there whether there’s an event to report or not.

Given that it’s a brain-dead configuration, is there something obviously wrong? The machines have 2GB memory, running at about 75% (no swap).

I can think of a few things to do:

Reduce the frequency of the checks (all at 60s right now)

Split off RabbitMQ to other machines

Split off sensu-server or sensu-api

Discover something in the code that’s driving the load (mailer? graphite-mutator?)

Any thoughts?

Best regards,
Mojo

Brian_Boyd · May 16, 2014, 1:59am

Also watch out for handlers forking. I’ve had issues where the forking of handlers has driven high loads. The server is running just fine until a large number of alerts go off at once and quickly the server becomes unresponsive because of the rate it’s trying to execute handlers. So when I need monitoring the most it falls over. To resolve that I got away from ‘pipe’ handlers and switched to ‘tcp’ - converted all of the handlers to services which was easy by leveraging the sensu libraries (socket handling, logging, config, etc). And because the TCP connection is less resource intensive than forking I’ve since added numerous handlers (mailer, remediator, irc, elasticsearch) to the default handler set and haven’t had any issues yet.

brian

···

On Wednesday, May 14, 2014 12:07:48 PM UTC-5, portertech wrote:

I do not recommend using the Graphite mutator for the community repository, https://github.com/sensu/sensu-community-plugins/blob/master/mutators/graphite.rb Fork/exec is an expensive operation, this will cause your system to become “lethargic”, and consume the Sensu execution thread pool, not leaving room for the mailer, which would cause a backlog. I use the built-in Sensu mutator extension “only_check_output”, and produce metrics w/ the correct schema.

Sean.

On Wed, May 14, 2014 at 9:15 AM, Sean Porter porte...@gmail.com wrote:

What type of handler are you using for Graphite? Is the mutator for Graphite using a command? Are metric events being sent to your mailer handler? Is the mailer handler blocked on SMTP?

On May 14, 2014 9:11 AM, “Mojo” moj...@gmail.com wrote:

I’ve been building out sensu monitoring for a cluster of about 40 machines.

I have two VMs dedicated to sensu. RabbitMQ on both, redis on one, api, client, server, and dashboard on both.

I have two graphite machines, collecting metrics that are echoed to RabbitMQ by the sensu server through the graphite mutator.

I have a handful of checks and metrics, with more to come – disk, ntp, memory, load, a few application endpoints. I have an email handler for events, forwarding to gmail.

Right now my two sensu machines are running a load average around 10-20, usually about 18.

Typically my waiting processes include a few mailer handlers (eight when I glanced just now) and a few graphite mutators (four at this glance).

I’m not sure why the mailers are running and waiting. I see them there whether there’s an event to report or not.

Given that it’s a brain-dead configuration, is there something obviously wrong? The machines have 2GB memory, running at about 75% (no swap).

I can think of a few things to do:

Reduce the frequency of the checks (all at 60s right now)

Split off RabbitMQ to other machines

Split off sensu-server or sensu-api

Discover something in the code that’s driving the load (mailer? graphite-mutator?)

Any thoughts?

Best regards,
Mojo

Mojo · May 19, 2014, 6:49pm

Brian,

Yes it looks like getting rid of the piped handlers is going to be a big step.

Did you write your own tcp mailer handler? I’d love to find an example of how that’s done.

Getting rid of the graphite_mutator has made a bit of a difference. Now the mailer handler is really dragging it down. It appears that sensu is exec’ing the mailer.rb script on every handling, even though an email isn’t generated because of an “occurrences” restriction.

Mojo

···

On Thu, May 15, 2014 at 6:59 PM, Brian Boyd brian@boydduo.com wrote:

Also watch out for handlers forking. I’ve had issues where the forking of handlers has driven high loads. The server is running just fine until a large number of alerts go off at once and quickly the server becomes unresponsive because of the rate it’s trying to execute handlers. So when I need monitoring the most it falls over. To resolve that I got away from ‘pipe’ handlers and switched to ‘tcp’ - converted all of the handlers to services which was easy by leveraging the sensu libraries (socket handling, logging, config, etc). And because the TCP connection is less resource intensive than forking I’ve since added numerous handlers (mailer, remediator, irc, elasticsearch) to the default handler set and haven’t had any issues yet.
brian

On Wednesday, May 14, 2014 12:07:48 PM UTC-5, portertech wrote:

I do not recommend using the Graphite mutator for the community repository, https://github.com/sensu/sensu-community-plugins/blob/master/mutators/graphite.rb Fork/exec is an expensive operation, this will cause your system to become “lethargic”, and consume the Sensu execution thread pool, not leaving room for the mailer, which would cause a backlog. I use the built-in Sensu mutator extension “only_check_output”, and produce metrics w/ the correct schema.

Sean.

On Wed, May 14, 2014 at 9:15 AM, Sean Porter porte...@gmail.com wrote:

What type of handler are you using for Graphite? Is the mutator for Graphite using a command? Are metric events being sent to your mailer handler? Is the mailer handler blocked on SMTP?

On May 14, 2014 9:11 AM, “Mojo” moj...@gmail.com wrote:

I’ve been building out sensu monitoring for a cluster of about 40 machines.

I have two VMs dedicated to sensu. RabbitMQ on both, redis on one, api, client, server, and dashboard on both.

I have two graphite machines, collecting metrics that are echoed to RabbitMQ by the sensu server through the graphite mutator.

I have a handful of checks and metrics, with more to come – disk, ntp, memory, load, a few application endpoints. I have an email handler for events, forwarding to gmail.

Right now my two sensu machines are running a load average around 10-20, usually about 18.

Typically my waiting processes include a few mailer handlers (eight when I glanced just now) and a few graphite mutators (four at this glance).

I’m not sure why the mailers are running and waiting. I see them there whether there’s an event to report or not.

Given that it’s a brain-dead configuration, is there something obviously wrong? The machines have 2GB memory, running at about 75% (no swap).

I can think of a few things to do:

Reduce the frequency of the checks (all at 60s right now)

Split off RabbitMQ to other machines

Split off sensu-server or sensu-api

Discover something in the code that’s driving the load (mailer? graphite-mutator?)

Any thoughts?

Best regards,
Mojo

Mojo · May 19, 2014, 7:16pm

I see exactly what’s going on with the mailer handler: It subclasses Sensu::Handler, which does the actual filtering based on occurrences or disabled or silenced. So if I have an old event, or bunches of them, and they’re triggered every minute, they all cause a fork and a test.

So the trick is to get this code to run inside Sensu without piping it to a new process and new ruby startup.

Or write a tcp mailer handler that runs as a daemon.

Mojo

···

On Mon, May 19, 2014 at 11:49 AM, Mojo mojo.la@gmail.com wrote:

Brian,

Yes it looks like getting rid of the piped handlers is going to be a big step.

Did you write your own tcp mailer handler? I’d love to find an example of how that’s done.

Getting rid of the graphite_mutator has made a bit of a difference. Now the mailer handler is really dragging it down. It appears that sensu is exec’ing the mailer.rb script on every handling, even though an email isn’t generated because of an “occurrences” restriction.

Mojo

On Thu, May 15, 2014 at 6:59 PM, Brian Boyd brian@boydduo.com wrote:

Also watch out for handlers forking. I’ve had issues where the forking of handlers has driven high loads. The server is running just fine until a large number of alerts go off at once and quickly the server becomes unresponsive because of the rate it’s trying to execute handlers. So when I need monitoring the most it falls over. To resolve that I got away from ‘pipe’ handlers and switched to ‘tcp’ - converted all of the handlers to services which was easy by leveraging the sensu libraries (socket handling, logging, config, etc). And because the TCP connection is less resource intensive than forking I’ve since added numerous handlers (mailer, remediator, irc, elasticsearch) to the default handler set and haven’t had any issues yet.
brian

On Wednesday, May 14, 2014 12:07:48 PM UTC-5, portertech wrote:

I do not recommend using the Graphite mutator for the community repository, https://github.com/sensu/sensu-community-plugins/blob/master/mutators/graphite.rb Fork/exec is an expensive operation, this will cause your system to become “lethargic”, and consume the Sensu execution thread pool, not leaving room for the mailer, which would cause a backlog. I use the built-in Sensu mutator extension “only_check_output”, and produce metrics w/ the correct schema.

Sean.

On Wed, May 14, 2014 at 9:15 AM, Sean Porter porte...@gmail.com wrote:

What type of handler are you using for Graphite? Is the mutator for Graphite using a command? Are metric events being sent to your mailer handler? Is the mailer handler blocked on SMTP?

On May 14, 2014 9:11 AM, “Mojo” moj...@gmail.com wrote:

I’ve been building out sensu monitoring for a cluster of about 40 machines.

I have two VMs dedicated to sensu. RabbitMQ on both, redis on one, api, client, server, and dashboard on both.

I have two graphite machines, collecting metrics that are echoed to RabbitMQ by the sensu server through the graphite mutator.

I have a handful of checks and metrics, with more to come – disk, ntp, memory, load, a few application endpoints. I have an email handler for events, forwarding to gmail.

Right now my two sensu machines are running a load average around 10-20, usually about 18.

Typically my waiting processes include a few mailer handlers (eight when I glanced just now) and a few graphite mutators (four at this glance).

I’m not sure why the mailers are running and waiting. I see them there whether there’s an event to report or not.

Given that it’s a brain-dead configuration, is there something obviously wrong? The machines have 2GB memory, running at about 75% (no swap).

I can think of a few things to do:

Reduce the frequency of the checks (all at 60s right now)

Split off RabbitMQ to other machines

Split off sensu-server or sensu-api

Discover something in the code that’s driving the load (mailer? graphite-mutator?)

Any thoughts?

Best regards,
Mojo

Brian_Boyd · May 21, 2014, 3:36am

Here’s the TCP mailer handler I’m using. mailer-handler.rb - TCP Mailer handler for Sensu · GitHub

It’s based on the community-provided mailer.rb and some of the sensu classes like Sensu::Server. It leverages the sensu libraries for reading the config, listening on a socket, logging, etc. I wasn’t able to use the Sensu::Handler class as-is because it expects events to be read from STDIN, so while I didn’t like doing it, the best approach I could find was to copy and modify.

I’ve taken the same approach with the remediator, elasticsearch and am wrapping up the irc handler (which will have a persistent connection to avoid the login/logout).

Using this approach I’ve noticed the load on the server remains more predictable even if there are a flood of alerts.

brian

···

On Monday, May 19, 2014 2:16:42 PM UTC-5, Morris Jones wrote:

I see exactly what’s going on with the mailer handler: It subclasses Sensu::Handler, which does the actual filtering based on occurrences or disabled or silenced. So if I have an old event, or bunches of them, and they’re triggered every minute, they all cause a fork and a test.

So the trick is to get this code to run inside Sensu without piping it to a new process and new ruby startup.

Or write a tcp mailer handler that runs as a daemon.

Mojo

On Mon, May 19, 2014 at 11:49 AM, Mojo moj...@gmail.com wrote:

Brian,

Yes it looks like getting rid of the piped handlers is going to be a big step.

Did you write your own tcp mailer handler? I’d love to find an example of how that’s done.

Getting rid of the graphite_mutator has made a bit of a difference. Now the mailer handler is really dragging it down. It appears that sensu is exec’ing the mailer.rb script on every handling, even though an email isn’t generated because of an “occurrences” restriction.

Mojo

On Thu, May 15, 2014 at 6:59 PM, Brian Boyd br...@boydduo.com wrote:

Also watch out for handlers forking. I’ve had issues where the forking of handlers has driven high loads. The server is running just fine until a large number of alerts go off at once and quickly the server becomes unresponsive because of the rate it’s trying to execute handlers. So when I need monitoring the most it falls over. To resolve that I got away from ‘pipe’ handlers and switched to ‘tcp’ - converted all of the handlers to services which was easy by leveraging the sensu libraries (socket handling, logging, config, etc). And because the TCP connection is less resource intensive than forking I’ve since added numerous handlers (mailer, remediator, irc, elasticsearch) to the default handler set and haven’t had any issues yet.
brian

On Wednesday, May 14, 2014 12:07:48 PM UTC-5, portertech wrote:

I do not recommend using the Graphite mutator for the community repository, https://github.com/sensu/sensu-community-plugins/blob/master/mutators/graphite.rb Fork/exec is an expensive operation, this will cause your system to become “lethargic”, and consume the Sensu execution thread pool, not leaving room for the mailer, which would cause a backlog. I use the built-in Sensu mutator extension “only_check_output”, and produce metrics w/ the correct schema.

Sean.

On Wed, May 14, 2014 at 9:15 AM, Sean Porter porte...@gmail.com wrote:

What type of handler are you using for Graphite? Is the mutator for Graphite using a command? Are metric events being sent to your mailer handler? Is the mailer handler blocked on SMTP?

On May 14, 2014 9:11 AM, “Mojo” moj...@gmail.com wrote:

I’ve been building out sensu monitoring for a cluster of about 40 machines.

I have two VMs dedicated to sensu. RabbitMQ on both, redis on one, api, client, server, and dashboard on both.

I have two graphite machines, collecting metrics that are echoed to RabbitMQ by the sensu server through the graphite mutator.

I have a handful of checks and metrics, with more to come – disk, ntp, memory, load, a few application endpoints. I have an email handler for events, forwarding to gmail.

Right now my two sensu machines are running a load average around 10-20, usually about 18.

Typically my waiting processes include a few mailer handlers (eight when I glanced just now) and a few graphite mutators (four at this glance).

I’m not sure why the mailers are running and waiting. I see them there whether there’s an event to report or not.

Given that it’s a brain-dead configuration, is there something obviously wrong? The machines have 2GB memory, running at about 75% (no swap).

I can think of a few things to do:

Reduce the frequency of the checks (all at 60s right now)

Split off RabbitMQ to other machines

Split off sensu-server or sensu-api

Discover something in the code that’s driving the load (mailer? graphite-mutator?)

Any thoughts?

Best regards,
Mojo

David_Mark · June 23, 2014, 4:38pm

Hi Portertech

I also encountered the same problem that my CPU (2 cores) experienced ~70% load and i wanted to reduce this, and I also using the Graphite mutator.

When you say that you use the built-in mutator extension “only_check_output”, how can I use the built-in mutator instead of the graphite one? and what am I need to add in order to send the data to Graphite?

In addition, do I need to download the latest Sensu\sensu-extension packages in order to use that?

Thanks!

···

On Wednesday, May 14, 2014 8:07:48 PM UTC+3, portertech wrote:

I do not recommend using the Graphite mutator for the community repository, https://github.com/sensu/sensu-community-plugins/blob/master/mutators/graphite.rb Fork/exec is an expensive operation, this will cause your system to become “lethargic”, and consume the Sensu execution thread pool, not leaving room for the mailer, which would cause a backlog. I use the built-in Sensu mutator extension “only_check_output”, and produce metrics w/ the correct schema.

Sean.

On Wed, May 14, 2014 at 9:15 AM, Sean Porter porte...@gmail.com wrote:

What type of handler are you using for Graphite? Is the mutator for Graphite using a command? Are metric events being sent to your mailer handler? Is the mailer handler blocked on SMTP?

On May 14, 2014 9:11 AM, “Mojo” moj...@gmail.com wrote:

I’ve been building out sensu monitoring for a cluster of about 40 machines.

I have two VMs dedicated to sensu. RabbitMQ on both, redis on one, api, client, server, and dashboard on both.

I have two graphite machines, collecting metrics that are echoed to RabbitMQ by the sensu server through the graphite mutator.

I have a handful of checks and metrics, with more to come – disk, ntp, memory, load, a few application endpoints. I have an email handler for events, forwarding to gmail.

Right now my two sensu machines are running a load average around 10-20, usually about 18.

Typically my waiting processes include a few mailer handlers (eight when I glanced just now) and a few graphite mutators (four at this glance).

I’m not sure why the mailers are running and waiting. I see them there whether there’s an event to report or not.

Given that it’s a brain-dead configuration, is there something obviously wrong? The machines have 2GB memory, running at about 75% (no swap).

I can think of a few things to do:

Reduce the frequency of the checks (all at 60s right now)

Split off RabbitMQ to other machines

Split off sensu-server or sensu-api

Discover something in the code that’s driving the load (mailer? graphite-mutator?)

Any thoughts?

Best regards,
Mojo

Mojo · June 24, 2014, 3:10pm

David, you can use one of the metrics handlers, such as load-metrics.rb (for load averages) from the community sensu plugins.

The one thing that the (bad) graphite mutator gets you in reversing the host name for the metrics labels. After ripping out the graphite mutator, I modified my local copy of the metrics plugin to give the name format I wanted by default – but you can also specify the name scheme on the command line for the plugin.

Then when you add the check json to your config, you give it “type”: “metric” so sensu knows it’s graphite metric format. Here’s a copy of my graphite handler json config that uses the built-in extension:

{
“handlers”: {
“graphite”: {
“command”: null,
“exchange”: {
“name”: “metrics”,

    "type": "topic",
    "passive": true
  },
  "type": "amqp",
  "severities": [
    "ok",
    "warning",

    "critical",
    "unknown"
  ],
  "mutator": "only_check_output"
}

}
}

When sensu stops forking that graphite_mutator, the load drop is dramatic.

That, combined with a mailer handler daemon that doesn’t load a process has kept my load very reasonable.

Mojo

···

On Mon, Jun 23, 2014 at 9:38 AM, David Mark mim770@gmail.com wrote:

Hi Portertech

I also encountered the same problem that my CPU (2 cores) experienced ~70% load and i wanted to reduce this, and I also using the Graphite mutator.

When you say that you use the built-in mutator extension “only_check_output”, how can I use the built-in mutator instead of the graphite one? and what am I need to add in order to send the data to Graphite?

In addition, do I need to download the latest Sensu\sensu-extension packages in order to use that?

Thanks!

On Wednesday, May 14, 2014 8:07:48 PM UTC+3, portertech wrote:

I do not recommend using the Graphite mutator for the community repository, https://github.com/sensu/sensu-community-plugins/blob/master/mutators/graphite.rb Fork/exec is an expensive operation, this will cause your system to become “lethargic”, and consume the Sensu execution thread pool, not leaving room for the mailer, which would cause a backlog. I use the built-in Sensu mutator extension “only_check_output”, and produce metrics w/ the correct schema.

Sean.

On Wed, May 14, 2014 at 9:15 AM, Sean Porter porte...@gmail.com wrote:

What type of handler are you using for Graphite? Is the mutator for Graphite using a command? Are metric events being sent to your mailer handler? Is the mailer handler blocked on SMTP?

On May 14, 2014 9:11 AM, “Mojo” moj...@gmail.com wrote:

I’ve been building out sensu monitoring for a cluster of about 40 machines.

I have two VMs dedicated to sensu. RabbitMQ on both, redis on one, api, client, server, and dashboard on both.

I have two graphite machines, collecting metrics that are echoed to RabbitMQ by the sensu server through the graphite mutator.

I have a handful of checks and metrics, with more to come – disk, ntp, memory, load, a few application endpoints. I have an email handler for events, forwarding to gmail.

Right now my two sensu machines are running a load average around 10-20, usually about 18.

Typically my waiting processes include a few mailer handlers (eight when I glanced just now) and a few graphite mutators (four at this glance).

I’m not sure why the mailers are running and waiting. I see them there whether there’s an event to report or not.

Given that it’s a brain-dead configuration, is there something obviously wrong? The machines have 2GB memory, running at about 75% (no swap).

I can think of a few things to do:

Reduce the frequency of the checks (all at 60s right now)

Split off RabbitMQ to other machines

Split off sensu-server or sensu-api

Discover something in the code that’s driving the load (mailer? graphite-mutator?)

Any thoughts?

Best regards,
Mojo

David_Mark · June 29, 2014, 9:50am

Thanks Mojo

Now I just need to add TCP method to Pushover handler.

David

···

On Wednesday, May 14, 2014 7:11:57 PM UTC+3, Morris Jones wrote:

I’ve been building out sensu monitoring for a cluster of about 40 machines.

I have two VMs dedicated to sensu. RabbitMQ on both, redis on one, api, client, server, and dashboard on both.

I have two graphite machines, collecting metrics that are echoed to RabbitMQ by the sensu server through the graphite mutator.

I have a handful of checks and metrics, with more to come – disk, ntp, memory, load, a few application endpoints. I have an email handler for events, forwarding to gmail.

Right now my two sensu machines are running a load average around 10-20, usually about 18.

Typically my waiting processes include a few mailer handlers (eight when I glanced just now) and a few graphite mutators (four at this glance).

I’m not sure why the mailers are running and waiting. I see them there whether there’s an event to report or not.

Given that it’s a brain-dead configuration, is there something obviously wrong? The machines have 2GB memory, running at about 75% (no swap).

I can think of a few things to do:

Reduce the frequency of the checks (all at 60s right now)

Split off RabbitMQ to other machines

Split off sensu-server or sensu-api

Discover something in the code that’s driving the load (mailer? graphite-mutator?)

Any thoughts?

Best regards,
Mojo

John_Dyer · July 3, 2014, 2:43am

It’s my understanding the @b-Boyd’s approach will no longer work in Sensu 0.13.0 because sensu/base no longer exists. Is everyone moving to extensions, or is flapjack the new hotness ? Personally I am not super excited about having to add another outside component to do event handling… Anyone else have any thoughts / plans for dealing with high load in a post Sensu 0.12.0 world?

-John

···

On Sunday, June 29, 2014 5:50:33 AM UTC-4, David Mark wrote:

Thanks Mojo

Now I just need to add TCP method to Pushover handler.

David

On Wednesday, May 14, 2014 7:11:57 PM UTC+3, Morris Jones wrote:

I’ve been building out sensu monitoring for a cluster of about 40 machines.

I have two VMs dedicated to sensu. RabbitMQ on both, redis on one, api, client, server, and dashboard on both.

I have two graphite machines, collecting metrics that are echoed to RabbitMQ by the sensu server through the graphite mutator.

I have a handful of checks and metrics, with more to come – disk, ntp, memory, load, a few application endpoints. I have an email handler for events, forwarding to gmail.

Right now my two sensu machines are running a load average around 10-20, usually about 18.

Typically my waiting processes include a few mailer handlers (eight when I glanced just now) and a few graphite mutators (four at this glance).

I’m not sure why the mailers are running and waiting. I see them there whether there’s an event to report or not.

Given that it’s a brain-dead configuration, is there something obviously wrong? The machines have 2GB memory, running at about 75% (no swap).

I can think of a few things to do:

Reduce the frequency of the checks (all at 60s right now)

Split off RabbitMQ to other machines

Split off sensu-server or sensu-api

Discover something in the code that’s driving the load (mailer? graphite-mutator?)

Any thoughts?

Best regards,
Mojo

Sean_Porter · July 3, 2014, 6:58am

I’m drafting an email for sensu-dev, but I plan to add occurrence filtering and silencing (API /subdue) to Sensu core.

Sean.

···

On Wed, Jul 2, 2014 at 7:43 PM, John Dyer johntdyer@gmail.com wrote:

-John

It’s my understanding the @b-Boyd’s approach will no longer work in Sensu 0.13.0 because sensu/base no longer exists. Is everyone moving to extensions, or is flapjack the new hotness ? Personally I am not super excited about having to add another outside component to do event handling… Anyone else have any thoughts / plans for dealing with high load in a post Sensu 0.12.0 world?

On Sunday, June 29, 2014 5:50:33 AM UTC-4, David Mark wrote:

Thanks Mojo

Now I just need to add TCP method to Pushover handler.

David

On Wednesday, May 14, 2014 7:11:57 PM UTC+3, Morris Jones wrote:

I’ve been building out sensu monitoring for a cluster of about 40 machines.

I have two VMs dedicated to sensu. RabbitMQ on both, redis on one, api, client, server, and dashboard on both.

I have two graphite machines, collecting metrics that are echoed to RabbitMQ by the sensu server through the graphite mutator.

I have a handful of checks and metrics, with more to come – disk, ntp, memory, load, a few application endpoints. I have an email handler for events, forwarding to gmail.

Right now my two sensu machines are running a load average around 10-20, usually about 18.

Typically my waiting processes include a few mailer handlers (eight when I glanced just now) and a few graphite mutators (four at this glance).

I’m not sure why the mailers are running and waiting. I see them there whether there’s an event to report or not.

Given that it’s a brain-dead configuration, is there something obviously wrong? The machines have 2GB memory, running at about 75% (no swap).

I can think of a few things to do:

Reduce the frequency of the checks (all at 60s right now)

Split off RabbitMQ to other machines

Split off sensu-server or sensu-api

Discover something in the code that’s driving the load (mailer? graphite-mutator?)

Any thoughts?

Best regards,
Mojo

Mojo · July 3, 2014, 3:18pm

Sean, that will make a HUGE difference in load when there’s a stack of events that eventually get filtered. Thank you!

Mojo

···

On Wed, Jul 2, 2014 at 11:58 PM, Sean Porter portertech@gmail.com wrote:

I’m drafting an email for sensu-dev, but I plan to add occurrence filtering and silencing (API /subdue) to Sensu core.

Sean.

On Wed, Jul 2, 2014 at 7:43 PM, John Dyer johntdyer@gmail.com wrote:

-John

It’s my understanding the @b-Boyd’s approach will no longer work in Sensu 0.13.0 because sensu/base no longer exists. Is everyone moving to extensions, or is flapjack the new hotness ? Personally I am not super excited about having to add another outside component to do event handling… Anyone else have any thoughts / plans for dealing with high load in a post Sensu 0.12.0 world?

On Sunday, June 29, 2014 5:50:33 AM UTC-4, David Mark wrote:

Thanks Mojo

Now I just need to add TCP method to Pushover handler.

David

On Wednesday, May 14, 2014 7:11:57 PM UTC+3, Morris Jones wrote:

I’ve been building out sensu monitoring for a cluster of about 40 machines.

I have two VMs dedicated to sensu. RabbitMQ on both, redis on one, api, client, server, and dashboard on both.

I have two graphite machines, collecting metrics that are echoed to RabbitMQ by the sensu server through the graphite mutator.

I have a handful of checks and metrics, with more to come – disk, ntp, memory, load, a few application endpoints. I have an email handler for events, forwarding to gmail.

Right now my two sensu machines are running a load average around 10-20, usually about 18.

Typically my waiting processes include a few mailer handlers (eight when I glanced just now) and a few graphite mutators (four at this glance).

I’m not sure why the mailers are running and waiting. I see them there whether there’s an event to report or not.

Given that it’s a brain-dead configuration, is there something obviously wrong? The machines have 2GB memory, running at about 75% (no swap).

I can think of a few things to do:

Reduce the frequency of the checks (all at 60s right now)

Split off RabbitMQ to other machines

Split off sensu-server or sensu-api

Discover something in the code that’s driving the load (mailer? graphite-mutator?)

Any thoughts?

Best regards,
Mojo

John_Dyer · July 30, 2014, 6:47pm

Sean,

Did that email ever go out to the dev list ? I couldn’t find it and I am really interested in following this discussion.

-John

···

On Thursday, July 3, 2014 11:18:48 AM UTC-4, Morris Jones wrote:

Sean, that will make a HUGE difference in load when there’s a stack of events that eventually get filtered. Thank you!

Mojo

On Wed, Jul 2, 2014 at 11:58 PM, Sean Porter porte...@gmail.com wrote:

I’m drafting an email for sensu-dev, but I plan to add occurrence filtering and silencing (API /subdue) to Sensu core.

Sean.

On Wed, Jul 2, 2014 at 7:43 PM, John Dyer john...@gmail.com wrote:

-John

It’s my understanding the @b-Boyd’s approach will no longer work in Sensu 0.13.0 because sensu/base no longer exists. Is everyone moving to extensions, or is flapjack the new hotness ? Personally I am not super excited about having to add another outside component to do event handling… Anyone else have any thoughts / plans for dealing with high load in a post Sensu 0.12.0 world?

On Sunday, June 29, 2014 5:50:33 AM UTC-4, David Mark wrote:

Thanks Mojo

Now I just need to add TCP method to Pushover handler.

David

On Wednesday, May 14, 2014 7:11:57 PM UTC+3, Morris Jones wrote:

I’ve been building out sensu monitoring for a cluster of about 40 machines.

I have two VMs dedicated to sensu. RabbitMQ on both, redis on one, api, client, server, and dashboard on both.

I have two graphite machines, collecting metrics that are echoed to RabbitMQ by the sensu server through the graphite mutator.

I have a handful of checks and metrics, with more to come – disk, ntp, memory, load, a few application endpoints. I have an email handler for events, forwarding to gmail.

Right now my two sensu machines are running a load average around 10-20, usually about 18.

Typically my waiting processes include a few mailer handlers (eight when I glanced just now) and a few graphite mutators (four at this glance).

I’m not sure why the mailers are running and waiting. I see them there whether there’s an event to report or not.

Given that it’s a brain-dead configuration, is there something obviously wrong? The machines have 2GB memory, running at about 75% (no swap).

I can think of a few things to do:

Reduce the frequency of the checks (all at 60s right now)

Split off RabbitMQ to other machines

Split off sensu-server or sensu-api

Discover something in the code that’s driving the load (mailer? graphite-mutator?)

Any thoughts?

Best regards,
Mojo

Stella_Tyler · September 21, 2014, 12:20pm

I think the server is running just fine until a large number of alerts go off at once and quickly the server becomes unresponsive because of the rate it’s trying to execute handlers. So when I need monitoring the most it falls over. To resolve that I got away from ‘pipe’ handlers and switched to ‘tcp’ - converted all of the handlers to services which was easy by leveraging the sensu libraries (socket handling, logging, config, etc).

···

Best Video Baby Monitor | Wifi Baby Monitor

On Monday, May 19, 2014 11:49:42 PM UTC+5, Morris Jones wrote:

Brian,

Yes it looks like getting rid of the piped handlers is going to be a big step.

Did you write your own tcp mailer handler? I’d love to find an example of how that’s done.

Getting rid of the graphite_mutator has made a bit of a difference. Now the mailer handler is really dragging it down. It appears that sensu is exec’ing the mailer.rb script on every handling, even though an email isn’t generated because of an “occurrences” restriction.

Mojo

On Thu, May 15, 2014 at 6:59 PM, Brian Boyd br...@boydduo.com wrote:

Also watch out for handlers forking. I’ve had issues where the forking of handlers has driven high loads. The server is running just fine until a large number of alerts go off at once and quickly the server becomes unresponsive because of the rate it’s trying to execute handlers. So when I need monitoring the most it falls over. To resolve that I got away from ‘pipe’ handlers and switched to ‘tcp’ - converted all of the handlers to services which was easy by leveraging the sensu libraries (socket handling, logging, config, etc). And because the TCP connection is less resource intensive than forking I’ve since added numerous handlers (mailer, remediator, irc, elasticsearch) to the default handler set and haven’t had any issues yet.
brian

On Wednesday, May 14, 2014 12:07:48 PM UTC-5, portertech wrote:

I do not recommend using the Graphite mutator for the community repository, https://github.com/sensu/sensu-community-plugins/blob/master/mutators/graphite.rb Fork/exec is an expensive operation, this will cause your system to become “lethargic”, and consume the Sensu execution thread pool, not leaving room for the mailer, which would cause a backlog. I use the built-in Sensu mutator extension “only_check_output”, and produce metrics w/ the correct schema.

Sean.

On Wed, May 14, 2014 at 9:15 AM, Sean Porter porte...@gmail.com wrote:

What type of handler are you using for Graphite? Is the mutator for Graphite using a command? Are metric events being sent to your mailer handler? Is the mailer handler blocked on SMTP?

On May 14, 2014 9:11 AM, “Mojo” moj...@gmail.com wrote:

I’ve been building out sensu monitoring for a cluster of about 40 machines.

I have two VMs dedicated to sensu. RabbitMQ on both, redis on one, api, client, server, and dashboard on both.

I have two graphite machines, collecting metrics that are echoed to RabbitMQ by the sensu server through the graphite mutator.

I have a handful of checks and metrics, with more to come – disk, ntp, memory, load, a few application endpoints. I have an email handler for events, forwarding to gmail.

Right now my two sensu machines are running a load average around 10-20, usually about 18.

Typically my waiting processes include a few mailer handlers (eight when I glanced just now) and a few graphite mutators (four at this glance).

I’m not sure why the mailers are running and waiting. I see them there whether there’s an event to report or not.

Given that it’s a brain-dead configuration, is there something obviously wrong? The machines have 2GB memory, running at about 75% (no swap).

I can think of a few things to do:

Reduce the frequency of the checks (all at 60s right now)

Split off RabbitMQ to other machines

Split off sensu-server or sensu-api

Discover something in the code that’s driving the load (mailer? graphite-mutator?)

Any thoughts?

Best regards,
Mojo

Topic		Replies	Views
order of checks may impact CPU usage metrics collected by sensu-client Sensu Classic (EOL)	0	515	November 6, 2014
sensu server hitting 100% cpu Usage Sensu Classic (EOL)	2	1640	November 6, 2018
Issue with Relaying metrics to Graphite using WizardVan Sensu Classic (EOL)	3	711	October 22, 2019
Publishing check results directly via amqp Sensu Classic (EOL)	5	516	November 22, 2018
Sensu RabbitMQ "results" queue piling up with low CPU on servers, RabbitMQ Sensu Classic (EOL)	7	830	November 22, 2018

High load on monitoring servers

Related topics