High load on monitoring servers

I’ve been building out sensu monitoring for a cluster of about 40 machines.

I have two VMs dedicated to sensu. RabbitMQ on both, redis on one, api, client, server, and dashboard on both.

I have two graphite machines, collecting metrics that are echoed to RabbitMQ by the sensu server through the graphite mutator.

I have a handful of checks and metrics, with more to come – disk, ntp, memory, load, a few application endpoints. I have an email handler for events, forwarding to gmail.

Right now my two sensu machines are running a load average around 10-20, usually about 18.

Typically my waiting processes include a few mailer handlers (eight when I glanced just now) and a few graphite mutators (four at this glance).

I’m not sure why the mailers are running and waiting. I see them there whether there’s an event to report or not.

Given that it’s a brain-dead configuration, is there something obviously wrong? The machines have 2GB memory, running at about 75% (no swap).

I can think of a few things to do:

  1. Reduce the frequency of the checks (all at 60s right now)

  2. Split off RabbitMQ to other machines

  3. Split off sensu-server or sensu-api

  4. Discover something in the code that’s driving the load (mailer? graphite-mutator?)

Any thoughts?

Best regards,
Mojo

What type of handler are you using for Graphite? Is the mutator for Graphite using a command? Are metric events being sent to your mailer handler? Is the mailer handler blocked on SMTP?

···

On May 14, 2014 9:11 AM, “Mojo” mojo.la@gmail.com wrote:

I’ve been building out sensu monitoring for a cluster of about 40 machines.

I have two VMs dedicated to sensu. RabbitMQ on both, redis on one, api, client, server, and dashboard on both.

I have two graphite machines, collecting metrics that are echoed to RabbitMQ by the sensu server through the graphite mutator.

I have a handful of checks and metrics, with more to come – disk, ntp, memory, load, a few application endpoints. I have an email handler for events, forwarding to gmail.

Right now my two sensu machines are running a load average around 10-20, usually about 18.

Typically my waiting processes include a few mailer handlers (eight when I glanced just now) and a few graphite mutators (four at this glance).

I’m not sure why the mailers are running and waiting. I see them there whether there’s an event to report or not.

Given that it’s a brain-dead configuration, is there something obviously wrong? The machines have 2GB memory, running at about 75% (no swap).

I can think of a few things to do:

  1. Reduce the frequency of the checks (all at 60s right now)
  1. Split off RabbitMQ to other machines
  1. Split off sensu-server or sensu-api
  1. Discover something in the code that’s driving the load (mailer? graphite-mutator?)

Any thoughts?

Best regards,
Mojo

I do not recommend using the Graphite mutator for the community repository, https://github.com/sensu/sensu-community-plugins/blob/master/mutators/graphite.rb Fork/exec is an expensive operation, this will cause your system to become “lethargic”, and consume the Sensu execution thread pool, not leaving room for the mailer, which would cause a backlog. I use the built-in Sensu mutator extension “only_check_output”, and produce metrics w/ the correct schema.

Sean.

···

On Wed, May 14, 2014 at 9:15 AM, Sean Porter portertech@gmail.com wrote:

What type of handler are you using for Graphite? Is the mutator for Graphite using a command? Are metric events being sent to your mailer handler? Is the mailer handler blocked on SMTP?

On May 14, 2014 9:11 AM, “Mojo” mojo.la@gmail.com wrote:

I’ve been building out sensu monitoring for a cluster of about 40 machines.

I have two VMs dedicated to sensu. RabbitMQ on both, redis on one, api, client, server, and dashboard on both.

I have two graphite machines, collecting metrics that are echoed to RabbitMQ by the sensu server through the graphite mutator.

I have a handful of checks and metrics, with more to come – disk, ntp, memory, load, a few application endpoints. I have an email handler for events, forwarding to gmail.

Right now my two sensu machines are running a load average around 10-20, usually about 18.

Typically my waiting processes include a few mailer handlers (eight when I glanced just now) and a few graphite mutators (four at this glance).

I’m not sure why the mailers are running and waiting. I see them there whether there’s an event to report or not.

Given that it’s a brain-dead configuration, is there something obviously wrong? The machines have 2GB memory, running at about 75% (no swap).

I can think of a few things to do:

  1. Reduce the frequency of the checks (all at 60s right now)
  1. Split off RabbitMQ to other machines
  1. Split off sensu-server or sensu-api
  1. Discover something in the code that’s driving the load (mailer? graphite-mutator?)

Any thoughts?

Best regards,
Mojo

Also watch out for handlers forking. I’ve had issues where the forking of handlers has driven high loads. The server is running just fine until a large number of alerts go off at once and quickly the server becomes unresponsive because of the rate it’s trying to execute handlers. So when I need monitoring the most it falls over. To resolve that I got away from ‘pipe’ handlers and switched to ‘tcp’ - converted all of the handlers to services which was easy by leveraging the sensu libraries (socket handling, logging, config, etc). And because the TCP connection is less resource intensive than forking I’ve since added numerous handlers (mailer, remediator, irc, elasticsearch) to the default handler set and haven’t had any issues yet.

brian

···

On Wednesday, May 14, 2014 12:07:48 PM UTC-5, portertech wrote:

I do not recommend using the Graphite mutator for the community repository, https://github.com/sensu/sensu-community-plugins/blob/master/mutators/graphite.rb Fork/exec is an expensive operation, this will cause your system to become “lethargic”, and consume the Sensu execution thread pool, not leaving room for the mailer, which would cause a backlog. I use the built-in Sensu mutator extension “only_check_output”, and produce metrics w/ the correct schema.

Sean.

On Wed, May 14, 2014 at 9:15 AM, Sean Porter porte...@gmail.com wrote:

What type of handler are you using for Graphite? Is the mutator for Graphite using a command? Are metric events being sent to your mailer handler? Is the mailer handler blocked on SMTP?

On May 14, 2014 9:11 AM, “Mojo” moj...@gmail.com wrote:

I’ve been building out sensu monitoring for a cluster of about 40 machines.

I have two VMs dedicated to sensu. RabbitMQ on both, redis on one, api, client, server, and dashboard on both.

I have two graphite machines, collecting metrics that are echoed to RabbitMQ by the sensu server through the graphite mutator.

I have a handful of checks and metrics, with more to come – disk, ntp, memory, load, a few application endpoints. I have an email handler for events, forwarding to gmail.

Right now my two sensu machines are running a load average around 10-20, usually about 18.

Typically my waiting processes include a few mailer handlers (eight when I glanced just now) and a few graphite mutators (four at this glance).

I’m not sure why the mailers are running and waiting. I see them there whether there’s an event to report or not.

Given that it’s a brain-dead configuration, is there something obviously wrong? The machines have 2GB memory, running at about 75% (no swap).

I can think of a few things to do:

  1. Reduce the frequency of the checks (all at 60s right now)
  1. Split off RabbitMQ to other machines
  1. Split off sensu-server or sensu-api
  1. Discover something in the code that’s driving the load (mailer? graphite-mutator?)

Any thoughts?

Best regards,
Mojo

Brian,

Yes it looks like getting rid of the piped handlers is going to be a big step.

Did you write your own tcp mailer handler? I’d love to find an example of how that’s done.

Getting rid of the graphite_mutator has made a bit of a difference. Now the mailer handler is really dragging it down. It appears that sensu is exec’ing the mailer.rb script on every handling, even though an email isn’t generated because of an “occurrences” restriction.

Mojo

···

On Thu, May 15, 2014 at 6:59 PM, Brian Boyd brian@boydduo.com wrote:

Also watch out for handlers forking. I’ve had issues where the forking of handlers has driven high loads. The server is running just fine until a large number of alerts go off at once and quickly the server becomes unresponsive because of the rate it’s trying to execute handlers. So when I need monitoring the most it falls over. To resolve that I got away from ‘pipe’ handlers and switched to ‘tcp’ - converted all of the handlers to services which was easy by leveraging the sensu libraries (socket handling, logging, config, etc). And because the TCP connection is less resource intensive than forking I’ve since added numerous handlers (mailer, remediator, irc, elasticsearch) to the default handler set and haven’t had any issues yet.
brian

On Wednesday, May 14, 2014 12:07:48 PM UTC-5, portertech wrote:

I do not recommend using the Graphite mutator for the community repository, https://github.com/sensu/sensu-community-plugins/blob/master/mutators/graphite.rb Fork/exec is an expensive operation, this will cause your system to become “lethargic”, and consume the Sensu execution thread pool, not leaving room for the mailer, which would cause a backlog. I use the built-in Sensu mutator extension “only_check_output”, and produce metrics w/ the correct schema.

Sean.

On Wed, May 14, 2014 at 9:15 AM, Sean Porter porte...@gmail.com wrote:

What type of handler are you using for Graphite? Is the mutator for Graphite using a command? Are metric events being sent to your mailer handler? Is the mailer handler blocked on SMTP?

On May 14, 2014 9:11 AM, “Mojo” moj...@gmail.com wrote:

I’ve been building out sensu monitoring for a cluster of about 40 machines.

I have two VMs dedicated to sensu. RabbitMQ on both, redis on one, api, client, server, and dashboard on both.

I have two graphite machines, collecting metrics that are echoed to RabbitMQ by the sensu server through the graphite mutator.

I have a handful of checks and metrics, with more to come – disk, ntp, memory, load, a few application endpoints. I have an email handler for events, forwarding to gmail.

Right now my two sensu machines are running a load average around 10-20, usually about 18.

Typically my waiting processes include a few mailer handlers (eight when I glanced just now) and a few graphite mutators (four at this glance).

I’m not sure why the mailers are running and waiting. I see them there whether there’s an event to report or not.

Given that it’s a brain-dead configuration, is there something obviously wrong? The machines have 2GB memory, running at about 75% (no swap).

I can think of a few things to do:

  1. Reduce the frequency of the checks (all at 60s right now)
  1. Split off RabbitMQ to other machines
  1. Split off sensu-server or sensu-api
  1. Discover something in the code that’s driving the load (mailer? graphite-mutator?)

Any thoughts?

Best regards,
Mojo

I see exactly what’s going on with the mailer handler: It subclasses Sensu::Handler, which does the actual filtering based on occurrences or disabled or silenced. So if I have an old event, or bunches of them, and they’re triggered every minute, they all cause a fork and a test.

So the trick is to get this code to run inside Sensu without piping it to a new process and new ruby startup.

Or write a tcp mailer handler that runs as a daemon.

Mojo

···

On Mon, May 19, 2014 at 11:49 AM, Mojo mojo.la@gmail.com wrote:

Brian,

Yes it looks like getting rid of the piped handlers is going to be a big step.

Did you write your own tcp mailer handler? I’d love to find an example of how that’s done.

Getting rid of the graphite_mutator has made a bit of a difference. Now the mailer handler is really dragging it down. It appears that sensu is exec’ing the mailer.rb script on every handling, even though an email isn’t generated because of an “occurrences” restriction.

Mojo

On Thu, May 15, 2014 at 6:59 PM, Brian Boyd brian@boydduo.com wrote:

Also watch out for handlers forking. I’ve had issues where the forking of handlers has driven high loads. The server is running just fine until a large number of alerts go off at once and quickly the server becomes unresponsive because of the rate it’s trying to execute handlers. So when I need monitoring the most it falls over. To resolve that I got away from ‘pipe’ handlers and switched to ‘tcp’ - converted all of the handlers to services which was easy by leveraging the sensu libraries (socket handling, logging, config, etc). And because the TCP connection is less resource intensive than forking I’ve since added numerous handlers (mailer, remediator, irc, elasticsearch) to the default handler set and haven’t had any issues yet.
brian

On Wednesday, May 14, 2014 12:07:48 PM UTC-5, portertech wrote:

I do not recommend using the Graphite mutator for the community repository, https://github.com/sensu/sensu-community-plugins/blob/master/mutators/graphite.rb Fork/exec is an expensive operation, this will cause your system to become “lethargic”, and consume the Sensu execution thread pool, not leaving room for the mailer, which would cause a backlog. I use the built-in Sensu mutator extension “only_check_output”, and produce metrics w/ the correct schema.

Sean.

On Wed, May 14, 2014 at 9:15 AM, Sean Porter porte...@gmail.com wrote:

What type of handler are you using for Graphite? Is the mutator for Graphite using a command? Are metric events being sent to your mailer handler? Is the mailer handler blocked on SMTP?

On May 14, 2014 9:11 AM, “Mojo” moj...@gmail.com wrote:

I’ve been building out sensu monitoring for a cluster of about 40 machines.

I have two VMs dedicated to sensu. RabbitMQ on both, redis on one, api, client, server, and dashboard on both.

I have two graphite machines, collecting metrics that are echoed to RabbitMQ by the sensu server through the graphite mutator.

I have a handful of checks and metrics, with more to come – disk, ntp, memory, load, a few application endpoints. I have an email handler for events, forwarding to gmail.

Right now my two sensu machines are running a load average around 10-20, usually about 18.

Typically my waiting processes include a few mailer handlers (eight when I glanced just now) and a few graphite mutators (four at this glance).

I’m not sure why the mailers are running and waiting. I see them there whether there’s an event to report or not.

Given that it’s a brain-dead configuration, is there something obviously wrong? The machines have 2GB memory, running at about 75% (no swap).

I can think of a few things to do:

  1. Reduce the frequency of the checks (all at 60s right now)
  1. Split off RabbitMQ to other machines
  1. Split off sensu-server or sensu-api
  1. Discover something in the code that’s driving the load (mailer? graphite-mutator?)

Any thoughts?

Best regards,
Mojo

Here’s the TCP mailer handler I’m using. mailer-handler.rb - TCP Mailer handler for Sensu · GitHub

It’s based on the community-provided mailer.rb and some of the sensu classes like Sensu::Server. It leverages the sensu libraries for reading the config, listening on a socket, logging, etc. I wasn’t able to use the Sensu::Handler class as-is because it expects events to be read from STDIN, so while I didn’t like doing it, the best approach I could find was to copy and modify.

I’ve taken the same approach with the remediator, elasticsearch and am wrapping up the irc handler (which will have a persistent connection to avoid the login/logout).

Using this approach I’ve noticed the load on the server remains more predictable even if there are a flood of alerts.

brian

···

On Monday, May 19, 2014 2:16:42 PM UTC-5, Morris Jones wrote:

I see exactly what’s going on with the mailer handler: It subclasses Sensu::Handler, which does the actual filtering based on occurrences or disabled or silenced. So if I have an old event, or bunches of them, and they’re triggered every minute, they all cause a fork and a test.

So the trick is to get this code to run inside Sensu without piping it to a new process and new ruby startup.

Or write a tcp mailer handler that runs as a daemon.

Mojo

On Mon, May 19, 2014 at 11:49 AM, Mojo moj...@gmail.com wrote:

Brian,

Yes it looks like getting rid of the piped handlers is going to be a big step.

Did you write your own tcp mailer handler? I’d love to find an example of how that’s done.

Getting rid of the graphite_mutator has made a bit of a difference. Now the mailer handler is really dragging it down. It appears that sensu is exec’ing the mailer.rb script on every handling, even though an email isn’t generated because of an “occurrences” restriction.

Mojo

On Thu, May 15, 2014 at 6:59 PM, Brian Boyd br...@boydduo.com wrote:

Also watch out for handlers forking. I’ve had issues where the forking of handlers has driven high loads. The server is running just fine until a large number of alerts go off at once and quickly the server becomes unresponsive because of the rate it’s trying to execute handlers. So when I need monitoring the most it falls over. To resolve that I got away from ‘pipe’ handlers and switched to ‘tcp’ - converted all of the handlers to services which was easy by leveraging the sensu libraries (socket handling, logging, config, etc). And because the TCP connection is less resource intensive than forking I’ve since added numerous handlers (mailer, remediator, irc, elasticsearch) to the default handler set and haven’t had any issues yet.
brian

On Wednesday, May 14, 2014 12:07:48 PM UTC-5, portertech wrote:

I do not recommend using the Graphite mutator for the community repository, https://github.com/sensu/sensu-community-plugins/blob/master/mutators/graphite.rb Fork/exec is an expensive operation, this will cause your system to become “lethargic”, and consume the Sensu execution thread pool, not leaving room for the mailer, which would cause a backlog. I use the built-in Sensu mutator extension “only_check_output”, and produce metrics w/ the correct schema.

Sean.

On Wed, May 14, 2014 at 9:15 AM, Sean Porter porte...@gmail.com wrote:

What type of handler are you using for Graphite? Is the mutator for Graphite using a command? Are metric events being sent to your mailer handler? Is the mailer handler blocked on SMTP?

On May 14, 2014 9:11 AM, “Mojo” moj...@gmail.com wrote:

I’ve been building out sensu monitoring for a cluster of about 40 machines.

I have two VMs dedicated to sensu. RabbitMQ on both, redis on one, api, client, server, and dashboard on both.

I have two graphite machines, collecting metrics that are echoed to RabbitMQ by the sensu server through the graphite mutator.

I have a handful of checks and metrics, with more to come – disk, ntp, memory, load, a few application endpoints. I have an email handler for events, forwarding to gmail.

Right now my two sensu machines are running a load average around 10-20, usually about 18.

Typically my waiting processes include a few mailer handlers (eight when I glanced just now) and a few graphite mutators (four at this glance).

I’m not sure why the mailers are running and waiting. I see them there whether there’s an event to report or not.

Given that it’s a brain-dead configuration, is there something obviously wrong? The machines have 2GB memory, running at about 75% (no swap).

I can think of a few things to do:

  1. Reduce the frequency of the checks (all at 60s right now)
  1. Split off RabbitMQ to other machines
  1. Split off sensu-server or sensu-api
  1. Discover something in the code that’s driving the load (mailer? graphite-mutator?)

Any thoughts?

Best regards,
Mojo

Hi Portertech

I also encountered the same problem that my CPU (2 cores) experienced ~70% load and i wanted to reduce this, and I also using the Graphite mutator.

When you say that you use the built-in mutator extension “only_check_output”, how can I use the built-in mutator instead of the graphite one? and what am I need to add in order to send the data to Graphite?

In addition, do I need to download the latest Sensu\sensu-extension packages in order to use that?

Thanks!

···

On Wednesday, May 14, 2014 8:07:48 PM UTC+3, portertech wrote:

I do not recommend using the Graphite mutator for the community repository, https://github.com/sensu/sensu-community-plugins/blob/master/mutators/graphite.rb Fork/exec is an expensive operation, this will cause your system to become “lethargic”, and consume the Sensu execution thread pool, not leaving room for the mailer, which would cause a backlog. I use the built-in Sensu mutator extension “only_check_output”, and produce metrics w/ the correct schema.

Sean.

On Wed, May 14, 2014 at 9:15 AM, Sean Porter porte...@gmail.com wrote:

What type of handler are you using for Graphite? Is the mutator for Graphite using a command? Are metric events being sent to your mailer handler? Is the mailer handler blocked on SMTP?

On May 14, 2014 9:11 AM, “Mojo” moj...@gmail.com wrote:

I’ve been building out sensu monitoring for a cluster of about 40 machines.

I have two VMs dedicated to sensu. RabbitMQ on both, redis on one, api, client, server, and dashboard on both.

I have two graphite machines, collecting metrics that are echoed to RabbitMQ by the sensu server through the graphite mutator.

I have a handful of checks and metrics, with more to come – disk, ntp, memory, load, a few application endpoints. I have an email handler for events, forwarding to gmail.

Right now my two sensu machines are running a load average around 10-20, usually about 18.

Typically my waiting processes include a few mailer handlers (eight when I glanced just now) and a few graphite mutators (four at this glance).

I’m not sure why the mailers are running and waiting. I see them there whether there’s an event to report or not.

Given that it’s a brain-dead configuration, is there something obviously wrong? The machines have 2GB memory, running at about 75% (no swap).

I can think of a few things to do:

  1. Reduce the frequency of the checks (all at 60s right now)
  1. Split off RabbitMQ to other machines
  1. Split off sensu-server or sensu-api
  1. Discover something in the code that’s driving the load (mailer? graphite-mutator?)

Any thoughts?

Best regards,
Mojo

David, you can use one of the metrics handlers, such as load-metrics.rb (for load averages) from the community sensu plugins.

The one thing that the (bad) graphite mutator gets you in reversing the host name for the metrics labels. After ripping out the graphite mutator, I modified my local copy of the metrics plugin to give the name format I wanted by default – but you can also specify the name scheme on the command line for the plugin.

Then when you add the check json to your config, you give it “type”: “metric” so sensu knows it’s graphite metric format. Here’s a copy of my graphite handler json config that uses the built-in extension:

{
“handlers”: {
“graphite”: {
“command”: null,
“exchange”: {
“name”: “metrics”,

    "type": "topic",
    "passive": true
  },
  "type": "amqp",
  "severities": [
    "ok",
    "warning",

    "critical",
    "unknown"
  ],
  "mutator": "only_check_output"
}

}
}

When sensu stops forking that graphite_mutator, the load drop is dramatic.

That, combined with a mailer handler daemon that doesn’t load a process has kept my load very reasonable.

Mojo

···

On Mon, Jun 23, 2014 at 9:38 AM, David Mark mim770@gmail.com wrote:

Hi Portertech

I also encountered the same problem that my CPU (2 cores) experienced ~70% load and i wanted to reduce this, and I also using the Graphite mutator.

When you say that you use the built-in mutator extension “only_check_output”, how can I use the built-in mutator instead of the graphite one? and what am I need to add in order to send the data to Graphite?

In addition, do I need to download the latest Sensu\sensu-extension packages in order to use that?

Thanks!

On Wednesday, May 14, 2014 8:07:48 PM UTC+3, portertech wrote:

I do not recommend using the Graphite mutator for the community repository, https://github.com/sensu/sensu-community-plugins/blob/master/mutators/graphite.rb Fork/exec is an expensive operation, this will cause your system to become “lethargic”, and consume the Sensu execution thread pool, not leaving room for the mailer, which would cause a backlog. I use the built-in Sensu mutator extension “only_check_output”, and produce metrics w/ the correct schema.

Sean.

On Wed, May 14, 2014 at 9:15 AM, Sean Porter porte...@gmail.com wrote:

What type of handler are you using for Graphite? Is the mutator for Graphite using a command? Are metric events being sent to your mailer handler? Is the mailer handler blocked on SMTP?

On May 14, 2014 9:11 AM, “Mojo” moj...@gmail.com wrote:

I’ve been building out sensu monitoring for a cluster of about 40 machines.

I have two VMs dedicated to sensu. RabbitMQ on both, redis on one, api, client, server, and dashboard on both.

I have two graphite machines, collecting metrics that are echoed to RabbitMQ by the sensu server through the graphite mutator.

I have a handful of checks and metrics, with more to come – disk, ntp, memory, load, a few application endpoints. I have an email handler for events, forwarding to gmail.

Right now my two sensu machines are running a load average around 10-20, usually about 18.

Typically my waiting processes include a few mailer handlers (eight when I glanced just now) and a few graphite mutators (four at this glance).

I’m not sure why the mailers are running and waiting. I see them there whether there’s an event to report or not.

Given that it’s a brain-dead configuration, is there something obviously wrong? The machines have 2GB memory, running at about 75% (no swap).

I can think of a few things to do:

  1. Reduce the frequency of the checks (all at 60s right now)
  1. Split off RabbitMQ to other machines
  1. Split off sensu-server or sensu-api
  1. Discover something in the code that’s driving the load (mailer? graphite-mutator?)

Any thoughts?

Best regards,
Mojo

Thanks Mojo

Now I just need to add TCP method to Pushover handler.

David

···

On Wednesday, May 14, 2014 7:11:57 PM UTC+3, Morris Jones wrote:

I’ve been building out sensu monitoring for a cluster of about 40 machines.

I have two VMs dedicated to sensu. RabbitMQ on both, redis on one, api, client, server, and dashboard on both.

I have two graphite machines, collecting metrics that are echoed to RabbitMQ by the sensu server through the graphite mutator.

I have a handful of checks and metrics, with more to come – disk, ntp, memory, load, a few application endpoints. I have an email handler for events, forwarding to gmail.

Right now my two sensu machines are running a load average around 10-20, usually about 18.

Typically my waiting processes include a few mailer handlers (eight when I glanced just now) and a few graphite mutators (four at this glance).

I’m not sure why the mailers are running and waiting. I see them there whether there’s an event to report or not.

Given that it’s a brain-dead configuration, is there something obviously wrong? The machines have 2GB memory, running at about 75% (no swap).

I can think of a few things to do:

  1. Reduce the frequency of the checks (all at 60s right now)
  1. Split off RabbitMQ to other machines
  1. Split off sensu-server or sensu-api
  1. Discover something in the code that’s driving the load (mailer? graphite-mutator?)

Any thoughts?

Best regards,
Mojo

It’s my understanding the @b-Boyd’s approach will no longer work in Sensu 0.13.0 because sensu/base no longer exists. Is everyone moving to extensions, or is flapjack the new hotness ? Personally I am not super excited about having to add another outside component to do event handling… Anyone else have any thoughts / plans for dealing with high load in a post Sensu 0.12.0 world?

-John

···

On Sunday, June 29, 2014 5:50:33 AM UTC-4, David Mark wrote:

Thanks Mojo

Now I just need to add TCP method to Pushover handler.

David

On Wednesday, May 14, 2014 7:11:57 PM UTC+3, Morris Jones wrote:

I’ve been building out sensu monitoring for a cluster of about 40 machines.

I have two VMs dedicated to sensu. RabbitMQ on both, redis on one, api, client, server, and dashboard on both.

I have two graphite machines, collecting metrics that are echoed to RabbitMQ by the sensu server through the graphite mutator.

I have a handful of checks and metrics, with more to come – disk, ntp, memory, load, a few application endpoints. I have an email handler for events, forwarding to gmail.

Right now my two sensu machines are running a load average around 10-20, usually about 18.

Typically my waiting processes include a few mailer handlers (eight when I glanced just now) and a few graphite mutators (four at this glance).

I’m not sure why the mailers are running and waiting. I see them there whether there’s an event to report or not.

Given that it’s a brain-dead configuration, is there something obviously wrong? The machines have 2GB memory, running at about 75% (no swap).

I can think of a few things to do:

  1. Reduce the frequency of the checks (all at 60s right now)
  1. Split off RabbitMQ to other machines
  1. Split off sensu-server or sensu-api
  1. Discover something in the code that’s driving the load (mailer? graphite-mutator?)

Any thoughts?

Best regards,
Mojo

I’m drafting an email for sensu-dev, but I plan to add occurrence filtering and silencing (API /subdue) to Sensu core.

Sean.

···

On Wed, Jul 2, 2014 at 7:43 PM, John Dyer johntdyer@gmail.com wrote:

-John

It’s my understanding the @b-Boyd’s approach will no longer work in Sensu 0.13.0 because sensu/base no longer exists. Is everyone moving to extensions, or is flapjack the new hotness ? Personally I am not super excited about having to add another outside component to do event handling… Anyone else have any thoughts / plans for dealing with high load in a post Sensu 0.12.0 world?

On Sunday, June 29, 2014 5:50:33 AM UTC-4, David Mark wrote:

Thanks Mojo

Now I just need to add TCP method to Pushover handler.

David

On Wednesday, May 14, 2014 7:11:57 PM UTC+3, Morris Jones wrote:

I’ve been building out sensu monitoring for a cluster of about 40 machines.

I have two VMs dedicated to sensu. RabbitMQ on both, redis on one, api, client, server, and dashboard on both.

I have two graphite machines, collecting metrics that are echoed to RabbitMQ by the sensu server through the graphite mutator.

I have a handful of checks and metrics, with more to come – disk, ntp, memory, load, a few application endpoints. I have an email handler for events, forwarding to gmail.

Right now my two sensu machines are running a load average around 10-20, usually about 18.

Typically my waiting processes include a few mailer handlers (eight when I glanced just now) and a few graphite mutators (four at this glance).

I’m not sure why the mailers are running and waiting. I see them there whether there’s an event to report or not.

Given that it’s a brain-dead configuration, is there something obviously wrong? The machines have 2GB memory, running at about 75% (no swap).

I can think of a few things to do:

  1. Reduce the frequency of the checks (all at 60s right now)
  1. Split off RabbitMQ to other machines
  1. Split off sensu-server or sensu-api
  1. Discover something in the code that’s driving the load (mailer? graphite-mutator?)

Any thoughts?

Best regards,
Mojo

Sean, that will make a HUGE difference in load when there’s a stack of events that eventually get filtered. Thank you!

Mojo

···

On Wed, Jul 2, 2014 at 11:58 PM, Sean Porter portertech@gmail.com wrote:

I’m drafting an email for sensu-dev, but I plan to add occurrence filtering and silencing (API /subdue) to Sensu core.

Sean.

On Wed, Jul 2, 2014 at 7:43 PM, John Dyer johntdyer@gmail.com wrote:

-John

It’s my understanding the @b-Boyd’s approach will no longer work in Sensu 0.13.0 because sensu/base no longer exists. Is everyone moving to extensions, or is flapjack the new hotness ? Personally I am not super excited about having to add another outside component to do event handling… Anyone else have any thoughts / plans for dealing with high load in a post Sensu 0.12.0 world?

On Sunday, June 29, 2014 5:50:33 AM UTC-4, David Mark wrote:

Thanks Mojo

Now I just need to add TCP method to Pushover handler.

David

On Wednesday, May 14, 2014 7:11:57 PM UTC+3, Morris Jones wrote:

I’ve been building out sensu monitoring for a cluster of about 40 machines.

I have two VMs dedicated to sensu. RabbitMQ on both, redis on one, api, client, server, and dashboard on both.

I have two graphite machines, collecting metrics that are echoed to RabbitMQ by the sensu server through the graphite mutator.

I have a handful of checks and metrics, with more to come – disk, ntp, memory, load, a few application endpoints. I have an email handler for events, forwarding to gmail.

Right now my two sensu machines are running a load average around 10-20, usually about 18.

Typically my waiting processes include a few mailer handlers (eight when I glanced just now) and a few graphite mutators (four at this glance).

I’m not sure why the mailers are running and waiting. I see them there whether there’s an event to report or not.

Given that it’s a brain-dead configuration, is there something obviously wrong? The machines have 2GB memory, running at about 75% (no swap).

I can think of a few things to do:

  1. Reduce the frequency of the checks (all at 60s right now)
  1. Split off RabbitMQ to other machines
  1. Split off sensu-server or sensu-api
  1. Discover something in the code that’s driving the load (mailer? graphite-mutator?)

Any thoughts?

Best regards,
Mojo

Sean,

Did that email ever go out to the dev list ? I couldn’t find it and I am really interested in following this discussion.

-John

···

On Thursday, July 3, 2014 11:18:48 AM UTC-4, Morris Jones wrote:

Sean, that will make a HUGE difference in load when there’s a stack of events that eventually get filtered. Thank you!

Mojo

On Wed, Jul 2, 2014 at 11:58 PM, Sean Porter porte...@gmail.com wrote:

I’m drafting an email for sensu-dev, but I plan to add occurrence filtering and silencing (API /subdue) to Sensu core.

Sean.

On Wed, Jul 2, 2014 at 7:43 PM, John Dyer john...@gmail.com wrote:

-John

It’s my understanding the @b-Boyd’s approach will no longer work in Sensu 0.13.0 because sensu/base no longer exists. Is everyone moving to extensions, or is flapjack the new hotness ? Personally I am not super excited about having to add another outside component to do event handling… Anyone else have any thoughts / plans for dealing with high load in a post Sensu 0.12.0 world?

On Sunday, June 29, 2014 5:50:33 AM UTC-4, David Mark wrote:

Thanks Mojo

Now I just need to add TCP method to Pushover handler.

David

On Wednesday, May 14, 2014 7:11:57 PM UTC+3, Morris Jones wrote:

I’ve been building out sensu monitoring for a cluster of about 40 machines.

I have two VMs dedicated to sensu. RabbitMQ on both, redis on one, api, client, server, and dashboard on both.

I have two graphite machines, collecting metrics that are echoed to RabbitMQ by the sensu server through the graphite mutator.

I have a handful of checks and metrics, with more to come – disk, ntp, memory, load, a few application endpoints. I have an email handler for events, forwarding to gmail.

Right now my two sensu machines are running a load average around 10-20, usually about 18.

Typically my waiting processes include a few mailer handlers (eight when I glanced just now) and a few graphite mutators (four at this glance).

I’m not sure why the mailers are running and waiting. I see them there whether there’s an event to report or not.

Given that it’s a brain-dead configuration, is there something obviously wrong? The machines have 2GB memory, running at about 75% (no swap).

I can think of a few things to do:

  1. Reduce the frequency of the checks (all at 60s right now)
  1. Split off RabbitMQ to other machines
  1. Split off sensu-server or sensu-api
  1. Discover something in the code that’s driving the load (mailer? graphite-mutator?)

Any thoughts?

Best regards,
Mojo

I think the server is running just fine until a large number of alerts go off at once and quickly the server becomes unresponsive because of the rate it’s trying to execute handlers. So when I need monitoring the most it falls over. To resolve that I got away from ‘pipe’ handlers and switched to ‘tcp’ - converted all of the handlers to services which was easy by leveraging the sensu libraries (socket handling, logging, config, etc).

···

Best Video Baby Monitor | Wifi Baby Monitor

On Monday, May 19, 2014 11:49:42 PM UTC+5, Morris Jones wrote:

Brian,

Yes it looks like getting rid of the piped handlers is going to be a big step.

Did you write your own tcp mailer handler? I’d love to find an example of how that’s done.

Getting rid of the graphite_mutator has made a bit of a difference. Now the mailer handler is really dragging it down. It appears that sensu is exec’ing the mailer.rb script on every handling, even though an email isn’t generated because of an “occurrences” restriction.

Mojo

On Thu, May 15, 2014 at 6:59 PM, Brian Boyd br...@boydduo.com wrote:

Also watch out for handlers forking. I’ve had issues where the forking of handlers has driven high loads. The server is running just fine until a large number of alerts go off at once and quickly the server becomes unresponsive because of the rate it’s trying to execute handlers. So when I need monitoring the most it falls over. To resolve that I got away from ‘pipe’ handlers and switched to ‘tcp’ - converted all of the handlers to services which was easy by leveraging the sensu libraries (socket handling, logging, config, etc). And because the TCP connection is less resource intensive than forking I’ve since added numerous handlers (mailer, remediator, irc, elasticsearch) to the default handler set and haven’t had any issues yet.
brian

On Wednesday, May 14, 2014 12:07:48 PM UTC-5, portertech wrote:

I do not recommend using the Graphite mutator for the community repository, https://github.com/sensu/sensu-community-plugins/blob/master/mutators/graphite.rb Fork/exec is an expensive operation, this will cause your system to become “lethargic”, and consume the Sensu execution thread pool, not leaving room for the mailer, which would cause a backlog. I use the built-in Sensu mutator extension “only_check_output”, and produce metrics w/ the correct schema.

Sean.

On Wed, May 14, 2014 at 9:15 AM, Sean Porter porte...@gmail.com wrote:

What type of handler are you using for Graphite? Is the mutator for Graphite using a command? Are metric events being sent to your mailer handler? Is the mailer handler blocked on SMTP?

On May 14, 2014 9:11 AM, “Mojo” moj...@gmail.com wrote:

I’ve been building out sensu monitoring for a cluster of about 40 machines.

I have two VMs dedicated to sensu. RabbitMQ on both, redis on one, api, client, server, and dashboard on both.

I have two graphite machines, collecting metrics that are echoed to RabbitMQ by the sensu server through the graphite mutator.

I have a handful of checks and metrics, with more to come – disk, ntp, memory, load, a few application endpoints. I have an email handler for events, forwarding to gmail.

Right now my two sensu machines are running a load average around 10-20, usually about 18.

Typically my waiting processes include a few mailer handlers (eight when I glanced just now) and a few graphite mutators (four at this glance).

I’m not sure why the mailers are running and waiting. I see them there whether there’s an event to report or not.

Given that it’s a brain-dead configuration, is there something obviously wrong? The machines have 2GB memory, running at about 75% (no swap).

I can think of a few things to do:

  1. Reduce the frequency of the checks (all at 60s right now)
  1. Split off RabbitMQ to other machines
  1. Split off sensu-server or sensu-api
  1. Discover something in the code that’s driving the load (mailer? graphite-mutator?)

Any thoughts?

Best regards,
Mojo