I’ve been building out sensu monitoring for a cluster of about 40 machines.
I have two VMs dedicated to sensu. RabbitMQ on both, redis on one, api, client, server, and dashboard on both.
I have two graphite machines, collecting metrics that are echoed to RabbitMQ by the sensu server through the graphite mutator.
I have a handful of checks and metrics, with more to come – disk, ntp, memory, load, a few application endpoints. I have an email handler for events, forwarding to gmail.
Right now my two sensu machines are running a load average around 10-20, usually about 18.
Typically my waiting processes include a few mailer handlers (eight when I glanced just now) and a few graphite mutators (four at this glance).
I’m not sure why the mailers are running and waiting. I see them there whether there’s an event to report or not.
Given that it’s a brain-dead configuration, is there something obviously wrong? The machines have 2GB memory, running at about 75% (no swap).
I can think of a few things to do:
Reduce the frequency of the checks (all at 60s right now)
Split off RabbitMQ to other machines
Split off sensu-server or sensu-api
Discover something in the code that’s driving the load (mailer? graphite-mutator?)
What type of handler are you using for Graphite? Is the mutator for Graphite using a command? Are metric events being sent to your mailer handler? Is the mailer handler blocked on SMTP?
I’ve been building out sensu monitoring for a cluster of about 40 machines.
I have two VMs dedicated to sensu. RabbitMQ on both, redis on one, api, client, server, and dashboard on both.
I have two graphite machines, collecting metrics that are echoed to RabbitMQ by the sensu server through the graphite mutator.
I have a handful of checks and metrics, with more to come – disk, ntp, memory, load, a few application endpoints. I have an email handler for events, forwarding to gmail.
Right now my two sensu machines are running a load average around 10-20, usually about 18.
Typically my waiting processes include a few mailer handlers (eight when I glanced just now) and a few graphite mutators (four at this glance).
I’m not sure why the mailers are running and waiting. I see them there whether there’s an event to report or not.
Given that it’s a brain-dead configuration, is there something obviously wrong? The machines have 2GB memory, running at about 75% (no swap).
I can think of a few things to do:
Reduce the frequency of the checks (all at 60s right now)
Split off RabbitMQ to other machines
Split off sensu-server or sensu-api
Discover something in the code that’s driving the load (mailer? graphite-mutator?)
I do not recommend using the Graphite mutator for the community repository, https://github.com/sensu/sensu-community-plugins/blob/master/mutators/graphite.rb Fork/exec is an expensive operation, this will cause your system to become “lethargic”, and consume the Sensu execution thread pool, not leaving room for the mailer, which would cause a backlog. I use the built-in Sensu mutator extension “only_check_output”, and produce metrics w/ the correct schema.
What type of handler are you using for Graphite? Is the mutator for Graphite using a command? Are metric events being sent to your mailer handler? Is the mailer handler blocked on SMTP?
I’ve been building out sensu monitoring for a cluster of about 40 machines.
I have two VMs dedicated to sensu. RabbitMQ on both, redis on one, api, client, server, and dashboard on both.
I have two graphite machines, collecting metrics that are echoed to RabbitMQ by the sensu server through the graphite mutator.
I have a handful of checks and metrics, with more to come – disk, ntp, memory, load, a few application endpoints. I have an email handler for events, forwarding to gmail.
Right now my two sensu machines are running a load average around 10-20, usually about 18.
Typically my waiting processes include a few mailer handlers (eight when I glanced just now) and a few graphite mutators (four at this glance).
I’m not sure why the mailers are running and waiting. I see them there whether there’s an event to report or not.
Given that it’s a brain-dead configuration, is there something obviously wrong? The machines have 2GB memory, running at about 75% (no swap).
I can think of a few things to do:
Reduce the frequency of the checks (all at 60s right now)
Split off RabbitMQ to other machines
Split off sensu-server or sensu-api
Discover something in the code that’s driving the load (mailer? graphite-mutator?)
Also watch out for handlers forking. I’ve had issues where the forking of handlers has driven high loads. The server is running just fine until a large number of alerts go off at once and quickly the server becomes unresponsive because of the rate it’s trying to execute handlers. So when I need monitoring the most it falls over. To resolve that I got away from ‘pipe’ handlers and switched to ‘tcp’ - converted all of the handlers to services which was easy by leveraging the sensu libraries (socket handling, logging, config, etc). And because the TCP connection is less resource intensive than forking I’ve since added numerous handlers (mailer, remediator, irc, elasticsearch) to the default handler set and haven’t had any issues yet.
brian
···
On Wednesday, May 14, 2014 12:07:48 PM UTC-5, portertech wrote:
I do not recommend using the Graphite mutator for the community repository, https://github.com/sensu/sensu-community-plugins/blob/master/mutators/graphite.rb Fork/exec is an expensive operation, this will cause your system to become “lethargic”, and consume the Sensu execution thread pool, not leaving room for the mailer, which would cause a backlog. I use the built-in Sensu mutator extension “only_check_output”, and produce metrics w/ the correct schema.
What type of handler are you using for Graphite? Is the mutator for Graphite using a command? Are metric events being sent to your mailer handler? Is the mailer handler blocked on SMTP?
I’ve been building out sensu monitoring for a cluster of about 40 machines.
I have two VMs dedicated to sensu. RabbitMQ on both, redis on one, api, client, server, and dashboard on both.
I have two graphite machines, collecting metrics that are echoed to RabbitMQ by the sensu server through the graphite mutator.
I have a handful of checks and metrics, with more to come – disk, ntp, memory, load, a few application endpoints. I have an email handler for events, forwarding to gmail.
Right now my two sensu machines are running a load average around 10-20, usually about 18.
Typically my waiting processes include a few mailer handlers (eight when I glanced just now) and a few graphite mutators (four at this glance).
I’m not sure why the mailers are running and waiting. I see them there whether there’s an event to report or not.
Given that it’s a brain-dead configuration, is there something obviously wrong? The machines have 2GB memory, running at about 75% (no swap).
I can think of a few things to do:
Reduce the frequency of the checks (all at 60s right now)
Split off RabbitMQ to other machines
Split off sensu-server or sensu-api
Discover something in the code that’s driving the load (mailer? graphite-mutator?)
Yes it looks like getting rid of the piped handlers is going to be a big step.
Did you write your own tcp mailer handler? I’d love to find an example of how that’s done.
Getting rid of the graphite_mutator has made a bit of a difference. Now the mailer handler is really dragging it down. It appears that sensu is exec’ing the mailer.rb script on every handling, even though an email isn’t generated because of an “occurrences” restriction.
Mojo
···
On Thu, May 15, 2014 at 6:59 PM, Brian Boyd brian@boydduo.com wrote:
Also watch out for handlers forking. I’ve had issues where the forking of handlers has driven high loads. The server is running just fine until a large number of alerts go off at once and quickly the server becomes unresponsive because of the rate it’s trying to execute handlers. So when I need monitoring the most it falls over. To resolve that I got away from ‘pipe’ handlers and switched to ‘tcp’ - converted all of the handlers to services which was easy by leveraging the sensu libraries (socket handling, logging, config, etc). And because the TCP connection is less resource intensive than forking I’ve since added numerous handlers (mailer, remediator, irc, elasticsearch) to the default handler set and haven’t had any issues yet.
brian
On Wednesday, May 14, 2014 12:07:48 PM UTC-5, portertech wrote:
I do not recommend using the Graphite mutator for the community repository, https://github.com/sensu/sensu-community-plugins/blob/master/mutators/graphite.rb Fork/exec is an expensive operation, this will cause your system to become “lethargic”, and consume the Sensu execution thread pool, not leaving room for the mailer, which would cause a backlog. I use the built-in Sensu mutator extension “only_check_output”, and produce metrics w/ the correct schema.
What type of handler are you using for Graphite? Is the mutator for Graphite using a command? Are metric events being sent to your mailer handler? Is the mailer handler blocked on SMTP?
I’ve been building out sensu monitoring for a cluster of about 40 machines.
I have two VMs dedicated to sensu. RabbitMQ on both, redis on one, api, client, server, and dashboard on both.
I have two graphite machines, collecting metrics that are echoed to RabbitMQ by the sensu server through the graphite mutator.
I have a handful of checks and metrics, with more to come – disk, ntp, memory, load, a few application endpoints. I have an email handler for events, forwarding to gmail.
Right now my two sensu machines are running a load average around 10-20, usually about 18.
Typically my waiting processes include a few mailer handlers (eight when I glanced just now) and a few graphite mutators (four at this glance).
I’m not sure why the mailers are running and waiting. I see them there whether there’s an event to report or not.
Given that it’s a brain-dead configuration, is there something obviously wrong? The machines have 2GB memory, running at about 75% (no swap).
I can think of a few things to do:
Reduce the frequency of the checks (all at 60s right now)
Split off RabbitMQ to other machines
Split off sensu-server or sensu-api
Discover something in the code that’s driving the load (mailer? graphite-mutator?)
I see exactly what’s going on with the mailer handler: It subclasses Sensu::Handler, which does the actual filtering based on occurrences or disabled or silenced. So if I have an old event, or bunches of them, and they’re triggered every minute, they all cause a fork and a test.
So the trick is to get this code to run inside Sensu without piping it to a new process and new ruby startup.
Or write a tcp mailer handler that runs as a daemon.
Yes it looks like getting rid of the piped handlers is going to be a big step.
Did you write your own tcp mailer handler? I’d love to find an example of how that’s done.
Getting rid of the graphite_mutator has made a bit of a difference. Now the mailer handler is really dragging it down. It appears that sensu is exec’ing the mailer.rb script on every handling, even though an email isn’t generated because of an “occurrences” restriction.
Mojo
On Thu, May 15, 2014 at 6:59 PM, Brian Boyd brian@boydduo.com wrote:
Also watch out for handlers forking. I’ve had issues where the forking of handlers has driven high loads. The server is running just fine until a large number of alerts go off at once and quickly the server becomes unresponsive because of the rate it’s trying to execute handlers. So when I need monitoring the most it falls over. To resolve that I got away from ‘pipe’ handlers and switched to ‘tcp’ - converted all of the handlers to services which was easy by leveraging the sensu libraries (socket handling, logging, config, etc). And because the TCP connection is less resource intensive than forking I’ve since added numerous handlers (mailer, remediator, irc, elasticsearch) to the default handler set and haven’t had any issues yet.
brian
On Wednesday, May 14, 2014 12:07:48 PM UTC-5, portertech wrote:
I do not recommend using the Graphite mutator for the community repository, https://github.com/sensu/sensu-community-plugins/blob/master/mutators/graphite.rb Fork/exec is an expensive operation, this will cause your system to become “lethargic”, and consume the Sensu execution thread pool, not leaving room for the mailer, which would cause a backlog. I use the built-in Sensu mutator extension “only_check_output”, and produce metrics w/ the correct schema.
What type of handler are you using for Graphite? Is the mutator for Graphite using a command? Are metric events being sent to your mailer handler? Is the mailer handler blocked on SMTP?
I’ve been building out sensu monitoring for a cluster of about 40 machines.
I have two VMs dedicated to sensu. RabbitMQ on both, redis on one, api, client, server, and dashboard on both.
I have two graphite machines, collecting metrics that are echoed to RabbitMQ by the sensu server through the graphite mutator.
I have a handful of checks and metrics, with more to come – disk, ntp, memory, load, a few application endpoints. I have an email handler for events, forwarding to gmail.
Right now my two sensu machines are running a load average around 10-20, usually about 18.
Typically my waiting processes include a few mailer handlers (eight when I glanced just now) and a few graphite mutators (four at this glance).
I’m not sure why the mailers are running and waiting. I see them there whether there’s an event to report or not.
Given that it’s a brain-dead configuration, is there something obviously wrong? The machines have 2GB memory, running at about 75% (no swap).
I can think of a few things to do:
Reduce the frequency of the checks (all at 60s right now)
Split off RabbitMQ to other machines
Split off sensu-server or sensu-api
Discover something in the code that’s driving the load (mailer? graphite-mutator?)
It’s based on the community-provided mailer.rb and some of the sensu classes like Sensu::Server. It leverages the sensu libraries for reading the config, listening on a socket, logging, etc. I wasn’t able to use the Sensu::Handler class as-is because it expects events to be read from STDIN, so while I didn’t like doing it, the best approach I could find was to copy and modify.
I’ve taken the same approach with the remediator, elasticsearch and am wrapping up the irc handler (which will have a persistent connection to avoid the login/logout).
Using this approach I’ve noticed the load on the server remains more predictable even if there are a flood of alerts.
brian
···
On Monday, May 19, 2014 2:16:42 PM UTC-5, Morris Jones wrote:
I see exactly what’s going on with the mailer handler: It subclasses Sensu::Handler, which does the actual filtering based on occurrences or disabled or silenced. So if I have an old event, or bunches of them, and they’re triggered every minute, they all cause a fork and a test.
So the trick is to get this code to run inside Sensu without piping it to a new process and new ruby startup.
Or write a tcp mailer handler that runs as a daemon.
Yes it looks like getting rid of the piped handlers is going to be a big step.
Did you write your own tcp mailer handler? I’d love to find an example of how that’s done.
Getting rid of the graphite_mutator has made a bit of a difference. Now the mailer handler is really dragging it down. It appears that sensu is exec’ing the mailer.rb script on every handling, even though an email isn’t generated because of an “occurrences” restriction.
Mojo
On Thu, May 15, 2014 at 6:59 PM, Brian Boyd br...@boydduo.com wrote:
Also watch out for handlers forking. I’ve had issues where the forking of handlers has driven high loads. The server is running just fine until a large number of alerts go off at once and quickly the server becomes unresponsive because of the rate it’s trying to execute handlers. So when I need monitoring the most it falls over. To resolve that I got away from ‘pipe’ handlers and switched to ‘tcp’ - converted all of the handlers to services which was easy by leveraging the sensu libraries (socket handling, logging, config, etc). And because the TCP connection is less resource intensive than forking I’ve since added numerous handlers (mailer, remediator, irc, elasticsearch) to the default handler set and haven’t had any issues yet.
brian
On Wednesday, May 14, 2014 12:07:48 PM UTC-5, portertech wrote:
I do not recommend using the Graphite mutator for the community repository, https://github.com/sensu/sensu-community-plugins/blob/master/mutators/graphite.rb Fork/exec is an expensive operation, this will cause your system to become “lethargic”, and consume the Sensu execution thread pool, not leaving room for the mailer, which would cause a backlog. I use the built-in Sensu mutator extension “only_check_output”, and produce metrics w/ the correct schema.
What type of handler are you using for Graphite? Is the mutator for Graphite using a command? Are metric events being sent to your mailer handler? Is the mailer handler blocked on SMTP?
I’ve been building out sensu monitoring for a cluster of about 40 machines.
I have two VMs dedicated to sensu. RabbitMQ on both, redis on one, api, client, server, and dashboard on both.
I have two graphite machines, collecting metrics that are echoed to RabbitMQ by the sensu server through the graphite mutator.
I have a handful of checks and metrics, with more to come – disk, ntp, memory, load, a few application endpoints. I have an email handler for events, forwarding to gmail.
Right now my two sensu machines are running a load average around 10-20, usually about 18.
Typically my waiting processes include a few mailer handlers (eight when I glanced just now) and a few graphite mutators (four at this glance).
I’m not sure why the mailers are running and waiting. I see them there whether there’s an event to report or not.
Given that it’s a brain-dead configuration, is there something obviously wrong? The machines have 2GB memory, running at about 75% (no swap).
I can think of a few things to do:
Reduce the frequency of the checks (all at 60s right now)
Split off RabbitMQ to other machines
Split off sensu-server or sensu-api
Discover something in the code that’s driving the load (mailer? graphite-mutator?)
I also encountered the same problem that my CPU (2 cores) experienced ~70% load and i wanted to reduce this, and I also using the Graphite mutator.
When you say that you use the built-in mutator extension “only_check_output”, how can I use the built-in mutator instead of the graphite one? and what am I need to add in order to send the data to Graphite?
In addition, do I need to download the latest Sensu\sensu-extension packages in order to use that?
Thanks!
···
On Wednesday, May 14, 2014 8:07:48 PM UTC+3, portertech wrote:
I do not recommend using the Graphite mutator for the community repository, https://github.com/sensu/sensu-community-plugins/blob/master/mutators/graphite.rb Fork/exec is an expensive operation, this will cause your system to become “lethargic”, and consume the Sensu execution thread pool, not leaving room for the mailer, which would cause a backlog. I use the built-in Sensu mutator extension “only_check_output”, and produce metrics w/ the correct schema.
What type of handler are you using for Graphite? Is the mutator for Graphite using a command? Are metric events being sent to your mailer handler? Is the mailer handler blocked on SMTP?
I’ve been building out sensu monitoring for a cluster of about 40 machines.
I have two VMs dedicated to sensu. RabbitMQ on both, redis on one, api, client, server, and dashboard on both.
I have two graphite machines, collecting metrics that are echoed to RabbitMQ by the sensu server through the graphite mutator.
I have a handful of checks and metrics, with more to come – disk, ntp, memory, load, a few application endpoints. I have an email handler for events, forwarding to gmail.
Right now my two sensu machines are running a load average around 10-20, usually about 18.
Typically my waiting processes include a few mailer handlers (eight when I glanced just now) and a few graphite mutators (four at this glance).
I’m not sure why the mailers are running and waiting. I see them there whether there’s an event to report or not.
Given that it’s a brain-dead configuration, is there something obviously wrong? The machines have 2GB memory, running at about 75% (no swap).
I can think of a few things to do:
Reduce the frequency of the checks (all at 60s right now)
Split off RabbitMQ to other machines
Split off sensu-server or sensu-api
Discover something in the code that’s driving the load (mailer? graphite-mutator?)
David, you can use one of the metrics handlers, such as load-metrics.rb (for load averages) from the community sensu plugins.
The one thing that the (bad) graphite mutator gets you in reversing the host name for the metrics labels. After ripping out the graphite mutator, I modified my local copy of the metrics plugin to give the name format I wanted by default – but you can also specify the name scheme on the command line for the plugin.
Then when you add the check json to your config, you give it “type”: “metric” so sensu knows it’s graphite metric format. Here’s a copy of my graphite handler json config that uses the built-in extension:
When sensu stops forking that graphite_mutator, the load drop is dramatic.
That, combined with a mailer handler daemon that doesn’t load a process has kept my load very reasonable.
Mojo
···
On Mon, Jun 23, 2014 at 9:38 AM, David Mark mim770@gmail.com wrote:
Hi Portertech
I also encountered the same problem that my CPU (2 cores) experienced ~70% load and i wanted to reduce this, and I also using the Graphite mutator.
When you say that you use the built-in mutator extension “only_check_output”, how can I use the built-in mutator instead of the graphite one? and what am I need to add in order to send the data to Graphite?
In addition, do I need to download the latest Sensu\sensu-extension packages in order to use that?
Thanks!
On Wednesday, May 14, 2014 8:07:48 PM UTC+3, portertech wrote:
I do not recommend using the Graphite mutator for the community repository, https://github.com/sensu/sensu-community-plugins/blob/master/mutators/graphite.rb Fork/exec is an expensive operation, this will cause your system to become “lethargic”, and consume the Sensu execution thread pool, not leaving room for the mailer, which would cause a backlog. I use the built-in Sensu mutator extension “only_check_output”, and produce metrics w/ the correct schema.
What type of handler are you using for Graphite? Is the mutator for Graphite using a command? Are metric events being sent to your mailer handler? Is the mailer handler blocked on SMTP?
I’ve been building out sensu monitoring for a cluster of about 40 machines.
I have two VMs dedicated to sensu. RabbitMQ on both, redis on one, api, client, server, and dashboard on both.
I have two graphite machines, collecting metrics that are echoed to RabbitMQ by the sensu server through the graphite mutator.
I have a handful of checks and metrics, with more to come – disk, ntp, memory, load, a few application endpoints. I have an email handler for events, forwarding to gmail.
Right now my two sensu machines are running a load average around 10-20, usually about 18.
Typically my waiting processes include a few mailer handlers (eight when I glanced just now) and a few graphite mutators (four at this glance).
I’m not sure why the mailers are running and waiting. I see them there whether there’s an event to report or not.
Given that it’s a brain-dead configuration, is there something obviously wrong? The machines have 2GB memory, running at about 75% (no swap).
I can think of a few things to do:
Reduce the frequency of the checks (all at 60s right now)
Split off RabbitMQ to other machines
Split off sensu-server or sensu-api
Discover something in the code that’s driving the load (mailer? graphite-mutator?)
Now I just need to add TCP method to Pushover handler.
David
···
On Wednesday, May 14, 2014 7:11:57 PM UTC+3, Morris Jones wrote:
I’ve been building out sensu monitoring for a cluster of about 40 machines.
I have two VMs dedicated to sensu. RabbitMQ on both, redis on one, api, client, server, and dashboard on both.
I have two graphite machines, collecting metrics that are echoed to RabbitMQ by the sensu server through the graphite mutator.
I have a handful of checks and metrics, with more to come – disk, ntp, memory, load, a few application endpoints. I have an email handler for events, forwarding to gmail.
Right now my two sensu machines are running a load average around 10-20, usually about 18.
Typically my waiting processes include a few mailer handlers (eight when I glanced just now) and a few graphite mutators (four at this glance).
I’m not sure why the mailers are running and waiting. I see them there whether there’s an event to report or not.
Given that it’s a brain-dead configuration, is there something obviously wrong? The machines have 2GB memory, running at about 75% (no swap).
I can think of a few things to do:
Reduce the frequency of the checks (all at 60s right now)
Split off RabbitMQ to other machines
Split off sensu-server or sensu-api
Discover something in the code that’s driving the load (mailer? graphite-mutator?)
It’s my understanding the @b-Boyd’s approach will no longer work in Sensu 0.13.0 because sensu/base no longer exists. Is everyone moving to extensions, or is flapjack the new hotness ? Personally I am not super excited about having to add another outside component to do event handling… Anyone else have any thoughts / plans for dealing with high load in a post Sensu 0.12.0 world?
-John
···
On Sunday, June 29, 2014 5:50:33 AM UTC-4, David Mark wrote:
Thanks Mojo
Now I just need to add TCP method to Pushover handler.
David
On Wednesday, May 14, 2014 7:11:57 PM UTC+3, Morris Jones wrote:
I’ve been building out sensu monitoring for a cluster of about 40 machines.
I have two VMs dedicated to sensu. RabbitMQ on both, redis on one, api, client, server, and dashboard on both.
I have two graphite machines, collecting metrics that are echoed to RabbitMQ by the sensu server through the graphite mutator.
I have a handful of checks and metrics, with more to come – disk, ntp, memory, load, a few application endpoints. I have an email handler for events, forwarding to gmail.
Right now my two sensu machines are running a load average around 10-20, usually about 18.
Typically my waiting processes include a few mailer handlers (eight when I glanced just now) and a few graphite mutators (four at this glance).
I’m not sure why the mailers are running and waiting. I see them there whether there’s an event to report or not.
Given that it’s a brain-dead configuration, is there something obviously wrong? The machines have 2GB memory, running at about 75% (no swap).
I can think of a few things to do:
Reduce the frequency of the checks (all at 60s right now)
Split off RabbitMQ to other machines
Split off sensu-server or sensu-api
Discover something in the code that’s driving the load (mailer? graphite-mutator?)
It’s my understanding the @b-Boyd’s approach will no longer work in Sensu 0.13.0 because sensu/base no longer exists. Is everyone moving to extensions, or is flapjack the new hotness ? Personally I am not super excited about having to add another outside component to do event handling… Anyone else have any thoughts / plans for dealing with high load in a post Sensu 0.12.0 world?
On Sunday, June 29, 2014 5:50:33 AM UTC-4, David Mark wrote:
Thanks Mojo
Now I just need to add TCP method to Pushover handler.
David
On Wednesday, May 14, 2014 7:11:57 PM UTC+3, Morris Jones wrote:
I’ve been building out sensu monitoring for a cluster of about 40 machines.
I have two VMs dedicated to sensu. RabbitMQ on both, redis on one, api, client, server, and dashboard on both.
I have two graphite machines, collecting metrics that are echoed to RabbitMQ by the sensu server through the graphite mutator.
I have a handful of checks and metrics, with more to come – disk, ntp, memory, load, a few application endpoints. I have an email handler for events, forwarding to gmail.
Right now my two sensu machines are running a load average around 10-20, usually about 18.
Typically my waiting processes include a few mailer handlers (eight when I glanced just now) and a few graphite mutators (four at this glance).
I’m not sure why the mailers are running and waiting. I see them there whether there’s an event to report or not.
Given that it’s a brain-dead configuration, is there something obviously wrong? The machines have 2GB memory, running at about 75% (no swap).
I can think of a few things to do:
Reduce the frequency of the checks (all at 60s right now)
Split off RabbitMQ to other machines
Split off sensu-server or sensu-api
Discover something in the code that’s driving the load (mailer? graphite-mutator?)
It’s my understanding the @b-Boyd’s approach will no longer work in Sensu 0.13.0 because sensu/base no longer exists. Is everyone moving to extensions, or is flapjack the new hotness ? Personally I am not super excited about having to add another outside component to do event handling… Anyone else have any thoughts / plans for dealing with high load in a post Sensu 0.12.0 world?
On Sunday, June 29, 2014 5:50:33 AM UTC-4, David Mark wrote:
Thanks Mojo
Now I just need to add TCP method to Pushover handler.
David
On Wednesday, May 14, 2014 7:11:57 PM UTC+3, Morris Jones wrote:
I’ve been building out sensu monitoring for a cluster of about 40 machines.
I have two VMs dedicated to sensu. RabbitMQ on both, redis on one, api, client, server, and dashboard on both.
I have two graphite machines, collecting metrics that are echoed to RabbitMQ by the sensu server through the graphite mutator.
I have a handful of checks and metrics, with more to come – disk, ntp, memory, load, a few application endpoints. I have an email handler for events, forwarding to gmail.
Right now my two sensu machines are running a load average around 10-20, usually about 18.
Typically my waiting processes include a few mailer handlers (eight when I glanced just now) and a few graphite mutators (four at this glance).
I’m not sure why the mailers are running and waiting. I see them there whether there’s an event to report or not.
Given that it’s a brain-dead configuration, is there something obviously wrong? The machines have 2GB memory, running at about 75% (no swap).
I can think of a few things to do:
Reduce the frequency of the checks (all at 60s right now)
Split off RabbitMQ to other machines
Split off sensu-server or sensu-api
Discover something in the code that’s driving the load (mailer? graphite-mutator?)
It’s my understanding the @b-Boyd’s approach will no longer work in Sensu 0.13.0 because sensu/base no longer exists. Is everyone moving to extensions, or is flapjack the new hotness ? Personally I am not super excited about having to add another outside component to do event handling… Anyone else have any thoughts / plans for dealing with high load in a post Sensu 0.12.0 world?
On Sunday, June 29, 2014 5:50:33 AM UTC-4, David Mark wrote:
Thanks Mojo
Now I just need to add TCP method to Pushover handler.
David
On Wednesday, May 14, 2014 7:11:57 PM UTC+3, Morris Jones wrote:
I’ve been building out sensu monitoring for a cluster of about 40 machines.
I have two VMs dedicated to sensu. RabbitMQ on both, redis on one, api, client, server, and dashboard on both.
I have two graphite machines, collecting metrics that are echoed to RabbitMQ by the sensu server through the graphite mutator.
I have a handful of checks and metrics, with more to come – disk, ntp, memory, load, a few application endpoints. I have an email handler for events, forwarding to gmail.
Right now my two sensu machines are running a load average around 10-20, usually about 18.
Typically my waiting processes include a few mailer handlers (eight when I glanced just now) and a few graphite mutators (four at this glance).
I’m not sure why the mailers are running and waiting. I see them there whether there’s an event to report or not.
Given that it’s a brain-dead configuration, is there something obviously wrong? The machines have 2GB memory, running at about 75% (no swap).
I can think of a few things to do:
Reduce the frequency of the checks (all at 60s right now)
Split off RabbitMQ to other machines
Split off sensu-server or sensu-api
Discover something in the code that’s driving the load (mailer? graphite-mutator?)
I think the server is running just fine until a large number of alerts go off at once and quickly the server becomes unresponsive because of the rate it’s trying to execute handlers. So when I need monitoring the most it falls over. To resolve that I got away from ‘pipe’ handlers and switched to ‘tcp’ - converted all of the handlers to services which was easy by leveraging the sensu libraries (socket handling, logging, config, etc).
On Monday, May 19, 2014 11:49:42 PM UTC+5, Morris Jones wrote:
Brian,
Yes it looks like getting rid of the piped handlers is going to be a big step.
Did you write your own tcp mailer handler? I’d love to find an example of how that’s done.
Getting rid of the graphite_mutator has made a bit of a difference. Now the mailer handler is really dragging it down. It appears that sensu is exec’ing the mailer.rb script on every handling, even though an email isn’t generated because of an “occurrences” restriction.
Mojo
On Thu, May 15, 2014 at 6:59 PM, Brian Boyd br...@boydduo.com wrote:
Also watch out for handlers forking. I’ve had issues where the forking of handlers has driven high loads. The server is running just fine until a large number of alerts go off at once and quickly the server becomes unresponsive because of the rate it’s trying to execute handlers. So when I need monitoring the most it falls over. To resolve that I got away from ‘pipe’ handlers and switched to ‘tcp’ - converted all of the handlers to services which was easy by leveraging the sensu libraries (socket handling, logging, config, etc). And because the TCP connection is less resource intensive than forking I’ve since added numerous handlers (mailer, remediator, irc, elasticsearch) to the default handler set and haven’t had any issues yet.
brian
On Wednesday, May 14, 2014 12:07:48 PM UTC-5, portertech wrote:
I do not recommend using the Graphite mutator for the community repository, https://github.com/sensu/sensu-community-plugins/blob/master/mutators/graphite.rb Fork/exec is an expensive operation, this will cause your system to become “lethargic”, and consume the Sensu execution thread pool, not leaving room for the mailer, which would cause a backlog. I use the built-in Sensu mutator extension “only_check_output”, and produce metrics w/ the correct schema.
What type of handler are you using for Graphite? Is the mutator for Graphite using a command? Are metric events being sent to your mailer handler? Is the mailer handler blocked on SMTP?
I’ve been building out sensu monitoring for a cluster of about 40 machines.
I have two VMs dedicated to sensu. RabbitMQ on both, redis on one, api, client, server, and dashboard on both.
I have two graphite machines, collecting metrics that are echoed to RabbitMQ by the sensu server through the graphite mutator.
I have a handful of checks and metrics, with more to come – disk, ntp, memory, load, a few application endpoints. I have an email handler for events, forwarding to gmail.
Right now my two sensu machines are running a load average around 10-20, usually about 18.
Typically my waiting processes include a few mailer handlers (eight when I glanced just now) and a few graphite mutators (four at this glance).
I’m not sure why the mailers are running and waiting. I see them there whether there’s an event to report or not.
Given that it’s a brain-dead configuration, is there something obviously wrong? The machines have 2GB memory, running at about 75% (no swap).
I can think of a few things to do:
Reduce the frequency of the checks (all at 60s right now)
Split off RabbitMQ to other machines
Split off sensu-server or sensu-api
Discover something in the code that’s driving the load (mailer? graphite-mutator?)