Can I use senus for both monitoring and process recovery?


#1

Hi,

I would like to have some feedback on my idea.

There are two goals I want to achieve:

  1. Put monitoring in place to ensure critical processes are running on several AWS EC2 instances and monitor error messages in application log files.

  2. If a critical process in a particular ec2 instance is down, I want to be able to automatically restart the process in that host.

First goal should be reasonably easy to achieve with sensu.

But how about a second point?

I think a possible solution is to leverage Fabric (http://www.fabfile.org/) in the sensu server’s handlers. if a process is founded to be dead, it will be restarted by a fabric process. (Of course I can put in a MAX RETRY to avoid endless process restart)

Does it sound reasonable? Is there an easier way to achieve the second goal in sensu?

Thanks, Tony


#2

When the alert comes in have it map to a handler that restarts the process, simple as that :wink: The handler for an event can do anything from send an email, notify a slack/hipchat channel, or perform some other operation.

···

On Mon, Jan 12, 2015 at 11:51 PM, Anthony Kong anthony.hw.kong@gmail.com wrote:

Hi,

I would like to have some feedback on my idea.

There are two goals I want to achieve:

  1. Put monitoring in place to ensure critical processes are running on several AWS EC2 instances and monitor error messages in application log files.
  1. If a critical process in a particular ec2 instance is down, I want to be able to automatically restart the process in that host.

First goal should be reasonably easy to achieve with sensu.

But how about a second point?

I think a possible solution is to leverage Fabric (http://www.fabfile.org/) in the sensu server’s handlers. if a process is founded to be dead, it will be restarted by a fabric process. (Of course I can put in a MAX RETRY to avoid endless process restart)

Does it sound reasonable? Is there an easier way to achieve the second goal in sensu?

Thanks, Tony


#3

Thanks Matt for your quick response!

Where do I install the handler? As a sensu-client plugin I presume?

Cheers, Tony

···

On Tue, Jan 13, 2015 at 3:53 PM, Matt Jones <mattjones@yieldbot.com> wrote:

When the alert comes in have it map to a handler that restarts the
process, simple as that :wink: The handler for an event can do anything from
send an email, notify a slack/hipchat channel, or perform some other
operation.

On Mon, Jan 12, 2015 at 11:51 PM, Anthony Kong <anthony.hw.kong@gmail.com> > wrote:

Hi,

I would like to have some feedback on my idea.

There are two goals I want to achieve:

1) Put monitoring in place to ensure critical processes are running on
several AWS EC2 instances and monitor error messages in application log
files.

2) If a critical process in a particular ec2 instance is down, I want to
be able to automatically restart the process in that host.

First goal should be reasonably easy to achieve with sensu.

But how about a second point?

I think a possible solution is to leverage Fabric (
http://www.fabfile.org/) in the sensu server's handlers. if a process is
founded to be dead, it will be restarted by a fabric process. (Of course I
can put in a MAX RETRY to avoid endless process restart)

Does it sound reasonable? Is there an easier way to achieve the second
goal in sensu?

Thanks, Tony


#4

http://sensuapp.org/docs/0.16/adding_a_handler

handlers live on the server. In your case ‘type’ would be pipe so it could read the data being sent (service name, etc) if needed, and ‘command’ which could be a shell script or something as simple as ‘sudo sv restart elasticsearch’

···

On Tue, Jan 13, 2015 at 12:35 AM, Anthony Kong anthony.hw.kong@gmail.com wrote:

Thanks Matt for your quick response!

Where do I install the handler? As a sensu-client plugin I presume?

Cheers, Tony

On Tue, Jan 13, 2015 at 3:53 PM, Matt Jones mattjones@yieldbot.com wrote:

When the alert comes in have it map to a handler that restarts the process, simple as that :wink: The handler for an event can do anything from send an email, notify a slack/hipchat channel, or perform some other operation.

On Mon, Jan 12, 2015 at 11:51 PM, Anthony Kong anthony.hw.kong@gmail.com wrote:

Hi,

I would like to have some feedback on my idea.

There are two goals I want to achieve:

  1. Put monitoring in place to ensure critical processes are running on several AWS EC2 instances and monitor error messages in application log files.
  1. If a critical process in a particular ec2 instance is down, I want to be able to automatically restart the process in that host.

First goal should be reasonably easy to achieve with sensu.

But how about a second point?

I think a possible solution is to leverage Fabric (http://www.fabfile.org/) in the sensu server’s handlers. if a process is founded to be dead, it will be restarted by a fabric process. (Of course I can put in a MAX RETRY to avoid endless process restart)

Does it sound reasonable? Is there an easier way to achieve the second goal in sensu?

Thanks, Tony


#5

I’m just getting my feet wet with Sensu, I want to do something very similar (auto remediation). I found this: https://github.com/sensu/sensu-community-plugins/blob/master/handlers/remediation/sensu.rb, although I’m not sure 100% how all the pieces fit together just yet. IE If Handlers are installed on the server, I’m not 100% clear yet how that handler results in action being taken on the client, however the comments in that file indicate it would happen on the client.

···

Where do I install the handler? As a sensu-client plugin I presume?

Cheers, Tony


#6

David:

# This plugin reads configuration from a check definition # and triggers appropriate remediation actions (defined as # other checks) via the Sensu API, when the occurrences and # severities reach certain values. #
This will just run other checks, not perform system actions. The other checks are called using the sensu API. The OP, I believe, wants to restart a service or something to that effect. This would entail the handler kicking off a script to perform this action on the server that created the event.

···

On Tue, Jan 13, 2015 at 8:14 PM, David Petzel davidpetzel@gmail.com wrote:

I’m just getting my feet wet with Sensu, I want to do something very similar (auto remediation). I found this: https://github.com/sensu/sensu-community-plugins/blob/master/handlers/remediation/sensu.rb, although I’m not sure 100% how all the pieces fit together just yet. IE If Handlers are installed on the server, I’m not 100% clear yet how that handler results in action being taken on the client, however the comments in that file indicate it would happen on the client.

Where do I install the handler? As a sensu-client plugin I presume?

Cheers, Tony


#7

Sorry if I’m being dense here, but I’m not groking the distinction, isn’t that remediation plugin, “performing system actions”, by invoking the unpublished checks? Aren’t those addition “_remediation” checks just system actions, and one of those could be to restart the process in question?.

Again sorry if this is a dumb question, if I’m way out in left field, I’ll start a fresh thread so as not to further derail this one.

Thanks

···

On Tuesday, January 13, 2015 at 8:58:05 PM UTC-5, Matt Jones wrote:

David:

#
This will just run other checks, not perform system actions. The other checks are called using the sensu API. The OP, I believe, wants to restart a service or something to that effect. This would entail the handler kicking off a script to perform this action on the server that created the event.



#8

David - you’re right. The remediation handler triggers additional ‘checks’ for the box via API calls. Those ‘checks’ are just Ruby files (or any other language you want) so they can do anything you like - they just need to report a check status back to sensu-client when they’re finished. You could certainly write a check that shells out to start a process, then reports OK if it succeeded, or CRITICAL if it failed to start. This same pattern could be used for automatic disk cleanup, terminating inactive connections after reaching a certain threshold, emptying a message queue, and so on.

Another way to handle this would be to write a custom check script that tries to directly resolve the issue before reporting results. For the process check example, write a check that does the following:

  1. Check to see if the process is running - if it is, return status OK.
  2. If the process is not running, send a shell command to start it; if it succeeds, return status WARNING.
  3. If the process could not be restarted, return CRITICAL.
    Then set your occurrences for that check to 1, and tune your handlers to report differently for each state. For example, you could send an email notification for the WARNING state to let you know that a problem was automatically resolved; you could send a PagerDuty alert for the CRITICAL state telling you that the process is down and needs to be manually fixed.
···

On Tuesday, January 13, 2015 at 8:19:47 PM UTC-6, David Petzel wrote:

On Tuesday, January 13, 2015 at 8:58:05 PM UTC-5, Matt Jones wrote:

David:

#
This will just run other checks, not perform system actions. The other checks are called using the sensu API. The OP, I believe, wants to restart a service or something to that effect. This would entail the handler kicking off a script to perform this action on the server that created the event.


Sorry if I’m being dense here, but I’m not groking the distinction, isn’t that remediation plugin, “performing system actions”, by invoking the unpublished checks? Aren’t those addition “_remediation” checks just system actions, and one of those could be to restart the process in question?.

Again sorry if this is a dumb question, if I’m way out in left field, I’ll start a fresh thread so as not to further derail this one.

Thanks

This e-mail, including attachments, contains confidential and/or proprietary information, and may be used only by the person or entity to which it is addressed. The reader is hereby notified that any dissemination, distribution or copying of this e-mail is prohibited. If you have received this e-mail in error, please notify the sender by replying to this message and delete this e-mail immediately.