How to properly handle keepalive and AWS EC2 start/stop instances


#1

Hi,

we have some fairly large (and expensive) instances on AWS which we use only 6 to 10 hours each day to run some batch computation.
We are in the process of starting/stopping the instances at the times when we want the computation to be done to save a bit on the bill.

These instances are monitored by Sensu and each of them run the Sensu client, which pings back to the server every once in a while with the keepalive function. If we just shutdown the instances for several hours, we will have keepalives warnings popping up in our reporting system. How should I approach this kind of behavior?

Ideally, I would like to say something like : “I know this instance is supposed to run every days at midnight, don’t produce any keepalive (or other) warnings until the moment I stop and next midnight”, but I’m not sure how I should do that. Creating a stash via the Sensu API server?

Do you have any other possible solution to this problem?

Thanks!

Jonathan


#2

It is similar logic that is done in a aws decommission handler. You’ll want to check for stopped instances and remove the client from sensu. When they come back up they will register again and everything will be happy.

For an overview of logic that I’m talking about, see here: http://www.ragedsyscoder.com/blog/2014/01/14/sensu-automated-decommission-of-clients/

-Bryan

···

On Fri, Dec 11, 2015 at 6:06 AM, Jonathan Ballet jon@multani.info wrote:

Hi,

we have some fairly large (and expensive) instances on AWS which we use only 6 to 10 hours each day to run some batch computation.
We are in the process of starting/stopping the instances at the times when we want the computation to be done to save a bit on the bill.

These instances are monitored by Sensu and each of them run the Sensu client, which pings back to the server every once in a while with the keepalive function. If we just shutdown the instances for several hours, we will have keepalives warnings popping up in our reporting system. How should I approach this kind of behavior?

Ideally, I would like to say something like : “I know this instance is supposed to run every days at midnight, don’t produce any keepalive (or other) warnings until the moment I stop and next midnight”, but I’m not sure how I should do that. Creating a stash via the Sensu API server?

Do you have any other possible solution to this problem?

Thanks!

Jonathan


#3

Write an init script that upon graceful shutdown makes an api call to pull the client from monitoring and create a stash. Then modify then keep alive handler to check for this stash before sending an alert.

The stash is used solely to prevent a race condition where a client would get added back in after it was pulled. I time it fairly close, we have the kill script kicked at 70 and sensu-client gets pulled at 75 I believe.

···

On Fri, Dec 11, 2015 at 6:06 AM, Jonathan Ballet jon@multani.info wrote:

Hi,

we have some fairly large (and expensive) instances on AWS which we use only 6 to 10 hours each day to run some batch computation.
We are in the process of starting/stopping the instances at the times when we want the computation to be done to save a bit on the bill.

These instances are monitored by Sensu and each of them run the Sensu client, which pings back to the server every once in a while with the keepalive function. If we just shutdown the instances for several hours, we will have keepalives warnings popping up in our reporting system. How should I approach this kind of behavior?

Ideally, I would like to say something like : “I know this instance is supposed to run every days at midnight, don’t produce any keepalive (or other) warnings until the moment I stop and next midnight”, but I’m not sure how I should do that. Creating a stash via the Sensu API server?

Do you have any other possible solution to this problem?

Thanks!

Jonathan


#4

Hi Bryan,

Thanks for your answer! I came on your website already when I was looking for an answer for this problem, but I was looking for something hopefully less radical than removing the client from Pdmpt itself. It is something which is going to happen fairly frequently (everyday, for a couple of machines at first), and I was more looking for a way to say to Sensu :“it’s OK if this machine is off for the moment, don’t get all stressed about it”.

I’ll still keep an eye on your article as it may help me to program something!

Jonathan

Jonathan

···

On Fri, Dec 11, 2015 at 6:06 AM, Jonathan Ballet jon@multani.info wrote:

Hi,

we have some fairly large (and expensive) instances on AWS which we use only 6 to 10 hours each day to run some batch computation.
We are in the process of starting/stopping the instances at the times when we want the computation to be done to save a bit on the bill.

These instances are monitored by Sensu and each of them run the Sensu client, which pings back to the server every once in a while with the keepalive function. If we just shutdown the instances for several hours, we will have keepalives warnings popping up in our reporting system. How should I approach this kind of behavior?

Ideally, I would like to say something like : “I know this instance is supposed to run every days at midnight, don’t produce any keepalive (or other) warnings until the moment I stop and next midnight”, but I’m not sure how I should do that. Creating a stash via the Sensu API server?

Do you have any other possible solution to this problem?

Thanks!

Jonathan


#5

Hi Matty,

Thanks for your answer! Have you tried to write a solution where you were having custom keep ajgtdp sent by the client, like “I should be sending the next keep alive in 20 seconds”?

I’m not sure I understand the race condition you are talking about, would you care to explain?

Also, when you speak about 70 and 75, are you referring to the init script number which removes the stash and starts the Sensu client?

Thanks!

Jonathan

···

On Fri, Dec 11, 2015 at 6:06 AM, Jonathan Ballet jon@multani.info wrote:

Hi,

we have some fairly large (and expensive) instances on AWS which we use only 6 to 10 hours each day to run some batch computation.
We are in the process of starting/stopping the instances at the times when we want the computation to be done to save a bit on the bill.

These instances are monitored by Sensu and each of them run the Sensu client, which pings back to the server every once in a while with the keepalive function. If we just shutdown the instances for several hours, we will have keepalives warnings popping up in our reporting system. How should I approach this kind of behavior?

Ideally, I would like to say something like : “I know this instance is supposed to run every days at midnight, don’t produce any keepali ve (or other) warnings until the moment I stop and next midnight”, but I’m not sure how I should do that. Creating a stash via the Sensu API server?

Do you have any other possible solution to this problem?

Thanks!

Jonathan


#6

I have not tried doing something like that.

The race condition would be something along the lines of the following, with the being the init script number which refers to the sequence that is gets executed in.

SHUTDOWN ORDER:

<70> send remove host to sensu

<70> send api call to create stash saying client has been removed

<75> shutdown sensu client

The reason I have the stash is that after I send the call to remove the host <70>, the sensu-client still has time to send another keep-alive. If that keep-alive comes in then the client will get added back in and we would thus have an alert on a “phantom” host.

By creating the stash and modifying the keepalive handler, before an alert is sent it will check the stashes to see if the host has be given a “remove” status and thus don’t alert on it as it is being shutdown gracefully.

Here are the modified handlers

This is what we use to send the call from the init script to the server

Here is a copy of the kill script that we drop on machines

···

On Sun, Dec 13, 2015 at 2:01 PM, Jonathan Ballet jon@multani.info wrote:

Hi Matty,

Thanks for your answer! Have you tried to write a solution where you were having custom keep ajgtdp sent by the client, like “I should be sending the next keep alive in 20 seconds”?

I’m not sure I understand the race condition you are talking about, would you care to explain?

Also, when you speak about 70 and 75, are you referring to the init script number which removes the stash and starts the Sensu client?

Thanks!

Jonathan

On December 11, 2015 3:09:45 PM CET, matty jones urlugal@gmail.com wrote:

Write an init script that upon graceful shutdown makes an api call to pull the client from monitoring and create a stash. Then modify then keep alive handler to check for this stash before sending an alert.

The stash is used solely to prevent a race condition where a client would get added back in after it was pulled. I time it fairly close, we have the kill script kicked at 70 and sensu-client gets pulled at 75 I believe.

On Dec 11, 2015 8:47 AM, “Bryan Brandau” agent462@gmail.com wrote:

It is similar logic that is done in a aws decommission handler. You’ll want to check for stopped instances and remove the client from sensu. When they come back up they will register again and everything will be happy.

For an overview of logic that I’m talking about, see here: http://www.ragedsyscoder.com/blog/2014/01/14/sensu-automated-decommission-of-clients/

-Bryan

On Fri, Dec 11, 2015 at 6:06 AM, Jonathan Ballet jon@multani.info wrote:

Hi,

we have some fairly large (and expensive) instances on AWS which we use only 6 to 10 hours each day to run some batch computation.
We are in the process of starting/stopping the instances at the times when we want the computation to be done to save a bit on the bill.

These instances are monitored by Sensu and each of them run the Sensu client, which pings back to the server every once in a while with the keepalive function. If we just shutdown the instances for several hours, we will have keepalives warnings popping up in our reporting system. How should I approach this kind of behavior?

Ideally, I would like to say something like : “I know this instance is supposed to run every days at midnight, don’t produce any keepali ve (or other) warnings until the moment I stop and next midnight”, but I’m not sure how I should do that. Creating a stash via the Sensu API server?

Do you have any other possible solution to this problem?

Thanks!

Jonathan

Matt Jones @DevopsMatt

Infrastructure Engineer - Yieldbot Inc.

Core Contributor - Sensu Plugins

Co-Organizer - Boston Infrastructure Coders

https://linkedin.com/in/mattyjones


#7

Hi Matt,

OK, I understand better the race condition now :slight_smile:
I'll have a look at your scripts and see if I can do something with them, they seem to be nicely written and I can most probably learn a lot from them, thanks!

  Jonathan

···

On 12/13/2015 08:17 PM, Matt Jones wrote:

I have not tried doing something like that.

The race condition would be something along the lines of the following,
with the <number> being the init script number which refers to the
sequence that is gets executed in.

SHUTDOWN ORDER:

<70> send remove host to sensu
<70> send api call to create stash saying client has been removed
<75> shutdown sensu client

The reason I have the stash is that after I send the call to remove the
host <70>, the sensu-client still has time to send another keep-alive.
If that keep-alive comes in then the client will get added back in and
we would thus have an alert on a "phantom" host.

By creating the stash and modifying the keepalive handler, before an
alert is sent it will check the stashes to see if the host has be given
a "remove" status and thus don't alert on it as it is being shutdown
gracefully.

Here are the modified handlers
<https://github.com/yieldbot/sensu-yieldbot-plugins/tree/master/handlers/other>

This is what we use to send the call
<https://github.com/yieldbot/sensu-yieldbot-plugins/blob/master/plugins/sensu/sensu-socket-client.rb>from
the init script to the server

Here is a copy of the kill script
<https://gist.github.com/mattyjones/79e018a65ee8d81a8d41> that we drop
on machines

On Sun, Dec 13, 2015 at 2:01 PM, Jonathan Ballet <jon@multani.info > <mailto:jon@multani.info>> wrote:

    Hi Matty,

    Thanks for your answer! Have you tried to write a solution where you
    were having custom keep ajgtdp sent by the client, like "I should be
    sending the next keep alive in 20 seconds"?
    I'm not sure I understand the race condition you are talking about,
    would you care to explain?
    Also, when you speak about 70 and 75, are you referring to the init
    script number which removes the stash and starts the Sensu client?

    Thanks!

    Jonathan

    On December 11, 2015 3:09:45 PM CET, matty jones <urlugal@gmail.com > <mailto:urlugal@gmail.com>> wrote:

        Write an init script that upon graceful shutdown makes an api
        call to pull the client from monitoring and create a stash. Then
        modify then keep alive handler to check for this stash before
        sending an alert.

        The stash is used solely to prevent a race condition where a
        client would get added back in after it was pulled. I time it
        fairly close, we have the kill script kicked at 70 and
        sensu-client gets pulled at 75 I believe.

        On Dec 11, 2015 8:47 AM, "Bryan Brandau" <agent462@gmail.com > <mailto:agent462@gmail.com>> wrote:

            It is similar logic that is done in a aws decommission
            handler. You’ll want to check for stopped instances and
            remove the client from sensu. When they come back up they
            will register again and everything will be happy.

            For an overview of logic that I’m talking about, see here:
            http://www.ragedsyscoder.com/blog/2014/01/14/sensu-automated-decommission-of-clients/

            -Bryan

            On Fri, Dec 11, 2015 at 6:06 AM, Jonathan Ballet > <jon@multani.info <mailto:jon@multani.info>> wrote:

                Hi,

                we have some fairly large (and expensive) instances on
                AWS which we use only 6 to 10 hours each day to run some
                batch computation.
                We are in the process of starting/stopping the instances
                at the times when we want the computation to be done to
                save a bit on the bill.

                These instances are monitored by Sensu and each of them
                run the Sensu client, which pings back to the server
                every once in a while with the keepalive function. If we
                just shutdown the instances for several hours, we will
                have keepalives warnings popping up in our reporting
                system. How should I approach this kind of behavior?

                Ideally, I would like to say something like : "I know
                this instance is supposed to run every days at midnight,
                don't produce any keepali ve (or other) warnings until
                the moment I stop and next midnight", but I'm not sure
                how I should do that. Creating a stash via the Sensu API
                server?

                Do you have any other possible solution to this problem?

                Thanks!

                  Jonathan

--
Matt Jones @DevopsMatt
Infrastructure Engineer - Yieldbot Inc. <http://yieldbot.com/>
Core Contributor - Sensu Plugins <http://sensu-plugins.github.io/>
Co-Organizer - Boston Infrastructure Coders
<http://www.meetup.com/Boston-Infrastructure-Coders/>
https://linkedin.com/in/mattyjones