Unexpected behavior of sensu-client in a redundant scenario


#1

Hi,

I am seeing some possibly unexpected behavior with the sensu client which may or may not be a bug and wanted to see if somebody here may be able to answer my query. If this is not the right place, I apologize in advance; please let me know what would be the right forum for posting this query and I will post it there.

I am trying to design and validate a redundant Sensu solution that will across multiple data centers. I have already gone through the document on the Sensu website that talks about the different approaches:

https://sensuapp.org/docs/0.16/scaling_strategies

Given the project requirements and resources available, among other things, the approach I have settled on is to have two different sets of Sensu server/Rabbit/Redis instances, one in each data center (let’s call them sensu-1, rabbit-1, redis-1 and sensu-2, rabbit-2, redis-2 respectively) and have the clients point to a sensu/rabbit/redis hostname that resolves to sensu-1, rabbit-1 and redis-1 when everything is working in data center #1, and to sensu-2, rabbit-2 and redis-2 if anything goes wrong in data center #1. This is essentially an ACTIVE-PASSIVE solution, with the Sensu setup in data center #2 taking over whenever there is a problem in data center #1.

In order to test this, I am modifying the DNS entries of sensu/rabbit/redis hostnames that the clients are using to point to the IP addresses of sensu-2, rabbit-2 and redis-2 (from those of sensu-1, rabbit-1 and redis-1, which is what they are pointing to initially). I would expect that once this change has been made, the clients would start using sensu-2, rabbit-2 and redis-2, and I should be able to see these clients as belonging to datacenter #2, instead of datacenter #1.

However, in my testing I have noticed that unless I restart the sensu-client on all the nodes being monitored, this change does not take effect. In other words, I can verify on the client system that the sensu/rabbit/redis hostnames are now resolving to sensu-2, rabbit-2 and redis-2 after making the changes to DNS, but the sensu client is still

  • Is this behavior expected or is it a bug? Do I need any additional configuration to make this work?
  • If this is expected behavior, how do I design a redundant solution that works across data centers that does not require restarting the clients in case of an outage?
    Thanks in advance!

#2

DNS makes for a very poor service discovery mechanism. The majority of
clients across most languages will not periodically resolve hostnames.
I (personally) expect this behavior from everything except web
browsers and short connections and do not rely on dns TTLs for
anything.

One way you could solve this is to set the reconnect_on_failure
setting to be "false", and use upstart or a supervisor to restart the
process when it dies. By restarting the process when things go wrong,
you give it a chance to re-resolve.

Another path is to use load-balancers and let the load balancers
handle the routing to healthy things. (you can still use
reconnect_on_failure trick to get HA loadbalancers)
Have you read this?
http://failshell.io/sensu/2013/05/08/high-availability-sensu/

···

On Tue, Sep 8, 2015 at 1:56 PM, <manupathak@gmail.com> wrote:

Hi,

I am seeing some possibly unexpected behavior with the sensu client which
may or may not be a bug and wanted to see if somebody here may be able to
answer my query. If this is not the right place, I apologize in advance;
please let me know what would be the right forum for posting this query and
I will post it there.

I am trying to design and validate a redundant Sensu solution that will
across multiple data centers. I have already gone through the document on
the Sensu website that talks about the different approaches:

https://sensuapp.org/docs/0.16/scaling_strategies

Given the project requirements and resources available, among other things,
the approach I have settled on is to have two different sets of Sensu
server/Rabbit/Redis instances, one in each data center (let's call them
sensu-1, rabbit-1, redis-1 and sensu-2, rabbit-2, redis-2 respectively) and
have the clients point to a sensu/rabbit/redis hostname that resolves to
sensu-1, rabbit-1 and redis-1 when everything is working in data center #1,
and to sensu-2, rabbit-2 and redis-2 if anything goes wrong in data center
#1. This is essentially an ACTIVE-PASSIVE solution, with the Sensu setup in
data center #2 taking over whenever there is a problem in data center #1.

In order to test this, I am modifying the DNS entries of sensu/rabbit/redis
hostnames that the clients are using to point to the IP addresses of
sensu-2, rabbit-2 and redis-2 (from those of sensu-1, rabbit-1 and redis-1,
which is what they are pointing to initially). I would expect that once this
change has been made, the clients would start using sensu-2, rabbit-2 and
redis-2, and I should be able to see these clients as belonging to
datacenter #2, instead of datacenter #1.

However, in my testing I have noticed that unless I restart the sensu-client
on all the nodes being monitored, this change does not take effect. In other
words, I can verify on the client system that the sensu/rabbit/redis
hostnames are now resolving to sensu-2, rabbit-2 and redis-2 after making
the changes to DNS, but the sensu client is still

Is this behavior expected or is it a bug? Do I need any additional
configuration to make this work?
If this is expected behavior, how do I design a redundant solution that
works across data centers that does not require restarting the clients in
case of an outage?

Thanks in advance!


#3

Hi Kyle,

Thanks for your response. Yes, I realize that the DNS solution is not ideal and is not my first choice either. However, I am limited by time and resources, so testing and validating the fully redundant solution will be a challenge for this project. I did read the document you referenced in your reply, but it’s probably overkill for what I am trying to achieve for now.

One quick question: where is the reconnect_on_failure setting available? I could not find any documentation on it. Can you provide a pointer/example?

Thanks!

···

On Tuesday, September 8, 2015 at 10:15:02 PM UTC-4, Kyle Anderson wrote:

DNS makes for a very poor service discovery mechanism. The majority of
clients across most languages will not periodically resolve hostnames.
I (personally) expect this behavior from everything except web
browsers and short connections and do not rely on dns TTLs for
anything.

One way you could solve this is to set the reconnect_on_failure
setting to be “false”, and use upstart or a supervisor to restart the
process when it dies. By restarting the process when things go wrong,
you give it a chance to re-resolve.

Another path is to use load-balancers and let the load balancers
handle the routing to healthy things. (you can still use
reconnect_on_failure trick to get HA loadbalancers)
Have you read this?
http://failshell.io/sensu/2013/05/08/high-availability-sensu/

On Tue, Sep 8, 2015 at 1:56 PM, manup...@gmail.com wrote:

Hi,

I am seeing some possibly unexpected behavior with the sensu client which
may or may not be a bug and wanted to see if somebody here may be able to
answer my query. If this is not the right place, I apologize in advance;
please let me know what would be the right forum for posting this query and
I will post it there.

I am trying to design and validate a redundant Sensu solution that will
across multiple data centers. I have already gone through the document on
the Sensu website that talks about the different approaches:

https://sensuapp.org/docs/0.16/scaling_strategies

Given the project requirements and resources available, among other things,
the approach I have settled on is to have two different sets of Sensu
server/Rabbit/Redis instances, one in each data center (let’s call them
sensu-1, rabbit-1, redis-1 and sensu-2, rabbit-2, redis-2 respectively) and
have the clients point to a sensu/rabbit/redis hostname that resolves to
sensu-1, rabbit-1 and redis-1 when everything is working in data center #1,
and to sensu-2, rabbit-2 and redis-2 if anything goes wrong in data center
#1. This is essentially an ACTIVE-PASSIVE solution, with the Sensu setup in
data center #2 taking over whenever there is a problem in data center #1.

In order to test this, I am modifying the DNS entries of sensu/rabbit/redis
hostnames that the clients are using to point to the IP addresses of
sensu-2, rabbit-2 and redis-2 (from those of sensu-1, rabbit-1 and redis-1,
which is what they are pointing to initially). I would expect that once this
change has been made, the clients would start using sensu-2, rabbit-2 and
redis-2, and I should be able to see these clients as belonging to
datacenter #2, instead of datacenter #1.

However, in my testing I have noticed that unless I restart the sensu-client
on all the nodes being monitored, this change does not take effect. In other
words, I can verify on the client system that the sensu/rabbit/redis
hostnames are now resolving to sensu-2, rabbit-2 and redis-2 after making
the changes to DNS, but the sensu client is still

Is this behavior expected or is it a bug? Do I need any additional
configuration to make this work?
If this is expected behavior, how do I design a redundant solution that
works across data centers that does not require restarting the clients in
case of an outage?

Thanks in advance!


#4

Ah, it is reconnect_on_error.

For rabbitmq it is implemented here:


I've made a PR to document that option

For redis the docs are here:
https://sensuapp.org/docs/0.20/redis#anatomy-of-a-redis-definition

···

On Wed, Sep 9, 2015 at 8:28 AM, <manupathak@gmail.com> wrote:

Hi Kyle,

Thanks for your response. Yes, I realize that the DNS solution is not ideal
and is not my first choice either. However, I am limited by time and
resources, so testing and validating the fully redundant solution will be a
challenge for this project. I did read the document you referenced in your
reply, but it's probably overkill for what I am trying to achieve for now.

One quick question: where is the reconnect_on_failure setting available? I
could not find any documentation on it. Can you provide a pointer/example?

Thanks!

On Tuesday, September 8, 2015 at 10:15:02 PM UTC-4, Kyle Anderson wrote:

DNS makes for a very poor service discovery mechanism. The majority of
clients across most languages will not periodically resolve hostnames.
I (personally) expect this behavior from everything except web
browsers and short connections and do not rely on dns TTLs for
anything.

One way you could solve this is to set the reconnect_on_failure
setting to be "false", and use upstart or a supervisor to restart the
process when it dies. By restarting the process when things go wrong,
you give it a chance to re-resolve.

Another path is to use load-balancers and let the load balancers
handle the routing to healthy things. (you can still use
reconnect_on_failure trick to get HA loadbalancers)
Have you read this?
http://failshell.io/sensu/2013/05/08/high-availability-sensu/

On Tue, Sep 8, 2015 at 1:56 PM, <manup...@gmail.com> wrote:
> Hi,
>
> I am seeing some possibly unexpected behavior with the sensu client
> which
> may or may not be a bug and wanted to see if somebody here may be able
> to
> answer my query. If this is not the right place, I apologize in advance;
> please let me know what would be the right forum for posting this query
> and
> I will post it there.
>
>
> I am trying to design and validate a redundant Sensu solution that will
> across multiple data centers. I have already gone through the document
> on
> the Sensu website that talks about the different approaches:
>
> https://sensuapp.org/docs/0.16/scaling_strategies
>
>
> Given the project requirements and resources available, among other
> things,
> the approach I have settled on is to have two different sets of Sensu
> server/Rabbit/Redis instances, one in each data center (let's call them
> sensu-1, rabbit-1, redis-1 and sensu-2, rabbit-2, redis-2 respectively)
> and
> have the clients point to a sensu/rabbit/redis hostname that resolves to
> sensu-1, rabbit-1 and redis-1 when everything is working in data center
> #1,
> and to sensu-2, rabbit-2 and redis-2 if anything goes wrong in data
> center
> #1. This is essentially an ACTIVE-PASSIVE solution, with the Sensu setup
> in
> data center #2 taking over whenever there is a problem in data center
> #1.
>
>
> In order to test this, I am modifying the DNS entries of
> sensu/rabbit/redis
> hostnames that the clients are using to point to the IP addresses of
> sensu-2, rabbit-2 and redis-2 (from those of sensu-1, rabbit-1 and
> redis-1,
> which is what they are pointing to initially). I would expect that once
> this
> change has been made, the clients would start using sensu-2, rabbit-2
> and
> redis-2, and I should be able to see these clients as belonging to
> datacenter #2, instead of datacenter #1.
>
>
> However, in my testing I have noticed that unless I restart the
> sensu-client
> on all the nodes being monitored, this change does not take effect. In
> other
> words, I can verify on the client system that the sensu/rabbit/redis
> hostnames are now resolving to sensu-2, rabbit-2 and redis-2 after
> making
> the changes to DNS, but the sensu client is still
>
> Is this behavior expected or is it a bug? Do I need any additional
> configuration to make this work?
> If this is expected behavior, how do I design a redundant solution that
> works across data centers that does not require restarting the clients
> in
> case of an outage?
>
> Thanks in advance!
>
>


#5

Hi Kyle,

Thanks for sending information about this option. From the code you referred to, it seems like this setting is available for the sensu client. Could you confirm that? If it is indeed for the client, this might help address something I saw in my testing.

I tried your suggestion and wrote a quick script that monitors the IP addresses to which the DNS names resolve. By default, these names resolve to sensu-1, rabbit-1 and redis-1, I then changed the DNS entries for these names and had them resolve to IP addresses of sensu-2, rabbit-2 and redis-2, thus simulating a failover to the secondary Sensu setup, and had my script restart the Sensu clients when it detected that DNS entries were now resolving to different IPs. This works, and clients now show up as belonging to datacenter #2. However, I still see (presumably stale) entries for the clients in datacenter #1 in the uchiwa dashboard, and they stay there unless I manually delete them from the dahsboard. I wonder if this behavior is because the clients are trying to reconnect on failure and setting this option would fix it. I will try it and see what I find. In any case, this is a minor annoyance and I can probably live with it even if setting reconnect_on_error on the clients to false does not address it.

By the way, you also referred to the Redis documentation…since the clients don’t talk directly to the redis server, I would assume that I don’ need to worry about the connection between Sensu server and Redis? Is there anything else I need to be aware of as far as Redis is concerned that might affect how things work in the redundancy setup I described above?

Thanks again for your help!

···

On Saturday, September 12, 2015 at 11:33:21 AM UTC-4, Kyle Anderson wrote:

Ah, it is reconnect_on_error.

For rabbitmq it is implemented here:
https://github.com/sensu/sensu/blob/9b64a7f8f6ca9aa23ad9667c6b41c09d630e05b1/lib/sensu/daemon.rb#L190

I’ve made a PR to document that option
https://github.com/sensu/sensu-docs/pull/271

For redis the docs are here:
https://sensuapp.org/docs/0.20/redis#anatomy-of-a-redis-definition

On Wed, Sep 9, 2015 at 8:28 AM, manup...@gmail.com wrote:

Hi Kyle,

Thanks for your response. Yes, I realize that the DNS solution is not ideal
and is not my first choice either. However, I am limited by time and
resources, so testing and validating the fully redundant solution will be a
challenge for this project. I did read the document you referenced in your
reply, but it’s probably overkill for what I am trying to achieve for now.

One quick question: where is the reconnect_on_failure setting available? I
could not find any documentation on it. Can you provide a pointer/example?

Thanks!

On Tuesday, September 8, 2015 at 10:15:02 PM UTC-4, Kyle Anderson wrote:

DNS makes for a very poor service discovery mechanism. The majority of
clients across most languages will not periodically resolve hostnames.
I (personally) expect this behavior from everything except web
browsers and short connections and do not rely on dns TTLs for
anything.

One way you could solve this is to set the reconnect_on_failure
setting to be “false”, and use upstart or a supervisor to restart the
process when it dies. By restarting the process when things go wrong,
you give it a chance to re-resolve.

Another path is to use load-balancers and let the load balancers
handle the routing to healthy things. (you can still use
reconnect_on_failure trick to get HA loadbalancers)
Have you read this?
http://failshell.io/sensu/2013/05/08/high-availability-sensu/

On Tue, Sep 8, 2015 at 1:56 PM, manup...@gmail.com wrote:

Hi,

I am seeing some possibly unexpected behavior with the sensu client
which
may or may not be a bug and wanted to see if somebody here may be able
to
answer my query. If this is not the right place, I apologize in advance;
please let me know what would be the right forum for posting this query
and
I will post it there.

I am trying to design and validate a redundant Sensu solution that will
across multiple data centers. I have already gone through the document
on
the Sensu website that talks about the different approaches:

https://sensuapp.org/docs/0.16/scaling_strategies

Given the project requirements and resources available, among other
things,
the approach I have settled on is to have two different sets of Sensu
server/Rabbit/Redis instances, one in each data center (let’s call them
sensu-1, rabbit-1, redis-1 and sensu-2, rabbit-2, redis-2 respectively)
and
have the clients point to a sensu/rabbit/redis hostname that resolves to
sensu-1, rabbit-1 and redis-1 when everything is working in data center
#1,
and to sensu-2, rabbit-2 and redis-2 if anything goes wrong in data
center
#1. This is essentially an ACTIVE-PASSIVE solution, with the Sensu setup
in
data center #2 taking over whenever there is a problem in data center
#1.

In order to test this, I am modifying the DNS entries of
sensu/rabbit/redis
hostnames that the clients are using to point to the IP addresses of
sensu-2, rabbit-2 and redis-2 (from those of sensu-1, rabbit-1 and
redis-1,
which is what they are pointing to initially). I would expect that once
this
change has been made, the clients would start using sensu-2, rabbit-2
and
redis-2, and I should be able to see these clients as belonging to
datacenter #2, instead of datacenter #1.

However, in my testing I have noticed that unless I restart the
sensu-client
on all the nodes being monitored, this change does not take effect. In
other
words, I can verify on the client system that the sensu/rabbit/redis
hostnames are now resolving to sensu-2, rabbit-2 and redis-2 after
making
the changes to DNS, but the sensu client is still

Is this behavior expected or is it a bug? Do I need any additional
configuration to make this work?
If this is expected behavior, how do I design a redundant solution that
works across data centers that does not require restarting the clients
in
case of an outage?

Thanks in advance!


#6

Hi Kyle,

Thanks for sending information about this option. From the code you referred
to, it seems like this setting is available for the sensu client. Could you
confirm that? If it is indeed for the client, this might help address
something I saw in my testing.

That function is called by the client/api/server etc

I tried your suggestion and wrote a quick script that monitors the IP
addresses to which the DNS names resolve. By default, these names resolve
to sensu-1, rabbit-1 and redis-1, I then changed the DNS entries for these
names and had them resolve to IP addresses of sensu-2, rabbit-2 and redis-2,
thus simulating a failover to the secondary Sensu setup, and had my script
restart the Sensu clients when it detected that DNS entries were now
resolving to different IPs. This works, and clients now show up as belonging
to datacenter #2. However, I still see (presumably stale) entries for the
clients in datacenter #1 in the uchiwa dashboard, and they stay there unless
I manually delete them from the dahsboard. I wonder if this behavior is
because the clients are trying to reconnect on failure and setting this
option would fix it. I will try it and see what I find. In any case, this is
a minor annoyance and I can probably live with it even if setting
reconnect_on_error on the clients to false does not address it.

This is expected behavior. Clients that are failing keepalives will
continue to stay there until
they reconnect to that original sensu server (technically rabbitmq) or
"something" removes them.

The same thing would happen if you just shut down a client: it alerts
for keepalive and shows up red in the dashboard.

By the way, you also referred to the Redis documentation...since the clients
don't talk directly to the redis server, I would assume that I don' need to
worry about the connection between Sensu server and Redis? Is there anything
else I need to be aware of as far as Redis is concerned that might affect
how things work in the redundancy setup I described above?

Yes. The same thing that can happen to clients resolving an ip for
rabbitmq can happen to the server too,
when connecting to redis. If the ip for your redis server changes
during a failover, *something*
must restart your processes to let them know to force them to
re-resolve if you are using
dns for discovery. (or let them "fail fast" and make upstart respawn
them or whatever)

···

On Sat, Sep 12, 2015 at 11:26 AM, <manupathak@gmail.com> wrote:

Thanks again for your help!

On Saturday, September 12, 2015 at 11:33:21 AM UTC-4, Kyle Anderson wrote:

Ah, it is reconnect_on_error.

For rabbitmq it is implemented here:

https://github.com/sensu/sensu/blob/9b64a7f8f6ca9aa23ad9667c6b41c09d630e05b1/lib/sensu/daemon.rb#L190
I've made a PR to document that option
https://github.com/sensu/sensu-docs/pull/271

For redis the docs are here:
https://sensuapp.org/docs/0.20/redis#anatomy-of-a-redis-definition

On Wed, Sep 9, 2015 at 8:28 AM, <manup...@gmail.com> wrote:
> Hi Kyle,
>
> Thanks for your response. Yes, I realize that the DNS solution is not
> ideal
> and is not my first choice either. However, I am limited by time and
> resources, so testing and validating the fully redundant solution will
> be a
> challenge for this project. I did read the document you referenced in
> your
> reply, but it's probably overkill for what I am trying to achieve for
> now.
>
> One quick question: where is the reconnect_on_failure setting
> available? I
> could not find any documentation on it. Can you provide a
> pointer/example?
>
> Thanks!
>
> On Tuesday, September 8, 2015 at 10:15:02 PM UTC-4, Kyle Anderson wrote:
>>
>> DNS makes for a very poor service discovery mechanism. The majority of
>> clients across most languages will not periodically resolve hostnames.
>> I (personally) expect this behavior from everything except web
>> browsers and short connections and do not rely on dns TTLs for
>> anything.
>>
>> One way you could solve this is to set the reconnect_on_failure
>> setting to be "false", and use upstart or a supervisor to restart the
>> process when it dies. By restarting the process when things go wrong,
>> you give it a chance to re-resolve.
>>
>> Another path is to use load-balancers and let the load balancers
>> handle the routing to healthy things. (you can still use
>> reconnect_on_failure trick to get HA loadbalancers)
>> Have you read this?
>> http://failshell.io/sensu/2013/05/08/high-availability-sensu/
>>
>>
>>
>> On Tue, Sep 8, 2015 at 1:56 PM, <manup...@gmail.com> wrote:
>> > Hi,
>> >
>> > I am seeing some possibly unexpected behavior with the sensu client
>> > which
>> > may or may not be a bug and wanted to see if somebody here may be
>> > able
>> > to
>> > answer my query. If this is not the right place, I apologize in
>> > advance;
>> > please let me know what would be the right forum for posting this
>> > query
>> > and
>> > I will post it there.
>> >
>> >
>> > I am trying to design and validate a redundant Sensu solution that
>> > will
>> > across multiple data centers. I have already gone through the
>> > document
>> > on
>> > the Sensu website that talks about the different approaches:
>> >
>> > https://sensuapp.org/docs/0.16/scaling_strategies
>> >
>> >
>> > Given the project requirements and resources available, among other
>> > things,
>> > the approach I have settled on is to have two different sets of Sensu
>> > server/Rabbit/Redis instances, one in each data center (let's call
>> > them
>> > sensu-1, rabbit-1, redis-1 and sensu-2, rabbit-2, redis-2
>> > respectively)
>> > and
>> > have the clients point to a sensu/rabbit/redis hostname that resolves
>> > to
>> > sensu-1, rabbit-1 and redis-1 when everything is working in data
>> > center
>> > #1,
>> > and to sensu-2, rabbit-2 and redis-2 if anything goes wrong in data
>> > center
>> > #1. This is essentially an ACTIVE-PASSIVE solution, with the Sensu
>> > setup
>> > in
>> > data center #2 taking over whenever there is a problem in data center
>> > #1.
>> >
>> >
>> > In order to test this, I am modifying the DNS entries of
>> > sensu/rabbit/redis
>> > hostnames that the clients are using to point to the IP addresses of
>> > sensu-2, rabbit-2 and redis-2 (from those of sensu-1, rabbit-1 and
>> > redis-1,
>> > which is what they are pointing to initially). I would expect that
>> > once
>> > this
>> > change has been made, the clients would start using sensu-2, rabbit-2
>> > and
>> > redis-2, and I should be able to see these clients as belonging to
>> > datacenter #2, instead of datacenter #1.
>> >
>> >
>> > However, in my testing I have noticed that unless I restart the
>> > sensu-client
>> > on all the nodes being monitored, this change does not take effect.
>> > In
>> > other
>> > words, I can verify on the client system that the sensu/rabbit/redis
>> > hostnames are now resolving to sensu-2, rabbit-2 and redis-2 after
>> > making
>> > the changes to DNS, but the sensu client is still
>> >
>> > Is this behavior expected or is it a bug? Do I need any additional
>> > configuration to make this work?
>> > If this is expected behavior, how do I design a redundant solution
>> > that
>> > works across data centers that does not require restarting the
>> > clients
>> > in
>> > case of an outage?
>> >
>> > Thanks in advance!
>> >
>> >