I am seeing some possibly unexpected behavior with the sensu client which may or may not be a bug and wanted to see if somebody here may be able to answer my query. If this is not the right place, I apologize in advance; please let me know what would be the right forum for posting this query and I will post it there.
I am trying to design and validate a redundant Sensu solution that will across multiple data centers. I have already gone through the document on the Sensu website that talks about the different approaches:
Given the project requirements and resources available, among other things, the approach I have settled on is to have two different sets of Sensu server/Rabbit/Redis instances, one in each data center (let’s call them sensu-1, rabbit-1, redis-1 and sensu-2, rabbit-2, redis-2 respectively) and have the clients point to a sensu/rabbit/redis hostname that resolves to sensu-1, rabbit-1 and redis-1 when everything is working in data center #1, and to sensu-2, rabbit-2 and redis-2 if anything goes wrong in data center #1. This is essentially an ACTIVE-PASSIVE solution, with the Sensu setup in data center #2 taking over whenever there is a problem in data center #1.
In order to test this, I am modifying the DNS entries of sensu/rabbit/redis hostnames that the clients are using to point to the IP addresses of sensu-2, rabbit-2 and redis-2 (from those of sensu-1, rabbit-1 and redis-1, which is what they are pointing to initially). I would expect that once this change has been made, the clients would start using sensu-2, rabbit-2 and redis-2, and I should be able to see these clients as belonging to datacenter #2, instead of datacenter #1.
However, in my testing I have noticed that unless I restart the sensu-client on all the nodes being monitored, this change does not take effect. In other words, I can verify on the client system that the sensu/rabbit/redis hostnames are now resolving to sensu-2, rabbit-2 and redis-2 after making the changes to DNS, but the sensu client is still
- Is this behavior expected or is it a bug? Do I need any additional configuration to make this work?
- If this is expected behavior, how do I design a redundant solution that works across data centers that does not require restarting the clients in case of an outage?
Thanks in advance!