Sensugo disaster recovery setup

My current sensu infra is like(AWS):

  • 3 backends
  • 3 node external etcd cluster
  • The backends are behind elastic load balancer

I am taking etcd snapshots regularly and I would like to launch a new etcd 3 node cluster from this backup for testing disaster recovery. Is there a way I can test and validate the cluster post restoration? I think connecting to a backend might cause some false/duplicate keepalive(or any other) alarms since it already have the config and metric data?

I just want to make sure I understand. Your concern here is, if you take an etcd cluster snapshot and restore it… then point a sensu-backend at the restored etcd cluster that the first thing the sensu-backend will do is try to handle stale things and issue bogus alerts and the like.

So would it be acceptable to use to just test for the integrity of the rebuilt etcd snapshot if you connected to it as an etcd client and just validated specific etcd keyvalues?

Or does the sensu-backend need a special operational mode that will disable certain aspects of its operation so that pipeline elements like handlers don’t fire.

This is actually a very good question, what is the expected/best/reasonable behavior for a backend cluster that has been offline for a period of time… long eough for all ttls defined in the system to have been breached.