Best practices: update sensu backend with zero downtime

the PR has just been merged. Should work now.
Not sure if about prod ready - but I can assure you that we use it in prod :wink:

@raulgs thats great news. would it be possible that you share your helm config for etcd ? we’re using Ansible to deploy things but it should look familiar to whatever you use to deploy …

- name: etcd | environment variables config map
    state: present
      kind: ConfigMap
        name: etcd-env-vars
        namespace: "{{ etcd_namespace }}"
        ETCD_AUTO_COMPACTION_MODE: "revision"
        ETCD_ENABLE_V2: "true"
    kubeconfig: "{{ kube_config_path }}"

- name: etcd | deploy helm chart
    name: etcd
    chart_ref: bitnami/etcd
    chart_version: 4.8.2
    namespace: "{{ etcd_namespace }}"
        tag: "3.4.9"
        debug: true
      envVarsConfigMap: "etcd-env-vars"
        enabled: true
      allowNoneAuthentication: false
          secureTransport: true
          enableAuthentication: true
          existingSecret: "etcd-peer-certs"
          secureTransport: true
          enableAuthentication: true
          existingSecret: "etcd-client-certs"
        replicaCount: 3
    update_repo_cache: true
    kubeconfig: "{{ kube_config_path }}"

… so i would just need to add etcd.initialClusterState: “existing” ? or did you also modify your script ?

I will publish the agent, backend and etcd stuff in a public repo until the end of week.

However, to let you know in advance.
We use gitlab runner ci to deploy the different applications.
etcd, backend and agent are steps of it - something like that:

- helm upgrade --install --wait -f etcd/values/${STAGE}.yaml etcd bitnami/etcd
- helm upgrade --install --wait -f backend/values/${STAGE}.yaml backend backend/.
- helm upgrade --install --wait -f agent/values/${STAGE}.yaml agent agent/.
1 Like

@seizste here you go


have you ever tried this mechanism with sensu go 6.x … if i use the same chart and only change sensu container version, the individual sensu backend containers dont’ seem to be able to communicate with eachother …

Yes works fine for me. I am not sure if the backends are talking to each other. But so far I have not seen any issues. Also while updating when one of the 2 pods gets terminated. The service is still available

… my current problem is that if i do an upgrade or fresh deployment, the init container runs successfully but the main sensu go 6.x container fails to start as the sensu-backend process can’t start as it fails to connect to itself …

what i read from release notes - - they changed the hostname for the container from localhost to its “real” name … not sure if that need an additional command line options or if i’m missing something in my certificates subject alternate names … @raulgs do you have all the short hostnames of the sensu containers added as san’s in your certificates ? or what do you have configured for etcd-initial-advertise-peer-urls ?

hi @seizste what exact issue is the container not starting logging?

I am not sure if you have seen it already, but I have shared the helm charts that I have created here:

regarding the cert I have used the following commands to generate it:

#Generate CA
echo '{"CN":"Sensu INFS CA","key":{"algo":"rsa","size":4096}}' | cfssl gencert -initca - | cfssljson -bare ca -
echo '{"signing":{"default":{"expiry":"876000h","usages":["signing","key encipherment","client auth"]},"profiles":{"backend":{"usages":["signing","key encipherment","server auth"],"expiry":"876000h"},"agent":{"usages":["signing","key encipherment","client auth"],"expiry":"876000h"}}}}' > ca-config.json

#Generate Cert
export ADDRESS=localhost,,*.sensu,*.sensu.sensu-system,*.sensu.sensu-system.svc,*.sensu-system,*.sensu-system.svc
export NAME=backend
echo '{"CN":"'$NAME'","hosts":[""],"key":{"algo":"rsa","size":4096}}' | cfssl gencert -config=ca-config.json -profile="backend" -ca=ca.pem -ca-key=ca-key.pem -hostname="$ADDRESS" - | cfssljson -bare $NAME

So as you see I have used wildcards instead of fixed host names. Made it easier for me to replace them, without have to update the cert always.

Hope this helps

… we’re also using wildcards, but we’re currently not using all the combinations and localhost as you do.
The error we currently have is “waiting for sensu-backend process, trying to connect to :2379 …” … i’ll try to recreate the certificates and see if that helps

still getting following error messages in the pods logs …

== waiting for sensu-backend-0:2379 to become available before running backend-init…

… but somehow the backend is active and if i do “sensuctl cluster health” it shows me that all nodes are healthy … this is really weird. @raulgs did you check if you also see these log messages ?

Sorry for my late response.
I was seeing the log entry as well.
It is caused by a badly written entrypoint script - I have opened an issue regarding it

In the meantime you can use a workaround to get rid of it.
I have added it to my repo

Thanks for sharing :heart: