Best practices: update sensu backend with zero downtime

@jspaleta What versions of etcd are supported ? is 3.4.9 supported or only the newest 3.3.x ?

To use Sensu with an external etcd cluster, you must have etcd 3.3.2 or newer. To stand up an external etcd cluster, follow etcd’s clustering guide using the same store configuration.

via: https://docs.sensu.io/sensu-go/latest/guides/clustering/#use-an-external-etcd-cluster

@raulgs what version are you using in your k8s deployment ? i’m also trying to get sensu up and running with an external etcd db but failed so far … were you successfull ? would be great to have a look at your deployment config

we’re using etcd 3.4.9 deployed using helm chart https://hub.helm.sh/charts/bitnami/etcd
and sensu 5.20.2

sensu-backend start
–log-level=debug
–cache-dir=/var/cache/sensu/sensu-backend
–state-dir=/var/lib/sensu
–etcd-client-urls=http://sensu-etcd-0.sensu-etcd-headless.sensu-etcd.svc.cluster.local:2379,http://sensu-etcd-1.sensu-etcd-headless.sensu-etcd.svc.cluster.local:2379,http://sensu-etcd-2.sensu-etcd-headless.sensu-etcd.svc.cluster.local:2379
–no-embed-etcd

… but pods fail after a few seconds …

{“component”:“sensu-enterprise”,“error”:“context deadline exceeded”,“level”:“fatal”,“msg”:“error executing sensu-backend”,“time”:“2020-06-05T13:54:32Z”}

@seizste I am using the same bitnami chart.
I have enabled TLS between Backend and etcd now.

You can use the etcd service that will be created using the etcd chart.
These are my parameters. Remove the ca, key and cert lines if you don’t want to encrypt the connection to the etcd.

          "sensu-backend", "start",
          "--log-level=debug",
          "--no-embed-etcd",
          "--etcd-trusted-ca-file=/var/lib/sensu/etcd-certs/ca.crt",
          "--etcd-cert-file=/var/lib/sensu/etcd-certs/cert.pem",
          "--etcd-key-file=/var/lib/sensu/etcd-certs/key.pem",
          "--etcd-client-urls", "https://etcd.sensu-system.svc:2379",

I will share the whole stuff next week. First want to migrate my DEV stage to ensure that everything works properly.

1 Like

sensu go docs indicated that for external etcd 3.3.2 or newer is required.
https://docs.sensu.io/sensu-go/latest/guides/clustering/#use-an-external-etcd-cluster

Hi @raulgs,

FYI, I’m working with @seizste on setting up Sensu on a Kubernetes cluster with external etcd.

We finally managed to deploy a TLS-enabled etcd cluster with 3 members and we are currently trying to setup Sensu to use this cluster. By starting Sensu with the following command…

sensu-backend start \
  --log-level=debug \
  --no-embed-etcd \
  --etcd-client-urls=https://etcd-release:2379 \
  --etcd-cert-file=/var/lib/sensu/etcd/cert.pem \
  --etcd-key-file=/var/lib/sensu/etcd/key.pem \
  --etcd-trusted-ca-file=/var/lib/sensu/etcd/ca.pem

… we can see the Sensu processes starting but, at some point, we see some weird error messages in the log related to some “passthrough” for etcd:

waiting for backend to become available before running backend-init...
{"component":"backend","level":"info","msg":"dialing etcd server","time":"2020-06-05T22:34:56Z"}                                                                                                                                        {"component":"backend","level":"debug","msg":"Registering backend...","time":"2020-06-05T22:34:56Z"}
{"component":"backend","level":"debug","msg":"Done registering backend.","time":"2020-06-05T22:34:56Z"}                                                                                                                                 {"component":"backend","entity":{"entity_class":"backend","system":{"hostname":"tmp-sensu-65dd85cf5d-wvv7n","os":"linux","platform":"alpine","platform_family":"alpine","platform_version":"3.8.5","network":{"interfaces":[{"name":"lo"
...
running backend init...
{"component":"metricsd","level":"info","msg":"metricsd running","time":"2020-06-05T22:34:58Z"}
...
{"level":"warn","ts":"2020-06-05T22:35:02.741Z","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"passthrough:///http://localhost:2379","attempt":0,"error":"rpc error: code = DeadlineExce
eded desc = latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""}
{"component":"sensu-enterprise","error":"context deadline exceeded","level":"fatal","msg":"error executing sensu-backend","time":"2020-06-05T22:35:02Z"}
...

Interestingly, even with these messages, Sensu is listening to all the ports it should:

/ # netstat -an | grep LISTEN
tcp        0      0 127.0.0.1:6060          0.0.0.0:*               LISTEN
tcp        0      0 :::8080                 :::*                    LISTEN
tcp        0      0 :::8081                 :::*                    LISTEN
tcp        0      0 :::3000                 :::*                    LISTEN

My question: are you seeing these messages as well in your setup? Or are we missing some etcd parameters?

Thanks in advance for your answer!

1 Like

Hi @foofoo_2 according to your logs the etcd client variable has not been passed / set correctly.
This is visible in the following log line - sensu is trying to connect against localhost:2379

Is there probably a typo in the url that you used in the config?
Using the bitnami helm chart I get the following 2 services build:

service/etcd                            ClusterIP   10.101.35.151   <none>        2379/TCP,2380/TCP
service/etcd-headless                   ClusterIP   None            <none>        2379/TCP,2380/TCP

Where is etcd-release coming from?
Also double check if etcd and sensu-backend are in the same namespace. Otherwise you will have to add this detail to the url as well.

Hope this helps

Hi @raulgs,

Thank you very much for your answer.

As you can see in the first two log entries I posted in my previous message, registration to etcd seems to work as expected:

waiting for backend to become available before running backend-init...
{"component":"backend","level":"info","msg":"dialing etcd server","time":"2020-06-05T22:34:56Z"}                                                                                                                                        {"component":"backend","level":"debug","msg":"Registering backend...","time":"2020-06-05T22:34:56Z"}
{"component":"backend","level":"debug","msg":"Done registering backend.","time":"2020-06-05T22:34:56Z"}

etcd-release is an etcd deployment done via helm chart in the same namespace as the Sensu one:

NAME                    TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)             AGE
etcd-release            ClusterIP   10.0.91.112   <none>        2379/TCP,2380/TCP   2d8h
etcd-release-headless   ClusterIP   None          <none>        2379/TCP,2380/TCP   2d8h

And it is reachable under the URL configured in the startup command line:

# curl -v https://etcd-release:2379
*   Trying 10.0.91.112:2379...
* TCP_NODELAY set
* Connected to etcd-release (10.0.91.112) port 2379 (#0)
...

I was just wondering if you also had these Warning/Errors in your logs and to what extend they were relevant?

I see so you are able to login to sensu and everything is working?
I am not getting these errors in my logs

init container logs:

{"component":"backend.seeds","level":"info","msg":"seeding etcd store with intial data","time":"2020-06-08T05:54:22Z"}

backend container logs:

{"component":"backend","level":"info","msg":"dialing etcd server","time":"2020-06-08T05:54:23Z"}
{"component":"backend","level":"debug","msg":"Registering backend...","time":"2020-06-08T05:54:23Z"}
{"component":"backend","level":"debug","msg":"Done registering backend.","time":"2020-06-08T05:54:23Z"}
{"component":"backend","entity":{"entity_class":"backend","system":{"hostname":"sensu-backend-0","os":"linux","platform":"alpine","platform_family":"alpine","platform_version":"3.11.2","network":{"interfaces":[{"name":"lo","addresses":["127.0.0.1/8"]},{"name":"tunl0","addresses":null},{"name":"eth0","mac":"ae:bf:a7:1e:33:a0","addresses":["10.101.13.245/32"]}]},"arch":"amd64","libc_type":"musl","vm_system":"docker","vm_role":"guest","cloud_provider":"","processes":null},"subscriptions":null,"last_seen":0,"deregister":false,"deregistration":{},"metadata":{"name":"sensu-backend-0"},"sensu_agent_version":""},"level":"info","msg":"backend entity information","time":"2020-06-08T05:54:23Z"}
{"cache":"/var/cache/sensu/sensu-backend","component":"asset-manager","level":"debug","msg":"initializing cache directory","time":"2020-06-08T05:54:23Z"}
{"cache":"/var/cache/sensu/sensu-backend","component":"asset-manager","level":"debug","msg":"done initializing cache directory","time":"2020-06-08T05:54:23Z"}
{"component":"agentd","level":"info","msg":"starting agentd on address: [::]:8081","time":"2020-06-08T05:54:23Z"}
{"component":"store","key":"/sensu.io/checks/","level":"debug","msg":"starting a watcher","time":"2020-06-08T05:54:23Z"}
{"component":"apid","level":"info","msg":"starting apid on address: [::]:8080","time":"2020-06-08T05:54:23Z"}
{"component":"tessend","level":"info","msg":"tessen is opted in, enabling tessen.. thank you so much for your support 💚","opt-out":false,"time":"2020-06-08T05:54:23Z"}

Do these errors only appear during the initialization?

Thanks for your answer @raulgs. We will now try to deploy etcd (with TLS enabled) and Sensu on our main cluster and see if these messages are only visible during initialization. We will post the results of our tests afterwards.

(Until now, I only tested on one of my dev clusters)

Hi @raulgs,

So, quick update regarding our setup: after having successfully setup a TLS-enabled etcd, we tried to start Sensu backend using the following command:

sensu-backend start \
  --log-level=debug \
  --etcd-client-urls=https://etcd.sensu-etcd.svc.cluster.local:2379 \
  --etcd-cert-file=/etc/sensu/tls/cert.pem \
  --etcd-key-file=/etc/sensu/tls/key.pem \
  --etcd-trusted-ca-file=/etc/sensu/tls/ca.pem

We could then successfully reach the web UI but unfortunately, we were not able to login using the credentials provided as environment variables. In order to fix that issue, we had to manually run the init command as follow:

sensu-backend init \
  --etcd-client-urls=https://etcd.sensu-etcd.svc.cluster.local:2379 \
  --etcd-cert-file=/etc/sensu/tls/cert.pem \
  --etcd-key-file=/etc/sensu/tls/key.pem \
  --etcd-trusted-ca-file=/etc/sensu/tls/ca.pem

Which produced the following output:

{"component":"backend.seeds","level":"info","msg":"seeding etcd store with intial data","time":"2020-06-08T16:23:30Z"}

After that, re-issuing the start command provided above allowed us to successfully login to the web UI by using the configured credentials.

Does this process sound correct to you or are we missing something here?

Hi @foofoo_2

great to hear that TLS to etcd is working for you now.
Regarding the init I am using a init container that is handling that for me.
I have attached the manifest:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: sensu-backend
  namespace: sensu-system
spec:
  selector:
    matchLabels:
      app: sensu
  serviceName: sensu
  replicas: 2
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        app: sensu
    spec:
      volumes:
      - name: etcd-certs
        secret:
          defaultMode: 420
          optional: false
          secretName: etcd-client-certs
      - name: sensu-backend-certs
        secret:
          defaultMode: 420
          optional: false
          secretName: backend-certs
      initContainers:
      - name: sensu-backend-init
        image: sensu/sensu:5.20.2
        command:
        - /bin/sh
        - -c
        - |
          sensu-backend init \
          --etcd-trusted-ca-file=/var/lib/sensu/etcd-certs/ca.crt \
          --etcd-cert-file=/var/lib/sensu/etcd-certs/cert.pem \
          --etcd-key-file=/var/lib/sensu/etcd-certs/key.pem \
          --etcd-client-urls=https://etcd.sensu-system.svc:2379
          [[ $? -eq 1 ]] && return 1 || return 0
        env:
        - name: SENSU_BACKEND_CLUSTER_ADMIN_USERNAME
          valueFrom:
            secretKeyRef:
              name: backend-secret
              key: user
        - name: SENSU_BACKEND_CLUSTER_ADMIN_PASSWORD
          valueFrom:
            secretKeyRef:
              name: backend-secret
              key: password
        volumeMounts:
        - name: etcd-certs
          mountPath: /var/lib/sensu/etcd-certs
      containers:
      - name: sensu-backend
        image: sensu/sensu:5.20.2
        command: [
          "sensu-backend", "start",
          "--log-level=debug",
          "--no-embed-etcd",
          "--etcd-trusted-ca-file=/var/lib/sensu/etcd-certs/ca.crt",
          "--etcd-cert-file=/var/lib/sensu/etcd-certs/cert.pem",
          "--etcd-key-file=/var/lib/sensu/etcd-certs/key.pem",
          "--etcd-client-urls=https://etcd.sensu-system.svc:2379",
          "--cert-file=/var/lib/sensu/certs/backend.pem",
          "--key-file=/var/lib/sensu/certs/backend-key.pem",
          "--trusted-ca-file=/var/lib/sensu/certs/ca.pem",
          "--insecure-skip-tls-verify=true"
        ]
        readinessProbe:
          exec:
            command: ["/usr/bin/nc","-z","127.0.0.1","8080"]
          failureThreshold: 3
          initialDelaySeconds: 5
          periodSeconds: 5
        env:
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: SENSU_BACKEND_CLUSTER_ADMIN_USERNAME
          valueFrom:
            secretKeyRef:
              name: backend-secret
              key: user
        - name: SENSU_BACKEND_CLUSTER_ADMIN_PASSWORD
          valueFrom:
            secretKeyRef:
              name: backend-secret
              key: password
        ports:
        - protocol: TCP
          containerPort: 8080
        - protocol: TCP
          containerPort: 8081
        - protocol: TCP
          containerPort: 3000
        volumeMounts:
        - name: etcd-certs
          mountPath: /var/lib/sensu/etcd-certs
        - name: sensu-backend-certs
          mountPath: /var/lib/sensu/certs

Hello @raulgs,

thanks for sharing your configuration, we adopted in to our environment and initialization works like a charm !

Cheers

3 Likes

Happy to hear it works for you :+1:

1 Like

@raulgs … we’re facing a new issue now and wanted to know if you experience the same … after about 24 hours our etcd cluster fails as one member of the cluster suddenly disconnects and when deleting the pod fails to join the cluster again …

==> Bash debug is on
==> Detected data from previous deployments...
https://etcd-0.etcd-headless.sensu-etcd.svc.cluster.local:2379 is healthy: successfully committed proposal: took = 13.220009ms
https://etcd-2.etcd-headless.sensu-etcd.svc.cluster.local:2379 is healthy: successfully committed proposal: took = 12.087714ms
grep: /bitnami/etcd/member_removal.log: No such file or directory
==> Updating member in existing cluster...
Error: bad member ID arg (strconv.ParseUint: parsing "": invalid syntax), expecting ID in Hex

any idea what this could be ? btw. this usually happens when a kubernetes worker node is rebooted …

Not really my etcd clusters are running properly so far. Maybe you should better reach out with that question to bitnami or the etcd developers.

@seizste have you been able to figure out the root cause and also to fix the issue?

Hello @raulgs,

I found an issue on their github site that could be the source of the problem but didn‘t had the time to verify the fix yet … https://github.com/bitnami/charts/issues/1908 … looks like an issue with the pre-stop hook script …

ah yeah you are right… I have raised a pull request. the etcd initial state needs to be changed to existing. That should fix the issue :wink:

@raulgs … saw your pr, does this mean we only need to change …

ETCD_INITIAL_CLUSTER_STATE: new >>> existing

… would this even work on a already deployed one ? to get it production ready …