Sensu Go 6.0.0 upgrade issue

Hi,

I’m following the instructions, but I’m having issues with step 3 upgrading sensu-backend.

? Do you really want to upgrade your Sensu 5.x database to 6.x? This operation cannot be undone; make sure you back up your database! Yes

{“component”:“store”,“level”:“warning”,“msg”:“migrating etcd database to a new version”,“time”:“2020-08-12T10:55:46+02:00”}

{“component”:“store”,“database_version”:1,“level”:“error”,“msg”:“error upgrading database”,“time”:“2020-08-12T10:55:46+02:00”}

{“component”:“sensu-enterprise”,“error”:“the namespace production does not exist”,“level”:“fatal”,“msg”:“error executing sensu-backend”,“time”:“2020-08-12T10:55:46+02:00”}

I don’t have a namespace called production, so not sure where this is coming from.

sensuctl namespace list

Name

──────

dev

prod

Sensu version before upgrade: 5.21.0
OS: CentOS Linux release 7.8.2003 (Core)

Hi @TakeTwo

I have a feeling there’s some unexpected keys in etcd that might refer to a production namespace, that maybe you previously deleted?

Just to verify that, could you try to install etcdctl on the backend machine, and list all keys with production within their path (you might have to adjust some flags if you configured etcd with TLS authentication)

etcdctl get /sensu.io --prefix --keys-only | grep default

If some keys are returned, you might have to manually delete them, using something like this:

etcdctl del /sensu.io/path/to/key

@TakeTwo
Just a quick command fix up… grep for production instead of default

etcdctl get /sensu.io --prefix --keys-only | grep production

So a little bit more on this. I’m not sure how to get into this situation. I just deleted a namespace, made sure there were still resource keys in the namespace and then did the upgrade. I wasn’t able to reproduce the error starting from 5.21.1.

Here’s what I did step by step.

  1. ensure namespace action_CICD exists
  2. populate role, rolebindings and a check in the namespace
etcdctl get /sensu.io --prefix --keys-only | grep action_CICD`
/sensu.io/api/internal/metricsd/v1/metrics/action_CICD/entity_gauges
/sensu.io/api/internal/metricsd/v1/metrics/action_CICD/event_gauges
/sensu.io/api/internal/metricsd/v1/metrics/action_CICD/keepalive_gauges
/sensu.io/checks/action_CICD/test
/sensu.io/namespaces/action_CICD
/sensu.io/rbac/rolebindings/action_CICD/namespace-admins
/sensu.io/rbac/rolebindings/action_CICD/namespace-operators
/sensu.io/rbac/roles/action_CICD/namespace-admin
/sensu.io/rbac/roles/action_CICD/namespace-operator
  1. delete the namespace
  2. resource related keys are still in place
/sensu.io/api/internal/metricsd/v1/metrics/action_CICD/entity_gauges
/sensu.io/api/internal/metricsd/v1/metrics/action_CICD/event_gauges
/sensu.io/api/internal/metricsd/v1/metrics/action_CICD/keepalive_gauges
/sensu.io/checks/action_CICD/test
/sensu.io/rbac/rolebindings/action_CICD/namespace-admins
/sensu.io/rbac/rolebindings/action_CICD/namespace-operators
/sensu.io/rbac/roles/action_CICD/namespace-admin
/sensu.io/rbac/roles/action_CICD/namespace-operator
  1. did the upgrade, no problem

@TakeTwo, it would definitely would be useful to see that key output to see if we can determine if there is a specific resource key that is causing the problem for you. My quick test of just deleting a namespace didn’t cause a problem for me, so its definitely something subtle.

In the meantime, if you do see production namespaced resources in your etcdctl key output, it might be easier to add the namespace back do the upgrade then delete the namespace again after the upgrade. Though I really want to see what etcd key names are referring to production.

Thanks for the suggestions @palourde and @jspaleta.

Unfortunately I haven’t had much luck with this, it may be a bit beyond my technical capabilities to troubleshoot.

etcdctl -ca-file /etc/sensu/tls/xxx.pem --cert-file /etc/sensu/tls/xxx.pem --key-file /etc/sensu/tls/xxx-key.pem get /sensu.io --prefix --keys-only | grep production

flag provided but not defined: -prefix

I’ve tried swapping between API 2 and 3 without any difference.

Removing these flags gives the following error;

Error: client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 127.0.0.1:4001: connect: connection refused

; error #1: EOF

error #0: dial tcp 127.0.0.1:4001: connect: connection refused

error #1: EOF

Adding --endpoints “https://localhost:2379” seems to allow connectivity, but not sure this is the endpoint to use for this purpose? This gives the following error;

Error: 100: Key not found (/sensu.io) [53]

Could you try to run the same first command, but by prefixing the command with ETCDCTL_API=3, so something like

ETCDCTL_API=3 etcdctl -ca-file /etc/sensu/tls/xxx.pem --cert-file /etc/sensu/tls/xxx.pem --key-file /etc/sensu/tls/xxx-key.pem get /sensu.io --prefix --keys-only | grep production

Thanks @palourde, that did the trick.

ETCDCTL_API=3 etcdctl --cacert=/etc/sensu/tls/ca.pem --cert=/etc/sensu/tls/xxx.pem --key=/etc/sensu/tls/xxx.pem get /sensu.io --prefix --keys-only | grep production
/sensu.io/api/internal/metricsd/v1/metrics/production/entity_gauges
/sensu.io/api/internal/metricsd/v1/metrics/production/event_gauges
/sensu.io/api/internal/metricsd/v1/metrics/production/keepalive_gauges
/sensu.io/entities/production/xxx
/sensu.io/events/production/xxx/keepalive
/sensu.io/switchsets/lease/keepalived/production/xxx 

(Certificates and server name replaced with xxx)

I can hold off on deleting the keys if there’s some other debugging you like me to do first.

Glad to hear this @TakeTwo!

Feel free to delete those keys; you will need to delete them one by one using a command similar to this:

ETCDCTL_API=3 etcdctl --cacert=/etc/sensu/tls/ca.pem --cert=/etc/sensu/tls/xxx.pem --key=/etc/sensu/tls/xxx.pem del /sensu.io/path/to/key

Do you remember what was the original version you installed on this cluster? I believe some older Sensu Go versions may have allowed you to delete a namespace that wasn’t empty, but it has been fixed for a couple of releases I think.