Sensu Go 6.4 three node cluster won't start

Hi,

I’m trying to setup a three node Sensu Go cluster using Sensu Go 6.4 on Red Hat Enterprise Linux 8.3 servers. I have no firewall between the nodes (same subnet).
I have a working “test-cluster” using Sensu Go 6.3 and I have pretty much replicated the settings in backend.yml to my 6.4 setup. However it seems that backend.yml was updated between 6.3 → 6.4 and there are some new settings.

On each node, Sensu is started using systemctl with the following ExecStart:

ExecStart=/usr/sbin/sensu-backend start -c /opt/app/sensu/backend.yml

backend.yml (the only thing that differs on the other two nodes are hostname/IP-address and path to certificates and key):

---
# Sensu backend configuration
cache-dir: "/opt/app/sensu/sensu-backend"
config-file: "/opt/app/sensu/backend.yml"
state-dir: "/opt/app/sensu/sensu-backend"
log-level: "debug" #available log levels: panic, fatal, error, warn, info, debug, trace

##
# backend configuration
##
#labels:
#  example_key: "example value"
#annotations:
#  example/key: "example value"
#assets-burst-limit: 100
#assets-rate-limit: 1.39
#debug: false
#deregistration-handler: "example_handler"
#require-fips: false
#require-openssl: false
#eventd-buffer-size: 100
#eventd-workers: 100
#keepalived-buffer-size: 100
#keepalived-workers: 100
#pipelined-buffer-size: 100
#pipelined-workers: 100


##
# api configuration
##
api-listen-address: "[::]:8080" #listen on all IPv4 and IPv6 addresses
#api-request-limit: 512000
api-url: "https://server1.domain.net:8080"


##
# tls configuration
##
agent-host: "[::]" #listen on all IPv4 and IPv6 addresses
agent-port: 8081
cert-file: "/opt/app/sensu/tls/server1.domain.net.pem"
key-file: "/opt/app/sensu/tls/server1.domain.net.key"
trusted-ca-file: "/opt/app/sensu/tls/ca_chain.pem"
#agent-auth-cert-file: /path/to/tls/backend-1.pem
#agent-auth-crl-urls: http://localhost/CARoot.crl
#agent-auth-key-file: /path/to/tls/backend-1-key.pem
#agent-auth-trusted-ca-file: /path/to/tls/ca.pem
#agent-burst-limit: null
#agent-rate-limit: null
#insecure-skip-tls-verify: false
#jwt-private-key-file: /path/to/key/private.pem
#jwt-public-key-file: /path/to/key/public.pem
dashboard-cert-file: "/opt/app/sensu/tls/server1.domain.net.pem"
dashboard-host: "[::]"
dashboard-key-file: "/opt/app/sensu/tls/server1.domain.net.key"
dashboard-port: 3000


##
# etcd datastore configuration
##
etcd-advertise-client-urls:
  - https://10.0.0.1:2379
etcd-cert-file: "/opt/app/sensu/tls/server1.domain.net.pem"
#etcd-cipher-suites:
#  - TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384
#  - TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
#  - TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256
#  - TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
#  - TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305
#  - TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305
#etcd-client-cert-auth: false
#etcd-client-urls:
#  - https://10.0.0.1:2379
#  - https://10.1.0.1:2379
#etcd-discovery:
#  - https://discovery.etcd.io/3e86b59982e49066c5d813af1c2e2579cbf573de
#etcd-discovery-srv:
#  - example.org
etcd-initial-advertise-peer-urls:
  - https://10.0.0.1:2380
#  - https://10.1.0.1:2380
etcd-initial-cluster: "server1=https://10.0.0.1:2380,server2=https://10.0.0.2:2380,server3=https://10.0.0.3:2380"
etcd-initial-cluster-state: "new"
etcd-initial-cluster-token: "verysecretkey"
etcd-key-file: "/opt/app/sensu/tls/server1.domain.net.key"
etcd-listen-client-urls:
  - https://10.0.0.1:2379
#  - https://10.1.0.1:2379
etcd-listen-peer-urls:
  - https://10.0.0.1:2380
#  - https://10.1.0.1:2380
etcd-name: "server1"
etcd-peer-cert-file: "/opt/app/sensu/tls/server1.domain.net.pem"
#etcd-peer-client-cert-auth: false
etcd-peer-key-file: "/opt/app/sensu/tls/server1.domain.net.key"
etcd-peer-trusted-ca-file: "/opt/app/sensu/tls/ca_chain.pem"
etcd-trusted-ca-file: "/opt/app/sensu/tls/ca_chain.pem"
#no-embed-etcd: false
#etcd-election-timeout: 1000
#etcd-heartbeat-interval: 100
#etcd-max-request-bytes: 1572864
#etcd-quota-backend-bytes: 4294967296

When I start all cluster nodes, the following error is written to the log about once per second:

Jul 01 08:19:02 server1.domain.net sensu-backend[2328690]: {"component":"etcd","level":"debug","caller":"v3rpc/lease.go:118","msg":"failed to receive lease keepalive request from gRPC stream","error":"rpc error: code = Canceled desc = context canceled","time":"2021-07-01T08:19:02+02:00"}

When I try to initialize I get the following error:

sudo -E sensu-backend init --config-file /opt/app/sensu/backend.yml --cluster-admin-password password --cluster-admin-username admin
{"component":"cmd","level":"info","msg":"attempting to connect to etcd server: https://10.0.0.1:2379","time":"2021-07-01T08:19:18+02:00"}
{"component":"cmd","level":"error","msg":"error connecting to etcd endpoint: context deadline exceeded","time":"2021-07-01T08:19:23+02:00"}
{"component":"sensu-enterprise","error":"no etcd endpoints are available or cluster is unhealthy","level":"fatal","msg":"error executing sensu-backend","time":"2021-07-01T08:19:23+02:00"}

Any ideas?
Thanks!

Edit: I didn’t provide enough info, attaching links to logs from server1 and server2 (server3 logs in separate post since new users can’t post more than 2 links). Real server names and IP-addresses have been replaced, just in case something weird related to this is spotted in config and/or log files.

server1.log
server2.log

Best regards,
Jim

server3.log

Hi,

I’m seeing a lot of TLS handshake error messages in those logs, so that makes me suspicious that what are seeing is a change in the golang certificate handling behavior that was introduced in golang 1.15. Ref: https://golang.google.cn/doc/go1.15#commonname

Quoting golang 1.15 release notes:

X.509 CommonName deprecation

The deprecated, legacy behavior of treating the CommonName field on X.509 certificates as a host name when no Subject Alternative Names are present is now disabled by default. It can be temporarily re-enabled by adding the value x509ignoreCN=0 to the GODEBUG environment variable.

Note that if the CommonName is an invalid host name, it’s always ignored, regardless of GODEBUG settings. Invalid names include those with any characters other than letters, digits, hyphens and underscores, and those with empty labels or trailing dots.

This was actually a change that can be tracked back to an RFC issued in 2000
https://datatracker.ietf.org/doc/html/rfc2818
Quoting the RFC:

Common Name field in the Subject field of the certificate MUST be used. Although
the use of the Common Name is existing practice, it is deprecated and
Certification Authorities are encouraged to use the dNSName instead.`

Hi,

Thanks for replying!

The TLS handshake errors are caused by a F5 load balancer which is querying my nodes on port 3000 to see if they’re alive, I should’ve pointed this out in my original post. I have the same setup in my other Sensu test-cluster, where the cluster is working fine, and I’ve got the same TLS handshake errors there.

My certificates contains all host names and IP-addresses used in the Subject Alternative Name extension:

hmm,
I’m at a loss at the moment. The tls commonname deprecation is the only gotcha I’m aware.

Just to be clear this is a brand new cluster bring up with embedded etcd correctly? Not reusing an existing etcd data store directory… completely fresh bring up?

Is the bring up being managed by puppet or something similar?

Yes, brand new cluster. Nothing is re-used.

Is the operational 6.3 sensu go clusters also running on rhel 8.x?

Could this be something specific to rhel 8 environment like a change to selinux labeling policy compared to rhel 7. Anything in the os audit logs that would indicate an selinux filesystem denial to the sensu user?

Both the working 6.3 cluster and the non-working 6.4 cluster are built on the exact same platform.

Non-working 6.4 cluster:

Working 6.3 cluster:

I did not post the same pic twice, promise! :slight_smile:

The strange thing is, at least to me, it looks like the cluster “works” initially when it is started.
One of the nodes is elected leader, which tells me they can talk to each other?

Installed Sensu Go 6.3 on my “problematic” servers, and it works just fine. I don’t know if there’s some unfortunate combination of something on my “problematic” servers + Sensu Go 6.4 or what is causing my issues. However I’m kinda stuck at troubleshooting this on my own.

Hey,
Still trying to reliably reproduce your problem. I know its been a while.

I’m still suspicious that its a nuanced problem with golang deprecated x509 common name support.
Can you try one thing for me…If you re-enable the golang deprecated cert behavior using the environment variable:

GODEBUG=x509ignoreCN=0

You’ll need to add this to all the sensu-backend’s running environments. This will re-enable the common-name support golang 1.15 deprecated.

If making this change gets your 6.4 cluster working with the same config your 6.3 cluster is using we’ll can be pretty sure its something subtle in your cert’s SAN relative to how the sensu-backend is trying to talk to the sensu provided embedded etcd. Something we may need to do a better job documenting.

Hi Jef,

Thanks for the persistence! :slight_smile:
Unfortunately I went for 6.3 in my setup and I can’t get a 6.4 version up and running at the moment.

Best regards,
Jim

Hi Jef,

I’m back, attempting to install Sensu Go Backend 6.6.0-5502 (on new RHEL 8 servers).

I have added the environment variable to systemctl config (i.e. systemctl edit sensu-backend):
Environment=“GODEBUG=x509ignoreCN=0”

However, still seeing the same error:

Nov 29 14:29:50 server1.domain.net sensu-backend[452433]: {“component”:“etcd”,“level”:“debug”,“caller”:“v3rpc/lease.go:118”,“msg”:“failed to receive lease keepalive request from gRPC stream”,“error”:“rpc error: code = Canceled desc = context canceled”,“time”:“2021-11-29T14:29:50+01:00”}

Do you have any more ideas?
Thanks!

Best regards,
Jim