Intermittent connection refused from sensu-backend

I have been using Sensu Go in development for the past 2 months, and I have been running into some problems with the sensu-backend web ui intermittently failing (“Connection Refused”). Recently it happened 2 times in the past 4 days.

I was on version 6.1.1-3555 when the problem started to happened, so I downgraded to 6.0.0-3003 on a fresh vm, and the problem is still happening.

I am really puzzled because the logs show nothing out of the normal, and then around the time of the problem, the logs shows it is losing access to handlers:
{“component”:“pipelined”,“handler”:“mailer”,“level”:“info”,“msg”:“handler does not exist, will be ignored”,“namespace”:“default”,“time”:“2020-10-22T11:59:41-07:00”}
And then seconds later, the handlers found fine but the web ui is not longer accepting connections. No error in the log.

Again, the sensu-backend that has been running for days without issue, and then suddenly stops working. The process is still running and checks are still occurring but the web ui is refusing connections.
The only fix at this point is to restart sensu-backend. I have had to run a watch script to keep it running.

Any help is appreciated. My vm:
sensu version 6.0.0-3003
linux centos 7.8.2003
amd64
glibc|

plugins installed:
[ “sensu/sensu-email-handler”,
‘sensu/sensu-ruby-runtime’,
“sensu/monitoring-plugins”,
‘sensu/sensu-pagerduty-handler’,
‘fgouteroux/sensu-go-graphite-handler’,
‘sensu-plugins/sensu-plugins-memory-checks’,
“sensu-plugins/sensu-plugins-filesystem-checks”,
“sensu-plugins/sensu-plugins-disk-checks”,
“sensu-plugins/sensu-plugins-process-checks”,
“sensu-plugins/sensu-plugins-network-checks”,
“sensu-plugins/sensu-plugins-cpu-checks”,
“sensu-plugins/sensu-plugins-mysql”,
“nixwiz/sensu-go-fatigue-check-filter”]

Any help is so appreciated. I don’t know where to start because the logs are not showing me much.

Thanks!

Hi @scottymarshall

Thanks for reporting this issue. Would you able to share with us the sensu-backend logs around the time the Web UI becomes unresponsive?

Also, it might be useful to increment the log verbosity the next time this happens, which can be achieved just like this: https://docs.sensu.io/sensu-go/latest/operations/maintain-sensu/troubleshoot/#increment-log-level-verbosity.

Thanks

Of course. Here is a copy of the logs, setting was info at the time:
https://jmhpublic.s3-us-west-1.amazonaws.com/linux/sensu-backend.log.1.gz

Again the timestamp of the problem was 2020-10-22T11:59.

I also just turned on debug, so if it does happen again, there will be a lot more to see

I just upgraded to 6.1.2 and the release notes make me optimistic that the problem might be corrected. Still I have debug on and will monitor for the next week.

OK…bummer. I upgraded to 6.1.2-3565.x86_64 and 3 days later I ran into the problem. :frowning:

The good news is i had debug on:
https://jmhpublic.s3-us-west-1.amazonaws.com/linux/sensu-backend.log.11012020.gz
The web ui went down sometime close to Nov 01 18:40:17.

The bad news is I still don’t see much. I do see right around the time of the web ui going down, I see these messages in the logs:
Nov 1 18:40:24 monitor01-poc sensu-backend: {"component":"pipelined","handler":"keepalive","level":"info","msg":"handler does not exist, will be ignored","namespace":"default","time":"2020-11-01T18:40:24-08:00"}

They go away after a minute. But I see no errors.

I will note that I see compaction occured a few minutes before the exception:
Nov 1 18:36:52 monitor01-poc sensu-backend: {"component":"etcd","level":"warning","msg":"Starting auto-compaction at revision 6225453 (retention: 2 revisions)","pkg":"compactor","time":"2020-11-01T18:36:52-08:00"} Nov 1 18:36:52 monitor01-poc sensu-backend: {"component":"etcd","level":"info","msg":"store.index: compact 6225453","pkg":"mvcc","time":"2020-11-01T18:36:52-08:00"} Nov 1 18:36:52 monitor01-poc sensu-backend: {"component":"etcd","level":"warning","msg":"Finished auto-compaction at revision 6225453","pkg":"compactor","time":"2020-11-01T18:36:52-08:00"} Nov 1 18:36:52 monitor01-poc sensu-backend: {"component":"etcd","level":"info","msg":"finished scheduled compaction at 6225453 (took 496.498µs)","pkg":"mvcc","time":"2020-11-01T18:36:52-08:00"}

Any help is much appreciated.

I’m having the API port randomly go down as well on 6.1.2. Was hoping this dot release would take care of it but it appears to still be present.

And it just started again this morning. It seems like when it starts, it happens more often. I rebuilt the server on the 29th and no problems until today. Suddenly, it happened like six times today.

the last time was Nov. 5th at 10:46. Again, nothing at all odd in the logs. I am so at a loss.
https://jmhpublic.s3-us-west-1.amazonaws.com/linux/sensu-backend.log.20201105.gz

I want to move this production, but its stability is frightening me.

Here’s another tidbit: This is what I get when I check the health:
curl -k https://127.0.0.1:8083/health

{"Alarms":null,"ClusterHealth":[{"MemberID":9882886658148554927,"MemberIDHex":"8927110dc66458af","Name":"default","Err":"context deadline exceeded","Healthy":false}],"Header":{"cluster_id":4255616304056076734,"member_id":9882886658148554927,"raft_term":46}}

I think i found the problem, and it is a bad graphite plugin. I was having restart problems every hour, and I happened to see the same plugin having problems during the time of the failed web ui.

So I removed the plugin and the site became stable immediately. Still, i need to wait a week to ensure the problem is corrected.

I did update the graphite plugin with this one that seems to be active:

@miralgj, I wonder if your issues is a plugin, too.

I did not find the problem :(. It started again this morning. I did make a few changes to some checks that are every minute and made them every few minutes and 70 second intervals, hoping the problem is just a system getting hit hard every 60 seconds, but I am still experimenting…

@palourde is there a way i can give you a stack trace when the system breaks for you to review?

So I think it might be the graphite-handler or one of the metrics I use for the handler. It was happening intermittently all day yesterday, so I tried just removing the graphite-handler altogether and the problem has seemed to subside.

Without any issues for 24 hours, I have now re-installed the graphite handler BUT I have not subscribed any checks to it.

The plan is to one by one turn on a single metric every day until I see a problem again. When I do, hopefully I can get it down to specific check…I hope.

thanks for keeping us apprised of the diagnosis.
Have you been able to narrow down the problem further?

Thanks for asking, @jspaleta. I think i have the problem corrected. The issue looks like timeouts. Once I set timeouts for all of my checks, the problem went away. I have not seen it for two weeks.

From what I was reading, there are no timeouts by default. My guess is some of my checks were not timing out and just taking up space in the sensugo server as they awaited a response.

1 Like