Failed to write response #4109

luke · December 11, 2020, 1:48am

Can not open the ui .
Logs:
Dec 01 18:16:46 WXJD-PSC-P9F1-MCORE-PM-OS01-BCOPS-01 sensu-backend[60816]: {“component”:“apid.routers”,“error”:“write tcp 10.174.68.6:32605-\u003e172.20.152.110:27864: i/o timeout”,“level”:“error”,“msg”:“failed to write response”,“time”:“2020-12-01T18:16:46+08:00”}
Dec 01 18:17:10 WXJD-PSC-P9F1-MCORE-PM-OS01-BCOPS-01 sensu-backend[60816]: {“component”:“apid.routers”,“error”:“write tcp 10.174.68.6:32605-\u003e172.20.152.110:64297: i/o timeout”,“level”:“error”,“msg”:“failed to write response”,“time”:“2020-12-01T18:17:10+08:00”}
Dec 01 18:17:35 WXJD-PSC-P9F1-MCORE-PM-OS01-BCOPS-01 sensu-backend[60816]: {“component”:“apid.routers”,“error”:“write tcp 10.174.68.6:32605-\u003e172.20.152.110:65077: i/o timeout”,“level”:“error”,“msg”:“failed to write response”,“time”:“2020-12-01T18:17:35+08:00”}
Dec 01 18:17:58 WXJD-PSC-P9F1-MCORE-PM-OS01-BCOPS-01 sensu-backend[60816]: {“component”:“apid.routers”,“error”:“write tcp 10.174.68.6:32605-\u003e172.20.152.110:41533: i/o timeout”,“level”:“error”,“msg”:“failed to write response”,“time”:“2020-12-01T18:17:58+08:00”}
Dec 01 18:18:24 WXJD-PSC-P9F1-MCORE-PM-OS01-BCOPS-01 sensu-backend[60816]: {“component”:“apid.routers”,“error”:“write tcp 10.174.68.6:32605-\u003e172.20.152.110:28274: i/o timeout”,“level”:“error”,“msg”:“failed to write response”,“time”:“2020-12-01T18:18:24+08:00”}
Dec 01 18:18:48 WXJD-PSC-P9F1-MCORE-PM-OS01-BCOPS-01 sensu-backend[60816]: {“component”:“apid.routers”,“error”:“write tcp 10.174.68.6:32605-\u003e172.20.152.110:46913: i/o timeout”,“level”:“error”,“msg”:“failed to write response”,“time”:“2020-12-01T18:18:48+08:00”}
Dec 01 18:19:13 WXJD-PSC-P9F1-MCORE-PM-OS01-BCOPS-01 sensu-backend[60816]: {“component”:“apid.routers”,“error”:“write tcp 10.174.68.6:32605-\u003e172.20.152.110:21830: i/o timeout”,“level”:“error”,“msg”:“failed to write response”,“time”:“2020-12-01T18:19:13+08:00”}
Dec 01 18:19:30 WXJD-PSC-P9F1-MCORE-PM-OS01-BCOPS-01 sensu-backend[60816]: {“component”:“apid.routers”,“error”:“write tcp 10.174.68.6:32605-\u003e172.20.152.110:25408: i/o timeout”,“level”:“error”,“msg”:“failed to write response”,“time”:“2020-12-01T18:19:30+08:00”}
Dec 01 18:19:46 WXJD-PSC-P9F1-MCORE-PM-OS01-BCOPS-01 sensu-backend[60816]: {“component”:“apid.routers”,“error”:“write tcp 10.174.68.6:32605-\u003e172.20.152.110:31197: i/o timeout”,“level”:“error”,“msg”:“failed to write response”,“time”:“2020-12-01T18:19:46+08:00”}

If the event size is 150 000, we can see event dashboard
But when event size large than 200 000. the api is responce none.

jspaleta · December 11, 2020, 1:56am

Hey,
So when you say event size is “150,000” versus “200,000” What are you referring to here… the number of events in the namespace or the size of a particular event’s payload? Or something else?

luke · December 11, 2020, 2:02am

It is the number of events in the namespace.

jspaleta · December 11, 2020, 2:47am

that helps thanks…
200,000 events is a lot. that’s equivalent to 2000 checks running on each of 100 agents.
or 20 checks on 10,000 agents.

So you’ve probably hit a resource constraint in the design right now.

The web-ui is asking the backend to sort/order 200,000 events and with etcd as the event store this means the backend is pulling them all into memory and doing that filter/sort before handing it back to the web-ui. Your backend might be paging memory to swap and causing the backend to slow down, or just taking a long time to do that sort. If you monitor the host running the backend you are talking to, you might see the web-ui causing memory or cpu spikes.

Options:

up the memory on the backend system and see if that helps… you should be able to monitor memory usage and see if its swapping. If it is, upping the memory might help.
enable the postgres event store in the official binaries. postgres might be more memory/cpu efficient and allow you to scale up to 200,000 events in a namespace.
try to use Sensu namespaces so you can bucket events into smaller groups. So instead of trying to get all 200,000 events in a single namespace break it into 5 or more namespaces.

But yet 200,000 events in a single namespace is definitely pushing it.

There’s also a benchmarking repository with tools that you can use to help you test configurations.

Hope this helps,

luke · December 11, 2020, 4:39am

Thanks very much, I will test that.

We have 256G memory. Do I set etcd-max-request-bytes to 10M helps.
We monitor about 6000 entity and with 800 checks.
most check are use proxy_request to monitor port. And we add check name to entity subscriptions.
Such as:

{
  "command": "check-ports.rb -p 1321 -H {{ .annotations.address }}",
  "handlers": [
    "alarm"
  ],
  "high_flap_threshold": 0,
  "interval": 30,
  "low_flap_threshold": 0,
  "publish": true,
  "runtime_assets": [
    "sensu-ruby-runtime"
  ],
  "subscriptions": [
    "proxy"
  ],
  "proxy_entity_name": "",
  "check_hooks": null,
  "stdin": false,
  "subdue": null,
  "ttl": 0,
  "timeout": 0,
  "proxy_requests": {
    "entity_attributes": [
      "entity.subscriptions.indexOf('07d973e0f64d4c60b2b870c2122f7a2f') >= 0"
    ],
    "splay": false,
    "splay_coverage": 90
  },
  "round_robin": true,
  "output_metric_format": "",
  "output_metric_handlers": null,
  "env_vars": null,
  "metadata": {
    "name": "07d973e0f64d4c60b2b870c2122f7a2f",
    "namespace": "default",
    "labels": {
      "sensu.io/managed_by": "sensuctl",
      "group": "BC-EC"
    },
    "annotations": {
      "AlarmImpact": "",
      "AlarmAdvice": "",
      "snmpTrapOID": "07d973e0f64d4c60b2b870c2122f7a2e",
      "description": "BC-EC_port_1321",
      "AlarmType": "BC-EC",
      "AlarmTitle": "端口down",
      "AlarmLevel": "0"
    },
    "created_by": "admin"
  },
  "secrets": null`Preformatted text`
}

Can you give me some advices to monitor these.

jspaleta · December 11, 2020, 9:59pm

uhm… replicate checks in multiple namespaces and run an agent per namespace… then partition your proxy hosts across those namespaces.

Its possible to run multple agents side by side on the same host… as long as you configured them to use different state directories and ensure they aren’t trying to all bind to the same ports for things like the agent event socket and statsd.

luke · December 12, 2020, 7:26am

Thanks very much. It is a better idea to setup multiple namespace for monitor large cluster.

luke · December 21, 2020, 9:15am

I now deploy sensu 3.21, Can I deploy etcd 3.4 with it.
It says etcd 3.4 have a better performance.

jspaleta · December 22, 2020, 11:34pm

Sensu 3.21?

I am a little concerned now. I’m not sure what you are referring to. Current release is Sensu 6.2

Topic		Replies	Views
Intermittent connection refused from sensu-backend Sensu Go	13	1389	November 30, 2020
Unable to connect to sensu-backend from agent Sensu Go	1	684	August 2, 2019
Sensu Go 6.5.5 is here! New Releases	0	311	November 22, 2021
Sensu Go 6.4.0 is here! Announcements	0	368	June 28, 2021
[Sensu Dashboard] Connection Failed to the Datacenter Sensu Classic (EOL)	2	812	November 22, 2018

Failed to write response #4109

Related topics