Failed to write response #4109

Can not open the ui .
Logs:
Dec 01 18:16:46 WXJD-PSC-P9F1-MCORE-PM-OS01-BCOPS-01 sensu-backend[60816]: {“component”:“apid.routers”,“error”:“write tcp 10.174.68.6:32605-\u003e172.20.152.110:27864: i/o timeout”,“level”:“error”,“msg”:“failed to write response”,“time”:“2020-12-01T18:16:46+08:00”}
Dec 01 18:17:10 WXJD-PSC-P9F1-MCORE-PM-OS01-BCOPS-01 sensu-backend[60816]: {“component”:“apid.routers”,“error”:“write tcp 10.174.68.6:32605-\u003e172.20.152.110:64297: i/o timeout”,“level”:“error”,“msg”:“failed to write response”,“time”:“2020-12-01T18:17:10+08:00”}
Dec 01 18:17:35 WXJD-PSC-P9F1-MCORE-PM-OS01-BCOPS-01 sensu-backend[60816]: {“component”:“apid.routers”,“error”:“write tcp 10.174.68.6:32605-\u003e172.20.152.110:65077: i/o timeout”,“level”:“error”,“msg”:“failed to write response”,“time”:“2020-12-01T18:17:35+08:00”}
Dec 01 18:17:58 WXJD-PSC-P9F1-MCORE-PM-OS01-BCOPS-01 sensu-backend[60816]: {“component”:“apid.routers”,“error”:“write tcp 10.174.68.6:32605-\u003e172.20.152.110:41533: i/o timeout”,“level”:“error”,“msg”:“failed to write response”,“time”:“2020-12-01T18:17:58+08:00”}
Dec 01 18:18:24 WXJD-PSC-P9F1-MCORE-PM-OS01-BCOPS-01 sensu-backend[60816]: {“component”:“apid.routers”,“error”:“write tcp 10.174.68.6:32605-\u003e172.20.152.110:28274: i/o timeout”,“level”:“error”,“msg”:“failed to write response”,“time”:“2020-12-01T18:18:24+08:00”}
Dec 01 18:18:48 WXJD-PSC-P9F1-MCORE-PM-OS01-BCOPS-01 sensu-backend[60816]: {“component”:“apid.routers”,“error”:“write tcp 10.174.68.6:32605-\u003e172.20.152.110:46913: i/o timeout”,“level”:“error”,“msg”:“failed to write response”,“time”:“2020-12-01T18:18:48+08:00”}
Dec 01 18:19:13 WXJD-PSC-P9F1-MCORE-PM-OS01-BCOPS-01 sensu-backend[60816]: {“component”:“apid.routers”,“error”:“write tcp 10.174.68.6:32605-\u003e172.20.152.110:21830: i/o timeout”,“level”:“error”,“msg”:“failed to write response”,“time”:“2020-12-01T18:19:13+08:00”}
Dec 01 18:19:30 WXJD-PSC-P9F1-MCORE-PM-OS01-BCOPS-01 sensu-backend[60816]: {“component”:“apid.routers”,“error”:“write tcp 10.174.68.6:32605-\u003e172.20.152.110:25408: i/o timeout”,“level”:“error”,“msg”:“failed to write response”,“time”:“2020-12-01T18:19:30+08:00”}
Dec 01 18:19:46 WXJD-PSC-P9F1-MCORE-PM-OS01-BCOPS-01 sensu-backend[60816]: {“component”:“apid.routers”,“error”:“write tcp 10.174.68.6:32605-\u003e172.20.152.110:31197: i/o timeout”,“level”:“error”,“msg”:“failed to write response”,“time”:“2020-12-01T18:19:46+08:00”}

If the event size is 150 000, we can see event dashboard
But when event size large than 200 000. the api is responce none.

Hey,
So when you say event size is “150,000” versus “200,000” What are you referring to here… the number of events in the namespace or the size of a particular event’s payload? Or something else?

It is the number of events in the namespace.

that helps thanks…
200,000 events is a lot. that’s equivalent to 2000 checks running on each of 100 agents.
or 20 checks on 10,000 agents.

So you’ve probably hit a resource constraint in the design right now.

The web-ui is asking the backend to sort/order 200,000 events and with etcd as the event store this means the backend is pulling them all into memory and doing that filter/sort before handing it back to the web-ui. Your backend might be paging memory to swap and causing the backend to slow down, or just taking a long time to do that sort. If you monitor the host running the backend you are talking to, you might see the web-ui causing memory or cpu spikes.

Options:

  1. up the memory on the backend system and see if that helps… you should be able to monitor memory usage and see if its swapping. If it is, upping the memory might help.
  2. enable the postgres event store in the official binaries. postgres might be more memory/cpu efficient and allow you to scale up to 200,000 events in a namespace.
  3. try to use Sensu namespaces so you can bucket events into smaller groups. So instead of trying to get all 200,000 events in a single namespace break it into 5 or more namespaces.

But yet 200,000 events in a single namespace is definitely pushing it.

There’s also a benchmarking repository with tools that you can use to help you test configurations.

Hope this helps,

Thanks very much, I will test that.

  1. We have 256G memory. Do I set etcd-max-request-bytes to 10M helps.

  2. We monitor about 6000 entity and with 800 checks.
    most check are use proxy_request to monitor port. And we add check name to entity subscriptions.
    Such as:

{
  "command": "check-ports.rb -p 1321 -H {{ .annotations.address }}",
  "handlers": [
    "alarm"
  ],
  "high_flap_threshold": 0,
  "interval": 30,
  "low_flap_threshold": 0,
  "publish": true,
  "runtime_assets": [
    "sensu-ruby-runtime"
  ],
  "subscriptions": [
    "proxy"
  ],
  "proxy_entity_name": "",
  "check_hooks": null,
  "stdin": false,
  "subdue": null,
  "ttl": 0,
  "timeout": 0,
  "proxy_requests": {
    "entity_attributes": [
      "entity.subscriptions.indexOf('07d973e0f64d4c60b2b870c2122f7a2f') >= 0"
    ],
    "splay": false,
    "splay_coverage": 90
  },
  "round_robin": true,
  "output_metric_format": "",
  "output_metric_handlers": null,
  "env_vars": null,
  "metadata": {
    "name": "07d973e0f64d4c60b2b870c2122f7a2f",
    "namespace": "default",
    "labels": {
      "sensu.io/managed_by": "sensuctl",
      "group": "BC-EC"
    },
    "annotations": {
      "AlarmImpact": "",
      "AlarmAdvice": "",
      "snmpTrapOID": "07d973e0f64d4c60b2b870c2122f7a2e",
      "description": "BC-EC_port_1321",
      "AlarmType": "BC-EC",
      "AlarmTitle": "端口down",
      "AlarmLevel": "0"
    },
    "created_by": "admin"
  },
  "secrets": null`Preformatted text`
}

Can you give me some advices to monitor these.

uhm… replicate checks in multiple namespaces and run an agent per namespace… then partition your proxy hosts across those namespaces.

Its possible to run multple agents side by side on the same host… as long as you configured them to use different state directories and ensure they aren’t trying to all bind to the same ports for things like the agent event socket and statsd.

Thanks very much. It is a better idea to setup multiple namespace for monitor large cluster.

I now deploy sensu 3.21, Can I deploy etcd 3.4 with it.
It says etcd 3.4 have a better performance.

Sensu 3.21?

I am a little concerned now. I’m not sure what you are referring to. Current release is Sensu 6.2