cannot get correct cpu/memory metrics by concurrent check script run.

Toshiya_Kawasaki · June 29, 2016, 4:58am

the result of cpu and memory checks gets incorrect when the check scripts runs concurrently. script itself occupies high cpu usage.
I want to collect the “correct” cpu and memory metrics.

How can I avoid this problem?

I attache the screen shot of top command.

Smith_Joel_HEALTH_AN · June 29, 2016, 2:19pm

This is unfortunately a side effect of the way the check is written - it looks at the cpu counter, waits a second, and then looks again and works out the utilisation and reports that. However, if all checks (or just an expensive one) are running at the same time (in response to the client checking in to Rabbitmq subscriptions for checks to run), then you will get biased results.

One thing you can do is to put in a sleep parameter to sample over a longer period:

check-cpu.rb --sleep 2

``

which will sample over 2 seconds. Since this is just a check, I am less bothered about precise values, but more about whether there is a problem. Tie this in to a number of occurrences, and I think it is sufficient for an alert.

However, the metrics issue is more of a problem, particularly if you use one of the checks like metric-cpu-mpstat.rb which also samples the cpu for a second and extrapolates out. I raised an issue about it:

https://github.com/sensu-plugins/sensu-plugins-cpu-checks/issues/11

If you use metric-cpu.rb it just queries the cpu counter once, and logs that into graphite (or other time series database). The next check it logs the counter value again, which gives the true amount of CPU usage over the intervening minute. However, you have to calculate the derivative yourself in the graphite query (or other time series database) to generate the CPU %age from these counter values (I do all of this in the Grafana graphs).

If you really wanted your CPU checks to be accurate, I think you would need to query graphite to get the true values, work out the percentage, and then alert on this. I’m not sure if the existing sensu-plugins-graphite scripts would do this directly, or would need tweaking.

Cheers,

Joel

···

On Wednesday, 29 June 2016 05:58:37 UTC+1, Toshiya Kawasaki wrote:

the result of cpu and memory checks gets incorrect when the check scripts runs concurrently. script itself occupies high cpu usage.
I want to collect the “correct” cpu and memory metrics.

How can I avoid this problem?

I attache the screen shot of top command.

Toshiya_Kawasaki · July 6, 2016, 9:40am

Hi, Joel!

Thanks for your reply!

I’ll give it a shot

Toshiya

2016年6月29日水曜日 23時19分22秒 UTC+9 joel....@hscic.gov.uk:

···

This is unfortunately a side effect of the way the check is written - it looks at the cpu counter, waits a second, and then looks again and works out the utilisation and reports that. However, if all checks (or just an expensive one) are running at the same time (in response to the client checking in to Rabbitmq subscriptions for checks to run), then you will get biased results.

One thing you can do is to put in a sleep parameter to sample over a longer period:

check-cpu.rb --sleep 2

``

which will sample over 2 seconds. Since this is just a check, I am less bothered about precise values, but more about whether there is a problem. Tie this in to a number of occurrences, and I think it is sufficient for an alert.

However, the metrics issue is more of a problem, particularly if you use one of the checks like metric-cpu-mpstat.rb which also samples the cpu for a second and extrapolates out. I raised an issue about it:

https://github.com/sensu-plugins/sensu-plugins-cpu-checks/issues/11

If you use metric-cpu.rb it just queries the cpu counter once, and logs that into graphite (or other time series database). The next check it logs the counter value again, which gives the true amount of CPU usage over the intervening minute. However, you have to calculate the derivative yourself in the graphite query (or other time series database) to generate the CPU %age from these counter values (I do all of this in the Grafana graphs).

If you really wanted your CPU checks to be accurate, I think you would need to query graphite to get the true values, work out the percentage, and then alert on this. I’m not sure if the existing sensu-plugins-graphite scripts would do this directly, or would need tweaking.

Cheers,

Joel

On Wednesday, 29 June 2016 05:58:37 UTC+1, Toshiya Kawasaki wrote:

the result of cpu and memory checks gets incorrect when the check scripts runs concurrently. script itself occupies high cpu usage.
I want to collect the “correct” cpu and memory metrics.

How can I avoid this problem?

I attache the screen shot of top command.

Topic		Replies	Views
order of checks may impact CPU usage metrics collected by sensu-client Sensu Classic (EOL)	0	515	November 6, 2014
set threshold values for metric output Sensu Classic (EOL)	9	800	March 16, 2016
check on metric threshold, generate alert Sensu Classic (EOL)	3	490	February 12, 2016
High CPU utilization issue after creating a large number of checks (25+) Sensu Go sensu-go , assets , checks	5	576	May 20, 2023
graphite-api for sensu Sensu Classic (EOL)	1	479	March 24, 2016

cannot get correct cpu/memory metrics by concurrent check script run.

Related topics