cannot get correct cpu/memory metrics by concurrent check script run.


#1

the result of cpu and memory checks gets incorrect when the check scripts runs concurrently. script itself occupies high cpu usage.
I want to collect the “correct” cpu and memory metrics.

How can I avoid this problem?

I attache the screen shot of top command.


#2

This is unfortunately a side effect of the way the check is written - it looks at the cpu counter, waits a second, and then looks again and works out the utilisation and reports that. However, if all checks (or just an expensive one) are running at the same time (in response to the client checking in to Rabbitmq subscriptions for checks to run), then you will get biased results.

One thing you can do is to put in a sleep parameter to sample over a longer period:

check-cpu.rb --sleep 2

``

which will sample over 2 seconds. Since this is just a check, I am less bothered about precise values, but more about whether there is a problem. Tie this in to a number of occurrences, and I think it is sufficient for an alert.

However, the metrics issue is more of a problem, particularly if you use one of the checks like metric-cpu-mpstat.rb which also samples the cpu for a second and extrapolates out. I raised an issue about it:

https://github.com/sensu-plugins/sensu-plugins-cpu-checks/issues/11

If you use metric-cpu.rb it just queries the cpu counter once, and logs that into graphite (or other time series database). The next check it logs the counter value again, which gives the true amount of CPU usage over the intervening minute. However, you have to calculate the derivative yourself in the graphite query (or other time series database) to generate the CPU %age from these counter values (I do all of this in the Grafana graphs).

If you really wanted your CPU checks to be accurate, I think you would need to query graphite to get the true values, work out the percentage, and then alert on this. I’m not sure if the existing sensu-plugins-graphite scripts would do this directly, or would need tweaking.

Cheers,

Joel

···

On Wednesday, 29 June 2016 05:58:37 UTC+1, Toshiya Kawasaki wrote:

the result of cpu and memory checks gets incorrect when the check scripts runs concurrently. script itself occupies high cpu usage.
I want to collect the “correct” cpu and memory metrics.

How can I avoid this problem?

I attache the screen shot of top command.


#3

Hi, Joel!

Thanks for your reply!

I’ll give it a shot :slight_smile:

Toshiya

2016年6月29日水曜日 23時19分22秒 UTC+9 joel....@hscic.gov.uk:

···

This is unfortunately a side effect of the way the check is written - it looks at the cpu counter, waits a second, and then looks again and works out the utilisation and reports that. However, if all checks (or just an expensive one) are running at the same time (in response to the client checking in to Rabbitmq subscriptions for checks to run), then you will get biased results.

One thing you can do is to put in a sleep parameter to sample over a longer period:

check-cpu.rb --sleep 2

``

which will sample over 2 seconds. Since this is just a check, I am less bothered about precise values, but more about whether there is a problem. Tie this in to a number of occurrences, and I think it is sufficient for an alert.

However, the metrics issue is more of a problem, particularly if you use one of the checks like metric-cpu-mpstat.rb which also samples the cpu for a second and extrapolates out. I raised an issue about it:

https://github.com/sensu-plugins/sensu-plugins-cpu-checks/issues/11

If you use metric-cpu.rb it just queries the cpu counter once, and logs that into graphite (or other time series database). The next check it logs the counter value again, which gives the true amount of CPU usage over the intervening minute. However, you have to calculate the derivative yourself in the graphite query (or other time series database) to generate the CPU %age from these counter values (I do all of this in the Grafana graphs).

If you really wanted your CPU checks to be accurate, I think you would need to query graphite to get the true values, work out the percentage, and then alert on this. I’m not sure if the existing sensu-plugins-graphite scripts would do this directly, or would need tweaking.

Cheers,

Joel

On Wednesday, 29 June 2016 05:58:37 UTC+1, Toshiya Kawasaki wrote:

the result of cpu and memory checks gets incorrect when the check scripts runs concurrently. script itself occupies high cpu usage.
I want to collect the “correct” cpu and memory metrics.

How can I avoid this problem?

I attache the screen shot of top command.