Sensu and influxdb scaling strategies

#1

Hi all,

I am interested in hearing about different options/mechanisms available to allow a deployment using Sensu and influxdb for storing metrics data to scale well, and also if our particular situation warrants additional resources, further customizations etc. First, here is some background about our use of Sensu:

  1. We are using Sensu to monitor around 80 production servers, each of which is running about 15-20 checks. The Sensu server, rabbitmq and redis components are all running on dedicated different machines (VMs). The VM hosting the Sensu server has 4 cores and 8GB of RAM. When we first deployed the system, the focus was on getting the system up and running and ensuring that the monitoring system could provide timely notifications for any conditions of interest. For the checks, we took what met our requirements from the Sensu open source code base, and wrote some of our own checks to cover any gaps. We had also planned an influx deployment in the future, but for the initial deployment, we just took a few (CPU, RAM, disk, vmstat, interface stats) of the standard Sensu metrics Ruby scripts available in github and used them for gathering data from all the client nodes once every 10 minutes, just logging this data to text files on the server. This is obviously not very useful, other than showing us that the Sensu server VM can handle this load without a problem. So far, this system has run relatively well, and we have not had any significant issues to deal with, except perhaps on the UI and usability side (that’s a subject for another thread, though).

  2. For the next phase, we wanted to use influxdb for storing metrics data and use Grafana for providing dashboards of interest. We found that the raw metrics data being provided by the stock Ruby scripts was not very useful for visualization, so we modified some of the metrics checks and also wrote our own handler (in Python) to log the data to influxdb. We also increased the interval at which we were gathering the metrics data from once every 10 minutes (which was not very useful to begin with) to once evey 60 seconds (one minute). We created some dashboards based on this data in Grafana and tested this out in a lab setup of about 30 nodes, and everything seemed to work fine. So far, so good.

  3. However, when we deployed these changes in the production environment, we found that the CPU utilization on the Sensu server was fairly high (>70%) for sustained periods of time, and that the server was not able to keep up with the metrics data that was being generated by the clients. We found that the bulk of the CPU resources were being used up in writing the data to influx database, and that there was significant delay in having this data show up in the database after the server had received it from the clients. We also found that this led to delay in processing check data from clients; for instance, we found that the email alerts were not being generated promptly by the server in case a client reported an error condition based on the Sensu check that it ran. Based on these observations, we reverted the metrics collection interval back to 600 seconds (10 minutes) from the new value of 60 seconds (1 minute), and this addressed all the issues I described above. This makes it clear that the problem is that the Sensu server was not able to process all the metrics data that was being sent to it by the clients.

At this point, I would like to understand what our options are in order to get this to work properly. A few things come to mind:

  1. Add more CPU cores to the Sensu master VM, or maybe add another Sensu server on a different machine, to give the Sensu server the processing power it needs to handle the extra processing of metrics data. This may be the easiest thing to do in the short term, but it does not really address any of the fundamental issues we have seen here, and is at best a short term measure. Before I make this recommendation to my team, I would like to evaluate what our options are first.

  2. From what I have been able to read so far, it appears that the poor performance of influxdb writes is a relatively well-known cause for scaling problems. Is this correct? If so, what can be done to fix this? Are there any examples of how to address this if this is indeed the issue?

  3. Is there some data available about scaling using influxdb and Sensu? How can I determine if the existing resources allocated for the Sensu server VM are sufficient? Also, how can I size my server resources to ensure that they are adequate to handle the data being generated by a specific number of clients?

  4. Ideally, I would like to implement a solution that does not require a lot of new development on our side to address this issue. One of our design goals is to be flexible and allow for more data to be collected in future if necessary. Is this possible using existing tools?

I would be very interested in hearing what the folks here have to say about this. I am sure this topic has come up before, so I apologize if I have missed something; I have not had a chance to go through related messages on this issue.

Thanks in advance for any help!

-Manu

#2
  1. I’m glad you are asking before throwing cores at the solution.

  2. Is the performance really influxdb? Or is it the sensu infrastructure + handler stuff?

  3. I think this question is a little too hard to just figure out non-experimentally, but in general it is going to be clients * metric count * frequency to come up with a metrics/second and then sizing influxdb to handle that capacity.

  4. Depending on how much you consider existing, yes.

The main bottleneck for Sensu (for metrics and otherwise) is the handler execution.

Longer term, for high volume metric you will need to move to a different execution model and use an extension. Here is an influxdb one:

https://github.com/seegno/sensu-influxdb-extension

Read up on what extensions are to be sure to understand the difference:

https://sensuapp.org/docs/0.25/reference/extensions.html#what-is-a-sensu-extension

But it means you will have to give up your own python handler :frowning:

For light metrics collection I think normal handlers are fine.

For medium metrics collection you have to move to extensions and drop the process overhead.

For really hard-core metrics distribution I think you have to give up and bypass the sensu infrastructure and do direct collector (like collectd) => influxdb.

···

On Fri, Jul 8, 2016 at 11:29 AM, manupathak@gmail.com wrote:

Hi all,

I am interested in hearing about different options/mechanisms available to allow a deployment using Sensu and influxdb for storing metrics data to scale well, and also if our particular situation warrants additional resources, further customizations etc. First, here is some background about our use of Sensu:

  1. We are using Sensu to monitor around 80 production servers, each of which is running about 15-20 checks. The Sensu server, rabbitmq and redis components are all running on dedicated different machines (VMs). The VM hosting the Sensu server has 4 cores and 8GB of RAM. When we first deployed the system, the focus was on getting the system up and running and ensuring that the monitoring system could provide timely notifications for any conditions of interest. For the checks, we took what met our requirements from the Sensu open source code base, and wrote some of our own checks to cover any gaps. We had also planned an influx deployment in the future, but for the initial deployment, we just took a few (CPU, RAM, disk, vmstat, interface stats) of the standard Sensu metrics Ruby scripts available in github and used them for gathering data from all the client nodes once every 10 minutes, just logging this data to text files on the server. This is obviously not very useful, other than showing us that the Sensu server VM can handle this load without a problem. So far, this system has run relatively well, and we have not had any significant issues to deal with, except perhaps on the UI and usability side (that’s a subject for another thread, though).
  1. For the next phase, we wanted to use influxdb for storing metrics data and use Grafana for providing dashboards of interest. We found that the raw metrics data being provided by the stock Ruby scripts was not very useful for visualization, so we modified some of the metrics checks and also wrote our own handler (in Python) to log the data to influxdb. We also increased the interval at which we were gathering the metrics data from once every 10 minutes (which was not very useful to begin with) to once evey 60 seconds (one minute). We created some dashboards based on this data in Grafana and tested this out in a lab setup of about 30 nodes, and everything seemed to work fine. So far, so good.
  1. However, when we deployed these changes in the production environment, we found that the CPU utilization on the Sensu server was fairly high (>70%) for sustained periods of time, and that the server was not able to keep up with the metrics data that was being generated by the clients. We found that the bulk of the CPU resources were being used up in writing the data to influx database, and that there was significant delay in having this data show up in the database after the server had received it from the clients. We also found that this led to delay in processing check data from clients; for instance, we found that the email alerts were not being generated promptly by the server in case a client reported an error condition based on the Sensu check that it ran. Based on these observations, we reverted the metrics collection interval back to 600 seconds (10 minutes) from the new value of 60 seconds (1 minute), and this addressed all the issues I described above. This makes it clear that the problem is that the Sensu server was not able to process all the metrics data that was being sent to it by the clients.

At this point, I would like to understand what our options are in order to get this to work properly. A few things come to mind:

  1. Add more CPU cores to the Sensu master VM, or maybe add another Sensu server on a different machine, to give the Sensu server the processing power it needs to handle the extra processing of metrics data. This may be the easiest thing to do in the short term, but it does not really address any of the fundamental issues we have seen here, and is at best a short term measure. Before I make this recommendation to my team, I would like to evaluate what our options are first.
  1. From what I have been able to read so far, it appears that the poor performance of influxdb writes is a relatively well-known cause for scaling problems. Is this correct? If so, what can be done to fix this? Are there any examples of how to address this if this is indeed the issue?
  1. Is there some data available about scaling using influxdb and Sensu? How can I determine if the existing resources allocated for the Sensu server VM are sufficient? Also, how can I size my server resources to ensure that they are adequate to handle the data being generated by a specific number of clients?
  1. Ideally, I would like to implement a solution that does not require a lot of new development on our side to address this issue. One of our design goals is to be flexible and allow for more data to be collected in future if necessary. Is this possible using existing tools?

I would be very interested in hearing what the folks here have to say about this. I am sure this topic has come up before, so I apologize if I have missed something; I have not had a chance to go through related messages on this issue.

Thanks in advance for any help!

-Manu

#3

Hi Kyle,

Thanks for the quick response. Some responses inline:

  1. I’m glad you are asking before throwing cores at the solution.

We don’t want to do that, but out of curiosity, how much would doubling the CPU resources help? Any experience with that?

  1. Is the performance really influxdb? Or is it the sensu infrastructure + handler stuff?

We don’t have conclusive data yet, but I think the problem is not influxdb, but how the Sensu infrastructure + handler is writing to influxdb. From the Sensu server logs, it appeared that there were many calls to the influxdb handler being made, and that may have been the reason why the server was not able to keep up (either that or the fact that the writes were taking a long time). My knowledge of influxdb is limited, but I believe that it scales pretty well, so it is likely that the bottleneck is on the Sensu infrastructure side, and not on the influxdb side. Does that make sense?

  1. I think this question is a little too hard to just figure out non-experimentally, but in general it is going to be clients * metric count * frequency to come up with a metrics/second and then sizing influxdb to handle that capacity.

I understand that. I was wondering if anybody had any data they could share about what works and what doesn’t in terms of actual numbers.

  1. Depending on how much you consider existing, yes.

The main bottleneck for Sensu (for metrics and otherwise) is the handler execution.

Could you elaborate what you mean by “depending on how much you consider existing”? But thanks for confirming. This matches my understanding.

Longer term, for high volume metric you will need to move to a different execution model and use an extension. Here is an influxdb one:

https://github.com/seegno/sensu-influxdb-extension

Read up on what extensions are to be sure to understand the difference:

https://sensuapp.org/docs/0.25/reference/extensions.html#what-is-a-sensu-extension

But it means you will have to give up your own python handler :frowning:

Interesting. Thanks for the pointer. A couple of questions about this:

  1. Can extensions only be written in Ruby or is it possible to write our own extensions in Python as well? I would imagine it should be possible to write our own Python extension, or is that not an option? Of course, this does mean we would need to understand the internals of Sensu a lot better than we do right now.

  2. Assuming we cannot use the default handler for some reason, what are our options to improve scalability? I read something about batching influx writes and also staggering check execution to allow the writes to be distributed more evenly. Is that possible? And does batching influx writes help? Are there any other things we could do which would help with this?

For light metrics collection I think normal handlers are fine.

For medium metrics collection you have to move to extensions and drop the process overhead.

For really hard-core metrics distribution I think you have to give up and bypass the sensu infrastructure and do direct collector (like collectd) => influxdb.

Thanks for the recommendation. We are still trying to figure out what our future direction should be, but given that we will be adding a lot more metrics in the future, I suspect we may eventually have to do something similar to #3 above.

Thanks,

-Manu

···

On Saturday, July 9, 2016 at 1:14:59 PM UTC-4, Kyle Anderson wrote:

On Fri, Jul 8, 2016 at 11:29 AM, manup...@gmail.com wrote:

Hi all,

I am interested in hearing about different options/mechanisms available to allow a deployment using Sensu and influxdb for storing metrics data to scale well, and also if our particular situation warrants additional resources, further customizations etc. First, here is some background about our use of Sensu:

  1. We are using Sensu to monitor around 80 production servers, each of which is running about 15-20 checks. The Sensu server, rabbitmq and redis components are all running on dedicated different machines (VMs). The VM hosting the Sensu server has 4 cores and 8GB of RAM. When we first deployed the system, the focus was on getting the system up and running and ensuring that the monitoring system could provide timely notifications for any conditions of interest. For the checks, we took what met our requirements from the Sensu open source code base, and wrote some of our own checks to cover any gaps. We had also planned an influx deployment in the future, but for the initial deployment, we just took a few (CPU, RAM, disk, vmstat, interface stats) of the standard Sensu metrics Ruby scripts available in github and used them for gathering data from all the client nodes once every 10 minutes, just logging this data to text files on the server. This is obviously not very useful, other than showing us that the Sensu server VM can handle this load without a problem. So far, this system has run relatively well, and we have not had any significant issues to deal with, except perhaps on the UI and usability side (that’s a subject for another thread, though).
  1. For the next phase, we wanted to use influxdb for storing metrics data and use Grafana for providing dashboards of interest. We found that the raw metrics data being provided by the stock Ruby scripts was not very useful for visualization, so we modified some of the metrics checks and also wrote our own handler (in Python) to log the data to influxdb. We also increased the interval at which we were gathering the metrics data from once every 10 minutes (which was not very useful to begin with) to once evey 60 seconds (one minute). We created some dashboards based on this data in Grafana and tested this out in a lab setup of about 30 nodes, and everything seemed to work fine. So far, so good.
  1. However, when we deployed these changes in the production environment, we found that the CPU utilization on the Sensu server was fairly high (>70%) for sustained periods of time, and that the server was not able to keep up with the metrics data that was being generated by the clients. We found that the bulk of the CPU resources were being used up in writing the data to influx database, and that there was significant delay in having this data show up in the database after the server had received it from the clients. We also found that this led to delay in processing check data from clients; for instance, we found that the email alerts were not being generated promptly by the server in case a client reported an error condition based on the Sensu check that it ran. Based on these observations, we reverted the metrics collection interval back to 600 seconds (10 minutes) from the new value of 60 seconds (1 minute), and this addressed all the issues I described above. This makes it clear that the problem is that the Sensu server was not able to process all the metrics data that was being sent to it by the clients.

At this point, I would like to understand what our options are in order to get this to work properly. A few things come to mind:

  1. Add more CPU cores to the Sensu master VM, or maybe add another Sensu server on a different machine, to give the Sensu server the processing power it needs to handle the extra processing of metrics data. This may be the easiest thing to do in the short term, but it does not really address any of the fundamental issues we have seen here, and is at best a short term measure. Before I make this recommendation to my team, I would like to evaluate what our options are first.
  1. From what I have been able to read so far, it appears that the poor performance of influxdb writes is a relatively well-known cause for scaling problems. Is this correct? If so, what can be done to fix this? Are there any examples of how to address this if this is indeed the issue?
  1. Is there some data available about scaling using influxdb and Sensu? How can I determine if the existing resources allocated for the Sensu server VM are sufficient? Also, how can I size my server resources to ensure that they are adequate to handle the data being generated by a specific number of clients?
  1. Ideally, I would like to implement a solution that does not require a lot of new development on our side to address this issue. One of our design goals is to be flexible and allow for more data to be collected in future if necessary. Is this possible using existing tools?

I would be very interested in hearing what the folks here have to say about this. I am sure this topic has come up before, so I apologize if I have missed something; I have not had a chance to go through related messages on this issue.

Thanks in advance for any help!

-Manu

#4

Hi Kyle,

Thanks for the quick response. Some responses inline:

1. I'm glad you are asking before throwing cores at the solution.

> We don't want to do that, but out of curiosity, how much would
doubling the CPU resources help? Any experience with that?

2. Is the performance really influxdb? Or is it the sensu infrastructure
+ handler stuff?

> We don't have conclusive data yet, but I think the problem is not
influxdb, but how the Sensu infrastructure + handler is writing to
influxdb. From the Sensu server logs, it appeared that there were many
calls to the influxdb handler being made, and that may have been the reason
why the server was not able to keep up (either that or the fact that the
writes were taking a long time). My knowledge of influxdb is limited, but I
believe that it scales pretty well, so it is likely that the bottleneck is
on the Sensu infrastructure side, and not on the influxdb side. Does that
make sense?

3. I think this question is a little too hard to just figure out
non-experimentally, but in general it is going to be clients * metric count
* frequency to come up with a metrics/second and then sizing influxdb to
handle that capacity.

> I understand that. I was wondering if anybody had any data they could
share about what works and what doesn't in terms of actual numbers.

4. Depending on how much you consider existing, yes.
The main bottleneck for Sensu (for metrics and otherwise) is the handler
execution.

> Could you elaborate what you mean by "depending on how much you
consider existing"? But thanks for confirming. This matches my
understanding.

If you consider the sensu installation (rabbitmq, etc) as existing, then
yea the extension route is easy.
You would have to give up your python handler stuff.

Longer term, for high volume metric you will need to move to a different
execution model and use an extension. Here is an influxdb one:
https://github.com/seegno/sensu-influxdb-extension

Read up on what extensions are to be sure to understand the difference:

https://sensuapp.org/docs/0.25/reference/extensions.html#what-is-a-sensu-extension

But it means you will have to give up your own python handler :frowning:

> Interesting. Thanks for the pointer. A couple of questions about this:

1. Can extensions only be written in Ruby or is it possible to write our
own extensions in Python as well? I would imagine it should be possible to
write our own Python extension, or is that not an option? Of course, this
does mean we would need to understand the internals of Sensu a lot better
than we do right now.

Nope, ruby only. It runs in the sensu-server process.

2. Assuming we cannot use the default handler for some reason, what are
our options to improve scalability? I read something about batching influx
writes and also staggering check execution to allow the writes to be
distributed more evenly. Is that possible? And does batching influx writes
help? Are there any other things we could do which would help with this?

Hmmm. Seems like batching influxdb writes should help, but I don't know how
you would do that.

Sensu does stagger checks:

But I'm pretty sure this only works on standalone checks?

···

On Wed, Jul 13, 2016 at 12:19 PM, <manupathak@gmail.com> wrote:

On Saturday, July 9, 2016 at 1:14:59 PM UTC-4, Kyle Anderson wrote:

For light metrics collection I think normal handlers are fine.
For medium metrics collection you have to move to extensions and drop the
process overhead.
For really hard-core metrics distribution I think you have to give up and
bypass the sensu infrastructure and do direct collector (like collectd) =>
influxdb.

> Thanks for the recommendation. We are still trying to figure out what
our future direction should be, but given that we will be adding a lot more
metrics in the future, I suspect we may eventually have to do something
similar to #3 above.

Thanks,
-Manu

On Fri, Jul 8, 2016 at 11:29 AM, <manup...@gmail.com> wrote:

Hi all,

I am interested in hearing about different options/mechanisms available
to allow a deployment using Sensu and influxdb for storing metrics data to
scale well, and also if our particular situation warrants additional
resources, further customizations etc. First, here is some background about
our use of Sensu:

1. We are using Sensu to monitor around 80 production servers, each of
which is running about 15-20 checks. The Sensu server, rabbitmq and redis
components are all running on dedicated different machines (VMs). The VM
hosting the Sensu server has 4 cores and 8GB of RAM. When we first deployed
the system, the focus was on getting the system up and running and ensuring
that the monitoring system could provide timely notifications for any
conditions of interest. For the checks, we took what met our requirements
from the Sensu open source code base, and wrote some of our own checks to
cover any gaps. We had also planned an influx deployment in the future, but
for the initial deployment, we just took a few (CPU, RAM, disk, vmstat,
interface stats) of the standard Sensu metrics Ruby scripts available in
github and used them for gathering data from all the client nodes once
every 10 minutes, just logging this data to text files on the server. This
is obviously not very useful, other than showing us that the Sensu server
VM can handle this load without a problem. So far, this system has run
relatively well, and we have not had any significant issues to deal with,
except perhaps on the UI and usability side (that's a subject for another
thread, though).

2. For the next phase, we wanted to use influxdb for storing metrics
data and use Grafana for providing dashboards of interest. We found that
the raw metrics data being provided by the stock Ruby scripts was not very
useful for visualization, so we modified some of the metrics checks and
also wrote our own handler (in Python) to log the data to influxdb. We also
increased the interval at which we were gathering the metrics data from
once every 10 minutes (which was not very useful to begin with) to once
evey 60 seconds (one minute). We created some dashboards based on this data
in Grafana and tested this out in a lab setup of about 30 nodes, and
everything seemed to work fine. So far, so good.

3. However, when we deployed these changes in the production
environment, we found that the CPU utilization on the Sensu server was
fairly high (>70%) for sustained periods of time, and that the server was
not able to keep up with the metrics data that was being generated by the
clients. We found that the bulk of the CPU resources were being used up in
writing the data to influx database, and that there was significant delay
in having this data show up in the database after the server had received
it from the clients. We also found that this led to delay in processing
check data from clients; for instance, we found that the email alerts were
not being generated promptly by the server in case a client reported an
error condition based on the Sensu check that it ran. Based on these
observations, we reverted the metrics collection interval back to 600
seconds (10 minutes) from the new value of 60 seconds (1 minute), and this
addressed all the issues I described above. This makes it clear that the
problem is that the Sensu server was not able to process all the metrics
data that was being sent to it by the clients.

At this point, I would like to understand what our options are in order
to get this to work properly. A few things come to mind:

1. Add more CPU cores to the Sensu master VM, or maybe add another Sensu
server on a different machine, to give the Sensu server the processing
power it needs to handle the extra processing of metrics data. This may be
the easiest thing to do in the short term, but it does not really address
any of the fundamental issues we have seen here, and is at best a short
term measure. Before I make this recommendation to my team, I would like to
evaluate what our options are first.

2. From what I have been able to read so far, it appears that the poor
performance of influxdb writes is a relatively well-known cause for scaling
problems. Is this correct? If so, what can be done to fix this? Are there
any examples of how to address this if this is indeed the issue?

3. Is there some data available about scaling using influxdb and Sensu?
How can I determine if the existing resources allocated for the Sensu
server VM are sufficient? Also, how can I size my server resources to
ensure that they are adequate to handle the data being generated by a
specific number of clients?

4. Ideally, I would like to implement a solution that does not require a
lot of new development on our side to address this issue. One of our design
goals is to be flexible and allow for more data to be collected in future
if necessary. Is this possible using existing tools?

I would be very interested in hearing what the folks here have to say
about this. I am sure this topic has come up before, so I apologize if I
have missed something; I have not had a chance to go through related
messages on this issue.

Thanks in advance for any help!

-Manu

#5

another option you can consider is a using a tcp handler. writing a server that receives events from sensu and tranduces them into influxdb. would avoid all the forking of pipe handlers but let you write in whatever language you choose unlike an extension.

that said. i’d vote for a good extension. :slight_smile:

···

On Wednesday, July 13, 2016 at 6:13:06 PM UTC-7, Kyle Anderson wrote:

On Wed, Jul 13, 2016 at 12:19 PM, manup...@gmail.com wrote:

  1. Can extensions only be written in Ruby or is it possible to write our own extensions in Python as well? I would imagine it should be possible to write our own Python extension, or is that not an option? Of course, this does mean we would need to understand the internals of Sensu a lot better than we do right now.

Nope, ruby only. It runs in the sensu-server process.