I am interested in hearing about different options/mechanisms available to allow a deployment using Sensu and influxdb for storing metrics data to scale well, and also if our particular situation warrants additional resources, further customizations etc. First, here is some background about our use of Sensu:
We are using Sensu to monitor around 80 production servers, each of which is running about 15-20 checks. The Sensu server, rabbitmq and redis components are all running on dedicated different machines (VMs). The VM hosting the Sensu server has 4 cores and 8GB of RAM. When we first deployed the system, the focus was on getting the system up and running and ensuring that the monitoring system could provide timely notifications for any conditions of interest. For the checks, we took what met our requirements from the Sensu open source code base, and wrote some of our own checks to cover any gaps. We had also planned an influx deployment in the future, but for the initial deployment, we just took a few (CPU, RAM, disk, vmstat, interface stats) of the standard Sensu metrics Ruby scripts available in github and used them for gathering data from all the client nodes once every 10 minutes, just logging this data to text files on the server. This is obviously not very useful, other than showing us that the Sensu server VM can handle this load without a problem. So far, this system has run relatively well, and we have not had any significant issues to deal with, except perhaps on the UI and usability side (that’s a subject for another thread, though).
For the next phase, we wanted to use influxdb for storing metrics data and use Grafana for providing dashboards of interest. We found that the raw metrics data being provided by the stock Ruby scripts was not very useful for visualization, so we modified some of the metrics checks and also wrote our own handler (in Python) to log the data to influxdb. We also increased the interval at which we were gathering the metrics data from once every 10 minutes (which was not very useful to begin with) to once evey 60 seconds (one minute). We created some dashboards based on this data in Grafana and tested this out in a lab setup of about 30 nodes, and everything seemed to work fine. So far, so good.
However, when we deployed these changes in the production environment, we found that the CPU utilization on the Sensu server was fairly high (>70%) for sustained periods of time, and that the server was not able to keep up with the metrics data that was being generated by the clients. We found that the bulk of the CPU resources were being used up in writing the data to influx database, and that there was significant delay in having this data show up in the database after the server had received it from the clients. We also found that this led to delay in processing check data from clients; for instance, we found that the email alerts were not being generated promptly by the server in case a client reported an error condition based on the Sensu check that it ran. Based on these observations, we reverted the metrics collection interval back to 600 seconds (10 minutes) from the new value of 60 seconds (1 minute), and this addressed all the issues I described above. This makes it clear that the problem is that the Sensu server was not able to process all the metrics data that was being sent to it by the clients.
At this point, I would like to understand what our options are in order to get this to work properly. A few things come to mind:
Add more CPU cores to the Sensu master VM, or maybe add another Sensu server on a different machine, to give the Sensu server the processing power it needs to handle the extra processing of metrics data. This may be the easiest thing to do in the short term, but it does not really address any of the fundamental issues we have seen here, and is at best a short term measure. Before I make this recommendation to my team, I would like to evaluate what our options are first.
From what I have been able to read so far, it appears that the poor performance of influxdb writes is a relatively well-known cause for scaling problems. Is this correct? If so, what can be done to fix this? Are there any examples of how to address this if this is indeed the issue?
Is there some data available about scaling using influxdb and Sensu? How can I determine if the existing resources allocated for the Sensu server VM are sufficient? Also, how can I size my server resources to ensure that they are adequate to handle the data being generated by a specific number of clients?
Ideally, I would like to implement a solution that does not require a lot of new development on our side to address this issue. One of our design goals is to be flexible and allow for more data to be collected in future if necessary. Is this possible using existing tools?
I would be very interested in hearing what the folks here have to say about this. I am sure this topic has come up before, so I apologize if I have missed something; I have not had a chance to go through related messages on this issue.
Thanks in advance for any help!