We generally avoid alerting from log contents - we alert from direct checks (Sensu and Sensu-routed Monit checks, where the latter are used because monit is doing fast local checks and process control), from metrics thresholds, and from a variety of other checks both on and off-net (external checks of our service apis and “ping” apis that we require all internally built applications to have).
The I have a number of issues with using logs as a primary alert source:
log volume is very high - many 10s of GB/day for each environment. Parsing all of that text - even json-formatted events - is relatively costly.
logs are relatively “brittle” - application developers change log formats all the time, and its difficult to track changes resulting in false positives or even worse false negatives.
there are a lot of moving parts in the log infrastructure, and slowness or interruptions of logs would leave us blind to alerts. This is a simple economic trade-off; we could make logs more reliable, but there are costs to doing so, and we’d rather invest in application performance and resilience.
I prefer to have explicitly different “streams” for metrics, logs, alerts, and health checks. They can and do all feed each other, but each is distinct and each is optimized for their purpose rather than trying to be one-size fits all
Where some prefer a “single source of truth”, I prefer to have a “second opinion” - when an alert fires, we turn to our metrics and logs to understand what is happening. Indeed when our log volume drops unexpectedly, we alert. This is why we love tools like Sensu - its toolchain/pipeline approach makes it easy to integrate with our other systems.
Sometimes, we don’t have choice - if we’re using off-the-shelf software, we sometimes can’t find a better way to alert than to scrape logs. In that case, we do it a the source (in the “log shipper” in our Logstash logging infra).
In fact, the only thing I can think of in our current stack where we were doing that was fixed in a later release, and we re now alerting off of collectd stats wired into that component - 500 errors in http logs, in particular.
On March 22, 2015 at 1:38:17 PM, Thomas Güttler (firstname.lastname@example.org) wrote:
Am Samstag, 21. März 2015 17:39:53 UTC+1 schrieb Reinhardt Quelle:
One of the better ways we’ve found to monitor log files (which I agree is something to be avoided when possible) is to use monit: http://mmonit.com/monit/documentation/monit.html#FILE-CONTENT-TESTING
Since we have all of our monit events routing through Sensu, the net result is that we get what I think you may be looking for.
I understand if someone says “monitoring logs with sensu is not a good solution”.
But I don’t understand your first sentence: Do you avoid to check logs at all?