System Visibility, Splunk, and the ELK Stack
Principal Software Engineer
Wednesday, January 13, 2016
It was another busy holiday season here in Bronto Engineering. We set new records on Black Friday, only to break them three days later on Cyber Monday. All the while, we closely monitored our systems, measuring and recording observations that will inform our planning process for the coming year. If history is any indication, today's record volume will soon be just another average day. We need to be ready. None of this would be possible without exhaustive visibility into our production systems. If we can't see it, we can't measure it. If we can't measure it, we can't pinpoint potential problem areas or determine the efficacy of our solutions. To maintain visibility into our production systems, we collect everything, and I mean everything, that might be of interest. System level metrics, network metrics, application metrics, log files of all kinds ... we want them all, and we want them at the tips of our fingers. We believe it is better to over-collect and risk capturing unnecessary data points than to find ourselves missing important information when we need it most. Storage is cheaper than downtime. To accomplish this, we use a suite of technologies which are probably familiar to you, including Dropwizard Metrics, StatSite, Graphite, Grafana, Cacti, Splunk, and we're always on the lookout for more tools to throw into our toolkit. Of course, collecting exhaustive forensics about a scaled-out system can become a scaling challenge of its own. One area where we've been feeling some pain is with Splunk. Splunk has served us well over the years, but their licenses are sold based on the maximum amount of uncompressed data that can be indexed per day. Go over that limit five times within a 30 day period, and Splunk will disallow searches on your data, rendering it inaccessible.* With our significant, seasonal load variations, we end up paying for a peak multiplier well over our typical daily usage, which forces us to presuppose which data is 'important' enough to include in the index. For this reason, we are working to remove Splunk from our infrastructure. Its replacement will be the ELK (ElasticSearch, Logstash, Kibana) stack, which is quickly gaining a reputation as an excellent, open source alternative to Splunk. The system is highly capable, the documentation is clear and consistent, and the user base is friendly and active. Elastic offers subscription-based support for teams that need some help, as well as some licensed plugins to address point solutions. For engineering groups that have a DIY mindset and don't mind digging into new technologies, it's a great fit. More importantly, it allows us to scale up our deployment based on the data we want to make available, rather than leaving things out, or wondering if we're going to run over our license quotas on a busy day. Where are we in this transition? So far, we've completed the infrastructure rollout for the ELK pipeline and started migrating data sources from Splunk to ELK. To date, we've already moved some of our highest volume sources (e.g. HAProxy access logs) with great success. Our relatively modest cluster can rip through billions of events in seconds, providing fast access to the data we need. Take a look at ELK. I think you'll like it ... a lot. *An earlier version of this post stated that Splunk disables indexing after passing the daily ingestion threshold. That statement has been corrected to say that it disables search after passing that mark five times within 30 days.