Splunk

It’s been slightly over a year now since Points.com rolled out Splunk across our systems. For those not in the know, Splunk is basically a datastore for unstructured data – generally it’s described as centralized logging, and it’ll certainly fill that role but it’s capable of a lot more. This post is sort of an overview of things I wish I had planned better beforehand.

Server Setup

Splunk is actually incredibly flexible when it comes to the server setup. I’ve run the server as a lightweight VM in several places, tested clustering, installed it on baremetal systems and everything in between.

Eventually we’ve found that in our environment the majority of searches now are IO bound as opposed to CPU bound. This drove me to settle on a baremetal server with 8 cores, 32GB of memory and about 16 terabytes of usable storage. Splunk itself uses slightly more than half a terabyte of data on this server at the moment; however, we have been adding more and more data to our indexers and anticipate lots of growth. Additionally a large value of Splunk is not being required to purge any old data unless you choose to, which also makes the high storage capacity worthwhile.

Agent Configuration, Deployment

Normally these would be two subjects; however, if you’re at all sane you’ll be using some kind of configuration management (C.M.) tool that should merge these processes for you (or at least centralize them). The two primary C.M. systems right now are Puppet and Chef. At Points we’ve been using Puppet for a couple years and it’s worked great; we periodically compare the two and have gone with Puppet each time. Eventually I’ll write a post or two on Puppet as well.

Anyways, Puppet in our case manages the distribution of the Splunk forwarders. We use the lightweight “Universal Forwarders” deployed to every machine we have (physical and virtual). To accomplish this, we have our own internal package repositories for Ubuntu and RedHat Linux that contain the Splunk universal forwarder package. A simple Puppet rule ensures that each server has the package present.

The second piece, which is slightly more complex, is the agent configuration. There’s only two files we distribute for this – inputs.conf and outputs.conf. The outputs.conf file is static and points to your Splunk server(s) that collect the log data. The inputs.conf is what actually defines where Puppet gathers data from.

Our inputs.conf file is stored as an ERB template in our Puppet repository. An ERB template is just a text file that allows you to embed Ruby code or variables in it. This let’s us use Puppet to configure a set of rules regarding which files are monitored, and which index they’re stored in (as well as what sourcetypes they’re filed under).

For Points systems we have a number of indexes configured in Splunk. We use logic Puppet to segregate our application test environments into their own index (so that we can prune that data more quickly than production data), as well as tag everything we can with consistent sourcetypes.

Authorization

This one is simple; give Splunk access to everyone. Connect it to your identity management system (Active Directory for 99% of you) and allow all users the basic functionality of Splunk. Setup a group for Splunk administrators, and allow only them to do things like delete data.

You probably have lots of data that you don’t want everyone to see though, which brings us to…

Access Control and Index Configuration

These two go together for one main reason – the most effective and quick way of having access controls in Splunk is by enforcing per-index constraints. Additionally data aging policies are set on each index individually. At Points I’ve got the following primary indexes in Splunk:

  • applications
  • firewalls
  • operations
  • testenvironments
  • weblogs
  • windows

Everything a developer would be interested is contained in the “applications” index. All our application logs  and container logs (such as Tomcat, WebLogic, JBoss, Apache) are logged here. I’ve segmented this though, so that the “testenvironments” index has the exact same data but for a separate group of systems. That way all of the production data is in “applications”, and test data can be searched, indexed and purged in a completely separate fashion.

Inside “weblogs” are all web access logs. Apache and an IIS server or two log their traffic to here.

The “operations” index contains data that is of most interest to my team – system logs, performance metrics, that kinds of things. Everyone has access to this as there’s no sensitive information here.

The other two indexes – “firewalls” and “windows” – do contain sensitive information, and only Splunk administrators can search and use these indexes. This is enforced through standard Active Directory groups that are mapped in Splunk to capabilities on these indexes.

At first of course we just dumped everything into main; you can definitely do that, but it will most likely bite you later when you either want to enforce access to some data or when you want to cleanup your Splunk database.

Adding Metadata with Sourcetypes

To get the most out of Splunk it’s very important to define sourcetypes for your data at the start of the deployment. You don’t ever want Splunk to auto-create or auto-determine any sourcetype, so be explicit as possible in your configuration. This will allow you to later define extractions based on that sourcetype.

I’ll give you an example of where the sourcetype is crucial. One of our primary applications here logs output to a file named “epoch-debug.log”. The file is rotated by WebLogic based on filesize, so in the directory we may have ten or more files with the rotation count appended to the name. When we started using Splunk, we simply pointed inputs.conf at this directory like this:

[monitor:///logs/epoch/epoch-debug.log*]
disabled = false
followTail = 0
index = applications

By itself this was fine at first. It let us do queries against that log with this syntax:

source=”/logs/epoch/*”

There are two problems that come later with this. Once we started actually extracting at search time values from the log we lacked a clean way to attach those extractions to that specific log file. This is where we’d essentially configure Splunk to recognize that all the files in that monitor stanza are one sourcetype:

[monitor:///logs/epoch/epoch-debug.log*]
disabled = false
followTail = 0
index = applications
sourcetype = epoch-debug

With that simple addition, we can no define extractions that pull information from the log data and apply it to that application output in any environment.

The second problem was when we had test environments with multiple installs of the application – suddenly, the prefix of “/logs/epoch” wasn’t enough. With a sourcetype both of these problems go away, and you can perform this search:

sourcetype=”epoch-debug”

Tagging data sources with sourcetypes also overrides any automatic sourcetype derivation that Splunk may want to do. If the sample size when Splunk first sees a log file is small, it may save the data for that file under the wrong sourcetype. For example a Postfix log may be stored as a Sendmail log. By being explicit in your sourcetypes, you eliminate this possibility and keep the data much more clean.

Build Common “Event Types”

Event types essentially let you save a search. One example I’ve used is mail. Sendmail has a ton of information in it’s logs, but most people generally just want to see mail deliveries. You can take this query:

source=”/var/log/maillog” stat=*

And save it as a new event type,  “email-sent”. Then you can use a shortform to search for that specific event type:

eventtype=”email-sent”

You can put any criteria in the search. Another thing we use the event types for is to search a test environment as a whole for data. For example I have this query saved as “test-ft5″:

index=”testenvironments” source=”/logs/*ft5/*” OR source=”/TomcatFT/*ft5/logs/*”

This searches all systems that have logs for the FT5 applications (FT stands for functional testing). This allows faster troubleshooting in the test environments.

Training

Arguably this is the most important thing here. Splunk is very powerful, and also very complex. A decent training session (with burritos) will go a long way to showing people what can be done with Splunk.

I recommend that you come up with lots of example searches specific to your business – show people numbers that are hard to get elsewhere (such as from reporting systems), and demonstrate how to use Splunk to solve problems and answer questions.

Conclusion

That’s pretty much it for now. Splunk has been quite a valuable tool at Points.com, and that value will increase with the amount of data that we put into it. It has replaced numerous tools that we’ve bought or built, and has streamlined several manual processes for us. I feel spoiled now in having that – it’d be very hard, or boring, to have to work with the same amount of data without Splunk in the future.

Shares 0