Splunk for dotCMS

Published on June 14, 2010 by in Administration, Linux

1

Introduction

Every once in a great while, you discover a tool that you didn’t know you couldn’t live without.  Such was the case for us with an application bearing the incredibly descriptive name – Splunk.  What is Splunk?  One of its primary uses is as a log analysis tool.  But wait.  Before you yawn and turn the page, please think about this:  I’ve been in IT for almost 30 years.  I can name less than 10 really game changing system utilities I’ve run across in all that time.  Splunk is one of them.

The best way to really understand Splunk’s usefulness is to try it.  Installation for use on a single machine is a snap.  Uninstalling is also very clean, complete, quick and easy.  If the combined addition to all the logs you’ll be monitoring with Splunk is less than 500 MB per day, you can use the free version of Splunk.  If you’re producing more log data than that, then you’ll be able to use the free version for 30 days.  After that, you’ll have to purchase Splunk.

One of Splunk’s very powerful features is its ability to simultaneously index logs and other files from many servers, in real time, and consolidate all the data into one fully indexed master repository.  You can do this by mounting remote file systems on the Splunk box using NFS or by installing Splunk in its “data collector” mode on other servers and having those other Splunk instances push data to the master Splunk box over the network using Splunk’s built-in TCP/IP protocol designed for doing just that.  Data collection from remote servers is an exercise we’ll leave for the reader.  The Splunk website sports a well maintained wiki and excellent documentation.  I recommend both whether you install Splunk on a single server or on every box in your data center (which is just about what we’ve done).  For the remainder of this section, we’ll limit the discussion to installing Splunk to run on a single server (the dotCMS server) and only absorb data about and produced by dotCMS.

With that in mind, let’s run through a quick “HowTo” on installing Splunk on your Linux based dotCMS server:

Initial Installation

  1. Fire up your favorite browser and go visit http://www.splunk.com Somewhere on the home page, you should find a green “Free Download” button.  Click that, select the appropriate flavor of Splunk, fill out the registration information, and download.  Look carefully at the download page when you finally get there.  One of the options at the top says “Got WGET?”  If you click that, you see a properly formatted WGET command that you can copy and paste into a PuTTY session on your server to download Splunk.  That’s handy if you, like many folks,  don’t run a GUI desktop on your dotCMS  server.You can get either an RPM version or a “manual install” TGZ.  If you’re on a system that supports RPMs and you select the RPM version – know these two things: First, Splunk will install in /opt/splunk and, second, if you have a busy site and want to keep a significant amount of data, the Splunk directory can eventually occupy several gigabytes of disk space.  If you’ve got your disk partitioned with dotCMS in mind, there may not be enough space for Splunk’s data and index files to grow if you put it in /opt. If you’re not sure how your disk is partitioned, enter the “df” command to see the layout and the free space available.  If you’ve got one big partition with all your directories in it – except perhaps /boot, then it doesn’t make any difference where Splunk lives.  In that case, get the RPM.  If you want Splunk to live somewhere else, download the TGZ.  Either way, Splunk is very well behaved.  Everything for Splunk lives in the “Splunk home” directory.  It doesn’t scatter crap all over the place… not even in /etc.
  2. Once Splunk is downloaded, where you place it depends on what you downloaded.  If you got the RPM, you can stick it anywhere – like /tmp or /root.  If you got the TGZ, place it in the directory that you want to contain the Splunk home directory.  If you want Splunk to live in /usr/local/splunk then put the TGZ in /usr/local for example.
  3. To install the RPM version, enter:  rpm –Uvh big-long-nasty-splunk-file-name.rpm.  When that finishes, Splunk is installed in /opt/splunk and ready to run.To install the TGZ version, make sure it’s in the directory you want, then enter:  tar –xvzf big-long-nasty-splunk-file-name.tgz.  When tar is done extracting the archive, Splunk is installed in the ./splunk directory and ready to run.
  4. If you have a firewall on the server, open tcp port 8000 for incoming traffic.  If you have a GUI desktop on the server and will only use Splunk from “localhost” then you can skip this step.  Obviously, if you have no firewall, you can also skip this step.
  5. Fire up Splunk.  If you installed the RPM, open a terminal window and enter: /opt/splunk/bin/splunk start.  If you used the TGZ, then fix up the command above to be something like /your/path/splunk/bin/splunk start .  Either way, the last part should always be … /splunk/bin/splunk start (note the space before “start.”).
  6. Use the space bar to page through the license agreement and enter “y” to accept it.
  7. Press enter (take the defaults) at any other prompts.
  8. Splunk fires itself up and begins to index its sample dataset.  This will take a very short time.  There’s no need to wait for that to finish.
  9. To make Splunk start up automatically with the server, use your favorite text editor and open: /etc/rc.d/rc.local .  This is the Linux equivalent of Windows’ “startup” folder or the registry “run” key.  At the bottom of this text file, enter the command to start Splunk – the same one you used in step 5 above.
  10. You’re finished working in the terminal window.  Everything else is done with your favorite web browser.

Post Install Configuration

Configuring Splunk to consume and index the dotCMS logs on the local machine involves only two simple steps:

  • Tell Splunk to index everything in the dotCMS logs directory
  • Set a reasonable limit on the total amount of disk space Splunk may consume

That’s it.

Let’s begin.  Fire up the Splunk web console with a URL similar to the following:  http://ip-addr-or-dns-name-of-server:8000/

You’ll be presented with a log-in screen.  The default user name is “admin” and the default password is “changeme”.  Don’t worry about actually following the “change me” suggestion unless you intend to purchase Splunk.  After 30 days, the limitations of the free version kick in, and the log-in screen will go away.  In free Splunk, anyone who can access the URL may drive Splunk.  If you need to limit access to the free version, you’ll need to do so with some external mechanism – like firewall rules that only permit access to TCP port 8000 from a specific IP address or subnet.  If you’re not an IPTables (firewall) guru, there are several free applications out there to help.  If you don’t have such an animal in your repertoire, one of my favorites is WebMin.  Snag it at www.webmin.com .  Firewall management is only a tiny portion of what WebMin can do for the less-than-expert (or just lazy) Linux administrator.  You’ll find the firewall setup applet under Networking in WebMin’s menu structure.

Here’s the log-in screen:

Login Screen

Login Screen

After logging in, a first time user will be prompted to add data to Splunk. You want to do that now. If you miss the initial “add data” prompt, fear not. Look for a link called “Manager.” It’s rather inconspicuous – located in the top right corner of the web console. Find it and click on it.
You’ll see the administration page appear:

Administrative Page

Administrative Page

Under “System configurations” – the second column on the Manager screen – find and click on “Indexes.” The following will appear:

Indexes

Indexes

You’ll see that the default maximum for all the listed indices is 500,000 MB or half a terabyte. If you feel, as I did, that this default is just a bit excessive, you’ll want to pick some maximums that represent the amount of space you want to allocate to Splunk. The only really important index you’ll want to worry about is the one called “main.” The screen shot above shows the state of all the indices on a dotCMS server that’s been running Splunk for about a week and a half.

To change an index’s space maximum, click on the index name in the leftmost column:

Index Settings

Index Settings

Until you’ve done some reading and understand what the hot, warm, and cold buckets are, leave the second value alone. Enter a reasonable value in the first blank under “Max size (MB) of entire index” and click “Save.”

It will require a restart for the new index limits to take effect. Locate the “Click here to restart…” link in the upper-right corner of the screen and click upon it:

Restart Notification

Restart Notification

Then select “restart Splunk” on the screen that follows:

Server Controls

Server Controls

Wait for Splunk to restart and find your way back to the Manager screen. Now that we have set reasonable limits on the index, let’s go add some data. Again under system configurations – the second column of links – find and click on “Data inputs.” You’ll see the following:

Data Inputs

Data Inputs

On the “Files & Directories” line, click on “Add new” in the Actions column on the right. There’s really only a couple of items to enter or verify here. Most importantly, you need to tell Splunk the path to the dotCMS log directory. When you’re done, the form should resemble the one below:

Files & Directories

Files & Directories

In the “Full path on server” text box, enter: /usr/local/dotcms/logs (or the equivalent, if you didn’t install dotCMS in /usr/local). Next, make sure the host name is filled in correctly, if not, fill it in. Take the defaults for everything else. Scroll to the bottom of the page and click “Save.”

Save Button

Save Button

Splunk will immediately go nuts sucking up and indexing everything it can find in the new input path. You can monitor the system load by opening a terminal session and running the “top” command. Wait a while for Splunk to settle down. When it does, open up your browser and log in to Splunk, choosing “Search” from the Launcher screen. You’ll see Splunk’s search page. This is where you’ll spend most of your time when using Splunk. The degree to which you become skilled at constructing queries on this search page will determine Splunk’s usefulness to you. If you’ve done any programming at all, you’ll find the conditional expressions in Splunk are somewhat familiar. If the syntax isn’t familiar, the concepts should be. You’re basically building a conditional using “ANDs” , “ORs” , and “NOTs” along with parenthesis if necessary to zero in on the data you’re looking for. The sample data in Splunk is used in the Getting Started tutorial. I highly recommend it. To whet your appetite, let’s do one real-world example.

Say I’m curious to see how many “404” errors I’ve been getting. Let’s start out with a simple query telling Splunk we only want to see lines containing the string “404.”:

Initial Search View

Initial Search View

Right off the bat, I see that I need to get rid of all the 404 lines having to do with favicon.ico GET requests. Let’s filter them out by modifying our query. We’ll just add “AND NOT favicon.ico” to the end of what we already have (404):

Filtered Search Results

Filtered Search Results

That’s a little better, but I’d like to also get rid of the robots.txt stuff and, for now, the /global/js… stuff:

Further Refined Results

Further Refined Results

Now we’re getting somewhere. I’ve filtered out the noise and am beginning to see some useful data. This kind of try-narrow, try-narrow iteration is precisely what makes Splunk so darn useful. I can zero in on a problem in just a few attempts and then look for patterns that might explain why the problem is happening.

If you construct one of these that you’re especially happy with, you can use the “save search” link on the top-right to give the query a name and save it for future use. You can also construct a query, tell Splunk to run it on a schedule, and e-mail you the results.

The possibilities are virtually endless. We’ve not made more that a slight blemish, much less a real scratch on the surface of Splunk’s capabilities. The more familiar you get with Splunk, the more you’ll find it can do for you. I’ll bet you a cup of coffee that in the first half hour of using Splunk, you’ll find out something about your dotCMS instance you didn’t know before.

Again, don’t forget the “Getting Started” tutorial on the Launcher screen, watch the video demonstration, and check out the Wiki and other documentation available at www.splunk.com . If you’re really stuck, you can even post a question on the forums at the Splunk web site.

One of Splunk’s best banner-ad tag lines is “Needle. Haystack. Found.” It’s true. Happy Splunking!

Image Gallery Recap

Trackbacks/Pingbacks

  1. Tweets that mention Splunk for dotCMS | Learn dotCMS -- Topsy.com - June 15, 2010

    […] This post was mentioned on Twitter by Michael Fienen, Splunk. Splunk said: RT @fienen: Setting up Splunk for dotCMS | Learn dotCMS http://ow.ly/1YUg9 […]

Leave a Reply