Enter Nutch, Stage Left

Published on December 29, 2010 by in Back End, Front End

1

Spindle sucked. Say it with me. Spindle sucked. As far as search goes, sure, it was better than nothing. The problem was that the Spindle backbone wasn’t very robust as far as search algorithms go, there was no UI for it, and the kind of results you could get were pretty limited. Not to mention development on the Spindle project seems to have ceased. Now, you could have always tied into Google Custom Search with the $googleminiapi viewtool, which was perfectly fine, but that then requires you to be set up in Google, and it takes some control out of your hands.

Nutch was introduced as the solution. Nutch is one of many Apache projects implemented within dotCMS. What makes Nutch powerful is that it is built on top of Lucene and Solr, it includes a crawler engine, and is capable of crawling web pages and documents alike. On top of all that, dotCMS also implemented a portlet to run it from (Technically, Spindle had one, but you had to manually call its URI, and all it did was instantiate it). In summation, you end up with a much nicer tool with a lot more flexibility if you need to add search to your site.

The Site Search Portlet

The Site Search Portlet

The Site Search portlet is available under the CMS Admin tab. You can enable or disable it per host if you choose, set up the crawl schedule (using a cron expression), and control exclusions. It’s powerful enough to be useful, but simple enough to not be intimidating. Some people experienced in search might find it a bit limiting, but that’s when you can switch to using Google or an appliance. This is a nice balance to being a tool more powerful than the default search in something like WordPress, and not being something that takes hours to figure out.

You interact with Nutch via the $sitesearch viewtool, which just has a single method to worry about: $sitesearch.search($query, $sortBy, $startNum, $numRows, $request). This will return a DotSearchResults object that has a number of fields serialized in that you can call (assuming we’ve set the viewtool results to a Velocity variable called $results):

  • $results.responseType
  • $results.query
  • $results.lang
  • $results.reverse
  • $results.start
  • $results.rows
  • $results.end
  • $results.totalHits
  • $results.hits
  • $results.details
  • $results.summaries
  • $results.withSummary
  • $results.misspellings

Most of these are specifically details related to the search itself, such as the total hits it found, and alternative spelling suggestions. When you go to display the results, you’ll be working specifically with the .details and .summaries fields. For example:

#* PERFORM THE SEARCH FOR THE QUERY, STARTING 
AT THE BEGINNING, GETTING 10 RESULTS *#
#set($results = $sitesearch.search($request.getParameter('query'), null, 0, 10, $request))
## PULL OUT THE DETAIL SETS AND RESULT SUMMARIES
#set($details = $results.details)
#set($summaries = $results.summaries)
<ol>
#foreach ($i in [0..$math.sub($results.end,1)])
    <li>
        <strong>$!{details.get($i).getValue('title')}</strong><br />
        $!{summaries.get($i).toHtml(true)}<br />
        <small><a href="$!{details.get($i).getValue('url')}">$!{details.get($i).getValue('url')}</a></small>
    </li>
#end
</ol>

Overall, implementing Nutch isn’t particularly any more difficult or easy than it was to use Spindle, but level of control and improved results are what make the difference. This also shows off a very nice foundation that we can hope is used to develop an even more robust featureset down the road. Below, you can catch Dean’s walkthrough of the tool as well, where he shows how you enable and setup Nutch.

* NOTE: Windows users will need to install CYGWIN in order to use Nutch on Windows based machines and servers, due to the linux tools Nutch uses natively.


Photo Credit: AttributionNoncommercialShare Alike Some rights reserved by morgan frederick.

One Response to “Enter Nutch, Stage Left”

  1. tss says:

    Nice article on adding site search. Another lucene based free solution to consider for the CMS is SearchBlox. Comes with file system, web and rss crawlers as also provides web based control to the admin. Also includes an API if you need total flexibility.

Leave a Reply