Searching on top of the Lucene Core, Solr vs Elastic Search

4 min readMay 2, 2017

For a recent project we were looking for a new and up to date backend search engine for one of our customer’s website. It is a relatively small site available in Dutch and French with around 5000 pages in total. Because the site is relatively small, we do not need a distributed cluster, instead a single instance is enough.

Previous setup

The previous (and very outdated) setup consisted of:

Apache Solr 3.6.0
Apache Nutch 1.6
Hadoop 1.0

For a custom Nutch indexer built by an employee who left long ago, the indexer is actually pretty good. Next to the webpages it also indexes videos and images, but we only had the `.jar`s to our disposal, and some nullpointer exceptions that were triggered unfortunately couldn’t be fixed without having access to the source code.

Comparison

Because of different blogs and websites, there is enough information about the similarities and the differences between Solr and Elastic Search. The only difficult part was to decide between the two for our setup.

The advantages of Elastic Search:

Easy setup
Better for analytics
Query DSL
More ‘natural’ REST api.
Multi tenancy

The advantages of Solr:

Well documented
Better and more specific config.
Faceted search
Full-text search (and text-oriented)
Big ecosystem

Because we only handle a small amount of text files, and thus have no need for multi tenancy or query DSL, while benefitting fully from the faceted and fulltext search, we have a clear winner.

Implementation

Okay, so we have a candidate for the new system. We are using Nutch to crawl our site, Solr to index it and then we can query it through the REST api!

With the new managed_schema config from Solr it was easy to get it up and running. I made a small php script which reads the sitemaps from the site, and puts all the urls in a file. Next I configured Nutch to crawl these urls and only urls from this website. Next Nutch needed some configuration to speed up the scraping, because in the standard configuration Nutch only crawls one host once every 5 seconds. To speed this up, I set `fetcher.threads.per.host` to a maximum of twenty threads per host. If I went higher than this, a lot of timeouts would occur and then the scraping would not be very effective. Using additional plugins (`index-metadata` and `parse-metatags`), some extra metatags were scraped (for `language_id` and `site_id` used internally).

After configuring Nutch to save all relevant information to Solr, it was time to let Solr know which data we were most likely to fetch, and to configure the boost in a way that only relevant information would pop-up.
To configure the meta keywords, some additional tuning was necessary. Due to the fact that it was a comma delimited string, we needed Solr to index those as separate keywords. This field was also going to be a `termVector`, the field definition was as follows (xml):

<field name="keywords" type="delimited" termVectors="true" indexed="true" stored="true"/><fieldType name="delimited" class="solr.TextField">
    <analyzer>
      <tokenizer 
          class="solr.PatternTokenizerFactory" 
          pattern=",\s*"/>
    </analyzer>
</fieldType>

This allowed us to let nutch index it as a commma seperated string, while still returning relevant keywords.

Searching through multiple columns for an autosuggest feature

To be able to search through multiple columns, we need to copy the data of some fields to a single field which we can then easily search through. There is no need for a complicated setup for this. See the example below. This little snippet declares a new field, which is indexed, but not stored, because we do not need to retrieve this data at any point, only query it. I added three `copyField` instructions that copy the field from the `source` field to the `dest`ination field to achieve the desired result.

<field name="suggest" type="textSpell" indexed="true" stored="false" multiValued="true"/>
<copyField source="text_nl" dest="suggest" />
<copyField source="title" dest="suggest" />
<copyField source="keywords" dest="suggest" />

We can now define a `searchComponent` that enables us to search in this field.

<searchComponent class="solr.SpellCheckComponent" name="suggest">
    <lst name="spellchecker">
      <str name="field">suggest</str>
      ...
    </lst>
</searchComponent>

To access this searchComponent through the REST api, we also need to define the requesthandler which leverages our previously built searchComponent. This request components listens to the `/suggest` URL.

<requestHandler class="org.apache.solr.handler.component.SearchHandler" name="/suggest">
    <lst name="defaults">
      <str name="spellcheck">true</str>
      <str name="spellcheck.dictionary">suggest</str>
      ...
    </lst>
    <arr name="components">
      <str>suggest</str>
    </arr>
</requestHandler>

Dutch difficulties with stemming

Because the dutch language has a lot of irregular verbs, we need to define a different Filter Factory Language to correctly stem the words in our text_nl field. This enables us to have better search results when searching for text in the text_nl field.

<fieldType name="text_nl" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.StopFilterFactory" 
            format="snowball" 
            words="lang/stopwords_nl.txt" 
            ignoreCase="true"/>
    <filter class="solr.StemmerOverrideFilterFactory" 
            dictionary="lang/stemdict_nl.txt" 
            ignoreCase="false"/>
    <filter class="solr.SnowballPorterFilterFactory" 
            language="Kp"/>
  </analyzer>
</fieldType>

Conclusion

With this setup, it is now relatively easy to get back a list of terms, using the TermComponent, getting back facetted keywords using Facets, getting back AutoSuggestions, a simple Search, and everything else Solr is good at.

New setup: Apache SOLR 6.4.0 & Apache Nutch 1.12

To read more about the Kraaij-Pohlmann algorithm (and the difference with the Dutch language): read this excellent blog.