Introduction

One of the key requirements for the Preference Terms Dictionary (nee Common Terms Registry) is that we allow users to search using their own language and help them find common terms.

As outlined in my previous blog on our combined record structure, we use Lucene (specifically couchdb-lucene) to add full-text searching of our records.

Recently, in demonstrating the search in a meeting, I noticed some problems in executing seemingly simple searches, and did a bit of investigating. I wanted to share what I learned.

It turns out that with the great power of Lucene comes great responsibility. Namely, to take best advantage of Lucene's power, you need to understand and configure its analyzers.

Our requirements

As you can see in our API docs, the Preference Terms Dictionary provides a powerful search, that includes the ability to:

  1. Search the full text of all records and find terms and aliases that match anywhere in their definition, term label, uniqueId, et cetera.
  2. Filter and order search results using structured field data (for example, only displaying records with the right status, or sorting by uniqueId).

The promise of (and problems with) CouchDB-Lucene

CouchdB-lucene integration seems like it should help greatly with this. It provides high performance full-text searching from within a couch view (proxied through a separate server running as a java process).

Initially, I put a high priority on "stemming" words that might appear in the definition. "Display" is a good example. You might have definitions like:

  1. "stop displaying on-screen feedback"
  2. "change the number of columns displayed"
  3. "speak all text that appears on the primary display"

In each case, you would want "display" to match all three. For this, we use one of the "stemming" analyzers included with Lucene, namely the "porter" analyzer. This worked well enough, as it would match all three variations. However, it caused two problems:

First, uniqueIds like "8DotComputerBrailleTable" and "org.gnome.packagekit.ignored-dbus-requests" were broken down into their component parts and stripped of "blocked" terms like "8". This made it difficult to precisely match a specific uniqueId.

Second, the query itself was parsed using the analyzer, which would truncate a search for "computer" to "comput". This is fine when comparing apples to apples, i.e. if the definition contains "displayed", and that becomes "displai" in the index, then it doesn't matter if the query is also searching for "displai" instead of "display". They'll still match.

The problem comes when you're trying to search for something like a uniqueId that contains a className. The className will be broken up into individual words at every period, and then "stop words" will be stripped. Both "8DotComputerBraille" and "6DotComputerBraille" will be converted to "dot computer braille", which makes it impossible to search for one but not the other.

So how did this get fixed?

Different types of data, different approaches

To recap, we had to balance two concerns: We needed to prevent lucene from mangling values like uniqueIds in both our indexes and our queries. We also wanted stemming for definitions and other free text.

There is no single Lucene analyzer that will do this well. Instead, you need to use the "perfield" wrapper and specify which analyzer to use for each field. Here's what our analyzer setting in couchdb-lucene finally ended up looking like:

perfield:
    {
        default: "porter",
        uniqueId: "keyword",
        aliasOf: "keyword",
        translationOf: "keyword"
    }

The default analyzer is "porter" so that we're matching "stems". Plural and singular work interchangeably. Past and present tense work as well. This is perfect for the default search field, in which we put every piece of searchable information.

For the fields we want to be treated literally, we use the "keyword" analyzer. The beauty of this is that the analyzer is applied per field, even for the query itself.

Here's a sample search URL from my local instance that illustrates how nicely this works:

http://localhost:5984/_fti/local/tr/_design/lucene/by_content?q=display+termLabel:display+uniqueId:display

That's the word "display" three times in the same query, once without a field prefix, once with the "termLabel" field prefix, and once with the "uniqueId" prefix. Couchdb-lucene helpfully shows us what the final parsed query looks like, and the results are instructive:

"q": "default:displai termLabel:displai uniqueId:display",

Both the unqualified term and the term that is prefixed with the "termLabel" field are transformed so that all stemmed variations are correctly matched. For the "uniqueId" field, which needs to be a literal, the value is left alone.

So we end up with exactly what we want, which is to have an inclusive search that stems out and matches variation on your starting language, but which can be cleanly filtered using exact field values.


Comments

comments powered by Disqus