The Common Tools Registry is All About Relationships

One of the requirements for the Common Terms Registry is that we store records in a flat format. For this, CouchDb is perfect.

However, another core requirement is that we clearly represent the relationship between standard terms and variations on those terms (aliases and translations, for example). This is not something CouchDb is good at handling.

There is a blog article by Christopher Lenz that goes over a few strategies to handle relationships in CouchDb. We took that as our starting point, tried a few approaches in depth, and ended up using a combination of CouchDb, Lucene, and Express to meet all of our requirements.

Our Requirements

So, what are our requirements as regards relationships? We need to be able to:

  1. Make it easy to find a standard term and its variations all at once.
  2. Ensure that the relationship is valid, i.e. that the "parent" record actually exists.
  3. Preserve the relationship when either the parent or the child is updated.

I'll start with the second two, which are a bit simpler to describe.

Referential Integrity

In SQL, this would be simply a matter of defining a foreign key constraint. Once that's done, it would not be possible to either create a link to a non-existent record or to delete a record that other records refer to.

Couchdb helpfully provides a mechanism to validate records as they are added, updated or deleted. Couchdb does not, however, provide a means to check for the existence of a related record, which means that we can't provide either of the safety checks that a foreign key constraint provides.

Our solution to this is to have Express handle the adding, editing, and deleting of records. Express can perform the necessary checks when manipulating both child and parent records.

Preserving the Relationship

Our initial approach was to store the list of child records in the parent record (see "cmlenz" blog for details). This makes it easy to search for standard terms based on words mentioned in a variation on that term. However, it introduces serious drawbacks when editing both parent and child records.

When editing a child record, if you change the link to the parent record, you need to regenerate the child data for both the old and the new parent record. This means you may have to make as many as three updates for an edit.

When editing a parent record, you have to make sure to preserve the current child data. This is certainly possible, but introduces unnecessary complexity. If you allow changes to the key used to link records, you also need to trigger an update for all child records, which is a much more open ended set of cascading updates.

To solve both problems, we will rely on Express to help us disentangle the data as it is stored in Couch from the format we deliver via our REST interfaces. That's covered in detail below, to summarize as it relates to preserving the relationship, the key is that only the child record actually stores a link to the parent record, and that link is validated when the record is added or updated.

Enabling reasonable searching using a combined record format

If we are looking for settings related to "color", we also want to see settings related to "colour". More importantly, we want to know which is the standard term, and which is the alias. We will not have a single field that has only the word "color" or "colour" in it, so we need full text searching of a range of fields (label, description, etc.).

The most obvious choice for full text searching is CouchDb Lucene, which brings the power of Lucene to Couch. Lucene indexes the raw records directly with hints from a particular type of design document. This means that it will return search results that are a mix of parent and child records, ordered by how well they match the search terms provided by the user, or whatever sort terms the user provides.

Diagram of record format presented by Lucene.

We need to convert this to a list of standard terms that include their child records. To do this, we use Express to:

  1. Execute the requested search.
  2. Retrieve the underlying record data for the search results.
  3. Knit together the mixed list of parent and child records into a unique list of parent records.

The output will look something like this:

Diagram of final record format as presented by Express.

Sorting

Because only Express has access to the final format, it has to handle sorting and paging.

To take advantage of the sorting in Lucene, we apply two simple rules as we evolve the raw search results:

  1. If a child record appears in the search results, we display the parent record in its place.
  2. If a parent record (or its children) appear in multiple places in the search results, we only display the first match.

Paging

That still leaves us with the problem of paging results. Lucene doesn't know about the final order or record count, so it can't handle the paging. Again, we handle this in Express. In the proof of concept work done so far, we generate the full list of results every time and return only the requested slice of that set. For the real implementation, we will cache the search results using a library like node-cache. As we have two lookups, I would propose two caches:

  1. For individual searches, we would cache the search results returned by Lucene under the full query string (or hashed equivalent). Those search results would only contain a list of matching uniqueIds for the parent record.
  2. We would cache parent records by uniqueIds.

With that approach, we can check for cache misses for just the paged set of results and make a single call to look up and cache all of the misses.

This should limit the amount of memory used by the caches and ensure that only one instance of each parent record is stored in the cache at a time.

Conclusion

Why am I writing this down (besides the fact that my shoulder angel told me to)?

Because we want feedback on whether this is a sane and complete approach before we go much further with it. Please comment if you have ideas about how to do what we're trying to do.


Comments

comments powered by Disqus