Date Thu 12 June 2014 Tags Couchdb

Introduction

I shared this with our mailing list a while back, but wanted to put out a more complete and formal document for further review.

For the Common Terms Registry (and presumably for the Unified Listing), we will need a fuller change history than Couchdb (our back-end data store) provides.

Our requirements

We need:

  1. To preserve the change history even if the database is compacted and cleaned up.
  2. To preserve a record of the author of each change.
  3. To allow versions to be submitted as proposed changes and later approved by moderators.
  4. To allow a comment on what was changed in each version and why.
  5. To provide the ability to leave a new comment (or respond to an existing comment) without making an additional change.

Versions as attachments

To satisfy these requirements, I propose extending the "version as attachment" method outlined in "jchris" blog.

In that approach, whenever there is a change, the old content of the document is added as an attachment. The attachment is just a JSON file, which makes reading and writing simple.

This approach has a few key strengths:

  1. It absolutely and clearly associates versions with a single document.
  2. It reuses a core feature of CouchDB (attachments) rather than introducing an entirely new data structure.
  3. It only introduces complexity when it's needed. Depending on what you're trying to do, you can:
    • Retrieve a single record without any attachments (the default)
    • Retrieve a record with a list of attachments, and then retrieve each attachment as desired.
    • Retrieve a record and all attachments in a single call.

There is a limitation to this approach, in that when using lists and views, you only have access to the list of attachments, meaning that you cannot retrieve the full version history of all documents in a single request. This is not a huge problem in that:

  1. CouchDB itself only allows you to retrieve the version history for a single document (we'd be in the same boat with CouchDB, in other words).
  2. We are most interested in presenting the current published version of each document in list views, and in presenting the full version history when viewing or editing a single document.

How I propose to implement versions

So here are the specifics I propose to glue this all together and meet our requirements.

We need a way to keep track of version numbers. We can't use CouchDB's version structures (_rev and revpos) to keep track of versions. The _rev value is overwritten as soon as we add an attachment. We could retrieve the current record and change history, and save the previous version to the history, but we could never store the current record's _rev value, as we wouldn't know it up front. We could choose to store just past versions in the history, but that would mean that we can't really keep track of unpublished drafts using the same mechanism.

The solution is to assign and reuse our own version numbers for documents. Because we know what the current version is when making updates, we can store the current version in the change history. This is slightly redundant for records in which the published version is also the latest version, but is key in adding support for unpublished changes. If the current document has a version number of 6 and there is a version 7 attached, then version 7 is by definition an unpublished draft.

For convenience, we store the document version twice in the history. Once in the document itself, and once in the name of the attachment, as in versions/NUMBER. Having the version in the list of attachments allows us to see where we are in history without loading the version history. It also allows us to figure out what the next version number in the series should be without loading the version history. It also lends itself to easy natural ordering of edits.

Edits only at the end

As a reminder, the document itself represents the latest published version. There may be later unpublished versions, but the proposed changes are chained.

If the last published version is version 6, and someone proposes a change, that becomes version 7. If someone else wishes to make an update, they are in essence starting with version 7 when creating version 8.

This makes it fairly clean to publish a proposed change. If we like version 8, that content replaces the current document. Given that both version 6 (the old version) and version 8 (the new version) are already in the change history, that history does not need to be updated.

If, on the other hand, version 8 is rejected, and we want to approve version 7, we use the same mechanism as we would for any "revert" operation. We copy the content of version 7 to version 9, update the version history to include that content, and add a comment (see below) indicating that the content was cloned from version 7.

So what about comments?

As mentioned in our requirements, we also need to support a back and forth dialog between contributors and moderators. I would propose a similar approach as we used with versions.

Comments should be stored as attachments, and named with respect to the version they are associated with (for example: versions/1/comments/1).

Because the list of attachments returned with the original CouchDB document is a flat structure that includes path information, we can extract the full list of comments as well as their position in the hierarchy. In that way we can trivially do things like:

  1. Associate comments with an individual revision, for example when paging through the change history.
  2. Also keep the full comment history as a separate object.
  3. Clearly indicate that comments that were relevant to an older version.
  4. Filter comments displayed to those entered against the last version or versions.

We may not choose to build a UI that relies on all of those features, but the simple naming convention gives us all of those options and more.

Putting it all together

So, if you're still reading, what will this all look like?

Here is a single sample record with two versions and two comment threads, as it would be presented in CouchDB, with the list of attachments:

{
   "_id": "0069b020043147f58eeda6d72c9845f4",
   "_rev": "60-8e50635398074f9eff7f7875a01fad6f",
   "type": "GENERAL",
   "uniqueId": "soundActive",
   "defaultValue": "TRUE",
   "definition": "Undefined...",
   "status": "active",
   "source": "gpii",
   "version": "2",
   "updated": "2014-05-22T10:01:33.655Z",
   "_attachments": {
       "versions/2/comments/1": {
           "content_type": "application/json",
           "revpos": 58,
           "digest": "md5-zHlXptYHwLkBg9yCDhmTCg==",
           "length": 242,
           "stub": true
       },
       "versions/2": {
           "content_type": "application/json",
           "revpos": 60,
           "digest": "md5-zHlXptYHwLkBg9yCDhmTCg==",
           "length": 242,
           "stub": true
       },
       "versions/1/comments/1": {
           "content_type": "application/json",
           "revpos": 58,
           "digest": "md5-zHlXptYHwLkBg9yCDhmTCg==",
           "length": 242,
           "stub": true
       },
       "versions/1": {
           "content_type": "application/json",
           "revpos": 59,
           "digest": "md5-zHlXptYHwLkBg9yCDhmTCg==",
           "length": 242,
           "stub": true
       }
   }
}

If the ?attachments=true parameter is passed, the results will also include MIME-encoded attachment content, so we can get everything we care about in a single request. We'd decode the attachments using a library like multipart-parser or node-formidable, from that point it's just JSON data.

Using the above information and the attachment content, here is the same record as I would propose to display it via our API (which would include query parameters to show/hide both versions and comments):

{
    "type": "GENERAL",
    "uniqueId": "soundActive",
    "defaultValue": "TRUE",
    "definition": "Undefined...",
    "status": "active",
    "source": "gpii",
    "version": "2",
    "author": "aatkins",
    "updated": "2014-05-22T10:01:33.655Z",
    "versions": [
        {
            "type": "GENERAL",
            "uniqueId": "soundActive",
            "defaultValue": "TRUE",
            "definition": "Undefined...",
            "status": "unreviewed",
            "source": "gpii",
            "version": "1",
            "author": "aatkins",
            "updated": "2014-05-22T10:01:33.655Z",
            "comments": [
                {
                    "comment": "Submitted new unreviewed record.",
                    "version": "1",
                    "author": "aatkins",
                    "updated": "2014-05-22T10:01:33.655Z"
                }
            ]
        },
        {
            "type": "GENERAL",
            "uniqueId": "soundActive",
            "defaultValue": "TRUE",
            "definition": "Undefined...",
            "status": "active",
            "source": "gpii",
            "version": "2",
            "author": "aatkins",
            "updated": "2014-05-23T10:01:33.655Z",
            "comments": [
                {
                    "comment": "Published record after review.",
                    "version": "2",
                    "author": "aatkins",
                    "updated": "2014-05-23T10:01:33.655Z"
                }
            ]
        }
    ],
    "comments": [
        {
            "comment": "Submitted new unreviewed record.",
            "version": "1",
            "author": "aatkins",
            "updated": "2014-05-22T10:01:33.655Z"
        },
        {
            "comment": "Published record after review.",
            "version": "2",
            "author": "aatkins",
            "updated": "2014-05-23T10:01:33.655Z"
        }
    ]
}

For those of you familiar with the CTR record structure, the "version" and "author" fields are managed entirely by the system, and are not required to look up or add records.

Conclusion

I hope that gives you a good sense of how I'm proposing to handle versions (and comments on versions). Please review and comment.

Update

For more information, check out the discussion on the mailing list.


Comments

comments powered by Disqus