Sooner or later, NoSQL == inconsistent data

Ever since I took the plunge and moved over to Mongo for Exquisite Haiku, I’ve been waiting to see if (when?) I would get bitten by the whole NoSQL ability to accommodate inconsistent data. In the world of SQL, the data schema is the deepest and most fundamental unit of structure in the application, a sacrosanct blueprint that comprehensively defines the arrangement of data that is allowed to enter or exit the system. Any attempt to insert to query data that doesn’t conform to the schema will be outright rejected, and, if the exception isn’t handled, the code will fail.

In Neatline, the data record schema looks like this:

As an experiment, I randomly deleted one of the lines in the create statement to see what would happen to the test suite. It would be easy here to totally bomb out the whole application be getting rid of something that’s fundamental to the relational structure (like one of the foreign keys, or even the id column on the data records). Instead though I just got rid of the “title” field, which is hardly central to the functionality of the application – it’s a little bucket for plaintext data, but nothing depends on it in the sense that it’s never used as a carrier pigeon for any kind of important information through the system.

Right now, blog-post-inspired-mucking-around aside, the server-side Neatline test suite consists of 214 tests that make a total of 1185 assertions, all of which pass. When I took out the title field in the schema declaration, the test suite was only able to even make 626 assertions (about half), and recorded 87 errors during the run. In other words, the application is completely decimated. From the standpoint of a user, essentially nothing at all works.

Of course, the advantage to all of this is that you can be completely certain that any data that does, in fact, make it into the system will be completely normal and reliable – when you query for a collection of data records in Neatline, you can be completely confident that every single one of the records will have the exact structure that’s defined in the original schema.

NoSQL stores, by contrast, place no (native) restrictions on the structure of incoming and outgoing data, and no such assurance of structural consistency exists. Mongo, for instance, is essentially in the business of just persisting to disk little clumps of data that take the structural form of JSON strings. These strings, or “documents,” can be clumped together arbitrarily into “collections,” the hazy-to-the-point-of-not-even-being-analogous analog of a relational “table.” For example, you could have a Mongo collection called “records” and push in these two documents:

Mongo, beautifully or horrifyingly depending on your perspective, couldn’t care less. You can even execute queries on the inconsistent data. I could issue a db.find({ living: false }), and Mongo walk the collection, test each document to see (a) if it has a public attribute, and (b) if so, does it have a value of true, and return the document in the queried collection if both checks pass.

From an application developer’s standpoint, this is all both fantastic and scary. Fantastic because the process of making evolutionary updates to the “schema” is much more straightforward. Just update the application code – no need to write a migration. Scary, though, because there’s a virtual certainty that inconsistent data will, in fact, be introduced into the system given a large and complex enough application, and especially when more than one developer is working on application code that touches on the same collection of data.

When I started working on Exquisite Haiku, I was curious to see how long it would take before some combination of application and testing code resulted in the presence of schematically inconsistent data in Mongo. Really, Exquisite Haiku is a comparatively low-risk application as far as this matter is concerned – it’s a studiously simple codebase written by a single person. And yet, yesterday as I was browsing through a test suite for a collection of custom form validators that I wrote back in February, I realized that there were dozens of tests that were working on mocked user documents that had has many as 3 extraneous fields that have long since been popped off of the User schema declaration. Originally, when I was writing the application with the idea that it could be used either as an “installable” application, like WordPress, or as a public-facing service with administrator registration, I had a couple of extra fields on the User model to track the different access levels:

Once I abandoned the idea of accommodating both use patterns (mainly because I realized that it’s totally impossible for a centralized server to handle more than one to two concurrently running poems…) I stripped all of the abstractions off of the model that were designed to handle the public-facing administrator registrations (active, superUser, etc.), and just went with a single boolean field called “admin” to keep track of whether the user is one of the “owners” of the installation or a publicly-registered poem participant. So, it changed to this:

But, littered copiously throughout one of the older test suites, were user document stubs that were still mocking out documents with the attributes that had been removed from the canonical prototype of the data schema (email, superUser, and active):

These save just fine, and the tests were passing silently. And, in fact, there was no actual problem here – the extra fields are really, truly extraneous in this case, and the application code isn’t affected by the presence of additional attributes on the documents. But the fact that it was non-breaking this time seems to be largely a matter of luck. And if I can get turned around in a small, surgical, two-month-old application of which I’m the sole developer, I can only imagine how hard it would be to keep things tidy and normalized with a large, sprawling, long-duration, multi-developer application.