“Things done right” is a series of blog entries focusing in on both legacy and modern design and implementation componentry in the Domino NoSQL data store and engine.
“Data structures” is a course at institutions teaching programming and engineering everywhere in the world. It’s necessary because students will soon discover work done (often LONG) before their time that has placed data in memory and on disk in ways that vary a lot by usage and data contours. It’s not always done well; sometimes a forced fit causes gross dysfunction and de-optimization.
In the sphere of NoSQL document store databases (even before there were NoSQL databases), denormalized, unstructured data is the rule. Those (denormalized and unstructured) are coupled intentionally, because document data is a bad for relational tables and its assumptions have made many a document-store project fail or perform badly against one’s favorite SQL engine.
The nature of document data is related mush. It is semi-structured for sure, with the apparitions of relational constructs. And some stores have in fact have taken strongly typed and length-restricted data into separate storage areas from their less structured content.
It has always been Domino’s premise that data should start typeless and formless. You add form and type to it as you see fit. And you can change your mind or leave things mushy. I’ve known relational mavens who would have allergic reactions to that premise, citing all kind of problems you’ll have which are really all kinds of problems they have.
I’ve known database structures that group all values for a given column/field on a single set of concentrated pages for quick aggregation and mathematical operations. I’ve known database structures that sort all rows/documents by a given sort key to optimize ordered retrieval. But in those cases, and others, try to get the data efficiently out in any other order or for any other purpose.
What Domino did right from the outset was to create summary data, which solves two main problems:
- Separate frequently accessed data from less frequently accessed data
- Order the data the only way that universally makes sense – creation time
A sub-attribute of 1 is that frequently accessed data is generally small (yes, others will argue strenuously that is a brutish/arrogant assumption – the fact is that it’s true for something north of 95% of cases). This allows for all larger data to spill into other stores or dedicated sets of pages on disk, with anchor pointers in the summary data that contain useful metadata about those data elements (that is, attachments, rich text, images, other objects).
For 2, I call it tautology that the hot data – that is, most often updated and referenced data – is the recent data. E-mail is radically so – the average e-mail message has a meaningful half-life of days. And who updates an e-mail document? But even systems boasting the highest transaction rates have a fading rate of access to a given row/document as time goes on. Therefore, ordering by creation time is the intuitive and efficient way to store data. In Domino’s case, the ascending NoteID provides an approximate clustering of data that optimizes all kinds of operations. Although NoteIDs can differ across database replicas, they remain the order that document summaries and even as the default ascending key for view indexing order.
While parts of Domino insist on data residing in the summary area (like view data), that restriction is a very sensible one all things considered.
Now, of course things change. Data only changes in size one way – it gets bigger. That’s why the summary area has grown to 16MB in V10. But the assumptions behind the original design were spot on. It’s a thing done right.