Cool new Free software

Posted Dec 21, 2012 15:51 UTC (Fri) by Wol (subscriber, #4433)
In reply to: Cool new Free software by Wol
Parent article: Status.net service to phase out, replaced by pump.io

Following up to myself, let's take a quick look at that invoice, with ten items.

Let's start with the invoice and delivery addresses. Are they attributes of the invoice, stored in the invoice record, or relations to a WORM table of locations? As far as Pick is concerned, it doesn't care, it can store a foreign key or the location itself. Okay, the same is true of relational, but depending on how relational physically stores the data, it may have an impact later on.

Now the line items. Are they an attribute of the invoice, an attribute of some ledger, or an entity in their own right? I'm inclined to make them entities in their own right, not knowing enough about accounting off the top of my head to make the best call. I *could* make them an attribute of the invoice.

Now to save it all. Assuming the addresses all exist on file, that's one write for the invoice record and ten writes for the ten line items (if the line items were invoice attributes, I could have written the lot in just ONE write). Eleven atomic writes, wrapped in a transaction.

In relational, however, I have to add a row to the invoice table. Ten rows to the line item table. And update the line-item index on invoice. That's over and above the fact that I have to muddle data and metadata in the line item table - creating some random field I can sort on to return the line items in the correct order (in Pick, I simply store a *list* of line-item keys in the invoice record). So relational has the extra overhead of more "data" to store, and (unless it's willing to incur a massive hit on reading) the overhead of updating a whole bunch of indexes. The same eleven writes of data (with no option to reduce it to one) plus a bunch of indexes.

Now, let's assume we come back a week later and want to print off the index. I'll ignore how we obtain the invoice number. In Pick, we have ONE read for the invoice record, TWO reads for the addresses, and TEN reads for the line items. By the way, a read is defined as a SINGLE disk seek instruction. Statistics tell me the engine is going to make one mistake, so I need to make 14 seeks.

In relational, however, I guess I need to read the invoice table index to find out where to find the invoice. That's two seeks minimum. Then I need to read the two addresses. Another four seeks. Then the index on the line item table followed by the line items. That's eleven seeks, assuming the location is stored in that index or twenty-one if it isn't. I make that 17 *minimum*, probably a lot more.

Remember I said Pick optimises retrieving data from disk?

What if I made a mistake and stored line items as an invoice attribute when I shouldn't? I end up with the equivalent of the relational line item table, clustered by invoice number. Given that relational has to guess how best to cluster data, chances are my arrangement is just as good :-)

At the end of the day, as soon as we start arguing performance, I have a MASSIVE advantage over you. The relational model explicitly forbids you knowing the internal structure of the database, so that the engine can optimise it as best it sees fit. As an application programmer, I know *exactly* how Pick is storing its data at the disk level. There's a reason why Pick doesn't have a query optimiser - it's a fairly trivial exercise in logic to prove that disk access is so efficient (approx 97%) that any attempt to optimise it will cost more than it saves. Pick enforces primary keys. The primary key enables Pick to calculate the location of any item on disk. The Pick data structure pretty much enforces logically tightly coupled attributes to be physically tightly coupled on disk. The ability to store a LIST of foreign keys in a single atomic record eliminates the need for many indices (and because it's a LIST eliminates the need to muddle data and metadata).

In Pick's worst-case scenario (provided the data has been normalised), it degrades to a weakly optimised relational scenario. (The enforcement of primary keys provides some indexing.) In Pick's typical scenario, any half-way complex query is going to leave relational in the dust. That P90 query I mentioned? I bet those Oracle consultants were adding indexes up the wazoo to try and improve performance. The Pick query was probably thrown together in five minutes, and because it was pretty much solely hunting down pre-known primary keys, could go straight to the data it wanted without needing to search for it on dis.

If you want to know how Pick finds data so fast - http://en.wikipedia.org/wiki/Linear_hashing - given a known primary key, it takes on average 1.05 requests to disk to find what you're looking for!

Cheers,
Wol