Cool new Free software

Posted Dec 20, 2012 22:16 UTC (Thu) by Wol (subscriber, #4433)
In reply to: Cool new Free software by dlang
Parent article: Status.net service to phase out, replaced by pump.io

Falls apart how?

EXACTLY THE SAME is true (or not true) of a relational database. If my Pick engine can't assume that a write has succeeded, nor can your relational engine. If your relational engine can force the issue, so can my Pick engine.

ACID is nothing to do with relational. A relational app has to deal with the database returning a status of "unable to save data" - so does a Pick app.

A relational database has to make sure that when it tells the app that the data has been saved to disk, it really has. A Pick database, likewise.

And you completely miss my major point about speed. Let's say I am writing that invoice. First of all, when I read the client info, chances are it is a *single* disk access. In a relational database it is probably several reads scattered across multiple rows and tables. Then when I write the invoice, it is a SINGLE write to the invoice table. Done! Then I need to update the sales ledger and the client outstanding. I've already read the client info - that's cached - ONE write to the client table and the client outstanding is done. One read and write to the sales ledger and that is done.

Obviously those three writes need to be wrapped in a transaction to enforce ACID, but that would be true of relational. The difference between Pick and relational is that in relational that could easily be twenty or thirty writes. That's a LOT of i/o. And if your ACID is doing a sync, to ensure it's all flushed to disk, that's a LOT of overhead my Pick database doesn't have. Now add in all your integrity checks - that make sure your invoice has a matching delivery address, invoice address, line items, etc. That aren't needed in Pick because the invoice is one atom, not plastered across multiple tables ...

Where Pick scores massively, as I keep trying to hammer home, is at the (*expensive*!!!) interface between the database engine and the disk, there is FAR LESS TRAFFIC. So Pick is a lot faster. As the Pick FAQ put it - "other databases optimise the easy task of finding data in ram. Pick optimises the difficult task of getting it into ram in the first place". Where relational typically passes *many* writes to the OS, Pick passes a single ATOMIC write.

To drift the topic a bit, do you know why Linux tends to panic rather than fix things? Because, like a true engineer, Linus assumes that - when things go wrong - the code to handle any problems is likely to be buggy and untested, so it's far better to fail noisily and let the user handle the mess, than try to fix it automatically and fail. Yes, that's a judgement call, but a good one. Relational tends to try to enforce perfection. Getting that last ten percent can be VERY expensive. More than expensive enough to be totally counter-productive.

If I was writing a new Pick app I would be more concerned about the integrity of the objects - such as the invoice and information provided by the customer - that they were saved. And that's a single write to disk.

Stitching the customer order into the accounts and shipping can come later. And if it screws up I can use transactions to roll back, or run an integrity check to complete the failed commit. (In actual fact, I used exactly this technique - for a different reason - in an accounting app I wrote many moons ago.)

At the end of the day, Murphy WILL strike. And if you rely on ACID in your database to save you it won't. If, instead, you rely on phased, replayable commits, you're laughing. My database will be vulnerable to a crash while the rep is on the phone to the customer. Once the rep has hit "save order" and that has succeeded, all I'd be afraid of is losing the disk - the order can be committed to fulfilment and accounts as and when.

WHEN Murphy strikes and ACID fails, how easily could you run an integrity check on your database and fix it? In Pick, it's both EASY and FAST. Pretty much all the systems I've worked with do it as a matter of course as an overnight job.

Cheers,
Wol

Cool new Free software

Posted Dec 20, 2012 22:39 UTC (Thu) by Wol (subscriber, #4433) [Link]

Following up to myself.

ACID sits at the interface between the database and the OS.

A transaction that is seen by the app as atomic is very UNlikely to be passed from a relational database to the OS as an atomic write.

That same transaction is far more likely to be passed to the OS as an atomic write by Pick.

Far less complicated. Far less overhead. Far easier for the user or programmer to understand.

Cheers,
Wol

Cool new Free software

Posted Dec 21, 2012 1:43 UTC (Fri) by dlang (guest, #313) [Link] (5 responses)

> If your relational engine can force the issue, so can my Pick engine.

you are claiming that you are so much better than the relational databases because you don't do all the hard work that they do to be safe in the face of multiple writes.

you can't then turn around and say "if they do it, I can do"

you could, but then you loose what you are claiming is such a great advantage.

Cool new Free software

Posted Dec 21, 2012 12:09 UTC (Fri) by Wol (subscriber, #4433) [Link] (4 responses)

No I am NOT.

What I am claiming is that where a relational database HAS to do MULTIPLE writes, a Pick database usually only has to do ONE!

Who cares if the overhead PER WRITE is the same, if I'm doing half the writes, and that overhead is expensive, I'm going to trounce you for speed! Chances are, I'm doing a lot LESS than half the writes. That's the whole point of NFNF. (And as I keep saying, ACID has nothing to do with relational, and everything to do with reality, so Pick doesn't have to do it the same way as relational. It can if it wants, no reason why not.)

(I also forgot to mention, because Pick is primary-key-driven, data retrieval usually involves a direct key access, not a search via an index for a row - more savings on disk access!)

And chances are I'm doing a heck of a lot less i/o, because I have far less duplicate data all over the place, and I'm storing it much more compactly too. I was involved in porting an app from Pick to SQL-Server, so I've got direct comparisons to hand - dumping the data from the Pick datastore and loading into SQL-Server, the resulting SQL-Server database was MANY times the size of the Pick one. Four, five times? Maybe more. Oh, and I'm including in the Pick datastore all the data overhead we didn't transfer. And the Pick datastore by default (we didn't change it) runs at 80% full. I can't give you figures for SQL overhead because I don't know it.

Cheers,
Wol

Cool new Free software

Posted Dec 21, 2012 12:51 UTC (Fri) by pboddie (guest, #50784) [Link] (2 responses)

As I recall from having brushed up against UniData a few years ago, this class of database system works well for limited-depth hierarchical data because of the various things you've mentioned, but that doesn't necessarily mean that all of the advantages apply to other kinds of database structure.

I think it's also pertinent to mention that PostgreSQL has been able to deal with things like multivalued columns for a long time and in an arguably more sane fashion than, say, UniData in various respects, such as in the storage representation which, as I recall with UniData, involved various "top-bit-set" characters as field boundaries that probably made internationalisation a pain.

Certainly, this class of system works well for certain kinds of systems and there's undoubtedly a lot of people still sticking with them, as well as a lot who tried to migrate from them in a bad way, either causing lots of teething troubles and organisational stress with the new system or reinforcing prejudices about the capabilities of "those new-fangled client-server RDBMSs". That the migration exercises probably involved Oracle or some other proprietary solution, where only a single vendor can ease the pain, probably won't have helped.

It's telling that UniData and UniVerse ended up ultimately with IBM after an acquisitions cascade that involved lots of slightly bigger fish eating other fish. I think it was Ardent acquiring the Uni-products, being acquired by Informix, being acquired by IBM. Unlike HP who would have killed the product line in order to add a few cents to the dividend for that quarter, IBM probably see the long-term value in those remaining customers.

Cool new Free software

Posted Dec 21, 2012 13:23 UTC (Fri) by Wol (subscriber, #4433) [Link] (1 responses)

I think UniData is a bit of a red-headed stepchild in the Pick world. Certainly the reports I've seen seem to say that under the hood it doesn't do it like the other db's in the family.

Yes, I suspect internationalisation may be a bit of a pain, but it has been done. I haven't used it, but I haven't used internationalisation on linux either (I guess it's there, but I'm not conscious of it).

Limited depth hierarchies? In reality, how often do you blow off the end of Pick's ability? The tools aren't necessarily that good, but it handles between five and seven levels "out of the box". How many entities have attributes nested that deep?

You're right about Ardent acquiring the Uni products, but in reality, Ardent took over Informix. Yes, I know Informix the company bought out Ardent, but six months later the Informix board was gone, replaced entirely by Ardent people. The rumour is that IBM bought Informix for the Informix database, only to discover that the company's primary product by then was U2.

And as you say about Postgres, I don't know anything about it but I understood it could handle multivalue columns sort of thing. If you're going to be strictly relational, however, that's not allowed :-) Postgres is moving away from a pure relational db to a more NFNF model. Pick was there first ... :-)

Cheers,
Wol

Informix and IBM

Posted Dec 21, 2012 18:21 UTC (Fri) by markhb (guest, #1003) [Link]

You're right about Ardent acquiring the Uni products, but in reality, Ardent took over Informix. Yes, I know Informix the company bought out Ardent, but six months later the Informix board was gone, replaced entirely by Ardent people. The rumour is that IBM bought Informix for the Informix database, only to discover that the company's primary product by then was U2.

Another rumor I heard, from a consultant who knew a lot of people in IBM, was that when they bought Informix they did, in fact, plan to merge the IDS (or Universal Server) tech into DB2, only to find that the Informix stuff was so far ahead of where DB2 was that they couldn't make it happen.

Cool new Free software

Posted Dec 21, 2012 15:51 UTC (Fri) by Wol (subscriber, #4433) [Link]

Following up to myself, let's take a quick look at that invoice, with ten items.

Let's start with the invoice and delivery addresses. Are they attributes of the invoice, stored in the invoice record, or relations to a WORM table of locations? As far as Pick is concerned, it doesn't care, it can store a foreign key or the location itself. Okay, the same is true of relational, but depending on how relational physically stores the data, it may have an impact later on.

Now the line items. Are they an attribute of the invoice, an attribute of some ledger, or an entity in their own right? I'm inclined to make them entities in their own right, not knowing enough about accounting off the top of my head to make the best call. I *could* make them an attribute of the invoice.

Now to save it all. Assuming the addresses all exist on file, that's one write for the invoice record and ten writes for the ten line items (if the line items were invoice attributes, I could have written the lot in just ONE write). Eleven atomic writes, wrapped in a transaction.

In relational, however, I have to add a row to the invoice table. Ten rows to the line item table. And update the line-item index on invoice. That's over and above the fact that I have to muddle data and metadata in the line item table - creating some random field I can sort on to return the line items in the correct order (in Pick, I simply store a *list* of line-item keys in the invoice record). So relational has the extra overhead of more "data" to store, and (unless it's willing to incur a massive hit on reading) the overhead of updating a whole bunch of indexes. The same eleven writes of data (with no option to reduce it to one) plus a bunch of indexes.

Now, let's assume we come back a week later and want to print off the index. I'll ignore how we obtain the invoice number. In Pick, we have ONE read for the invoice record, TWO reads for the addresses, and TEN reads for the line items. By the way, a read is defined as a SINGLE disk seek instruction. Statistics tell me the engine is going to make one mistake, so I need to make 14 seeks.

In relational, however, I guess I need to read the invoice table index to find out where to find the invoice. That's two seeks minimum. Then I need to read the two addresses. Another four seeks. Then the index on the line item table followed by the line items. That's eleven seeks, assuming the location is stored in that index or twenty-one if it isn't. I make that 17 *minimum*, probably a lot more.

Remember I said Pick optimises retrieving data from disk?

What if I made a mistake and stored line items as an invoice attribute when I shouldn't? I end up with the equivalent of the relational line item table, clustered by invoice number. Given that relational has to guess how best to cluster data, chances are my arrangement is just as good :-)

At the end of the day, as soon as we start arguing performance, I have a MASSIVE advantage over you. The relational model explicitly forbids you knowing the internal structure of the database, so that the engine can optimise it as best it sees fit. As an application programmer, I know *exactly* how Pick is storing its data at the disk level. There's a reason why Pick doesn't have a query optimiser - it's a fairly trivial exercise in logic to prove that disk access is so efficient (approx 97%) that any attempt to optimise it will cost more than it saves. Pick enforces primary keys. The primary key enables Pick to calculate the location of any item on disk. The Pick data structure pretty much enforces logically tightly coupled attributes to be physically tightly coupled on disk. The ability to store a LIST of foreign keys in a single atomic record eliminates the need for many indices (and because it's a LIST eliminates the need to muddle data and metadata).

In Pick's worst-case scenario (provided the data has been normalised), it degrades to a weakly optimised relational scenario. (The enforcement of primary keys provides some indexing.) In Pick's typical scenario, any half-way complex query is going to leave relational in the dust. That P90 query I mentioned? I bet those Oracle consultants were adding indexes up the wazoo to try and improve performance. The Pick query was probably thrown together in five minutes, and because it was pretty much solely hunting down pre-known primary keys, could go straight to the data it wanted without needing to search for it on dis.

If you want to know how Pick finds data so fast - http://en.wikipedia.org/wiki/Linear_hashing - given a known primary key, it takes on average 1.05 requests to disk to find what you're looking for!

Cheers,
Wol