LWN.net Logo

Cool new Free software

Cool new Free software

Posted Dec 21, 2012 12:09 UTC (Fri) by Wol (guest, #4433)
In reply to: Cool new Free software by dlang
Parent article: Status.net service to phase out, replaced by pump.io

No I am NOT.

What I am claiming is that where a relational database HAS to do MULTIPLE writes, a Pick database usually only has to do ONE!

Who cares if the overhead PER WRITE is the same, if I'm doing half the writes, and that overhead is expensive, I'm going to trounce you for speed! Chances are, I'm doing a lot LESS than half the writes. That's the whole point of NFNF. (And as I keep saying, ACID has nothing to do with relational, and everything to do with reality, so Pick doesn't have to do it the same way as relational. It can if it wants, no reason why not.)

(I also forgot to mention, because Pick is primary-key-driven, data retrieval usually involves a direct key access, not a search via an index for a row - more savings on disk access!)

And chances are I'm doing a heck of a lot less i/o, because I have far less duplicate data all over the place, and I'm storing it much more compactly too. I was involved in porting an app from Pick to SQL-Server, so I've got direct comparisons to hand - dumping the data from the Pick datastore and loading into SQL-Server, the resulting SQL-Server database was MANY times the size of the Pick one. Four, five times? Maybe more. Oh, and I'm including in the Pick datastore all the data overhead we didn't transfer. And the Pick datastore by default (we didn't change it) runs at 80% full. I can't give you figures for SQL overhead because I don't know it.

Cheers,
Wol


(Log in to post comments)

Cool new Free software

Posted Dec 21, 2012 12:51 UTC (Fri) by pboddie (subscriber, #50784) [Link]

As I recall from having brushed up against UniData a few years ago, this class of database system works well for limited-depth hierarchical data because of the various things you've mentioned, but that doesn't necessarily mean that all of the advantages apply to other kinds of database structure.

I think it's also pertinent to mention that PostgreSQL has been able to deal with things like multivalued columns for a long time and in an arguably more sane fashion than, say, UniData in various respects, such as in the storage representation which, as I recall with UniData, involved various "top-bit-set" characters as field boundaries that probably made internationalisation a pain.

Certainly, this class of system works well for certain kinds of systems and there's undoubtedly a lot of people still sticking with them, as well as a lot who tried to migrate from them in a bad way, either causing lots of teething troubles and organisational stress with the new system or reinforcing prejudices about the capabilities of "those new-fangled client-server RDBMSs". That the migration exercises probably involved Oracle or some other proprietary solution, where only a single vendor can ease the pain, probably won't have helped.

It's telling that UniData and UniVerse ended up ultimately with IBM after an acquisitions cascade that involved lots of slightly bigger fish eating other fish. I think it was Ardent acquiring the Uni-products, being acquired by Informix, being acquired by IBM. Unlike HP who would have killed the product line in order to add a few cents to the dividend for that quarter, IBM probably see the long-term value in those remaining customers.

Cool new Free software

Posted Dec 21, 2012 13:23 UTC (Fri) by Wol (guest, #4433) [Link]

I think UniData is a bit of a red-headed stepchild in the Pick world. Certainly the reports I've seen seem to say that under the hood it doesn't do it like the other db's in the family.

Yes, I suspect internationalisation may be a bit of a pain, but it has been done. I haven't used it, but I haven't used internationalisation on linux either (I guess it's there, but I'm not conscious of it).

Limited depth hierarchies? In reality, how often do you blow off the end of Pick's ability? The tools aren't necessarily that good, but it handles between five and seven levels "out of the box". How many entities have attributes nested that deep?

You're right about Ardent acquiring the Uni products, but in reality, Ardent took over Informix. Yes, I know Informix the company bought out Ardent, but six months later the Informix board was gone, replaced entirely by Ardent people. The rumour is that IBM bought Informix for the Informix database, only to discover that the company's primary product by then was U2.

And as you say about Postgres, I don't know anything about it but I understood it could handle multivalue columns sort of thing. If you're going to be strictly relational, however, that's not allowed :-) Postgres is moving away from a pure relational db to a more NFNF model. Pick was there first ... :-)

Cheers,
Wol

Informix and IBM

Posted Dec 21, 2012 18:21 UTC (Fri) by markhb (guest, #1003) [Link]

You're right about Ardent acquiring the Uni products, but in reality, Ardent took over Informix. Yes, I know Informix the company bought out Ardent, but six months later the Informix board was gone, replaced entirely by Ardent people. The rumour is that IBM bought Informix for the Informix database, only to discover that the company's primary product by then was U2.
Another rumor I heard, from a consultant who knew a lot of people in IBM, was that when they bought Informix they did, in fact, plan to merge the IDS (or Universal Server) tech into DB2, only to find that the Informix stuff was so far ahead of where DB2 was that they couldn't make it happen.

Cool new Free software

Posted Dec 21, 2012 15:51 UTC (Fri) by Wol (guest, #4433) [Link]

Following up to myself, let's take a quick look at that invoice, with ten items.

Let's start with the invoice and delivery addresses. Are they attributes of the invoice, stored in the invoice record, or relations to a WORM table of locations? As far as Pick is concerned, it doesn't care, it can store a foreign key or the location itself. Okay, the same is true of relational, but depending on how relational physically stores the data, it may have an impact later on.

Now the line items. Are they an attribute of the invoice, an attribute of some ledger, or an entity in their own right? I'm inclined to make them entities in their own right, not knowing enough about accounting off the top of my head to make the best call. I *could* make them an attribute of the invoice.

Now to save it all. Assuming the addresses all exist on file, that's one write for the invoice record and ten writes for the ten line items (if the line items were invoice attributes, I could have written the lot in just ONE write). Eleven atomic writes, wrapped in a transaction.

In relational, however, I have to add a row to the invoice table. Ten rows to the line item table. And update the line-item index on invoice. That's over and above the fact that I have to muddle data and metadata in the line item table - creating some random field I can sort on to return the line items in the correct order (in Pick, I simply store a *list* of line-item keys in the invoice record). So relational has the extra overhead of more "data" to store, and (unless it's willing to incur a massive hit on reading) the overhead of updating a whole bunch of indexes. The same eleven writes of data (with no option to reduce it to one) plus a bunch of indexes.

Now, let's assume we come back a week later and want to print off the index. I'll ignore how we obtain the invoice number. In Pick, we have ONE read for the invoice record, TWO reads for the addresses, and TEN reads for the line items. By the way, a read is defined as a SINGLE disk seek instruction. Statistics tell me the engine is going to make one mistake, so I need to make 14 seeks.

In relational, however, I guess I need to read the invoice table index to find out where to find the invoice. That's two seeks minimum. Then I need to read the two addresses. Another four seeks. Then the index on the line item table followed by the line items. That's eleven seeks, assuming the location is stored in that index or twenty-one if it isn't. I make that 17 *minimum*, probably a lot more.

Remember I said Pick optimises retrieving data from disk?

What if I made a mistake and stored line items as an invoice attribute when I shouldn't? I end up with the equivalent of the relational line item table, clustered by invoice number. Given that relational has to guess how best to cluster data, chances are my arrangement is just as good :-)

At the end of the day, as soon as we start arguing performance, I have a MASSIVE advantage over you. The relational model explicitly forbids you knowing the internal structure of the database, so that the engine can optimise it as best it sees fit. As an application programmer, I know *exactly* how Pick is storing its data at the disk level. There's a reason why Pick doesn't have a query optimiser - it's a fairly trivial exercise in logic to prove that disk access is so efficient (approx 97%) that any attempt to optimise it will cost more than it saves. Pick enforces primary keys. The primary key enables Pick to calculate the location of any item on disk. The Pick data structure pretty much enforces logically tightly coupled attributes to be physically tightly coupled on disk. The ability to store a LIST of foreign keys in a single atomic record eliminates the need for many indices (and because it's a LIST eliminates the need to muddle data and metadata).

In Pick's worst-case scenario (provided the data has been normalised), it degrades to a weakly optimised relational scenario. (The enforcement of primary keys provides some indexing.) In Pick's typical scenario, any half-way complex query is going to leave relational in the dust. That P90 query I mentioned? I bet those Oracle consultants were adding indexes up the wazoo to try and improve performance. The Pick query was probably thrown together in five minutes, and because it was pretty much solely hunting down pre-known primary keys, could go straight to the data it wanted without needing to search for it on dis.

If you want to know how Pick finds data so fast - http://en.wikipedia.org/wiki/Linear_hashing - given a known primary key, it takes on average 1.05 requests to disk to find what you're looking for!

Cheers,
Wol

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds