Why not just dump relational?

Posted Jul 25, 2008 12:59 UTC (Fri) by pboddie (guest, #50784)
In reply to: Why not just dump relational? by Wol
Parent article: Drizzle: a lighter MySQL

Post-relational and MultiValue are synonymous.

Buzzwords aplenty from the Pick camp, I see.

Oh - and I said why you would throw away the query optimiser - the cost isn't worth it.

You then go on to reveal very little to substantiate this claim, other than claiming that just throwing everything into a single table, because it's multivalued, will make everything super-efficient. Whether this is useful either for larger data sets or for not-strictly-hierarchical data is questionable.

I had the "fortune" to use UniData for a while. The query language was badly defined and the mechanisms for editing the data were archaic: "top bit set" characters for "column delimiters", anyone? That many such products have been vacuumed up by IBM and are being milked for support revenues is hardly an accident of history, in my opinion.

Why not just dump relational?

Posted Jul 25, 2008 15:00 UTC (Fri) by Wol (subscriber, #4433) [Link] (1 responses)

The point of a query optimiser is to speed things up. And I'm not saying "throw everything
into a single table". I liken MV to C (and relational to Pascal). C is very loose and lets you
shoot yourself in the foot (as does MV). Pascal, done properly, is so rigid it's a horrible
straightjacket. If you can program C with the rigours of Pascal then you get the advantages of
both worlds.

Anyway, back to query optimisation. If I want to select a row based on an attribute value,
then MV or relational is irrelevant, I have to scan the entire table (or use an index, which
is effectively selecting row(s) by key from an index table). So, selecting by value is equal
in MV and relational, it's expensive.

Selecting by key is very fast in MV - as I said, accessing a record by key requires on average
1.05 table accesses. If your optimiser costs you more than 5%, throw it away! In MV also (with
a properly designed database :-), accessing the record will retrieve off the disk every bit of
information that shares the same simple primary key.

In a relational database, there is no guarantee that your data has a primary key. If you
access an attribute that can have multiple values (ie a one-to-many relationship with the
primary table), there's no guarantee that related values will be retrieved at the same time.
Etc etc. The thing is, in a properly designed MV database, a single table access will retrieve
all the information about a single instance of an entity. And a relational optimiser really
scores when retrieving information about a single instance of an entity from across multiple
tables. Problem is, it's hard to score against an adversary that *consistently* gets it right
"first time every time". If you're reading ONE record from a table of two hundred records, a
totally unoptimised relational database will require an average of 100 table accesses, making
a 10000% cost of your optimiser acceptable (of course, I would be very surprised to find such
a database actually exists in the real world :-)

I used INFORMATION and now use UniVerse - I think the query language is similar. It's very
English-like - indeed one variant is still called ENGLISH, I believe. Yes it takes some
getting used to. But SQL ... that's AWFUL. Yes, it's logical. But it gets horribly nested,
convoluted and heaven-knows-what. Give me the Pick query language any day :-)

What made you think Pick was hierarchical? I'll give you that - again in a properly designed
MV database - each entity definition should be hierarchical. But the links between entities
should be relational. Which gives you all the *speed* of a hierarchical database, and all the
*flexibility* of a relational database. Yes - relational is flexible - but it pays for it in
speed.

Properly designed, MV has all the speed of hierarchical. But you can choose ANY entity as the
peak of your hierarchy, giving it all the flexibility of relational. And - apart from when you
do a "find me this" query - all your data access is based on primary keys that normally hit
"first time, every time". That's blindingly fast compared to a paradigm that says "always
SELECT FROM". Yes I know relational doesn't do that under the covers, but I'd rather a model
that explicitly guarantees me instant response than one that guarantees me nothing but
typically does pretty well - I've come across enough horror stories about upgrades to the
optimisation engine "breaking" carefully crafted code and turning a racehorse into a snail.

Cheers,
Wol

Why not just dump relational?

Posted Jul 26, 2008 2:04 UTC (Sat) by alankila (guest, #47141) [Link]

> Yes I know relational doesn't do that under the covers, but I'd rather a
> model that explicitly guarantees me instant response than one that 
> guarantees me nothing but typically does pretty well - I've come across 
> enough horror stories about upgrades to the optimisation engine 
> "breaking" carefully crafted code and turning a racehorse into a snail.

Let's just say that the flip side of the optimizer is that you do not have to think about how
to get the data yourself. You declare how the things are related, and using SQL describe what
the result set is, and leave it to the database to work out how to satisfy that query.

It should be clear that this argument requires that we have something more complicated than
"select * from table where id = :x" in mind. As an example, I once wrote a query that produced
a report of top 20 products for all vendors monthly, sorting the vendors alphabetically, the
products by their rank (determined by number of items sold), and also reported the number of
those items sold monthly. The database was a star schema of facts, one row describing an item
x sold at time z, another describing the relationship from item to vendor.

So the optimizer works out which end of the query looks like least work, where a temporary
index might be needed, what rows can be read off purely from index, if any of the subresults
are available in materialized query tables, etc.

To me, it was faster to write the SQL than work out the individual queries and the code in the
host programming language (Perl, in my case) to calculate that same statistic. In principle,
this way is the most efficient as well, as when the database knows what I want, then it is in
the position to do the least work required to satisfy that query.