Why not just dump relational?

Posted Jul 24, 2008 15:28 UTC (Thu) by alankila (guest, #47141)
In reply to: Why not just dump relational? by Wol
Parent article: Drizzle: a lighter MySQL

Maverick has had its last release in 2006, and its 0.0.x style version number indicates
unreadiness for use. It seems safe to say that it's dead.

OpenQM website has very little information on how exactly you would go about to program one.
That product's website is more a sales device than developer resource.

The term "post-relational" might be synonymous with "multivalue". And multivalue simply means,
in relational terms, that a row holds more than one value per row. Even postgres supports
this---should postgres now be considered post-relational?

UniVerse database is billed as "relational" database on the IBM site. Not post-relational.

In other words, I am unable to get a handle on what exactly you are talking about. Why exactly
is it a good idea to throw away a query optimizer and how is it even possible that you
wouldn't need one?

Why not just dump relational?

Posted Jul 25, 2008 11:40 UTC (Fri) by Wol (subscriber, #4433) [Link] (6 responses)

Ummm - I didn't know that MaVerick doesn't seem to have done much! I need to check, but there
were people using it, and you might find that sourceforge is more up-to-date (or you might
not). I know it was maintained for a long time after the official release.

Post-relational and MultiValue are synonymous. But no, it doesn't mean "multiple values per
row", it means "multiple values per CELL". So, to take an infamous example from
comp.databases.pick, if I have a CARS table, in the COLOUR field I can store *two* values (or
more) if I want - say my car is "black and green", I can store both "black" and "green" in the
same column, all in the one row that is my car in the table.

UniVerse is probably billed as relational as a marketing tool ...

Oh - and I said why you would throw away the query optimiser - the cost isn't worth it.
Bearing in mind - taking my car as an example - that all the information about my car is
stored in one "row" (or RECORD as it's called in the MV world), what is there to optimise?
Once I know the primary key, I can retrieve everything in one hit. Oh - and that information
will include the primary key(s) of the owner(s) so I can then do a direct retrieval on them -
and so on down the chain. It's a pretty safe statistical bet that if I access one piece of
information about a car, I will want other information about the same car. It's less likely I
will want information about other cars (and which car would that be?). And even less likely
that I will want information about entities related to the car - the cost of optimising that
is bound to be more than just retrieving the foreign key from the car record and doing a
direct access.

Think about it. In real life, the relational optimiser will optimise access to either (a) data
that MultiValue would store in the same record, or (b) data that might be wanted but
statistically isn't worth prefetching.

And the filing system is (usually) a self-balancing hash so (even on an unmaintained system!)
a direct retrieval from a known primary key takes, on average, about 1.05 table accesses to
get the data.

NB Think XML. The data storage format inside a MultiValue database is pretty much a mirror of
XML. There are some differences (MultiValue pre-dates Relational!) but the basic philosophy is
identical and the mapping is trivial. So if you want to store XML in a database, try using
MultiValue. It may take YOU a little effort to get it to map cleanly, but it won't give the
mvdbms any problems at all.

Cheers,
Wol

Why not just dump relational?

Posted Jul 25, 2008 12:11 UTC (Fri) by epa (subscriber, #39769) [Link]

Indeed, multi-value databases in this sense predate the relational model.  You might as well
talk of 'pre-relational' database systems as 'post-relational'.

Why not just dump relational?

Posted Jul 25, 2008 12:59 UTC (Fri) by pboddie (guest, #50784) [Link] (2 responses)

Post-relational and MultiValue are synonymous.

Buzzwords aplenty from the Pick camp, I see.

Oh - and I said why you would throw away the query optimiser - the cost isn't worth it.

You then go on to reveal very little to substantiate this claim, other than claiming that just throwing everything into a single table, because it's multivalued, will make everything super-efficient. Whether this is useful either for larger data sets or for not-strictly-hierarchical data is questionable.

I had the "fortune" to use UniData for a while. The query language was badly defined and the mechanisms for editing the data were archaic: "top bit set" characters for "column delimiters", anyone? That many such products have been vacuumed up by IBM and are being milked for support revenues is hardly an accident of history, in my opinion.

Why not just dump relational?

Posted Jul 25, 2008 15:00 UTC (Fri) by Wol (subscriber, #4433) [Link] (1 responses)

The point of a query optimiser is to speed things up. And I'm not saying "throw everything
into a single table". I liken MV to C (and relational to Pascal). C is very loose and lets you
shoot yourself in the foot (as does MV). Pascal, done properly, is so rigid it's a horrible
straightjacket. If you can program C with the rigours of Pascal then you get the advantages of
both worlds.

Anyway, back to query optimisation. If I want to select a row based on an attribute value,
then MV or relational is irrelevant, I have to scan the entire table (or use an index, which
is effectively selecting row(s) by key from an index table). So, selecting by value is equal
in MV and relational, it's expensive.

Selecting by key is very fast in MV - as I said, accessing a record by key requires on average
1.05 table accesses. If your optimiser costs you more than 5%, throw it away! In MV also (with
a properly designed database :-), accessing the record will retrieve off the disk every bit of
information that shares the same simple primary key.

In a relational database, there is no guarantee that your data has a primary key. If you
access an attribute that can have multiple values (ie a one-to-many relationship with the
primary table), there's no guarantee that related values will be retrieved at the same time.
Etc etc. The thing is, in a properly designed MV database, a single table access will retrieve
all the information about a single instance of an entity. And a relational optimiser really
scores when retrieving information about a single instance of an entity from across multiple
tables. Problem is, it's hard to score against an adversary that *consistently* gets it right
"first time every time". If you're reading ONE record from a table of two hundred records, a
totally unoptimised relational database will require an average of 100 table accesses, making
a 10000% cost of your optimiser acceptable (of course, I would be very surprised to find such
a database actually exists in the real world :-)

I used INFORMATION and now use UniVerse - I think the query language is similar. It's very
English-like - indeed one variant is still called ENGLISH, I believe. Yes it takes some
getting used to. But SQL ... that's AWFUL. Yes, it's logical. But it gets horribly nested,
convoluted and heaven-knows-what. Give me the Pick query language any day :-)

What made you think Pick was hierarchical? I'll give you that - again in a properly designed
MV database - each entity definition should be hierarchical. But the links between entities
should be relational. Which gives you all the *speed* of a hierarchical database, and all the
*flexibility* of a relational database. Yes - relational is flexible - but it pays for it in
speed.

Properly designed, MV has all the speed of hierarchical. But you can choose ANY entity as the
peak of your hierarchy, giving it all the flexibility of relational. And - apart from when you
do a "find me this" query - all your data access is based on primary keys that normally hit
"first time, every time". That's blindingly fast compared to a paradigm that says "always
SELECT FROM". Yes I know relational doesn't do that under the covers, but I'd rather a model
that explicitly guarantees me instant response than one that guarantees me nothing but
typically does pretty well - I've come across enough horror stories about upgrades to the
optimisation engine "breaking" carefully crafted code and turning a racehorse into a snail.

Cheers,
Wol

Why not just dump relational?

Posted Jul 26, 2008 2:04 UTC (Sat) by alankila (guest, #47141) [Link]

> Yes I know relational doesn't do that under the covers, but I'd rather a
> model that explicitly guarantees me instant response than one that 
> guarantees me nothing but typically does pretty well - I've come across 
> enough horror stories about upgrades to the optimisation engine 
> "breaking" carefully crafted code and turning a racehorse into a snail.

Let's just say that the flip side of the optimizer is that you do not have to think about how
to get the data yourself. You declare how the things are related, and using SQL describe what
the result set is, and leave it to the database to work out how to satisfy that query.

It should be clear that this argument requires that we have something more complicated than
"select * from table where id = :x" in mind. As an example, I once wrote a query that produced
a report of top 20 products for all vendors monthly, sorting the vendors alphabetically, the
products by their rank (determined by number of items sold), and also reported the number of
those items sold monthly. The database was a star schema of facts, one row describing an item
x sold at time z, another describing the relationship from item to vendor.

So the optimizer works out which end of the query looks like least work, where a temporary
index might be needed, what rows can be read off purely from index, if any of the subresults
are available in materialized query tables, etc.

To me, it was faster to write the SQL than work out the individual queries and the code in the
host programming language (Perl, in my case) to calculate that same statistic. In principle,
this way is the most efficient as well, as when the database knows what I want, then it is in
the position to do the least work required to satisfy that query.

Why not just dump relational?

Posted Jul 26, 2008 1:31 UTC (Sat) by alankila (guest, #47141) [Link]

> Post-relational and MultiValue are synonymous. But no, it doesn't mean
> "multiple values per row", it means "multiple values per CELL".

My bad. I was talking about this thing, with postgres's Array types.

http://www.postgresql.org/docs/8.0/interactive/arrays.html

So, now we have established postgres as either pre- or post-relational, or possibly both...
;-)

Your talk about the query optimizer's unworthiness is unconvincing to me. Not all projections
are about fetching a single entity. Some queries are about fetching aggregates such as
statistics of all the rows in the database. It is not useless to have these queries run well,
too.

It would seem advantegous to be able to restrict the amount of data fetched about an entity to
the absolute minimum. For instance, in many databases you can build an internally maintained
"clone" of a table by indexing a set of its columns. When a query concerning that table
requesting only those columns is executed, chances are that the optimizer chooses to read the
data from the index instead of the table proper.

Why not just dump relational?

Posted Jul 27, 2008 3:30 UTC (Sun) by jlokier (guest, #52227) [Link]

Multiple values per cell sounds like Amazon's "cloud" database, SimpleDB.
Actually the rest of your description sounds like SimpleDB too.

It strikes me as a good idea: easy to understand, and a good fit for a lot of web app
requirements.

But not all of the common ones.  Things like "show me all items matching criteria X and a
summary of their details", or the same but "50 per page sorted by Foo" - and show me how many
pages that comes to - are quite common requirements.

If that comes to 10000 items, it's going to be really slow doing it by simply scanning a
key->multi-value database and filtering/sorting in the app, and especially if you have to do
it every time they click "next page".  Whereas a SQL using window functions, and a backend
table/index which stores cumulative values as well as direct values, can do the same queries
rather fast.  (Not that all SQL databases do such things.)

OTOH for something like storing cookie state, user preferences etc., key->value type of
database is fine and nice and simple.

> Once I know the primary key, I can retrieve everything in one hit. Oh -
> and that information will include the primary key(s) of the owner(s) so
> I can then do a direct retrieval on them - and so on down the chain.

Round trip latency for each of retrieval in the chain?

I've done something similar with a web page built out of data stored in a very simple
name->value database.  The database round trip was about 1ms (over a network).

But the latencies added up, in our case to about 30s per page (yes, thousands of round trips),
due to chasing dependencies like this in the app: "fetch all items in range A..B (to make a
table of results in a web page).  [get answer].  for each item, using it's "current mode"
property just retried, fetch info about current mode C."

With SQL you typically fold dependency chains into the single query so the DB engine chases
them while it reads the file, merging and reordering block reads to minimise seeks, and all
the answers come in bulk with a single network round trip.  There are other methods, some of
them better, and different programming methods, but none of them really simple.

> And the filing system is (usually) a self-balancing hash so (even on an
> unmaintained system!)  a direct retrieval from a known primary key
> takes, on average, about 1.05 table accesses to get the data.

If you just do a direct access to a single primary key with a SQL database, if it has a
half-decent query implementation, it will access the data in similar steps and won't spend any
time optimising such a simple query.

A hash isn't always best, even though it takes only 1.05 table accesses.  If you want to
retrieve 1000 answers per second, and your table is bigger than RAM, that will not work with
hash and a magnetic disk...  You'll be lucky to get 100 answers per second.

But if the 1000 answers per second have some locality properties, then it can work with other
structures which maintain locality, like trees.

Locality is noticably important in Google's BigTable documentation as one way to improve
performance.

Even in a RAM based database, this can make a performance difference, due to difference in
speed between L1 cache, L2 cache and main RAM.

And decent SQL databases have balanced hashes as an option anyway :-)

> all the information about my car is stored in one "row" (or RECORD as
> it's called in the MV world), what is there to optimise?

The reason column-oriented databases have a niche in which they perform better than
row-oriented databases, is that sometimes it's a performance _downgrade_ to keep all the data
related to "my car" in the same place on a disk :-)

This is even true when you have big disks - scanning many records to calculate aggregate
values is still faster if you can read less.  And for single-value lookups, you can cache
more.

Summary: I agree with you that relational isn't always a good model.  I think XML-style is a
better fit for a lot of data.  But don't throw out the SQL model of giving the database
information about the structure of your queries, and it batching and sorting them together.
Round trip times and not parallelising queries adds up to be a big deal with queries which
involve a lot of data, even if it's just intermediate data (as in the "50 items per page"
example).