Drizzle: a lighter MySQL [LWN.net]

Drizzle: a lighter MySQL

Posted Jul 23, 2008 15:05 UTC (Wed) by zeridon (guest, #46234) [Link] (3 responses)

For most of the code i don't feel sorry, but at least in the official site it states that
query cache will be removed. And that is gonna hurt (at least for me). For some applications i
heavily depend on it's performance.

Also stands the question of compatibility to mysql. and if it could be a drop in replacement
for original mysql (of course if your code does not depend on the removed features)

Drizzle: a lighter MySQL

Posted Jul 23, 2008 15:22 UTC (Wed) by elanthis (guest, #6227) [Link] (1 responses)

The query cache can in many cases just be moved client-side.  Very few web applications need
anything more.

I'm more sorry to see triggers and stored procedures go.  The reason most web applications
don't use them (IMO) is because MySQL's implementation of them is essentially worthless.  I've
been dying to make use of them, but MySQL just sucks too much to even bother.  Ah well.

Also, from what I understand, there aren't separate users anymore.  Which, if correct, means
Drizzle is only useful in cases where you can install a separate Drizzle instance for each
application/site.  Hopefully I just misunderstood what the removal of ACLs meant (that is,
hopefully they added in or plan to add in a simple user->database access control mechanism).

So far as MySQL compatibility... it's not.  They're reimplementing the client-side library,
the protocol, and a lot of language support is changing.  At the very least, you'd need to use
a new connector (not a huge issue for most developers, I suppose, as they'd just use a wrapper
like MDB2, DBI, etc.) and probably update your schema files.

Drizzle: a lighter MySQL

Posted Jul 23, 2008 16:58 UTC (Wed) by iabervon (subscriber, #722) [Link]

I think this is targeted at people with only one application or site. I'm thinking of people
who get complete control of a VM on a colo server to run their one thing. The extent to which
such people would even have a multiple DB users is to have a limited-privilege user for the
application to prevent it from damaging the database structure.

Drizzle: a lighter MySQL

Posted Jul 23, 2008 15:43 UTC (Wed) by xorbe (guest, #3165) [Link]

"That said, Drizzle is not here to replace the normal MySQL server; Drizzle targets a limited
but important market and will thus help us the enhance the MySQL based offerings. Think of
Drizzle as the microkernel server around which other offerings/features can be developed."

MySQL is your continued solution for now... seems obvious reading the article.

Drizzle: a lighter MySQL

Posted Jul 23, 2008 15:11 UTC (Wed) by Sutoka (guest, #43890) [Link] (8 responses)

So I guess Drizzle is going to just be MySQL-lite but without being as -lite as SQlite? At the
same time, it's also meant to drive MySQL's development, so maybe calling it MySQLcore would
be a better way to think about it? :P

Hopefully this'll help to drive innovation on the DBMS world on *nix. It'd be interesting to
see it also move closer to the desktop environment worlds as well (SQLite is used by lots of
desktop applications, IIRC Apple uses PostgreSQL extensively on OSX), it'd probably help to
clean up that massive amount of little text files floating around that no one understands
anyways.

Drizzle: a lighter MySQL

Posted Jul 23, 2008 15:17 UTC (Wed) by louie (guest, #3285) [Link] (6 responses)

But those little text files require no thought on the part of the user- no daemon to run,
configure, and *secure*, so many fewer worries. So I certainly prefer apps I can install on my
server that use sqlite.

Drizzle: a lighter MySQL

Posted Jul 24, 2008 1:47 UTC (Thu) by elanthis (guest, #6227) [Link] (5 responses)

SQLite is hardly more secure.  It gives the app complete access to the entire database.  With
most "real" SQL servers, I can lock down an application to only be able to do what it needs to
do.  I can disable DROP, ALTER, and CREATE commands for the user, limit certain tables to
read-only (or write-only), etc.  With SQLite, you have none of that control, and the entire
database can be corrupted by a bad or hacked application.

I really can't think of any situation where letting an application directly read and write
data files is going to offer any security advantage over an application that has to talk
through a controlled protocol to access and update data.  Maybe if the data access were
significantly simpler than the protocol, but that certainly isn't the case for anything like a
SQL database file.

Don't get me wrong, SQLite is damn nice for a lot of reasons.  I just wouldn't list "security"
as one of them.

Drizzle: a lighter MySQL

Posted Jul 24, 2008 3:40 UTC (Thu) by Sutoka (guest, #43890) [Link] (4 responses)

I believe by 'secure' he was referring to the point that it isn't running as a daemon that
could potentially be exposed to other systems on the network (sane defaults by the distros
could mitigate the risk some).

Also, if theres one central PostgreSQL/MySQL server running on the system for all the desktop
users, an exploit in the server would expose other user's data which wouldn't be possible if
each application simply used an SQLite database stored under ~.

In my OP I was more so thinking about something a little higher than SQLite, without being as
high as the traditional DBMSs. If the 'server' was per-user, and connected to the user's other
applications by something that is already supposed to be trusted (like DBus's session bus?),
it could help ensure the SQLite db isn't lost as easily (say, keeping a copy around from the
end of the last session as well as a transaction log?).

Then again, what do I know? I (obviously :P) have little experience with these database
systems, and none from a development PoV.

Drizzle: a lighter MySQL

Posted Jul 24, 2008 5:25 UTC (Thu) by louie (guest, #3285) [Link] (3 responses)

Yeah, that's what I meant. If my application is compromised, or my file system is compromised,
then my data is hosed, whether it is in SQLite or mysql (or any other storage technique that I
know of). Agreed that sqlite doesn't add any layers of security there. The mysql daemon (any
database server, really) adds another attack vector, which as we all know is often exploited.
So given the lack of a daemon I think it is reasonable to say that sqlite is a more secure
solution.

Drizzle: a lighter MySQL

Posted Jul 24, 2008 7:46 UTC (Thu) by jamesh (guest, #1159) [Link] (2 responses)

There are varying levels at which an application can be compromised.  An attacker may work out
a way to run arbitrary SQL but not arbitrary code (an SQL injection attack).  In this case,
limiting what SQL can be executed will reduce the impact of the vulnerability even if the
database files are owned by the user running the application.

Drizzle: a lighter MySQL

Posted Jul 24, 2008 16:30 UTC (Thu) by louie (guest, #3285) [Link] (1 responses)

Unless you're djb, your software inevitably accrues features and picks up security problems as
a result. So I would never rely on 'my software is more secure because it does less.'

Drizzle: a lighter MySQL

Posted Jul 25, 2008 4:13 UTC (Fri) by jamesh (guest, #1159) [Link]

It isn't so much doing less as compartmentalising the code.  Programmers will make mistakes,
so doesn't it make sense to limit the damage that can occur when such a mistake is made?

If you have an SQL injection vulnerability, why does it have to be a data loss problem (if the
attacker can issue DROP TABLE) when it could just be information disclosure (and even that can
be limited).

And if we ignore the security aspect, restricting what an application can do can help pick up
programming errors.  If you have a log analysis application, it might only need to read from a
set of tables and not write to any tables.  Giving it only those permissions makes it obvious
if those expectations aren't met.

Drizzle: a lighter MySQL

Posted Jul 23, 2008 16:28 UTC (Wed) by nevyn (guest, #33129) [Link]

It's not even that, for instance SQLite has VIEWs (pretty much required if you do anything
normal formish, IMO).

My hope is that they are experimenting with something that is useful for certain situations
but isn't SQL (ie. they'll change the QL to be something else too). Making another hacky SQL
DB doesn't seem worthwhile.

Drizzle: a lighter MySQL

Posted Jul 23, 2008 15:22 UTC (Wed) by tjc (guest, #137) [Link] (1 responses)

Features that the broad Drizzle community does not want or need are now removed or in the process of being removed [snip]

Hopefully this is the beginning of a trend!

Drizzle: a lighter MySQL

Posted Jul 23, 2008 16:20 UTC (Wed) by rfunk (subscriber, #4054) [Link]

Yes, if they keep up this trend, before long they will have reverted to 
MySQL 1.0!

podcast segment on Drizzle

Posted Jul 23, 2008 17:22 UTC (Wed) by mduregon (guest, #3792) [Link]

http://www.linuxworld.com/podcasts/linux/2008/071808-linu...

'''
A lighter, simpler MySQL: Brian Aker

Brian Aker, a principal engineer for MySQL at Sun, explains the Drizzle project: a new,
stripped-down derivative of MySQL that relies on best-of-breed internal libraries, eliminates
support for obsolete platforms, and has already attracted contributors from outside Sun. 
'''

Drizzle: a lighter MySQL

Posted Jul 23, 2008 22:06 UTC (Wed) by flewellyn (subscriber, #5047) [Link] (2 responses)

So...they want to remove the things that make an RDBMS worthy of the name?

What next, are they going to compromise on proper transaction support with ACID compliance?

Drizzle: a lighter MySQL

Posted Jul 24, 2008 5:52 UTC (Thu) by gdt (subscriber, #6284) [Link] (1 responses)

A RDBMS is hardly an optimal fit for web-based applications. RDBMS are designed for small transactions from random sources with strong integrity needs. The basic assumption -- that all applications are like 1970s green screen banking systems -- isn't true of database-backed web sites. MySQL has benefited from this; for a long time it offered adequate integrity for DB-based web sites but inadequate integrity for traditional online transaction processing. There is plenty of scope for experimentation with database designs to find a better fits to web applications, data warehousing, and image storage.

There's also plenty of scope for differing emphasises. The traditional RDBMS minimises worst-case latency. This isn't the same as best average performance. The traditional RDBMS is disk I/O oriented and designed with an eye to a reasonable use of disk space. But RAM is so available in bulk quantities now, SSD gives an alternative to complex journaling designs, disks are so massive that most past trade-offs of space v performance are now wrong, but at the same time disk speed hasn't improved much at all.

There's a nice article on these points by some RDBMS pioneers at ACM Queue.

Drizzle: a lighter MySQL

Posted Jul 25, 2008 0:44 UTC (Fri) by flewellyn (subscriber, #5047) [Link]

A RDBMS is hardly an optimal fit for web-based applications.

REALLY? Because, y'know, I develop web-based applications for a living, and we use an RDBMS (PostgreSQL) as our data backend. And, the funny thing is, it all works quite nicely. It's fast, powerful, and flexible. And my company's applications (web-based GIS and image analysis) are quite a long ways from 1970s green-screen banking systems.

Interestingly, in the article you linked, Stonebraker didn't talk much about web-based applications. He was talking more about data warehouses and real-time stream processing applications. Most web applications that I've seen don't look anything like those.

Why not just dump relational?

Posted Jul 23, 2008 22:45 UTC (Wed) by Wol (subscriber, #4433) [Link] (13 responses)

After all, it's just maths. There's no evidence that it works in the real world (indeed
there's a lot of evidence it doesn't!)

And before you start flaming me, where is the MATHEMATICAL proof that Newton got his mechanics
wrong? If you look at Newton's maths, it's easy to prove that it's CORRECT. It's just that it
doesn't describe the real world as it is. It's easy to prove that relational theory is
mathematically CORRECT - it's just that it doesn't describe the real world.

Newton assumed that mass was constant. Now we know that it's interchangeable with energy -
e=mc^2. Codd and Date declared that data is conveniently two-dimensional - experience tells us
it isn't!

Try using a post-relational database, such as OpenQM, or MaVerick-dbms. They're a much better
fit to the real world.

Oh - and to the person who complained that Drizzle was dropping the query optimiser -
post-relational databases don't have query optimisers. Why? Because it can be (easily) proven
that the cost of implementing an optimiser is much greater than the maximum gain that could be
obtained from one.

My favourite war story is the Oracle consultants who proudly announced that, after six months
of hard work, their Oracle query was faster than the post-relational database it was
replacing. That is, their proudness lasted for all of a couple of seconds - just long enough
for the "dinosaur" in charge of the post-relational database to point out that Oracle was
running on a twin-Xeon-800. It was a mere 10% faster than UniVerse (the post-relational
database) running on a Pentium 90.

And while the originator of the story didn't say, I'm prepared to bet the post-relational
query had been thrown together in 10 seconds with no attempt at optimisation.

Cheers,
Wol

Why not just dump relational?

Posted Jul 24, 2008 7:12 UTC (Thu) by jamesh (guest, #1159) [Link] (1 responses)

While Newton's model might not have matched up with reality at extremes, it provides a useful
model for a large set of problems.  There are certainly cases where it will give the wrong
answers, but that doesn't invalidate its usefulness in other areas.

Similarly there are many problems that can usefully be expressed in the relational model,
which accounts for the success of SQL databases.  This isn't to say that it is a perfect fit
for all problems (I think I've heard the UniVerse example before), but it does have its uses.

Why not just dump relational?

Posted Jul 26, 2008 0:08 UTC (Sat) by Wol (subscriber, #4433) [Link]

I'd agree with you.

And I actually use relational theory to design my MultiValue databases. But just as Newton's
theories don't work on the very small, or very large, scale, so I find relational theory
doesn't work on the large scale (namely modelling any half-way complex system!). It's great
for modelling attributes of a single entity, and also the links between entities, but it fails
badly when imposed on a complete system.

"All data comes in rows and columns" - that's rule 1 I think. NO IT DOESN'T.

"All data is equal" - not sure which rule that is, but it immediately contradicts itself by
dividing data into relationships, keys (primary and otherwise), and attributes.

I could go on ... suffice it to say, I consider relational theory, as implemented in most
products, to be in serious breach of Einsteins' maxim - "make things as simple as possible BUT
NO SIMPLER". The relational database has been simplified too far - to the extent that all the
supporting stuff around it has to be over-complex in order to compensate.

Cheers,
Wol

Why not just dump relational?

Posted Jul 24, 2008 9:17 UTC (Thu) by cortana (subscriber, #24596) [Link]

So where can we learn more about this magical world of post-relational databases?

Are there any free software implementations to play with?

Why not just dump relational?

Posted Jul 24, 2008 9:49 UTC (Thu) by epa (subscriber, #39769) [Link] (1 responses)

Are there any free 'post-relational' database systems?  Is there a standard query language, as
SQL is for relational DBs?

Why not just dump relational?

Posted Jul 26, 2008 0:00 UTC (Sat) by Wol (subscriber, #4433) [Link]

I mentioned MaVerick and OpenQM. And if you want "free" rather than "Free", there's always
UniVerse, UniData, jBase, and probably more.

For download links to UniVerse and UniData, go to www.u2ug.org. For jBase a web search should
find it.

As for the query language, SQL works, but you're probably better off using the tools provided
with the database. And while the MV query language may not have just one name, the different
dialects are about as similar as the different dialects of SQL.

Cheers,
Wol

Why not just dump relational?

Posted Jul 24, 2008 15:28 UTC (Thu) by alankila (guest, #47141) [Link] (7 responses)

Maverick has had its last release in 2006, and its 0.0.x style version number indicates
unreadiness for use. It seems safe to say that it's dead.

OpenQM website has very little information on how exactly you would go about to program one.
That product's website is more a sales device than developer resource.

The term "post-relational" might be synonymous with "multivalue". And multivalue simply means,
in relational terms, that a row holds more than one value per row. Even postgres supports
this---should postgres now be considered post-relational?

UniVerse database is billed as "relational" database on the IBM site. Not post-relational.

In other words, I am unable to get a handle on what exactly you are talking about. Why exactly
is it a good idea to throw away a query optimizer and how is it even possible that you
wouldn't need one?

Why not just dump relational?

Posted Jul 25, 2008 11:40 UTC (Fri) by Wol (subscriber, #4433) [Link] (6 responses)

Ummm - I didn't know that MaVerick doesn't seem to have done much! I need to check, but there
were people using it, and you might find that sourceforge is more up-to-date (or you might
not). I know it was maintained for a long time after the official release.

Post-relational and MultiValue are synonymous. But no, it doesn't mean "multiple values per
row", it means "multiple values per CELL". So, to take an infamous example from
comp.databases.pick, if I have a CARS table, in the COLOUR field I can store *two* values (or
more) if I want - say my car is "black and green", I can store both "black" and "green" in the
same column, all in the one row that is my car in the table.

UniVerse is probably billed as relational as a marketing tool ...

Oh - and I said why you would throw away the query optimiser - the cost isn't worth it.
Bearing in mind - taking my car as an example - that all the information about my car is
stored in one "row" (or RECORD as it's called in the MV world), what is there to optimise?
Once I know the primary key, I can retrieve everything in one hit. Oh - and that information
will include the primary key(s) of the owner(s) so I can then do a direct retrieval on them -
and so on down the chain. It's a pretty safe statistical bet that if I access one piece of
information about a car, I will want other information about the same car. It's less likely I
will want information about other cars (and which car would that be?). And even less likely
that I will want information about entities related to the car - the cost of optimising that
is bound to be more than just retrieving the foreign key from the car record and doing a
direct access.

Think about it. In real life, the relational optimiser will optimise access to either (a) data
that MultiValue would store in the same record, or (b) data that might be wanted but
statistically isn't worth prefetching.

And the filing system is (usually) a self-balancing hash so (even on an unmaintained system!)
a direct retrieval from a known primary key takes, on average, about 1.05 table accesses to
get the data.

NB Think XML. The data storage format inside a MultiValue database is pretty much a mirror of
XML. There are some differences (MultiValue pre-dates Relational!) but the basic philosophy is
identical and the mapping is trivial. So if you want to store XML in a database, try using
MultiValue. It may take YOU a little effort to get it to map cleanly, but it won't give the
mvdbms any problems at all.

Cheers,
Wol

Why not just dump relational?

Posted Jul 25, 2008 12:11 UTC (Fri) by epa (subscriber, #39769) [Link]

Indeed, multi-value databases in this sense predate the relational model.  You might as well
talk of 'pre-relational' database systems as 'post-relational'.

Why not just dump relational?

Posted Jul 25, 2008 12:59 UTC (Fri) by pboddie (guest, #50784) [Link] (2 responses)

Post-relational and MultiValue are synonymous.

Buzzwords aplenty from the Pick camp, I see.

Oh - and I said why you would throw away the query optimiser - the cost isn't worth it.

You then go on to reveal very little to substantiate this claim, other than claiming that just throwing everything into a single table, because it's multivalued, will make everything super-efficient. Whether this is useful either for larger data sets or for not-strictly-hierarchical data is questionable.

I had the "fortune" to use UniData for a while. The query language was badly defined and the mechanisms for editing the data were archaic: "top bit set" characters for "column delimiters", anyone? That many such products have been vacuumed up by IBM and are being milked for support revenues is hardly an accident of history, in my opinion.

Why not just dump relational?

Posted Jul 25, 2008 15:00 UTC (Fri) by Wol (subscriber, #4433) [Link] (1 responses)

The point of a query optimiser is to speed things up. And I'm not saying "throw everything
into a single table". I liken MV to C (and relational to Pascal). C is very loose and lets you
shoot yourself in the foot (as does MV). Pascal, done properly, is so rigid it's a horrible
straightjacket. If you can program C with the rigours of Pascal then you get the advantages of
both worlds.

Anyway, back to query optimisation. If I want to select a row based on an attribute value,
then MV or relational is irrelevant, I have to scan the entire table (or use an index, which
is effectively selecting row(s) by key from an index table). So, selecting by value is equal
in MV and relational, it's expensive.

Selecting by key is very fast in MV - as I said, accessing a record by key requires on average
1.05 table accesses. If your optimiser costs you more than 5%, throw it away! In MV also (with
a properly designed database :-), accessing the record will retrieve off the disk every bit of
information that shares the same simple primary key.

In a relational database, there is no guarantee that your data has a primary key. If you
access an attribute that can have multiple values (ie a one-to-many relationship with the
primary table), there's no guarantee that related values will be retrieved at the same time.
Etc etc. The thing is, in a properly designed MV database, a single table access will retrieve
all the information about a single instance of an entity. And a relational optimiser really
scores when retrieving information about a single instance of an entity from across multiple
tables. Problem is, it's hard to score against an adversary that *consistently* gets it right
"first time every time". If you're reading ONE record from a table of two hundred records, a
totally unoptimised relational database will require an average of 100 table accesses, making
a 10000% cost of your optimiser acceptable (of course, I would be very surprised to find such
a database actually exists in the real world :-)

I used INFORMATION and now use UniVerse - I think the query language is similar. It's very
English-like - indeed one variant is still called ENGLISH, I believe. Yes it takes some
getting used to. But SQL ... that's AWFUL. Yes, it's logical. But it gets horribly nested,
convoluted and heaven-knows-what. Give me the Pick query language any day :-)

What made you think Pick was hierarchical? I'll give you that - again in a properly designed
MV database - each entity definition should be hierarchical. But the links between entities
should be relational. Which gives you all the *speed* of a hierarchical database, and all the
*flexibility* of a relational database. Yes - relational is flexible - but it pays for it in
speed.

Properly designed, MV has all the speed of hierarchical. But you can choose ANY entity as the
peak of your hierarchy, giving it all the flexibility of relational. And - apart from when you
do a "find me this" query - all your data access is based on primary keys that normally hit
"first time, every time". That's blindingly fast compared to a paradigm that says "always
SELECT FROM". Yes I know relational doesn't do that under the covers, but I'd rather a model
that explicitly guarantees me instant response than one that guarantees me nothing but
typically does pretty well - I've come across enough horror stories about upgrades to the
optimisation engine "breaking" carefully crafted code and turning a racehorse into a snail.

Cheers,
Wol

Why not just dump relational?

Posted Jul 26, 2008 2:04 UTC (Sat) by alankila (guest, #47141) [Link]

> Yes I know relational doesn't do that under the covers, but I'd rather a
> model that explicitly guarantees me instant response than one that 
> guarantees me nothing but typically does pretty well - I've come across 
> enough horror stories about upgrades to the optimisation engine 
> "breaking" carefully crafted code and turning a racehorse into a snail.

Let's just say that the flip side of the optimizer is that you do not have to think about how
to get the data yourself. You declare how the things are related, and using SQL describe what
the result set is, and leave it to the database to work out how to satisfy that query.

It should be clear that this argument requires that we have something more complicated than
"select * from table where id = :x" in mind. As an example, I once wrote a query that produced
a report of top 20 products for all vendors monthly, sorting the vendors alphabetically, the
products by their rank (determined by number of items sold), and also reported the number of
those items sold monthly. The database was a star schema of facts, one row describing an item
x sold at time z, another describing the relationship from item to vendor.

So the optimizer works out which end of the query looks like least work, where a temporary
index might be needed, what rows can be read off purely from index, if any of the subresults
are available in materialized query tables, etc.

To me, it was faster to write the SQL than work out the individual queries and the code in the
host programming language (Perl, in my case) to calculate that same statistic. In principle,
this way is the most efficient as well, as when the database knows what I want, then it is in
the position to do the least work required to satisfy that query.

Why not just dump relational?

Posted Jul 26, 2008 1:31 UTC (Sat) by alankila (guest, #47141) [Link]

> Post-relational and MultiValue are synonymous. But no, it doesn't mean
> "multiple values per row", it means "multiple values per CELL".

My bad. I was talking about this thing, with postgres's Array types.

http://www.postgresql.org/docs/8.0/interactive/arrays.html

So, now we have established postgres as either pre- or post-relational, or possibly both...
;-)

Your talk about the query optimizer's unworthiness is unconvincing to me. Not all projections
are about fetching a single entity. Some queries are about fetching aggregates such as
statistics of all the rows in the database. It is not useless to have these queries run well,
too.

It would seem advantegous to be able to restrict the amount of data fetched about an entity to
the absolute minimum. For instance, in many databases you can build an internally maintained
"clone" of a table by indexing a set of its columns. When a query concerning that table
requesting only those columns is executed, chances are that the optimizer chooses to read the
data from the index instead of the table proper.

Why not just dump relational?

Posted Jul 27, 2008 3:30 UTC (Sun) by jlokier (guest, #52227) [Link]

Multiple values per cell sounds like Amazon's "cloud" database, SimpleDB.
Actually the rest of your description sounds like SimpleDB too.

It strikes me as a good idea: easy to understand, and a good fit for a lot of web app
requirements.

But not all of the common ones.  Things like "show me all items matching criteria X and a
summary of their details", or the same but "50 per page sorted by Foo" - and show me how many
pages that comes to - are quite common requirements.

If that comes to 10000 items, it's going to be really slow doing it by simply scanning a
key->multi-value database and filtering/sorting in the app, and especially if you have to do
it every time they click "next page".  Whereas a SQL using window functions, and a backend
table/index which stores cumulative values as well as direct values, can do the same queries
rather fast.  (Not that all SQL databases do such things.)

OTOH for something like storing cookie state, user preferences etc., key->value type of
database is fine and nice and simple.

> Once I know the primary key, I can retrieve everything in one hit. Oh -
> and that information will include the primary key(s) of the owner(s) so
> I can then do a direct retrieval on them - and so on down the chain.

Round trip latency for each of retrieval in the chain?

I've done something similar with a web page built out of data stored in a very simple
name->value database.  The database round trip was about 1ms (over a network).

But the latencies added up, in our case to about 30s per page (yes, thousands of round trips),
due to chasing dependencies like this in the app: "fetch all items in range A..B (to make a
table of results in a web page).  [get answer].  for each item, using it's "current mode"
property just retried, fetch info about current mode C."

With SQL you typically fold dependency chains into the single query so the DB engine chases
them while it reads the file, merging and reordering block reads to minimise seeks, and all
the answers come in bulk with a single network round trip.  There are other methods, some of
them better, and different programming methods, but none of them really simple.

> And the filing system is (usually) a self-balancing hash so (even on an
> unmaintained system!)  a direct retrieval from a known primary key
> takes, on average, about 1.05 table accesses to get the data.

If you just do a direct access to a single primary key with a SQL database, if it has a
half-decent query implementation, it will access the data in similar steps and won't spend any
time optimising such a simple query.

A hash isn't always best, even though it takes only 1.05 table accesses.  If you want to
retrieve 1000 answers per second, and your table is bigger than RAM, that will not work with
hash and a magnetic disk...  You'll be lucky to get 100 answers per second.

But if the 1000 answers per second have some locality properties, then it can work with other
structures which maintain locality, like trees.

Locality is noticably important in Google's BigTable documentation as one way to improve
performance.

Even in a RAM based database, this can make a performance difference, due to difference in
speed between L1 cache, L2 cache and main RAM.

And decent SQL databases have balanced hashes as an option anyway :-)

> all the information about my car is stored in one "row" (or RECORD as
> it's called in the MV world), what is there to optimise?

The reason column-oriented databases have a niche in which they perform better than
row-oriented databases, is that sometimes it's a performance _downgrade_ to keep all the data
related to "my car" in the same place on a disk :-)

This is even true when you have big disks - scanning many records to calculate aggregate
values is still faster if you can read less.  And for single-value lookups, you can cache
more.

Summary: I agree with you that relational isn't always a good model.  I think XML-style is a
better fit for a lot of data.  But don't throw out the SQL model of giving the database
information about the structure of your queries, and it batching and sorting them together.
Round trip times and not parallelising queries adds up to be a big deal with queries which
involve a lot of data, even if it's just intermediate data (as in the "50 items per page"
example).

Drizzle: a lighter MySQL

Posted Jul 24, 2008 0:44 UTC (Thu) by bk (guest, #25617) [Link]

Wasn't MySQL originally intended to be a 'lite' version of a RDBMS (ie, essentially an SQL
frontend to a berkelydb data store)? 

Why not just avoid the whole mess and use PostgreSQL?