|
|
Log in / Subscribe / Register

Is this SQL databases or No-SQL?

Is this SQL databases or No-SQL?

Posted Mar 27, 2014 13:58 UTC (Thu) by intgr (subscriber, #39733)
In reply to: Is this SQL databases or No-SQL? by Wol
Parent article: A discussion between database and kernel developers

> As for magic characters, what does PostgreSQL use to separate its columns?

It doesn't. Using magic characters often necessiates escaping/de-escaping, which has a cost in performance. And finding column N in a tuple requires you to walk through all preceding values in whole. Fixed-length columns always have defined size so they're known by position, variable-length values are prefixed with their length, which doesn't have these problems.

> it's just that Pick exposes its internals - and guarantees the API - which enables programmers to take advantage of an understanding of how the engine actually works

If I understand correctly, it seems that these delimiters are exposed to the coder, which invites misuse: are you sanitizing these values correctly in the input data? A failure may easily expose security bugs. How do you escape UTF-8 data, which can contain byte values 250-255? And it also paints the database into a corner such that it cannot change the storage format easily.

SQL has widely-known escaping problems too, but these days they're passed as parameters instead. I don't know how rigorous the use of abstractions is in Pick communities.

> The alternative is things like Oracle where, after spending months empirically tuning your queries, they change the query optimisation engine and the query suddenly suffers an order of magnitude slowdown.

You make lots of sweeping claims about performance. It's hard to argue without understanding the details of Pick, but from what I have heard I'm not impressed.

I haven't used Oracle much in particular, but I think query optimizers are a godsend based on my experience with PostgreSQL. Instead of needing to code the exact approach of how to retrieve the data, I just describe what I want and the database will figure out the best way to execute it. If the data distributions in the database change, queries will automatically adapt, rather than requiring coders to go back and revisit old queries.

While regressions and outliers do occur, they're a small minority and debugging them isn't difficult with some experience. If you have to spend "months" tuning queries then it seems you're trying to outmsart the optimizer to force it to do what you want, rather than helping it do its job and make good decisions.

And for every regression, there are tons of queries that get faster with each release, because new optimization techniques are implemented all the time.


to post comments

Is this SQL databases or No-SQL?

Posted Mar 27, 2014 14:39 UTC (Thu) by mpr22 (subscriber, #60784) [Link]

How do you escape UTF-8 data, which can contain byte values 250-255?

Byte streams representing text encoded using ISO 8859-[n], KOI8-R, or JIS X 0208 can contain those byte values, but unless-and-until the decision of the relevant standardization bodies to formally restrict the Universal Character Set to the range [0, 0x10ffff] is retracted, the presence of byte values in the range 245-255 in a bytestream means that whatever you have, it most assuredly is not UTF-8-encoded text.

Is this SQL databases or No-SQL?

Posted Mar 28, 2014 1:03 UTC (Fri) by Wol (subscriber, #4433) [Link] (9 responses)

> If I understand correctly, it seems that these delimiters are exposed to the coder, which invites misuse: are you sanitizing these values correctly in the input data?

Bang on - there are various safeguards in place that help prevent misuse, but at the end of the day it's down to the programmer to get it right.

> I haven't used Oracle much in particular, but I think query optimizers are a godsend based on my experience with PostgreSQL.

Let's look at what a relational optimiser has to do. Assuming crude unoptimised data access, EVERY row you look for has to search half the table!

Okay, if I'm only searching for a single row in Pick, it's quite likely I'll have to do the same. BUT!!!

Let's take the example of an invoice. Have you keyed it on the invoice number? I quite likely have. If I have, my mvDBMS can tell the OS where on the disk the data I am looking for is, with a 97% chance of being right. "Open file X, seek to position Y, read that block". If your invoice table is big (let's say 1Mb) you've got to read 1/2Mb to find it!

I've now got ALL the invoice data. Chances are (in fact it's pretty certain) your invoice data is plastered over several tables, so you are going to get further clobbered. Then what happens if you want to access the related customer details? Yet again, you probably have the data plastered over several tables and have to read half of each table to retrieve the data. My invoice record will contain the company key, and once again, I will be able to tell the OS exactly where to find it with a 97% chance of being right.

In other words, a raw unoptimised RDBMS data access is incredibly inefficient, and it gets worse as the database gets bigger. An efficiency of 1% is probably very good! So an optimiser doesn't have to be very good in order to earn its keep.

But for Pick? Let's take my worst case - assume I haven't keyed on invoice no. If I don't have an index I'll have to scan the entire table - so my worst case is equal to yours (if I'm being stupid!) But all I've got to do is declare an index on the invoice no ... and it takes one disk read (97% probability) to convert from invoice no to invoice key, and one more to retrieve the invoice.

(Okay, you can index your primary invoice table on invoice no too, but then you have to index all your subtables too, with all the overhead that involves.)

So. Your SQL optimiser can achieve a doubling of performance by increasing efficiency by a fraction of a percent.

My Pick optimiser (if I had one) can achieve a maximum of 1/3 of one percent increase in performance before it hits the unbreachable barrier of perfect efficiency. It's just not worth spending any effort searching for those sort of gains.

Oh - and there's another thing to take home from this. When making my claims about RDBMSs, I'm mostly speculating. You see, I'm *not* *allowed* to know what's going on under the covers, so I'm forced to guess. Because I know pretty much exactly what's going on under the covers of Pick, I can use logic to reason about it, and PROVE what's going on. I've got figures to work with. (Okay, the figures I have are the inverse of 97%. In a properly balanced Pick file, it takes on average 1.05 "get this block from disk" commands to score a direct hit on the target data. Oh, and rebalancing the file after a write (if that's necessary) incurs about 3 more writes. That's not that much overhead.)

And because every Pick record MUST have a key, most of the time I'm not searching for data, but targeting known records for which I have the key. (Plus I have far fewer tables to search to retrieve the same data.)

Cheers,
Wol

Is this SQL databases or No-SQL?

Posted Mar 28, 2014 8:31 UTC (Fri) by cladisch (✭ supporter ✭, #50193) [Link]

> If I don't have an index I'll have to scan the entire table - so my worst case is equal to yours (if I'm being stupid!) But all I've got to do is declare an index on the invoice no ... and it takes one disk read (97% probability) to convert from invoice no to invoice key, and one more to retrieve the invoice.

Most relational databases (even SQLite) can store table rows ordered by their primary key (typically called "index-organized" or "clustered" tables), so no separate index lookup would be required.

> When making my claims about RDBMSs, I'm mostly speculating. You see, I'm *not* *allowed* to know what's going on under the covers, so I'm forced to guess.

Nobody forbids you to find out what's going on under the covers. Most relational database document low-level details (or even their file format) to allow manual optimizations.

Is this SQL databases or No-SQL?

Posted Mar 28, 2014 10:28 UTC (Fri) by intgr (subscriber, #39733) [Link] (7 responses)

> Have you keyed it on the invoice number? I quite likely have. If I have, my mvDBMS can tell the OS where on the disk the data I am looking for is, with a 97% chance of being right. "Open file X, seek to position Y, read that block". If your invoice table is big (let's say 1Mb) you've got to read 1/2Mb to find it!

What the? How does Pick know the position in the file, where to look for that row? I assume that by "keying" you mean what RDBMSes call indices. Isn't it dishonest to compare indexed access in Pick against non-indexed access in an RDBMS? No reasonable person creates a table without indices when they expect row lookups to be fast.

> I've now got ALL the invoice data. Chances are (in fact it's pretty certain) your invoice data is plastered over several tables, so you are going to get further clobbered. Then what happens if you want to access the related customer details? Yet again, you probably have the data plastered over several tables and have to read half of each table to retrieve the data.

This is true, but you're only talking about OLTP performance, with the assumption that every access to an invoice needs to read all the invoice lines and vice versa. It's true that a system like Pick will be far more efficient RDBMSes in this scenario, but in the all the systems I work on, OLTP queries are not the bottleneck (unless you do crazy things like don't index your data properly). It's the more complex queries that kill you: find all customers that have item X on their invoice; give me a breakdown of the costs of all invoice lines by item type. That's where query optimizers really shine. All evidence I've seen suggests that Pick deals terribly with these.

Even in development, I frequently whip up multi-line queries to look for anomalies in the data. It often just blows me away how these queries complete in seconds, despite churning through gigabytes of data and without any advance planning to make them fast.

In other words: you're worrying about optimizing something that's already more than fast enough. I worry about optimizing the things that are hard to optimize.

----

Also, PostgreSQL has a powerful type system, including composite types, arrays, mapping types, JSON, etc. You *can* organize data like you do in Pick if you want to. But there are real advantages to normalizing (with exceptions of course).

> When making my claims about RDBMSs, I'm mostly speculating. You see, I'm *not* *allowed* to know what's going on under the covers, so I'm forced to guess

That may be true for Oracle, but the page layout of PostgreSQL is documented. http://www.postgresql.org/docs/current/static/storage-pag... The difference from Pick is that the storage layout is not part of the API, but just an implementation detail. If the developers need to change it in a future version, for performance or other reasons, they can.

Is this SQL databases or No-SQL?

Posted Mar 28, 2014 19:33 UTC (Fri) by Wol (subscriber, #4433) [Link] (6 responses)

> What the? How does Pick know the position in the file, where to look for that row? I assume that by "keying" you mean what RDBMSes call indices. Isn't it dishonest to compare indexed access in Pick against non-indexed access in an RDBMS? No reasonable person creates a table without indices when they expect row lookups to be fast.

Urmmm. No.

What I mean by "keying" is sort-of what an RDBMS would call a primary key. In Pick, a primary key is mandatory. And that key contains all the information required for Pick to go straight to the place on disk where that row is stored. Pick files use linear hashing, so basically it's a modern hashed file. And unlike in a traditional hashed file, it only requires rewriting about three disk blocks to change the modulo. So it's FAST.

(So basically, the difference between a key and an index may be minor but it's crucial. An index tells you the key, while a key directly tells you where the data is. And again, Pick gives you rope. Screw up your primary keys and you screw up the hashing, which cripples performance. But in modern implementations that is HARD.)

> This is true, but you're only talking about OLTP performance, with the assumption that every access to an invoice needs to read all the invoice lines and vice versa. It's true that a system like Pick will be far more efficient RDBMSes in this scenario, but in the all the systems I work on, OLTP queries are not the bottleneck (unless you do crazy things like don't index your data properly). It's the more complex queries that kill you: find all customers that have item X on their invoice; give me a breakdown of the costs of all invoice lines by item type. That's where query optimizers really shine. All evidence I've seen suggests that Pick deals terribly with these.

Erm - no. Assuming I've understood what you want, it's a sequence of commands in Pick (and no, a Pick SELECT and a SQL SELECT are not equivalent).

SELECT INVOICES WHERE ITEM-TYPE EQ "X". If I've got an index on X, that's a single request to disk. That gives me a list of all invoices.
SELECT INVOICES SAVING CUST-ID. Okay, I have to go down my list of invoices but, assuming they're scattered randomly on disk that's 105 disk seeks per 100 invoices.
Okay, I can't remember the syntax for the next command, but it's something like
SELECT INVOICES WITH CUST-ID IN CURRENT SELECTLIST. If we have an index on CUST-ID, once again that's 105 seeks per 100 customers.

I've now got a list of invoices to report on ...

LIST INVOICES BY.EXP ITEM-TYPE TOTAL PRICE

Note that absolutely *nowhere* have I had to search for any information whatsoever. The user provided the initial "X", and every single data access from then on has been keyed - with a near perfect disk hit rate. I've had to throw away about 5% of my disk seeks because they failed to hit their target.

There are two places you might score on me. Sorting and collating the final report is one. But if you can't fit all the invoice details in ram I think I'm going to fly past you. I don't know the internals of the sort, but I suspect the systems I'm used to scan through each invoice in turn, creating an (invoice, line, price) tuple, and then sort that. They will then scan through that sorted list building up the report as they go. So they have to read each invoice at least twice, and once more for each extra relevant line.

But if it all fits in ram, the OS disk caching will help me, just as much as being able to store the entire report in ram will help you.

The other place you will score is indices. Your optimiser will spot it needs to scan INVOICES multiple times, and it will build indices on the fly if they don't exist. Pick assumes the user knows what s/he is doing, and gives you the rope to hang yourself.

Plus, you say "that's only relevant for OLTP" - erm no there too. Okay, it's only Monte Carlo probability, but in pretty much ALL scenarios the cost of retrieving that extra data is minimal, while the chance of it being required is relatively high. So much so that it's well worth it. If the system has been designed by a "practitioner in the art" (in this case an accountant), they will have subconsciously skewed the definition of an object such that this is pretty much inevitable.

As an aside, it was mentioned way back in the thread that PostgreSQL often requires gigabytes of temporary storage. I just cannot CONCEIVE of Pick needing that much, even if handling those sort of quantities of data!

> But there are real advantages to normalizing (with exceptions of course).

I couldn't agree more !!! But not in storing the data in First Normal Form. In fact, if a Pick developer did not normalise the data ON PAPER before designing his Pick schema, I would consider him negligent!

> Even in development, I frequently whip up multi-line queries to look for anomalies in the data. It often just blows me away how these queries complete in seconds, despite churning through gigabytes of data and without any advance planning to make them fast.

> In other words: you're worrying about optimizing something that's already more than fast enough. I worry about optimizing the things that are hard to optimize.

To quote from the Pick FAQ - "SQL optimises the easy task of finding things in memory, Pick optimises the hard task of getting them from disk into memory".

And to repeat my Xeon 800 war story - this Oracle system was designed to replace the old Pick system. It took Oracle Consultants some SIX MONTHS of hard work before it could run that query faster than the Pentium 90 system it was replacing ...

> That may be true for Oracle, but the page layout of PostgreSQL is documented. http://www.postgresql.org/docs/current/static/storage-pag... The difference from Pick is that the storage layout is not part of the API, but just an implementation detail. If the developers need to change it in a future version, for performance or other reasons, they can.

My point is that you're not supposed to know that information, not that you can't have insider knowledge. And if developers do need to change it, you have no recourse when all your finely tuned racing queries suddenly become bogged down in treacle. It's not their fault if your system is suddenly unusably slow and you have six months work (or more) retuning the system to get response times back to acceptable. Okay, that's unlikely to happen, but Oracle customers complain about it often enough. It just can't happen with Pick.

Cheers,
Wol

Is this SQL databases or No-SQL?

Posted Mar 29, 2014 0:11 UTC (Sat) by Wol (subscriber, #4433) [Link]

Replying to myself, let's expand a little on the following ...

> but in pretty much ALL scenarios the cost of retrieving that extra data is minimal, while the chance of it being required is relatively high.

Let's assume we've got 1000 invoices, and we're only reporting on 100 of them - 1 in ten. Let's also assume that we've got 10 invoices per block.

Using schoolboy howler statistics, we want 1 invoice in 10, there are 10 invoices per block, so we are going to have to read every block and we will only use 1 of the invoices in it. So we will get absolutely no benefit in the form of "accidentally" reading in another invoice row. The only possible benefit we will get is if we want multiple columns in the same row. (My howler may have simplified the maths, but doesn't affect the truth of the supposition :-)

The smaller the proportion of rows you want to access becomes, relative to the table size, the starker this effect becomes. So, by storing all the invoice data in a single row, a Pick database will benefit from this effect far more than an RDBMS, and the larger the database becomes, the more significant this benefit becomes. Because in everyday use it is rare to want to scan most of the database - you will only be interested in a few rows.

Cheers,
Wol

Is this SQL databases or No-SQL?

Posted Apr 10, 2014 10:02 UTC (Thu) by nix (subscriber, #2304) [Link] (4 responses)

The other place you will score is indices. Your optimiser will spot it needs to scan INVOICES multiple times, and it will build indices on the fly if they don't exist.
You need to use some actual databases, rather than just reading Codd and assuming everyone implemented what he suggested. While doing that (incrementally, no less) would be really cool, nobody ever does that and I don't know of any database system that implements it. Indices are explicitly-constructed entities.

Yes, this does mean that in a lot of databases you end up with a rule that on particular large tables all queries must use an index, and often the database can even enforce that rule. But... you can have more than one index, allowing for more forms of lookup than just primary-keyed, and you can still have totally ad-hoc queries on smaller tables.

As far as I can see, Pick throws all that flexibility away. I can easily see myself having to rethink half an application because I want to add one query, since oops now I need to change what the primary key is! This seems very far from ideal.

(Disclaimer: I work for Oracle now, but in the period of my life when I did things with databases, I didn't.)

Is this SQL databases or No-SQL?

Posted Apr 10, 2014 20:14 UTC (Thu) by cladisch (✭ supporter ✭, #50193) [Link] (3 responses)

> > […] it will build indices on the fly if they don't exist.

> I don't know of any database system that implements it.

As shown in this example, SQLite is happy to create a temporary index if it is estimated to be faster overall.

Is this SQL databases or No-SQL?

Posted Apr 15, 2014 14:45 UTC (Tue) by nix (subscriber, #2304) [Link] (2 responses)

Neat!

I did some digging in response: other, bigger RDBMSes (including PostgreSQL) will also create temporary indexes in temporary tablespace if needed. They're not visible to the database user, though, and are thrown away after the query terminates.

Is this SQL databases or No-SQL?

Posted Apr 15, 2014 15:27 UTC (Tue) by intgr (subscriber, #39733) [Link] (1 responses)

> bigger RDBMSes (including PostgreSQL) will also create temporary indexes

I think you are misunderstanding. The "temporary indexes" you speak probably refers to explicitly created indexes CREATE TEMPORARY TABLE tables issued by the client. I guess you were reading this: http://www.postgresql.org/docs/current/static/runtime-con...

PostgreSQL can create temporary in-memory hash tables, do temporary sorts (such as for merge joins or ORDER BY) and mid-query materialization of intermediate results. Sorts and materialized results can spill out to disk when there isn't enough memory to keep them, but they aren't called "temporary indexes" since they don't resemble actual index data structures.

Is this SQL databases or No-SQL?

Posted Apr 20, 2014 18:27 UTC (Sun) by nix (subscriber, #2304) [Link]

Yes, but those appear to be what wol is referring to as 'indexes' (they index the data, after all). They're not SQL-level entities, or user-visible, of course.

Is this SQL databases or No-SQL?

Posted Mar 28, 2014 1:21 UTC (Fri) by Wol (subscriber, #4433) [Link] (3 responses)

>> As for magic characters, what does PostgreSQL use to separate its columns?

> It doesn't. Using magic characters often necessiates escaping/de-escaping, which has a cost in performance. And finding column N in a tuple requires you to walk through all preceding values in whole. Fixed-length columns always have defined size so they're known by position, variable-length values are prefixed with their length, which doesn't have these problems.

Then how does PostgreSQL know where the data is stored? How does it know how to find column N of row R? You haven't told me what PostgreSQL actually does. Reading between the lines, it seems that every column is stored separately, so reading a row is horribly inefficient ... but I really don't have a clue ...

And yes, Pick does have to search all preceding columns to find column N (the first time. It can be optimised fairly easily by caching). It seems to work very efficiently :-)

> And it also paints the database into a corner such that it cannot change the storage format easily.

So how does PostgreSQL do it? As an application guy, I'm not supposed to know. Certainly for older RDBMSs, adding a new column was a painful process. It's never been a problem for Pick. Or is that not what you meant?

Because internally to the database, a row is just one big string. And it's worked great for 45 years. And it is the same technique as used by SQLite, too, so I understand.

Cheers,
Wol

Is this SQL databases or No-SQL?

Posted Mar 28, 2014 11:55 UTC (Fri) by intgr (subscriber, #39733) [Link] (2 responses)

You make all these sweeping claims about relational databases. But how much experience do you actually have with them?

> You haven't told me what PostgreSQL actually does. Reading between the lines, it seems that every column is stored separately

I tried, but failed. No, all the fields in a single row are stored together. The difference from Pick is that there are no field terminators. Instead, variable-length fields are prefixed with their length. If it's not the column you're looking for, the database reads the length value of the field and skips that number of bytes to find the next column. Fixed-length fields like integer/float/date/etc, do not need to be prefixed with their length since their offset is known just by their data type.

> > And it also paints the database into a corner such that it cannot change the storage format easily.
> Or is that not what you meant?

What I meant is, if the database software exposes its on-disk format directly to the clients, then it cannot easily change the format in future versions. They may want to change the format for efficiency or other considerations. Of course as always, it has its advantages, such as fewer transformations needed before sending it to client.

> Certainly for older RDBMSs, adding a new column was a painful process

Adding a column that defaults to NULL (i.e. no value) is quick PostgreSQL and many other databases too.

> Because internally to the database, a row is just one big string.

You can call any piece of data "just a string". That doesn't give you any insight about how it's accessed.

> And it is the same technique as used by SQLite, too, so I understand

SQLite uses length-prefixed fields too, not terminator bytes.

Is this SQL databases or No-SQL?

Posted Mar 28, 2014 23:52 UTC (Fri) by Wol (subscriber, #4433) [Link] (1 responses)

> You make all these sweeping claims about relational databases. But how much experience do you actually have with them?

I've used them. I wasn't impressed. Where I worked, management decided to migrate from Pick to SQL-Server. Okay, my knowledge of SQL might not be good, but when it takes me maybe seconds to run a dump from my NF2 table, and it then takes SQL-Server hours to run the script to create the matching SQL tables ... when I do a "dir" (it was on Windows) to find out how much space my Pick tables were taking up, and finding out that the SQL data store was nigh on an order of magnitude larger ... and we had consultants who could easily have helped me speed things up but I got the impression they were happy with that sort of speed - WTF!

And I'm interested in the theory of relational systems. You know I've repeated said "you're not supposed to know how the RDBMS stores data"? And it's been repeatedly pointed out that we know how PostgreSQL does it? But that requirement is spec'd in the definition of a generic RDBMS. That's what really annoys me - the spec says "users aren't allowed to know how the RDBMS does it, to permit the RDBMS developers find a faster way". As an *engineer*, I can prove that Pick does it the fastest way possible so why on earth would I want to allow space for the discovery of a faster way!?!?!?

Likewise, First Normal Form requires that everything is *forced* into sets. The poor business analyst is forced to use a sledgehammer to smash square lists and rhomboid bags into round sets! As an *engineer* I can prove this overcomplicates things, and means that an RDBMS is irreparably inefficient.

> I tried, but failed. No, all the fields in a single row are stored together. The difference from Pick is that there are no field terminators. Instead, variable-length fields are prefixed with their length. If it's not the column you're looking for, the database reads the length value of the field and skips that number of bytes to find the next column. Fixed-length fields like integer/float/date/etc, do not need to be prefixed with their length since their offset is known just by their data type.

Ah. But in practice, Pick is just as fast. When looking for column N

searchcol = N;
while (searchcol != 0 & not(end of string)) {
if currentchar == columnmarker) searchcol--;
get next char;
}

Okay, that's pseudocode but in practice it's just as fast as

currentpos = 0;
for (column = 0; column == N, column++) {
switch (columntype) {
case integer:
currentpos += 4;
case date:
currentpos += ?
...
...
}

(or maybe even faster! How long does that switch take to run for each column?)

Again, Pick gives you plenty of rope to hang yourself with - if you're daft enough to declare column 1000 (and I've known people do it) you're going to shoot yourself in the foot :-)

> Adding a column that defaults to NULL (i.e. no value) is quick PostgreSQL and many other databases too.

:-) Which is why I restricted my comment to older RDBMSs. I know newer ones have cracked it. Chances are, however, adding a field that has a default value that needs updating every single row is pretty quick in Pick. If you're sensible about it (and it's easy to force in Pick), it would probably be a case of "read block 1, write block 1, read block 2, write block 2, ..., read block n, write block n". How can you do it faster than that?

> You can call any piece of data "just a string". That doesn't give you any insight about how it's accessed.

But as far as the Pick data store (the DBMS itself) is concerned, why should the DBMS access the data at all? As far as the DBMS itself is concerned, the row is just a blob.

Part of the trouble, I think, is that an RDBMS internally contains several layers. There's the data store. Then there's the indexing and metadata management layer. Then there's the table management layer. Then there's the view management layer. And only at this point do we have the First Normal Form API through which all access to the database is supposed to occur. With Pick, so much more of this is exposed to the application (and, imho, the RDBMS FNF API is like a lady's corset that forces data into an unatural shape to get through it :-)

The only reason a Pick mvDBMS has to concern itself with the layout of the data is in the index and metadata management layer.

> SQLite uses length-prefixed fields too, not terminator bytes.

Ah. But apparently (so I've heard) they're all strings. Not that I've ever seen it, but I have heard that one of its failure modes is putting text data into numeric fields ... That was what I meant - that apparently all data in SQLite is stored as text.

Cheers,
Wol

Is this SQL databases or No-SQL?

Posted Apr 9, 2014 22:05 UTC (Wed) by kleptog (subscriber, #1183) [Link]

Your comments about Pick are interesting and I'm trying to follow you, but comments like this:

> I know newer ones have cracked it. Chances are, however, adding a field that has a default value that needs updating every single row is pretty quick in Pick. If you're sensible about it (and it's easy to force in Pick), it would probably be a case of "read block 1, write block 1, read block 2, write block 2, ..., read block n, write block n". How can you do it faster than that?

Make me just go "wow!". Obviously adding a column doesn't require rewriting the table, the database simply records that a column has been added and whenever a row is read that is missing that column it fills in the default value.

You seem to be quick to tar all RDBMSes with a single brush. No modern database system implements strict SQL. Indexes are not part of SQL, think about that. Yet every RDBMS has them. You keep talking about C&D's 1st rule, but no database actually requires that. PostgreSQL has supported arrays since the beginning and key/value fields for a long time. Recent releases allow you to store and index JSON documents. Which seems to me to give you all the performance of Pick with the flexibility of an RDBMS.

RDBMS builders are engineers too.

But finally, the bit that worries me most about Pick is that the entire pickwiki doesn't mention transactions at all. One of the fundamental features of databases and it's not even mentioned. You mention Pick uses hashes, but transaction safe hashes have historically been very difficult so the combination makes me think Pick don't do it at all.

And don't say "but when you store complete documents you don't need transactions". In the real world you need to be able to read a document, make changes to it and write it back knowing no-one else changed it in the meantime. You need to be able to perform complex queries without worrying you're getting an inconsistant state. Systems that don't need transactions tend to be trivial systems.

Hopefully I'm wrong though...

Is this SQL databases or No-SQL?

Posted Apr 10, 2014 9:56 UTC (Thu) by nix (subscriber, #2304) [Link]

Well, sometimes what he says is true. I did spend months -- actually, most of a year -- tuning a large database-backed Oracle system for better performance. About 20% of that was DML and even some DDL rewriting: perhaps twenty out of a million-plus queries in this system had bad performance. This was mostly down to places where the programmer had tried to be 'clever', Pick-style, and was denying the optimizer a chance to work (e.g. one place was repeatedly reissuing similar queries twelve times over rather than using an intermediate table and a suitable join to do all the work at once). Rewriting such queries to let the optimizer chew at them fixed about 80% of the speed problems we were seeing, speeding up the system by a factor of perhaps a hundred to a thousand under the relevant problematic workloads.

(The rest of the sloth? Maybe I could have fixed it with query rewriting, but with no low-hanging fruit it would have meant rewriting hundreds or thousands of queries. I chose to add application-side caches instead. All the speedup, none of hte pain.)


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds