gcobol: a native COBOL compiler

Posted Apr 8, 2022 12:38 UTC (Fri) by farnz (subscriber, #17727)
In reply to: gcobol: a native COBOL compiler by Wol
Parent article: gcobol: a native COBOL compiler

Yes, because I can find the right place to read from the tables faster than you can find the right place to read in your file. We had a worked example in another thread, where you indicated a total of 10 reads to output the right data, while in the relational example you were trying to beat, 5 reads was enough for PostgreSQL, and 6 reads were needed by MySQL because MySQL doesn't have useful search indexes for one of the cases.

And, of course, I can store multiple tables in a single file under the relational world, and thus just have to read one record from one file, too.

Recall your basic database history: we start with record-oriented workflows, where each entry is a single record, one value per column. We then add indexes that allows you to go from value to matching records for a column quickly. There are two directions that theory takes from here:

Multivalue, where a single column can now have multiple values in a single record. This makes the records bigger, and the indexes more complex (thus more expensive to read)
Relational, where the database can now build the output record from a join across multiple tables (a set theory relationship between the tables).

For complex reasons (please find a decent textbook in a library, like Christopher Date's one that I've already recommended), if you store your data in tables that are in 4NF or higher, then a mechanical translation of a multivalue insert or read to a sequence of relational inserts or a single join in relational will never involve more disk access than the corresponding change in a multivalue database. The extra tables and indexes that relational has to handle impose a cost, but so does multivalue's handling of multiple values for a single column, and once you're in 4NF or above, relational's cost is equal to or less than multivalue's cost.

gcobol: a native COBOL compiler

Posted Apr 8, 2022 22:16 UTC (Fri) by Wol (subscriber, #4433) [Link]

> Yes, because I can find the right place to read from the tables faster than you can find the right place to read in your file. We had a worked example in another thread, where you indicated a total of 10 reads to output the right data, while in the relational example you were trying to beat, 5 reads was enough for PostgreSQL, and 6 reads were needed by MySQL because MySQL doesn't have useful search indexes for one of the cases.

That's a new claim ... My files are hashed - if I know the primary key I know exactly where to look with almost no effort whatsoever. And the whole point of Pick is that it's all primary-key based.

I think I remember the example about 10 reads - is that because the example contained 10 table rows? I don't remember you saying PostGreSQL only needed 5 reads. One for the order details and four for the 10 rows? How does PostgreSQL retrieve ten distinct rows in only four disk blocks? The answer is obvious, they happen to cluster. But is that natural clustering because the table is small, or because PostgreSQL is intelligent enough to recognise they are related and cluster them together? Do you know? Because Pick will benefit just as much from natural clustering. But if I stored those ten rows as a sub-table in the order details I have intelligent clustering and the number of reads specifically to retrieve those ten rows is ZERO - they've come along with the order details.

(I've carefully avoided claiming - or taking into account - anywhere I think that outside influences benefit or hinder either database equally - like I talk about calls into the OS, but I assume the OS treats us both alike ...)

> Multivalue, where a single column can now have multiple values in a single record. This makes the records bigger, and the indexes more complex (thus more expensive to read)

????????????????

Single value index: value->primary key
Multi-value index: value-> primary key

Either you don't have a clue what multi-value is, or your definition of multi-value is completely different from mine.

Likewise your definition of hashed file seems to be completely different from mine - the whole point of hashed files is that if you know the key, you can compute the location of the data you are looking for in no time flat. How does PostgreSQL find the location of the rows FASTER than a hash-compute? Do you know, or are you just blindly believing EXPLAIN without understanding how it does it? Because I would LOVE to know how it does it.

I'll give you that Pick records are far bigger than relational rows. But the statistical probability that you WANT that extra data is high. Which means the information/noise ratio PER BLOCK READ is also high. What's the probability that you want the extra rows dragged in with each relational read? Low? So the information/noise ratio decays much faster with Relational as database size grows, precisely because each "unit of data" - your row to my record - is much smaller. Bigger records means less noise in the retrieved blocks.

Cheers,
Wol