gcobol: a native COBOL compiler

Posted Mar 19, 2022 11:13 UTC (Sat) by Wol (subscriber, #4433)
In reply to: gcobol: a native COBOL compiler by Cyberax
Parent article: gcobol: a native COBOL compiler

> > Why? (actually, I'd agree that it CAN lead to an unmaintainable mess, because it doesn't enforce structure by default, but ...)

> Because you're mixing logic and data. Ideally you would like to keep the data separate, in a normalized form.

Ideally you want to keep logic and data separate, true. And in Pick there's nothing to stop you (apart from your own incompetence). The difference is Relational *tries* to stop, Pick gives you plenty of rope to hang yourself.

But I ALSO want to keep data and metadata separate (I define metadata as anything that can be logically derived from the data. So ordinality - first, second, last etc is NOT data).

The act of First Normal-isation muddles data and meta-data such that it is impossible to keep them separate.

> > Or did you mean "study databases that NEED a hundred-line query to do what my database can do in one line"?

> I'm sorry, but so far I haven't seen you provide such an example. Moreover, I'm actually 100% certain that any modern RDBMS can emulate all Pick's features easily.

Well, nobody's provided me with a hundred-line SQL query to emulate. But my personal experience as a Pick programmer was that all my queries were two- or three- line at most. If in all my decades of programming Pick, I've never met a query like that, yet in six months of programming SQL pretty much every query I've ever worked with is 20, 30, 50 lines long, then I think the *evidence* is on my side.

As for emulation, it's noticeable I'm far more into the Science / Reality side of things than the Maths / Theory side. Maths can prove anything it likes. If the Universe begs to differ, sorry I'm on the side of the Universe. Going back to the topic of the article :-) let me pose a question - given hardware of EQUAL ability, which is FASTER? To run your IBM/370 Cobol executable on an IBM/370, or on an IBM 370 machine code emulator? OF COURSE running it on the emulator will be a lot slower.

And I'm not going to argue with your statement that Relational can emulate Pick. It's maths, of course it can be done. But did you see my "Proof by Physics" that - actually - it CAN'T be done EASILY.

My Pick abstraction is a mathematical Proper Superset of your First Normal Form. If I provide a layer (I hesitate to call it an emulation layer) that THROWS AWAY my primary-key, ordinality META-data, I'm not *emulating* FNF, I'm *providing* FNF. And I've been forced to increase my entropy to do so. And so the second law of thermodynamics says that if you want to emulate Pick it requires Work. And lots of it. Because in order to reduce the entropy back to the Pick system's natural level you have to expend work to reduce entropy. (Notice that, in the Physics sense, I did ABSOLUTELY NO WORK AT ALL to give you your Relational model.)

Likewise when I threw my invoice model at you. Let me give you that challenge again. You have an invoice with two addresses and ten line items. What is the *minimum* number of data accesses (in SQL terms rows, in Pick terms RECORDS, in data theory terms I'll call it molecules rather than atoms - atoms bound together such that removing one will destroy the information) required?

Note that you can't be clever and say "I'll group all my line items for the same invoice together on invoice number", because they're part of the General Ledger, and that will want to use different access logic.

What's the speed limit of the data Universe? Can you conceive of ANY POSSIBLE THEORY WHATSOEVER that would allow you to access those 13 molecules of data in less than 13 accesses? Can you see any way of re-organising them into less than thirteen molecules (and yes, I did fix a bug to reduce it to 11 :-)? Because I can do a Gedanken experiment that proves with Pick I need (on average) between 13 and 14 accesses. I can do a real experiment to test that. Because Relational says "don't worry about the man behind the curtain" you can't even do a Gedanken experiment.

I'll state my case very simply - THE SECOND LAW OF THERMODYNAMICS says Relational is - MUST BE - slower and more complicated than Pick in real life.

> Sure, SQL syntax might be a bit weird, but modern RDBMS allow writing extensions in friendlier languages like Python or even C#.

And Pick has its own version of BASIC since before I started programming four decades ago. And it's a very friendly language. :-)

Cheers,
Wol

gcobol: a native COBOL compiler

Posted Mar 19, 2022 12:09 UTC (Sat) by mpr22 (subscriber, #60784) [Link] (1 responses)

Great, the equivalent of a simple inner join on the primary key/foreign key column pair is fast.

This is good.

But so much of the SQL I write is not "fetch this record".

It's "for the client's customers whose renewal dates are in a certain range, find me all of their payments by a certain method, in a certain status, which were created, modified, or finalized in a certain timestamp range".

gcobol: a native COBOL compiler

Posted Mar 19, 2022 17:22 UTC (Sat) by Wol (subscriber, #4433) [Link]

> It's "for the client's customers whose renewal dates are in a certain range, find me all of their payments by a certain method, in a certain status, which were created, modified, or finalized in a certain timestamp range".

Great. A challenge.

Firstly, I'm going to assume processing costs are much less than disk access costs so I'm going to ignore them. They're lost in the noise. And I'm going to optimise my Pick data layout, but I think the rules I apply will be pretty much the same rules you use. Accessing the same record/row multiple times will be counted as one disk access. I'll only call it out as a win for Pick if I can show that one record access for me equals multiple row accesses for you ...

And I'll assume, as it appears to be a subscription model, that I have a list of all those payments stored in the client record. The physical implementation of which will be an index into the ledger, on client id.

And bear in mind my SELECT command does not return a record set, it returns a list of IDs for me to process further. Which means I can actually save, and retrieve the list again later, for further processing without the cost of having to do a new SELECT and all the costs involved. That doesn't apply here though.

Okay, let's go. My first two commands could easily be combined into one, but it's easier to explain as two. I'm assuming RENEWAL.DATE has been predefined as an index.

select CUSTOMERS where RENEWAL.DATE >= date1 and where RENEWAL.DATE <= date2

This would be a complete index scan. Curiously enough, the more customers there are, the more likely the key and value are to get separated which will increase the efficiency of the scan.

select CUSTOMERS saving LEDGER_ID list.required

The "list.required" keyword will cause the command to terminate if the previous select didn't return a list of customers. Otherwise you'll get a mess reminiscent of screwing up your join type in relational :-)

But this command will now read the CUSTOMER index on the LEDGER file. For each customer renewing, it will take one disk access to retrieve a list of all the ledger entries of interest. LEDGER_ID is defined on the CUSTOMER file as a calculated field which does that retrieval.

select LEDGER where METHOD = this and STATUS = that list.required

I'll actually stop here because I think "timestamp range" is a SQL construct? It splits a huge table into a whole bunch of smaller separate tables which share the same table definition? Pick has a similar construct called "distributed files".

I'll assume it's not feasible to index METHOD and STATUS because the key range is small and number of values is huge, but if I could it would be a simple "read two values from index, intersect all three lists of keys". My worst case is having to go through the list of keys, read every item, and select by comparing the actual values. That's one disk read per ledger id.

list LEDGER list.required

If I haven't already pulled all the records into ram, this pulls all of the ledger records in, so I match the fact that SQL actually creates a record set. Without this, I've merely created a list of all the records I want, and I'm cheating :-)

As for the timestamp stuff, if I've understood this correctly, Pick requires me to store this information in the record key, so I can select the key list and select the records of interest with an in-memory scan of the key list.

So!

As I understand relational, you can't even do that analysis as how much work your query is going to involve? You can't ask "the wizard behind the curtain"?

And let's forget about the database entirely. Use information theory. Is it even possible to retrieve that information with less work? As I said in a previous comment, for every "request for data" Pick makes to the disk, it has an approx 5% miss rate.

Cheers,
Wol

gcobol: a native COBOL compiler

Posted Apr 6, 2022 22:59 UTC (Wed) by nix (subscriber, #2304) [Link]

> And so the second law of thermodynamics says that if you want to emulate Pick it requires Work.

This really really does sound like absolute crankery. We know the relevant natural law in this area where computers and information processing are concerned. It's the Bekenstein bound, and we are so *so* far from it that there is literally no way it can possibly apply to anything you have ever done (not unless you're using a black hole at absolute zero as a computer and you hadn't mentioned it).

gcobol: a native COBOL compiler

Posted Apr 6, 2022 23:14 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (10 responses)

> if you want to emulate Pick it requires Work

No, it doesn't. All the Pick features (multi-dimensional arrays and computed views) are available in stock PostgreSQL and they are implemented in a similar fashion. There's literally nothing in Pick that can't be done better, faster and more flexibly in Postgres.

I'm pretty sure this holds for other large databases like MSSQL and Oracle.

gcobol: a native COBOL compiler

Posted Apr 7, 2022 9:19 UTC (Thu) by farnz (subscriber, #17727) [Link] (9 responses)

It's also false, now that I've investigated the theory in depth. Chris Date has shown in some of his textbooks on database systems that, assuming that you get the data into 4NF or higher (for which I've not yet found evidence that you can automate it - DDL translation appears not to be solved here), two things hold true:

You can mechanically translate any multivalue DML or DQL statement into a relational equivalent.
A relational DB running that DML or DQL will require an upper bound on data access (both bytes and access count) that's the same as an optimal multivalue database, but has a typical case of log(multivalue accesses) to do the work.

So, unless your use case is trivial, a mechanical translation of your DML/DQL to SQL is going to save you work in the long run - you do a small amount up-front to translate to SQL via a compiler, but your SQL database can then do fewer disk accesses on average than Pick does for the same results.

The fun result comes in with domain-key normal form (DKNF); it is possible to represent certain data models in a relational database in DKNF form without duplication, where an MV database (which is equivalent to 4NF) has to duplicate the data and rely on external processes keeping the duplicates in sync on edit.

gcobol: a native COBOL compiler

Posted Apr 7, 2022 12:07 UTC (Thu) by Wol (subscriber, #4433) [Link] (8 responses)

> So, unless your use case is trivial, a mechanical translation of your DML/DQL to SQL is going to save you work in the long run - you do a small amount up-front to translate to SQL via a compiler, but your SQL database can then do fewer disk accesses on average than Pick does for the same results.

How on earth does a SQL database do fewer disk accesses?

Each disk blob is associated with a primary key - which belongs to what I would call an entity. Which contains a 3NF/4NF/whatever representation of ALL my attributes. Those attributes include the primary key(s) of other relevant entities, for example my wife, my car.

Let's assume you want to look me up. The FILE is keyed on NI-number. You run the query on my name and yes, Pick needs an index scan to convert name to key. Then, by requesting JUST ONE disk block (assuming all this data fits in said block) you have my address, my wife's key, all my phone numbers, a list of my employer(s), a list of all my car(s), etc etc.

It takes just one more request for one disk block, and you have the same information about my wife.

(That's assuming I haven't chosen to split any bulky data, eg photo, into a separate FILE)

If this data is stored in FNF in your relational database there is just NO WAY you can be that efficient. If you're relying on AI in your database, are you sure it's going to get it right? With Pick, one disk request for one disk block gets ALL the attributes associated with any entity's primary key (unless it's deliberately been configured otherwise).

And if you want the data on my three cars, you've now already got ALL THREE primary keys - that's three more requests to disk for three more blobs. Try doing THAT with a 1NF data store!

> The fun result comes in with domain-key normal form (DKNF); it is possible to represent certain data models in a relational database in DKNF form without duplication, where an MV database (which is equivalent to 4NF) has to duplicate the data and rely on external processes keeping the duplicates in sync on edit.

Looking at Wikipedia's definition of DKNF, it's clear you don't understand. I would probably store someone's status in a virtual field.

https://en.wikipedia.org/wiki/Domain-key_normal_form

IF WEALTH LT 1,000,000 ELSE IF WEALTH LT 1,000,000,000 THEN "Millionaire" ELSE "Billionaire"

But it's easy to do in many ways. But try doing THIS without storing duplicate data ... going back to the bookstore example, and apologies for the formatting ... bear in mind I'm defining line-item as an *attribute* of order, and not as an entity in its own right ...

ORDER_NO CUST_NAME BOOK QUANTITY PRICE TOTAL.PRICE TOTAL.COST

1234 "Liza Doolittle" "The Importance of Being Earnest" £5 4 £20 £34
"Swallows and Amazons" £2 2 £4
"War and Peace" £10 1 £10

Only the first five columns actually exist as data (the last two are calculated, or virtual, columns). And that is ALL the data that is stored - there is only ONE copy of ORDER_NO or CUST_NAME.

Put that into 1NF and you've instantly got three extra copies of ORDER_NO, as an absolute minimum. Yes you could put BOOK, QUANTITY and PRICE into arrays inside cells inside your order table, but does SQL know how to handle it without some horrendous complicated expression? And if you split it into a separate table, then order is "meaningless but preserved" - what extra crud in the query do you need to handle that?

And if I suddenly decide to treat line item as an entity in its own right I could re-organise the data, or I could just treat it as the sub-table it currently is, and access it like any other table.

Cheers,
Wol

gcobol: a native COBOL compiler

Posted Apr 7, 2022 14:50 UTC (Thu) by farnz (subscriber, #17727) [Link] (7 responses)

You really need to read something like An Introduction to Database Systems by Chris Date to make your arguments.

In the case you've described, both systems indeed do the same number of accesses. However, you're doing point lookups - lookups on a single entity. Where relational wins is when I want to do a more complex query - e.g. how, given the database layout you have shown, do I get all people that have been paid by a given list of employers without rearranging the data as stored? In a multivalue database, I have to do more work than in 4NF.

And I understand perfectly - Wikipedia's definition of DKNF is incomplete as compared to the one you'll find in a textbook on database design, and you're not actually putting your data into DKNF. Your description of putting it into 1NF is correct, but that's not how you are supposed to use relational - at a minimum, you need BCNF for optimal results, and there are cases where BCNF is insufficient and you need 4NF or 5NF.

To convince me, you need to demonstrate why Mr Date's cited proof in the textbook given above that relational with 4NF is equivalent to or better than multivalue is wrong. So far, all you've been able to show is that you don't understand relational to the level of a student who's taken a course in database systems, which is why you're coming across as a crank - of course multivalue beats relational if you hobble relational by insisting on only going to 1NF, just as relational would beat multivalue if you insisted that you cannot have more than one value for a single attribute (hence hobbling multivalue by taking away the thing it does well).

gcobol: a native COBOL compiler

Posted Apr 8, 2022 0:36 UTC (Fri) by Wol (subscriber, #4433) [Link] (6 responses)

> In the case you've described, both systems indeed do the same number of accesses. However, you're doing point lookups - lookups on a single entity. Where relational wins is when I want to do a more complex query - e.g. how, given the database layout you have shown, do I get all people that have been paid by a given list of employers without rearranging the data as stored? In a multivalue database, I have to do more work than in 4NF.

Why on earth would I want to re-organise the database? I don't remember any mention of an employee database (my favourite example is invoices and I did that bookstore one). If I have an employer database I would have all the employer data in one FILE, all the people data in another. What would I want to move, and where would I want to move it? This shows you clearly have no clue about the Pick data structure, sorry.

And I don't know what you mean by "where relational wins is when I want to do a more complex query". So far, "more complex" seems to me to mean "using a join", which simply shows you have no clue WHY JOINS ARE UNNECESSARY in Pick. The whole point of Pick, is EVERYTHING is a point lookup on a single entity, apart from the initial search. And if you happen to know the initial primary key you don't even need that search. That's why it's so fast.

Let's take your example. I guess you have a table "payments". I'll have a FILE "payments". Chances are my FILE will be in 1NF because I think all the attributes are single-value. Random GUID as primary key, at least two attributes "payer" and "payee" which are the primary keys of the "employer" and "person".

SELECT PAYMENTS WITH EMPLOYER_ID EQ {list of employers} SAVING UNIQUE PERSON_ID
LIST PERSON

Is that REALLY more complicated than your SQL?

If the PAYMENTS file has an index on EMPLOYER_ID I haven't at any point done a table scan - EVERYTHING has been a point lookup. Even if I was mad enough to store the payments in the person table, all I need is an index on EMPLOYER_ID. If I stored the payments in the employer file I think it would be even easier.

> To convince me, you need to demonstrate why Mr Date's cited proof in the textbook given above that relational with 4NF is equivalent to or better than multivalue is wrong. So far, all you've been able to show is that you don't understand relational to the level of a student who's taken a course in database systems, which is why you're coming across as a crank - of course multivalue beats relational if you hobble relational by insisting on only going to 1NF, just as relational would beat multivalue if you insisted that you cannot have more than one value for a single attribute (hence hobbling multivalue by taking away the thing it does well).

So you point me to a £150 text book, and tell me Wikipedia is wrong ...

Oh - and at no point whatsoever have you provided any evidence that you understand what's going on "under the bonnet" of your favourite relational system. If the data is not stored on disk in 1NF, how is it stored? I'll freely admit I don't know, but I get the impression you don't either.

And I'm left with the impression that pretty much any half-decent EAR analysis makes all your normal forms just fall out in the wash. If I store all of an entity's attributes together in a blob, with a DICTionary that makes breaking it down into 1NF easy (think an XML DTD), then you can build all your normal forms on top of it. But you cannot beat Pick for actually getting that data off of backing storage into the database.

I'm left with the impression that you know a lot about relational theory, but your knowledge stops when it hits the database. And it also stops with Relational, without understanding the advantages and disadvantages of other databases and theories. Whereas I know far more about what - Pick at least - actually DOES with that data to stash it away and retrieve it. Relational explicitly hides how it stores the data "in case we find a better way". Pick says "no better way is possible", and I'm afraid I agree with Pick.

Cheers,
Wol

gcobol: a native COBOL compiler

Posted Apr 8, 2022 8:13 UTC (Fri) by farnz (subscriber, #17727) [Link] (5 responses)

I'm not telling you Wikipedia is wrong - I'm telling you that, as an encyclopedia, it's abbreviated and incomplete compared to two chapters of one of the definitive texts on database theory.

And data is not stored in 1NF in a relational database - it's stored in the tables that you've defined. If you model the data in 1NF, then those tables are in 1NF; if you model it in 6NF, then those tables are in 6NF. See PostgresQL documentation on storage for how PostgreSQL stores that data on disk. In 4NF, where the proof of beating relational exists, I'd have multiple tables to match your Pick file.

At this point, I'm done. You're refusing to read the definitive text on databases, which covers multivalue as well as relational, you're ignoring 50 years of well-studied theory, and you're making assertions about how relational works that are simply dishonest and false.

gcobol: a native COBOL compiler

Posted Apr 8, 2022 8:41 UTC (Fri) by mpr22 (subscriber, #60784) [Link] (1 responses)

> In 4NF, where the proof of beating relational exists

Awkward typo there :)

gcobol: a native COBOL compiler

Posted Apr 8, 2022 9:46 UTC (Fri) by farnz (subscriber, #17727) [Link]

You're right - I meant to write "proof of relational beating multivalue". Still annoyed by this thread, which Wol has chosen to ignore, in which I showed that my mental model of PostgreSQL was accurate, and it beat Pick on Wol's chosen metric.

But, of course, Wol chose to dismiss it on the basis that having made a prediction, I validated it against reality, and my prediction turned out to be accurate - it's only "scientific" if it fits his prejudices, not if it can be verified against what the database engine does.

gcobol: a native COBOL compiler

Posted Apr 8, 2022 11:53 UTC (Fri) by Wol (subscriber, #4433) [Link] (2 responses)

> In 4NF, where the proof of beating relational exists, I'd have multiple tables to match your Pick file.

I'm REALLY confused now ...

I have just one FILE.
You have multiple tables.

The documentation tells me PostGreSQL stores data, in rows, in tables.

So you can read MULTIPLE rows, from MULTIPLE tables, faster than I can read ONE record from one file? Truly?

Cheers,
Wol

gcobol: a native COBOL compiler

Posted Apr 8, 2022 12:38 UTC (Fri) by farnz (subscriber, #17727) [Link] (1 responses)

Yes, because I can find the right place to read from the tables faster than you can find the right place to read in your file. We had a worked example in another thread, where you indicated a total of 10 reads to output the right data, while in the relational example you were trying to beat, 5 reads was enough for PostgreSQL, and 6 reads were needed by MySQL because MySQL doesn't have useful search indexes for one of the cases.

And, of course, I can store multiple tables in a single file under the relational world, and thus just have to read one record from one file, too.

Recall your basic database history: we start with record-oriented workflows, where each entry is a single record, one value per column. We then add indexes that allows you to go from value to matching records for a column quickly. There are two directions that theory takes from here:

Multivalue, where a single column can now have multiple values in a single record. This makes the records bigger, and the indexes more complex (thus more expensive to read)
Relational, where the database can now build the output record from a join across multiple tables (a set theory relationship between the tables).

For complex reasons (please find a decent textbook in a library, like Christopher Date's one that I've already recommended), if you store your data in tables that are in 4NF or higher, then a mechanical translation of a multivalue insert or read to a sequence of relational inserts or a single join in relational will never involve more disk access than the corresponding change in a multivalue database. The extra tables and indexes that relational has to handle impose a cost, but so does multivalue's handling of multiple values for a single column, and once you're in 4NF or above, relational's cost is equal to or less than multivalue's cost.

gcobol: a native COBOL compiler

Posted Apr 8, 2022 22:16 UTC (Fri) by Wol (subscriber, #4433) [Link]

> Yes, because I can find the right place to read from the tables faster than you can find the right place to read in your file. We had a worked example in another thread, where you indicated a total of 10 reads to output the right data, while in the relational example you were trying to beat, 5 reads was enough for PostgreSQL, and 6 reads were needed by MySQL because MySQL doesn't have useful search indexes for one of the cases.

That's a new claim ... My files are hashed - if I know the primary key I know exactly where to look with almost no effort whatsoever. And the whole point of Pick is that it's all primary-key based.

I think I remember the example about 10 reads - is that because the example contained 10 table rows? I don't remember you saying PostGreSQL only needed 5 reads. One for the order details and four for the 10 rows? How does PostgreSQL retrieve ten distinct rows in only four disk blocks? The answer is obvious, they happen to cluster. But is that natural clustering because the table is small, or because PostgreSQL is intelligent enough to recognise they are related and cluster them together? Do you know? Because Pick will benefit just as much from natural clustering. But if I stored those ten rows as a sub-table in the order details I have intelligent clustering and the number of reads specifically to retrieve those ten rows is ZERO - they've come along with the order details.

(I've carefully avoided claiming - or taking into account - anywhere I think that outside influences benefit or hinder either database equally - like I talk about calls into the OS, but I assume the OS treats us both alike ...)

> Multivalue, where a single column can now have multiple values in a single record. This makes the records bigger, and the indexes more complex (thus more expensive to read)

????????????????

Single value index: value->primary key
Multi-value index: value-> primary key

Either you don't have a clue what multi-value is, or your definition of multi-value is completely different from mine.

Likewise your definition of hashed file seems to be completely different from mine - the whole point of hashed files is that if you know the key, you can compute the location of the data you are looking for in no time flat. How does PostgreSQL find the location of the rows FASTER than a hash-compute? Do you know, or are you just blindly believing EXPLAIN without understanding how it does it? Because I would LOVE to know how it does it.

I'll give you that Pick records are far bigger than relational rows. But the statistical probability that you WANT that extra data is high. Which means the information/noise ratio PER BLOCK READ is also high. What's the probability that you want the extra rows dragged in with each relational read? Low? So the information/noise ratio decays much faster with Relational as database size grows, precisely because each "unit of data" - your row to my record - is much smaller. Bigger records means less noise in the retrieved blocks.

Cheers,
Wol