Python sets, frozensets, and literals

Posted Jan 22, 2022 18:35 UTC (Sat) by Wol (subscriber, #4433)
In reply to: Python sets, frozensets, and literals by NYKevin
Parent article: Python sets, frozensets, and literals

> > Okay, the line items may be a set of line items, but the order on the invoice is both RANDOM and PRESERVED.

> But what does that actually accomplish? I would much rather have invoice line items sorted (by name, by serial number, or by some other reasonable thing such as total or individual price) than in "natural" order. It's the same reason ls with no arguments automatically sorts a directory's contents - unsorted natural order is less useful to most people most of the time, so you have to explicitly request it.

It keeps the auditors happy? :-)

And as for sorting, how is the computer suposed to know what the correct order is? Do you sort upper and lower case together or separate? I still haven't worked out why ls sorts differently on my rebuilt gentoo system from how the original it replaced did. Does McAdam sort before or after Macadamia? (If name sorting is on, McAdam sorts before Macadamia in alphabetic order ... :-)

> > In SQL I have to create some field, fill it with random - MEANINGLESS - values who's only value is in its sort order, and if the list is mutable I need to add all sorts of extra logic to manage that field.

> No, you can just make an array, if you really want an array. All modern SQL databases have support for non-normalized columns like that. See for example https://www.postgresql.org/docs/14/arrays.html

Yup. I know modern SQL supports arrays, but I have yet to find any in production (yes my experience of production SQL databases is near nil). The problem is, arrays are second class citizens.

Oh, by the way, isn't the very definition of your SQL table a RANDOM but PRESERVED LIST of column names? :-)

> This is probably going to scale more poorly than the foreign key setup which you describe, so it's not necessarily a Good Idea... but nobody ever said life was going to be fair. You can use the database in the manner it was designed, or you can use the database in the manner which best fits your exact problem space, but it's sometimes difficult to do both at once.

I'd say the trouble with relational is that it's ALWAYS difficult :-)

The trouble with Pick is it gives you too much rope ...

To take my example of an invoice, is a line item an attribute of the invoice, or an instance of a general ledger entry? Because if it's an attribute of the invoice you grab a bunch of columns and turn it into a 2-dimensional array. Or if it's an instance in its own right, you store a list of foreign keys in the invoice. Pick theory says "Your Choice!". SQL very much influences the designer to split them off into their own table (probably the general ledger table).

That said, whichever way you jump with Pick, like I quoted the FAQ, Pick can probably find the data you're looking for on disk, retrieve, and normalise it, faster than SQL can find it ... :-)

And like I said, Pick can throw away the order and give you a relational set much easier han SQL can recreate a list that isn't stored that way in the relational database.

I'm hoping there'll soon be an anouncement of an industrial-grade GPL2 Pick database ... YAYYYY! Still needs a bit more work on it, mostly trying to make it easier for people whos' minds have been ruined by Relational, to grok. :-)

Problem is, they'll probably say our minds have been ruined by DataBASIC :-)

Cheers,
Wol

Python sets, frozensets, and literals

Posted Jan 22, 2022 19:42 UTC (Sat) by NYKevin (subscriber, #129325) [Link] (6 responses)

> Yup. I know modern SQL supports arrays, but I have yet to find any in production (yes my experience of production SQL databases is near nil).

As someone who actually does work with production databases from time to time: You see non-normalized stuff. It's not *typical*, but it does exist and is used by real systems for serious purposes.

> Oh, by the way, isn't the very definition of your SQL table a RANDOM but PRESERVED LIST of column names? :-)

This is SQL, not Lisp. Besides, column order is irrelevant unless you do something like SELECT * (don't do that!).

> To take my example of an invoice, is a line item an attribute of the invoice, or an instance of a general ledger entry?

I would do it like this:

CREATE TABLE products(
id integer PRIMARY KEY,
name varchar(1024) NOT NULL,
unit_price numeric(11, 2) NOT NULL CHECK unit_price > 0, -- If your products cost more than $1B, increase precision as needed, or use money type instead of numeric.
inventory_count integer NOT NULL CHECK inventory_count >= 0,
-- put more per-product columns here...
);

CREATE TABLE invoices(
id integer PRIMARY KEY,
when_issued timestamp NOT NULL,
-- possibly other relevant timestamps here.
);

CREATE TABLE invoice_items(
-- or ledger_entries, if you're doing a more general accounting system.
-- But a general accounting system should be based on credits and debits, so you probably get a different table structure from that.
id integer PRIMARY KEY,
invoice_id integer NOT NULL REFERENCES invoices(id),
product_id integer NOT NULL REFERENCES products(id),
quantity integer NOT NULL CHECK quantity > 0,
);

CREATE VIEW invoice_subtotals AS
SELECT itm.invoice_id AS invoice_id, itm.product_id AS product_id, itm.quantity * prd.unit_price AS subtotal
FROM invoice_items itm
JOIN products prd ON itm.product_id = prd.id;

CREATE VIEW invoice_grand_totals AS
SELECT SUM(subtotal)
FROM invoice_subtotals
GROUP_BY invoice_id;

A modern SQL engine will automatically infer that it needs to create indexes for all of the PRIMARY KEY columns, and do so without my asking, so those lookups are all O(log n). Similarly, it will keep track of how big the join tables get, and automatically choose the most efficient type of join to do (nested loop vs. indexed join vs. hash join) for each particular query by estimating the number of rows which that query will hit. From the perspective of "developer time is more expensive than CPU time," this is a Very Good Thing (at least until it breaks and you have to manually hint it, anyway).

Python sets, frozensets, and literals

Posted Jan 23, 2022 0:05 UTC (Sun) by Wol (subscriber, #4433) [Link] (5 responses)

Okay, doing roughly the same as you ...

ED DICT PRODUCTS

...@ID primary key
...NAME
...UNIT_PRICE
...INVENTORY_COUNT
...
...

ED DICT INVOICES

...@ID primary key
...DATE
...ADDRESS_TYPE as linked array ADDRESS
...ADDRESS_LINE_1 as linked array ADDRESS
...ADDRESS_LINE_2 as linked array ADDRESS
...ADDRESS_TOWN as linked array ADDRESS
...PRODUCT_ID as linked array LINE_ITEM
...PRODUCT_QUANTITY as linked array LINE_ITEM
...PRODUCT_NAME as TRANSLATE( PRODUCT_ID, PRODUCTS, NAME) as linked array LINE_ITEM
...PRODUCT_PRICE as TRANSLATE( PRODUCT_ID, PRODUCTS, UNIT_PRICE) as linked array LINE_ITEM
...LINE_PRICE as PRODUCT_QUANTITY * PRODUCT_PRICE as linked array LINE_ITEM

I didn't declare datatypes, because everything is a string. In the basic system (I did say Pick gives you plenty of rope to hang yourself :-) you reply on the application to enforce data types, but pretty much every system nowadays enables you to enforce it at the database level.

So I now have two tables to your three (oh, and I've thrown in an address table too, for delivery/invoice/any other address you care for).

You've got two views, I've got none. Although I'll grant you I need a LIST (what SQL calls a SELECT) to get the equivalent of your view.

LIST INVOICES @ID TOTAL LINE_PRICE GRAND_TOTAL LINE_PRICE

Gives me a list of all invoices, with the invoice total, and sums all invoices to give me the grand total.

Note that I haven't got ANY indices - I don't need any. Foreign keys and hashed files, all data accesses apart from the first are O(1). The only "index" I need (or use) is the list of keys to INVOICES.

I don't have any joins - note I have a TRANSLATE function in my INVOICES dictionary which takes as arguments the table, foreign key, and column name. And again with foreign keys and hashed files, this is an O(1) data retrieval. Okay, I'm making two access to the same foreign record, but I can rely (pretty safely) on disk caching saving me, or any modern implementation can just cache the foreign record directly.

Oh - and my SQL SELECT implementation that runs on top of Pick will see ADDRESS and LINE_ITEM as two tables it can access as normal.

So. I don't even need to create inferred indices in order to get O(1) !!! on my lookups. I don't have any join tables to worry about. So I think I have just seriously kicked your butt on speed !!! And as for developer time being more valuable than cpu time, I think having just two tables, PRODUCT and INVOICE, is a lot simpler to understand than needing a third LINE_ITEM table, along with your two surplus views.

Sorry, game, set and match I think!

Cheers,
Wol

Python sets, frozensets, and literals

Posted Jan 23, 2022 10:19 UTC (Sun) by NYKevin (subscriber, #129325) [Link] (4 responses)

> I didn't declare datatypes, because everything is a string. In the basic system (I did say Pick gives you plenty of rope to hang yourself :-) you reply on the application to enforce data types, but pretty much every system nowadays enables you to enforce it at the database level.

Stringly typed is a recipe for disaster.

> Note that I haven't got ANY indices - I don't need any. Foreign keys and hashed files, all data accesses apart from the first are O(1). The only "index" I need (or use) is the list of keys to INVOICES.

O(1) is only fast if the constant factor is low. That's why SQL keeps track of whether the table is big or small. If it has very few rows, SQL will do a nested loop, with a nice linear cacheable access pattern. If the data is huge, SQL can and will do a O(1) hash join.

I'm sure you *can* do that in Pick, but in SQL, you don't have to think about it. It's automatic.

> So I think I have just seriously kicked your butt on speed !!!

Really? You built a benchmark and got meaningful numbers from both systems *that* quickly? Or is this purely hypothetical "speed?"

> And as for developer time being more valuable than cpu time, I think having just two tables, PRODUCT and INVOICE, is a lot simpler to understand than needing a third LINE_ITEM table, along with your two surplus views.

The whole point of the views is that you can ignore the underlying tables if you don't need them.

Python sets, frozensets, and literals

Posted Jan 23, 2022 11:30 UTC (Sun) by Wol (subscriber, #4433) [Link] (3 responses)

> > I didn't declare datatypes, because everything is a string. In the basic system (I did say Pick gives you plenty of rope to hang yourself :-) you reply on the application to enforce data types, but pretty much every system nowadays enables you to enforce it at the database level.

> Stringly typed is a recipe for disaster.

I don't TOTALLY agree. Yes I really miss strong typing some of the time. Equally, I don't miss trying to keep track of FLOAT, INT16, INT64, BIGNUM, etc. FFS just give me a number ! (And be able to multiple by ten just by string-concatenating a zero is a nifty feature sometimes - hey did I talk about rope :-)

> > Note that I haven't got ANY indices - I don't need any. Foreign keys and hashed files, all data accesses apart from the first are O(1). The only "index" I need (or use) is the list of keys to INVOICES.

> O(1) is only fast if the constant factor is low. That's why SQL keeps track of whether the table is big or small. If it has very few rows, SQL will do a nested loop, with a nice linear cacheable access pattern. If the data is huge, SQL can and will do a O(1) hash join.

If the constant factor is low ... a SINGLE disk seek on a cold data access is low, I think? That's pretty much the accepted time ...

> I'm sure you *can* do that in Pick, but in SQL, you don't have to think about it. It's automatic.

I don't think that approach would make sense in Pick, or rather, it melds the two. Because it's a hash database, it'll do a straight O(1) access to the bucket, then a lookup inside it. Depends how many rows there are in the bucket.

> > So I think I have just seriously kicked your butt on speed !!!

> Really? You built a benchmark and got meaningful numbers from both systems *that* quickly? Or is this purely hypothetical "speed?"

Sorry, yes, it's a gedanken experiment. But, on a cold database, I can count the minimum necessary disk accesses. And experience says that figure is DAMN close. I'm also pretty certain information theory would say it's impossible to improve on ... Relational theory says you're not allowed to, but you can work out the minimum. And if relational relies heavily on hot hash indexes, which I think it does, me assuming that the Pick database is stone cold says a lot ...

Real world experience bears this out - find pretty much any Pick -> Relational migration story, and it will be full of stories about how the Pick system ran rings round its replacement. My favourite is the consultants who proudly announced - after six months of hard work - that some important query now ran faster on the new system than the old. Unfortunately for them, the guy responsible for the old system overheard and said "What? You're PROUD that your Twin Xeon 800 is only just faster than an old Pentium 90!?"

> > And as for developer time being more valuable than cpu time, I think having just two tables, PRODUCT and INVOICE, is a lot simpler to understand than needing a third LINE_ITEM table, along with your two surplus views.

> The whole point of the views is that you can ignore the underlying tables if you don't need them.

Agreed. But if you do need them they bring a whole new level of complexity ... An invoice is a real world object. My brain can grasp the concept of an invoice. If I need to do ANYTHING with an invoice, it's ALL in INVOICES table (and therein lies another of Pick's problems - define "what is an object"). If I suddenly need a table called "address", how many different address tables are there scattered about the relational schema? Probably quite a few? In Pick, if it's an invoice address the invoice table will tell me. Again, and this is personal experience, working with Pick I DON'T NEED this big schema on the wall telling me what tables go where. If I understand the real world problem, the Pick schema is a pretty close approximation, and if I need to drill down, ALL THE INFORMATION IS IN THE TABLES. The whole point of Pick is that you should NEVER EVER have to search for data that's related to your query. Okay, "find me all customers that ..." is a search, but "get me details of all their invoices in the last 6 months" is not. That second half of the query should be an order one drill down, not a join.

(And I'll just throw in my experience as a newbie SQL programmer, oh my god is it complex and error prone. What Pick does in one line, with all the logic in the table dictionary, SQL takes multiple, multiple line, selects - I could write a hundred line query easily for what Pick will do in one simple command line!)

Cheers,
Wol

Python sets, frozensets, and literals

Posted Jan 25, 2022 3:09 UTC (Tue) by NYKevin (subscriber, #129325) [Link] (2 responses)

I can write code like this, long after my database is already deployed and populated, and a migration would be expensive:

WITH recent_invoices AS (
SELECT *
FROM invoices
WHERE age(when_issued) < '30 days' -- your syntax may vary here.
)
SELECT p.id, p.name, p.inventory_count - COALESCE(SUM(itm.quantity), 0) AS forecasted_surplus
FROM products p
LEFT JOIN (
invoice_items itm JOIN recent_invoices inv ON itm.invoice_id = inv.id
) ON itm.product_id = p.id
GROUP BY p.id, p.name, p.inventory_count;

I probably also need to run a CREATE INDEX on the invoices(when_issued) column, but that's not too expensive to run once (it's basically a full table scan, with writes). This should (on most reasonable systems) create a B-tree index which can keep track of the order of invoices (i.e. it doesn't just stick them into hash buckets). Then I can get monthly forecasts of my inventory based on last month's demand.

Note that we cannot assume that recently-inserted invoices are necessarily issued more recently than older invoices, because new invoices may need to be backdated, so we can't simply loop over the full table backwards and stop when we start hitting old invoices, because the table is not necessarily in the right order. We have to use an index, or do a full table scan going back through (potentially) years of ancient invoices. Also note that adding this query does not require significant changes to our existing schema - we run one very simple CREATE INDEX command, do some load testing just in case, and that's it, we're ready to roll this out to prod.

Finally, while this may look complicated, I banged it out in about an hour, and frankly it's not that hard to understand for people who use SQL regularly. Quick translation:

* WITH name AS SELECT ... turns another query into a temporary table ("common table expression"). Here all it does is pull the "within the past month" logic out so that the FROM part of the main query does not get too hard to read, but you can also use it if you're going to do the same sub-select more than once.
* The join in parentheses needs to happen first, so that we throw away any invoice_items which don't match one of the existing invoices. We could do this as a second CTE, but that's uglier IMHO.
* LEFT JOIN is an abbreviation of LEFT OUTER JOIN, and has the effect of manufacturing an all-NULL invoice and invoice_item for any products which were not purchased in the past month.
* COALESCE() turns the NULL back into a zero.

With your solution... well, I'm not sure exactly how Pick works, but I tend to imagine you will at least need to introduce a products table so that you know how many are in inventory, and then do an outer join against the invoices table.

Python sets, frozensets, and literals

Posted Jan 25, 2022 3:24 UTC (Tue) by NYKevin (subscriber, #129325) [Link]

> you will at least need to introduce a products table

Actually, your solution already has that information.

Python sets, frozensets, and literals

Posted Jan 25, 2022 8:57 UTC (Tue) by Wol (subscriber, #4433) [Link]

Okay, so have I got this right? You're creating a report on everything sold in the last month?

Like you, I need to create an index on DATE, so evens there. I'd rather put a date in the query, but adding another column to the invoice is easy

ED DICT INVOICES
... create AGE as TODAY - INVOICE.DATE

I think Pick would struggle with me creating an index on a dynamically created field ... :-)

Then it's something as simple as

LIST INVOICES WITH INVOICE.DATE > 25/12/2022 ITEM.NAME ITEM.QUANTITY GRAND.TOTAL ITEM.QUANTITY BY ITEM.NAME.

Okay I'd need to make sure I got the syntax right, but it cost me more time to understand what you were doing than to come up with my first cut at the Pick equivalent. And now I see you're also comparing against current inventory ...

LIST INVOICES WITH INVOICE.DATE > 25/12/2022 ITEM.CODE ITEM.NAME ITEM.QUANTITY GRAND.TOTAL ITEM.QUANTITY BY ITEM.NAME EVAL(TRANSLATE PRODUCTS ITEM.CODE STOCK) AS STOCK EVAL (STOCK - ITEM.QUANTITY) AS SURPLUS

There's probably a bit more to it than that, I've never had to write that sort of query before, but that's about it. Nested Pick queries are more likely to be written as procedures than command lines. (Note the translate instead of a join...)

Cheers,
Wol