Python sets, frozensets, and literals

Posted Jan 23, 2022 0:05 UTC (Sun) by Wol (subscriber, #4433)
In reply to: Python sets, frozensets, and literals by NYKevin
Parent article: Python sets, frozensets, and literals

Okay, doing roughly the same as you ...

ED DICT PRODUCTS

...@ID primary key
...NAME
...UNIT_PRICE
...INVENTORY_COUNT
...
...

ED DICT INVOICES

...@ID primary key
...DATE
...ADDRESS_TYPE as linked array ADDRESS
...ADDRESS_LINE_1 as linked array ADDRESS
...ADDRESS_LINE_2 as linked array ADDRESS
...ADDRESS_TOWN as linked array ADDRESS
...PRODUCT_ID as linked array LINE_ITEM
...PRODUCT_QUANTITY as linked array LINE_ITEM
...PRODUCT_NAME as TRANSLATE( PRODUCT_ID, PRODUCTS, NAME) as linked array LINE_ITEM
...PRODUCT_PRICE as TRANSLATE( PRODUCT_ID, PRODUCTS, UNIT_PRICE) as linked array LINE_ITEM
...LINE_PRICE as PRODUCT_QUANTITY * PRODUCT_PRICE as linked array LINE_ITEM

I didn't declare datatypes, because everything is a string. In the basic system (I did say Pick gives you plenty of rope to hang yourself :-) you reply on the application to enforce data types, but pretty much every system nowadays enables you to enforce it at the database level.

So I now have two tables to your three (oh, and I've thrown in an address table too, for delivery/invoice/any other address you care for).

You've got two views, I've got none. Although I'll grant you I need a LIST (what SQL calls a SELECT) to get the equivalent of your view.

LIST INVOICES @ID TOTAL LINE_PRICE GRAND_TOTAL LINE_PRICE

Gives me a list of all invoices, with the invoice total, and sums all invoices to give me the grand total.

Note that I haven't got ANY indices - I don't need any. Foreign keys and hashed files, all data accesses apart from the first are O(1). The only "index" I need (or use) is the list of keys to INVOICES.

I don't have any joins - note I have a TRANSLATE function in my INVOICES dictionary which takes as arguments the table, foreign key, and column name. And again with foreign keys and hashed files, this is an O(1) data retrieval. Okay, I'm making two access to the same foreign record, but I can rely (pretty safely) on disk caching saving me, or any modern implementation can just cache the foreign record directly.

Oh - and my SQL SELECT implementation that runs on top of Pick will see ADDRESS and LINE_ITEM as two tables it can access as normal.

So. I don't even need to create inferred indices in order to get O(1) !!! on my lookups. I don't have any join tables to worry about. So I think I have just seriously kicked your butt on speed !!! And as for developer time being more valuable than cpu time, I think having just two tables, PRODUCT and INVOICE, is a lot simpler to understand than needing a third LINE_ITEM table, along with your two surplus views.

Sorry, game, set and match I think!

Cheers,
Wol

Python sets, frozensets, and literals

Posted Jan 23, 2022 10:19 UTC (Sun) by NYKevin (subscriber, #129325) [Link] (4 responses)

> I didn't declare datatypes, because everything is a string. In the basic system (I did say Pick gives you plenty of rope to hang yourself :-) you reply on the application to enforce data types, but pretty much every system nowadays enables you to enforce it at the database level.

Stringly typed is a recipe for disaster.

> Note that I haven't got ANY indices - I don't need any. Foreign keys and hashed files, all data accesses apart from the first are O(1). The only "index" I need (or use) is the list of keys to INVOICES.

O(1) is only fast if the constant factor is low. That's why SQL keeps track of whether the table is big or small. If it has very few rows, SQL will do a nested loop, with a nice linear cacheable access pattern. If the data is huge, SQL can and will do a O(1) hash join.

I'm sure you *can* do that in Pick, but in SQL, you don't have to think about it. It's automatic.

> So I think I have just seriously kicked your butt on speed !!!

Really? You built a benchmark and got meaningful numbers from both systems *that* quickly? Or is this purely hypothetical "speed?"

> And as for developer time being more valuable than cpu time, I think having just two tables, PRODUCT and INVOICE, is a lot simpler to understand than needing a third LINE_ITEM table, along with your two surplus views.

The whole point of the views is that you can ignore the underlying tables if you don't need them.

Python sets, frozensets, and literals

Posted Jan 23, 2022 11:30 UTC (Sun) by Wol (subscriber, #4433) [Link] (3 responses)

> > I didn't declare datatypes, because everything is a string. In the basic system (I did say Pick gives you plenty of rope to hang yourself :-) you reply on the application to enforce data types, but pretty much every system nowadays enables you to enforce it at the database level.

> Stringly typed is a recipe for disaster.

I don't TOTALLY agree. Yes I really miss strong typing some of the time. Equally, I don't miss trying to keep track of FLOAT, INT16, INT64, BIGNUM, etc. FFS just give me a number ! (And be able to multiple by ten just by string-concatenating a zero is a nifty feature sometimes - hey did I talk about rope :-)

> > Note that I haven't got ANY indices - I don't need any. Foreign keys and hashed files, all data accesses apart from the first are O(1). The only "index" I need (or use) is the list of keys to INVOICES.

> O(1) is only fast if the constant factor is low. That's why SQL keeps track of whether the table is big or small. If it has very few rows, SQL will do a nested loop, with a nice linear cacheable access pattern. If the data is huge, SQL can and will do a O(1) hash join.

If the constant factor is low ... a SINGLE disk seek on a cold data access is low, I think? That's pretty much the accepted time ...

> I'm sure you *can* do that in Pick, but in SQL, you don't have to think about it. It's automatic.

I don't think that approach would make sense in Pick, or rather, it melds the two. Because it's a hash database, it'll do a straight O(1) access to the bucket, then a lookup inside it. Depends how many rows there are in the bucket.

> > So I think I have just seriously kicked your butt on speed !!!

> Really? You built a benchmark and got meaningful numbers from both systems *that* quickly? Or is this purely hypothetical "speed?"

Sorry, yes, it's a gedanken experiment. But, on a cold database, I can count the minimum necessary disk accesses. And experience says that figure is DAMN close. I'm also pretty certain information theory would say it's impossible to improve on ... Relational theory says you're not allowed to, but you can work out the minimum. And if relational relies heavily on hot hash indexes, which I think it does, me assuming that the Pick database is stone cold says a lot ...

Real world experience bears this out - find pretty much any Pick -> Relational migration story, and it will be full of stories about how the Pick system ran rings round its replacement. My favourite is the consultants who proudly announced - after six months of hard work - that some important query now ran faster on the new system than the old. Unfortunately for them, the guy responsible for the old system overheard and said "What? You're PROUD that your Twin Xeon 800 is only just faster than an old Pentium 90!?"

> > And as for developer time being more valuable than cpu time, I think having just two tables, PRODUCT and INVOICE, is a lot simpler to understand than needing a third LINE_ITEM table, along with your two surplus views.

> The whole point of the views is that you can ignore the underlying tables if you don't need them.

Agreed. But if you do need them they bring a whole new level of complexity ... An invoice is a real world object. My brain can grasp the concept of an invoice. If I need to do ANYTHING with an invoice, it's ALL in INVOICES table (and therein lies another of Pick's problems - define "what is an object"). If I suddenly need a table called "address", how many different address tables are there scattered about the relational schema? Probably quite a few? In Pick, if it's an invoice address the invoice table will tell me. Again, and this is personal experience, working with Pick I DON'T NEED this big schema on the wall telling me what tables go where. If I understand the real world problem, the Pick schema is a pretty close approximation, and if I need to drill down, ALL THE INFORMATION IS IN THE TABLES. The whole point of Pick is that you should NEVER EVER have to search for data that's related to your query. Okay, "find me all customers that ..." is a search, but "get me details of all their invoices in the last 6 months" is not. That second half of the query should be an order one drill down, not a join.

(And I'll just throw in my experience as a newbie SQL programmer, oh my god is it complex and error prone. What Pick does in one line, with all the logic in the table dictionary, SQL takes multiple, multiple line, selects - I could write a hundred line query easily for what Pick will do in one simple command line!)

Cheers,
Wol

Python sets, frozensets, and literals

Posted Jan 25, 2022 3:09 UTC (Tue) by NYKevin (subscriber, #129325) [Link] (2 responses)

I can write code like this, long after my database is already deployed and populated, and a migration would be expensive:

WITH recent_invoices AS (
SELECT *
FROM invoices
WHERE age(when_issued) < '30 days' -- your syntax may vary here.
)
SELECT p.id, p.name, p.inventory_count - COALESCE(SUM(itm.quantity), 0) AS forecasted_surplus
FROM products p
LEFT JOIN (
invoice_items itm JOIN recent_invoices inv ON itm.invoice_id = inv.id
) ON itm.product_id = p.id
GROUP BY p.id, p.name, p.inventory_count;

I probably also need to run a CREATE INDEX on the invoices(when_issued) column, but that's not too expensive to run once (it's basically a full table scan, with writes). This should (on most reasonable systems) create a B-tree index which can keep track of the order of invoices (i.e. it doesn't just stick them into hash buckets). Then I can get monthly forecasts of my inventory based on last month's demand.

Note that we cannot assume that recently-inserted invoices are necessarily issued more recently than older invoices, because new invoices may need to be backdated, so we can't simply loop over the full table backwards and stop when we start hitting old invoices, because the table is not necessarily in the right order. We have to use an index, or do a full table scan going back through (potentially) years of ancient invoices. Also note that adding this query does not require significant changes to our existing schema - we run one very simple CREATE INDEX command, do some load testing just in case, and that's it, we're ready to roll this out to prod.

Finally, while this may look complicated, I banged it out in about an hour, and frankly it's not that hard to understand for people who use SQL regularly. Quick translation:

* WITH name AS SELECT ... turns another query into a temporary table ("common table expression"). Here all it does is pull the "within the past month" logic out so that the FROM part of the main query does not get too hard to read, but you can also use it if you're going to do the same sub-select more than once.
* The join in parentheses needs to happen first, so that we throw away any invoice_items which don't match one of the existing invoices. We could do this as a second CTE, but that's uglier IMHO.
* LEFT JOIN is an abbreviation of LEFT OUTER JOIN, and has the effect of manufacturing an all-NULL invoice and invoice_item for any products which were not purchased in the past month.
* COALESCE() turns the NULL back into a zero.

With your solution... well, I'm not sure exactly how Pick works, but I tend to imagine you will at least need to introduce a products table so that you know how many are in inventory, and then do an outer join against the invoices table.

Python sets, frozensets, and literals

Posted Jan 25, 2022 3:24 UTC (Tue) by NYKevin (subscriber, #129325) [Link]

> you will at least need to introduce a products table

Actually, your solution already has that information.

Python sets, frozensets, and literals

Posted Jan 25, 2022 8:57 UTC (Tue) by Wol (subscriber, #4433) [Link]

Okay, so have I got this right? You're creating a report on everything sold in the last month?

Like you, I need to create an index on DATE, so evens there. I'd rather put a date in the query, but adding another column to the invoice is easy

ED DICT INVOICES
... create AGE as TODAY - INVOICE.DATE

I think Pick would struggle with me creating an index on a dynamically created field ... :-)

Then it's something as simple as

LIST INVOICES WITH INVOICE.DATE > 25/12/2022 ITEM.NAME ITEM.QUANTITY GRAND.TOTAL ITEM.QUANTITY BY ITEM.NAME.

Okay I'd need to make sure I got the syntax right, but it cost me more time to understand what you were doing than to come up with my first cut at the Pick equivalent. And now I see you're also comparing against current inventory ...

LIST INVOICES WITH INVOICE.DATE > 25/12/2022 ITEM.CODE ITEM.NAME ITEM.QUANTITY GRAND.TOTAL ITEM.QUANTITY BY ITEM.NAME EVAL(TRANSLATE PRODUCTS ITEM.CODE STOCK) AS STOCK EVAL (STOCK - ITEM.QUANTITY) AS SURPLUS

There's probably a bit more to it than that, I've never had to write that sort of query before, but that's about it. Nested Pick queries are more likely to be written as procedures than command lines. (Note the translate instead of a join...)

Cheers,
Wol