Python sets, frozensets, and literals
A Python "frozenset" is simply a set object that is immutable—the objects it contains are determined at initialization time and cannot be changed thereafter. Like sets, frozensets are built into the language, but unlike most of the other standard Python types, there is no way to create a literal frozenset object. Changing that, by providing a mechanism to do so, was the topic of a recent discussion on the python-ideas mailing list.
Dictionaries, lists, tuples, and sets, which are called collections in Python, can all be created "on the fly" in the language using a variety of brackets:
>>> a_dict = { 'a' : 42 } >>> a_set = { 'a', 42 } >>> a_list = [ 'a', 42 ] >>> a_tuple = ( 'a', 42 ) >>> print(a_dict, a_set, a_list, a_tuple) {'a': 42} {42, 'a'} ['a', 42] ('a', 42)The tuple is the only immutable type in there, as the rest can be changed by various means. In Python terms, that means the tuple is the only hashable object; it can be used in places where a hashable is required, which includes dictionary keys and set members. Both of those mechanisms require a stable, unchanging value, which in turn requires an immutable object.
As with mathematical sets, Python sets only contain a single element of a given value; adding the same value multiple times does not change the set. Continuing on from the example above:
>>> a_set.add('b') >>> a_set.add('a') >>> a_set {'b', 42, 'a'}Meanwhile, a frozenset can only be created using the frozenset() constructor:
>>> an_fset = frozenset(a_set) >>> an_fset frozenset({'b', 42, 'a'}) >>> an_fset.add('c') Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'frozenset' object has no attribute 'add'As can be seen, no new elements can be added to the immutable frozenset; some of the operations that are defined for sets are not available for frozensets. One implication is that, since set members must be hashable, sets containing sets must actually contain frozensets:
>>> new_set = { a_set } Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: unhashable type: 'set' >>> new_set = { an_fset } >>> new_set {frozenset({'b', 42, 'a'})}And, as expected, creating a set of two identical frozensets only contains one element:
>>> fset2 = frozenset(a_set) >>> new_set = { fset2, an_fset } >>> new_set {frozenset({'b', 42, 'a'})}
Literal frozensets?
It is against that backdrop that Steven D'Aprano posted a query about adding a way to create frozensets on the fly in some fashion, like with the other built-in collection types. His jumping off point was an enhancement request that noted the most recent Python compiler already takes a shortcut when building literal sets of constant values; it creates them as frozensets, rather than as mutable sets. D'Aprano used the dis module to disassemble and display the bytecode for a simple Python statement:
CPython already has some neat optimizations in place to use frozensets instead of sets, for example:>>> import dis >>> dis.dis("x in {1, 2, 3}") 1 0 LOAD_NAME 0 (x) 2 LOAD_CONST 0 (frozenset({1, 2, 3})) 4 CONTAINS_OP 0 6 RETURN_VALUEand the compiler can build frozensets of literals as a constant.
The set literal "{1, 2, 3}" is built as a constant frozenset
(since it cannot be changed) by the compiler and the bytecode loads it
directly. But, as he demonstrated
with another example, when the
programmer wants a frozenset of constant values themselves, the compiler
creates a frozenset constant, turns it into a set, and calls
frozenset() at run time to create the frozenset it already had:
"So to get the frozenset we want, we start with the frozenset we
want,
and make an unnecessary copy the long way o_O
".
It should be noted that his second example, and the one shown in the enhancement request, only occur in the under-development Python 3.11 version of the language. The example shown above, though, works in Python 3.8 (and likely earlier versions), so the compiler clearly has the ability needed to create frozenset constants. But all sets obviously cannot just be switched to frozensets. Beyond that, since the frozenset() call is made at run time, there is the possibility that the function has been shadowed or monkey-patched to do something subtly (or not-so-subtly) different. D'Aprano had a suggestion:
It seems to me that all of the machinery to make this work already exists. The compiler already knows how to create frozensets at compile-time, avoiding the need to lookup and call the frozenset() builtin. All we need is syntax for a frozenset display.How does this work for you?
f{1, 2, 3}
As might be guessed, the "spelling" of the syntax was not pleasing to some. Chris Angelico objected that it would look too much like other constructs that have a different interpretation:
While it's tempting, it does create an awkward distinction.f(1, 2, 3) # look up f, call it with parameters f[1, 2, 3] # look up f, subscript it with paramters f{1, 2, 3} # construct a frozensetAnd that means it's going to be a bug magnet.
He suggested using angle brackets instead (e.g. <1, 2, 3>), if the parser could be made to handle it. D'Aprano thought that the switch to the PEG parser might enable using angle brackets, but he was not in favor of doing so:
Reading this makes my eyes bleed:>>> <1, 2, 3> < <1, 2, 3, 4> True
D'Aprano said that Python's f-strings
provide another example of how "f" can be used in a potentially confusing
way; beyond that, "r" can be used to prefix raw strings, but be used as a
function or array name too. "I don't think that f{} will be any more
of a bug magnet than f"" and r""
already are.
" Inevitably, other suggestions were made, including Rob
Cliffe's for "fs{1, 2, 3}" to avoid the possible
ambiguity of simply using "f" and Greg
Ewing's joke suggestion of using the Unicode snowflake
("❄{1, 2, 3}
").
Matthew Barnett (MRAB) suggested
double curly brackets, which would open up another possibility:
How about doubling-up the braces:{{1, 2, 3}}and for frozen dicts:{{1: 'one', 2: 'two', 3: 'three'}}if needed?
At least currently, there is no built-in frozendict; it was rejected when proposed back in 2012. But either of those two expressions currently raise exceptions, because sets and dictionaries are not hashable, which means that syntax could potentially be used. As Barnett pointed out, though, nested frozensets might lead to curly-brace overload.
Oscar Benjamin wondered about whether frozenset comprehensions (analogous to list comprehensions) should be supported. A set comprehension like the following:
>>> {x**2 for x in range(10)} {0, 1, 64, 4, 36, 9, 16, 49, 81, 25}could be turned into the frozenset equivalent. Benjamin asked, should that work? In the abstract, at least, D'Aprano thought it should, since it "
would be an obvious extension of the syntax". There could be technical hurdles that "
might make it less attractive", however.
Advantages
There were questions about the advantages of adding some kind of literal frozenset syntax. In his initial message, D'Aprano said that the idea had come up before, most recently in 2018, where it was suggested for consistency's sake. Inada Naoki recognized the inconsistency of not having a way to specify a literal frozenset, but was only lukewarm on adding syntax for it unless improvements to existing code could be demonstrated. Christopher Barker also thought that consistency was behind the current effort, but D'Aprano said that there were other reasons for it:
CPython already has all the machinery needed to create constant frozensets of constants at compile time. It already implicitly optimizes some expressions involving tuples and sets into frozensets, but such optimizations are fragile and refactoring your code can remove them. Ironically, that same optimization makes the explicit creation of a frozenset needlessly inefficient.
He noted some uses of frozenset in the standard library that might benefit from the new syntax, but does not think the change will be earth-shattering by any means:
The benefit is not huge. This is not list comprehensions or decorator syntax, which revolutionized the way we write Python, it is an incremental improvement. If the compiler didn't already have the machinery in place for building compile-time constant frozensets, this might not be worth the effort.
D'Aprano also described
some of the problems that can arise with the current state of affairs.
Creating a frozenset is dependent on the frozenset() function,
unlike, say, creating a list: "[1, 2, 3] is guaranteed to
return a genuine list, even if the
name 'list' is deleted, shadowed or replaced
". In addition, there
are current optimizations that are done, but that are somewhat fragile:
If you are writing if x in ("this", "that", "another", "more") then you probably should be using a frozenset literal, since membership testing in sets is faster than linear search of a tuple.I think that the CPython peephole optimizer actually replaces that tuple with a frozenset, which is cool, but you can defeat that optimization and go back to slow linear search by refactoring the code and giving the targets a name:
targets = ("this", "that", "another", "more") if x in targets: ...
In the end, this "feature" would not be a big change, either in CPython, itself, or for the Python ecosystem, but it would remove a small wart that might be worth addressing. Consistency and avoiding needless work when creating a literal frozenset both seem like good reasons to consider making the change. Whether a Python Enhancement Proposal (PEP) emerges remains to be seen. If it does, no major opposition arises, and the inevitable bikeshed-o-rama over its spelling ever converges, it just might appear in an upcoming Python—perhaps even Python 3.11 in October.
Index entries for this article | |
---|---|
Python | Enhancements |
Python | Sets |
Posted Jan 19, 2022 0:33 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link] (38 responses)
And rename the language to Pyerl while you're at it.
Posted Jan 19, 2022 2:03 UTC (Wed)
by NYKevin (subscriber, #129325)
[Link] (37 responses)
Posted Jan 19, 2022 8:44 UTC (Wed)
by Wol (subscriber, #4433)
[Link] (36 responses)
Trying to put a list into a set requires converting metadata to data - A STUPID IDEA.
If your primary layout method is a list, you can get a set by defining it, at the storage level, as a list that is sorted and duplicate-free.
The CRITICAL point is that to get from a list to a set, you THROW AWAY information. To get from a set to a list, you need EXTRA INFORMATION. Anything, therefore, that only provides sets as a fundamental storage mechanism is pushing extra, unnecessary, work onto the layers above.
That's why Relational is such a pig to work with. Most data comes as lists, and the data management layer can't handle it!
Cheers,
Posted Jan 19, 2022 9:54 UTC (Wed)
by jmaa (guest, #128856)
[Link] (18 responses)
Posted Jan 19, 2022 11:33 UTC (Wed)
by Wol (subscriber, #4433)
[Link] (17 responses)
The complaint is that all too often, lists are not available when they are the most appropriate data form. Quite often you're appending to a list, which is a (relatively) quick operation. Often sets are small, so the cost of a sorted list is low, and equally as importantly the COGNITIVE cost to the programmer is non-existent! The database can be told it's a set and manage it! You can't tell an RDBMS "this is a list, manage it".
Working with SQL, managing lists is a complex multi-line query. In Pick (which is sets of lists) it's a one-line statement that looks like English ... Now I'm using SQL extensively, I really miss that simplicity.
Cheers,
Posted Jan 19, 2022 17:22 UTC (Wed)
by ballombe (subscriber, #9523)
[Link] (16 responses)
Posted Jan 19, 2022 19:53 UTC (Wed)
by Wol (subscriber, #4433)
[Link]
I know managing the key has a cost but, given a known key, the cost for a Pick database (and Python and Berkeley/Sleepycat use the same technique) to access the data truly is order one. And the cost to add a new key is order one, too!
Which is why Pick kicks seven bells out of Relational for speed, because it doesn't hide its internals, and it's simple maths to prove you just cannot retrieve/store data any faster. From the FAQ, "SQL optimises the easy task of finding data in memory. Pick optimises the hard task of getting it there in the first place".
(If you don't know, there are two techniques that are pretty much the same, Linear, and Dynamic hashing. Pick I know uses Dynamic, I'm not sure which Python and Berkeley use.)
Cheers,
Posted Jan 19, 2022 20:14 UTC (Wed)
by NYKevin (subscriber, #129325)
[Link] (14 responses)
Posted Jan 19, 2022 21:23 UTC (Wed)
by Wol (subscriber, #4433)
[Link]
Cheers,
Posted Jan 20, 2022 1:16 UTC (Thu)
by excors (subscriber, #95769)
[Link] (12 responses)
On the other hand, formal definitions of O(f(n)) are about the behaviour of a function as n tends to infinity. If your elements have fixed or bounded size then the maximum n supported by the algorithm is finite, so the behaviour as n tends to infinity is undefined, and the big-O notation is mathematically meaningless.
You have to pretend that your algorithm is running on a computer where certain values (at least the ones you're counting with n) can have infinite range, but can still be processed in a constant amount of time. Usually that's a good enough approximation for practical purposes, because we're interested in the behaviour when n is large enough that the actual performance is very close to the theoretical asymptotic bound but n is still well below 2^32 or 2^64 so it can be processed in constant time. I think that's kind of a fudge though, and it can lead to confusion and/or misleading results if you're not careful about what things you're falsely treating as infinite.
In this case it sounds like ballombe is considering keys to be sequences of bytes (or equivalent), and counting the cost of processing each byte. I wouldn't say that's wrong - it's a better approximation of how a computer would implement an unbounded set. It's just not the traditional way of analysing set/map-like data structures, where we pretend keys have infinite range but constant processing cost. The traditional way seems reasonable for a hash-based implementation, because the keys used by the algorithm will be a constant size (the size of the hash) but the table can still be large enough that we get close to the asymptotic performance in practice; but you'll obviously need a different analysis if your implementation doesn't use hashes and does lots of strcmp(key1, key2) instead, because then the lengths of keys may be significant. So the context and assumptions are important (but, from what I've seen, frequently ignored) when choosing how to apply or interpret the big-O notation.
Posted Jan 20, 2022 1:33 UTC (Thu)
by NYKevin (subscriber, #129325)
[Link] (8 responses)
n has nothing to do with the sizes of individual elements. n is the number of elements. If you want a factor for element size, you simply *must* use a different variable to represent it (in the same way that graph-based algorithms must be separately parameterized in terms of E and V - those are not the same variable and cannot be magically recombined into a single "size" value).
> You have to pretend that your algorithm is running on a computer where certain values (at least the ones you're counting with n) can have infinite range, but can still be processed in a constant amount of time.
Why? Plenty of people use int32 or float64 keys all the time, and those *are* processed in a constant amount of time because they are fixed-width!
Sure, if you happen to be working with strings you might want to make this assumption, but it is not generally required. You could just as easily assume that all strings are forcibly interned and pre-hashed at creation time, and then hashing really is O(1) because it's just a pointer deref to get the precomputed hash value.
Regardless, nobody in this entire thread ever explicitly stipulated that the keys were strings.
> In this case it sounds like ballombe is considering keys to be sequences of bytes (or equivalent), and counting the cost of processing each byte.
No, they said O(log(n)). They are considering n to represent the value (not size) of an arbitrary-precision integer (which presumably has to be positive), and taking the logarithm to get its bit width. This is technically applicable to Python, if your keys happen to be ints, but as mentioned above, it would be possible to pre-hash ints at creation time since they are immutable (Python doesn't currently do this as far as I can tell, but it would be a straightforward modification of the existing code).
The problem with this argument is that a list (or set) is not an arbitrary-precision integer, it is a list or set. It does not *have* a single integral value for you to take the logarithm of in the first place, because it may contain many integers, or none, or may even contain keys which are not integers at all. So talking about log(n) in that context is meaningless unless you specify how you are computing n in the first place.
Posted Jan 20, 2022 1:47 UTC (Thu)
by Wol (subscriber, #4433)
[Link]
And what is a string? JUST A BIG INTEGER!
The only datatype Pick has is string, but that's just what the user sees. Internally, I'm pretty certain it just treats the key as a large number, runs something like md5 over it, and then applies its hash function.
And then, given the hash, the time taken to access the target is constant. And much larger than computing the hash, in terms of total time that's just noise!
Cheers,
Posted Jan 20, 2022 4:29 UTC (Thu)
by foom (subscriber, #14868)
[Link] (3 responses)
As such, if you say your keys are 32-bit integers, the number of elements, "n", in your map cannot possibly exceed 2**32. Thus, performance "as n goes to infinity" is meaningless, unless you make the approximation that your 32-bit integer keys can actually represent an infinite number of distinct values.
Posted Jan 20, 2022 12:34 UTC (Thu)
by anselm (subscriber, #2796)
[Link] (2 responses)
2³² may not technically be “infinity”, but if you're looking at O(1) vs. O(n), in the real world n=2³² is pretty likely to make itself noticeable, performance-wise, somehow.
Posted Jan 20, 2022 21:20 UTC (Thu)
by ballombe (subscriber, #9523)
[Link] (1 responses)
Posted Jan 20, 2022 22:50 UTC (Thu)
by anselm (subscriber, #2796)
[Link]
That's what the “in real life” bit was about. In real life, even with big-but-certainly-not-infinite n, O(n log n) heap sort usually performs better than O(n²) bubble sort.
Posted Jan 20, 2022 12:12 UTC (Thu)
by excors (subscriber, #95769)
[Link] (2 responses)
But if your elements are e.g. 32-bit integers then you can't have more than 2^32 distinct elements, so the behaviour of your algorithm as n tends to infinity is undefined. For the big-O notation (which is based on n tending to infinity) to be valid, you have to pretend your elements are integers with infinite range (but can still be processed in constant time) so there's no upper bound on n. That's usually a reasonable thing to pretend if you want the theory to closely match a practical implementation, but not always, so O(1) and O(log(n)) are both valid answers for set membership depending on what assumptions you make.
Posted Jan 21, 2022 15:53 UTC (Fri)
by jwarnica (subscriber, #27492)
[Link] (1 responses)
O(n) might be mathematically defined as "tending to infinity", but unless the analysis determines that weird things happen at n>2^32+1 then its the same thing. Off hand can't think of any of the handful of the standard selection of simple lines that are used where things get weird past some useful point off to the right.
As others have mentioned, one must not consider n-->infinity considerations, but n->expectations. As expectations will always be less than infinity. And memory size of any computer you will ever run something on will also be less than infinity.
Posted Jan 21, 2022 17:29 UTC (Fri)
by nybble41 (subscriber, #55106)
[Link]
The general approximation for treating finite computers as equivalent to Turing machines is that while any given computer may have finite memory you can always add more if an algorithm requires it for some particular input. Even if that might require multiple levels of indirection due to e.g. the address space limits of the CPU. Of course, for physical computers at some point you run into the physical limits of computation (information density limited by thermodynamics; Bremermann's limit on computational speed), but this is with respect to an idealized model of computation where e.g. your computer architecture (ALU and currently installed memory) may be finite but there is no predetermined restriction on how long you're willing to wait for the result.
Also, it doesn't really matter that your memory is limited when analyzing *time* complexity, provided the algorithm has constant *space* complexity and can stream the input and output.
Posted Jan 20, 2022 17:14 UTC (Thu)
by Wol (subscriber, #4433)
[Link] (2 responses)
Are they :-)
Why can't you have a formal definition of "O(f(n)) where 0<= n < 2^32" ?
Ahhh - except that you would say that the conditional clause breaks the definition of "O" and renders everything meaningless...
Cheers,
Posted Jan 20, 2022 17:49 UTC (Thu)
by excors (subscriber, #95769)
[Link] (1 responses)
The formal definition is usually something like "g(n) = O(f(n)) iff there exists c,k such that |g(n)| <= c f(n) for all n >= k", i.e. f(n) is an upper bound for g(n) for large n (ignoring constant factors). If you change that definition to add "n < 2^32", then everything becomes trivially O(1) because you can just set c = g(2^32) or similar to get a constant upper bound for any function.
I don't know how you'd define it in a way that doesn't become trivial and useless, but doesn't become so complicated that it stops being a helpful approximation and is no better than just counting the exact number of computational steps.
Posted Jan 21, 2022 3:38 UTC (Fri)
by NYKevin (subscriber, #129325)
[Link]
If you're willing to turn all your limits inside-out, you can probably do this in the 2-adic integers. But then you have to deal with a variety of oddities, such as the fact that the sequence of successive powers of two converges to zero under the 2-adic metric (which is certainly not the case in the regular integers!) - informally, you can think of this as indicating that very large powers of two can only be distinguished from zero using a correspondingly large (wide) integer type, and in the limit, it becomes impossible to distinguish them altogether, and so the 2-adic metric says the limit of the sequence is zero.
Posted Jan 19, 2022 10:34 UTC (Wed)
by anselm (subscriber, #2796)
[Link]
It depends on what you want to do. Sometimes an ordered sequence with insertions and deletions is what you want (which is what a list will give you), sometimes O(1) membership checks are what you want (which is what a set will give you, if properly implemented). Certainly in Python there's no clear advantage of lists over sets to a point where you would want to abolish one and keep the other (or vice versa); they both have their place.
Posted Jan 19, 2022 12:28 UTC (Wed)
by k3ninho (subscriber, #50375)
[Link]
K3n.
Posted Jan 20, 2022 14:22 UTC (Thu)
by jschrod (subscriber, #1646)
[Link] (14 responses)
Is it? That's news to me, and I majored in CS.
Lists and sets are two different abstractions, to be used when one needs them.
> That's why Relational is such a pig to work with. Most data comes as lists,
Really?
In most systems that I have designed or where I was involved in implementation, data comes as sets.
So we seem to have two different experiences. I don't think one can make a general statement from such empirical datapoints.
> and the data management layer can't handle it!
ORDER BY and indices are your friend.
Posted Jan 20, 2022 16:19 UTC (Thu)
by Wol (subscriber, #4433)
[Link] (13 responses)
ORDER BY what exactly? What data?
Let's take my favourite example of line items in an invoice. Okay, the line items may be a set of line items, but the order on the invoice is both RANDOM and PRESERVED.
And said order is not DATA, but METADATA. So in Pick I just store a list of line item foreign keys in the invoice record. (At which point I no longer need a select on an index to retrieve said line items, so that speeds my database up, too :-)
In SQL I have to create some field, fill it with random - MEANINGLESS - values who's only value is in its sort order, and if the list is mutable I need to add all sorts of extra logic to manage that field.
In Pick, it's just a list and I insert a new value where I need it. At no point am I forced to create imaginary fields with contents that have no intrinsic stand-alone meaning. (Okay, if I want to create meaningless uuids I can, but I don't have to!). Mixing values that have real intrinsic meaning, and others that have no instrinsic meaning whatsoever, is a recipe for un-necessary complexity. As Einstein said, "make things as simple as possible BUT NO SIMPLER". Storing a list in database that can only store sets is a MASSIVE breach of that principle!
Cheers,
Posted Jan 22, 2022 7:05 UTC (Sat)
by NYKevin (subscriber, #129325)
[Link] (12 responses)
But what does that actually accomplish? I would much rather have invoice line items sorted (by name, by serial number, or by some other reasonable thing such as total or individual price) than in "natural" order. It's the same reason ls with no arguments automatically sorts a directory's contents - unsorted natural order is less useful to most people most of the time, so you have to explicitly request it.
> In SQL I have to create some field, fill it with random - MEANINGLESS - values who's only value is in its sort order, and if the list is mutable I need to add all sorts of extra logic to manage that field.
No, you can just make an array, if you really want an array. All modern SQL databases have support for non-normalized columns like that. See for example https://www.postgresql.org/docs/14/arrays.html
This is probably going to scale more poorly than the foreign key setup which you describe, so it's not necessarily a Good Idea... but nobody ever said life was going to be fair. You can use the database in the manner it was designed, or you can use the database in the manner which best fits your exact problem space, but it's sometimes difficult to do both at once.
Posted Jan 22, 2022 18:35 UTC (Sat)
by Wol (subscriber, #4433)
[Link] (7 responses)
> But what does that actually accomplish? I would much rather have invoice line items sorted (by name, by serial number, or by some other reasonable thing such as total or individual price) than in "natural" order. It's the same reason ls with no arguments automatically sorts a directory's contents - unsorted natural order is less useful to most people most of the time, so you have to explicitly request it.
It keeps the auditors happy? :-)
And as for sorting, how is the computer suposed to know what the correct order is? Do you sort upper and lower case together or separate? I still haven't worked out why ls sorts differently on my rebuilt gentoo system from how the original it replaced did. Does McAdam sort before or after Macadamia? (If name sorting is on, McAdam sorts before Macadamia in alphabetic order ... :-)
> > In SQL I have to create some field, fill it with random - MEANINGLESS - values who's only value is in its sort order, and if the list is mutable I need to add all sorts of extra logic to manage that field.
> No, you can just make an array, if you really want an array. All modern SQL databases have support for non-normalized columns like that. See for example https://www.postgresql.org/docs/14/arrays.html
Yup. I know modern SQL supports arrays, but I have yet to find any in production (yes my experience of production SQL databases is near nil). The problem is, arrays are second class citizens.
Oh, by the way, isn't the very definition of your SQL table a RANDOM but PRESERVED LIST of column names? :-)
> This is probably going to scale more poorly than the foreign key setup which you describe, so it's not necessarily a Good Idea... but nobody ever said life was going to be fair. You can use the database in the manner it was designed, or you can use the database in the manner which best fits your exact problem space, but it's sometimes difficult to do both at once.
I'd say the trouble with relational is that it's ALWAYS difficult :-)
The trouble with Pick is it gives you too much rope ...
To take my example of an invoice, is a line item an attribute of the invoice, or an instance of a general ledger entry? Because if it's an attribute of the invoice you grab a bunch of columns and turn it into a 2-dimensional array. Or if it's an instance in its own right, you store a list of foreign keys in the invoice. Pick theory says "Your Choice!". SQL very much influences the designer to split them off into their own table (probably the general ledger table).
That said, whichever way you jump with Pick, like I quoted the FAQ, Pick can probably find the data you're looking for on disk, retrieve, and normalise it, faster than SQL can find it ... :-)
And like I said, Pick can throw away the order and give you a relational set much easier han SQL can recreate a list that isn't stored that way in the relational database.
I'm hoping there'll soon be an anouncement of an industrial-grade GPL2 Pick database ... YAYYYY! Still needs a bit more work on it, mostly trying to make it easier for people whos' minds have been ruined by Relational, to grok. :-)
Problem is, they'll probably say our minds have been ruined by DataBASIC :-)
Cheers,
Posted Jan 22, 2022 19:42 UTC (Sat)
by NYKevin (subscriber, #129325)
[Link] (6 responses)
As someone who actually does work with production databases from time to time: You see non-normalized stuff. It's not *typical*, but it does exist and is used by real systems for serious purposes.
> Oh, by the way, isn't the very definition of your SQL table a RANDOM but PRESERVED LIST of column names? :-)
This is SQL, not Lisp. Besides, column order is irrelevant unless you do something like SELECT * (don't do that!).
> To take my example of an invoice, is a line item an attribute of the invoice, or an instance of a general ledger entry?
I would do it like this:
CREATE TABLE products(
CREATE TABLE invoices(
CREATE TABLE invoice_items(
CREATE VIEW invoice_subtotals AS
CREATE VIEW invoice_grand_totals AS
A modern SQL engine will automatically infer that it needs to create indexes for all of the PRIMARY KEY columns, and do so without my asking, so those lookups are all O(log n). Similarly, it will keep track of how big the join tables get, and automatically choose the most efficient type of join to do (nested loop vs. indexed join vs. hash join) for each particular query by estimating the number of rows which that query will hit. From the perspective of "developer time is more expensive than CPU time," this is a Very Good Thing (at least until it breaks and you have to manually hint it, anyway).
Posted Jan 23, 2022 0:05 UTC (Sun)
by Wol (subscriber, #4433)
[Link] (5 responses)
ED DICT PRODUCTS
...@ID primary key
ED DICT INVOICES
...@ID primary key
I didn't declare datatypes, because everything is a string. In the basic system (I did say Pick gives you plenty of rope to hang yourself :-) you reply on the application to enforce data types, but pretty much every system nowadays enables you to enforce it at the database level.
So I now have two tables to your three (oh, and I've thrown in an address table too, for delivery/invoice/any other address you care for).
You've got two views, I've got none. Although I'll grant you I need a LIST (what SQL calls a SELECT) to get the equivalent of your view.
LIST INVOICES @ID TOTAL LINE_PRICE GRAND_TOTAL LINE_PRICE
Gives me a list of all invoices, with the invoice total, and sums all invoices to give me the grand total.
Note that I haven't got ANY indices - I don't need any. Foreign keys and hashed files, all data accesses apart from the first are O(1). The only "index" I need (or use) is the list of keys to INVOICES.
I don't have any joins - note I have a TRANSLATE function in my INVOICES dictionary which takes as arguments the table, foreign key, and column name. And again with foreign keys and hashed files, this is an O(1) data retrieval. Okay, I'm making two access to the same foreign record, but I can rely (pretty safely) on disk caching saving me, or any modern implementation can just cache the foreign record directly.
Oh - and my SQL SELECT implementation that runs on top of Pick will see ADDRESS and LINE_ITEM as two tables it can access as normal.
So. I don't even need to create inferred indices in order to get O(1) !!! on my lookups. I don't have any join tables to worry about. So I think I have just seriously kicked your butt on speed !!! And as for developer time being more valuable than cpu time, I think having just two tables, PRODUCT and INVOICE, is a lot simpler to understand than needing a third LINE_ITEM table, along with your two surplus views.
Sorry, game, set and match I think!
Cheers,
Posted Jan 23, 2022 10:19 UTC (Sun)
by NYKevin (subscriber, #129325)
[Link] (4 responses)
Stringly typed is a recipe for disaster.
> Note that I haven't got ANY indices - I don't need any. Foreign keys and hashed files, all data accesses apart from the first are O(1). The only "index" I need (or use) is the list of keys to INVOICES.
O(1) is only fast if the constant factor is low. That's why SQL keeps track of whether the table is big or small. If it has very few rows, SQL will do a nested loop, with a nice linear cacheable access pattern. If the data is huge, SQL can and will do a O(1) hash join.
I'm sure you *can* do that in Pick, but in SQL, you don't have to think about it. It's automatic.
> So I think I have just seriously kicked your butt on speed !!!
Really? You built a benchmark and got meaningful numbers from both systems *that* quickly? Or is this purely hypothetical "speed?"
> And as for developer time being more valuable than cpu time, I think having just two tables, PRODUCT and INVOICE, is a lot simpler to understand than needing a third LINE_ITEM table, along with your two surplus views.
The whole point of the views is that you can ignore the underlying tables if you don't need them.
Posted Jan 23, 2022 11:30 UTC (Sun)
by Wol (subscriber, #4433)
[Link] (3 responses)
> Stringly typed is a recipe for disaster.
I don't TOTALLY agree. Yes I really miss strong typing some of the time. Equally, I don't miss trying to keep track of FLOAT, INT16, INT64, BIGNUM, etc. FFS just give me a number ! (And be able to multiple by ten just by string-concatenating a zero is a nifty feature sometimes - hey did I talk about rope :-)
> > Note that I haven't got ANY indices - I don't need any. Foreign keys and hashed files, all data accesses apart from the first are O(1). The only "index" I need (or use) is the list of keys to INVOICES.
> O(1) is only fast if the constant factor is low. That's why SQL keeps track of whether the table is big or small. If it has very few rows, SQL will do a nested loop, with a nice linear cacheable access pattern. If the data is huge, SQL can and will do a O(1) hash join.
If the constant factor is low ... a SINGLE disk seek on a cold data access is low, I think? That's pretty much the accepted time ...
> I'm sure you *can* do that in Pick, but in SQL, you don't have to think about it. It's automatic.
I don't think that approach would make sense in Pick, or rather, it melds the two. Because it's a hash database, it'll do a straight O(1) access to the bucket, then a lookup inside it. Depends how many rows there are in the bucket.
> > So I think I have just seriously kicked your butt on speed !!!
> Really? You built a benchmark and got meaningful numbers from both systems *that* quickly? Or is this purely hypothetical "speed?"
Sorry, yes, it's a gedanken experiment. But, on a cold database, I can count the minimum necessary disk accesses. And experience says that figure is DAMN close. I'm also pretty certain information theory would say it's impossible to improve on ... Relational theory says you're not allowed to, but you can work out the minimum. And if relational relies heavily on hot hash indexes, which I think it does, me assuming that the Pick database is stone cold says a lot ...
Real world experience bears this out - find pretty much any Pick -> Relational migration story, and it will be full of stories about how the Pick system ran rings round its replacement. My favourite is the consultants who proudly announced - after six months of hard work - that some important query now ran faster on the new system than the old. Unfortunately for them, the guy responsible for the old system overheard and said "What? You're PROUD that your Twin Xeon 800 is only just faster than an old Pentium 90!?"
> > And as for developer time being more valuable than cpu time, I think having just two tables, PRODUCT and INVOICE, is a lot simpler to understand than needing a third LINE_ITEM table, along with your two surplus views.
> The whole point of the views is that you can ignore the underlying tables if you don't need them.
Agreed. But if you do need them they bring a whole new level of complexity ... An invoice is a real world object. My brain can grasp the concept of an invoice. If I need to do ANYTHING with an invoice, it's ALL in INVOICES table (and therein lies another of Pick's problems - define "what is an object"). If I suddenly need a table called "address", how many different address tables are there scattered about the relational schema? Probably quite a few? In Pick, if it's an invoice address the invoice table will tell me. Again, and this is personal experience, working with Pick I DON'T NEED this big schema on the wall telling me what tables go where. If I understand the real world problem, the Pick schema is a pretty close approximation, and if I need to drill down, ALL THE INFORMATION IS IN THE TABLES. The whole point of Pick is that you should NEVER EVER have to search for data that's related to your query. Okay, "find me all customers that ..." is a search, but "get me details of all their invoices in the last 6 months" is not. That second half of the query should be an order one drill down, not a join.
(And I'll just throw in my experience as a newbie SQL programmer, oh my god is it complex and error prone. What Pick does in one line, with all the logic in the table dictionary, SQL takes multiple, multiple line, selects - I could write a hundred line query easily for what Pick will do in one simple command line!)
Cheers,
Posted Jan 25, 2022 3:09 UTC (Tue)
by NYKevin (subscriber, #129325)
[Link] (2 responses)
WITH recent_invoices AS (
I probably also need to run a CREATE INDEX on the invoices(when_issued) column, but that's not too expensive to run once (it's basically a full table scan, with writes). This should (on most reasonable systems) create a B-tree index which can keep track of the order of invoices (i.e. it doesn't just stick them into hash buckets). Then I can get monthly forecasts of my inventory based on last month's demand.
Note that we cannot assume that recently-inserted invoices are necessarily issued more recently than older invoices, because new invoices may need to be backdated, so we can't simply loop over the full table backwards and stop when we start hitting old invoices, because the table is not necessarily in the right order. We have to use an index, or do a full table scan going back through (potentially) years of ancient invoices. Also note that adding this query does not require significant changes to our existing schema - we run one very simple CREATE INDEX command, do some load testing just in case, and that's it, we're ready to roll this out to prod.
Finally, while this may look complicated, I banged it out in about an hour, and frankly it's not that hard to understand for people who use SQL regularly. Quick translation:
* WITH name AS SELECT ... turns another query into a temporary table ("common table expression"). Here all it does is pull the "within the past month" logic out so that the FROM part of the main query does not get too hard to read, but you can also use it if you're going to do the same sub-select more than once.
With your solution... well, I'm not sure exactly how Pick works, but I tend to imagine you will at least need to introduce a products table so that you know how many are in inventory, and then do an outer join against the invoices table.
Posted Jan 25, 2022 3:24 UTC (Tue)
by NYKevin (subscriber, #129325)
[Link]
Actually, your solution already has that information.
Posted Jan 25, 2022 8:57 UTC (Tue)
by Wol (subscriber, #4433)
[Link]
Like you, I need to create an index on DATE, so evens there. I'd rather put a date in the query, but adding another column to the invoice is easy
ED DICT INVOICES
I think Pick would struggle with me creating an index on a dynamically created field ... :-)
Then it's something as simple as
LIST INVOICES WITH INVOICE.DATE > 25/12/2022 ITEM.NAME ITEM.QUANTITY GRAND.TOTAL ITEM.QUANTITY BY ITEM.NAME.
Okay I'd need to make sure I got the syntax right, but it cost me more time to understand what you were doing than to come up with my first cut at the Pick equivalent. And now I see you're also comparing against current inventory ...
LIST INVOICES WITH INVOICE.DATE > 25/12/2022 ITEM.CODE ITEM.NAME ITEM.QUANTITY GRAND.TOTAL ITEM.QUANTITY BY ITEM.NAME EVAL(TRANSLATE PRODUCTS ITEM.CODE STOCK) AS STOCK EVAL (STOCK - ITEM.QUANTITY) AS SURPLUS
There's probably a bit more to it than that, I've never had to write that sort of query before, but that's about it. Nested Pick queries are more likely to be written as procedures than command lines. (Note the translate instead of a join...)
Cheers,
Posted Jan 24, 2022 9:42 UTC (Mon)
by james (subscriber, #1325)
[Link] (3 responses)
Posted Jan 24, 2022 19:58 UTC (Mon)
by Wol (subscriber, #4433)
[Link] (2 responses)
But no data doesn't come in rows and columns. It's an unstructured mess that comes from the user. And any structure that you derive from it is METAdata.
Pick strongly separates such metadata from the data - in fact I would go as far as to say mixing the two together in the same data structure is a *major* design blunder. Yet relational analysis absolutely demands that you mingle the two so the database can't tell the difference. WTF??? Imho if the database can't tell the difference between external data and internal deduction something is well off.
The poor relational business analyst is given the nightmare task of bashing square pegs into round holes.
(One nice thing about Pick is that it does, however, use the exact same structure to store metadata and data. Each level of the tree is metadata for the level below it. I'm pretty sure relational actually does the same under the covers, but it's much more hidden and obscure.)
Think xml and dtd, where the data is "unstructured" and the definition is separate. Pick beat xml to it by about 30 years...
Cheers,
Posted Feb 6, 2022 23:26 UTC (Sun)
by flussence (guest, #85566)
[Link] (1 responses)
I... don't think that's the job of the average RDBMS to begin with, ostensibly at least? It stores data in a shape computers would prefer, and it has assertions to stop out-of-range values being inserted so readers can assume it's a source of truth without reverifying everything (unless you're unfortunate enough to have mysql), but everything else - the heavy lifting of getting user input, pulling it apart into that shape, reassembling it for the user afterwards - is *supposed to* be done in a real programming environment outside the DB.
I'll admit though that's often not how it goes in the real world, and I've seen far too many “clever” things done in relational stuff just because it was possible. A lot of people learned carpentry with the PHP hammer, and SQL servers nowadays are full of power tools without guidance on appropriate uses for them. I guess I agree with you about the bashing square pegs into round holes.
Posted Feb 7, 2022 0:25 UTC (Mon)
by Wol (subscriber, #4433)
[Link]
There are plenty of other databases out there, object, tree, multi-value, etc etc.
Relational won because Oracle invested in marketing. Pick lost because Dick Pick invested in litigation.
But anything relational can do I can do with Pick. Faster, less hardware, less cpu power, ... AND I HAVE THE MATHS TO PROVE IT.
Einstein said "make things as simple as possible BUT NO SIMPLER". I would argue, very forcefully, that Relational Databases are TOO simple, with the result that the application layer on top has to be more complex, and the net result is that the system AS A WHOLE is much more complicated than it need be.
I'm quite happy to agree with you that it's not the job of a RELATIONAL database to manage the metadata. The problem is, it requires the Business Analyst to MUDDLE UP DATA AND METADATA, which then makes it hell for the programmer.
Because Pick DOES manage data and metadata SEPARATELY, thats why we can hold a schema in our head that would blow the mind of a relational guy. Because we're not actually holding the entire schema, we only need to hold the bit that's relevant to our problem (the relational guy doesn't know what bit IS relevant!). If I'm looking at INVOICE, then everything I need is in the DICTionary (at least it should be, if the previous programmer did his job properly!)
Cheers,
Posted Jan 19, 2022 6:24 UTC (Wed)
by dambacher (subscriber, #1710)
[Link] (1 responses)
So what use do I have from frozensets -
Posted Jan 19, 2022 7:04 UTC (Wed)
by LtWorf (subscriber, #124958)
[Link]
Sets automatically remove duplicates and have operations such as intersection and subtraction. In general when you use the "in" operator and don't care about ordering, you should always be using a set.
Frozensets are not mutable so, can be used as default parameter to functions, can be used inside other (frozen)set.
Posted Jan 19, 2022 6:54 UTC (Wed)
by sub2LWN (subscriber, #134200)
[Link] (1 responses)
Sets could be frozen by default unless they are surrounded by 🌞 ("Sun with Face") to thaw the set. If you want all of the sets within a given scope to be thawed, just bracket it with 🌞🌞🌞, similar to the "triple double quote" string syntax. This could be called a heatwave, among other things.
Posted Jan 19, 2022 12:36 UTC (Wed)
by k3ninho (subscriber, #50375)
[Link]
Current practice is what it is from history. I think that declarative-paradigm languages and immutability go together to form pragmatic functional programming, maybe over time the performance benefits of having everything hashable and sorted would put pressure towards an approach that's mutable-in-the-exception.
K3n.
Posted Jan 19, 2022 18:35 UTC (Wed)
by JoeBuck (subscriber, #2330)
[Link] (4 responses)
Posted Jan 20, 2022 5:48 UTC (Thu)
by wtanksleyjr (subscriber, #74601)
[Link] (3 responses)
Posted Jan 20, 2022 11:13 UTC (Thu)
by quietbritishjim (subscriber, #114117)
[Link] (2 responses)
By itself that isn't a problem. If you create a set, put it somewhere it needs to be frozen (key of a dict, inside another set, etc.), and try to modify it then you would get an error that it's frozen. Well, that's OK, the developer gets an error that they can't modify it because it's being used in a context where it needs to be frozen.
The real problem is that you can have multiple references to the same object. So I put my set into a variable X, then also put it into a context where it needs to be frozen. If you copy the set into a frozenset then changes to X won't be reflected in the new location - very confusing. If the set object itself magically morphed into a frozenset (which I think is what the parent comment meant), then modifications via X now fail. This is particularly a problem if this is "someone else's" set. So a set gets passed as a parameter to a function, then that function accidentially turns it to a frozenset by using it in some other data structure. The caller might get an error in a substantially later part of the code, and it will be hard to find the function call that froze the set.
Posted Jan 20, 2022 18:48 UTC (Thu)
by JoeBuck (subscriber, #2330)
[Link] (1 responses)
But it probably isn't a good idea, because it could be unexpectedly expensive: people expect hashes to be fast, and if we're always doing copy-and-insert, copy-and-hash it would be costly.
Posted Jan 20, 2022 23:08 UTC (Thu)
by quietbritishjim (subscriber, #114117)
[Link]
> If you copy the set into a frozenset then changes to X won't be reflected in the new location - very confusing.
I think the fact that changes to the original set won't be reflected in the frozenset made from it earlier is even more serious than the cost of the copy.
Python sets, frozensets, and literals
Python sets, frozensets, and literals
Python sets, frozensets, and literals
Wol
Python sets, frozensets, and literals
Python sets, frozensets, and literals
Wol
Python sets, frozensets, and literals
Just reading the key is already O(log(n)). Information theory shows that set membership cannot be done in
less that Omega(log(n)).
Python sets, frozensets, and literals
Wol
Python sets, frozensets, and literals
Python sets, frozensets, and literals
Wol
Python sets, frozensets, and literals
Python sets, frozensets, and literals
Python sets, frozensets, and literals
Wol
Python sets, frozensets, and literals
Python sets, frozensets, and literals
Thus, performance "as n goes to infinity" is meaningless
Python sets, frozensets, and literals
Just do not use the O notation when you do not mean it!
Python sets, frozensets, and literals
Python sets, frozensets, and literals
Python sets, frozensets, and literals
Python sets, frozensets, and literals
Python sets, frozensets, and literals
Wol
Python sets, frozensets, and literals
Python sets, frozensets, and literals
Python sets, frozensets, and literals
Why is Comp Sci so afraid of lists?
Python sets, frozensets, and literals
I'm trying to make an incomplete and buggy implementation of Common LISP here! I'm not afraid of lists, I'm just doing it ...well not entirely wrong but it sure looks incomplete and buggy.
Python sets, frozensets, and literals
Both are equally important.
Python sets, frozensets, and literals
Wol
Python sets, frozensets, and literals
Python sets, frozensets, and literals
Wol
Python sets, frozensets, and literals
id integer PRIMARY KEY,
name varchar(1024) NOT NULL,
unit_price numeric(11, 2) NOT NULL CHECK unit_price > 0, -- If your products cost more than $1B, increase precision as needed, or use money type instead of numeric.
inventory_count integer NOT NULL CHECK inventory_count >= 0,
-- put more per-product columns here...
);
id integer PRIMARY KEY,
when_issued timestamp NOT NULL,
-- possibly other relevant timestamps here.
);
-- or ledger_entries, if you're doing a more general accounting system.
-- But a general accounting system should be based on credits and debits, so you probably get a different table structure from that.
id integer PRIMARY KEY,
invoice_id integer NOT NULL REFERENCES invoices(id),
product_id integer NOT NULL REFERENCES products(id),
quantity integer NOT NULL CHECK quantity > 0,
);
SELECT itm.invoice_id AS invoice_id, itm.product_id AS product_id, itm.quantity * prd.unit_price AS subtotal
FROM invoice_items itm
JOIN products prd ON itm.product_id = prd.id;
SELECT SUM(subtotal)
FROM invoice_subtotals
GROUP_BY invoice_id;
Python sets, frozensets, and literals
...NAME
...UNIT_PRICE
...INVENTORY_COUNT
...
...
...DATE
...ADDRESS_TYPE as linked array ADDRESS
...ADDRESS_LINE_1 as linked array ADDRESS
...ADDRESS_LINE_2 as linked array ADDRESS
...ADDRESS_TOWN as linked array ADDRESS
...PRODUCT_ID as linked array LINE_ITEM
...PRODUCT_QUANTITY as linked array LINE_ITEM
...PRODUCT_NAME as TRANSLATE( PRODUCT_ID, PRODUCTS, NAME) as linked array LINE_ITEM
...PRODUCT_PRICE as TRANSLATE( PRODUCT_ID, PRODUCTS, UNIT_PRICE) as linked array LINE_ITEM
...LINE_PRICE as PRODUCT_QUANTITY * PRODUCT_PRICE as linked array LINE_ITEM
Wol
Python sets, frozensets, and literals
Python sets, frozensets, and literals
Wol
Python sets, frozensets, and literals
SELECT *
FROM invoices
WHERE age(when_issued) < '30 days' -- your syntax may vary here.
)
SELECT p.id, p.name, p.inventory_count - COALESCE(SUM(itm.quantity), 0) AS forecasted_surplus
FROM products p
LEFT JOIN (
invoice_items itm JOIN recent_invoices inv ON itm.invoice_id = inv.id
) ON itm.product_id = p.id
GROUP BY p.id, p.name, p.inventory_count;
* The join in parentheses needs to happen first, so that we throw away any invoice_items which don't match one of the existing invoices. We could do this as a second CTE, but that's uglier IMHO.
* LEFT JOIN is an abbreviation of LEFT OUTER JOIN, and has the effect of manufacturing an all-NULL invoice and invoice_item for any products which were not purchased in the past month.
* COALESCE() turns the NULL back into a zero.
Python sets, frozensets, and literals
Python sets, frozensets, and literals
... create AGE as TODAY - INVOICE.DATE
Wol
NYKevin wrote:
Python sets, frozensets, and literals
I would much rather have invoice line items sorted (by name, by serial number, or by some other reasonable thing such as total or individual price) than in "natural" order.
Well,
Python sets, frozensets, and literals
Wol
Python sets, frozensets, and literals
Python sets, frozensets, and literals
Wol
What benefit i get from frozensets?
Even the given example of indirect usage via an "a in {b,c}" expression was never on my list.
Are they faster on some accessing functions, e.g. "in"?
Do they use less memory?
What benefit i get from frozensets?
Python sets, frozensets, and literals
Python sets, frozensets, and literals
...or extend current practice with a one-way freezing method (though an unfreezing pattern would emerge and adoption into the standard library of something akin to deepcopy-into-new should also spit out a warning to stderr that you're doing a performance anti-pattern).
Python sets, frozensets, and literals
Python sets, frozensets, and literals
Python sets, frozensets, and literals
Python sets, frozensets, and literals
Python sets, frozensets, and literals