Python sets, frozensets, and literals

Posted Jan 20, 2022 1:33 UTC (Thu) by NYKevin (subscriber, #129325)
In reply to: Python sets, frozensets, and literals by excors
Parent article: Python sets, frozensets, and literals

> If your elements have fixed or bounded size then the maximum n supported by the algorithm is finite, so the behaviour as n tends to infinity is undefined, and the big-O notation is mathematically meaningless.

n has nothing to do with the sizes of individual elements. n is the number of elements. If you want a factor for element size, you simply *must* use a different variable to represent it (in the same way that graph-based algorithms must be separately parameterized in terms of E and V - those are not the same variable and cannot be magically recombined into a single "size" value).

> You have to pretend that your algorithm is running on a computer where certain values (at least the ones you're counting with n) can have infinite range, but can still be processed in a constant amount of time.

Why? Plenty of people use int32 or float64 keys all the time, and those *are* processed in a constant amount of time because they are fixed-width!

Sure, if you happen to be working with strings you might want to make this assumption, but it is not generally required. You could just as easily assume that all strings are forcibly interned and pre-hashed at creation time, and then hashing really is O(1) because it's just a pointer deref to get the precomputed hash value.

Regardless, nobody in this entire thread ever explicitly stipulated that the keys were strings.

> In this case it sounds like ballombe is considering keys to be sequences of bytes (or equivalent), and counting the cost of processing each byte.

No, they said O(log(n)). They are considering n to represent the value (not size) of an arbitrary-precision integer (which presumably has to be positive), and taking the logarithm to get its bit width. This is technically applicable to Python, if your keys happen to be ints, but as mentioned above, it would be possible to pre-hash ints at creation time since they are immutable (Python doesn't currently do this as far as I can tell, but it would be a straightforward modification of the existing code).

The problem with this argument is that a list (or set) is not an arbitrary-precision integer, it is a list or set. It does not *have* a single integral value for you to take the logarithm of in the first place, because it may contain many integers, or none, or may even contain keys which are not integers at all. So talking about log(n) in that context is meaningless unless you specify how you are computing n in the first place.

Python sets, frozensets, and literals

Posted Jan 20, 2022 1:47 UTC (Thu) by Wol (subscriber, #4433) [Link]

> Sure, if you happen to be working with strings you might want to make this assumption, but it is not generally required. You could just as easily assume that all strings are forcibly interned and pre-hashed at creation time, and then hashing really is O(1) because it's just a pointer deref to get the precomputed hash value.

And what is a string? JUST A BIG INTEGER!

The only datatype Pick has is string, but that's just what the user sees. Internally, I'm pretty certain it just treats the key as a large number, runs something like md5 over it, and then applies its hash function.

And then, given the hash, the time taken to access the target is constant. And much larger than computing the hash, in terms of total time that's just noise!

Cheers,
Wol

Python sets, frozensets, and literals

Posted Jan 20, 2022 4:29 UTC (Thu) by foom (subscriber, #14868) [Link] (3 responses)

You cannot have a map of "n" elements, unless the size of the keys are at least log(n), since keys must be unique.

As such, if you say your keys are 32-bit integers, the number of elements, "n", in your map cannot possibly exceed 2**32. Thus, performance "as n goes to infinity" is meaningless, unless you make the approximation that your 32-bit integer keys can actually represent an infinite number of distinct values.

Python sets, frozensets, and literals

Posted Jan 20, 2022 12:34 UTC (Thu) by anselm (subscriber, #2796) [Link] (2 responses)

Thus, performance "as n goes to infinity" is meaningless

2³² may not technically be “infinity”, but if you're looking at O(1) vs. O(n), in the real world n=2³² is pretty likely to make itself noticeable, performance-wise, somehow.

Python sets, frozensets, and literals

Posted Jan 20, 2022 21:20 UTC (Thu) by ballombe (subscriber, #9523) [Link] (1 responses)

Not if the implied constant in the first O is 2^32 and 1 in the second O...
Just do not use the O notation when you do not mean it!

Python sets, frozensets, and literals

Posted Jan 20, 2022 22:50 UTC (Thu) by anselm (subscriber, #2796) [Link]

That's what the “in real life” bit was about. In real life, even with big-but-certainly-not-infinite n, O(n log n) heap sort usually performs better than O(n²) bubble sort.

Python sets, frozensets, and literals

Posted Jan 20, 2022 12:12 UTC (Thu) by excors (subscriber, #95769) [Link] (2 responses)

> n has nothing to do with the sizes of individual elements. n is the number of elements.

But if your elements are e.g. 32-bit integers then you can't have more than 2^32 distinct elements, so the behaviour of your algorithm as n tends to infinity is undefined. For the big-O notation (which is based on n tending to infinity) to be valid, you have to pretend your elements are integers with infinite range (but can still be processed in constant time) so there's no upper bound on n. That's usually a reasonable thing to pretend if you want the theory to closely match a practical implementation, but not always, so O(1) and O(log(n)) are both valid answers for set membership depending on what assumptions you make.

Python sets, frozensets, and literals

Posted Jan 21, 2022 15:53 UTC (Fri) by jwarnica (subscriber, #27492) [Link] (1 responses)

If we go from base principles that computers are finite state automaton then your argument about 2^32 applies to everything in reality.

O(n) might be mathematically defined as "tending to infinity", but unless the analysis determines that weird things happen at n>2^32+1 then its the same thing. Off hand can't think of any of the handful of the standard selection of simple lines that are used where things get weird past some useful point off to the right.

As others have mentioned, one must not consider n-->infinity considerations, but n->expectations. As expectations will always be less than infinity. And memory size of any computer you will ever run something on will also be less than infinity.

Python sets, frozensets, and literals

Posted Jan 21, 2022 17:29 UTC (Fri) by nybble41 (subscriber, #55106) [Link]

> And memory size of any computer you will ever run something on will also be less than infinity.

The general approximation for treating finite computers as equivalent to Turing machines is that while any given computer may have finite memory you can always add more if an algorithm requires it for some particular input. Even if that might require multiple levels of indirection due to e.g. the address space limits of the CPU. Of course, for physical computers at some point you run into the physical limits of computation (information density limited by thermodynamics; Bremermann's limit on computational speed), but this is with respect to an idealized model of computation where e.g. your computer architecture (ALU and currently installed memory) may be finite but there is no predetermined restriction on how long you're willing to wait for the result.

Also, it doesn't really matter that your memory is limited when analyzing *time* complexity, provided the algorithm has constant *space* complexity and can stream the input and output.