Python sets, frozensets, and literals

Posted Jan 19, 2022 20:14 UTC (Wed) by NYKevin (subscriber, #129325)
In reply to: Python sets, frozensets, and literals by ballombe
Parent article: Python sets, frozensets, and literals

You are misinterpreting O(n) and O(1). In this context, n is the size (number of elements) of the list or set. It is assumed that all elements are the same size (e.g. fixed-width integers or pointers to opaque objects), or that the size of elements is de facto bounded above (e.g. "in practice, none of my strings will ever be more than 1 gigabyte in length"). Under that assumption, O(1) (or Θ(1)) is the correct way to describe a hash table lookup.

Python sets, frozensets, and literals

Posted Jan 19, 2022 21:23 UTC (Wed) by Wol (subscriber, #4433) [Link]

Yup. O(n) means the time required changes in line with n. O(1) means the time required changes in line with one. Or to put it another way, it's constant (give or take).

Cheers,
Wol

Python sets, frozensets, and literals

Posted Jan 20, 2022 1:16 UTC (Thu) by excors (subscriber, #95769) [Link] (12 responses)

> You are misinterpreting O(n) and O(1). In this context, n is the size (number of elements) of the list or set. It is assumed that all elements are the same size (e.g. fixed-width integers or pointers to opaque objects), or that the size of elements is de facto bounded above

On the other hand, formal definitions of O(f(n)) are about the behaviour of a function as n tends to infinity. If your elements have fixed or bounded size then the maximum n supported by the algorithm is finite, so the behaviour as n tends to infinity is undefined, and the big-O notation is mathematically meaningless.

You have to pretend that your algorithm is running on a computer where certain values (at least the ones you're counting with n) can have infinite range, but can still be processed in a constant amount of time. Usually that's a good enough approximation for practical purposes, because we're interested in the behaviour when n is large enough that the actual performance is very close to the theoretical asymptotic bound but n is still well below 2^32 or 2^64 so it can be processed in constant time. I think that's kind of a fudge though, and it can lead to confusion and/or misleading results if you're not careful about what things you're falsely treating as infinite.

In this case it sounds like ballombe is considering keys to be sequences of bytes (or equivalent), and counting the cost of processing each byte. I wouldn't say that's wrong - it's a better approximation of how a computer would implement an unbounded set. It's just not the traditional way of analysing set/map-like data structures, where we pretend keys have infinite range but constant processing cost. The traditional way seems reasonable for a hash-based implementation, because the keys used by the algorithm will be a constant size (the size of the hash) but the table can still be large enough that we get close to the asymptotic performance in practice; but you'll obviously need a different analysis if your implementation doesn't use hashes and does lots of strcmp(key1, key2) instead, because then the lengths of keys may be significant. So the context and assumptions are important (but, from what I've seen, frequently ignored) when choosing how to apply or interpret the big-O notation.

Python sets, frozensets, and literals

Posted Jan 20, 2022 1:33 UTC (Thu) by NYKevin (subscriber, #129325) [Link] (8 responses)

> If your elements have fixed or bounded size then the maximum n supported by the algorithm is finite, so the behaviour as n tends to infinity is undefined, and the big-O notation is mathematically meaningless.

n has nothing to do with the sizes of individual elements. n is the number of elements. If you want a factor for element size, you simply *must* use a different variable to represent it (in the same way that graph-based algorithms must be separately parameterized in terms of E and V - those are not the same variable and cannot be magically recombined into a single "size" value).

> You have to pretend that your algorithm is running on a computer where certain values (at least the ones you're counting with n) can have infinite range, but can still be processed in a constant amount of time.

Why? Plenty of people use int32 or float64 keys all the time, and those *are* processed in a constant amount of time because they are fixed-width!

Sure, if you happen to be working with strings you might want to make this assumption, but it is not generally required. You could just as easily assume that all strings are forcibly interned and pre-hashed at creation time, and then hashing really is O(1) because it's just a pointer deref to get the precomputed hash value.

Regardless, nobody in this entire thread ever explicitly stipulated that the keys were strings.

> In this case it sounds like ballombe is considering keys to be sequences of bytes (or equivalent), and counting the cost of processing each byte.

No, they said O(log(n)). They are considering n to represent the value (not size) of an arbitrary-precision integer (which presumably has to be positive), and taking the logarithm to get its bit width. This is technically applicable to Python, if your keys happen to be ints, but as mentioned above, it would be possible to pre-hash ints at creation time since they are immutable (Python doesn't currently do this as far as I can tell, but it would be a straightforward modification of the existing code).

The problem with this argument is that a list (or set) is not an arbitrary-precision integer, it is a list or set. It does not *have* a single integral value for you to take the logarithm of in the first place, because it may contain many integers, or none, or may even contain keys which are not integers at all. So talking about log(n) in that context is meaningless unless you specify how you are computing n in the first place.

Python sets, frozensets, and literals

Posted Jan 20, 2022 1:47 UTC (Thu) by Wol (subscriber, #4433) [Link]

> Sure, if you happen to be working with strings you might want to make this assumption, but it is not generally required. You could just as easily assume that all strings are forcibly interned and pre-hashed at creation time, and then hashing really is O(1) because it's just a pointer deref to get the precomputed hash value.

And what is a string? JUST A BIG INTEGER!

The only datatype Pick has is string, but that's just what the user sees. Internally, I'm pretty certain it just treats the key as a large number, runs something like md5 over it, and then applies its hash function.

And then, given the hash, the time taken to access the target is constant. And much larger than computing the hash, in terms of total time that's just noise!

Cheers,
Wol

Python sets, frozensets, and literals

Posted Jan 20, 2022 4:29 UTC (Thu) by foom (subscriber, #14868) [Link] (3 responses)

You cannot have a map of "n" elements, unless the size of the keys are at least log(n), since keys must be unique.

As such, if you say your keys are 32-bit integers, the number of elements, "n", in your map cannot possibly exceed 2**32. Thus, performance "as n goes to infinity" is meaningless, unless you make the approximation that your 32-bit integer keys can actually represent an infinite number of distinct values.

Python sets, frozensets, and literals

Posted Jan 20, 2022 12:34 UTC (Thu) by anselm (subscriber, #2796) [Link] (2 responses)

Thus, performance "as n goes to infinity" is meaningless

2³² may not technically be “infinity”, but if you're looking at O(1) vs. O(n), in the real world n=2³² is pretty likely to make itself noticeable, performance-wise, somehow.

Python sets, frozensets, and literals

Posted Jan 20, 2022 21:20 UTC (Thu) by ballombe (subscriber, #9523) [Link] (1 responses)

Not if the implied constant in the first O is 2^32 and 1 in the second O...
Just do not use the O notation when you do not mean it!

Python sets, frozensets, and literals

Posted Jan 20, 2022 22:50 UTC (Thu) by anselm (subscriber, #2796) [Link]

That's what the “in real life” bit was about. In real life, even with big-but-certainly-not-infinite n, O(n log n) heap sort usually performs better than O(n²) bubble sort.

Python sets, frozensets, and literals

Posted Jan 20, 2022 12:12 UTC (Thu) by excors (subscriber, #95769) [Link] (2 responses)

> n has nothing to do with the sizes of individual elements. n is the number of elements.

But if your elements are e.g. 32-bit integers then you can't have more than 2^32 distinct elements, so the behaviour of your algorithm as n tends to infinity is undefined. For the big-O notation (which is based on n tending to infinity) to be valid, you have to pretend your elements are integers with infinite range (but can still be processed in constant time) so there's no upper bound on n. That's usually a reasonable thing to pretend if you want the theory to closely match a practical implementation, but not always, so O(1) and O(log(n)) are both valid answers for set membership depending on what assumptions you make.

Python sets, frozensets, and literals

Posted Jan 21, 2022 15:53 UTC (Fri) by jwarnica (subscriber, #27492) [Link] (1 responses)

If we go from base principles that computers are finite state automaton then your argument about 2^32 applies to everything in reality.

O(n) might be mathematically defined as "tending to infinity", but unless the analysis determines that weird things happen at n>2^32+1 then its the same thing. Off hand can't think of any of the handful of the standard selection of simple lines that are used where things get weird past some useful point off to the right.

As others have mentioned, one must not consider n-->infinity considerations, but n->expectations. As expectations will always be less than infinity. And memory size of any computer you will ever run something on will also be less than infinity.

Python sets, frozensets, and literals

Posted Jan 21, 2022 17:29 UTC (Fri) by nybble41 (subscriber, #55106) [Link]

> And memory size of any computer you will ever run something on will also be less than infinity.

The general approximation for treating finite computers as equivalent to Turing machines is that while any given computer may have finite memory you can always add more if an algorithm requires it for some particular input. Even if that might require multiple levels of indirection due to e.g. the address space limits of the CPU. Of course, for physical computers at some point you run into the physical limits of computation (information density limited by thermodynamics; Bremermann's limit on computational speed), but this is with respect to an idealized model of computation where e.g. your computer architecture (ALU and currently installed memory) may be finite but there is no predetermined restriction on how long you're willing to wait for the result.

Also, it doesn't really matter that your memory is limited when analyzing *time* complexity, provided the algorithm has constant *space* complexity and can stream the input and output.

Python sets, frozensets, and literals

Posted Jan 20, 2022 17:14 UTC (Thu) by Wol (subscriber, #4433) [Link] (2 responses)

> On the other hand, formal definitions of O(f(n)) are about the behaviour of a function as n tends to infinity. If your elements have fixed or bounded size then the maximum n supported by the algorithm is finite, so the behaviour as n tends to infinity is undefined, and the big-O notation is mathematically meaningless.

Are they :-)

Why can't you have a formal definition of "O(f(n)) where 0<= n < 2^32" ?

Ahhh - except that you would say that the conditional clause breaks the definition of "O" and renders everything meaningless...

Cheers,
Wol

Python sets, frozensets, and literals

Posted Jan 20, 2022 17:49 UTC (Thu) by excors (subscriber, #95769) [Link] (1 responses)

> Why can't you have a formal definition of "O(f(n)) where 0<= n < 2^32" ?

The formal definition is usually something like "g(n) = O(f(n)) iff there exists c,k such that |g(n)| <= c f(n) for all n >= k", i.e. f(n) is an upper bound for g(n) for large n (ignoring constant factors). If you change that definition to add "n < 2^32", then everything becomes trivially O(1) because you can just set c = g(2^32) or similar to get a constant upper bound for any function.

I don't know how you'd define it in a way that doesn't become trivial and useless, but doesn't become so complicated that it stops being a helpful approximation and is no better than just counting the exact number of computational steps.

Python sets, frozensets, and literals

Posted Jan 21, 2022 3:38 UTC (Fri) by NYKevin (subscriber, #129325) [Link]

> I don't know how you'd define it in a way that doesn't become trivial and useless, but doesn't become so complicated that it stops being a helpful approximation and is no better than just counting the exact number of computational steps.

If you're willing to turn all your limits inside-out, you can probably do this in the 2-adic integers. But then you have to deal with a variety of oddities, such as the fact that the sequence of successive powers of two converges to zero under the 2-adic metric (which is certainly not the case in the regular integers!) - informally, you can think of this as indicating that very large powers of two can only be distinguished from zero using a correspondingly large (wide) integer type, and in the limit, it becomes impossible to distinguish them altogether, and so the 2-adic metric says the limit of the sequence is zero.