|
|
Subscribe / Log in / New account

One million ought to be enough for anybody

By Jake Edge
December 17, 2019

Programming languages generally have limits—explicit or implicit—on various aspects of their operation. Things like the maximum length of an identifier or the range of values that a variable can store are fairly obvious examples, but there are others, many of which are unspecified by the language designers and come about from various implementations of the language. That ambiguity has consequences, so nailing down a wide variety of limits in Python is the target of an ongoing discussion on the python-dev mailing list.

Mark Shannon posted a proposal "to impose a limit of one million on various aspects of Python programs, such as the lines of code per module". One million may seem like an arbitrary number (and it is), but part of his thinking is that it is also memorable, so that programmers don't need to consult a reference when wondering about a language-imposed limit. In addition, though, certain values stored by the Python virtual machine (e.g. line numbers) are 32-bit values, which obviously imposes its own limit—but one that may be wasting space for the vast majority of Python programs that never even get close. Beyond that, overflowing those 32-bit values could lead to security and other types of problems.

As Shannon pointed out, a range of -1,000,000 to 1,000,000 could fit in 21 bits and that three of those values could be packed into a 64-bit word. "Memory access is usually a limiting factor in the performance of modern CPUs. Better packing of data structures enhances locality and reduces memory [bandwidth], at a modest increase in ALU usage (for shifting and masking)." He suggested that the data structures for stack-frame objects, code objects, and objects themselves could benefit from packing in that fashion. "There is also the potential for a more efficient instruction format, speeding up interpreter dispatch."

He proposed using the limit for seven different aspects of Python programs:

  • The number of source code lines in a module.
  • The number of bytecode instructions in a code object.
  • The sum of local variables and stack usage for a code object.
  • The number of distinct names in a code object.
  • The number of constants in a code object.
  • The number of classes in a running interpreter.
  • The number of live coroutines in a running interpreter.

He also addressed the "Isn't this '640K ought to be enough for anybody' again?" question, which is something that immediately comes to mind when arbitrary limits are proposed. He noted that the Java virtual machine (JVM) limits many program elements to 65535 (which fits in 16 bits); that can be constricting, but mostly for program generators rather than hand-written code. The limit he is proposing for Python is a good deal larger than that and he believes it would not prove to be a real impediment to human-generated code. "While it is possible that generated code could exceed the limit, it is easy for a code generator to modify its output to conform."

Reactions

He provided short justifications for some of the limits, but many of those who commented in the thread were concerned with how much benefit they would actually provide. Chris Angelico asked whether Shannon had done any measurements to see how large of an effect there would be. Steven D'Aprano agreed with the premise of getting memory and security benefits, but thought it was "a bit much to expect" Python developers to take the anticipated speed increase on faith. Steve Dower thought the overall idea was not unreasonable, though he had some concerns as well:

Picking an arbitrary limit less than 2**32 is certainly safer for many reasons, and very unlikely to impact real usage. We already have some real limits well below 10**6 (such as if/else depth and recursion limits).

That said, I don't really want to impact edge-case usage, and I'm all too familiar with other examples of arbitrary limits (no file system would need a path longer than 260 characters, right? :o) ).

Dower is referring to the Windows MAX_PATH value, which restricts path names in the Win32 API to a maximum of 260 characters. He thought several of the proposed limits did seem reasonable, though the "lines per module" limit "feels the most arbitrary". Comments and blank lines would certainly count against the limit, which gave him pause. The "classes in a running interpreter" limit is a bit worrisome, he said, but there may be ways to deal with programs that go beyond it while still getting any gains it brings: "The benefits seem worthwhile here even without the rest of the PEP."

But Rhodri James thought that limits like those proposed will eventually become a problem for some; he also objected to the idea of packing the counts into smaller bit widths due to the inefficiency of masking and shifting on each access. Gregory P. Smith was, in general, in favor of limits, but was concerned that code generation would run afoul of them; he noted that the JVM limits have been a big problem in the Android world. Others in the thread also pointed to the JVM limits as being problematic.

Guido van Rossum posted some thoughts on the idea. He wondered a bit about the problem being solved and expressed some skepticism that, for example, representing line numbers in 20 bits rather than 32 is really going to be much of an efficiency gain. He was concerned about "existing (accidental) limits that caused problems for generated code". But he also pointed out that the existing CPython parser is limited to 100 levels of nested parentheses (and likely nested indent levels as well) and there have been no complaints that he has heard.

A real world example of a file with one million lines was noted by Oscar Benjamin. It is a test file for SymPy that crashed the Python 3.6 interpreter (though that was fixed in 3.7.1). The actual test is only around 3000 long lines, but it gets rewritten by the pytest framework into a file with more than one million lines. Benjamin also pointed out that a limit of one million bytecode instructions is even more restrictive.

A distinction should be made between limits for the language itself and for those of the CPython implementation, Jim J. Jewett said. He noted that "there is great value in documenting the limits that CPython in particular currently chooses to enforce", but that making the limits too high for the language itself would, for example, potentially leave out implementations like MicroPython.

There may well be value in changing the limits supported by CPython (or at least CPython in default mode), or its bytecode format, but those should be phrased as clearly a CPython implementation PEP (or bytecode PEP) rather than a language change PEP.

PEP 611

Shannon took in the feedback and reflected it in a PEP 611 ("The one million limit"). That led to another thread, where there were more calls for some kind of benchmarking for the changes in order to reasonably evaluate them. Barry Warsaw also said that "there is a lack of clarity as to whether you are proposing a Python-the-language limit or a CPython-the-implementation limit". If the limits are for the language "then the PEP needs to state that, and feedback from other implementation developers should be requested".

A few days later, Shannon asked that commenters be more precise about their concerns. There are costs associated with either choice, but simply stating a preference for a higher limit is not entirely helpful feedback:

Merely saying that you would like a larger limit is pointless. If there were no cost to arbitrarily large limits, then I wouldn't have proposed the PEP in the first place.

Bear in mind that the costs of higher limits are paid by everyone, but the benefits are gained by few.

Angelico again asked for numbers, but Shannon said that the performance increase is difficult to quantify: "Given there is an infinite number of potential optimizations that it would enable, it is a bit hard to put a number on it :)". As might be expected, that hand waving was not entirely popular. But part of the problem may be that Shannon sees the potential optimizations partly as just a byproduct of the other advantages the limits would bring; as Nathaniel Smith put it:

Mark, possibly you want to re-frame the PEP to be more like "this is good for correctness and enabling robust reasoning about the interpreter, which has a variety of benefits (and possibly speed will be one of them eventually)"? My impression is that you see speedups as a secondary motivation, while other people are getting the impression that speedups are the entire motivation, so one way or the other the text is confusing people.

Justification is certainly needed for making changes of this sort, Shannon said, but thought that the current limits (or lack thereof) came about due to "historical accident and/or implementation details", which would seemingly allow a weaker justification. Van Rossum took issue with that assertion:

Whoa. The lack of limits in the status quo (no limits on various things except indirectly, through available memory) is most definitely the result of an intentional decision. "No arbitrary limits" was part of Python's initial design philosophy. We didn't always succeed (parse tree depth and call recursion depth come to mind) but that was definitely the philosophy. [...]

You have an extreme need to justify why we should change now. "An infinite number of potential optimizations" does not cut it.

Shannon replied that it may be part of the philosophy of the language, "but in reality Python has lots of limits." For example, he pointed out that having more than 231 instructions in a code object will crash CPython currently; that is a bug that could be fixed, but those kinds of problems can be hard to test for and find.

Explicit limits are much easier to test. Does code outside the limit fail in the expected fashion and code just under the limit work correctly?

What I want, is to allow more efficient use of resources without inconveniently low or unspecified limits. There will always be some limits on finite machines. If they aren't specified, they still exist, we just don't know what they are or how they will manifest themselves.

The limits on the number of classes and on the number of coroutines were specifically raised as problematic by Van Rossum. Changing the object header, which is one thing that a limit on classes would allow, is a change to the C API, so he thinks it should be debated separately. A coroutine is "just another Python object and has no operating resources associated with it", so he did not understand why they were being specifically targeted. Others agreed about coroutines and offered up suggestions of applications that might have a need for more than one million. Those objections led Shannon to drop coroutines from the PEP as "the justification for limiting coroutines is probably the weakest".

Steering council thoughts

The Python steering council weighed in (perhaps in one of its last official actions, as a new council was elected on December 16) on the PEP. Warsaw said that it had been discussed at the meeting on December 10. The council suggested that the PEP be broken up into two parts, one that applies to all implementations of the language and would provide ways to determine the limits at runtime, and another that is implementation-specific for CPython with limits for that implementation. In addition: "We encourage the PEP authors and proponents to gather actual performance data that can be used to help us evaluate whether this PEP is a good idea or not."

Shannon is still skeptical that using "a handful of optimizations" to judge the PEP is the right approach. It is impossible to put numbers on all of the infinite possible optimizations, but it is also "impossible to perfectly predict what limits might restrict possible future applications". However, he did agree that coming up with example optimizations and applications that would suffer from the limitations would be useful.

In yet another thread, Shannon asked for the feedback, with specifics, to continue. He also wondered if his idea of choosing a single number as a limit, mostly as a memory aid, was important. Several thought that it was not reasonable to pick a single number for a variety of reasons. As D'Aprano put it, no one will need to memorize the limits since they should be rarely hit anyway and there should be a way to get them at runtime. Beyond that, there are already limits on things like nested parentheses, recursion depth, and so on that are not now and are not going to be one million.

Paul Moore agreed that a single limit value was not important, though he was in favor of choosing round numbers for any limits, rather than something based on the implementation details. He also may have summed up how most are thinking about the idea:

[...] my view is that I'm against any limits being imposed without demonstrated benefits. I don't care *how much* benefit, although I would expect the impact of the limit to be kept in line with the level of benefit. In practical terms, that means I see this proposal as backwards. I'd prefer it if the proposal were defined in terms of "here's a benefit we can achieve if we impose such-and-such a limit".

That's where things stand at this point. It would seem there is interest on the part of the steering council, given that it took up the PEP in its early days, though the council's interest may not entirely align with Shannon's. Ill-defined limits, with unclear semantics on what happens if they are exceeded, would seem to have little benefit for anyone, however. Some tightening of that specification for CPython and additional APIs for the language as a whole would be a great step. It might allow those interested to tweak the values in an experimental fork of CPython to test some optimization possibilities as well.


Index entries for this article
PythonLanguage specification


to post comments

One million ought to be enough for anybody

Posted Dec 18, 2019 0:30 UTC (Wed) by Kamilion (guest, #42576) [Link] (8 responses)

>For example, he pointed out that having more than 2^31 instructions in a code object will crash CPython currently; that is a bug that could be fixed

What? Found a bug and didn't fix it or add an assert/RuntimeException?

And now you're proposing to add more arbitrary limits you extracted from the heated gasses within your anus?

Are you *TRYING* to chase data scientists away from python or something?

I got on board with python a long time ago because it lacked explicit limits. Because not everyone uses python like you do.
EVE Online embeds stackless python in their graphical game client, for instance. And happens to count some player-mineable resources required for crafting in the trillions.

Need more ram? Get a Gen-Z based rack "supercomputer" with 10 2Us and 24 256GB Gen-Z DRAM modules each, for 240*256=61440GB/61TB of cache-coherent DRAM across twenty 16-core/32-thread epyc 7302s.
Not including the NGFF DRAM modules, each of the 2Us would only cost around 3 grand. And that's with RDMA over PCIe.

I fully expect a 64bit python3.8 runtime to be able to deal with terabytes of DRAM already. It would be weird if it didn't, considering the amount of machine learning and data science using pandas, numpy, and sympy out there.

I mean, wouldn't this proposal, as described, unintentionally limit pandas to a million dataframe objects or less just due to the sum of local variables and stack usage?

Look, there's even a reasonable compromise.

Come up with a runtime limits system that an application can override by request with a soft and hard limit.
If you do that, then most of the existing hobby programmers will never ask the runtime for a limit-break.
It might still end up causing problems to enterprise programmers like the hassle that is sysctl limits for Oracle Enterprise Database installs.
(Ever wonder why Oracle reaaaaally wants Oracle Unbreakable Linux to be deployed instead of RHEL? Yeah, they've dinked with the RHEL default values for stuff like sysctls so their own software installs without complaint.)

"fs.file-max = 6815744" as seen in:
https://docs.oracle.com/en/database/oracle/oracle-databas...

So at the very least, Oracle seems to want to be able to open six point eight million files concurrently, a far cry from the current default limit of 84090.

You have no idea how happy I am that the sysctl mechanism FINALLY went away in kernel 5.5.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/...
(Mainly because, as an ex-oracle dba, It would really please me to see this cause oracle massive logistics/update problems)

One million ought to be enough for anybody

Posted Dec 18, 2019 6:03 UTC (Wed) by zdzichu (guest, #17118) [Link] (2 responses)

"sysctl mechanism" didn't went away. Never used syscall was removed, but /proc/sys is still here and will be always.
I also don't understand why you would be happy?

One million ought to be enough for anybody

Posted Dec 18, 2019 17:34 UTC (Wed) by clugstj (subscriber, #4020) [Link]

"I also don't understand why you would be happy?"

Having worked, tangentially, with Oracle, I can understand why he would be happy.

One million ought to be enough for anybody

Posted Dec 20, 2019 0:22 UTC (Fri) by Kamilion (guest, #42576) [Link]

Ah, I was under the impression the sysctl(8) binary still used the syscall. Wasn't aware it's been poking procfs for quite a while.
A quick googling resolved my ignorance on that matter.

As for my happiness; I think that oracle's one of the biggest "friendly-looking" threats to FOSS in the technical sector currently. They've pretty much systematically destroyed every Sun asset's reputation. Basically just try to sueball or intimidate anyone/everyone they can. Sad thing is, it's worked very well for their bank accounts. Schadenfreude it may be, but I'll cheer on any black-eye to their reputation I can.

One million ought to be enough for anybody

Posted Dec 18, 2019 16:54 UTC (Wed) by Rudd-O (guest, #61155) [Link]

These limits don't gore your ox. Move along.

^^ This is not sarcastic. They really do not gore your ox.

One million ought to be enough for anybody

Posted Dec 18, 2019 22:15 UTC (Wed) by sjj (guest, #2020) [Link] (1 responses)

There's nothing special about a database or other software needing larger fs.file-max. Many do, sometimes it's even documented, and systems get tuned accordingly. Not seeing the relevance here.

Incidentally do you think ranting about heated gases from somebody's anus makes your argument stronger in a technical discussion?

One million ought to be enough for anybody

Posted Dec 20, 2019 0:07 UTC (Fri) by Kamilion (guest, #42576) [Link]

Nothing special about needing limits raised for certain types of applications.
It's a tuneable. The Royal We, The Users, prefer probing and plug-and-play, and linux as a whole has been known to make it a goal to be moving towards this state and trying to reduce 'pointless' tunables. A hard limit / define will just make it harder for distributors to get things right for what they perceive as the general case. Alternatively, it makes the enterprise space much more difficult to maintain due to the explosion in combinatorial complexity.

Would you have preferred if I used the phrase "rabbit out of a hat", "full of hot air", "bullshit", or the physically accurate but crude way I decided to word "pulled a million out of his ass"?
It's not like my opinion really has any weight on the matter at all, it was more directed at the absurdity of the spherical-ness of the turd-polishing, so strength of argument has no real bearing.

One million ought to be enough for anybody

Posted Dec 18, 2019 22:42 UTC (Wed) by flussence (guest, #85566) [Link]

> Ever wonder why Oracle reaaaaally wants Oracle Unbreakable Linux to be deployed instead of RHEL?

Subtly broken software that only rears its ugly head once there's sufficient sunk cost to keep end users locked in sells more support contracts. If Oracle's developers were halfway as dedicated as their lawyers they wouldn't need that EULA clause prohibiting publishing of benchmarks.

One million ought to be enough for anybody

Posted Dec 21, 2019 2:43 UTC (Sat) by jwarnica (subscriber, #27492) [Link]

Not that I'm a Python guy in the least, but if the conversation was about promises (a minimum requirement) rather than limits (an arbitrary upper bound) it might be a lot more productive, and exactly as useful almost always.

You can still optimize to the promise.

One million ought to be enough for anybody

Posted Dec 18, 2019 0:41 UTC (Wed) by areilly (subscriber, #87829) [Link] (2 responses)

As an occasional Python user, the idea of getting excited about the range of line numbers seems absurd, since they are only ever used as a diagnostic anyway. The code itself is just text, limited by the file size limits of the system, for the most part. To make an argument that you might want to pack three of them (line numbers) into a 64-bit word is a step beyond absurd, IMO. I doubt that there is any context or use-case where that even begins to be a win. Python's integers are already bignums. If you're concerned about implementation details being constrained, then back them with bignums too. I agree with Guido that lacking inherent, explicit limitations wherever possible is a key part of Python's (good) design.

One million ought to be enough for anybody

Posted Dec 18, 2019 1:19 UTC (Wed) by andresfreund (subscriber, #69562) [Link] (1 responses)

> To make an argument that you might want to pack three of them (line numbers) into a 64-bit word is a step beyond absurd, IMO.

I assume this would be less about packing three line numbers together, and more about decreasing memory usage of structs that contain multiple 22bit quantities. I'm not that convinced of this in the case of line numbers (imo it makes more sense to store such accessed data separately anyway), but condensing different fields where l1/l2 residency is a big performance factor, is often a good idea.

> If you're concerned about implementation details being constrained, then back them with bignums too

If it's not already done, it should be comparatively easy to avoid most of the overhead for the common case of smaller integer, given the generic nature of datums in Python.

That's harder for lower level details. If you e.g. want a fixed width bytecode instruction format, everyone would suffer if it's required that jumps can jump arbitrarily far - even though the restriction can be worked around by the bytecode generator.

One million ought to be enough for anybody

Posted Dec 25, 2019 13:30 UTC (Wed) by dvdeug (guest, #10998) [Link]

Infinite jumps are easy to implement for one extra bytecode. Any normal code can use the normal jump, and any code that needs ridiculous jumps can call the super jump to jump to a place given on the top of the stack or in a register.

One million ought to be enough for anybody

Posted Dec 18, 2019 2:58 UTC (Wed) by KaiRo (subscriber, #1987) [Link]

Sorry to say ot like that, but it sounds almost ridiculous to me to talk about aby limits in this time and age without showing numbers of how close current ise cases are to those proposed limits, as well as actual reasons why those limits make sense (other than being a memorable number). In a mature project, I would expect solod data for both what this would break and what it would win before it can be seriously considered.

One million ought to be enough for anybody

Posted Dec 18, 2019 3:12 UTC (Wed) by sb (subscriber, #191) [Link]

I may be missing something because I found that discussion utterly surreal in its apparent seriousness. Bit packing a handful of arbitrary limits, how much of a performance improvement should be expected from placing said limits, the "infinite number of potential optimizations this would enable" -- all of this in a bytecode interpreter that allocates objects and makes indirect function calls and takes a global lock everywhere.

One million ought to be enough for anybody

Posted Dec 18, 2019 8:09 UTC (Wed) by eru (subscriber, #2753) [Link] (3 responses)

It is end of the year 2019. Most computers have 64-bit CPU:s and several gigabytes of RAM. In this context, limiting something like number of source lines or bytecodes in a program to at most million is absurd. If you absolutely must place a limit on these things, in order to specify the language tightly and allow testing extreme cases, the smallest number that makes any kind of sense is 2^31-1, ensuring the implementer is allowed to use a signed 32-bit integer for the quantity.

One million ought to be enough for anybody

Posted Dec 20, 2019 1:58 UTC (Fri) by flussence (guest, #85566) [Link] (2 responses)

Arguing over a million line limit in Python is a farce. For a language designed to be easy to read above all else, the limit should be much *lower*; anyone writing scripts tens of thousands of lines long has completely missed the point of Python. 2¹⁵ lines per file is more than enough.

I'm here looking at the worst offender on my system - an ad-hoc reimplementation of half of rsync/cpio/rcs in portage that “only” takes 5790 lines. One of those lines starts with 11 tabs. Is that really the sort of coding style people are advocating for here?

Anyone that insists on working with inscrutable, monolithic balls of mud that push the system into swap have two good alternatives here: take a time machine 25 years into the past, or go work with the Javascript ecosystem. The rest of us would like our computers to last more than 3 months.

One million ought to be enough for anybody

Posted Dec 20, 2019 16:13 UTC (Fri) by mina86 (guest, #68442) [Link]

Not all Python source files are hand-written.

Readability needs to be enforced by code review. The interpreter is unable to enforce it. You can play golf in any language.

One million ought to be enough for anybody

Posted Dec 21, 2019 20:05 UTC (Sat) by anselm (subscriber, #2796) [Link]

As the saying goes, there are only three numbers that make sense in software engineering - zero, one, and infinity (or as many as will fit into available memory). Arbitrary limits, whether they are “one million” or “2¹⁵”, are only guaranteed to become a nuisance at some point.

One million ought to be enough for anybody

Posted Dec 19, 2019 13:52 UTC (Thu) by Karellen (subscriber, #67644) [Link]

Bear in mind that the costs of higher limits are paid by everyone, but the benefits are gained by few.

Given that Python is a language with multiple implementations, this seems like a strange point of view.

If a limit of 1 million lines of code is put in place, does that mean that all Python implementations must be capable of running 1MLOC programs? What does that mean for MicroPython? And what if someone wanted to create a Python implementation that specialised working with massive dictionaries (e.g. creating snapshot dictionaries of filesystem directories, on systems with filesystems that support more than a million files per directory) but was otherwise a conforming Python implementation.

Maybe it doesn't matter if those implementations are required to be described as "Python-like" rather than "Python", but putting limits on the language itself, rather than noting the limits of a particular implementation, does seem rather... well, limiting, to someone who is used to languages like C where you have rules like "int has to be at least as big as short, and long has to be at least as big as int" that have lasted 40 years, in part because of the freedom that implementations have to grow as needs change.

One million ought to be enough for anybody

Posted Dec 19, 2019 14:50 UTC (Thu) by NRArnot (subscriber, #3033) [Link]

"The number of classes in a running interpreter."

This rang a warning bell for me, depending on how one's dynamically declared classes are garbage-collected. Python's 3-argument form of type creates a new class. One can have code which constructs and returns a class, which then gets instantiated. Hopefully when the last instance gets garbage-collected and the class itself has gone out of scope, the implied "class number 123456" (slot?) becomes vacant and can be claimed by the next dynamically created class. But if for some reason outside a programmer's control, deep in Python's innards, this did not happen, it's a show-stopper waiting to happen for people who do neat things with dynamic classes and metaclasses.

One million ought to be enough for anybody

Posted Dec 20, 2019 7:51 UTC (Fri) by dvdeug (guest, #10998) [Link]

"For example, he pointed out that having more than 2^31 instructions in a code object will crash CPython currently; that is a bug that could be fixed, but those kinds of problems can be hard to test for and find."

I don't see how those types of problems are hard to test for and find; stuff an unreasonably large program in and see where things fail.

I don't really see the goal. Other languages, like C and Ada, require compilers to support certain minimal limits (and usually issue warnings or errors if it exceeds the implementation dependent limits). CPython and other implementations of Python should have known limits, but not set arbitrarily, and with known costs to expanding them (i.e. changing a 32-bit value to 64-bit.) If squeezing 21 bit values into 64-bit blocks, then let's have numbers on that.

One million ought to be enough for anybody

Posted Dec 20, 2019 16:16 UTC (Fri) by mina86 (guest, #68442) [Link]

I could be convinced for limit on number of lines of code but I’m rather sceptical about limit on number of byte-code instructions since those are not under programmer’s control and how byte-code is generated can change between versions of the interpreter.

One million ought to be enough for anybody

Posted Dec 23, 2019 15:41 UTC (Mon) by anton (subscriber, #25547) [Link]

The reaction from the Python community to the unsubstantiated "infinite number of potential optimizations" is markedly different to that of many (though not all) in the C community to similar unsubstantiated claims. The question is whether this is due to the Python community being more down-to-earth or due to the obviously absurd nature of this proposal.


Copyright © 2019, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds