Leading items

Welcome to the LWN.net Weekly Edition for December 19, 2019

This edition contains the following feature content:

Fedora and optical media testing: is it time for Fedora to start leaving optical-media installation behind?
One million ought to be enough for anybody: a proposal for arbitrary limits on several Python parameters.
Buffered I/O without page-cache thrashing: combining the convenience of buffered I/O with the performance of direct I/O.
Explicit pinning of user-space pages: another baby step toward a solution to the problems with get_user_pages().
A year-end wrap-up from LWN: reviewing our January predictions and closing out the year.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

This is the final LWN.net Weekly Edition for 2019; as usual, we will be taking the last week of the year off to recuperate and get ready for 2020. The Weekly Edition will return on January 2; we wish all of our readers a great holiday season and a happy new year.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

Fedora and optical media testing

By Jake Edge
December 18, 2019

Once upon a time, Linux was installed from a stack of floppy disks—thankfully cassette tape "drives" were long in the past at that point—but floppies were superseded by optical media, first CDs and then DVDs. These days, those options are starting to fade away in most new computer systems; just as it is now rather hard to find a floppy-based Linux installer, not to mention the media and drives themselves, someday optical media installation will disappear as well. For Fedora, that day has not truly arrived, though a somewhat confusingly presented proposal on the Fedora devel mailing list is, to a limited extent, a step in that direction.

In truth, of course, there is not a ton of difference between images meant to boot from USB devices and those built for optical media. A USB device these days can hold more than optical media can, but that is generally not the gating factor; typical Linux distributions can generally fit on a DVD. "Burning" optical media is a time-consuming process that may also consume the media itself, however. For a distribution quality assurance (QA) team, testing that type of media takes time that might be better spent on other things; it also requires testers to have the proper equipment, while USB ports are ubiquitous.

That is the backdrop to the proposal posted by Fedora program manager Ben Cotton to drop the "release blocking" categorization for bugs found installing from optical media in Fedora 32 (and beyond, presumably).

This means we'd stop blocking on bugs found during the installation of Fedora from optical media (like CDs and DVDs). This doesn't mean that installation from optical media would stop working, just that the Fedora Release wouldn't be blocked on any issues that can pop up in Fedora installation using this method. Installation from USB devices will remain blocking.

Adam Williamson tried to clarify the proposal (which also has a page on the Fedora wiki, as with other system-wide change proposals) somewhat. He noted that the images for optical media would still be created—bugs in that process would continue to be release blockers—but testing them on physical optical media would not be necessary. He pointed to an earlier discussion on the Fedora test mailing list for added context.

Problems with booting from optical media are rare, according to Chris Murphy, though there was one reported on the day the proposal was posted; "But before that, I think it's been a little while. Couple years?" But several posting in the thread were adamant that problems with optical media should continue to be release blockers. There was, however, some confusion about the current practice; for the last few years, only two of the many images created for Fedora are actually being tested using physical optical media.

Murphy pointed out that the bugs that would have affected booting from optical media were actually found using the virtual CD device in a virtual machine. "No coasters were made as part of these tests!" He thought that it might be reasonable to keep the test of one specific image on optical media (instead of the current two) as a release blocker if it was still deemed necessary to do that kind of testing. But Kevin Kofler was surprised to hear that optical-media testing of the KDE Live image was not currently a release-blocking bug:

Nowhere was it agreed back then, nor ever since, that this should only apply to some types of physical (or virtual) media and not others. I hereby request that the KDE Live x86_64 be made release-blocking for optical media again, unless explicitly agreed otherwise with the KDE SIG.

As Williamson said, however, that change was made back in January 2017 after discussion on both the test and devel lists. The idea then was to reduce the testing burden:

Note, this isn't exactly a case of 'Workstation is more important than KDE'. Rather, the idea was that we can be more efficient in testing by only testing *one* live image, because all the live images are built exactly the same way, so if one boots there's really absolutely no reason all the others shouldn't boot too. Similarly for traditional installer images. So, we picked the Everything netinst as the 'representative' for traditional installer images, and the Workstation live as the 'representative' for live images; the idea is that if we just test those two, it 99.99999% proves all the others will also boot.

As he also noted further down in the lengthy thread, virtual optical devices are tested automatically multiple times per day. He suggested that if there are lots of Fedora users who rely on optical media, they should perhaps pitch in to help testing those images during the Fedora release cycle. There was also some disagreement in the thread about the prevalence of optical drives in the systems of Fedora users, but some strong opinions that supporting that type of media is a requirement for many users.

The likelihood is that even if this change were adopted, optical media would still work just fine, at least according to what the QA folks in the thread were saying. In addition, John M. Harris Jr. has volunteered to test optical media regularly, which should help to ensure there aren't any problems, regardless of whether those problems would block a release or not. In part, though, the complaints about the change stem from an overall concern that Fedora is too quick to leave some things behind.

For example, Harris (and others) complained about dropping x86 and Python 2 support in the distribution, linking those to the proposal. But, as Williamson pointed out in a lengthy message, that is part of the "four foundations" that underlie the philosophy of the project.

What's the alternative? We never drop Python 2 support in order to keep software that is clearly becoming increasingly out of date in a distribution which has "First" as one of its core principles? This is just another angle on "it is almost never the case that, when Fedora stops caring about something, it's a thing that absolutely nobody and nothing wants". There has to be a cut-off. There's probably *someone* out there who still has a Python 1 interpreter installed. And libc 5. On a 386SX. Should Fedora still work on it?

Harris also pointed to Debian as an example of how Python 2 could continue to be supported—by continuing to carry the older version of Python:

This doesn't change the fact that many Python scripts *cannot run on Python 3*. Debian is not a museum piece either, and yet they don't just kill the old version. The two versions can, and do, work when both installed in parallel. This isn't relevant to this thread either, but several packages were simply dropped from Fedora because the upstream didn't have a "path forward", as it was put in the FESCo [Fedora Engineering Steering Committee] ticket, to Python 3.

But Debian and Fedora have different missions and goals, Williamson said, they both are important pieces of the Linux distribution ecosystem:

BTW, there is another point here which you may not appreciate: Fedora and Debian aren't really in competition. Fedora does not see its job as being to Conquer The World and have everyone run Fedora. Fedora is targeted at particular purposes and particular audiences. If a given feature isn't actually driving Fedora's mission forward in any way, it's reasonable to consider not having it any more, or at least not making it a core part of the distribution and subject to blocking requirements and so on. [...]

To put it another way...Debian and Fedora have different purposes and different goals. Us dropping Python 2 earlier than Debian [does] is *things working the right way*. We (arguably) do more than Debian to drive the adoption and stabilization of new technologies - new stuff tends to show up in Fedora earlier than it shows up in Debian. Debian (arguably) does more than we do to provide long-term support for older software and support for alternate architectures. This is a *good* thing. It's an ecosystem that helps everyone.

In addition, Debian is dropping Python 2 support, just at a slower pace, as Neal Gompa said.

Debian definitely does not want to make another release with Python 2 in the distribution. Ubuntu has already decided to filter out all Python 2 packages from Ubuntu 20.04 LTS, so it's not going to be there either.

And you know what? This was all made possible by Fedora's work over the last several releases to port lots of software to Python 3, aggressively migrate to Python 3 by default, and now finally dropping Python 2 stuff over the last three releases.

The discussion is something of a reprise of a clash we have seen before in various guises (and for multiple distributions and other projects): stability versus the "latest and greatest". Fedora is clearly on record as tilting well toward the latest-and-greatest end of that spectrum, but that doesn't stop folks from pushing back on that at times. For optical media, it would seem that Harris is planning to dedicate two machines to testing release candidates and other nightly builds with optical media. Community help on that testing is welcome, Kamil Paral said, but there is still value in deciding whether any bugs found should block the release:

We claim that the importance of optical media has diminished and it's now below the threshold for granting it a release-blocking status. That's not really affected by whom executes the tests. I'm really interested to know what the community and FESCo thinks about this, and that's why this proposal is useful. If we keep blocking on optical media, we'll need to keep verifying it, and of course any community help will be very appreciated. If we stop blocking on it, we'll be able to run the test just optionally when we have spare time, and of course any community help will still be very appreciated.

Williamson disagreed slightly in that he thought having community members who reliably show up to help with testing a particular feature are an indicator that there are actually others who care about the feature. That might be a good reason not to switch the release-blocker status for optical-media boot problems—or at least not yet. As the thread wound down, Paral summarized the discussion; while there were a few clarifications made in response, it serves as a good overall wrap-up of the thread.

It will be up to FESCo to decide whether to proceed with this change or not; an active community testing effort would seem likely to impact its decision. Even if those bugs are determined to not be release blockers, it is pretty likely that nothing will change for users of that type of media—if bugs are found, they will be fixed. But there will come a time when even that will change and optical media will be remembered fondly (or not so fondly) as a media type used by dinosaurs; one might guess that Fedora will be one of the early pioneers when that time comes.

Comments (22 posted)

One million ought to be enough for anybody

By Jake Edge
December 17, 2019

Programming languages generally have limits—explicit or implicit—on various aspects of their operation. Things like the maximum length of an identifier or the range of values that a variable can store are fairly obvious examples, but there are others, many of which are unspecified by the language designers and come about from various implementations of the language. That ambiguity has consequences, so nailing down a wide variety of limits in Python is the target of an ongoing discussion on the python-dev mailing list.

Mark Shannon posted a proposal "to impose a limit of one million on various aspects of Python programs, such as the lines of code per module". One million may seem like an arbitrary number (and it is), but part of his thinking is that it is also memorable, so that programmers don't need to consult a reference when wondering about a language-imposed limit. In addition, though, certain values stored by the Python virtual machine (e.g. line numbers) are 32-bit values, which obviously imposes its own limit—but one that may be wasting space for the vast majority of Python programs that never even get close. Beyond that, overflowing those 32-bit values could lead to security and other types of problems.

As Shannon pointed out, a range of -1,000,000 to 1,000,000 could fit in 21 bits and that three of those values could be packed into a 64-bit word. "Memory access is usually a limiting factor in the performance of modern CPUs. Better packing of data structures enhances locality and reduces memory [bandwidth], at a modest increase in ALU usage (for shifting and masking)." He suggested that the data structures for stack-frame objects, code objects, and objects themselves could benefit from packing in that fashion. "There is also the potential for a more efficient instruction format, speeding up interpreter dispatch."

He proposed using the limit for seven different aspects of Python programs:

The number of source code lines in a module.
The number of bytecode instructions in a code object.
The sum of local variables and stack usage for a code object.
The number of distinct names in a code object.
The number of constants in a code object.
The number of classes in a running interpreter.
The number of live coroutines in a running interpreter.

He also addressed the "Isn't this '640K ought to be enough for anybody' again?" question, which is something that immediately comes to mind when arbitrary limits are proposed. He noted that the Java virtual machine (JVM) limits many program elements to 65535 (which fits in 16 bits); that can be constricting, but mostly for program generators rather than hand-written code. The limit he is proposing for Python is a good deal larger than that and he believes it would not prove to be a real impediment to human-generated code. "While it is possible that generated code could exceed the limit, it is easy for a code generator to modify its output to conform."

Reactions

He provided short justifications for some of the limits, but many of those who commented in the thread were concerned with how much benefit they would actually provide. Chris Angelico asked whether Shannon had done any measurements to see how large of an effect there would be. Steven D'Aprano agreed with the premise of getting memory and security benefits, but thought it was "a bit much to expect" Python developers to take the anticipated speed increase on faith. Steve Dower thought the overall idea was not unreasonable, though he had some concerns as well:

Picking an arbitrary limit less than 2**32 is certainly safer for many reasons, and very unlikely to impact real usage. We already have some real limits well below 10**6 (such as if/else depth and recursion limits).

That said, I don't really want to impact edge-case usage, and I'm all too familiar with other examples of arbitrary limits (no file system would need a path longer than 260 characters, right? :o) ).

Dower is referring to the Windows MAX_PATH value, which restricts path names in the Win32 API to a maximum of 260 characters. He thought several of the proposed limits did seem reasonable, though the "lines per module" limit "feels the most arbitrary". Comments and blank lines would certainly count against the limit, which gave him pause. The "classes in a running interpreter" limit is a bit worrisome, he said, but there may be ways to deal with programs that go beyond it while still getting any gains it brings: "The benefits seem worthwhile here even without the rest of the PEP."

But Rhodri James thought that limits like those proposed will eventually become a problem for some; he also objected to the idea of packing the counts into smaller bit widths due to the inefficiency of masking and shifting on each access. Gregory P. Smith was, in general, in favor of limits, but was concerned that code generation would run afoul of them; he noted that the JVM limits have been a big problem in the Android world. Others in the thread also pointed to the JVM limits as being problematic.

Guido van Rossum posted some thoughts on the idea. He wondered a bit about the problem being solved and expressed some skepticism that, for example, representing line numbers in 20 bits rather than 32 is really going to be much of an efficiency gain. He was concerned about "existing (accidental) limits that caused problems for generated code". But he also pointed out that the existing CPython parser is limited to 100 levels of nested parentheses (and likely nested indent levels as well) and there have been no complaints that he has heard.

A real world example of a file with one million lines was noted by Oscar Benjamin. It is a test file for SymPy that crashed the Python 3.6 interpreter (though that was fixed in 3.7.1). The actual test is only around 3000 long lines, but it gets rewritten by the pytest framework into a file with more than one million lines. Benjamin also pointed out that a limit of one million bytecode instructions is even more restrictive.

A distinction should be made between limits for the language itself and for those of the CPython implementation, Jim J. Jewett said. He noted that "there is great value in documenting the limits that CPython in particular currently chooses to enforce", but that making the limits too high for the language itself would, for example, potentially leave out implementations like MicroPython.

There may well be value in changing the limits supported by CPython (or at least CPython in default mode), or its bytecode format, but those should be phrased as clearly a CPython implementation PEP (or bytecode PEP) rather than a language change PEP.

PEP 611

Shannon took in the feedback and reflected it in a PEP 611 ("The one million limit"). That led to another thread, where there were more calls for some kind of benchmarking for the changes in order to reasonably evaluate them. Barry Warsaw also said that "there is a lack of clarity as to whether you are proposing a Python-the-language limit or a CPython-the-implementation limit". If the limits are for the language "then the PEP needs to state that, and feedback from other implementation developers should be requested".

A few days later, Shannon asked that commenters be more precise about their concerns. There are costs associated with either choice, but simply stating a preference for a higher limit is not entirely helpful feedback:

Merely saying that you would like a larger limit is pointless. If there were no cost to arbitrarily large limits, then I wouldn't have proposed the PEP in the first place.

Bear in mind that the costs of higher limits are paid by everyone, but the benefits are gained by few.

Angelico again asked for numbers, but Shannon said that the performance increase is difficult to quantify: "Given there is an infinite number of potential optimizations that it would enable, it is a bit hard to put a number on it :)". As might be expected, that hand waving was not entirely popular. But part of the problem may be that Shannon sees the potential optimizations partly as just a byproduct of the other advantages the limits would bring; as Nathaniel Smith put it:

Mark, possibly you want to re-frame the PEP to be more like "this is good for correctness and enabling robust reasoning about the interpreter, which has a variety of benefits (and possibly speed will be one of them eventually)"? My impression is that you see speedups as a secondary motivation, while other people are getting the impression that speedups are the entire motivation, so one way or the other the text is confusing people.

Justification is certainly needed for making changes of this sort, Shannon said, but thought that the current limits (or lack thereof) came about due to "historical accident and/or implementation details", which would seemingly allow a weaker justification. Van Rossum took issue with that assertion:

Whoa. The lack of limits in the status quo (no limits on various things except indirectly, through available memory) is most definitely the result of an intentional decision. "No arbitrary limits" was part of Python's initial design philosophy. We didn't always succeed (parse tree depth and call recursion depth come to mind) but that was definitely the philosophy. [...]

You have an extreme need to justify why we should change now. "An infinite number of potential optimizations" does not cut it.

Shannon replied that it may be part of the philosophy of the language, "but in reality Python has lots of limits." For example, he pointed out that having more than 2³¹ instructions in a code object will crash CPython currently; that is a bug that could be fixed, but those kinds of problems can be hard to test for and find.

Explicit limits are much easier to test. Does code outside the limit fail in the expected fashion and code just under the limit work correctly?

What I want, is to allow more efficient use of resources without inconveniently low or unspecified limits. There will always be some limits on finite machines. If they aren't specified, they still exist, we just don't know what they are or how they will manifest themselves.

The limits on the number of classes and on the number of coroutines were specifically raised as problematic by Van Rossum. Changing the object header, which is one thing that a limit on classes would allow, is a change to the C API, so he thinks it should be debated separately. A coroutine is "just another Python object and has no operating resources associated with it", so he did not understand why they were being specifically targeted. Others agreed about coroutines and offered up suggestions of applications that might have a need for more than one million. Those objections led Shannon to drop coroutines from the PEP as "the justification for limiting coroutines is probably the weakest".

Steering council thoughts

The Python steering council weighed in (perhaps in one of its last official actions, as a new council was elected on December 16) on the PEP. Warsaw said that it had been discussed at the meeting on December 10. The council suggested that the PEP be broken up into two parts, one that applies to all implementations of the language and would provide ways to determine the limits at runtime, and another that is implementation-specific for CPython with limits for that implementation. In addition: "We encourage the PEP authors and proponents to gather actual performance data that can be used to help us evaluate whether this PEP is a good idea or not."

Shannon is still skeptical that using "a handful of optimizations" to judge the PEP is the right approach. It is impossible to put numbers on all of the infinite possible optimizations, but it is also "impossible to perfectly predict what limits might restrict possible future applications". However, he did agree that coming up with example optimizations and applications that would suffer from the limitations would be useful.

In yet another thread, Shannon asked for the feedback, with specifics, to continue. He also wondered if his idea of choosing a single number as a limit, mostly as a memory aid, was important. Several thought that it was not reasonable to pick a single number for a variety of reasons. As D'Aprano put it, no one will need to memorize the limits since they should be rarely hit anyway and there should be a way to get them at runtime. Beyond that, there are already limits on things like nested parentheses, recursion depth, and so on that are not now and are not going to be one million.

Paul Moore agreed that a single limit value was not important, though he was in favor of choosing round numbers for any limits, rather than something based on the implementation details. He also may have summed up how most are thinking about the idea:

[...] my view is that I'm against any limits being imposed without demonstrated benefits. I don't care *how much* benefit, although I would expect the impact of the limit to be kept in line with the level of benefit. In practical terms, that means I see this proposal as backwards. I'd prefer it if the proposal were defined in terms of "here's a benefit we can achieve if we impose such-and-such a limit".

That's where things stand at this point. It would seem there is interest on the part of the steering council, given that it took up the PEP in its early days, though the council's interest may not entirely align with Shannon's. Ill-defined limits, with unclear semantics on what happens if they are exceeded, would seem to have little benefit for anyone, however. Some tightening of that specification for CPython and additional APIs for the language as a whole would be a great step. It might allow those interested to tweak the values in an experimental fork of CPython to test some optimization possibilities as well.

Comments (23 posted)

Buffered I/O without page-cache thrashing

By Jonathan Corbet
December 12, 2019

Linux offers two modes for file I/O: buffered and direct. Buffered I/O passes through the kernel's page cache; it is relatively easy to use and can yield significant performance benefits for data that is accessed multiple times. Direct I/O, instead, goes straight between a user-space buffer and the storage device. It can be much faster for situations where caching by the operating system isn't necessary, but it is complex to use and contains traps for the unwary. Now, it seems, Jens Axboe has come up with a way to get many of the benefits of direct I/O with a lot less bother.

Direct I/O can give better performance than buffered I/O in a couple of ways. One of those is simply avoiding the cost of copying the data between user space and the page cache; that cost can be significant, but in many cases it is not the biggest problem. The real issue may be the effect of buffered I/O on the page cache.

A process that performs large amounts of buffered I/O spread out over one or more large (relative to available memory) files will quickly fill the page cache (and thus memory) with cached file data. If the process in question does not access those pages after performing I/O, there is no benefit to keeping the data in memory, but it's there anyway. To be able to allocate memory for other uses, the kernel will have to reclaim some pages from somewhere. That can be expensive for the system as a whole, even if "somewhere" is the data associated with this I/O activity.

The memory-management subsystem tries to do the right thing in this situation. Pages added to the cache via buffered I/O go onto the inactive list; unless they are accessed a second time in the near future, they will be the first pages to be kicked back out. But there is still a fair amount of overhead associated with implementing this behavior; Axboe ran a simple test and described the results this way:

The test case is pretty basic, random reads over a dataset that's 10x the size of RAM. Performance starts out fine, and then the page cache fills up and we hit a throughput cliff. CPU usage of the IO threads go up, and we have kswapd spending 100% of a core trying to keep up.

This kind of problem can be avoided by switching to direct I/O, but that brings challenges and problems of its own. Axboe has concluded that there may be a third way that can provide the best of both worlds.

That third way is a new flag, RWF_UNCACHED, which is provided to the preadv2() and pwritev2() system calls. If present, this flag changes the requested I/O operation in two ways, depending on whether the affected file pages are currently in the page cache or not. When the data is present in the page cache, the operation proceeds as if the RWF_UNCACHED flag were not present; data is copied to or from the pages in the cache. If the pages are absent, instead, they will be added to the page cache, but only for the duration of the operation; those pages will be removed from the page cache once the operation completes.

The result, in other words, is buffered I/O that does not change the state of the page cache; whatever was present there before will still be there afterward, but nothing new will be added. I/O performed in this way will gain most of the benefits of buffered I/O, including ease of use and access to any data that is already cached, but without filling memory with unneeded cached data. The result, Axboe says, is readily observable:

With this, I can do 100% smooth buffered reads or writes without pushing the kernel to the state where kswapd is sweating bullets. In fact it doesn't even register.

This new flag thus seems like a significant improvement for a variety of workloads. In particular, workloads where it is known that the data will only be used once, or where the application performs its own caching in user space, may well benefit from running with the RWF_UNCACHED flag.

The implementation of this new behavior is not complicated; the entire patch set (which also adds support to io_uring) involves just over 200 lines of code. Of course, as Dave Chinner pointed out, there is something missing: all of the testing infrastructure needed to ensure that RWF_UNCACHED behaves as expected and does not corrupt data. Chinner also noted some performance issues in the write implementation, suggesting that an entire I/O operation should be flushed out at a time rather than the page-at-a-time approach taken in the original patch set. Axboe has already reworked the code to address that problem; the boring work of writing tests and precisely documenting semantics will follow at some future point.

If RWF_UNCACHED proves to work as well in real-world workloads, it may eventually be seen as one of those things that somebody should have thought of many years ago. Things often turn out this way. Solving the problem isn't hard; the hard part is figuring out which problem needs to be solved in the first place. That, and writing tests and documentation, of course.

Comments (28 posted)

Explicit pinning of user-space pages

By Jonathan Corbet
December 13, 2019

The saga of get_user_pages() — and the problems it causes within the kernel — has been extensively chronicled here; see the LWN kernel index for the full series. In short, get_user_pages() is used to pin user-space pages in memory for some sort of manipulation outside of the owning process(es); that manipulation can sometimes surprise other parts of the kernel that think they have exclusive rights to the pages in question. This patch series from John Hubbard does not solve all of the problems, but it does create some infrastructure that may make a solution easier to come by.

To simplify the situation somewhat, the problems with get_user_pages() come about in two ways. One of those happens when the kernel thinks that the contents of a page will not change, but some peripheral device writes new data there. The other arises with memory that is located on persistent-memory devices managed by a filesystem; pinning pages into memory deprives the filesystem of the ability to make layout changes involving those pages. The latter problem has been "solved" for now by disallowing long-lasting page pins on persistent-memory devices, but there are use cases calling for creating just that kind of pin, so better solutions are being sought.

Part of the problem comes down to the fact that get_user_pages() does not perform any sort of special tracking of the pages it pins into RAM. It does increment the reference count for each page, preventing it from being evicted from memory, but pages that have been pinned in this way are indistinguishable from pages that have acquired references in any of a vast number of other ways. So, while one can ask whether a page has references, it is not possible for kernel code to ask whether a page has been pinned for purposes like DMA I/O.

Hubbard's patch set addresses the tracking part of the problem; it starts by introducing some new internal functions as alternatives to get_user_pages() and its variants:

    long pin_user_pages(unsigned long start, unsigned long nr_pages,
		    	unsigned int gup_flags, struct page **pages,
		    	struct vm_area_struct **vmas);
    long pin_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
			       unsigned long start, unsigned long nr_pages,
			       unsigned int gup_flags, struct page **pages,
			       struct vm_area_struct **vmas, int *locked);
    int pin_user_pages_fast(unsigned long start, int nr_pages,
			    unsigned int gup_flags, struct page **pages);

From the caller's perspective, these new functions behave just like the get_user_pages() versions. Switching callers over is just a matter of changing the name of the function called. Pages pinned in this way must be released with the new unpin_user_page() and unpin_user_pages() functions; these are a replacement for put_user_page(), which was introduced by Hubbard earlier in 2019.

The question of how a developer should choose between get_user_pages() and pin_user_pages() is somewhat addressed in the documentation update found in this patch. In short, if pages are being pinned for access to the data contained within those pages, pin_user_pages() should be used. For cases where the intent is to manipulate the page structures corresponding to the pages rather than the data within them, get_user_pages() is the correct interface.

The new functions inform the kernel about the intent of the caller, but there is still the question of how pinned pages should be tracked. Some sort of reference count is required, since a given page might be pinned multiple times and must remain pinned until the last user has called unpin_user_pages(). The logical place for this reference count is in struct page, but there is a little problem: that structure is tightly packed with the information stored there now, and increasing its size is not an option.

The solution that was chosen is to overload the page reference count. A call to get_user_pages() will increase that count by one, pinning it in place. A call to pin_user_pages(), instead, will increase the reference count by GUP_PIN_COUNTING_BIAS, which is defined in patch 23 of the series as 1024. Kernel code can now check whether a page has been pinned in this way by calling page_dma_pinned(), which simply needs to check whether the reference count for the page in question is at least 1024.

Using reference count in this way does cause a few little quirks. Should a page acquire 1024 or more ordinary references, it will now appear to be pinned for DMA. This behavior is acknowledged in the patch set, but is seen not to be a problem; false positives created in this way should not adversely affect the behavior of the system. A more potentially serious issue has to do with the fact that the reference count only has 21 bits of space; that means that only 11 bits are available for counting pins. That might be considered to be enough for most uses, but pinning a compound page causes the head page to be pinned once for each of the tail pages. A 1GB compound page contains 256 4KB pages, so such a page could only be pinned eight times before the reference count overflows.

The solution to that problem, Hubbard says, is to teach get_user_pages() (and all the variants) about huge pages so that they can be managed with a single reference count. He notes that "some work is required" to implement this behavior, though, so it might not happen right away; it is certainly not a part of this patch set which, at 25 individual patches, is already large enough.

There is one other little detail that isn't part of this set: how the kernel should actually respond to pages that have been pinned in this way. Or, as Hubbard puts it: "What to do in response to encountering such a page is left to later patchsets". One possibility can be found in the layout lease proposal from Ira Weiny, which would provide a mechanism by which long-term-pinned pages could be unpinned when the need arises. There is not yet a lot of agreement on how such a mechanism should work, though, so a full solution to the get_user_pages() problem is still a somewhat distant prospect. Expect it to be a topic for more heated discussion at the 2020 Linux Storage, Filesystem, and Memory-Management Summit.

Meanwhile, though, the kernel may have at least gained a mechanism by which pinned pages can be recognized and tracked, which is a small step in the right direction. These patches have been through a number of revisions and look poised to enter Andrew Morton's -mm tree sometime in the near future. That would make their merging for the 5.6 kernel a relatively likely prospect.

Comments (12 posted)

A year-end wrap-up from LWN

By Jonathan Corbet
December 18, 2019

2019 is coming to a close. It has been another busy year with a lot going on in the Linux and free-software communities. Here at LWN, we have a longstanding tradition of looking back at the predictions made in January to see just how badly we did; it's not good to go against tradition no matter how embarrassing the results might be, so we might as well get right into it.

Visionary?

The 50th anniversary of Unix happened just as predicted; your editor is looking like a true visionary so far. That prediction also suggested that we might see "interesting work in alternative operating-system models" this year. Whether the development of systems like Fuchsia or seL4 qualifies is a matter of perspective. One could also observe, as Toke Høiland-Jørgensen recently did, that "the Linux kernel continues its march towards becoming a BPF runtime-powered microkernel" and conclude that the most viable alternative to the Unix model is developing right under our noses.

The prediction that there would be more hardware vulnerabilities was no less obvious back in January. Holes like MDS, SWAPGS, and TSX async abort duly put in an appearance. It seems unlikely that we are done at this point. A minor consolation might be found in the fact that, by most accounts, communications between the kernel community and hardware vendors regarding these vulnerabilities have improved as predicted.

Did kernel development become more formalized, as we thought might happen in January? Certainly there have been discussions around workflow issues and Change IDs that would point in that direction, as does the increased emphasis on automated testing. One might argue that the kernel community grows up far too slowly, but things do change over time. The suggestion that projects would continue to transition away from the patches-over-email model ties into this as well; even the kernel community is talking about it, though any such change still seems distant at the end of 2019.

Issues with the supportability of BPF APIs did arise as predicted, but the statement that "more kernel APIs will be created for BPF programs rather than exported as traditional system calls" has not been fully borne out. That doesn't mean that we aren't seeing interesting APIs being created for BPF; for example, it may soon be possible to write TCP congestion-control algorithms as BPF programs.

Did somebody try to test out the kernel's code-of-conduct as predicted? As of November 30, there had been no code-of-conduct events in the last three months, and only minor events before. That prediction, happily, has not worked out. Thus far, it seems that the code of conduct may actually have succeeded in making the kernel community a nicer place without the need for any serious enforcement efforts.

Whether we are seeing an increase in differentiation between distributions as predicted is unclear. There is clearly a growing divide between those that support systemd and those that do not; the Debian project is trying to decide where it fits on that divide as this is written. Fedora is working to prepare for the future with initiatives like Silverblue and Modularity. All distributors are trying to figure out how they fit in with the increasing popularity of language-specific package repositories — a concern that probably feels more pressing than differentiation from other distributions.

Your editor predicted that there would be more high-profile acquisitions of Linux companies in 2019; inspired by the purchase of Red Hat by IBM, the article also suggested that Canonical might finally be sold. Not looking quite so visionary now.

The Python community did indeed complete its transition to a post-Guido governance model, and recently held its second steering-council election. Is the Python 3 transition a memory as predicted? Perhaps so; certainly there are fewer discussions on the topic than there once were. For many, though, it remains a relatively vivid and unpleasant memory.

There are definitely groups out there trying to come up with new licensing models, as predicted. In January it seemed like these efforts would mostly be driven by companies trying to monetize their projects, but the emphasis appears to have shifted to attempts to drive other agendas. Thus we saw the Twente License that requires observance of human rights, the Cryptographic Autonomy License with its prohibition against locking up user data, the Vaccine License making rights available only to the vaccinated, and the CasperLabs Open Source License, which adds all kinds of complexity for unclear reasons. As a general rule, these are not truly open-source licenses and they are thus not going far, regardless of whether one agrees with their objectives.

The crypto wars have not yet returned as predicted, but there are a number of chilly breezes suggesting that the right to use strong encryption may yet come under serious threat.

The web-browser monopoly may not have gotten much worse over the past year, but that is mostly because there isn't room for things to get much worse. Chrome dominates the market, with other browsers relegated to single-digit-percentage usage shares. That causes site developers to not care about making sites work with anything but Chrome (and maybe Safari), forcing even dedicated users of other browsers to launch Chrome to make specific sites work. To those of us who lived through the period of Internet Explorer dominance, much of this looks discouragingly familiar.

The final prediction worried that free software has increasingly become a way for companies to develop software efficiently while depriving competitors of license revenues. Certainly there are companies that see free software that way. That is not an entirely bad thing; a lot of new software and a lot of development jobs result from companies working from this viewpoint. But there is more to free software than that, and many members of our community continue to work toward a vision of a world that is more free and more secure. We can only try to support those efforts, and that vision, as well as we can.

Events not foreseen

The other side of evaluating predictions is looking at what was missed. One obvious omission was the leadership transition at the Free Software Foundation as the result of Richard Stallman being forced out. In retrospect, it seems clear that a change had to happen at some point, but saying when it might occur is always hard ahead of the actual event.

On the kernel front, the many-year effort to get lockdown capability into the kernel finally came to fruition; that is another event that had to happen sometime, but your editor didn't expect it in 2019. Even more unpredictable was the pidfd API, which seemingly came into existence, fully formed, after the beginning of the year, though rumblings were certainly evident before then.

Another surprise was the openSUSE project's decision to separate from SUSE and form a separate foundation. This move is being made partly to make it easier for openSUSE to seek support from multiple sources, but the project still is likely to be dependent on SUSE for some time — perhaps indefinitely. The process of negotiating the project's future relationship — and the use of the openSUSE name, which the project elected to retain — with its former corporate owner is likely to be complex. One can only wish openSUSE well as it charts its course going forward.

Closing another year

In 2019, the LWN crew produced 50 Weekly Editions containing 266 feature articles and 56 articles from 16 guest authors. We reported from 26 conferences hosted on four continents — something that was made possible by our ongoing travel support from the Linux Foundation. It has been a satisfying year, but it's fair to say that we are ready for a break.

We cannot sign off, though, without acknowledging the force that keeps this whole operation going: you, our readers. Advertising revenue hit a new low this year but, thanks to you, we have not been overly dependent on advertising for many years. Subscriber numbers are down a bit since last year, which can mostly be attributed to some of the people who came in for the Meltdown/Spectre coverage opting not to renew. With a slightly longer perspective, it is clear that our base of support remains solid, and for that we are extremely grateful. We wouldn't be here without you.

On that note, we'll sign off for the year; we look forward to resuming next year with the January 2 Weekly Edition. As always, the lights will not go completely out between now and then; be sure to check in for the occasional article and update. Meanwhile, we wish a great holiday season for all of our readers.

Comments (38 posted)

Page editor: Jonathan Corbet
Next page: Brief items>>