Leading items
Welcome to the LWN.net Weekly Edition for May 25, 2017
This week's edition contains a fair amount of content from the 2017 Python Language Summit, an event that LWN has been privileged to attend for three years now. Beyond that, we have some solid kernel reporting and more:
- Python 3.6.x, 3.7.0, and beyond: what is the status of current and future Python releases?
- New CPython workflow issues: the CPython community has recently moved its development activity to GitHub. The architect of that move talked about the current status and issues that have yet to be resolved.
- The state of bugs.python.org: a brief discussion of the Python bug tracker.
- Progress on the Gilectomy: Python famously serializes access to the interpreter in multi-threaded programs; what is the status of the work to fix that problem?
- The trouble with SMC-R: a new network protocol was added to the kernel without input from a community of developers that feels it should have been consulted. How did that come to be, and what can be done about it?
- Revisiting "too small to fail": the recurring issue of whether the kernel's memory-management subsystem should allow small allocation requests to fail.
- Containers as kernel objects: an attempt to give the kernel a formal concept of what a "container" is runs into opposition.
- System monitoring with osquery: an interesting tool for obtaining information about a system using SQL queries.
This week's edition also includes these inner pages:
- Brief items: Brief news items from throughout the community.
- Announcements: Newsletters, conferences, security updates, patches, and more.
May 29 is the Memorial Day holiday in the US, so next week's edition will be published one day later than usual on June 2.
Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.
The 2017 Python Language Summit
The Python Language Summit is an annual invitational gathering of Python core developers along with others who need to work out larger issues within the development community. LWN has been lucky enough to sit in on the summit for the past two years and has coverage from this year's edition as well. Roughly 40 developers attended this year's summit. As with previous years, Larry Hastings and Barry Warsaw organized the summit, but there was a wardrobe addition this year: beyond the usual fez, each wore a "[LB]arry" shirt.
A group photo of the attendees is below. I took the picture using Kushal Das's camera:
Here are the sessions covered from the summit:
- Python 3.6.x, 3.7.0, and beyond: Release manager Ned Deily updated attendees on the various releases and where they stand.
- New CPython workflow issues: Brett Cannon discussed the new CPython development workflow since the switch to GitHub for hosting the repository.
- The state of bugs.python.org: A brief presentation from Maciej Szulik on some work that has been done on bugs.python.org recently, as well as some plans for the future.
- Progress on the Gilectomy: Larry Hastings's ambitious plan to remove the global interpreter lock (GIL) from CPython continues; Hastings updated attendees on where things stand.
- Keeping Python competitive: Victor Stinner is looking for ways to make Python faster by a factor of two so that it compares favorably to its language competitors.
- Trio and the future of asynchronous execution in Python: Nathaniel Smith has been working on the Trio asynchronous library and wanted to discuss what he has learned in that process.
- Python ssl module update: ssl module co-maintainer Christian Heimes gave a roundup of changes to the module along with some idea of the future plans for it.
- Classes and types in the Python typing module: Mark Shannon wanted to discuss the wisdom of making types out of classes in typing.
- Status of mypy and type checking: Jukka Lehtosalo updated attendees on type checking and the mypy static type checker.
- Lightning talks: Half a dozen talks on various subjects including MicroPython, beta releases, Python as a security vulnerability, and Jython.
Python 3.6.x, 3.7.0, and beyond
Ned Deily, release manager for the Python 3.6 and 3.7 series, opened up the 2017 edition of the Python Language Summit with a look at the release process and where things stand. It was an "abbreviated update" to his talk at last year's summit, he said. He looked to the future for 3.6 and 3.7, but also looked a bit beyond those two.
After a brief review of the Python development life cycle, he noted that 3.7 is in the feature development phase, while 3.6 is now in the maintenance phase. 3.6 was released in December 2016 and the first maintenance release, 3.6.1, was made in March; the next will be made in about a month as he is trying to do maintenance releases on a roughly three-month cadence.
The 3.5 branch is getting close to moving from maintenance mode to "security fixes only" mode. 3.5 release manager Larry Hastings will make that call; Hastings said there would be one more bug-fix release (3.5.4) before that happens. He will announce that it is the last ahead of time and is targeting June or July for that release, unless "something serious, like a zero-day" shows up before then. The 2.7 branch is still in maintenance mode and will be until 2020, when it will be retired, at least from the perspective of the core team, Deily said.
3.6 will remain in bug-fix mode until sometime late in 2018, some time after 3.7 is released. That means 3.6 will get bugs fixed, regressions addressed, and documentation updated, but no new features will be added unless there is a "serious compelling reason" to do so.
In September 2016, development on the 3.7 "branch" (it is really the master branch) began. It will last until January 2018. There are few restrictions on what goes into master over that time, though developers should "try not to break anything too badly". During the 3.6 cycle, new development started earlier than in previous cycles—it overlapped the beta and release candidate period—that seemed to work well and will be done again for 3.7, Deily said. Feature development ends at the release of the first beta (at which point it will pick up for 3.8).
3.7 Alpha releases will start in September and run through January. By the end of that, all new features should be in and "hopefully documented". The beta phase will run from January to May and only bug and regression fixes will be allowed during that time. At the end of the beta phase, the code is frozen and no more changes are allowed. The idea is that the first release candidate is the same as the final release; it is hoped that there are no problems that necessitate an rc2. The current plan is that 3.7 will be released on June 15, 2018.
The 3.8 work will begin in January. If the usual one-and-a-half year cycle between releases holds, 3.8 will be released in early 2020.
The key things that developers should focus on now are 3.6 bug fixes and documentation, 3.7 features and documenting them, and completing the GitHub migration and to the new development workflow. He reminded everyone that all of the changes for 3.7 should be fully complete by around the time of next year's PyCon (which will be held in May in Cleveland, Ohio).
Deily noted that 3.6 was a "great release" and thanked all of the contributors. One of the factors that made it such a good release was that a development sprint was held for it that was "incredibly productive". But, that sprint was organized at the last minute, so it wasn't possible to include everyone, he said. It was also right at the end of the development cycle, which caused the code cutoff to be pushed back a week or two to incorporate the changes; if there will be a sprint for 3.7, he suggested that it be done earlier in the cycle.
Guido van Rossum wanted to discuss what the 2.7 end-of-life date should be. He had said it would be supported until 2020, but never really pinned down a date in that year and now "people are freaking out for some reason". As he remembered, the plan was to set a date when it got closer—apparently, now is that time. In the end, Van Rossum settled on PyCon 2020 for the end of maintenance for 2.7; it is simply a "symbolic thing", since nothing actually happens on that date, as one attendee noted.
Since there was some mention of 3.8 in the session, Brett Cannon asked if that meant that Deily would be taking on the release manager role for 3.8 as well. Deily was quick to say that was definitely not the case. A release manager is needed, but not right away; if someone was chosen by January 2018, or even PyCon 2018, that would probably be just fine. He suggested that those interested talk it over with one of the other release managers to get an idea of what was needed.
[I would like to thank the Linux Foundation for travel assistance to Portland for the summit.]
New CPython workflow issues
As part of a discussion in 2014 about where to host some of the Python repositories, Brett Cannon was delegated the task of determining where they should end up. In early 2016, he decided that Python's code and other repositories (e.g. PEPs) should land at GitHub; at last year's language summit, he gave an overview of where things stood with a few repositories that had made the conversion. Since that time, the CPython repository has made the switch and he wanted to discuss some of the workflow issues surrounding that move at this year's summit.
He started by introducing himself as the "reason we are on GitHub"; "I'm sorry or I hope you're happy", he said with a chuckle. He wanted to focus the discussion on what's different in the workflow and what has been added (or will be) to try to make core developers more productive using the new workflow.
A bot to check whether a contributor has signed the Python contributor agreement (also known as the "CLA") was first up. Pull requests at GitHub are checked to ensure that the contributor has signed the form by "The Knights Who Say 'Ni!'", which labels the request based on the CLA status. It runs asynchronously and is not specifically tied to GitHub, Cannon said, so a switch to GitLab or elsewhere could be made some day if desired. There is also an "easter egg" in the bot regarding shrubbery, he noted.
It still takes one business day to process a newly signed CLA, so a contributor who has a pull request rejected because of the label is asked to wait that long and try again. Alternatively, any core developer can remove the CLA label, which will cause the bot to check again. There is an ongoing question about those who do not wish to agree to GitHub's terms of service, thus do not have a GitHub account. Some code contributed by a developer in that position was part of a pull request from another contributor, who credited the originator; the originator has not signed the CLA, however, so it is all in something of a "legal quagmire" at this point, Cannon said.
Another new bot is bedevere, which is meant to check pull requests for various kinds of problems. It currently checks to see if there is a reference to a bugs.python.org issue number in the title of the pull request and complains (with a reference to the Python Developer's Guide) if there is not. The issue numbers are now being placed into namespaces, in case the bug tracker changes down the road. So bedevere expects to find "bpo-NNNN" in the title (or "trivial").
The intent is to add more checks to bedevere to ensure the proper labels are present and that all of the status checks have passed for the pull request. Something that is not always being done for pull requests is to have reviewers approve changes made in response to their comments. Clicking to approve the changes is not a huge thing, he said, but it is handy and does get recorded by GitHub. In addition, references to previous pull requests should be done using the "GH-NNNN" namespace, which will still automatically link to the previous request, as the current "#NNNN" usage does. Checks for those kinds of things may be added as well.
Cannon then moved into the process to backport a pull request to an earlier version of Python. There are two different ways to do it, he said, either manually or the new, easier way. The manual way involves doing a "git cherry-pick" and then creating a pull request. The title of that request should have the branch target in it (e.g. "[3.6]") and the pull request ID from the original should be left in place. The "needs backport to X.Y" label should be removed from the pull request that was cherry-picked from.
The better way, though, is to use the cherry_picker tool that was recently written by Mariatta Wijaya. It automates most of the cherry-picking process. Developers just need to click one green button to create the pull request, then go remove the backport label from the original. The hope is that the label removal will be automated soon as well.
Plans
The goal of the GitHub switch has always been to make the workflow better, Cannon said, not just comparable to what was there before. The switch is now just three months old, which "feels like forever for me", so there are still some things to work through to make the workflow better than it was.
One area that has been problematic for some time is the maintenance of the Misc/NEWS file that contains small blurbs about changes made for a release. It is often the source of trivial, but annoying, merge conflicts, especially for backports. Moving forward, Larry Hastings has developed a blurb tool that will walk developers through the process of creating NEWS entries. Those entries will be stored as standalone files, with subdirectories for different sections of the file, and blurb can create the NEWS file on demand from this hierarchy.
There is a need for a bot to automate the backporting of pull requests based on the labels they contain. It could do the cherry-picking and create the pull request for each branch. Currently, he has four people who want to help work on the problem, so he needs to get them all working on the same code base. A more explicit link in pull requests to the bugs.python.org issue is desired as well. The current plan is to modify bedevere to add the link in the body of the pull request. He thinks that adding a message to each pull request for the link might get too noisy.
There has been some thought of automating the creation of the Misc/ACKS file, which lists all of the contributors to Python. He is not sure if anyone really cares about automating that, nor if it is particularly important, but it could perhaps be done. It could lead to the creation of something similar to thanks.rust-lang.org, he said. Grabbing the Git committer information from pull requests to go into ACKS would be a start.
Another thing that is needed is a way for reviewers to know when they should re-review a pull request. He suggested that bedevere could be changed to add a "[WIP]" (work in progress) tag to the request, which the contributor could remove when they are ready for review; they could also leave a comment that mentions the reviewer. But Guido van Rossum suggested doing it more like Phabricator, which he uses at Dropbox, does things: developers do not push their changes until they are ready for review. Others pointed out that may not work as code has to be pushed for the automated tests to run, so the developer may not know whether the code is ready until after they push. Cannon said that he had just begun to think about the problem, but it is clear that something is needed.
Another area that may need attention is a way to close stale pull requests. It is not a big problem yet, but may become one, he said. There could be a bot that measured the number of days since the last change and automatically close those that pass some threshold; the same should be done with those lacking a CLA signature. That code shouldn't really even be viewed until the CLA has been signed.
All are welcome to participate in the core-workflow project, Cannon said. There is an issue tracker on GitHub and a core-workflow mailing list that those interested should investigate.
Ned Deily asked if anyone had worked out a good strategy for dealing with GitHub email. As release manager, he would like to be able to see everything that is going on, which he used to be able to do, but is rather daunting for GitHub. There is someone working on an email bot, Cannon said, which may help. Barry Warsaw suggested sending all of the email to a mailing list that could be archived and put on Gmane, which Deily said would work as long as the threading was handled correctly.
Cannon ended the discussion by saying that he hopes to get to a point where the workflow is no longer the reason that core developers don't have the time to review pull requests. If folks simply don't have the time, that's one thing, but it should not be the workflow that holds them back. The CPython project currently has a delta of four additional pull requests every day and it would be great if the workflow improvements helped that number decline.
[I would like to thank the Linux Foundation for travel assistance to Portland for the summit.]
The state of bugs.python.org
In a brief session at the 2017 Python Language Summit, Maciej Szulik gave an update on the state and plans for bugs.python.org (bpo). It is the Roundup-based bug tracker for Python; moving to GitHub has not changed that. He described the work that two Google Summer of Code (GSoC) students have done to improve the bug tracker.
GSoC student Shiyao Ma worked on Docker-izing the bugs.python.org environment in 2015. That allows developers to run their own version of the bug tracker locally to test changes. Szulik showed a fairly simple Docker command to run the bug tracker in a local container. Unfortunately, it is actually not quite that simple at this point as there are a number of other pieces that need to be installed and configured first. He would like to see that be resolved and for a Docker image to be made available on Docker Hub.
In 2016, Ashish Shah did a GSoC project to help integrate bugs.python.org with GitHub in support of Python's switch to that site. Pull requests can be linked to bugs at bpo using the "bpo-NNNN" tag in various contexts. The tag can appear in the title, the initial pull request description, or in the comments on the request. It is currently limited to ten bugs per pull request, though that is a fairly arbitrary limit. Beyond that, bugs can be closed from merges using a simple regular expression ("close[sd]?|closing bpo-NNNN") on the text of the merge.
For the future, Szulik plans to personally work on a few different things. Right now, Google OAuth integration with bpo is "partly broken" and he will be fixing that. Similarly, he would like to add GitHub OAuth integration so that those with a GitHub account can directly work with the bugs on bpo. Lastly, there is a need to be able to escalate bugs directly to vendors (e.g. Linux or Python distributors) if they are in some way vendor-specific. There is currently discussion about how that all would work, but it would be a useful addition to bpo, he said.
[I would like to thank the Linux Foundation for travel assistance to Portland for the summit.]
Progress on the Gilectomy
At the 2016 Python Language Summit, Larry Hastings introduced Gilectomy, his project to remove the global interpreter lock (GIL) from CPython. The GIL serializes access to the Python interpreter, so it severely limits the performance of multi-threaded Python programs. At the 2017 summit, Hastings was back to update attendees on the progress he has made and where Gilectomy is headed.
He started out by stating his goal for the project. He wants to be able to run existing multi-threaded Python programs on multiple cores. He wants to break as little of the existing C API as possible. And he will have achieved his goal if those programs run faster than they do with CPython and the GIL—as measured by wall time. To that end, he has done work in four areas over the last year.
He noted that "benchmarks are impossible" by putting up a slide that showed the different CPU frequencies that he collected from his system. The Intel Xeon system he is using is constantly adjusting how fast the cores run for power and heat considerations, which makes it difficult to get reliable numbers. An attendee suggested he look into CPU frequency pinning on Linux.
Atomicity and reference counts
CPU cores have a little bus that runs between them, which is used for atomic updates among other things, he said. The reference counts used by the Gilectomy garbage collection use atomic increment and decrement instructions frequently, which causes a performance bottleneck because of the inter-core traffic to ensure cache consistency.
So he looked for another mechanism to maintain the reference counts without all of that overhead. He consulted The Garbage Collection Handbook, which had a section on "buffered reference counting". The idea is to push all of the reference count updating to its own thread, which is the only entity that can look at or change the reference counts. Threads write their reference count changes to a log that the commit thread reads and reflects those changes to the counts.
That works, but there is contention for the log between the threads. So he added a log per thread, but that means there is an ordering problem between operations on the same reference count. It turns out that three of the four possible orderings can be swapped without affecting the outcome, but an increment followed by a decrement needs to be done in order. If the decrement is processed first, it could reduce the count to zero, which might result in the object being garbage collected even though there should still be a valid reference.
He solved that with separate increment and decrement logs. The decrement log can only be processed after all of the increments. This implementation of buffered reference counting has been in Gilectomy since October and is now working well. He did some work on the Py_INCREF() and Py_DECREF() macros that are used all over the CPython code; the intent was to cache the thread-local storage (TLS) pointer and reuse it over multiple calls, rather than looking it up for each.
Buffered reference counts have a weakness: they cannot provide realtime reference counts. It could be as long as a second or two before the reference count actually has the right value. That's fine for most code in Gilectomy, because that code cannot look at the counts directly.
But there are places that need realtime reference counts, the weakref module in particular. Weak references do not increment the reference count but can be used to reference an object (e.g. for a cache) until it is garbage collected because it has no strong references. Hastings tried to use a separate reference count to support weakref, but isn't sure that will work. Mark Shannon may have convinced him that resurrecting objects in __del__() methods will not work under that scheme; it may be a fundamental limitation that might kill Gilectomy, Hastings said.
More performance
Along the way, he came to the conclusion that the object allocation routines in obmalloc.c were too slow. The object allocation scheme has different classes for different sizes of objects, so he added per-class locks. When that was insufficient, he added two kinds of locks: a "fast" lock for when an object exists on the free list and a "heavy" lock when the allocation routines need to go back to the pool for more memory. He also added per-thread, per-class free lists. As part of that work, he added a fair amount of statistics-gathering code but went to some lengths to ensure that it had no performance impact when it was disabled.
There are a lots of places where things are being pulled out of TLS and profiling the code showed 370 million calls to get the TLS pointer over a seven to eight second run of his benchmark. In order to minimize that, he has added parameters to pass the TLS pointer down into the guts of the interpreter.
An attendee asked if it made sense to do that for the CPython mainline, but Hastings pointed out those calls come from what he has added; CPython with a GIL does not have that performance degradation. Another attendee thought it should only require one assembly instruction to get the TLS pointer and that there is a GCC extension to use that. Hastings said that he tried that, but could not get it to work; he would be happy to have help as it should be possible to make it faster.
The benchmark that he always uses is a "really bad recursive Fibonacci". He showed graphs of how various versions of Gilectomy fare versus CPython. Gilectomy is getting better, but is still well shy of CPython speed in terms of CPU time. But that is not what he is shooting for; when looking at wall time, the latest incarnation of Gilectomy is getting quite close to CPython's graph line. The "next breakthrough" may show Gilectomy as faster than CPython, he said.
Next breakthrough
He has some ideas for ways to get that next breakthrough. For one, he could go to a fully per-thread object-allocation scheme. Thomas Wouters suggested looking at Thread-Caching Malloc (TCMalloc), but Hastings was a bit skeptical. The small-block allocator in Python is well tuned for the language, he said. But Wouters said that tests have been done and TCMalloc is no worse than Python's existing allocator, but has better fragmentation performance and is multi-threaded friendly. Hastings concluded that it was "worth considering" TCMalloc going forward.
He is thinking that storing the reference count separate from the object might be an improvement performance-wise. Changing object locking might also improve things, since most objects never leave the thread they are created in. Objects could be "pre-locked" to the thread they are created in and a mechanism for threads to register their interest in other threads' objects might make sense.
The handbook that he looked in to find buffered reference counts says little about reference counting; it is mostly focused on tracing garbage collection. So one thought he has had is to do a "crazy rewrite" of the Python garbage collector. That would be a major pain and break the C API, but he has ideas on how to fix that as well.
Guido van Rossum thought that working on a GIL-less Python and C API would
be much
easier in PyPy (which has no GIL), rather than CPython. Hastings said that
he thought having a multi-threaded Python would be easier to do using
CPython. Much of breakage in the C API simply comes from adding
multi-threading into the mix at all. If you want multi-core performance,
those things are going to have to be fixed no matter what.
But Van Rossum is concerned that all of the C-based Python extensions will be broken in Gilectomy. Hastings thinks that overstates things and has some ideas on how to make things better. Someone had suggested only allowing one thread into a C extension at a time (so, a limited GIL, in effect), which might help.
The adoption of PyPy "has not been swift", Hastings said; he thinks that since CPython is the reference implementation of Python, it will be the winner. He does not know how far he can take Gilectomy, but he is sticking with it; he asked Van Rossum to "let me know if you switch to PyPy". But Van Rossum said that he is happy with CPython as it is. On the other hand, Wouters pointed out one good reason to stick with experimenting with CPython; since the implementation is similar to what the core developers are already knowledgeable about, they will be able to offer thoughts and suggestions.
Hastings also gave a talk about Gilectomy status a few days later at PyCon; a YouTube video is available for those interested.
[I would like to thank the Linux Foundation for travel assistance to Portland for the summit.]
The trouble with SMC-R
Among the many features merged for the 4.11 kernel was the "shared memory communications over RDMA" (SMC-R) protocol from IBM. SMC-R is a high-speed data-center communications protocol that is claimed to be much more efficient than basic TCP sockets. As it turns out, though, the merging of this code was a surprise — and an unpleasant one at that — to a relevant segment of the kernel development community. This issue and the difficulties in resolving it are an indicator of how the increasingly fast-paced kernel development community can go off track.The patch set that was eventually merged (via the networking tree) for 4.11 claims a decrease in CPU consumption of up to 60% over basic TCP sockets. The protocol is designed in such a way that existing TCP applications can be made to use it simply by linking them against a special library — no code changes required. On the other hand, it requires bypassing much of the network stack (including firewalls, monitoring, and traffic control) and shorting out the code that tries to keep the networking layer from creating too much memory pressure. In many settings, those may be a price that users are willing to pay.
The problem, as raised by Christoph Hellwig on May 1, is that this RDMA-based protocol was merged without any input from the RDMA development community; it was never posted to the linux-rdma mailing list. Once the RDMA developers took a look at it, they found a number of things to dislike. SMC-R adds a new API, rather than using the existing RDMA APIs, for example. It has no support for IPv6, and the fact that it defines its own AF_SMC address family makes it unclear how an application could ever specify whether it wanted IPv6 or not. (It's worth noting that missing IPv6 support has blocked other protocol implementations in the past). There is also a significant security issue with SMC-R, in that it opens read/write access to all of memory from a remote system.
The RDMA developers, being less than pleased with all of this and feeling that they should have been consulted prior to the merging of SMC-R, are wanting to do something about it. But what can actually be done is not entirely clear at this point. Hellwig posted a patch marking the subsystem as "broken" and adding a strong warning about the security issue, but that patch has not yet been merged and probably never will be in that form.
Networking maintainer David Miller responded that Hellwig was being
"overbearing
" by trying to mark SMC-R as being broken, and added that there is no possibility of changing
the API before it develops users: "The API is out there already so we
are out of luck, and neither you nor I nor anyone else can 'stop' this from
happening
". SMC-R, in other words, is a fait accompli that
cannot be removed at this point.
RDMA maintainer Doug Ledford disagreed, noting that 4.11 has only been out since the end of April and has almost certainly not appeared in distributions yet. The "standard" that defines this protocol (RFC 7609) is, he pointed out, just an informational posting from IBM without actual standard status. There is nothing, he said, that prevents recalling SMC-R at this time. For now, Miller has applied a version of Hellwig's patch that removes the "broken" marker but keeps the security warning. Ledford still thinks, though, that the option of marking SMC-R broken (or moving it to staging) should still be on the table.
Ledford, along with others, also complained loudly that this subsystem was
merged without having ever been brought to the attention of the RDMA
mailing list. Miller fired back that he
had explicitly tried to slow the progress of this patch set in the hope
that it would get some
substantive reviews, but "I can't push back on people with silly
coding style and small semantic issues forever
". He complained that
evidently nobody from the RDMA community is following the netdev mailing
list, which is where the patches were posted.
The discussion went around a bit on whether Miller should have asked the
SMC-R submitters to copy their patches to the linux-rdma list as well,
without any real agreement being reached.
The reason that there are no RDMA developers on netdev, despite the obvious overlap between RDMA and networking, is an old story: the traffic on netdev (150-200 messages per day) has reached a level where the RDMA developers feel they simply cannot keep up with it. Developers used to say the same thing about linux-kernel, before everybody simply gave up on it altogether. As the community grows and the patch volume increases, this type of process-scalability issue will move downward through the subsystem hierarchy. Developers stop keeping up with relevant discussions because they cannot read all that email and still have time to actually get some development done.
Ledford proposed a solution of sorts for the problem of email volume: split netdev into separate lists for core networking, Ethernet drivers, and "netdev-packet". Ironically, that is likely to make the sort of communication issue that led to this discussion worse; as the development community segregates itself into increasingly specialized lists, communication across the community as a whole will be reduced. In a small town, everybody knows what everybody else is up to; that is not true in a large city. The kernel project resembles an increasingly large city in this regard.
This fracturing of the kernel community has been evident for at least two decades; it is likely to present significant scalability issues if the kernel project continues to grow. For the time being, the SMC-R issue appears to be headed toward a resolution, with the RDMA developers seeing a path by which the problems in the protocol and its implementation can be addressed. But this will certainly not be the last time that the development community is tripped up as a result of developers not being able to keep up with what their colleagues are doing.
Revisiting "too small to fail"
Back in 2014, the revelation that the kernel's memory-management subsystem would not allow relatively small allocation requests to fail created a bit of a stir. The discussion has settled down since then, but the "too small to fail" rule still clearly creates a certain amount of confusion in the kernel community, as is evidenced by a recent discussion inspired by the 4.12 merge window. It would appear that the rule remains in effect, but developers are asked to act as if it did not.
At the start of the 2014 discussion, memory-management developer Michal
Hocko described the "unwritten
rule
" that small allocations never fail. "Small" is determined by
the kernel's PAGE_ALLOC_COSTLY_ORDER constant, which is generally
set to three; that puts the threshold at eight pages, or 32KB on most
systems. Almost all memory allocations in the kernel are smaller than that
(much effort has gone into keeping most of them no larger than a single
page), so the end result is that memory allocation attempts almost never
fail.
That created some unhappiness for a couple of reasons. One is that kernel developers have been told since the beginning that any memory allocation can fail, so they have been carefully writing failure-recovery paths that will never be used. This policy can also lead the kernel to do unpleasant things — such as summoning the dreaded out-of-memory killer — rather than fail a request, even if the requesting code is prepared to deal gracefully with an allocation failure. Proposals to change this policy have always foundered on the fear that enabling allocation failures would expose bugs throughout the kernel. The bulk of that failure-recovery code may have never been executed — or it may not exist at all. So the "too small to fail" behavior remains in place.
Trond Myklebust's NFS client fixes pull
request included a line item reading: "Cleanup and removal of
some memory failure paths now that GFP_NOFS is guaranteed to never
fail
". The description was inaccurate: the code in question is
using a mempool, which pre-allocates memory and, if used properly, can indeed
guarantee that allocation failures will not occur. But it was enough to
prompt Nikolay Borisov to ask whether
success was truly guaranteed. If so, there would be an opportunity to
clean up a lot of unneeded error-handling code throughout the kernel. Hocko replied that, while "small allocations never
fail _practically_
", the behavior was in no way guaranteed and that
removing checks for allocation failures is "just wrong
".
Myklebust was not entirely pleased with that response; he asked for a clear statement that small allocation requests can fail. He didn't get one. Instead, Hocko replied:
The status quo — telling developers to be prepared for allocation failures while not actually failing allocation requests — is less than pleasing for many involved in these discussions. In many parts of the kernel, error handling makes up a large portion of the total amount of code. This code can be tricky to write and even trickier to test; it can be frustrating to be asked to do this work to prepare for a situation that is not ever going to happen.
The memory-management developers cannot just change this behavior, though. There can be little doubt that, in a kernel with thousands of never-executed, never-tested error-handling paths, some of those paths will contain bugs. Auditing the kernel and validating all of those paths would not be a small task, to put it lightly; it may not be feasible to do at all. What can be done is to validate and fix the code one piece at a time. This is how the big kernel lock (BKL) was finally removed in 2011. That job proceeded by getting rid of the BKL dependencies in one small bit of code at a time until, eventually, nothing needed it anymore. It took many years, but it got the job done.
In the case of memory-allocation failures, validating code will not always be easy. The fault injection framework can be used to force allocation errors, though, which can help in the testing of recovery paths. For code that is deemed to be properly prepared, the no-fail behavior can be turned off in any given allocation request by adding the __GFP_NORETRY flag; this has been done for roughly 100 allocation calls in the 4.12-rc1 kernel. Whether that flag will spread to larger parts of the kernel remains to be seen; as with the BKL removal, it will probably require the help of a group of developers who are willing to put a lot of time into the task.
The kernel community makes internal API changes on a regular basis; most of the time, it is a simple matter of a bunch of editing work or a Coccinelle script. But subtle semantic changes are harder, and eliminating the too-small-to-fail behavior certainly qualifies as that kind of change. The longer it remains, the more entrenched it is likely to become, but there are no signs that it will be able to change anytime soon.
Containers as kernel objects
The kernel has, over the years, gained comprehensive support for containers; that, in turn, has helped to drive the rapid growth of a number of containerization systems. Interestingly, though, the kernel itself has no concept of what a container is; it just provides a number of facilities that can be used in the creation of containers in user space. David Howells is trying to change that state of affairs with a patch set adding containers as a first-class kernel object, but the idea is proving to be a hard sell in the kernel community.Containers can be thought of as a form of lightweight virtualization. Processes running within a container have the illusion of running on an independent system but, in truth, many containers can be running simultaneously on the same host kernel. The container illusion is created using namespaces, giving each container its own view of the network, the filesystem, and more, and control groups, which isolate containers from each other and control resource usage. Security modules or seccomp can be used to further restrict what a container can do. The result is a mechanism that, like so many things in Linux, offers a great deal of flexibility at the cost of a fair amount of complexity. Setting up a container in a way that ensures it will stay contained is not a trivial task and, as we'll see, the lack of a container primitive also complicates things on the kernel side.
Adding a container object
Howells's patch creates (or modifies) a set of system calls to make it possible for user space to manipulate containers. It all starts with:
int container_create(const char *name, unsigned int flags);
This new system call creates a container with the given name. The flags mainly specify which namespaces from the caller should be replaced by new namespaces in the created container. For example, specifying CONTAINER_NEW_USER_NS will cause the container to be created with a new user namespace. The return value is a file descriptor that can be used to refer to the container. There are a couple of flags that indicate whether the container should be destroyed when the file descriptor is closed, and whether the descriptor should be closed if the calling process calls exec().
The container starts out empty, with no processes running within it; if it is created with a new mount namespace, there are no filesystems mounted inside it either. Two new system calls (fsopen() and fsmount(), added in a separate patch set) can be used to add filesystems to the container. The "at" versions of the file system calls (openat(), for example) can take a container file descriptor as the starting point, easing the creation of files inside the container. It is possible to open a socket within the container with:
int container_socket(int container_fd, int domain, int type, int protocol);
The main purpose of container_socket() appears to be to make it easy to use netlink sockets to configure the container's networking from the outside. It can help an orchestration system avoid the need to run a process inside the container to do this configuration.
When it comes time to start things running inside the container, a call can be made to:
pid_t fork_into_container(int container_fd);
The new process created by this call will be the init process inside the given container, and will run inside the container's namespaces. It can only be called once for any given container.
There are a number of things that, Howells said, could still be added to this mechanism. They include the ability to set a container's namespaces directly, support for the management of a container's control groups, the ability to suspend and restart a container, and more. But it is not clear that this work will progress far in its current form.
A poor match?
A number of developers expressed concerns about this proposal, mostly focused on two issues: the proposed container object is not seen as a good match for how containers are actually used now, and it is seen as the wrong solution to a specific problem. On the first issue, the flexibility of the current mechanisms is seen by many as an advantage, one that they would rather not lose. Jessica Frazelle said:
Here, she was referring to the runtime specification from the Open Containers Initiative. James Bottomley was more direct, saying that:
He pointed out, in particular, an apparent mismatch between the proposed
container object and the concepts of containers and "pods" implemented in
Kubernetes. Some namespaces are specific to a container, while others are
shared across a pod, blurring the boundaries somewhat.
The kernel container object, he added, "isn't something the
orchestration systems would find usable
".
Eric Biederman took an even stronger position by rejecting the patch outright. As he put it:
Unlike the others, he is not so deeply concerned with what existing orchestration systems do; his worries have to do with the exposing of the container object to user space at all. That is where the second issue comes up.
Upcalls
To a great extent, it appears that the motivation behind this patch set isn't to make the management of containers easier for user-space code. Instead, it is trying to solve a nagging problem that has become increasingly irritating for kernel developers: how to make kernel "upcalls" work properly in a containerized environment.
As a general rule, the kernel, as the lowest level of the system, tries to be self-sufficient in all things. There really is nobody else to rely on to get things done, after all. There are times, though, when the kernel has to ask user space for help. That is typically done with a call to call_usermodehelper(), an internal function that will create a user-space process and run a specific program to get something done — "calling up" to user space, in other words.
There are a number of call_usermodehelper() call sites in the kernel. Some of the tasks it is used for include:
- The core-dump code can use it to invoke a program to do something useful with the dumped data.
- The NFSv4 client can call a helper program to perform DNS resolution.
- The module loader can invoke a helper to perform demand-loading of modules.
- The kernel's key-management code will call to user space when a key is needed to perform a specific function — to mount an encrypted filesystem, for example.
Once upon a time, when life was simple, these upcalls would create a process running as root that could run the requested program. Now, however, the action that provoked the upcall in the first place may well have come from inside a container, and it may well be that the upcall should run within that container as well. At least, it should run inside that container's particular mix of namespaces. But, since the kernel has no concept of a container, it has no way to know which container to run any particular upcall within. A kernel upcall that is run in the wrong namespace might do the wrong thing — or allow a process to escape its container.
Adding a container concept to the kernel is one way to fix this problem. But this particular patch has raised questions of whether (1) a container object is the best solution to the upcall problem, and (2) if a container object does make sense, does it need to be exposed to user space? The kernel might be able to keep track of the proper namespaces to use for specific upcalls without creating a bunch of new infrastructure or exposing a new API that would have to be maintained forever. Biederman suggested one possible scheme that could be used to track namespaces for the key-management upcalls, for example.
Another possible approach, proposed by Colin Walters, is to drop the upcall model entirely. Instead, a protocol could be created to report events to a user-space daemon that could act on them in the proper context. That kind of change has been made in the past; device-related events were once handled via upcalls, but now they are communicated directly to the udev (or systemd-udevd) process instead. But, as Jeff Layton pointed out, that model only works in some settings. In others, it just leads to a proliferation of daemon processes that clutter up the system and can create reliability issues. So the events model isn't necessarily a replacement for all kernel upcalls.
This discussion is young as of this writing, and may yet progress in unexpected directions. From the early indications, it seems relatively unlikely that a container object visible to user space will be added to the kernel anytime soon. If, perhaps, some future attempt creates a container concept that is useful to existing orchestration systems, that could change. Meanwhile, we may well see an attempt to improve the kernel's internal ability to determine the proper namespace for any given upcall. Either way, the inherent complexity of the container problem seems likely to be with us for a long time.
System monitoring with osquery
Your operating system generates a lot of run-time data and statistics that are useful for monitoring system security and performance. How you get this information depends on the operating system you're running. It could be a from report in a fancy GUI, or obtained via a specialized API, or simply text values read from the filesystem in the case of Linux and /proc. However, imagine if you could get this data via an SQL query, and obtain the output as a database table or JSON object. This is exactly what osquery lets you do on Linux, macOS, and Windows.
Osquery is an open-source project created by Facebook and hosted at Github, which the company released to the world under a 3-clause BSD license in 2014. It initially supported only Linux and macOS, but a Windows port (with somewhat lesser capabilities) was released in 2016, allowing osquery to have a unified SQL-based query interface for quite different operating systems. A data center running multiple operating systems would be able to query the state of the infrastructure using a single querying interface, simplifying the data collection implementation for DevOps teams. The osquery development team wanted to created a fast, reliable, and easy instrumentation tool that did not require a lot of low-level programming to retrieve information from.
System information is presented in the form of database tables; it is dynamically generated at query time rather than retrieved from storage. Naturally, different operating systems have slightly different information to present to the user. However, the osquery developers have strived for "feature parity" between systems; there is a set of common tables available for every supported platform, along with a set of operating-system-specific tables. Querying the tables is done using a subset of SQL that follows the SQLite syntax.
Using osquery
Packages for osquery are available for CentOS and Ubuntu. I installed the provided package for Ubuntu 16.04 and it required no tweaking to get up and running immediately using the default configuration. There are two ways to use osquery: interactively or as an operating system service. The interactive program, osqueryi, can be used as a command interpreter that is similar to interactive prompt of SQLite. The interactive program is useful for exploring the running system. The available tables can be found in the documentation.
I ran a few sample queries to try out osquery in interactive mode. For example, getting a list of logged-in users, with their tty, login time, process id of the login, and host they are logging in from:
osquery> SELECT * from logged_in_users;
+-----------+----------+-------+------------------+------------+------+
| type | user | tty | host | time | pid |
+-----------+----------+-------+------------------+------------+------+
| boot_time | reboot | ~ | 4.8.0-52-generic | 1495211035 | 0 |
| runlevel | runlevel | ~ | 4.8.0-52-generic | 1495211045 | 53 |
| login | LOGIN | tty1 | | 1495211045 | 792 |
| user | hussein | pts/8 | 10.0.2.2 | 1495211067 | 1134 |
+-----------+----------+-------+------------------+------------+------+
The table lists my username as well as several pseudo-users, such as boot_time and runlevel.
Starting osqueryi with the --json flag gives JSON output:
osquery> SELECT version from os_version;
[
{"version":"16.04.2 LTS (Xenial Xerus)"}
]
Querying basic system information is fun, but the true usefulness of this tool shines when pulling data from several different tables to make inferences about the system. For example, if I were a system administrator I might want to check for processes with uid 0 (privileged processes) opening network connections to the outside world. The appearance of such connections is symptomatic of fishy behaviour that might indicate unauthorized system use. To do this, we can do an SQL join of some tables to search for any process with uid 0 with an open socket. To test this, I first needed a root-owned process to create an open socket, so I opened a network connection with sudo using telnet to google.com's port 80. Then I ran this query:
osquery> SELECT DISTINCT processes.uid, process_open_sockets.pid,
process_open_sockets.remote_address,
process_open_sockets.local_port,
process_open_sockets.remote_port
FROM process_open_sockets INNER JOIN processes
WHERE processes.pid=process_open_sockets.pid
AND processes.uid=0
AND process_open_sockets.remote_address <> ""
AND process_open_sockets.remote_address <> "0.0.0.0"
AND process_open_sockets.remote_address <> "10.0.2.2"
AND process_open_sockets.remote_address <> "::";
+-----+------+----------------+------------+-------------+
| uid | pid | remote_address | local_port | remote_port |
+-----+------+----------------+------------+-------------+
| 0 | 2641 | 64.233.188.99 | 36600 | 80 |
+-----+------+----------------+------------+-------------+
We do an inner join of the processes and process_open_sockets tables, using the uid information from the processes table to filter the results, and disregarding local addresses. As revealed from the table, there is a process with pid 2641 connected to 64.233.188.99 (one of Google's public-facing IPs) on port 80. I needed to run osqueryi using sudo, because otherwise it cannot display information that a regular user does not have the permissions to view. While it is possible to run other SQL commands beside SELECT, commands like UPDATE, INSERT, and DELETE don't do anything on the standard tables, as they are all understandably read-only. It is also possible to pipe queries into osqueryi and get the results back via stdout, which makes it possible to use as part of shell scripts.
Osquery can run as a service called osqueryd, which is a monitoring daemon that allows scheduling system queries or generating queries based on events. Query results are logged, either to the filesystem or via a plugin to a service such as Amazon's AWS. Configuring the daemon involves writing a script with a JSON-formatted list of SQL queries and the intervals at which they should run, much like a cron script. A scheduled query produces a log of the chosen system information at discrete intervals, so it is useful for seeing how the system state changes over time.
Logging, security, and monitoring
Osquery is useful for cross-platform monitoring of system infrastructure. The above example where the detection of root-owned processes that open network connections could be configured as a periodic query using osqueryd; any suspicious activity would be monitored and logged.
Additionally, a feature of osquery called file integrity monitoring on Linux and macOS systems tracks filesystem changes. This is accomplished via inotify on Linux and on macOS using FSEvents. Any files that a user wants to track need to be specified in the configuration file, and the status of the files can be read in a special table called file_events. Logs can be generated when a file is either accessed or changed. When a file change event happens, the MD5, SHA-1, and SHA-256 hashes of the affected file will be recalculated and logged.
Finally, basic auditing is also available on Linux and on macOS (via a kernel extension); the table process_events records process creation details. A table called socket_events is available on Linux that stores reports from the bind() and connect() system calls, as well.
There are two types of logs generated by osquery: status logs and results logs. Status logs contain the execution information of osquery itself; they are created using the Glog framework. Results logs are the output of queries, and come in two varieties: snapshot and differential. A snapshot logs the output of the entire query; that can often result in large logs as entire tables are written to disk at the specified intervals. A more space-economical format is the differential log, which is a format that shows the user which records were either added, removed, or changed.
By default osquery will log to the filesystem, but additional log aggregation options are available via plugins. Osquery version 1.7.4 and later can log directly to Amazon Web Service's Kinesis Streams and Kinesis Firehose, which are cloud-based data buckets that can capture data streams for storage and analysis.
Custom tables, extensions, and plugins
Custom tables can be added to the osquery source code; those tables can be filled with data from any data source. There are two parts to this: defining the table structures themselves, and then creating custom C++ classes to obtain the data to populate them. The table structure definition is done via defining a schema written in Python that will auto-generate the necessary C/C++ implementation of the table. Then a class is created for populating the data, which can be done via system calls, reading in values from /proc, or be dynamically generated. To get an idea of what this involves, there is example code given in the documentation that creates a custom table that holds the current time derived from the time() system call.
Osquery's functionality can also be built upon with plugins and extensions. Extensions are separate processes built and linked together with the osquery core code; they communicate with the main osquery process (either osqueryd or osqueryi) via a Thrift API. Extensions are useful for creating plugins for things such as logging and configuration.
Conclusion
Osquery is a unique tool for system monitoring. While it is useful on its own, one can imagine it as a basis for creating more complex monitoring systems with sophisticated malware and anomaly detection, where rules can be easily created by simply crafting an appropriate query, or even dynamic monitoring tools that create conditional monitoring rules based on real-time events. SQL is a powerful and expressive query language, and its use for log analysis gives a flexible tool to system administrators and DevOps teams.
Page editor: Jonathan Corbet
Next page:
Brief items>>
