|
|
Log in / Subscribe / Register

LWN.net Weekly Edition for June 13, 2019

Welcome to the LWN.net Weekly Edition for June 13, 2019

This edition contains the following feature content:

This week's edition also includes these inner pages:

  • Brief items: Brief news items from throughout the community.
  • Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

Python and "dead" batteries

By Jake Edge
June 12, 2019

Python is, famously, a "batteries included" language; it comes with a rich standard library right out of the box, which makes for a highly useful starting point for everyone. But that does have some downsides as well. The standard library modules are largely maintained by the CPython core developers, which adds to their duties; the modules themselves are subject to the CPython release schedule, which may be suboptimal. For those reasons and others, there have been thoughts about retiring some of the older modules; it is a topic that has come up several times over the last year or so.

It probably had been discussed even earlier, but a session at the 2018 Python Language Summit (PLS) is the starting point this time around. At that time, Christian Heimes listed a few modules that he thought should be considered for removal; he said he was working on a PEP to that end. PEP 594 ("Removing dead batteries from the standard library") surfaced in May with a much longer list of potentially dead batteries. There was also a session at this year's PLS, where Amber Brown advocated moving toward a much smaller standard library, arguing that including modules in the standard library stifles their growth. Some at PLS seemed to be receptive to Brown's ideas, at least to some extent, though Guido van Rossum was apparently not pleased with her presentation and "stormed from the room".

PEP 594

After PLS, Heimes posted the first draft of PEP 594 to the python-dev mailing list. It is a much more ambitious list than the one he made back in 2018; there are 31 modules listed, but four of those are actually modules that were once proposed for retirement that the PEP now recommends keeping. The modules are scattered throughout the standard library, but: "the majority of modules are for old data formats or old APIs. Some others are rarely useful and have better replacements on PyPI [the Python Package Index]."

The PEP lists alternatives from PyPI for most of the modules, along with a justification for their removal (or, in those few cases, their retention). In addition, the deprecation schedule being proposed is described; the modules agreed upon will be documented as "deprecated" in the upcoming 3.8 release and may raise a PendingDeprecationWarning exception. In 3.9, the modules will start raising DeprecationWarning exceptions and in 3.10 they will be removed along with their tests and documentation. Given the Python support window, the modules will still be supported by the core team until the end of life for Python 3.9, which is estimated to occur in 2026.

The modules listed to be removed in the first draft were:

Type Modules
Data encoding binhex, uu, and xdrlib
Multimedia aifc, audioop, colorsys, chunk, imghdr, ossaudiodev, sndhdr, and sunau
Networking asynchat, asyncore, cgi, cgitb, smtpd, and nntplib
OS interface crypt, macpath, nis, and spwd
Miscellaneous fileinput, formatter, imp, msilib, and pipes

As might be guessed, the PEP posting set off a bit of a storm of both suggestions for other modules to consider for removal, as well as concerns about some of the targeted modules. In particular, Andrew Svetlov suggested that socketserver might be a good candidate for removal, but it is used by http.server and others, so Heimes said he decided against it. But that led Glenn Linderman to suggest removing that module as well. He pointed out that it lacked a lot of functionality (e.g. HTTPS support) and that the PEP suggests removing the cgi module, further reducing its utility.

Van Rossum and others saw the value in http.server, though. It is used by other tools and modules, for one thing, but it is also useful as a quick and dirty local HTTP server—remembering how to configure and run a full web framework is too heavyweight for that kind of task. Linderman thought that the Bottle web framework might make a reasonable alternative for a simple, local server, but it turns out that Bottle uses http.server as well.

Several argued against removing specific modules. One of the more controversial choices that Heimes made was nntplib, which provides client-side code for accessing Network News Transfer Protocol (NNTP) services. Antoine Pitrou raised an objection to removing nntplib among others he thought were dubious choices (cgitb for generating tracebacks for web pages and crypt for interfacing to the crypt() one-way hash function). "NNTP is still quite used (often through GMane, but probably not only) so I'd question the removal of nntplib."

André Malo agreed with Pitrou about nntplib; he wondered how much of a maintenance burden it would be given how old the protocol is. But Victor Stinner pointed out that the "maintenance burden is real even if it's not visible"; for example, there are a number of sporadic test failures from nntplib in the continuous integration (CI) system and "nobody managed to come with a fix.. in 6 years". Beyond that, the administrator of the server used in some of those tests has asked about support for the NNTP compression extension, so there are still features that nntplib may need.

Giampaolo Rodolà thought that if nntplib was on the chopping block, telnetlib should probably join it, though he was not actually advocating that:

Overall, I think the bar for a module removal should be set very high, especially for “standard” things such as these network protocols, that despite being old are not likely to change. That means that also the maintenance burden for python-dev will be low or close to none after all.

Heimes replied that he had missed telnetlib (it has since been added), but that nntplib does have a high maintenance burden because it has no maintainer, outstanding bugs, and missing features. Rodolà also argued against removing crypt and spwd (which provides access to the shadow password file). He noted that the reasons behind removing those two were security related, which makes handling their removal different than others on the list; since it may be useful to be able to work with passwords on Unix systems, having something available to do so out of the box would be good. But Heimes said that those two modules have some serious security problems that makes them "very dangerous batteries".

The nature of the PEP as an omnibus including many different modules means that objections should be handled differently, Van Rossum said. The PEP will eventually either need to be accepted or rejected as a whole, which means that any particular modules eliciting complaints should probably simply be dropped off the list:

In order to get a consensus to pass the PEP, it may be necessary to compromise. IOW I would recommend removing modules from the PEP that bring up strong opposition, *even* if you yourself feel strongly that those modules should be removed.

The vast majority of modules on the list hasn't elicited any kind of feedback at all -- those are clearly safe to remove (many people are probably, like myself, hard-pressed to remember what they do). I'm not saying drop anything from the list that elicits any pushback, but once the debate has gone back and forth twice, it may be a hint that a module still has fans.

Not dead

Several commenters objected to the name of the PEP, arguing that "dead" was not an accurate description of the state of many of the modules. Stinner said: "A module is never 'dead', there are always users, even if there are less than 5 of them." He was generally in favor of the overall plan, even though he had multiple concerns and questions. Steven D'Aprano was also unhappy with the name; the batteries are working, he said, they are just unloved. But he was worried about users who will not find it easy to add the batteries back into their Python after they get removed.

Many Python users don't have the privilege of being able to install arbitrary, unvetted packages from PyPI. They get to use only packages from approved vendors, including the stdlib, what they write themselves, and nothing else. Please don't dismiss this part of the Python community just because they don't typically hang around in the same forums we do.

The current thinking seems to be that many of the modules that get removed will move over to PyPI in some form; it is possible they could even use the existing name because PyPI disallows module-name collisions with the standard library. Or, at least, the "useful" modules will move. How that will work, exactly, and how to make it easy for users affected by the module removal to fix it, are still up in the air, though it has been discussed in a thread on the Python Discourse forum. The only firm position that the PEP takes is that the core developers would stop maintaining the modules once they are removed (and the relevant Python versions reach their end of life). While moving modules to PyPI and providing some path for users to start picking them up from there is attractive, it has some possible downsides too; there are concerns that doing so could lead to another event-stream incident.

Meanwhile, Barry Warsaw was a bit worried that the PEP didn't go far enough toward solving the longstanding tension between various goals for the standard library:

We have two competing pressures, one to provide a rich standard library with lots of useful features that come right out of the box. Let's not underestimate the value that this has for our users, and the contribution such a stdlib has made to making Python as popular as it is.

But it's also true that lots of the stdlib don't get the love they need to stay relevant, and a curated path to keeping only the most useful and modern libraries. I wonder how much the long development cycle and relatively big overhead for contributing to stdlib maintenance causes a big part of our headaches with the stdlib. Current stdlib development processes also incur burden for alternative implementations.

We've had many ideas over the years, such as stripping the CPython repo stdlib to its bare minimum and providing some way of *distributing* a sumo tarball. But none have made it far enough to be adopted. I don't have any new bright ideas for how to make this work, but I think finding a holistic approach to these competing pressures is in the best long term interest of Python.

Heimes has posted a second draft of the PEP on python-dev, then moved the discussion to a Discourse thread, perhaps in the interests of involving those who are not inclined toward the mailing list. For the most part, the responses were similar, mostly pleas to keep certain modules, though it is clear that some are not really aware of the maintenance burden that is borne by the core developers in keeping things going for the standard library. It is also clear from the whole discussion that there are multiple places where the core developers simply have not been able to keep up with the maintenance duties.

While moving some set of modules out of the standard library will certainly help alleviate the maintenance headaches for those modules, it will not magically grow new maintainers for them. It is possible that it might cause some interested parties to step up to fix, maintain, and, even, develop new features for some of the modules, especially if the tie to the CPython release schedule and development process has been holding some back from those chores.

In the end, it is a tricky balancing act to provide enough "batteries included" that Python is useful out of the box without overburdening the core maintainers—or constraining potentially better solutions. One of the complaints about the standard library is that it locks users into a particular approach that may be suboptimal. For example, by most accounts the Requests package provides a much more rational approach to using HTTP from Python, but users often opt for the standard library "equivalents" because they come with Python. Moving these standard libraries to PyPI might allow other modules to rise to the occasion.

On the flip side of that, of course, is that environments that are limited in their choices of third-party packages (e.g. due to policy or internet connectivity) will have a starting point with fewer features. For the most part, the modules under consideration here will not likely truly be a barrier for users that need them if they are no longer shipped with Python. It will be the case that some programs suddenly stop working unexpectedly but, with luck, a path toward minimizing even that will be found.

Comments (26 posted)

Renaming openSUSE

By Jonathan Corbet
June 6, 2019
In mid-May, LWN reported on the discussions in the openSUSE project over whether a separation from SUSE would be a good move. It would appear that this issue has been resolved and that openSUSE will be setting up a foundation as its new home independent of the SUSE corporation. But now the community has been overtaken by a new, related discussion that demonstrates a characteristic of free-software projects: the hardest issues are usually related to naming.

Creating a foundation

At the 2019 openSUSE Conference, the openSUSE board discussed governance options at length. There will evidently be an official statement on its conclusions in the near future, but that has not been posted as of this writing. It would appear, though, that the board chose a foundation structure over the other options. A German registered association (e. V.) would have been easier to set up than a foundation, but an association has weaker restrictions so it could potentially shift its focus away from the openSUSE mission. Joining another umbrella group seemingly lacked appeal from the beginning, as did the option of doing nothing and leaving things as they are now.

The stated purpose of the foundation is to make it easier for openSUSE to accept donations and manage its own finances — things that are hard for the project to do now. The foundation structure, in particular, allows the project to enshrine its core objectives (such as support for free software) into the DNA of the organization, making it hard to divert the foundation toward some other goal. A foundation also allows openSUSE to retain its current governing board and membership structure.

In the absence of an official statement from the board, details on the decision and the reasoning behind it can be had by watching this YouTube video of a question-and-answer session with the board at the openSUSE Conference.

One motivation for the change that wasn't highlighted in the board session, but which was an undercurrent in the discussions leading up to it, is a desire for more independence from SUSE in general driven by concerns about what the company might do in the future. Such worries are not entirely irrational, even though by all accounts SUSE management is fully supportive of openSUSE now. A company's attitude can change quickly even in the absence of external events like a change of ownership. If SUSE were to be sold yet again, the new owners could take a rather dimmer view of the openSUSE project.

Time for a new name?

Such worries seem to be a key driver of the next possible change for the project: as initially proposed by Stasiek Michalski, the newly independent openSUSE project might well change its name, its logo, or both. It goes without saying, though, that there is no consensus behind any such change at this early stage.

The primary motivation for a name change is, as described by openSUSE board chair Richard Brown, trademarks. Since "openSUSE" contains "SUSE", the company will have to retain a significant amount of control over what the foundation can do with its own name, which "makes such things rather complicated". He later added:

Even without a legal entity, openSUSE already operates with significant constraints around the use of its name, which you can see in our Trademark Policy and the examples I gave in my post.

If openSUSE keeps its current name, I would be absolutely shocked if we manage the form the Foundation under the name "openSUSE" without significant additional restrictions atop of the status quo.

One other consequence of the current trademark situation, Brown said, is that the openSUSE board spends a significant amount of its time dealing with trademark issues, to the detriment of the rest of the project. In the future, he said, trademark restrictions could limit how the project could market itself, "and Marketing is an area which I think everyone would say we should be expanding upon, not limiting ourselves". For these reasons, Brown is in favor of picking a new name as the new foundation is created.

Others agreed, and supplied some additional reasons; Alberto Planas Domingue, for example, argued that a new name would allow the project to cast off an (in his view) reputation as a "traditional distribution" and highlight the interesting new technology that it is built around now. Jim Henderson added that there is a fair amount of confusion among users about the distinction between the SUSE Linux Enterprise and openSUSE distributions; a name change could help to clear that up.

Unsurprisingly, others feel that a name change would be a bad idea. Board member Simon Lees, for example, pointed out that SUSE has given the board "quite some guarantee" that the project would be able to use the openSUSE name for as long as it needs to. Should the relationship with the company deteriorate, the project will have time to consider a name change, and the additional press that would result from such a situation would be helpful in establishing the new name.

Others agreed with that position and added to it. Sarah Julia Kriesch argued that openSUSE is a well-known name that should not be discarded without a reason. Ancor Gonzalez Sosa said that a name change now would give the impression of a bad breakup with SUSE, which is not the case. Michal Kubecek worried that a renamed openSUSE would become "Yet Another Linux Distribution"; the Fedora project, he said, suffered from its name change. Marcus Meissner said that a name change would cause the distribution to lose many of its users: "The brand is the most important part on keeping the distribution alive. Throwing it away means throwing the distribution away, sorry." Robert Schweikert said that the project lacks the funding to make a name change stick, and said that such a change is unnecessary if the primary objective of the foundation really is to make financial matters easier to deal with.

No consensus

The discussion has been remarkably civil for what is an inherently controversial topic — openSUSE seems to be made up of a lot of pleasant and respectful people. Hopefully that atmosphere will sustain itself as the discussion drags out and eventually comes to a vote. There is no sign of an emerging consensus at this point; a decision might eventually have to be made without one. However it happens, it is likely to take time, and the project is likely to continue to use the openSUSE name for years even if a name change happens. As Brown put it:

If we change the name, I'd expect the "<insert name here> Foundation" would have some agreement with SUSE to continue using the openSUSE Marks for the purposes of a smooth transition. This is the kind of decision that may take weeks or months to agree upon, but could take years to fully implement.

For what it's worth, changing the logo seems to be rather less controversial — though some community members are adamant that the green color should be retained. A separate discussion, along with a possible replacement, can be found on this issue-tracker page.

One interesting aspect of this discussion is that at no point has anybody suggested a replacement name; if the people behind the proposal have one in mind, they are keeping it to themselves for now. That would be wise; a decision like this is hard enough without the additional complication of picking a new name at the same time. Should it come to a name change, though, expect another thread as the community works out what the new name should be.

Comments (36 posted)

Paying (some) Debian developers

By Jake Edge
June 12, 2019

In an offshoot of the Debian discussion we looked at last week, the Debian project has been discussing the idea of paying developers to work on the distribution. There is some history behind the idea, going back to the controversial Dunc-Tank initiative in 2006, but some think attitudes toward funding developers may have changed—or that a new approach might be better accepted. While it is playing out with regard to Debian right now, it is a topic that other projects have struggled with along the way—and surely will again.

The discussion on the debian-devel mailing list about possibly recommending dh for building packages that we covered headed into a bit of a tangent on "difficult packaging practices" that might be preventing new people from contributing. From there, Andreas Tille brought up the longstanding idea of creating some kind of Debian equivalent to the Ubuntu personal package archives (PPAs). Raphaël Hertzog suggested that it might be worth using some of the money in the Debian bank account to fund the development of such a feature. That began the debate.

Inevitably, the Dunc-Tank episode was raised, but Hertzog thinks that attitudes may have changed over the 13 years since then:

There are things to learn from this failed experiment (such as "don't let the DPL decide alone who gets paid") but there are also many reasons to believe that we are no longer in the same situation. At that time, the number of persons working on open source as part of their paid work was rather low and the jealousy aspect was likely more problematic than it would be today.

He noted that some developers are being paid now to work on the Debian Long Term Support (LTS) project under the auspices of Freexian, which is a company that he founded to provide Debian services to companies. Based on that, "with appropriate rules, the social impact of the use of money is acceptable", he said. So he thought it might be time to create some kind of framework within Debian to fund projects that are important to the project but are not progressing via volunteer efforts.

But others were not so sure that the example of the reaction to the LTS initiative is quite as sweeping as Hertzog said. Holger Levsen said that the LTS project is set up quite differently than the Dunc-Tank was; in particular, all of the money that goes to developers on the LTS project is handled completely outside of the Debian project. That makes a big difference, he said. Paul Wise also thought that there was a much wider range of reactions to the LTS project than "acceptance" as Hertzog said. Wise had some concerns about how Freexian operated the project, and its relationship with Debian, but naturally Hertzog disagreed with many. Debian project leader (DPL) Sam Hartman thought that many of the concerns Wise raised could be handled in a straightforward manner on an as-needed basis.

At that point, Hartman moved the discussion to the debian-project mailing list ("where it belongs") and asked for a volunteer to help "guide a discussion of the Money issues Martin [Michlmayr] brought up in his campaign". Michlmayr was another DPL candidate in the election that Hartman won; Michlmayr raised the issue in his platform, some of which we covered back in March.

Hartman asked for a rundown on the Dunc-Tank initiative in order to try to understand its ramifications. Though Ian Jackson was not thrilled about rehashing old history, he did provide a summary of how it played out; essentially, it was controversial because it tried to straddle the line between being a Debian project and being one that was outside of the project's purview. Levsen concurred with Jackson's recounting.

In his message moving the thread, Hartman mentioned two separate $300,000 donations to the project in the last year, one of which was earmarked for hardware upgrades as part of the project infrastructure that is overseen by the Debian system administration (DSA) team. That upgrade might overrun the earmark as it may require as much as $400,000. But that still leaves up to $200,000 that could potentially be used for funding something.

Adrian Bunk is looking for more detailed financial information: "it is impossible to start the budgetary discussion you are asking for without the status quo of the Debian finances as a basis". Hartman agreed that there was a need for that kind of report, but someone would need to volunteer to do it. Meanwhile, though, he thinks the high-level discussion about paying people to work on Debian could still be had.

But Bunk thinks that figuring out how to fund any work needs to proceed before getting to the other stuff, because he considers that to be the most controversial part. Russ Allbery disagreed, though; he thinks the most controversial piece is likely to be which projects and developers to fund (or not fund).

We're deciding as a project that some people's work is valuable enough to pay for and (by omission if nothing else) other people's work is not, and for all the good intentions that we have going in, there are so many ways for this to go poorly.

If we're only hiring people from *outside* the project, not each other, maybe that avoids the worst of the problems, but it's still an odd dynamic. For example, it creates a perverse incentive for someone to resign from the project so that they can be paid for the work they're currently doing as a volunteer.

There are also considerations regarding having to "fire" a developer who does not perform or encountering quality problems in the code that they produce, he said. For an effort like LTS, where the project does not hold the purse strings, that all works much more smoothly; someone could be "fired" from Debian LTS work and still continue working on Debian as whole. Beyond that, LTS is focused in an area that Debian had decided not to pursue on a purely volunteer basis. Allbery continued:

Maybe we can find more things like LTS that are pure incrementals over what the project is currently doing, but I'm pretty worried about the social dynamic of paying people to do core project work that others are currently doing for free.

Ximin Luo noted that there are already some who are paid to work on Debian full time, though not by the project; he wondered if it made sense to broaden that:

Wouldn't it be better to additionally have some other people be paid full-time to work on Debian under a democratic mandate (our voting system) rather than under corporate orders? At the very least, it would be a good social experiment to gain insight from - something like that hasn't not been done much in the world before.

Turning it over to a "democratic mandate" has a certain appeal at a philosophical level, Allbery said, but there are still a lot of problems to work through, including the inequalities in the cost of living in various places—and the expected compensation levels that engenders. In addition, money makes things complicated: "Money ranks right up there with politics and religion as likely to cause the most drama, the most hard feelings, and the most misunderstandings."

Luo acknowledged that there are problems to be worked out, but "progress isn't made by worrying about all the things that could possibly go wrong". Figuring out a way to organize a large-scale work effort via democratic principles "would have lots of benefits far beyond this project". There are risks, he said, but maybe a 25-year old project that has stayed roughly static in terms of developer numbers for the last ten years might be the right place to inject some risk.

While Allbery does not think developer funding is the right approach for Debian, he does admit that he might be wrong about that—and it is a worthy problem to solve, as Luo pointed out. Allbery worries that the Debian community is fragile, but the larger issue is a potent one:

Funding free software development is an enormous problem right now that desperately needs options other than controlling sponsorship by for-profit companies with all the baggage that carries.

Ondřej Surý suggested looking at the way the Internet Engineering Task Force (IETF) handles funding (in particular, RFC 4071) as a potential model for Debian. Hartman said he had worked with the IETF along the way; he described some of the kinds of positions that are funded by that organization and mapped them to Debian roles. He agreed with many of Allbery's concerns but thought that fixed tasks that are not part of the core of what Debian does (e.g. revamping the web site) might make good candidates.

Bunk said that he believes the right kind of work for the project to fund would fall under the category of work that most will not do unless they are paid. "My personal experience with real-life self-organizing projects is that the hardest part is usually finding volunteers who clean the toilets daily." Allbery agreed that janitorial work might make a reasonable choice for tasks to fund, and alleviate many of his concerns, but he also elaborated some on his earlier point about corporate influence on Debian's direction:

Also, the point is well-taken that "voting with time and energy" is not particularly "pure" in Debian already, since various corporations vote with their money to fund people to do various things they care about. So this is already complicated and is not a pure volunteer endeavor, to be sure. That said, my impression -- on the basis of no actual research, so maybe it's wrong -- is that Debian is driven much less by corporate priorities than a lot of large free software projects. Certainly less than the Linux kernel, to take an obvious example.

No real conclusions were reached, though there was common ground here and there. In Debian terms, perhaps that really means that "further discussion" is required. The issues are certainly complex and there are, mostly, only guesses as to the impact any particular path might have. If there were a clear and obvious target project, with a well-defined scope, it might be possible to see Debian funding it (somehow through the general resolution process), but nothing of that sort has (yet) been proposed it seems.

Bunk's point about financial transparency is also salient; it would be hard for some to vote for any kind of funding proposal without having a clear view of what its impact on the project's cash reserves would be. But how to fund free-software projects is an open question these days, as Allbery pointed out. Should Debian find a way forward, it would certainly be a boon for the rest of the community as well.

Comments (4 posted)

Generalized events notification and security policies

By Jonathan Corbet
June 11, 2019
Interfaces for the reporting of events to user space from the kernel have been a recurring topic on the kernel mailing lists for almost as long as the kernel has existed; LWN covered one 15 years ago, for example. Numerous special-purpose event-reporting APIs exist, but there are none that are designed to be a single place to obtain any type of event. David Howells is the latest to attempt to change that situation with a new notification interface that, naturally, uses a ring buffer to transfer events to user space without the need to make system calls. The API itself (which hasn't changed greatly since it was posted in 2018) is not hugely controversial, but the associated security model has inspired a few heated discussions.

/dev/watch_queue

Howells's mechanism is implemented as a special device, likely to be called /dev/watch_queue. Applications start by opening that device for write access; they then need to configure the size of the ring buffer. That is done with the IOC_WATCH_QUEUE_SET_SIZE ioctl() command, passing the desired size in pages. In the current patch set, the size must be a power of two and no greater than 16. Once that is done, mmap() is used to map the ring buffer into the application's address space.

The ring buffer itself is divided into eight-byte slots. Entries for specific events can occupy more than one slot; the first slot always contains this structure:

    struct watch_notification {
	__u32	type:24;
	__u32	subtype:8;
	__u32	info;
    };

The type and subtype fields tell the application what type of event has occurred; the attachment of a USB device would be reported as WATCH_TYPE_USB_NOTIFY and NOTIFY_USB_DEVICE_ADD. The special WATCH_TYPE_META type and WATCH_META_SKIP_NOTIFICATION subtype indicate an entry that should simply be skipped (there are a couple of uses for such entries that will be described below). The info field contains flags to report ring overruns or events dropped due to a lack of memory, for example; it also contains a subfield describing the number of slots occupied by this particular event.

The ring buffer itself looks like this:

    struct watch_queue_buffer {
	union {
	    struct {
		struct watch_notification watch; /* WATCH_TYPE_META */
		__u32		head;		/* Ring head index */
		__u32		tail;		/* Ring tail index */
		__u32		mask;		/* Ring index mask */
	    } meta;
	    struct watch_notification slots[0];
	};
    };

The use of a union implies that the meta structure, containing the head and tail pointers for the ring, overlays the first three slots. The special watch structure embedded in there is marked to be skipped, so user-space code will simply pass over the header information without the need to do anything special.

The buffer is empty if head and tail are equal. The kernel will insert the next event at the slot pointed to by head and increment head by the number of slots used by that event. Events will not be split at the end of the ring; if there are not enough slots left to hold the full event, the ring will be padded with skip events and the new event inserted at the beginning. User space should consume events starting at tail, and increment tail accordingly when each event is dealt with. As is always the case for data structures like this, appropriate memory barriers should be used when working with the ring indexes.

The application can call poll() to wait for events if need be.

Selecting events

The other piece of the puzzle is telling the kernel which events are of interest to begin with. Each subsystem provides its own way of requesting that events be delivered into a specific buffer. The patch set implements a number of event sources:

  • Events involving keys can be requested with the KEYCTL_WATCH_KEY command to the keyctl() system call.
  • Filesystem mount and unmount events can be had with a call to the new watch_mount() system call.
  • Events on specific filesystems (deemed "superblock events") are requested with the new watch_sb() system call.
  • Yet another new system call, watch_devices(), allows for the requesting of events related to hardware. The patch set adds support for events from the block and USB subsystems.

Finally, by default all events of the requested type(s) will be delivered into the ring buffer. The application might well only be interested in a small subset of those events. To avoid passing data that is not useful, there is a filtering mechanism built around this structure:

    struct watch_notification_type_filter {
	__u32	type;
	__u32	info_filter;
	__u32	info_mask;
	__u32	subtype_filter[8];
    };

The application puts the type of the event of interest into type. subtype_filter is a bitmask that can be used to limit which event subtypes are delivered; the application sets the bit corresponding to each desired subtype. For more complex filtering, the info_filter and info_mask fields can be used. Any given event will be delivered if:

    (event.info & info_mask) == info_filter

In other words, info_mask indicates which parts of the info_field are of interest, and info_filter holds the values that should be found in those parts.

The application can package up as many filters as it needs into this structure:

    struct watch_notification_filter {
	__u32	nr_filters;
	__u32	__reserved;
	struct watch_notification_type_filter filters[];
    };

The result is then passed to the kernel with the IOC_WATCH_QUEUE_SET_FILTER ioctl() command.

A lot more details about the notification mechanism, including the kernel-side API, can be found in the document at the beginning of this patch.

Security

Naturally, information out of the kernel could be sensitive and should not be given to any process that might request it. In an earlier (May 28) version of the patch set, events related to keys would only be delivered if the recipient has "View" access to the key involved. Information on mount events was unrestricted; superblock events were also unrestricted for any filesystem that was actually visible to the calling process. Generic device events were not a part of that patch set; block-subsystem events were supported as a distinct type and were unrestricted. This policy was seen as being overly loose in a number of ways, one of which was surprising to many of the participants in the discussion.

Consider mount events in particular, and whether process B should be able to see events generated when process A mounts or unmounts a filesystem. One might argue that B should be privileged, or should at least have enough access to watch what A is doing in general. Casey Schaufler, though, argued the reverse: for B to see an event generated by A, it is A that should have sufficient privilege to send signals to B:

If process A sends a signal (writes information) to process B the kernel checks that either process A has the same UID as process B or that process A has privilege to override that policy. Process B is passive in this access control decision, while process A is active. In the event delivery case, process A does something (e.g. modifies a keyring) that generates an event, which is then sent to process B's event buffer. Again, A is active and B is passive. Process A must have write access (defined by some policy) to process B's event buffer.

Any other policy, he said, would open covert channels between the processes and would be difficult to specify and model in general. To others, though, this policy seemed backward and surprising; most others were also less worried about covert channels than Schaufler is. The discussion circled around a few versions of the patch set with no seeming resolution (though Howells did attempt to implement the policy Schaufler was asking for); at one point Andy Lutomirski called it "a giant design error".

One seemingly counterintuitive example perhaps led to a better understanding between the participants, though. SELinux maintainer Stephen Smalley pointed out that, if two processes are both able to map a file, they can communicate via that file, so restricting notifications about activity on that file does not increase security. Schaufler replied with an example (/dev/null), where this is not the case, saying that many such examples exist. Lutomirski then agreed that notifications between unrelated processes should not be allowed for a file like /dev/null. That opens the door for a renewed discussion on the security policies around notifications.

This understanding has not, yet, led to a full agreement about what those policies should be. It would not be surprising if a full consensus were to take a while to emerge; this is a complex new API with new security implications for every subsystem that submits events to it. One generally wants to have the security story figured out before something like this is released in a mainline kernel. So this work may or may not find its way into 5.3, but it does appear to have a reasonable chance of avoiding the fate of many other generalized event-notification mechanisms and going upstream eventually.

Comments (11 posted)

Detecting and handling split locks

June 7, 2019

This article was contributed by Marta Rybczyńska

The Intel architecture allows misaligned memory access in situations where other architectures (such as ARM or RISC-V) do not. One such situation is atomic operations on memory that is split across two cache lines. This feature is largely unknown, but its impact is even less so. It turns out that the performance and security impact can be significant, breaking realtime applications or allowing a rogue application to slow the system as a whole. Recently, Fenghua Yu has been working on detecting and fixing these issues in the split-lock patch set, which is currently on its eighth revision.

From misaligned memory accesses to split locks

Misaligned memory access occurs when the processor accesses memory at an address that not aligned to the type of the operand, such as an eight-byte operation that accesses a four-byte-aligned variable. Reading four bytes from address 0x1008 is aligned, for example, while the same operation from 0x1006 is not. Misaligned accesses can cause varying behavior on different architectures, including correct and performant operation, exceptions that stop the processor, or incorrect results.

Misaligned accesses may incur a performance penalty even if the processor transparently handles them. For example, a misaligned access may be split by the CPU into two separate memory operations. Another possibility is the processor generating an exception that is silently handled by the kernel. Portable and high-performance applications should avoid misaligned accesses; the kernel's code guidelines state that developers should assume natural alignment requirements on all platforms.

A special type of a misaligned access is one that crosses two cache lines, possibly causing the processor to have to fetch multiple lines before performing the operation. Things get more complicated when an atomic operation is being performed and the processor must ensure that the data involved is seen consistently and correctly while the operation is executed. Intel platforms support atomic accesses that are split across two cache lines; such an operation is called a "split lock".

With a split lock, the value needs to be kept coherent between different CPUs, which means assuring that the two cache lines change together. As this is an uncommon operation, the hardware design needs to take a special path; as a result, split locks may have important consequences as described in the cover letter of Yu's patch set. Intel's choice was to lock the whole memory bus to solve the coherency problem; the processor locks the bus for the duration of the operation, meaning that no other CPUs or devices can access it. The split lock blocks not only the CPU performing the access, but also all others in the system. Configuring the bus-locking protocol itself also adds significant overhead to the system as a whole.

On the other hand, if the atomic operation operand fits into a single cache line, the processor will use a less expensive cache lock. This all means that developers may increase performance and avoid split locks by actions like simply correctly aligning their variables.

Split lock consequences

Yu explained the use cases that motivated this work: hard realtime, cloud computing, and avoiding a security hole. The most important one seems to be related to systems that run hard realtime applications on some cores and normal-priority processes on other cores. Split locks may cause the hard realtime requirements to be broken, as the bus locking caused by split locks executed by the regular code blocks memory accesses by the realtime code. Yu noted that, until now, such complex realtime applications could not be supported for exactly this reason:

To date the designers have been unable to deploy these solutions as they have no way to prevent the "untrusted" user code from generating split lock and bus lock to block the hard real time code to access memory during bus locking.

In the cloud case, one user process from a guest system may block other cores from accessing memory and cause performance degradation across the whole system. In a similar way, malicious code may try to slow down the system deliberately in a denial-of-service attack.

Solutions

Intel processors, starting with the upcoming Tremont generation, will be able to generate an exception (called "Alignment Check" or #AC) when a split lock is detected. Earlier processors support only an event counter for debugging purposes (exposed by an event counter called sq_misc.split_lock in perf), but do not allow immediate action from the system. Yu's work is based on this new capability.

The correct response to split locks, including what to do when they are detected while the system firmware is running, was the subject of some discussion during the review of the earlier version of the patch set. The implementation in the current version concentrates on detection of the problem.

If a split-lock event happens in the kernel itself, it issues a warning and disables the detection on the current CPU. After the warning, the faulty instruction will execute and the system will continue — whether the system should go on or panic was one of the main topics of discussion. The rationale is that a split lock in the kernel is a bug and should be fixed, but the bug is not so severe that the kernel should be made to panic.

The situation is different for user processes, which will be sent a fatal (by default) SIGBUS signal. The issue will need to be fixed before that program can run successfully. Something similar happens when a split lock is created by the system's firmware: the system will simply hang at that point. The developers decided on this handling because they were afraid that otherwise the firmware would never be fixed.

Split-lock detection is enabled by default when it is supported by the hardware. However, system administrators have control over the feature: they can use a new kernel parameter (nosplit_lock_detect) at boot time to disable it. There is also a sysfs interface to disable it at runtime at /sys/devices/system/cpu/split_lock_detect.

The patch set also includes support for KVM; it emulates the register in guests, exposing the property. The host system will have the feature enabled by default. What to do with the guests was discussed at multiple occasions; the agreed-on solution that will show up in the next iteration is to enable it in guests when the host kernel has it enabled. It means that if the host kernel has split-lock detection enabled and a guest triggers the exception, it will be stopped. On the other hand, if the host kernel has it disabled, the guest may choose to enable and use it, but is not required to.

Further work

The work has been through multiple iterations at this point and has received regular comments from the kernel developers, including Thomas Gleixner and Ingo Molnar. It has still some issues pending, but at the current pace it should show up in the mainline kernel before too long.

Comments (38 posted)

BPF for security—and chaos—in Kubernetes

June 10, 2019

This article was contributed by Sean Kerner


KubeCon EU

BPF is probably familiar to many LWN readers, though it's likely not yet quite as well known in the Kubernetes community — but that could soon change. At KubeCon + CloudNativeCon Europe 2019 there were multiple sessions with BPF in the title where developers talked about how BPF can be used to help with Kubernetes security, monitoring, and even chaos engineering testing. We will look at two of those talks that were led by engineers closely aligned with the open-source Cilium project, which is all about bringing BPF to Kubernetes container environments. Thomas Graf, who contributes to BPF development in the Linux kernel, led a session on transparent chaos testing with Envoy, Cilium, and BPF, while his counterpart Dan Wendlandt, who is well known in the OpenStack community for helping to start the Neutron networking project, spoke about using the kernel's BPF capabilities to add visibility and security in a Kubernetes-aware manner.

The Cilium GitHub project page defines the technology as a way to provide network security that understands the APIs used between microservices by using BPF and eXpress Data Path (XDP). Cilium also makes use of BPF programs to extract data from running containers for visibility purposes.

Chaos engineering

A core architectural capability of Kubernetes is that it is resilient and, when properly configured, can recover from failure by bringing up new pods as needed. Configuration optimization for resilience can often benefit from a type of testing known as chaos testing. That form of testing comes from chaos engineering; Graf showed the definition from that web site in his slides. It is the discipline of experimenting on a software system in production in order to build confidence in the system's capability to withstand turbulent and unexpected conditions. "In a nutshell it means you're introducing chaos into your infrastructure to better understand failure modes," he said.

In particular, Graf identified fault injection as a subset of chaos testing that can be beneficial. He explained that, as the name implies, intentional faults are injected into software where a fault is not occurring normally. Fault injection can be used to simulate an outage or a service failure; it can also be used to delay an application response.

In order to do fault injection, Graf said there is a need to have some form of man-in-the-middle function to get control of communications between services. There is also a need for that function to inject the fault in only a certain percentage of the operations, rather than simply having a service failing all the time. Lastly, there is also a need for visibility to make sure that the fault injection works and that an observed failure was in fact the result of chaos testing, rather than the application crashing on its own.

Graf's chaos engineering testing stack includes using the Envoy proxy, which is an open-source service and edge-proxy project, hosted by the Cloud Native Computing Foundation (CNCF); it has Go language extensions for chaos testing with fault injections. The final piece of the stack is Cilium, which is based on BPF and implements Kubernetes network policies and identity-based enforcement. "Cilium will bring up Envoy, run it, and then, using BPF magically, or really it's not magic it's networking but it will look like it's magic, redirect all traffic subject to the fault-injection policy," he said.

The fault-injection code testing approach demonstrated by Graf is publicly available in a GitHub repository, with fault-injection testing options for delayed response and service failure conditions.

Securing Kubernetes with BPF

Wendlandt used his session to explain how BPF and Cilium can be used to improve the security of applications running in a Kubernetes cluster. He explained that, from a security standpoint, Kubernetes needs to be thought of in a different way than a general-purpose operating system. With microservices running in a Kubernetes cluster, he said it is possible to better reduce and control the attack surfaces as each specific microservice is only supposed to perform a discrete set of operations.

Wendlandt said that he is focused on runtime security attacks, those that happen as the application is running. "Attacks happen when a malicious actor finds an alternative execution path or data flow that's different than what the application team intended, but it's still permitted by the infrastructure," he said. Security controls should prevent these invalid execution and data flows, though Wendlandt cautioned that it's easier said than done, since the security controls should not limit legitimate actions and use. "Ultimately what I think is so cool about BPF and the potential of BPF for Kubernetes security is that it's not just about being more secure, it's about being more secure in a way that's very low friction and doesn't get in the way of your application developers," he said.

He explained how BPF works and what it does, noting that it can be used to provide a functions-as-a-service capability for kernel events; based on a particular event, some function can be triggered. Each of the triggered functions runs within the BPF sandbox, which isolates the function to limit unauthorized actions. With BPF, Wendlandt said that it's possible to dynamically put in hooks that enable Kubernetes and microservices intelligence.

BPF maps were described by Wendlandt as efficient data structures that persist across function invocation. In his view, BPF maps can also be thought of as arrays and hash tables. He added that BPF programs can read and write to BPF maps and that they persist beyond the lifetime of an individual execution of a BPF program. For further reference on persistent BPF objects, Wendlandt suggested an LWN article detailing what it's all about. "BPF maps can both be written by the BPF programs that are running in the kernel but also by BPF-aware tools that are running in user space," he said.

Beyond iptables

Iptables provides firewall rules based on IP addresses, which can work well for externally facing resources like web servers, though iptables is not an ideal approach for Kubernetes. Wendlandt said that in Kubernetes, pods come and go; there can be thousands of services running in a cluster. "Iptables was not really designed for that type of scale, it was fundamentally designed around IPs and ports and, if you know anything about Kubernetes networking, IPs are very ephemeral, they're coming and going and they're not meaningful for long-term identity."

What matters for identity in Kubernetes is labels. He added that with BPF and Cilium there is an understanding of Kubernetes identity that is built into the kernel processing layer both from an enforcement perspective and from a visibility perspective. "We do a lot to elevate the notion of identity from an IP address to things that are more meaningful in a cloud-native world," Wendlandt said. "Cilium is fully transparent to the apps, there are no side cars or anything, that's part of the power of BPF right? The traffic is just going through the kernel, we can grab it with BPF and do whatever we need to it without requiring the application to even be aware that we're doing that."

During his talk, he used an example of a server-side request forgery that could compromise an entire Kubernetes cluster. A compromise could come from any number of different attack vectors, even the smallest misconfiguration could represent a risk. "There's no amount of code that's too small to potentially have a security relevant bug in it," he said. "You just have to assume that at some point, some app in your environment is going to be compromised and you have to decide how you will contain that blast radius."

With BPF, it's possible to better understand what an application is doing, and constrain it to the exact space it requires, so that attackers don't have extra degrees of freedom to go down paths they they shouldn't. Wendlandt said that the nature of Kubernetes microservices make it possible to lock down access in a way that isn't as easily possible for any generic application running on Linux. While the term microservices is often just used as a buzzword, he said that it has a specific meaning for Kubernetes. "The micro in microservices means you're not running a single container with lots of stuff in it, rather you're trying to strip it down to do some basic unit of work, and it's actually a very sane and clean environment," he said.

He added that, for debugging microservices, it's possible to tell the difference between an individual coming in with the kubectl exec command and making a network call and the main process making the network call. As such, in Kubernetes it's possible to give the different access methods their own permissions. On the services side of microservices, Wendlandt said that IPs are not meaningful, rather an understanding of what services a given pod has access to has more value for security and policy. He emphasized that, in Kubernetes, identity is tied to the service API being offered via Kubernetes labels and not IP addresses. That's where Cilium comes into play, providing visibility into gRPC, HTTP, and service API requests.

Identifying and stopping runtime attacks

Limiting the risk of runtime attacks starts with having an understanding of what the expected behavior is for a given microservice. With that understanding, Wendlandt said that it's possible to monitor for deviations that are outside of the normal operating parameters. "We can take the measuring and monitoring information on how the application is supposed to behave, using information extracted via BPF, and use it to enforce security policies to limit a pod to only perform those actions."

Wendlandt demonstrated how using Cilium, along with tools from the BCC tools project, makes it possible to understand when a given system call runs (such as connect(), fork(), or exec()) and then extract additional metadata that can be fed into Cilium for making enforcement decisions.

He explained that everything in a Kubernetes pod is in a single network namespace, with one IP address, so a traditional firewall like iptables can't tell the difference between different events coming from a given pod, because it all comes with the same IP address. "With BPF, we can pull in operating system granularity, so that we can lock each one of the bits of code execution down to exactly the minimum workflow that they need and nothing else."

YouTube video of the Chaos Testing and eBPF Security sessions are available.

Comments (10 posted)

Page editor: Jonathan Corbet

Inside this week's LWN.net Weekly Edition

  • Briefs: Exim vuln; syzbot reports; Matrix 1.0; Quotes; ...
  • Announcements: Newsletters; events; security updates; kernel patches; ...
Next page: Brief items>>

Copyright © 2019, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds