Leading items

Welcome to the LWN.net Weekly Edition for March 28, 2019

This edition contains the following feature content:

The Debian project leader election: a look at the campaign platforms for (most of) the candidates in this year's election.
The state of the OSU Open Source Lab: the Open Source Lab hosts a lot of important free-software infrastructure, but few people know much about it.
The congestion-notification conflict: two conflicting approaches to improved congestion notification in TCP, one of which may not be supportable by Linux.
Building header files into the kernel: an eyebrow-raising patch set to have the kernel provide its own source code.
Whither WireGuard?: the WireGuard virtual private network may be getting closer to inclusion.
Case-insensitive ext4: adding case-insensitive support to ext4 is not a trivial task.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

The Debian project leader election

By Jake Edge
March 27, 2019

While a few weeks back it looked like there might be a complete lack of Debian project leader (DPL) candidates, that situation has changed. After a one-week delay, five Debian developers have nominated themselves. We are now about halfway through the campaign phase; platforms have been posted and questions have been asked and answered. It seems a good time to have a look at the candidates and their positions.

The five candidates are Joerg Jaspert, Jonathan Carter, Sam Hartman, Martin Michlmayr, and Simon Richter. Platforms for four of the candidates can be found here along with their rebuttals to the other platforms. Simon Richter has not provided a platform or participated in the debian-vote mailing list since his nomination mail on March 17. It is not clear what that means and there was no response to an email query about his plans. The other four candidates provided detailed platforms that outlined their experience in the Debian project and their vision for its future.

Joerg Jaspert

Jaspert has been involved with the project since 2002 and has been a member of the FTP Masters team and Debian Account Managers (DAM) team for over a decade. He has lots of other roles in the project as well. He would like to see the project engage with those who are choosing other distributions as well as those who are choosing Debian derivatives. The intent would be to "see if we can enhance Debian to provide the features, while balancing it with our current users".

He sees the role of the DPL as one of enabling others in the project to get their work done; that work specifically includes more than just development and packaging:

Debian does not only consist of packaging and similar technical activities, we have a lot of other contributions that are equally important. We need to encourage more people who aren't well versed in the technical work to contribute in various non-technical ways. That can range from writing documentation, translations, help with the website, design work, help users with their problems or representing Debian at events. Or organizing such events, like Bug Squashing Parties or local "miniconf" gatherings. All of that is as important as the packagers work.

Jonathan Carter

Carter has been in and around Debian since the early 2000s, with a detour into Ubuntu for a while. He got back involved with Debian by way of DebConf, eventually helping to organize one in his native Cape Town, South Africa in 2016. He became a Debian developer in 2017 and is part of the DebConf committee. "I now actively maintain over 60 packages and have recently joined the debian-live project to help improve the quality of our live images."

His platform is lengthy, with a lot of detailed bullet points on his goals and eight separate steps he would like to take to make those goals a reality. For example:

Implement a 100 papercuts campaign: Create a project to identify the 100 most annoying small problems with Debian. These would typically be items that can be solved with a day's worth of work. This could be a technical bug or a problem in the Debian project itself. The aim would be to fix these 100 bugs by the time Buster (Debian 10) is released. A new cycle can also be started as part of an iterative process that's based on time and/or remaining bug count.

Sam Hartman

Hartman has been part of Debian since 2000. He initially started by packaging Kerberos, but soon ran into problems because of the US export restrictions. He helped navigate the legal problems with the cryptographic packages, which eventually allowed that code to move into the main Debian repository.

His focus seems to be on smoothing things out within the Debian community through mediating disputes and trying to help the project make decisions on contentious topics. Overall, the goal is to keep Debian fun:

Lucas Nussbaum wrote an excellent summary of the DPL responsibilities. Of these, I think the most important is keeping Debian fun. We want people to enjoy contributing to Debian so that they prioritize it in their busy schedules. We want to make it easy for people to do work: processes and interactions should be streamlined. When people have concerns or things don't work out, we want to listen to them and consider what they want to say. We want Debian to be welcoming of new contributors.

Martin Michlmayr

Michlmayr has been involved with Debian since 2000 and served as DPL from 2003 to 2005. He started his platform by confessing that he had considered retiring from Debian over the last few years because the project "just doesn't seem all that exciting anymore". But the project and community are too important to him to let go. He suspects that he is not alone in feeling that way and he would like to find ways to change that.

He pointed to a blog post by Michael Stapelberg that highlights some of the problems that Michlmayr also sees in the Debian project.

The open source world has fundamentally changed in the last 5-10 years in many ways. Yet, if you look at Debian, we mostly operate the same way we did 20 years ago. Debian used to be a pioneer, a true leader. Package managers, automatic upgrades, and packages builds on 10+ architectures — they were all novel, true innovations at the time. The only significant innovation I can think of that came out of Debian in recent years are reproducible builds. Reproducible builds solve an important problem and the idea has spread beyond Debian to the whole FOSS world.

I hate when large companies talk about being "nimble" or similar business buzz words. But looking at Debian, I finally understand what they mean — the project has evolved in a way that makes change difficult. We have failed to [adapt] to the new environment we find ourselves in and we're struggling to keep up with an ever-faster changing world.

Questions and answers

The above is meant to simply give a bit of a taste of the ideas that the candidates are running on; those who will be voting or are otherwise interested will find them worth reading in full. Around the time the platforms were first posted, the usual Q&A got started in the debian-vote mailing list. One of the first was about Stapelberg's blog post; Andreas Tille asked a two-part question based on it. The first regarded the legendary leeway that Debian package maintainers have over their packages, while the second asked about the future of collaborative packaging efforts like the Salsa project.

Jaspert's positive response to the second part spawned a sizable sub-thread discussing ways to make packaging more collaborative and, crucially, to have a more standardized packaging methodology for Debian. The other candidates were also in favor of at least discussing it (Hartman), documenting the existing workflows (Carter), and recommending that packagers use Salsa, or perhaps even "go further than that" (Michlmayr).

Most seemed in favor of moving away from the "wild west" of packaging, where maintainers can do whatever they like so long as it follows the policy manual. It is not clear how far anyone would want to take that (and, on their own, a DPL can't really effect that kind of change). Of the four, only Carter expressed reservations about the blog post, as he found it to lack solid arguments for reasons to step away from the project.

In his platform, Michlmayr called for more effort toward funding Debian projects, possibly even using some of the project's funds. Hartman asked his fellow candidate about that, and a recent statement "about potentially turning the DPL into a paid position, acknowledging that would be controversial". Hartman noted that agreed that lack of funding was holding Debian back to some extent but he worried about a repeat of the controversial Dunc-Tank experiment.

It seems like having the Debian Project and DPL working to get more paid developers might run into some of the same issues. In particular there might be a perception that there would be two classes of developers and that volunteers would be frustrated/disappointed they were not getting paid.

Michlmayr acknowledged the problems that Hartman describes, but thinks things may have changed since 2006 when the Dunc-Tank experiment was run. Debian needs to at least consider the possibility of funding projects and developers. That is not the only way to fund more Debian developers, of course, and Michlmayr would like to involve more companies so that there are lots of opportunities for Debian developers to be paid for what they do. He is also concerned that the Dunc-Tank example perpetuates the avoidance of considering paying for staff:

BTW, Debian is already paying Outreachy students to work on Debian. The only controversy here is around diversity, not about paying people. Debian could offer more grants from Debian money to test the waters to find out what people are comfortable with.

Finally, I see one risk: we keep repeating that something is controversial even though we're not sure it's *still* controversial. By repeating this myth, we're keeping it alive. The world has changed. Debian has to finds ways to adapt.

There are, of course, other questions being asked and answered. For example, the platforms of Hartman and Michlmayr have enough parallels and agreement that two separate questions have been posed about the two potentially teaming up. Other questions concern whether Debian should truly aim to be the "universal operating system", if free software and/or Debian are inherently political, and the communication platforms used by the project. Undoubtedly, more will be posted over the next week and a half or so.

In the end, the candidates clearly have the best interests of the project in mind, though they may have different approaches. Jaspert seems like the most "stay the course" candidate, perhaps, while Michlmayr is probably the candidate most interested in rocking the boat—or at least considering rocking it. The DPL's power is extremely limited, so any real changes will require convincing a substantial portion of the members—starting with convincing them to vote in your favor for the DPL position. The voting period runs from April 7–20. It will be interesting to see who wins, but even more interesting to see what happens after that.

Comments (27 posted)

The state of the OSU Open Source Lab

By Jake Edge
March 26, 2019

SCALE

The Oregon State University Open Source Lab (OSU OSL) has been a longtime hosting site for a wide variety of free and open-source software (FOSS) projects. At SCALE 17x, OSL director Lance Albertson gave an overview of what the lab does, some of its history, and its role in mentoring undergraduates at OSU. There are a lot of facets to the lab and its work, most of which flies under the radar, which is why Albertson came to Pasadena, CA to fill attendees in.

Background

OSL acts as a FOSS hosting company, providing free or low-cost hosting to a variety of projects. It offers colocation or virtual machines (VMs) in a private cloud. It can also provide access to a wide array of different CPU architectures. Beyond that, the lab is a distribution and mirroring site for multiple projects.

Something that is not as well known, he said, is that OSL mentors undergraduates. This allows them to gain real-world experience with production systems. A number of alumni, including the founders of CoreOS, from the lab have moved into work in high-profile jobs in the industry. The lab has one full-time employee, Albertson, and typically six to ten undergraduates.

It was started in 2003 by Scott Kveton and Jason McKerr, who worked for the OSU information services department. "The cloud did not exist" back then, so the lab offered colocation hosting for FOSS projects. Three early projects were Gentoo, Debian, and Freenode. In those early days, the existence of the lab spread by word of mouth among the projects, eventually attracting kernel.org, the Apache Software Foundation, Drupal, and the Linux Foundation.

Its initial funding came from OSU, based on the cost saving by the university from switching to FOSS. Google and RealNetworks were early sponsors. OSL moved to the college of engineering in 2013. Its ongoing funding model is to get corporate donations; IBM, Google, and Facebook are big donors. It also has hosting contracts with the Linux Foundation, Drupal, and the Open Source Robotics Foundation. Other companies donate hardware or bandwidth and there are individual donors as well. At this point, OSL gets no direct funding from OSU or the state of Oregon, which makes fundraising a yearly challenge.

The role of the lab is to be a neutral hosting facility and to foster relationships between FOSS projects and companies, Albertson said. It provides a stable, physical home for core FOSS projects that is flexible to the needs of each project. It gives access to less-common hardware and CPU architectures, including OpenPOWER and, soon, RISC-V, along with compute and storage resources, such as software mirroring and continuous integration and deployment (CI/CD). The lab also helps projects with their systems engineering needs and helps train the next generation of open-source leaders.

He put up a list of new projects (which can be seen in his PDF slides or the YouTube video of the talk), which showed around a dozen new projects for general hosting and double that for OpenPOWER hosting. The list of current projects was an eye chart over two slides, totaling up to almost 200 projects. That list does not include subprojects and several of the listed projects (e.g. the Linux Foundation and Apache Software Foundation) have lots of subprojects.

Students

Many of the alumni from the lab have landed in prominent positions in the industry, including at CoreOS (as mentioned earlier), the Linux Foundation, Microsoft, Amazon, Apple, Tesla, Red Hat, and more. Students at the lab interact with open-source projects on a daily basis; the lab runs a help desk that is staffed by the students, so they handle requests via email, IRC, Slack, and so on. OSL is a "Chef shop", he said, so students spend a lot of time creating new cookbooks and maintaining existing ones.

One of the more important pieces is that students get hands-on experience with hardware at the lab. It is relatively difficult to get that kind of experience these days, since many companies are using public clouds instead of their own data centers. The students get to learn about the quirks of installing (and retiring) real hardware, which is valuable. In addition, students handle all of the support tickets for OSL for a week on a rotating basis. This ensures they get wide experience with all of the different systems in the lab.

The hiring process consists of an open-book quiz, with basic questions about Linux; there are simple Bash and Chef exercises as well. After that is an in-person interview with both technical and non-technical questions. Applicants are not expected to necessarily know the answers; how they think about the problem will help assess their problem-solving abilities.

Once a student has been hired, there is an onboarding process that includes a walkthrough guide and some Chef training. Students start working with test cookbooks and work up to changes to the production cookbooks, which involve pull requests with review from more senior students. After two to three months, students get added into the support-ticket rotation.

One of the challenges is that "summer is coming". That means some students will graduate, some will get internships or other opportunities, and some can work full time at the lab. Students with internships may come back to work for another year or two at OSL or they may get part time remote work with the company where they interned. It is something that he has to juggle each year since summer is when some of the larger projects get tackled because the students can work full time when they are not taking classes.

Platforms

He then went into all of the various platforms that OSL handles. The current and new systems in the lab are running CentOS 6 or 7 for servers and Debian 8 ("jessie") or 9 ("stretch") for staff workstations. They are all managed by multi-platform Chef cookbooks, which have both unit and integration tests. There is also a pile of legacy systems that are CentOS 6 or Gentoo Linux managed by CFEngine.

The lab does not have a hardware budget, he said; instead it relies on in-kind donations. In 2012, Intel donated a bunch of servers that had been hosting MeeGo, for example. EMC donated hardware in 2016, as did Facebook, while Hudson Trading donated pallets of 10Gb switches in 2018. The lab has a wish list, which includes 1U/2U compute and storage nodes, large (>3TB) SATA hard drives and SSDs, 40Gb end-row switches, and 1Gb top-of-rack switches.

The core services that OSL provides to FOSS projects include mailing lists, email forwarding, DNS, and web application hosting. It also provides systems engineering consulting to the projects. Projects can either have managed or unmanaged hosting. The managed projects have systems that are kept up to date, with services configured and managed by OSL students, all via Chef. On the other end, unmanaged projects get a VM and have to manage all aspects of the system; the lab only requires an account with sudo privileges for troubleshooting and emergencies.

For software mirroring, there is a three-server cluster, with hosts in Corvallis at OSU, Chicago, and New York. Round-robin DNS spreads the load across the cluster, which handles an average of 1.7Gbps across the three nodes. It can store up to 15TB, of which 12TB are currently being used for more than 100 projects. The hardware is overpowered for what is needed, but came from a donation by IBM: POWER8 systems with 256GB of RAM.

There are more than 300 colocated hosts for various projects; certain projects (e.g. Gentoo, Linux Foundation, Apache Software Foundation) have their own project racks. These are all in a data center that is shared with OSU. Of the 70 racks in that data center, OSL uses 32, though not all of them are full.

The lab has recently "dived into Ceph", Albertson said. It has built two storage clusters using Ceph: a five-node cluster for OpenPOWER OpenStack and an eight-node cluster for x86 OpenStack and other OSL services. The primary use for both is as block storage for OpenStack. There are future plans to investigate object storage for Ceph.

OSL runs two different cloud platforms: OpenStack and Ganeti. The primary platform is Ganeti, which came out of Google, but the company has moved away from it. It is stable, easy to maintain, and came about before OpenStack even existed; OSL has been using it since 2009. But Ganeti has no public API, so it is poor at providing self-service.

OpenStack, on the other hand, has gotten really stable but it is difficult to maintain, he said. OSL started deploying it in 2013. OpenStack does have a public API and is "really good for self-service". It started as a PowerPC little-endian (i.e. ppc64le) cluster, but there is now an x86 cluster available as well. It is used by multiple projects including the Linux Foundation, GCC, and the GNOME project.

The main production cluster runs Ganeti and provides VMs for multiple projects. There are also project clusters that are managed by the lab, such as for the Python Software Foundation and CiviCRM.

Collaboration with IBM

OSL has collaborated with IBM on a variety of projects over the last ten years. One way or another, OSL has had some hardware available for Power ISA builds. That really started to pick up after the release of POWER8, which resulted in an OpenStack cluster of five POWER8 systems (with around 225 VMs) and three POWER9 systems (with around 22 VMs). Over 100 projects are using the cluster; many of the ppc64 and ppc64le binaries that attendees likely use were built on this cluster, he said. That hardware has either been donated or loaned by IBM. There are also bare-metal POWER machines for the GCC compile farm, Debian, and FreeBSD.

There is a collaboration between OSL and the OSU Center for Genome Research and Biocomputing (CGRB) to provide GPU access on OpenPOWER systems to FOSS projects. The hardware is managed by CGRB and projects can access it via Son of Grid Engine, which is a kind of batch scheduler for high-performance computing.

OSL has created a Jenkins portal that will allow projects to submit CI jobs to the GPU hardware. Albertson is working on using OpenStack Zun to provide a container facility for shell access to the GPUs. It is not feasible to share GPUs via VMs due to limitations with PCI passthrough. Right now, he is using older GPUs provided by CGRB, but if the project is successful, IBM will consider making some more advanced hardware available.

Beyond that, there are two LPARs on an IBM ZSeries (s/390x) system at Marist College in New York that can be used by projects. OSL has a Jenkins portal and Docker images can be submitted to be run. There is some AIX hardware that OSL is hosting, but does not manage. Select projects are given access to those for building and testing on AIX.

Recent work

The lab manages 130 or so hosts with Chef. Over the last year, OSL moved to Chef 13 and will start to migrate to Chef 14 soon. There is an ongoing move from CFEngine to Chef that started in 2013; he believes it will finally be done this year, as a lot of progress was made in 2018.

There is a compile farm based on Open Compute Project (OCP) hardware that was donated by Facebook. There are 90 compute nodes, each with 140GB of RAM, 3TB of disk, and a 10Gb NIC. The GCC compile farm was the first project to start using these nodes, but now there are multiple projects using them, including Debian and Fedora RISC-V builds using QEMU. 59 of the 90 nodes have been allocated at this point.

That hardware actually sat idle for a few years because there were logistical problems installing the seven foot tall racks with special power requirements. The OSU data center is on the second floor, but the elevators were not tall enough to transport the racks. That was solved when OSL moved into a new building (for other reasons) and repurposed a ground-floor room as mini-data-center for the OCP racks. He is still seeking around $150K in donations for cooling upgrades.

Albertson closed his talk with a laundry list of goals for the coming year. That included upgrades of various tools like OpenStack, Ceph, and Chef. Beyond that, there is a need to start replacing the aging OSL core network. CentOS 8 is on the horizon, so OSL needs to get ready to migrate systems to that, perhaps eliminating CentOS 6 along the way. An OpenStack cluster of Arm systems may be in the offing as well; an Arm startup has approached the lab to replicate what it has done with IBM, but with Arm systems.

The talk provided a detailed look at the innards of longtime FOSS ecosystem player. OSL provides a lot of compute—and human—power throughout the FOSS world, but it is somewhat rarely seen or heard from. Albertson helped bridge that gap; one can only hope that OSL continues to prosper for a long time to come.

[I would like to thank LWN's travel sponsor, the Linux Foundation, for travel assistance to Pasadena for SCALE.]

Comments (3 posted)

The congestion-notification conflict

By Jonathan Corbet
March 22, 2019

Most of the time, the dreary work of writing protocol standards at organizations like the IETF and beyond happens in the background, with most of us being blissfully unaware of what is happening. Recently, though, a disagreement over protocols for congestion notification and latency reduction has come to a head in a somewhat messy conflict. The outcome of this discussion may well affect how well the Internet of the future works — and whether Linux systems can remain first-class citizens of that net.

Network congestion is a fact of life; when it occurs, the only useful response is to get senders of traffic to slow down. Many governments place traffic signals on the on-ramps to major highways in congestion-prone areas in an attempt to limit traffic entering and to keep things flowing. Network traffic can benefit from similar controls, but the placement of traffic signals at every entry point to the net is impractical. So network protocols must rely on other types of signals to learn when they should reduce their transmission rate.

Protocols like TCP, unfortunately, were not designed with such signals, so congestion-control algorithms have been built to use the one signal that is always reliably delivered: dropped packets. But dropping packets on the floor can make things worse (it forces the data to be transmitted again), it introduces delays and, by the time it happens, congestion is already occurring. It would be better to inform senders of congestion in a less heavy-handed manner, before that congestion becomes a problem.

ECN and its discontents

The explicit congestion notification (ECN) protocol, standardized in 2001, was an attempt to improve the situation by informing senders of congestion without dropping packets. ECN repurposed two bits in the IP header; for reasons that will become clear below, it is worth looking at how those bits are interpreted:

00 Transport is not ECN-capable

01 ECN-capable ECT(1)

10 ECN-capable ECT(0)

11 Congestion experienced

The ECT(0) and ECT(1) values have the same meaning; they indicate that ECN is supported at least somewhere on the route between two endpoints. In practice all implementations use ECT(0); attempts to give a separate meaning to ECT(1) have not gained traction. ECN famously broke the Internet because many routers would drop packets with either of those bits set; that delayed its adoption for years.

ECN has improved the situation, but not enough; it suffers from a couple of significant problems. One is that a "congestion experienced" signal still arrives too late; congestion is already happening and the router is pleading for help. It is also still a heavy hammer; the RFC requires that a congestion-experienced signal be treated as if a packet had been dropped, so congestion-control algorithms respond by severely reducing their transmission rates, then working back up. That can reduce the throughput of a connection (and increase its latency) more than is needed to resolve the problem.

As networks get faster, the demands for lower latencies grow, and as bufferbloat-reduction efforts reduce the amount of queue space available on routers, congestion control needs to become a bit more nuanced. There is widespread agreement on that point. How that nuance should happen is a matter of rather less agreement.

L4S

One attempt at improving congestion control has been developed, mostly slowly and mostly in private, by various industry players; it is called Low Latency, Low Loss, Scalable Throughput, or L4S. The core idea seems to be to replace ECN with a more flexible signal built into a higher-level protocol; data center TCP (DCTCP) is one example. DCTCP acknowledgment packets can include information on how much queuing space is available; senders can use the information to keep the queue full without overflowing it. Linux has supported DCTCP since the 4.1 release.

The problem with something like DCTCP is that it must work with the active queue-management algorithms running on all of the routers between the two endpoints. Those algorithms see all traffic passing through the router, not just the DCTCP traffic. The proponents of L4S seem to want a sort of privileged treatment for suitably clueful protocols so that they can get low-latency treatment through the router without having to contend with what the L4S draft terms "classic TCP".

To bring that about, L4S redefines the ECT(1) value described above to indicate "this packet is using better congestion notification". Routers would then create two separate queues; a fast one for the L4S traffic, and a slower one for the "classic" traffic. That differentiation can, on its own, raise some eyebrows, but the queue-management algorithm needs to be evaluated as a whole to see what its broader effects, including on fairness, would really be.

DCTCP is not seen as being entirely safe for use outside of protected environments like data centers. For wider deployment, the intent has been to create a new TCP congestion-control algorithm called "TCP Prague". The L4S portion would then be implemented with a queue-management algorithm called "DUALPI", so named because it maintains the two independent queues described above. Both of these modules have been vaporware until recently: a repository with TCP Prague showed up on March 12 (no attempt has yet been made to submit it for the mainline), and DUALPI was posted to the netdev list on March 11.

SCE

The alternative, pushed by longtime bufferbloat fighters Jonathan Morton and Dave Täht, along with UDP creator David Reed, is called some congestion experienced, or SCE. It is a rather simpler proposal, intended to provide a "congestion is imminent" signal that, once again, is less heavy handed; it places a higher priority on compatibility with existing TCP congestion-control implementations, though.

SCE also makes use of the ECT(1) value to encode the "some congestion" signal. The full congestion-experienced value would retain its current meaning, with protocols expected to treat it as being equivalent to a dropped packet. The SCE signal, instead, should be interpreted this way:

New SCE-aware receivers and transport protocols SHOULD interpret the SCE codepoint as an indication of mild congestion, and respond accordingly by applying send rates intermediate between those resulting from a continuous sequence of ECT codepoints, and those resulting from a CE codepoint. The ratio of ECT and SCE codepoints received indicates the relative severity of such congestion, such that 100% SCE is very close to the threshold of CE marking, 100% ECT indicates that the bottleneck link may not be fully utilised, and a 1:1 balance of ECT and SCE codepoints indicates that the present send rate is a good match to the bottleneck link.

The code implementing SCE is also quite new; it showed up in the fq_codel_fast repository on March 14.. It's worth noting that this proposal does not give intermediate routers a way of knowing whether either endpoint is capable of responding to SCE signals or not. There is, perhaps, an implicit assumption that, once SCE is supported by Linux and FreeBSD, it will quickly become omnipresent.

Which is better?

These two proposals are clearly incompatible with each other; each places its own interpretation on the ECT(1) value and would be confused by the other. The SCE side argues that its use of that value is fully compatible with existing deployments, while the L4S proposal turns it over to private use by suitably anointed protocols that are not compatible with existing congestion-control algorithms. L4S proponents argue that the dual-queue architecture is necessary to achieve their latency objectives; SCE seems more focused on fixing the endpoints.

It looks like a fairly typical battle between a protocol pushed by the largest Internet service providers, and one with a rather more grass-roots origin. There is, however, another important thing to know about L4S: Alcatel-Lucent claims a patent on the dual-queue algorithm. The company has generously offered to make that patent available under "fair, reasonable, and non-discriminatory" terms; such terms are, of course, highly discriminatory against free software implementations. They make it impossible to merge the affected code into a GPL-licensed kernel.

As is the case with many patents, the quality of this one is not universally recognized. Bob Briscoe, one of the developers of L4S, claims loudly that there is prior art for the claims in the Alcatel-Lucent patent and that it should never have been issued. The patent unfortunately exists, though; as long as Alcatel-Lucent continues to claim it, the code cannot become part of the Linux kernel. If L4S becomes the IETF-anointed standard, and if industry adoption follows, Linux could find itself out in the cold.

The disagreement over these protocols reflects a difference in approach between developers and their associated industries. It is subject to all of the usual technical and political maneuverings; the process could be unpleasant to watch as it plays out. One could argue that the Linux community could happily let it play out and simply merge the winner; one might also argue that SCE better matches the values that have shaped our network stack in general. The assertion of that patent, though, raises the stakes considerably; it would not be good for Linux to find itself unable to play with other high-performance network stacks. As long as the patent remains, the technical choice is easy.

Comments (31 posted)

Building header files into the kernel

By Jonathan Corbet
March 21, 2019

Kernel developers learn, one way or another, to be careful about memory use; any memory taken by the kernel is not available for use by the actual applications that people keep the computer around to run. So it is unsurprising that eyebrows went up when Joel Fernandes proposed building the source for all of the kernel's headers files into the kernel itself, at a cost of nearly 4MB of unswappable, kernel-space memory. The discussion is ongoing, but it has already highlighted some pain points felt by Android developers in particular.

Fernandes first posted this work in January; version 5 was posted on March 20. As part of the build process, it gathers up all of the kernel's headers (the ".h" files) and a few other artifacts into a compressed tar file; that file is then built into a kernel module. If that module is loaded into the running kernel, the tar file containing the headers can be read from /proc/kheaders.tgz. This is, thus, a way of allowing applications to access the header files that were used to build whatever kernel is running at the moment.

The purpose of this mechanism is to make those header files available in situations where they are otherwise unavailable. In particular, developers building kernel modules need access to this information, as do those who are building BPF programs to analyze a system's behavior. In some systems, notably Android-based devices, those header files are almost certainly not easily available. Fernandes has tried other solutions to this problem, such as BPFd, in the past, but all have fallen short. Providing headers with the kernel itself is the solution he has settled on.

Some of the initial reviews were less than entirely favorable; Christoph Hellwig described it as "a pretty horrible idea and waste of kernel memory" while Alexey Dobriyan said that it was "gross". H. Peter Anvin also questioned the memory use and suggested that the data should, at a minimum, be stored in a swappable filesystem. Numerous others chimed in as well, describing the work as a "hack" and saying that, rather than building the tar file into a kernel module, it would be far more straightforward to just place that file in the module directory where it could be read directly. At the same time, a number of other developers have indicated that this feature would be useful; Daniel Colascione even asked whether it could be expanded to hold all of the kernel source.

Nobody seems to disagree with the overall objective of this work. There are times when the kernel headers are needed for development, but those headers tend to be absent on systems like Android. The disagreement is over the idea of building those headers into the kernel itself. This opposition is easy enough to understand; the kernel itself does not need that information to function, so there would have to be a strong reason indeed to sacrifice that much system memory to hold it in kernel space.

There are indeed reasons for doing so, many of which seem to come down to how Android systems are built rather than something more technical. It would be nice if Android simply had a "kernel headers" package but, as Fernandes explained, that is not really practical:

In the Android ecosystem, the Android teams only provide a "userspace system image" which goes on the system partition of the flash. This image is not GPL and doesn't contain anything GPL. It is thus not possible to put kernel headers on the system image and I already had many discussions on the subject with the teams, it is something that is just not done. Now for kernel modules, there's another image called the "vendor image" which is flashed onto the vendor partition, this is where kernel modules go. This vendor image is not provided by Google for non-Pixel devices. So we have no control over what goes there.

The seeming aversion to putting anything GPL-licensed into the system image rubs some developers the wrong way, but it is consistent with the GPL avoidance practiced in most of the Android system. There is another reason why putting the kernel headers there is not a complete solution, though: developers will often cross-build a kernel and ship it to a device for direct booting with the fastboot command. Any headers stored on the device itself will not match that new kernel, so they are useless at best. If the headers are built into the kernel itself, though, they will transfer to the device with that kernel and always be correct.

Even for kernels shipped with devices, though, the "store the headers in the filesystem" solution is problematic. As Fernandes noted, the Android project does not have much control over what vendors put onto their devices or where it goes, so it would be difficult (if not impossible) to mandate the presence of the kernel headers in any sort of standard location. Android can, though, mandate that specific kernel configuration options must be set; with this patch merged, vendors could be made to ship the headers for their kernels in a place where they could always be found. Even if vendors tend to hide their kernel modules in strange places (and they are vendors, so of course they do), the user space code on any given device knows how to find and load them.

In other words, building this information into the kernel is, among other things, a technical solution to the social problem of getting vendors to provide that information in a consistent way. Sometimes such solutions can be what is needed. As Colascione put it: "here's the bottom line: without this work, doing certain kinds of system tracing is a nightmare, and with this patch, it Just Works". Or, as Karim Yaghmour described it:

That, in my view, is a big part of the problem Joel's patch solves: in a system whose functionality requires multiple *independent* parties to work together, I can still get the necessary kernel headers for user-space tools to properly operate regardless of which part of the system is being substituted or replaced.

Proponents argue that, since the information is built into a kernel module, it can be configured out (or simply not loaded) when it is not needed. Anvin worried, though, that mechanisms like this tend to grow into a mandatory role over time.

One associated question is whether providing kernel header files is the best way to provide the needed information to user space. Steve Rostedt said that he would rather have a table describing the kernel's structures, including the offset, size, and type of each field. That is the information that is actually needed much of the time, and it could be more compact than the source code is. Colascione sympathized with the desire for a cleaner format, but argued that it would be better to go with what works now: "Think of the headers as encoding this information and more and the C compiler as a magical decoder ring". Header files also include macros, constant definitions, and other information needed to build BPF programs.

The discussion has gone on at length, provoked anew by each new posting of the patch set. It does not appear to have changed a lot of minds on either side of the debate. Sooner or later, presumably as the 5.2 merge window approaches, somebody (most likely Andrew Morton) will have to make a decision. Given the evident advantages from this patch set, it seems likely that Android kernels may ship it regardless, so it may be mostly a matter of whether the mainline follows suit.

Comments (47 posted)

Whither WireGuard?

By Jonathan Corbet
March 25, 2019

It has been just over one full year since the WireGuard virtual private network implementation was reviewed here. WireGuard has advanced in a number of ways since that article was written; it has gained many happy users, has been endorsed by Linus Torvalds, and is now supported by tools like NetworkManager. There is one notable thing that has not happened, though: WireGuard has not yet been merged into the mainline kernel. After a period of silence, WireGuard is back, and it would appear that the long process of getting upstream is nearly done.

A new version of the WireGuard patches was posted on March 22. WireGuard itself is not particularly controversial; few people have raised complaints about its design or implementation. The sticking point is the "Zinc" cryptographic library that WireGuard uses. Zinc was born out of frustration with the kernel's current cryptographic layer, which is seen by many as being far too difficult to use. Zinc is, in essence, an entirely new cryptographic layer that sits alongside the current code, duplicating a lot of functionality within the kernel but providing an easier interface for cryptographic tasks.

There are a few complaints that have been heard about Zinc. One of those revolves around the fact that Zinc isn't just a new API for accessing cryptographic algorithms; it also includes it own implementation of those algorithms, duplicating functionality that the kernel already has. WireGuard author Jason Donenfeld defends these new implementations, probably correctly, as having been subjected to a higher level of cryptographic review. Kernel developers strongly dislike this kind of duplication, though; they will argue that, if the new implementation of a specific algorithm is better, it should simply replace the existing one rather than duplicating it. That way, there is only one version to maintain, and all users will be able to take advantage of whatever benefits it offers.

The duplicated algorithms have been a sticking point for some time, but it would appear that a solution is in the works. Crypto maintainer Herbert Xu has posted a version of Zinc that introduces the new API, but which uses the existing algorithm implementations rather than Donenfeld's new ones. That makes the API available for users like WireGuard while removing the new algorithm implementations from the discussion for now. Those implementations can, in the future, be evaluated on their own merits and merged, one at a time, when a consensus emerges that they are better.

Past discussions might lead one to expect that Donenfeld would resist this move, but this time around he responded: "I think we're slightly closer to being same page". He plans to make some changes to Xu's version of Zinc, but the version he intends to post will still use existing, in-kernel algorithms where they are available. Assuming that everybody likes the result, one of the major long-term roadblocks to the merging of WireGuard will have been overcome.

Duplication of cryptographic functions is not the only complaint about Zinc, though; others were expressed by Ard Biesheuvel, whose criticisms have done a fair amount to impede Zinc in the past — but those criticisms have also resulted in numerous improvements to the code. Biesheuvel described Zinc as a "layering violation", and complained that it is unable to use the asynchronous algorithm implementations in the kernel. That is by design: Zinc explicitly only supports synchronous implementations (where the caller waits until each operation is done). Asynchronous implementations (which run in parallel, often on an external accelerator, while the caller does something else) are seen as too complex and providing too little benefit.

Biesheuvel disagrees with that view of asynchronous operations, and fears that, in the future, somebody will have to bolt asynchronous support onto Zinc. He would much rather see development effort going into fixing the deficiencies in the existing cryptographic API. He is not alone in this view, but others disagree, including Torvalds, who declared himself to be strongly in the Zinc camp:

And honestly, I'm 1000% with Jason on this. The crypto/ model is hard to use, inefficient, and completely pointless when you know what your cipher or hash algorithm is, and your CPU just does it well directly.

He went on to say that "none of the async accelerator code has ever been worth anything on real hardware and on any sane and real loads"; see his message for the details on his reasoning. If asynchronous crypto accelerators lack value in the real world, then it makes some sense to introduce an API that effectively ignores them. Naturally, this view of asynchronous crypto devices is not universally shared, or support for them would not exist in the kernel. See, in particular, this message from Pascal Van Leeuwen for a rebuttal of some of Torvalds's criticisms. But it does seem clear that asynchronous crypto is not particularly useful to a wide variety of use cases.

If the view expressed by Torvalds (and implicitly by Xu) wins out, and if the next posting of Zinc adequately addresses the concerns regarding duplicated algorithms, then Zinc's path into the mainline will start to look relatively clear. Unless some new problems arise with WireGuard (which seems unlikely, since even those who are opposed to Zinc tend to be supportive of WireGuard), it should be set to be merged as soon as Zinc gets in. That should bring a happy ending to the longish story of getting WireGuard into the mainline, conceivably as soon as the 5.2 development cycle.

Comments (13 posted)

Case-insensitive ext4

By Jake Edge
March 27, 2019

Handling file names in a case-insensitive way for Linux filesystems has been an ongoing discussion topic for many years. It is a (dubious) feature of filesystems for other operating systems (e.g. Android, Windows, macOS), but Linux has limited support for it. Over the last year or more, Gabriel Krisman Bertazi has been working on the problem for ext4, but it is a messy one to solve. He recently posted his latest patch set, which reflects some changes made at the behest of Linus Torvalds.

At the 2018 Linux Plumbers Conference (LPC), Krisman presented his plan for allowing ext4 filesystems to be case-insensitive. That plan would have enhanced the kernel's Native Language Support (NLS) subsystem to better support multi-byte encodings and expand the case-folding to handle UTF-8. NLS exists to handle filesystems, such as FAT, that support file names with different encodings, which are specified at mount time. Krisman posted his patch set to make those changes in December shortly after LPC, but Torvalds objected to the whole idea:

Why do people want to do this? We know it's a crazy and stupid thing to do. And we know that, exactly because people have done it, and it has always been a mistake.

He went on to list a number of different problems that can arise with case-insensitivity—many of which have occurred along the way. He asked for use cases: "I really want to know what is driving this insanity, and what the actual use-case is." But he made it pretty clear that he was—at a minimum—skeptical.

The old DOS/Mac people thought case insensitivity was a "helpful" idea, and that was understandable - but wrong - even back in the 80's. They are still living with the end result of that horrendously bad decision decades later. They've _tried_ to fix their bad decisions, and have never been able to (except, apparently, in iOS where somebody finally had a glimmer of a clue).

Theodore Y. Ts'o, who has been working with Krisman on this effort, had apparently brought the patch set to Torvalds's attention in a private email that Torvalds quotes. Another reply also didn't make it into the thread, but in that message (which Torvalds also quotes) Ts'o noted that there was no plan to support encodings other than UTF-8 (and ASCII), which would be set on a per-filesystem basis. Case-insensitivity would be set on a per-directory basis. Given that, Torvalds was adamant that the NLS code was the wrong place to make these changes:

Either you have a horrible fundamental design mistake that has different per-filesystem locales, or you don't.

If you don't, you shouldn't be touching any of the nls code.

Whatever unicode tables you use for case folding shouldn't be in the nls code.

Ts'o suggested moving the Unicode handling code to fs/unicode rather than changing the NLS code. He also described the current state of play with regard to case-sensitivity in filesystems for macOS and Windows, as well as for network filesystems like Samba and NFS. Over time, Ts'o said, the inconsistencies in handling file names between different filesystems have mostly been eliminated. In January, Krisman posted version 5 of his patch set, which reflects the switch to the fs/unicode directory.

The patch set also makes a more substantial change in that it switches normalization methods. There are multiple ways to create the "same" string in Unicode, which is known as "equivalence". Two different sets of code points that appear the same to a user, but not to the filesystem, would be confusing, so there are normalization mechanisms to allow comparisons that take equivalence into account. Ts'o described the confusion that can result:

In the bad old-days, MacOS X's HFS+ was not normalization-preserving. So it would force filenames to NFD form --- so if the user tried to create a file named Å, and passed in the Unicode string U+212B to creat(2), HFS+ would store it as U+0041,U+030A and that is what readdir(2) would return. Apple has effectively admitted this was a mistake, and their new APFS doesn't do this any more.

Now, both file systems basically say, "we don't care whether you pass in U+212B or U+0041,U+030A; on the screen it looks identical, Å, so we will treat it as the same filename; but readdir(2) will return what you gave us."

The new patch set switched from NFKD to NFD, which in normalization lingo means a switch from "compatibility" to "canonical" decomposition:

The main change presented here is a proposal to migrate the normalization method from NFKD to NFD. After our discussions, and reviewing other operating systems and languages aspects, I am more convinced that canonical decomposition is more viable solution than compatibility decomposition, because it doesn't ignore eliminate any semantic meaning, like the definitive case of superscript numbers. NFD is also the documented method used by HFS+ and APFS, so there is precedent. Notice however, that as far as my research goes, APFS doesn't completely [follow] NFD, and in some cases, like <compat> flags, it actually does NFKD, but not in others (<fraction>), where it applies the canonical form. We take a more consistent approach and always do plain NFD.

As those quotes indicate, normalization is a messy business. In fact, the whole problem of case handling is a horrific mess, as Torvalds (and others) noted. But there are use cases, mostly involving interoperability with other operating systems. In addition, user-space implementations, with a variety of shortcomings, exist for both Android (to support /sdcard) and Samba—those could perhaps be replaced with an in-kernel solution.

That posting did not generate all that many comments, though there was a question from Pali Rohár about the normalization change. He was concerned that NFD would be incompatible with various other Linux user-space tools. But Krisman explained that the patch set implements name-preserving semantics and that NFD is only used internally for comparison.

Handling invalid UTF-8 byte sequences also came up. There are effectively two possible ways to handle the problem, Krisman said. Either the filesystem can reject any file name that is invalid UTF-8 (and fix any that are found on the disk) or to simply treat an invalid UTF-8 file name as it would be today, so there would be no case-folding or normalization. Both are implemented and a given filesystem's behavior can be configured with a feature flag; the default is to treat them as an opaque byte sequence as they are currently.

On March 18, Krisman posted version 6, with few changes from the previous version. He is trying to flush out any opposition to the normalization change (or anything else in the patch set), presumably in the hopes of getting it upstream soon. So far, there has only been a question from Randy Dunlap about the impact on ext3 filesystems, which are handled by the ext4 code. Ts'o noted that "strictly speaking, there is no such thing as an 'ext3 file system'" these days. Filesystems handled by the ext4 code are defined by the feature bits they have set; if you create a filesystem using "-t ext3" and do not override any of the options, though, it will not have any of the new features enabled, thus it will be unaffected by them.

In order to use the feature, the filesystem will need to be created with encoding-awareness information stored in the superblock. On an encoding-aware ext4 filesystem, case-insensitivity can be enabled on an empty directory (and its children) by setting an inode attribute. That can be done using the EXT4_CASEFOLD_FL ioctl() command, though eventually the chattr command would presumably be updated to add support for the case-folding flag. It should be noted that case-folding and ext4 encryption cannot be used concurrently for the same directory, though Krisman is planning to change that restriction down the road.

Both encoding-awareness and case-insensitivity are fairly large changes to the traditional handling of file names. Unix file names have always been sequences of any byte values (except NUL and "/") without being interpreted in any way. If these changes are adopted, some ext4 filesystems will now be substantially changing the semantics of various filesystem operations. File creation and renaming will no longer operate the way they do today, for example.

However, case-insensitivity is a feature that has been a long time coming and we may see it in the mainline before long. At this point, though, it has only run the gauntlet of the filesystem mailing lists; when it gets posted to linux-kernel, there may be others with opinions—or outright objections. If not, though, Linux 5.3 or 5.4 might just have a feature that has been on some people's wish lists for a decade or two.

Comments (58 posted)

Page editor: Jonathan Corbet
Next page: Brief items>>

00	Transport is not ECN-capable
01	ECN-capable ECT(1)
10	ECN-capable ECT(0)
11	Congestion experienced