|
|
Log in / Subscribe / Register

LWN.net Weekly Edition for February 6, 2020

Welcome to the LWN.net Weekly Edition for February 6, 2020

This edition contains the following feature content:

This week's edition also includes these inner pages:

  • Brief items: Brief news items from throughout the community.
  • Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

Browsers, web sites, and user tracking

By Jake Edge
February 5, 2020

Browser tracking across different sites is certainly a major privacy concern and one that is more acute when the boundaries between sites and browsers blur—or disappear altogether. That seems to be the underlying tension in a "discussion" of an only tangentially related proposal being made by Google to the W3C Technical Architecture Group (TAG). The proposal would change the handling of the User-Agent headers sent by browsers, but the discussion turned to the unrelated X-Client-Data header that Chrome sends to Google-owned sites. The connection is that in both cases some feel that the web-search giant is misusing its position to the detriment of its users and its competitors in the web ecosystem.

The original review request was made on TAG's GitHub repository by Chrome developer Yoav Weiss. The change being proposed is an attempt to reduce the abuse of the User-Agent header, both as a way to fingerprint individual users and as a mechanism to limit which browsers can view a site. The "User-Agent Client Hints" draft describes a new mechanism for clients to identify themselves to web servers. If adopted, it would get rid of User-Agent entirely and replace it with the Sec-CH-UA header that would, at least initially, provide much less information for web servers to (ab)use. Servers could request more detailed information about the browser and platform in the response, which the browser could respond to. The name of the header puts it in the Sec- namespace, which means that it cannot be changed by JavaScript running on the page; the rest of the name is abbreviations for "client hints" and "user agent".

There is lots more to it than that, but that is the high-level gist. The review request gathered a few comments, several of which were skeptical of the need for the feature as well as concerned about its impact—especially on niche browsers or those that are just starting to get a foothold in the marketplace. There is, at minimum, a perception that Google may be trying to use its dominance in browsing, ad serving, web search, and other web services to undercut existing or future competitors.

A comment on the review request raised the issue of the X-Client-Data header; a user named "kiwibrowser" (presumably Arnaud Granal, who is the founder of the Chromium-based Kiwi Browser) asked Weiss about the header. "Did you consider removing the installation and Google-specific tracking headers (x-client-data) that Google Chrome is sending to Google properties ?" A pointer to that comment was subsequently posted on Hacker News, which set off a sizable stream of comments.

According to a Google privacy white paper linked in both places, X-Client-Data is meant to facilitate A/B testing in the browser and on various Google sites, but the concern is that it helps Google associate an ID with a browser. As kiwibrowser put it: "it doesn't make sense to anonymise user-agent if you have such backdoor"

How unique the X-Client-Data actually is was one area that was discussed in a sub-thread on Hacker News. In the best case (or worst case from the perspective of those trying to identify and track users), the white paper says that values from zero to 7999 are chosen for a seed, which results in 13 bits of entropy. But that is only true if usage statistics and crash reports are turned off, which is not the default; the entropy for the default installation is not specified, but it is sure to be higher, which means that it is more likely to be unique to a specific user's browser. Beyond that, more information will be used: "Experiments may be further limited by country (determined by your IP address), operating system, Chrome version and other parameters."

The active "variations" can be seen in the "chrome://version" screen in Chrome, but it is difficult to get any real sense for how they are used—and what they might be revealing. In addition, the X-Client-Data header is only sent to a select group of Google sites, which can be seen in the Chromium source code. That list contains various ad services, such as DoubleClick, which probably helps fuel the impression that the header can be used for tracking. Some were also concerned that only sending the header to Google sites is intended to provide a competitive advantage; any tracking ability that it provides is not shared with other sites.

These days, lots of browsers are based on Chromium, as are frameworks like Electron. Several Hacker News posters checked some of these derivative browsers and reported that they did not send X-Client-Data. An Electron maintainer said that the JavaScript framework has also dropped that code. The ungoogled-chromium repository was also mentioned as a fork that removes the code to send the header among other de-Googling changes.

But the truth of the matter is that people using Google's services (and browser) open themselves up to plenty of easy ways to be tracked. While the white paper tries to present the tracking in a positive, or at least neutral, light, it is clearly the case that the search giant has an enormous trove of data that it can use for ad targeting—or nearly anything else it chooses. Even switching away from Chrome entirely does not change much in terms of tracking for those signing into the Google mothership. And that data can be shared with other entities that wish to track the browsing habits of individual users, for advertising or even more malodorous activities.The Google privacy policy talks about sharing its data with its partners and allowing them to collect browser information, but it is not entirely clear how much protection that actually provides.

Browser fingerprinting is nothing new, of course, nor is the role played by Google and other online services. So, to a certain extent, the threads were used as an opportunity to vent about Google: its dominant position, its privacy practices, and how it can or does use that position to lock out competitors and potential future competitors. Topics like the wording of the white paper and the Google privacy policy, the collection of personally identifiable information (PII) on the web and how that relates to the EU General Data Protection Regulation (GDPR), the Sec-CH-UA proposal, as well as other real or perceived misdeeds by various players in the web ecosystem all came up along the way.

It is not at all clear that those kinds of concerns are actually reaching the audience that needs to hear them—if they are, they are apparently not particularly compelling to, seemingly, the vast majority of users. There may well be good reasons for questioning the reasoning behind moving away from User-Agent—and Google's motivation to make that change—but a W3C TAG review request hardly seems like the right forum to make a wider point about the behavior of Chrome. A Hacker News thread is not a place where these kinds of activities are going to be curtailed either; it is going to be up to users to recognize that there are problems to be solved and to try to find ways to do so.

That said, the web ecosystem has become a privacy wasteland, so it is not all that surprising that technically savvy folks get up in arms and vent from time to time. But those of a technical bent already have most of the tools they need to combat these problems for themselves; simply giving up the convenience of the various services provided by motherships of all stripes would largely obviate this particular problem. Other tools can be used to minimize tracking while still trying to extract some value from the online resources of the web.

But "we" (including myself as an (over)user of Google services) often choose not to stop using those services and may not protect our privacy using the other tools all that much either—our family members and neighbors without those technical skills effectively lack the choice. There is plenty of lip service to the idea that people can opt out of certain kinds of tracking and such but, without understanding the underlying problems and consequences, few will—and few do. In the end, it is a societal problem that will require some kind of collective action, either via governmental action (e.g. the GDPR) or through changes to people's attitudes and behavior—or both.

Comments (15 posted)

A new hash algorithm for Git

By Jonathan Corbet
February 3, 2020
The Git source-code management system is famously built on the SHA‑1 hashing algorithm, which has become an increasingly weak foundation over the years. SHA‑1 is now considered to be broken and, despite the fact that it does not yet seem to be so broken that it could be used to compromise Git repositories, users are increasingly worried about its security. The good news is that work on moving Git past SHA‑1 has been underway for some time, and is slowly coming to fruition; there is a version of the code that can be looked at now.

How Git works, simplified

To understand why SHA‑1 matters to Git, it helps to have an idea of how the underlying Git database works. What follows is an oversimplified view of how Git manages objects that can be skipped by readers who are already familiar with this material.

Git is often described as being built on a content-addressable filesystem — one where you can look up an object if you know that object's contents. That may not seem particularly useful, but there's more than one way to "know" those contents. In particular, you can substitute a cryptographic hash for the contents themselves; that hash is rather easier to work with and has some other useful properties.

Git stores a number of object types, using SHA‑1 hashes to identify them. So, for example, the SHA‑1 hash of drivers/block/floppy.c in a 5.6-merge-window kernel, as calculated by Git, is 485865fd0412e40d041e861506bb3ac11a3a91e3. Conceptually, at least, Git will store that version of floppy.c in a file, using that hash as its name; early versions of Git actually did that. If somebody makes a change to floppy.c, even just removing an extra space from the end of a line, the result will have a completely different SHA‑1 hash and will be stored under a different name.

A Git repository is thus full of objects (often called "blobs") with SHA‑1 names; since a new one is created for each revision of a file, they tend to proliferate. Your editor's kernel repository currently contains 8,647,655 objects. But blobs are not the only types of objects stored in a Git repository.

An individual file object holds a particular set of contents, but it has no information about where that file appears in the repository hierarchy. If floppy.c is moved to drivers/staging someday, its hash will remain the same, so its representation in the Git object database will not change. Keeping track of how files are organized into a directory hierarchy is the job of a "tree" object. Any given tree object can be thought of as a collection of blobs (each identified by its SHA‑1 hash, of course) associated with their location in the directory tree. As one might expect, a tree object has an SHA‑1 hash of its own that is used to store it in the repository.

Finally, a "commit" object records the state of the repository at a particular point in time. A commit contains some metadata (committer, date, etc.) along with the SHA‑1 hash of a tree object reflecting the current state of the repository. With that information, Git can check out the repository at a given commit, reproducing the state of the files in the repository at that point. Importantly, a commit also contains the hash of the previous commit (or multiple commits in the case of a merge); it thus records not just the state of the repository, but the previous state, making it possible to determine exactly what changed.

Commits, too, have SHA‑1 hashes, and the hash of the previous commit (or commits) is included in that calculation. If two chains of development end up with the same file contents, the resulting commits will still have different hashes. Thus, unlike some other source-code management systems, Git does not (conceptually, at least) record "deltas" from one revision to the next. It thus forms a sort of blockchain, with each block containing the state of the repository at a given commit.

Why hash security matters

The compromise of kernel.org in 2011 created a fair amount of concern about the security of the kernel source repository. If an attacker were able to put a backdoor into the kernel code, the result could be the eventual compromise of vast numbers of deployed systems. Malicious code placed into the kernel's build system could be run behind any number of corporate and government firewalls. It was not a pleasant scenario but, thanks to the use of Git, it was also not a particularly likely one.

Let us imagine that some attacker has gained control of kernel.org and wants to place some evil code into floppy.c — something unspeakable like a change that replaces random sectors with segments from Rick Astley videos, say. Somehow this change would have to be incorporated into the repository so that it would be included in subsequent pulls. But the change to floppy.c changes its SHA‑1 hash; that, in turn, will change every tree object containing the evil floppy.c and every commit that includes it as well. The head commit for the repository would certainly change, as would older ones if the attacker tried to make the change appear to have happened in the distant past.

Somewhere out there is certainly some developer who actually memorizes SHA‑1 hashes and would immediately notice a change like that. The rest of us probably would not, but Git will. The distributed nature of Git means that there are many copies of the repository out there; as soon as a developer tries to pull from or push to the corrupted repository, the operation will fail due to the mismatched hashes between the two repositories and the corruption will come to light.

Repository integrity is also protected by signed tags, which include the hash for a specific commit and a cryptographic signature. The chain of hashes leading up to a given tag cannot be changed without invalidating the tag itself. The use of signed tags is not universal in the kernel community (and rare to nonexistent in many other projects), but mainline kernel releases are signed that way. When one sees Linus Torvalds's signature on a tag, one knows that the repository is in the state he intended when the tag was applied.

All of this depends on the strength of the hash used, though. If our attacker is able to modify floppy.c in such a way that its SHA‑1 hash does not change, that modification could well go undetected. That is why the news of SHA‑1 hash collisions creates concern; if SHA‑1 cannot be trusted to detect hostile changes, then it is no longer assuring the integrity of the repository.

The world has not ended yet, fortunately. It is still reasonably expensive to create any sort of SHA‑1 hash collision at all. Creating any new version of floppy.c with the same hash would be hard. An attacker would not just have to do that, though; this new version would have to contain the desired hostile code, still function as a working floppy driver, and not look like an obfuscated C code contest entry (at least not more than it already does). Creating such a beast is probably still unfeasible. But the writing is clearly on the wall; the time when SHA‑1 is too weak for Git is rapidly approaching.

Moving to a stronger hash

Back in the early days of Git, Torvalds was unconcerned about the possibility of SHA‑1 being broken; as a result, he never designed in the ability to switch to a different hash; SHA‑1 is fundamental to how Git operates. As of 2017, the Git code was full of declarations like:

    unsigned char sha1[20];

In other words, the type of the hash was deeply wired into the code, and it was assumed that hashes would fit into a 20-byte array.

At that time, Git developer brian m. carlson was already at work to separate the Git core from the specific hash being used; indeed, he had been working on it since 2014. It was unclear what hash might eventually replace SHA‑1, but it was possible to create an abstract type for object hashes that would hide that detail. At this point, that work is done and merged.

The decision on a replacement hash algorithm was made in 2018. A number of possibilities were considered, but the Git community settled on SHA‑256 as the next-generation Git hash. The commit enshrining that choice cites its relatively long history, wide support, and good performance. The community has also decided on (and mostly implemented) a transition plan that is well documented; most of what follows is shamelessly cribbed from that file.

With the hash algorithm abstracted out of the core Git code, the transition is, on the surface, relatively easy. A new version of Git can be made with a different hash algorithm, along with a tool that will convert a repository from the old hash to the new. With a simple command like:

   git convert-repo --to-hash=sha-256 --frobnicate-blobs --climb-subtrees \
   	--liability-waiver=none --use-shovels --carbon-offsets

a user can leave SHA‑1 behind (note that the specific command-line options may differ). There is only one problem with this plan, though: most Git repositories do not operate in a vacuum. This sort of flag-day conversion might work for a tiny project, but it's not going to work well for a project like the kernel. So Git needs to be able to work with both SHA‑1 and SHA‑256 hashes for the foreseeable future. There are a number of implications to this requirement that make themselves felt throughout the system.

One of the transition design goals is that SHA‑256 repositories should be able to interoperate with SHA‑1 repositories managed by older versions of Git. If kernel.org updates to the new format, developers running older versions should still be able to pull from (and push to) that site. That will only happen if Git continues to track the SHA‑1 hashes for each object indefinitely.

For blobs, this tracking will happen through the maintenance of a set of translation tables; given a hash generated with one algorithm, Git will be able to look up the corresponding hash from the other. Needless to say, this lookup will only succeed for objects that are actually in the repository. These translation tables will be maintained in the "pack files" that hold most objects in a contemporary Git repository. There will be a separate table for "loose objects" that are stored as separate files rather than in packs; the cost of lookups in that table is seen as being high enough that measures need to be taken to minimize the number of loose objects in any given repository.

The handling of other object types is a bit more complicated. An SHA‑1 tree object, for example, must contain SHA‑1 hashes for the objects in the tree. So if such a tree object is requested, Git will have to locate the SHA‑256 version, then translate all the object hashes contained within it before returning it. Similar translations will be required for commits. Signed tags will contain both hashes.

With this machinery in place, Git installations will be interoperable during the transition. Eventually, all users will have upgraded to SHA‑256-capable versions of Git, at which point repository owners could begin turning off the SHA‑1 capability and removing the translation tables. The transition will, at that point, be complete.

Some inconvenient details

There are likely to be some glitches along the way, naturally. One of them is a simple human-factors problem: when a user supplies a hash value, should it be interpreted as SHA‑1 or SHA‑256? In some cases, it's unambiguous; SHA‑1 hashes are 160 bits wide, so a 256-bit hash must be SHA‑256, for example. But a shorter hash could be either, since hashes can be (and often are) abbreviated. The transition document describes a multi-phase process during which the interpretation of hash values would change, but most users are unlikely to go through that process.

There is, of course, a way to unambiguously give a hash value in the new Git code, and they can even be mixed on the command line; this example comes from the transition document:

    git --output-format=sha1 log abac87a^{sha1}..f787cac^{sha256}

For a Git user interface this is relatively straightforward and concise, but one can still imagine that users might tire of it relatively quickly. The obvious solution to this sort of bracket fatigue is to fully transition a project to SHA‑256 as quickly as possible.

There is another issue out there, though: there are a lot of SHA‑1 hash values in the wild. The kernel repository currently contains over 40,000 commits with a Fixes: tag; each one of those includes an SHA‑1 hash. These hash values also can be found in bug-tracker histories, release announcements, vulnerability disclosures, and more. In a repository without SHA‑1 compatibility, all of those hashes will become meaningless. To address this issue, one can imagine that the Git developers may eventually add a mode where translations for old SHA‑1 hashes remain in the repository, but no SHA‑1 hashes for new objects are added.

Current state

Much of the work to implement the SHA‑256 transition has been done, but it remains in a relatively unstable state and most of it is not even being actively tested yet. In mid-January, carlson posted the first part of this transition code, which clearly only solves part of the problem:

First, it contains the pieces necessary to set up repositories and write _but not read_ extensions.objectFormat. In other words, you can create a SHA‑256 repository, but will be unable to read it.

The value of write-only repositories is generally agreed to be relatively low; not even SCCS was so limited. Carlson's purpose in posting the code at this stage is to try to reveal any core issues that will be harder to change as the work progresses. Developers who are interested in where Git is going may well want to take a close look at this code; converting their working repositories over is not recommended, though.

As it turns out, carlson's work goes well beyond what has been put out for testing now; he will post it when he is ready, but really curious people can see it now in his GitHub repository. This work is unlikely to land on the systems of most Git users for some time yet, but it is good to know that it is getting close to ready. The Git developers (carlson in particular) have quietly been working on this project for years; we will all benefit from it.

Comments (71 posted)

The 5.6 merge window opens

By Jonathan Corbet
January 30, 2020
As of this writing, 4,726 non-merge changesets have been pulled into the mainline repository for the 5.6 development cycle. That is a relatively slow start by contemporary kernel standards, but it still is enough to bring a number of new features, some of which have been pending for years, into the mainline. Read on for a summary of the changes pulled in the early part of the 5.6 merge window.

Architecture-specific

  • The Arm E0PD feature is now supported; it provides the security benefits of kernel page-table isolation without the associated cost.
  • The Arm8.5 RNG instruction, which provides access to a hardware random-number generator, is now supported; it is used to initialize the kernel's random-number generator.

Core kernel

  • Realtime tasks running on heterogeneous (big.LITTLE) systems can now set the uclamp_min parameter (introduced with scheduler utilization clamping patches in 5.3) to ensure that they are scheduled on a CPU that is powerful enough.
  • Time namespaces have finally been merged. The primary use case for this feature is to ensure that clocks behave rationally when a container is migrated from one host to another, but other uses will surely arise. Some more information can be found in this commit.
  • There is a new boot-time parameter (managed_irq) that causes the kernel to attempt to prevent managed interrupts from disturbing isolated CPUs; see this commit for more information.
  • The BPF dispatcher and batched BPF map operations, both of which were described in this article, have been merged.
  • BPF global functions are a part of the effort to support BPF "libraries" within the kernel. The next step is dynamic program extensions, which allow the loading of global functions — and the replacement of existing global functions while they are in use.
  • The new CPU idle-injection device can cool an overheating CPU by forcing it to go idle for short periods of time; see this documentation patch for more information.
  • The openat2() system call has been added. It includes a number of new flags to restrict pathname resolution; see this commit for documentation.

Filesystems and block I/O

  • The Btrfs filesystem has a new "asynchronous discard" mode enabled with the discard=async mount option. This rigorously undocumented feature creates a list of no-longer-used blocks that can be given to the storage device's "discard" operation at some future time, rather than discarding them immediately. That helps to prevent discard operations from delaying transactions, improves the chances of reusing blocks before needing to discard them, and allows larger blocks to be discarded in a single operation. Some more information can be found in this patch-series cover letter.

Hardware support

  • GPIO and pin control: SiFive GPIO controllers, Xylon LogiCVC GPIO controllers, Qualcomm WCD9340/WCD9341 GPIO controllers, and NXP IMX8MP pin controllers.
  • Hardware monitoring: Maxim MAX31730 temperature sensors, Maxim MAX20730, MAX20734, and MAX20743 regulators, Infineon XDPE122 VR controllers, Analog Devices ADM1177 power monitors, Allwinner sun8i thermal sensors, and Broadcom AVS RO thermal sensors. Also: it is now possible to query sensors in ATA drives (temperature in particular) via sysfs; see this commit for details.
  • Industrial I/O: Analog Devices AD7091R5 analog-to-digital converters, Linear Technology LTC2496 analog-to-digital converters, Bosch BMA400 3-axis accelerometers, and All Sensors DLHL60D and DLHL60G pressure sensors.
  • Miscellaneous: Intel Uncore frequency controllers, TI K3 UDMA controllers and ring accelerator modules, HiSilicon DMA Engines, HiSilicon SPI-NOR flash controllers, ROHM BD71828 power regulators, Monolithic MPQ7920 power-management ICs, NXP i.MX8M DDR controllers, Microchip PIT64B clocks, Qualcomm MSM8916 interconnect buses, NXP i.MX INTMUX interrupt multiplexers, and AMD secure processors with trusted execution environment support.
  • Network: Broadcom BCM84881 PHYs, Qualcomm Atheros AR9331 Ethernet switches, Qualcomm 802.11ax chipsets, ZHAW InES PTP time stamp generators, and Marvell OcteonTX2 interfaces.
  • Sound: the ALSA subsystem has seen some significant changes to avoid the year-2038 apocalypse; that includes some extensions to the user-space API. This commit describes the most significant changes. Support was also added for Qualcomm WCD9340/WCD9341 codecs, Qualcomm WSA8810/WSA8815 Class-D amplifiers, Realtek RT700, RT711, RT715, and RT1308 codecs, Ingenic JZ4770 codecs, and Mediatek MT6660 speaker amplifiers.
  • USB: The Thunderbolt specification has morphed into USB4; the kernel configuration options for Thunderbolt have been renamed accordingly. Support for MediaTek MUSB controllers and Intel EMMC PHYs has been added.

Memory management

  • There is a new control-group controller to manage hugetlb usage; see this commit for more information.

Networking

  • At long last, the WireGuard virtual private network implementation has been merged into the mainline. Linus Torvalds was evidently happy with this development.
  • The "enhanced transmission selection scheduler" queuing discipline has been added. This nearly undocumented module does have a bit of help text: "The Enhanced Transmission Selection scheduler is a classful queuing discipline that merges functionality of PRIO and DRR qdiscs in one scheduler. ETS makes it easy to configure a set of strict and bandwidth-sharing bands to implement the transmission selection described in 802.1Qaz." Some more general information can be found on the IEEE 802.1Qaz page.
  • There is a long-running effort to switch ethtool from its ioctl() interface to netlink. Much of the ground work was merged for 5.6; see this merge commit and this document for more information.
  • The process of upstreaming the multipath TCP patches has begun, with a number of the prerequisite patches being merged. Multipath TCP will not be supported in 5.6, but it's getting closer.
  • The new BPF_PROG_TYPE_STRUCT_OPS BPF program type allows a BPF program to fill in where a function pointer would otherwise be used in the kernel; this feature was introduced with this commit. The first use of this feature is to allow the writing of TCP congestion-control algorithms in BPF; this commit adds a DCTCP implementation as an example.
  • The Flow Queue PIE packet scheduler, which is aimed at addressing bufferbloat problems, has been added. Cable modems appear to be a use case of interest for FQ-PIE.

Security-related

  • The ability to disable the SELinux security module at run time has been deprecated with an eye toward removing it in a future release. This feature is still used by Fedora and RHEL, but has been left behind by most other distributions. The preferred way to disable SELinux is with the selinux=0 command-line parameter.

    Interestingly, the deprecation plan for this feature involves making it "increasingly painful" to enable by inserting a boot-time delay that grows longer with each release.

Internal kernel changes

  • ioremap_nocache() and devm_ioremap_nocache() have long been redundant, since plain ioremap() already provides uncached mappings. These functions have now been removed; over 300 files have been touched to convert all remaining callers.

By the normal schedule, the 5.6 merge window should stay open until February 9, with the final 5.6 release happening at the end of March or the beginning of April. Stay tuned for our second-half summary, to be published just after the 5.6-rc1 release is made.

Comments (14 posted)

Accelerating netfilter with hardware offload, part 2

January 31, 2020

This article was contributed by Marta Rybczyńska

As network interfaces get faster, the amount of CPU time available to process each packet becomes correspondingly smaller. The good news is that many tasks, including packet filtering, can be offloaded to the hardware itself. The bad news is that the Linux kernel required quite a bit of work to be able to take advantage of that capability. The first article in this series provided an overview of how hardware-based packet filtering can work and the support for this feature that already existed in the kernel. This series now concludes with a detailed look at how offloaded packet filtering works in the netfilter subsystem and how administrators can make use of it.

The offload capability was added by a patch set from Pablo Neira Ayuso, merged in the kernel 5.3 release and updated thereafter. The goal of the patch set was to add support for offloading a subset of the netfilter rules in a typical configuration, thus bypassing the kernel's generic packet-handling code for packets filtered by the offloaded rules. It is not currently possible to offload all of the rules, as that would require additional support from the underlying hardware and in the netfilter code. The use case and some of the internals are mentioned in Neira's slides [PDF] from the 2019 Linux Plumbers Conference.

Background work

The bulk of the patch set is the refactoring needed to allow the netfilter offload mechanism to reuse the infrastructure that was directly tied to the traffic-control (tc) subsystem before. The refactoring effort was able to take advantage of an existing driver callback. Some modules, which were only used by the tc subsystem before have become more generic.

The first new subsystem, the "flow block" infrastructure, was introduced in 2017 to allow the sharing of filtering rules and to optimize the use of ternary content-addressable memory (TCAM) entries. It allows a set of rules to be shared by two (or more) network interfaces, which reduces the hardware resources needed by rule offloading; this is because the network cards with multiple physical interfaces usually share the TCAM entries between those interfaces. This optimization, in the case of switches, allows the administrator to define common blocks of filtering rules that can be assigned to multiple interfaces. When a shared block is in place, any changes will apply to all interfaces that the block is assigned to. The netfilter offload patch extends the use of flow blocks beyond the tc subsystem, making it available for all subsystems that need to offload packet-filtering tasks.

A flow block is, at its core, a list of driver callbacks invoked when the rules programmed into the hardware are changed. There is usually one entry per device (for typical network cards); in the case of switches there is one callback for all the interfaces in the switch. For a configuration with two network interfaces that share the same rules, the flow-block list contains two callbacks (one for each interface). The flow-block infrastructure does not limit the number of filtering rules.

Another important part of the patch set modifies a callback provided by network card device drivers. Those callbacks are kept in the struct net_device_ops structure. The netfilter offload patch set reuses the ndo_setup_tc() callback, which was initially added to configure schedulers, classifiers, and actions for the tc subsystem; it has the following prototype:

    int (*ndo_setup_tc)(struct net_device *dev, enum tc_setup_type type,
                        void *type_data);

It takes the network device dev, the type of the configuration to apply (defined in the enum tc_setup_type) and an opaque data value. The enum defines different action types; netfilter does not define its own type, instead it uses the one defined by the flower classifier (TC_SETUP_CLSFLOWER). This is expected to change in the future, when drivers will start supporting tc and netfilter offloading at the same time.

Finally, the flow-rule API was introduced in February 2019 (there is a longer cover letter in version 6 of the flow-rule patch set). It implements an intermediate representation for the flow-filtering rules, allowing the separation of the driver-specific implementation from the details of the subsystem calling it. In particular, it enabled a single code path to be used by drivers to support access-control-list offloads configured by either ethtool or the flower classifier.

In the flow-rule API, each flow_rule object represents a filtering rule. It consists of the match condition of the rule (struct flow_match) and the actions to be performed (struct flow_action). In the netfilter code, each flow_rule represents a rule to be offloaded to the hardware; it is kept in the flow-block list. When netfilter offloads a rule to hardware, it iterates over the callback list in the flow block, invoking each callback and passing in the rules, so that they can be handled by the driver.

Driver API changes

As the tc-specific code was made more generic, several types and definitions were renamed or reorganized. A new type, flow_block_command, that defines the commands for the driver's flow-block setup function was added. It includes two definitions, TC_BLOCK_BIND and TC_BLOCK_UNBIND, that were renamed to FLOW_BLOCK_BIND and FLOW_BLOCK_UNBIND, respectively. Those allow the kernel to bind and unbind a flow block to an interface. In the same way, flow_block_binder_type, which defines the type of the offload (ingress for input and egress for output), had seen its members renamed from TCF_BLOCK_BINDER_TYPE_* to FLOW_BLOCK_BINDER_TYPE_*

The existing drivers were all setting up tc offloading in a very similar way, so Neira added a helper function that can be used by all of them:

    int flow_block_cb_setup_simple(struct flow_block_offload *f,
        			   struct list_head *driver_block_list,
        			   flow_setup_cb_t *cb, void *cb_ident,
				   void *cb_priv, bool ingress_only);

where f is the offload context, driver_block_list is the list of flow blocks for the specific driver, cb is the driver's ndo_setup_tc() callback, cb_ident is the identification of the context, cb_priv is the context to be passed to cb (in most cases cb_ident and cb_priv are identical), and ingress_only is true if the offload should be set up for the ingress (receive) side only (this was the case for all the drivers right until 5.4, in 5.5 the cxgb4 driver supports both directions). flow_block_cb_setup_simple() registers one callback per network device, which is exactly what most of the drivers need.

Each driver is expected to keep a list of flow blocks with their callbacks: that is the driver_block_list argument of flow_block_cb_setup_simple(). This list is necessary if the driver needs more than one callback, for example one for the ingress and the other for the egress rules.

The callback implemented by the drivers, of type flow_setup_cb_t has the following definition:

    typedef int flow_setup_cb_t(enum tc_setup_type type, void *type_data,
        			void *cb_priv);

Its implementation in the driver sets up the hardware filtering using the provided configuration. The argument type defines the classifier to use, type_data is the data specific to the classifier (and is usually a pointer to a flow_rule structure) and cb_priv is the callback private data.

If the driver needs to go beyond the functionality of flow_block_cb_setup_simple() (which usually means it is part of a switch), it needs to use the part of the API that allocates the flow blocks directly. These blocks are allocated and freed by two helpers: flow_block_cb_alloc() and flow_block_cb_free() with the following prototypes:

    struct flow_block_cb *flow_block_cb_alloc(flow_setup_cb_t *cb,
                                              void *cb_ident, void *cb_priv,
                                              void (*release)(void *cb_priv));
    void flow_block_cb_free(struct flow_block_cb *block_cb);

The callbacks are defined by the drivers and passed to netfilter by the flow-block infrastructure. Netfilter maintains the list of callbacks that are attached to each given rule.

Each of the flow blocks contains a list of driver offload callbacks. The drivers can add and remove themselves from the list contained in the flow-block list using flow_block_cb_add() and flow_block_cb_remove() with the following prototypes:

    void flow_block_cb_add(struct flow_block_cb *block_cb,
                           struct flow_block_offload *offload);
    void flow_block_cb_remove(struct flow_block_cb *block_cb,
                              struct flow_block_offload *offload);

The driver can look up for a specific callback using flow_block_cb_lookup() defined as follows:

    struct flow_block_cb *flow_block_cb_lookup(struct flow_block *block,
        				       flow_setup_cb_t *cb, void *cb_ident);

This function searches for the flow-block callbacks on the list in the block context; if both the cb callback and the cb_ident value match, it returns the associated flow-block callback structure. It is used by switch drivers to check if a given callback is already installed (again, switches use one callback for all of their interfaces). The setup of the first interface allocates and registers the callback when flow_block_cb_lookup() returns NULL. Subsequently, other interfaces get a non-NULL return and reuse the callback in place, only increasing the reference count (see below). When unregistering a callback, flow_block_cb_lookup() also returns non-NULL if other users exist and the driver just decrements the reference count.

The operations for the flow-block reference counts are flow_block_cb_incref() and flow_block_cb_decref(); they are defined as follows:

    void flow_block_cb_incref(struct flow_block_cb *block_cb);
    unsigned int flow_block_cb_decref(struct flow_block_cb *block_cb);

The value returned by flow_block_cb_decref is the value of the reference count after the operation.

Another function, flow_block_cb_priv(), allows the driver to access its private data. It has the following, simple, prototype:

    void *flow_block_cb_priv(struct flow_block_cb *block_cb);

Finally, the drivers can use flow_block_is_busy() to check if the callback is already in use (added to the lists and active). The function has the following prototype:

    bool flow_block_cb_is_busy(flow_setup_cb_t *cb, void *cb_ident,
                               struct list_head *driver_block_list);

It returns true if it finds an entry with both cb and cb_ident on the driver_block_list. Its use is in the code setting up the offloads to avoid setting up tc and netfilter callbacks at the same time. This check is expected to be removed from drivers that are able to support both at the same time in their hardware, once that support gets implemented.

The internals of the traffic classifier were modified to apply the filtering stored in the flow-block API; this is done in a new function tcf_block_setup().

Callback list

The drivers set up the flow-block object (flow_block_cb) and add their callbacks to their list. Each driver then passes this list to the core networking code, which does the registration (in tc and netfilter) and calls the driver callback to do the actual hardware setup. This callback uses the classifier-specific data it receives in the parameters, including the type of the operation (for example to add or remove an offload).

Edward Cree asked why there is a single list per driver, and not per device, for example:

Pablo, can you explain (because this commit message doesn't) why these per-driver lists are needed, and what the information/state is that has module (rather than, say, netdevice) scope?

The drivers only supported one single flow block, Neira explained, and the idea was to extend that support to one for each subsystem (ethtool, tc, and so on). This is for two reasons: the first is that the current drivers can only support one subsystem; when that restriction is lifted, the other limitation is that the sharing support would require the same configuration of all the subsystems. This means that, for example, the same configuration would be required for both eth0 and eth1 for tc, and also there would also have to be a shared configuration for netfilter. Neira assumes this is almost never going to be the case.

The netfilter offload itself

The last patch in the series introduces the hardware offloading of netfilter itself. Currently the support is basic and only handles the ingress chain. The rule must perform an exact match on the five elements identifying the flow: the protocol, the source and destination addresses, and the source and destination ports.

An example of the offload is given in the series:

    table netdev filter {
        chain ingress {
            type filter hook ingress device eth0 priority 0; flags offload;
            ip daddr 192.168.0.10 tcp dport 22 drop
        }
    }

It drops all TCP packets to the destination address 192.168.0.10, port 22 (typically used by SSH). The only difference from the non-offloaded rules is the addition of the flags offload option.

Since the control of offloading is given to the administrator, there might be misconfigurations. For example, when the offload flag is set for a rule that cannot be offloaded, the error code will be EOPNOTSUPP. If the driver cannot handle the command, for example when the TCAM is full, the result will be a driver-specific error code.

The interface gives a lot of power to the system administrator, but also makes them responsible for figuring out which rules will benefit the most from offloading. It seems that knowledge of the system configuration and the traffic it handles will be necessary to derive the most benefit from this new feature. At the time of writing this article, no benchmark or best-practices documents are available. It also remains to be seen where the limitations of the offload feature will be — for example, how easy it will be to diagnose failures in the user configuration coming from the driver callbacks.

Summary

The netfilter classification offloading feature allows the activation of hardware offloading, which can provide important performance gains for certain use cases. This work resulted in useful refactoring of existing code blocks and opens a way for other offloading users. However, drivers need to be modified to take full advantage of this capability and the API itself is quite complex with a number of levels of callbacks. The administrators gain a powerful tool, but it will be up to them to use it correctly. There is definitely more work to be done in this area.

[The author would like to thank Pablo Neira Ayuso for helpful comments]

Comments (2 posted)

Postponing some feature removals in Python 3.9

By Jake Edge
February 4, 2020

Python 2 was officially "retired" on the last day of 2019, so no bugs will be fixed or changes made in that version of the language, at least by the core developers—distributions and others will continue for some time to come. But there are lots of Python projects that still support Python 2.7 and may not be ready for an immediate clean break. Some changes that were made for the upcoming Python 3.9 release (which is currently scheduled for October) are causing headaches because support for long-deprecated 2.7-compatibility features is being dropped. That led to a discussion on the python-dev mailing list about postponing those changes to give a bit more time to projects that want to drop Python 2.7 support soon, but not immediately.

There will actually be one final release of Python 2, Python 2.7.18, in April. It is something of a celebratory release that will be made in conjunction with PyCon. There were some fixes that accumulated in the branch between the 2.7.17 release in October and the end of the year, so those fixes will be flushed and the branch retired. Other than the release itself, no other changes will be allowed for that branch in 2020.

Compatibility

Fedora has recently started testing the Python 3.9 alpha releases in preparation for shipping 3.9 with Fedora 33, which is due around the end of the year. Victor Stinner and Miro Hrončok reported that a lot of packages will not build with the new version due to dropping the compatibility with Python 2.7. When that was posted on January 23, there were more than 150 packages broken by these changes, so they suggested that a handful of the ones causing the most problems be reverted.

Miro and me consider that Python 3.9 is pushing too much pressure on projects maintainers to either abandon Python 2.7 right now (need to update the CI, the documentation, warn users, etc.), or to introduce a *new* compatibility layer to support Python 3.9: layer which would be dropped as soon as Python 2.7 support will be dropped (soon-ish).

As described in the message, many existing Python packages have code to handle differences between Python 2.7 and Python 3.x, but the changes currently in 3.9 might require another layer of compatibility code—one that may be short-lived. Instead of requiring that all of these other developers make those changes immediately, they said that it makes more sense for the core language to maintain the compatibility for one more release:

While it's certainly tempting to have "more pure" code in the standard library, maintaining the compatibility shims for one more release isn't really that big of a maintenance burden, especially when comparing with dozens (hundreds?) of third party libraries essentially maintaining their own.

One of the examples provided is for the Sequence abstract base class (ABC) that can be used to create types similar to lists and tuples. In Python 2.7, it lives in the top level of the collections module, but for Python 3.3, the ABCs moved into collections.abc, though the top-level aliases were maintained. Those aliases are slated to be removed for 3.9, which will mean adding code like the following for projects that are still supporting the older version:

    try:
	from collections.abc import Sequence
    except ImportError:
	# Python 2.7 doesn't have collections.abc
	from collections import Sequence

If the developers of the third-party packages end up dropping support for 2.7 over the next year or so, then they will have needlessly added code like that when they could simply have put in the .abc wherever it was needed. While fixes are in progress for many of the projects, it will take time to get them out there, so Python 3.9 support could lag, which is probably not what the core developers want. Toward the bottom of the message, they list five incompatible changes (from the list in the in-progress Python 3.9 release notes); delaying those would eliminate most of the problems.

Reaction

The reaction was somewhat mixed. Steering council member Barry Warsaw was in favor of postponing those changes. Since Python has recently moved to a one-year cadence (down from 18 months), that shortened the time frame for projects to make these kinds of changes, he said. "And if it helps with the migration off of Python 2.7, then +1 from me."

Others were concerned that doing so would simply add another year to a longstanding plan to get rid of the compatibility cruft once 2.7 reached end-of-life. Ivan Levkivskyi thought that it was important to stick to the schedule that was established, but he also was not sure that delaying a year is truly beneficial: "For example, importing ABCs directly from collections was deprecated 8 years ago, what would 1 extra year change?" Eric V. Smith was also unsure about adding another year:

I think the concern is that with removing so many deprecated features, we're effectively telling libraries that if they want to support 3.9, they'll have stop supporting 2.7. And many library authors aren't willing to do that yet. Will they be willing to in another year? I can't say.

But Hrončok saw things differently:

The concern is not that they don't want to drop 2.7 support, but that it is a nontrivial task to actually do and we cannot expect them to do it within the first couple weeks of 2020. While at the same time, we want them to support 3.9 since the early development versions in order to be able to detect regressions early in the dev cycle.

Given that, Smith said that he was not opposed to a postponement as one-time thing to help those projects in the interim. He was guessing that deprecation warnings had been ignored from Python 3 in order to continue supporting 2.7, so a transition period is needed. It should be noted that deprecation warnings were not as prominent prior to the Python 3.7 release in 2018. Council member Brett Cannon concurred with postponing the deprecations:

I'm also okay with a one-time delay in removals that are problematic for code trying to get off of Python 2.7 this year and might not quite cut it before 2021 hits. I'm sure some people will be caught off-guard once 3.9b1 comes out and they realize that they now have to start caring about deprecation warnings again. So I'm okay letting 3.9 be the release where we loudly declare, "deprecation warnings matter again! Keep your code warnings-free going forward as things will start being removed in 3.10".

Paul Moore was concerned that it will just give projects more time to procrastinate: "I think that far more people will see this as yet another delay before 2.7 dies, and treat it as one more reason to do nothing". He was not opposed to the specific changes suggested, however. But Serhiy Storchaka thought that postponing would just lead to problems when it came time to put together 3.10, 3.11, and beyond. Also, the deprecations will help surface code that is no longer maintained:

I consider breaking unmaintained code is an additional benefit of removing deprecated features. For example pycrypto was unmaintained and insecure for 6 years, but has 4 million downloads per month. It will not work in 3.9 because of removing time.clock. Other example is nose, unmaintained for 5 years, and superseded by nose2.

Guido van Rossum cautioned against that approach, though he admitted to making similar statements in the past:

I now think core Python should not be so judgmental. We've broken enough code for a lifetime with the Python 2 transition. Let's be *much* more conservative when we remove things from Python 3.

Overall, the sense is that the specific reversions are reasonable, even if there are some concerns about delaying them for another release—thus sending the wrong signal to some projects. Stinner is also a steering council member, so it would seem there is a majority on that body in favor of pushing them back to 3.10, which will presumably come out in late 2021. It's hard to see things pushing out further than that, though, so projects should definitely take the opportunity to work out their strategy with regard to 2.7 compatibility.

Though Python 2.7 is "dead" in some sense, there are differing ideas of what that means in practice. There will certainly still be an enormous amount of Python 2 code running at the end of 2020; in fact, there will undoubtedly still be a lot of that code running at the end of the 2020s, but presumably much less at that point. The core development team has long been ready to leave that legacy behind, but smoothing the path for those who are not quite ready to come to terms with the demise of the Norwegian Blue ("beautiful plumage") is seemingly for the best.

Comments (23 posted)

Page editor: Jonathan Corbet

Inside this week's LWN.net Weekly Edition

  • Briefs: CoreOS Container Linux EOL; glibc 2.31; Lars Kurth and Scott Rifenbark RIP; Quotes; ...
  • Announcements: Newsletters; conferences; security updates; kernel patches; ...
Next page: Brief items>>

Copyright © 2020, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds