|
|
Subscribe / Log in / New account

Leading items

Welcome to the LWN.net Weekly Edition for November 17, 2022

This edition contains the following feature content:

Note: November 24 is the Thanksgiving holiday in the US; as has become traditional, LWN will not be publishing a weekly edition that week so that we can devote our full attention to eating. We'll be back on December 1.

This week's edition also includes these inner pages:

  • Brief items: Brief news items from throughout the community.
  • Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

Networking and high-frequency trading

By Jake Edge
November 16, 2022

Netdev

The high-frequency-trading (HFT) industry is rather tight-lipped about what it does and how it does it, but PJ Waskiewicz of Jump Trading came to the Netdev 0x16 conference to try to demystify some of that, especially with respect to its use of networking. He wanted to contrast the needs of HFT with those of the traditional networking as it is used outside of the HFT space. He also has some thoughts on what the Linux kernel could do to help address those needs so that HFT companies could move away from some of the custom code that is currently being developed and maintained by multiple firms in the industry.

Secrets

Waskiewicz began by highlighting the secretive nature of the industry; it is sometimes amusing, but also rather frustrating, how little is known about HFT. For example, there is almost nothing on his company's web site beyond some office-location information; that can make recruiting difficult, for example. It is well-known that HFT companies do trading—stocks, options, securities, etc.—but the "how" is the secretive part; how do the companies decide on the trades to execute and how do they actually get executed? That is the secret sauce that HFT companies do not want to share with anyone—especially their HFT competitors.

He said that a Wikipedia definition describes HFT as "algorithmic-based trading"—data is analyzed in various ways in order to decide whether or not to execute a trade. Those trades are then made in "very very high volumes". While preparing for the talk, he found a study that attributes 60% of all of the trades in all of the exchanges worldwide to HFT firms. All of that volume is coming from algorithmic trading.

The trading strategies that determine what to trade and how to trade it are based on quantitative analysis. So HFT firms have teams of people looking at large amounts of data, extracting signals from the data, and using it to build models. The goal is to predict the future and then to automatically execute trades to take advantage of those predictions. The strategy algorithms can be implemented in either software or hardware; it is no secret, he said, that HFT firms use custom hardware to help accelerate their operations.

HFT firms have massive amounts of data stored that can be used by the teams to create their models. Market data for the last ten years or more amounts to petabytes of storage required. Moving that data around efficiently is important, but the primary concern for HFT networking is to have predictable latency; Waskiewicz had already mentioned that network jitter can cause HFT firms to lose a lot of money and he would return to that idea several times in the talk. Unexpected latency can change the timing of queries and actions so that the strategy is no longer doing what it is trying to accomplish.

The communication between HFT companies and the exchanges is subject to various differences between the exchanges and the protocols that they use. Exchanges, such as Nasdaq, Eurex, and KRX, each publish their own specifications of the protocols that can be used to do electronic trading. The specifications cover packet formats, how to query information, incoming and outgoing packet rates, and so on; each exchanges has its own nuances—and quirks. The exchanges generally run on standard 10Gbps Ethernet; there is no movement toward 100Gbps Ethernet that he knows of, though there is some talk about 25Gbps.

Inside the HFT firm, there is a need for high-performance computing (HPC) facilities to do the quantitative analysis. Those HPC environments require grid networks with lots of distributed CPU horsepower and storage, he said. Remote DMA (RDMA) is used on these internal networks, but predictable latency is still the main concern.

Latency

Out of the box, the kernel networking stack has poor performance with regard to latency, he said, though it can be improved with some tuning. Techniques like pinning workloads to particular CPUs, keeping NUMA locality in mind, and using interrupt affinity are generally well-known for reducing packet jitter. CPU isolation is perhaps a lesser-known feature that the HFT world uses to reduce the jitter even further; CPUs are isolated from the rest of the system and workloads are pinned to them in order to reduce or eliminate the jitter.

He put up some graphs from a simple benchmark that he did to show the effects of these techniques; it used netperf to measure request-response latency on a 10Gbps Ethernet network with a switch. As expected, the numbers generally got better for the minimum and mean latencies as each technique was applied; the values were reported as an average of ten runs of the benchmark. The unoptimized values were a minimum of 51.6µs and a mean of 68.7µs, which he said "wasn't terrible" though the maximum latencies were 250-600µs, thus "a bit of a mess".

He then showed the results for pinning the CPU with no interrupt-affinity change, which showed a small improvement (50.1/67.6). When he added interrupt affinity into the mix, so that the cache locality came into play, there was a more noticeable boost (45.4/53.1); "we're starting to get to this point of less jitter, which is the important part". He expected that isolating the CPU would make things better still, but was surprised to see the numbers get worse (47.8/61.1). He thought about that and realized that the interrupt was interfering, so he ran the benchmark without interrupts by putting the driver into polling mode. That was more in line with expectations with a 41.9µs minimum and a 56.3µs average latency.

But that was "a very synthetic benchmark", where he could mold the system and the application specifically to his needs; it does not really match the real world of traditional networking at all. In that world, there are other workloads that also need to be run so things cannot be statically partitioned as he was doing; but in the HFT networking environment, none of that matters. The synthetic-benchmark environment is what is used; a system where "everything is perfectly lined up is actually how things get deployed" for HFT.

Options

So he wondered if he could use express data path (XDP or, commonly, AF_XDP) as a way to improve things further. "Because if we throw eBPF at any problem that'll just fix it, right?", he said to some scattered laughter. While XDP is "not here yet" for HFT, he thinks it could be the right path someday and has some ideas on how to get there.

XDP allows for kernel bypass without actually bypassing the kernel, Waskiewicz said, which is really compelling. His vision is that the "hot path" data that is extremely latency-sensitive could be identified by the application and those packets would go directly to it, while the other traffic would continue to be handled by the kernel networking stack. "That's like the best of both worlds as far as I am concerned." There are some limitations that need to be dealt with (or worked around). The receive side must be done with polling since interrupts introduce jitter by their very nature. As far as he is aware, transmitting data requires making a system call and the context switches for system calls introduce jitter as well.

CPU isolation works well, he said, until it does not. His firm uses the isolcpus boot parameter to choose the set of isolated CPUs, but there are still some "random" inter-processor interrupts (IPIs) that occur, which are "fairly infuriating". At this year's Linux Plumbers Conference, there was a microconference on CPU isolation where the problems he has been seeing were discussed.

In some configurations, simply connecting to a CPU-isolated system using SSH will cause a cascade of events that eventually result in "TLB [translation lookaside buffer] shootdowns issued to every core on the system". Those IPIs cause jitter, but refilling the TLB causes jitter as well. He is trying to carve out some time to address that problem. Another CPU-isolation problem that he encountered was that an IPI is sent to all processors when someone executed "cat /proc/cpuinfo"; the system does that to get the operating frequency of each core. This was particularly a problem for systems that ran some kind of telemetry application that would check those values frequently. The bug has now been fixed upstream in work that his company did in conjunction with Red Hat, he said.

HPC side

As noted, the connection to the exchanges requires standard Ethernet, but the internal HPC grid, where there can be tens to hundreds of thousands of CPUs, can be (and is) rather more exotic. It turns out that HPC in HFT has been something of a niche market for RDMA. Quantitative analysis requires moving lots of data around and operating on it in parallel throughout the network.

Predictable latency is also important on the analysis side of the network, and RDMA works well for that, but there are some things that he thinks could be done better. For one, io_uring is showing great promise; it has expanded quite a ways from its initial genesis as a replacement for the libaio asynchronous I/O library. It is no longer only for I/O, as Josh Triplett's io_uring_spawn talk at LPC shows, Waskiewicz said. It would be interesting to see if the networking-hardware ring buffers could be used directly as buffers for io_uring operations so that data can get to and from the hardware using that mechanism; that would allow the HPC side to use something "much more kernel-standardized that would be able to replace RDMA".

But, for now at least, RDMA is the king in HPC networks; no mention of RDMA is complete without mention of RDMA over Converged Ethernet (RoCE), though, he said. Infiniband, which is the fabric used for these RDMA networks, is expensive, but that is not necessarily a problem in the HFT world as the industry is willing to spend money that allows it to make more money. Infiniband is something of a niche technology, though, which makes it hard to find technical people that can manage the network and keep it up and running.

RoCE (along with iWARP) allow the use of Ethernet equipment and management skills, but they come with challenges of their own. Converged networks still have jitter problems because they are not a dedicated fabric. That leads to a need for additional equipment and configuration to reduce that.

He said that he already had planned to talk about Homa for HPC before the John Ousterhout's keynote the previous day (which we covered: part 1 and part 2). Waskiewicz sees the remote-procedure-call-based approach of Homa as being similar to the RDMA Verbs API. Having Homa available both in user space and the kernel would allow for more flexibility. Being able to use standard Ethernet equipment throughout the network would be worthwhile from a maintenance and cost standpoint as well.

He wondered if there are other possibilities that the HFT industry should be looking at. If so, they must eliminate jitter, as mentioned multiple times, but they must also be low latency. If the latency is 100µs, even without any jitter, it still cannot be used for HFT because "it will lose every time".

Waskiewicz was running out of time at that point so he quickly reiterated his main points from the talk. Not surprisingly, jitter was the centerpiece; it is important to ensure that the algorithms can get the predictable latency that they need because the exchanges themselves can be damaged "when algorithmic trading goes haywire". It is not hard to find instances where this kind of trading has caused problems that made exchanges hit their circuit-breakers—or worse. He is pleased to see that there are various efforts to attack the jitter problem underway at this point.

Comments (48 posted)

Class action against GitHub Copilot

By Jonathan Corbet
November 10, 2022
The GitHub Copilot offering claims to assist software developers through the application of machine-learning techniques. Since its inception, Copilot has been followed by controversies, mostly based on the extensive use of free software to train the machine-learning engine. The announcement of a class-action lawsuit against Copilot was thus unsurprising. The lawsuit raises all of the expected licensing questions and more; while some in our community have welcomed this attack against Copilot, it is not clear that this action will lead to good results.

Readers outside of the US may not be entirely familiar with the concept of a class-action lawsuit as practiced here. It is a way to seek compensation for a wrong perpetrated against a large number of people without clogging the courts with separate suits from each. The plaintiffs are grouped into a "class", with a small number of "lead plaintiffs" and the inevitable lawyers to represent the class as a whole. Should such a suit prevail, it will typically result in some sort of compensation to be paid to anybody who can demonstrate that they are a member of the class.

Class-action lawsuits have been used to, for example, get compensation for victims of asbestos exposure; they can be used to address massive malfeasance involving a lot of people. In recent decades, though, the class-action lawsuit seems to have become mostly a vehicle for extorting money from a business for the enrichment of lawyers. It is not an uncommon experience in the US to receive a mailing stating that the recipient may be a member of a class in a suit they have never heard of and that, by documenting their status, they can receive a $5 coupon in compensation for the harm that was done to them.

Compensation for the lawyers involved, instead, tends to run into the millions of dollars. Not all class-action lawsuits are abusive in this way, but it happens often enough that it has become second nature to look at a new class-action with a jaundiced eye.

The complaint

The complaint was filed on behalf of two unnamed lead plaintiffs against GitHub, Microsoft, and a multitude of companies associated with OpenAI (which is partially owned by Microsoft and participated in the development of Copilot). It explains at great length how Copilot has been trained on free software, and that it can be made to emit clearly recognizable fragments of that software without any of the associated attribution or licensing information. A few examples are given, showing where the emitted software came from, with some asides on the (poor) quality of the resulting code.

Distribution of any software must, of course, be done in compliance with the licenses under which that software is released. Even the most permissive of free-software licenses do not normally allow the removal of copyright or attribution information. Thus, the complaint argues, the distribution of software by Copilot, which does not include this information, is in violation of the that software's licenses and not, as GitHub seems to claim, a case of fair use. Whether fair use applies to Copilot may well be one of the key turning points in this case.

The members of the class of people who have allegedly been harmed by this activity are defined as:

All persons or entities domiciled in the United States that, (1) owned an interest in at least one US copyright in any work; (2) offered that work under one of GitHub’s Suggested Licenses; and (3) stored Licensed Materials in any public GitHub repositories at any time between January 1, 2015 and the present (the “Class Period”).

It is, as would be expected, a US-focused effort; if there is harm against copyright owners elsewhere in the world, it will have to be addressed in different courts. This wording would seem to exclude developers who have never themselves placed code on GitHub, but whose code has been put there by others — a frequent occurrence.

The list of charges against the defendants is impressive in its length and scope:

  • Violation of the Digital Millennium Copyright Act, brought about by the removal of copyright information from the code spit out by Copilot.
  • Breach of contract: the violation of the free-software licenses themselves. The failure to live up to the terms of a license is normally seen as a copyright violation rather than a contract issue, but they have thrown in the contract allegation as well.
  • Tortious interference in a contractual relationship; this is essentially a claim that GitHub is using free software to compete against its creators and has thus done them harm.
  • Fraud: GitHub users, it is claimed, were induced to put their software on GitHub by the promises made in GitHub's terms of service, which are said to be violated by the distribution of that software through Copilot.
  • False designation of origin — not saying where the software Copilot "creates" actually comes from.
  • Unjust enrichment: profiting by removing licensing information from free software.
  • Unfair competition: essentially a restatement of many of the other charges in a different light.
  • Breach of contract (again): the contracts in question this time are GitHub's terms of service and privacy policy.
  • Violation of the California Consumer Privacy Act: a claim that the plaintiff's personal identifying information has been used and disclosed by GitHub. Exactly which information has been abused in this way is not entirely clear.
  • Negligent handling of personal data: another claim related to the disclosure of personal information.
  • Conspiracy: because there are multiple companies involved, their having worked together on Copilot is said to be a conspiracy.

So what is this lawsuit asking in compensation for all of these wrongs? It starts with a request for an injunction to force Copilot to include the relevant licensing and attribution information with the code it emits. From there, the requests go straight to money, with attorney's fees being at the top of the list. After that, there are nine separate requests for both statutory and punitive damages. And just in case anybody thinks that the lawyers are thinking too small:

Plaintiffs estimate that statutory damages for Defendants’ direct violations of DMCA Section 1202 alone will exceed $9,000,000,000. That figure represents minimum statutory damages ($2,500) incurred three times for each of the 1.2 million Copilot users Microsoft reported in June 2022.

It seems fair to say that a lot of damage is being alleged here.

Some thoughts

The vacuuming of a massive amount of free software into the proprietary Copilot system has created a fair amount of discomfort in the community. It does, in a way, seem like a violation of the spirit of what we are trying to do. Whether it is a violation of the licenses involved is not immediately obvious, though. Human programmers will be influenced by the code they have seen through their lives and may well re-create, unintentionally, something they have seen before. Perhaps an AI-based system should be forgiven for doing the same.

Additionally, there could be an argument to be made that the code emitted by Copilot doesn't reach the point of copyright violation. The complaint spends a lot of time on the ability to reproduce in Copilot, using the right prompt, a JavaScript function called isEven() — which does exactly what one might expect — from a Creative-Commons-licensed textbook. It is not clear that a slow and buggy implementation of isEven() contains enough creative expression to merit copyright protection, though.

That said, there are almost certainly ways to get more complex — and useful — output from Copilot that might be considered to be a copyright violation. There are a lot of interesting questions that need to be answered regarding the intersection of copyright and machine-learning systems that go far beyond free software. Systems that produce images or prose, for example, may be subject to many of the same concerns. It would be good for everybody involved if some sort of consensus could emerge on how copyright should apply to such systems.

A class-action lawsuit is probably not the place to build that consensus. Lawsuits are risky affairs at best, and the chances of nonsensical or actively harmful rulings from any given court are not small. Judges tend to be smart people, but that does not mean that they are equipped to understand the issues at hand here. This suit could end up doing harm to the cause of free software overall.

The request for massive damages raises its own red flags. As the Software Freedom Conservancy noted in its response to the lawsuit, a core component of the ethical enforcement of free-software licenses is to avoid the pursuit of financial gain. The purpose of an enforcement action should be to obtain compliance with the licenses, not to generate multi-billion-dollar payouts. But such a payout appears to be an explicit goal of this action. Should it succeed, there can be no doubt that many more lawyers will quickly jump into that fray. That, in turn, could scare many people (and companies) away from free software entirely.

Bear in mind that most of these suits end up being settled before going to court. Often, that settlement involves a payment from the defendant without any admission of wrongdoing; the company is simply paying to make the suit go away. Should that happen here, the result will be a demonstration that money can be extracted from companies in this way without any sort of resolution of the underlying issues — perhaps a worst-case scenario.

Copilot does raise some interesting copyright-related questions, and it may well be, in the end, a violation of our licenses. Machine-learning systems do not appear to be going away anytime soon, so it will be necessary to come to some conclusions about how those systems interact with existing legal structures. Perhaps this class-action suit will be a step in that direction, but it is hard to be optimistic that there will be any helpful outcomes from that direction. Perhaps, at least, GitHub users will receive a coupon they can use to buy a new mouse or something.

Comments (166 posted)

Git evolve: tracking changes to changes

By Jonathan Corbet
November 11, 2022
The Git source-code management system exists to track changes to a set of files; the stream of commits in a Git repository reflects the change history of those files. What is seen in Git, though, is the final form of those commits; the changes that the patches themselves went through on their way toward acceptance are not shown there. That history can have value, especially while changes are still under consideration. The proposed git evolve subcommand is a recognition that changes themselves go through changes and that this process might benefit from tooling support.

Some patches are applied to a project's repository soon after being written, but other take more work. Consider, for example, support for stackable security modules, which has been through (at least) 38 revisions over many years. If and when this work lands in the Linux kernel mainline, it will bear little resemblance to what was initially posted years ago. Each revision will have undergone changes that will have rippled through much of the 39-part patch set. Git can support iteration on a series like that, but it can be a bit awkward, leading many developers to use other tools (such as Quilt) to manage in-progress work.

Commits, meta-commits, and changes

The proposed evolve functionality for Git adds a new level of tracking for "meta-commits", which can be thought of as versions of a specific commit. The meta-commits describing the history of a given commit are stored in a special branch that is called a "change". The documentation tosses the "meta-commit" and "change" terms around almost as if they were interchangeable and, for the most part, they can be thought of as the same. Meta-commits simply hold the history of a change — the evolution that a given commit has gone through over time.

Consider an extended example: if a developer does some work and commits it, the result will be the new commit itself (identified by its hash — we'll call it A in this case) and a meta-commit stored in a new change branch with a name like metas/mc1 (the naming of changes is a subject of its own, with the obligatory hook so that users can add scripts to generate their own names). The result is a structure that, given the limits of your editor's diagramming skills, can be represented as:

[The first commit]

Here we see the new commit A on the trunk branch in the local repository; the change branch metas/mc1 contains a meta-commit with reference to the hash of that commit.

Now imagine that this commit, like many, is not perfect in its initial form; it will need to be improved. If, later on, this developer uses a command like git commit --amend to change this commit, Git will update the metas/mc1 change to refer to the hash of the amended commit (B here), but also to note that this commit "obsoletes" commit A:

[The first commit, amended]

The old commit A will remain in the repository and can be consulted if, later, somebody wants to see what changed between A and B.

If our developer adds another commit C (without --amend) to the same branch, the result will be another change branch, call it metas/mc2, referring to this new commit:

[A second commit]

A large patch series will thus have a number of active change branches, one for each commit in the series. Notably, the mechanism described above ensures that the change name for each commit in the series remains stable, even as the commits themselves are changed. The first commit in the series is metas/mc1, even as that commit itself evolves over time and its hash changes. There is a set of commands to list the known change, and a simple git reset or git checkout command can be used to reset the branch to a given change.

Now suppose that commit B needs further changes; our developer has inexplicably forgotten to use reverse Christmas-tree ordering for their variable declarations and has been called out on it. They can use git reset to go back to that commit — the one described by metas/mc1 — to fix this unacceptable state of affairs. A bit of editing and a new git commit with --amend will yield a new commit D, and metas/mc1 will be updated to reflect the fact that commit B has been obsoleted.

[Amending the first commit]

The first commit in the series has been updated, but now our second commit in the series (C), the one described by metas/mc2, still has the old commit B as its parent, so the sequence has been split. If the developer now runs git evolve, though, all changes that were based on metas/mc1 (in any version) will be rebased, recreating the full change history.

[After git evolve]

The commit formerly known as C has been rebased on top of D, restoring the full patch series. The git evolve command can also be used to update a set of changes to a new base in the repository — rebasing all of the changes to reflect changes merged elsewhere.

More than rebase

Thus, git evolve can be used somewhat like git rebase, but there are some differences. Perhaps most significant is that commits can be modified in various places in the stream, then all evolved together at the end. A developer can, for example, make changes to patches 3, 7, and 9 of a 12-part series, each isolated from the other, then use git evolve to stitch the sequence back together at some future time.

Another difference is that the change history might not be strictly linear. As a simple example, imagine a repository with a single commit; the developer could amend that commit to create a new change, like the commit B shown above. If, then, our developer uses git reset to get back to the pre-amend commit (A) and amends it again, there will now be two changes, each of which obsoletes commit A. The documentation calls this "divergence"; the change history for a patch series can contain any number of divergences and changes built upon them. A divergence could be caused by trying alternative fixes for a problem, for example.

Git will be able to track that divergence indefinitely, but there will come a point when things need to be resolved. For example, if the developer runs git evolve, Git will need to know how to resolve the divergence so that it can rebase the rest of the series. The usual resolution at that point is to do a merge of the diverging changes, but it is also possible to simply pick one side.

Since changes are Git branches in their own right, they can be pushed and pulled between repositories. So developers can share the current state of their work — and how it got to that state — with other developers or with some sort of change-tracking system. Anybody who can access a change can review the various versions of the patch and see the direction in which the work is heading.

Finally, changes are ephemeral, in that they are really only relevant until the work they described is finalized and committed to a trunk branch. At that point, the change is presumably perfect and the story of how it got to its current state is no longer of interest. So, whenever a git evolve command sees that the commit described by a change has been merged, it will automatically delete the changes themselves. So a developer's set of active changes will normally reflect the work that is actually in progress at any given time.

An evolving story

The above description was mostly taken from this document describing the proposed feature. The document is thorough and detailed, but a bit challenging. Your editor only had to read it a dozen times or so, though, to get a superficial understanding of what is going on.

The git evolve patches are not new; indeed, they have been through a fair amount of evolution themselves. An initial design for the feature was posted by Stefan Xenos in late 2018, and the first implementation patches came out in January 2019. The most recent version of these patches, as of this writing, was posted by Christophe Poucet in early October. There has been interest in the patches over the years, but the complexity of the feature also arguably makes it hard for others to properly review.

As a result, it is still not clear whether git evolve will find its way into the Git mainline or not. There are some clear use cases for git evolve, and each version of the patch set has evoked active discussion. Whether the benefits of the feature justify the added complexity will be something for the Git maintainers to evaluate, though. If the evolve functionality can clear the bar, it could enhance Git with features that developers currently must seek in other tools. Some more complexity evolved into Git here might thus simplify life overall.

Comments (35 posted)

Block-device snapshots with blksnap

By Jonathan Corbet
November 14, 2022
As a general rule, one need not have worked in the technology industry for long before the value of good data backups becomes clear. Creating a backup that is truly good, though, can be a challenge if the filesystem in question is actively being changed while the backup process runs. Over the years, various ways of addressing this problem have been developed, ranging from simply shutting down the system while backups run to a variety of snapshotting mechanisms. The kernel may be about to get another approach to snapshots should the blksnap patch set from Sergei Shtepa find its way into the mainline.

The blksnap patches are rigorously undocumented, so much of what follows comes from reverse-engineering the code. Blksnap performs snapshotting at the block-device level, meaning that it is entirely transparent to any filesystems that may be stored on the devices in question. It is able to create snapshots of a set of multiple block devices, so it should be suitable for RAID arrays and such. The targeted use case appears to be automated backup systems; the snapshots that blksnap creates are described as "non-persistent" and are meant to be discarded once a real backup has been made.

Since blksnap works at the block level, it must be given space to store snapshots that is separate from the devices being snapshotted. Specifically, there are ioctl() operations to assign ranges of sectors on a separate device for the storage of "difference blocks" and to change those assignments over time. There is a notification mechanism whereby a user-space process can be told when a given difference area is running low on space so that it can assign more blocks to that area.

The algorithm used by blksnap is simple enough: once a snapshot has been created for a set of block devices (using another ioctl() operation), blksnap will intercept every block-write operation to those devices. If a given block is being written to for the first time after the snapshot was taken, the previous contents of that block will be copied to the difference area, and a note will be made that the block has been changed since the snapshot was created. Once that is done, the write operation can continue normally. The block devices thus always reflect the most recent writes, while the difference area contains the older data needed to recreate the state of those devices at the time the snapshot was created.

In order to be able to intercept writes to the block devices, Shtepa has had to add a new "device filter" mechanism to the block layer. A filter can be attached to a specific device that will be called prior to the execution of each operation on that device, with the BIO structure representing that operation as a parameter. If the filter function returns false, the operation will not be executed. An earlier version of the patch set provided the ability to attach multiple filters to a block device at different "altitudes", but that was removed since there are no other uses for filters currently.

Blksnap uses the filter function to catch writes to the snapshotted device(s). When a write is found, the operation is put on hold while the original contents of the blocks to be written are copied to the difference area; once that is complete, the write is submitted normally.

Interestingly, nothing in the patch set describes how one might gain access to a snapshot once it has been created. A look at the ioctl() interface shows a couple of possibilities, though. One is an operation to obtain the list of changed blocks associated with a snapshot, which might be useful for certain types of incremental backups. But blksnap also creates a new, read-only device for each snapshot taken. Reading a block from that device causes blksnap to consult its map of changed blocks; if the block in question has been changed, it is read from the difference area. Otherwise, it can be read from the original block device. The major and minor numbers of the snapshot devices can be obtained with another ioctl() operation; there is also an undocumented sysfs file that apparently can be consulted.

The kernel does not lack for the ability to make snapshots now, so one might logically wonder why blksnap is needed. It clearly differs from the snapshot feature offered by filesystems like Btrfs, since blksnap operates at the block-device level. Among other things, blktrace can be used with filesystems that do not, themselves, have a snapshot feature. Btrfs snapshots are stored on the same block device as the filesystem itself, meaning that the two can compete for space, and the space used by snapshots could prevent the writing of data to the live filesystem. Since blksnap stores its snapshot data on a separate device, that data won't get in the way of ongoing operations. If the difference area runs out of space the snapshot will be corrupted, but the device being snapshotted will be unaffected.

An existing alternative at the block level is the device mapper snapshot target. The functionality provided by blksnap is, in many ways, similar to the device mapper; both work by intercepting writes and copying the old data to a separate device. Blksnap can be used without needing to set up the device mapper for the devices to be snapshotted, though. It also claims to have more flexible management of its difference area, especially when multiple devices are being snapshotted together.

These differences appear to be interesting enough that nobody has, so far, questioned whether blksnap is a useful addition to the kernel. The patch set (despite being marked "v1") is on its second revision, having seen a number of fixes from its first posting in July. With luck, the next revision will incorporate some documentation; then perhaps it will be nearing readiness for inclusion into the mainline.

Comments (32 posted)

Scaling the KVM community

November 15, 2022

This article was contributed by Paolo Bonzini


KVM Forum

The scalability of Linus Torvalds was a recurring theme during Linux's early years; these days maintainer struggles are a recognized problem within open-source communities in general. It is thus not surprising that Sean Christopherson gave a talk at Open Source Summit Europe (and KVM Forum) with the title "Scaling KVM and its community". The talk mostly focused on KVM for the x86 architecture—the largest and most mature KVM architecture—which Christopherson co-maintains. But it was not a technical talk: most of the content can be applied to other KVM architectures, or even other Linux subsystems, so that they can avoid making the same kinds of mistakes.

The problem

The KVM hypervisor is not small: it covers 6 architectures and each often supports various "flavors" of virtualization. It consists of about 150,000 lines of code, plus 90,000 lines of tests split between the kernel self tests and the kvm-unit-tests project. Every year over 150 contributors add over a thousand commits. These numbers are expected to grow, and Christopherson identified three areas that need to scale to accommodate KVM's growth:

  • Development: KVM needs to support new architectures as they appear, new features of existing architectures, and new virtualization use cases.
  • Maintenance: a broader set of architectures, features and use cases means more patches to review and more code to maintain
  • Validation: a growing amount of code means more things that can break

It is important to note that scaling development and maintenance does not necessarily require more developers and more maintainers. The existing developers and maintainers can scale better if they can be more efficient, for example if fewer bugs are introduced. In fact, adding more developers and maintainers without improving the validation efforts will make development and maintenance slower and more expensive. It is worth noting that, whenever Christopherson referred to maintainers throughout the talk, he also included in the group people that are listed in the MAINTAINERS file as "reviewers", as well as those who are regularly triaging and fixing bugs.

In order to improve the situation, he said, the first step is to recognize the current problems in the community; even agreeing on the existence of problems would be a successful result of his presentation. In order to identify the problems, he took inspiration from the metrics used for computer networks and proposed four characteristics that can make KVM a successful community: low latency, high efficiency, better monitoring, and more durability.

Metrics

Low latency means less time between writing the code and having it reviewed and merged. High latency is perhaps the biggest problem in KVM x86, which is due to the time spent waiting for reviews and the number of "pings" that have to be sent to get the maintainers' attention. This leads to developer frustration; it also impacts the schedule of downstream consumers of KVM, making it harder for them to stay close to the Linux mainline and to work upstream first.

Over the last decade, the number of people maintaining the code base has been roughly flat, but the number of contributors has doubled. A common approach to increasing the number of maintainers is to define sub-components, but according to Christopherson the KVM code for each architecture is too small for this to be feasible. Experience also shows that splitting subsystems too finely leads to duplicated code, as was the case when code to support Intel and AMD processors was developed more independently. He thinks it is unlikely that the number of maintainers will grow in the near future.

The number of non-merge commits to KVM over the last two years has nearly doubled. This is mostly thanks to Google (his employer) moving aggressively to an upstream-first approach and merging patches that had been in its internal fork for years. However, it is unlikely that the activity will go back to the pre-2020 levels once all of those features are merged. Plenty of other changes are in the pipeline, and those that have been dropped due to lack of review resources could come back, for example address space isolation or virtual-machine introspection. This puts a high demand on developers and maintainers to do their work efficiently.

Developer efficiency is badly affected by uncertainty. In order to improve efficiency, both developers and maintainers need to know what they can expect of each other. If developers do not know which tests they need to run, and on which architectures, they will send more flawed patches; without clear rules for which branch to develop against, they will have to respin them more frequently in case of conflicts. On the other hand, maintainers need to be clear on the state of accepted patches. The time between a maintainer accepting a patch and the corresponding commit showing up in a public tree should be short, Christopherson said, because that is another source of developer uncertainty.

Efficiency not only means less time spent on the implementation of a new feature, but also less code to write in order to keep feature parity across architectures. RISC-V is the newest architecture to gain a KVM port, and it has the opportunity to generalize code written for other architectures (for example, hypervisor page-table management) instead of duplicating it. If this happens, the number of architecture-independent patches will grow in the future, after having remained almost flat for more than 10 years; it will thus reduce the number of architecture-dependent patches as well.

Monitoring is required to ensure that KVM does not acquire new bugs, and that those which slip through are fixed quickly. KVM's wealth of test code is a great asset for the project, but it doesn't help if the tests are not run so that the same bug is encountered, debugged, and even fixed by multiple users and developers.

Looking at KVM x86 commits, about 20-25% of the commits have a "Fixes" tag attached to the commit message. That percentage has started to grow at the same time that the number of overall commits accelerated. This is likely not a bad thing: "Fixes" tags have become more prevalent in general in kernel development over the last few years, and the increased development pace included fixing a lot of old bugs.

Still, "upstream KVM is woefully behind on the continuous-integration train", Christopherson said. Most of the continuous testing of KVM comes from bots that test the kernel at large, and the timing of the tests is not consistent enough for it to be relied upon as a measure of KVM's health. Testing could be done at a minimum on the public trees, but it could be done at a finer grain for each patch submission before it makes to an official queue.

Durability is just another name for stability: a stable hypervisor means that developer and maintainer time is not spent constantly fixing KVM, especially not fixing bugs that were introduced several years ago and stayed unnoticed for a long time.

Improving durability is not necessarily a simple matter, as it may entail a change in the developers' mindset. In the past, KVM has often adopted a "good enough" approach, but does close only count in horseshoes or does it also count in KVM? This approach worked when KVM had relatively few features, but eventually these shortcuts even started to interact in unexpected ways with each other or with changes made to guest operating systems. In fact, on x86 there is an API to disable "quirks", which are processor behaviors that are present in KVM virtual machines but not on actual hardware.

Even if a shortcut solution remains working, the community tends to forget the details due to attrition. Of all the people who have contributed 20+ commits to KVM for at least three years, only five are still active. Good-enough code imposes a long-term penalty on the stability of a project. Unless the community diligently documents these shortcuts, knowledge of them will be lost and will have be relearned the hard way—not just what KVM does, but why it behaves that way.

How to improve?

Having enumerated the issues that can jeopardize the well-being of the KVM development community, Christopherson proceeded to describe five ways to fix these problems. Many of these improvements are not limited to KVM, and could be applied more generally to other Linux subsystems or even other open-source projects.

The first is to document everything that developers need to work efficiently. This had been a constant theme during the first part of the talk: patch lifecycle, testing requirements for developers, expectations around flaky tests, and deviations from architectural specifications had already been mentioned as things that should be explicitly written down. He also proposed documenting the key dates within the Linux release cycle where patches will be reviewed and accepted, so that maintainers do not feel obliged to squeeze in patches at the last minute and break things. Removing uncertainty removes friction by avoiding the perception that maintainers are ignoring certain developers or features.

The second aspect of the solution is testing. Diligent testing on the part of the developers can be a major contributor to maintainer efficiency: many issues with submitted patches are caught when the maintainers run kvm-unit-tests or the KVM self tests because the developers had never done so before submission. New features should come with associated tests, which has been enforced more and more strictly over the past few years; but rather than write minimal tests to appease the maintainers, developers should use tests to find bugs themselves, using brute force when possible and even introducing bugs purposely to verify that the tests catch them. Tests can also serve as documentation and point out edge cases in the code.

It is important that the tests are easy to write and do not require too much boilerplate. While developers are unlikely to ever be excited about writing tests, the framework should not get in the way. This is especially true of the KVM self-tests framework, which allows describing test scenarios precisely but can also be daunting to approach.

The third point is sharing. More effort has to be put into sharing code across architectures. Common problems should be solved once, by consolidating code instead of duplicating it. That can only be done if maintainers are familiar with multiple architectures and suggest sharing the code from the beginning. Christopherson proposed that I, as the overall KVM maintainer, reduce my own x86 responsibilities and focus on this issue, so that the next generation of maintainers can be trained to operate on a code base where cross-architecture work is the norm. This transition was in fact already in progress at the time of the talk.

The fourth improvement is automation of both integration and developer testing in order to catch bugs as soon as possible. This reduces the "downtime" of KVM, the amount of time where the top of the development tree is broken and no new changes can be merged. While rare, these events happen and make it harder for contributors to do their work.

After the talk, Christopherson was asked how to get developers to follow his suggestions and automate their work. His answer was that the community does not only include kernel developers, and there are people that are knowledgeable about setting up automation. They can do so in a way that can help everyone in the community. He also remarked on how reinventing the test-automation wheel seems to be a "rite of passage" for Linux developers; he suggested instead that experienced developers try to share methods and scripts for testing, or handy Git aliases, even though they are unlikely to ever be part of the upstream kernel sources.

His final suggestion is to adapt to the changing circumstances. Sharing a focus on durability of the code and the community means a different approach to development of new features. Christopherson stressed that, as a rule, new features should be implemented according to the hardware specifications and without making assumptions about the guest's behavior. While there will always be exceptions due to features that are not easily (or not at all) virtualizable, following the hardware specifications relieves the developers from having to document said assumptions. Despite all the flaws that processor manuals have, they are way ahead of KVM documentation, which can then be restricted to the differences between real hardware and virtual machines.

On top of this, developers should speak up when an issue arises and propose improvements to the process and documentation so that it does not happen again. Even if their proposal is not accepted, they will at least get an explanation as to why things are the way they are.

Conclusions

KVM's growth in size and complexity has certainly been challenging to the community and the maintainers; Google's upstreaming of the previously-internal KVM patches helped make this clear to Christopherson. Addressing the challenges described in the talk will make it easier for KVM developers to add support for new technologies and new use cases of virtualization—and to make them available to private users and cloud-computing providers worldwide.

Comments (2 posted)

Page editor: Jonathan Corbet
Next page: Brief items>>


Copyright © 2022, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds