LWN.net Weekly Edition for September 5, 2019

Welcome to the LWN.net Weekly Edition for September 5, 2019

This edition contains the following feature content:

Maintaining the kernel's web of trust: the public key servers aren't working anymore, so the kernel community takes web-of-trust management into its own hands.
Bias and ethical issues in machine-learning models: a pair of conference sessions on machine-learning bias.
Kernel runtime security instrumentation: a proposed Linux security module for attack detection and response.
Change IDs for kernel patches: an attempt to improve the connection between Git commits and the discussions that lead up to them.
Examining exFAT: the legal obstacles to merging the exFAT filesystem module may have gone away, but technical and procedural issues remain.
CHAOSS project bringing order to open-source metrics: how does one objectively measure the health of a project?

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

Maintaining the kernel's web of trust

By Jonathan Corbet
September 4, 2019

A typical kernel development cycle involves pulling patches from over 100 repositories into the mainline. Any of those pulls could conceivably bring with it malicious code, leaving the kernel (and its users) open to compromise. The kernel's web of trust helps maintainers to ensure that pull requests are legitimate, but that web has become difficult to maintain in the wake of the recent attacks on key servers and other problems. So now the kernel community is taking management of its web of trust into its own hands.

Some history

As recently as 2011, there was no mechanism in place to verify the provenance of pull requests sent to kernel maintainers. If an emailed request looked legitimate, and the proposed code changes appeared to make sense, then the requested pull would generally be performed. That degree of openness makes for a low-friction development experience, but it also leaves the project open to at least a couple types of attacks. Email is easy to forge; an attacker could easily create an email that appeared to be from a known maintainer, but which requested a pull from a malicious repository.

The risk grows greater if an attacker somehow finds a way to modify a maintainer's repository (on kernel.org or elsewhere); then the malicious code would be coming from a trusted location. The chances of a forged pull request from a legitimate (but compromised) repository being acted on are discouragingly high.

The compromise of kernel.org in 2011 focused minds on this problem. By all accounts, the attackers had no idea of the importance of the machine they had taken over, so they did not even try to tamper with any of the repositories kept there. But they could have done such a thing. Git can help developers detect and recover from such attacks, but only to an extent. What the community really needs is a way to know that a specific branch or tag proposed for pulling was actually created by the maintainer for the relevant subsystem.

One action that was taken was to transform kernel.org from a machine managed by a small number of kernel developers in their spare time into a carefully thought-out system run by full-time administrators supported by the Linux Foundation. The provision of shell accounts to hundreds of kernel developers was belatedly understood to be something other than the best of ideas, so that is no longer done. No system is immune, but kernel.org has become a much harder target than before, so repositories stored there should be relatively safe.

The other thing that was done, though, was the establishment of a web of trust based on public-key encryption with GnuPG. When a subsystem maintainer readies a branch for pushing to a higher-level maintainer, they should apply a signed tag to the topmost commit. The receiving maintainer can then verify the signature and be sure that the series of commits they are pulling is what the maintainer had in mind. As can be seen from this article, not all maintainers are using signed tags, but their use has been growing over time. Adoption has been slowed a bit because Linus Torvalds does not require signed tags for pulls from kernel.org repositories.

A signed tag by itself does not mean much; the other half of the problem is that the pulling maintainer must be able to verify that the key used to sign that tag actually belongs to the developer it claims to. That is where the web of trust comes in. If Torvalds can be convinced that a given key belongs to a specific subsystem maintainer, he can sign that key; other maintainers can then trust that the key is as advertised (as long as they trust Torvalds, anyway). In the kernel community, all roads lead to Torvalds, but he does not have to personally sign every maintainer's key; as long as there is a path of trusted signatures leading to him, a key will be trusted within the community.

The kernel's web of trust was bootstrapped in a painful key-signing session at the 2011 Kernel Summit; thereafter, new developers have had to convince others to sign their keys at conferences or other gatherings. Until recently, PGP key servers were used to hold keys and any signatures attached to them. A given maintainer's key could be easily fetched and, if the signature chain checked out, trusted. The attacks on the signature mechanism, including the attachment of thousands of bogus signatures to public keys, have taken the key servers out of the picture, though, leaving the community without a way to maintain its web of trust.

pgpkeys.git

Konstantin Ryabitsev, the lead administrator for kernel.org, has stepped into this void. After investigating a number of key-server alternatives, he concluded that none of them were fit for the purpose; the code is unmaintained and there is little interest in the development of web-of-trust systems in general at this point. So the alternatives are to give up on the web of trust as well or to come up with a new solution. Dropping the web of trust is not an appealing option:

Unfortunately, if we abandon the web of trust completely, we will have to go back to relying on kernel.org infrastructure as the source of trust. Kernel.org has been hacked in the past -- ever since then our goal has always been to keep developers as the sole and only source of truth. This requirement is why we cannot and should not abandon the developer web of trust and must keep it going, at least in parallel to the [web key directory] and similar efforts.

So Ryabitsev has created a new Git repository to hold keys for kernel developers. It has been populated with keys used in pull requests in the past, along with the signatures on those keys. But, to avoid signature attacks, only signatures made with other keys stored in the repository are retained; that is sufficient to build the web of trust while eliminating the results of any signature spamming that might have taken place. There is also a set of SVG files showing how each key fits into the web of trust; to take a random example:

There is also, naturally, a way to make updates to this repository. Keys can be sent to an email address; after verification, they will be added to (or updated in) the repository. Finally, there is a script that can be used to automatically load all of the keys in this repository into one's personal GnuPG keyring. By running this script periodically, any developer can keep a copy of the entire kernel web of trust at hand.

To some, this may seem like a rearguard action aimed at propping up the web-of-trust concept when that idea is generally falling out of favor. It may well be true that the web of trust, as originally conceived with PGP many years ago, cannot scale to the Internet as a whole. It can still work, though, for relatively small communities, as the Debian project (for example) has shown for years. A Git repository full of keys will not solve the world's authentication problems, but it may well prove sufficient for the task of keeping kernel pull requests secure.

Comments (22 posted)

Bias and ethical issues in machine-learning models

September 2, 2019

This article was contributed by Andy Oram

The success stories that have gathered around data analytics drive broader adoption of the newest artificial-intelligence-based techniques—but risks come along with these techniques. The large numbers of freshly anointed data scientists piling into industry and the sensitivity of the areas given over to machine-learning models—hiring, loans, even sentencing for crime—means there is a danger of misapplied models, which is earning the attention of the public. Two sessions at the recent MinneBOS 2019 conference focused on maintaining ethics and addressing bias in machine-learning applications.

To define a few terms: modern analytics increasingly uses machine learning, currently the most popular form of the field broadly known as artificial intelligence (AI). In machine learning, an algorithm is run repeatedly to create and refine a model, which is then tested against new data.

MinneBOS was sponsored by the Twin Cities organization Minne Analytics; the two sessions were: "The Ethics of Analytics" by Bill Franks and "Minding the Gap: Understanding and Mitigating Bias in AI" by Jackie Anderson. (Full disclosure: Franks works on books for O'Reilly Media, which also employs the author of this article.) Both presenters pointed out that bias can sneak into machine learning at many places, and both laid out some ways to address the risks. There were interesting overlaps between the recommendations of Franks, who organized his talk around stages, and of Anderson who organized her talk around sources of bias.

When we talk about "bias" we normally think of it in the everyday of sense of discrimination on the basis of race, gender, income, or some other social category. This focus on social discrimination is reinforced by articles in the popular press. But in math and science, bias is a technical term referring to improper data handling or choice of inputs. And indeed, the risks in AI go further than protected categories such as race and gender. Bias leads to wrong results, plain and simple. Whether bias leads to social discrimination or just to lost business opportunities and wasted money, organizations must be alert and adopt ways to avoid it.

Franks based his talk on the claim that ethics are intuitive and inherently slippery. As a simple example, consider the commandment "Thou shalt not kill." Although almost everybody around the world would acknowledge the concept, few would agree on exactly when you could violate the rule. Another example would be the famous "right to be forgotten", now mandated by the GDPR in the European Union. Sometimes, the need to keep data for legal conformance or law enforcement overrides the right to be forgotten. At other times, Franks said, it's just infeasible—how can you maintain a relationship with a health care provider or insurance company if you want them to forget something about you?

Franks offered five stages in machine learning at which to look for and correct bias. I'll describe them here, noting where Anderson's three sources of bias match up.

Modeling targets and business problems

~~This stage is where the two companies mentioned earlier parted ways on the question of offering customer rewards.~~ As another example, Franks cited the privacy issues tied up in the notorious case where Target sent pregnancy-related offers to a 17-year-old who was trying to hide her pregnancy from her father. This public relations disaster highlights the differences between laws, ethics, and plain good sense about business goals. Legally, Target was perfectly entitled to send pregnancy-related offers. Although there's a difference between personal medical information and routine retail sales like potato chips, Target was probably within reasonable ethical bounds in sending the pregnancy information. Where it fell short was in considering what customers or the general public would find acceptable.

This stage in Franks's taxonomy aligns with Anderson's first source of bias, "Defining the problem". Anderson mentioned, as an example, college analytics about what potential students to target for promotional materials. (Recruitment costs per student start around $2,300 and can go over $6,000.) If one assumes that the students getting promotions are more likely to apply, matriculate, graduate, and ultimately benefit the institution (certainly that's the reasoning behind sending out the materials), avoiding bias is important for both business and social reasons. Anderson said that, before running analytics, the college has to ask what it's trying to find out: Who is most likely to accept, who is most likely to graduate, who will give the largest donations later in life, who will meet diversity goals, etc.

Modeling input data

Here the core problem is that nearly every algorithm predicts future performance based on existing data, so it reproduces whatever bias was used in the past. Anderson calls this a form of circular reasoning that excludes important new candidates, whether for a college, a retail business, or a criminal justice probe. Franks pointed out that any model can be invalidated by a change in the environment or the goal.

Franks turned here, as Anderson had, to an example of predicting college admissions. If your model was trained at a time when 80% of your students were liberal arts majors, but you have changed strategies so that 80% of your students are now business majors, your old model will fail. Another example Franks offered of a changed environment pertains to criminal sentencing. U.S. society has recently changed its idea of how to punish criminals, so "fair sentencing" software based on the old ways of sentencing has become invalid.

Two of Anderson's sources of bias align with this stage: "Selecting data" and "Cleaning/preparing the data". Anderson cited a problem in preparing data, involving a retail firm that was trying to determine which customers were most profitable. When cleaning the data, the firm excluded small purchases because it assumed it should aim for customers who purchased expensive items. Later it found out that it had totally missed a large and loyal clientele that spends quite a lot—but in small amounts.

Often, she said, you have look at results to uncover hidden sources of bias. In one case she cited, Amazon's HR department designed an algorithm to help find the best programmers among applicants. The department started without any checking for bias, which naturally led to a model that discriminated in favor of men because they are currently most common in the field. After looking at results and realizing that the model was biased against women, the department created another model after changing obvious gender markers such as pronouns and names. But the model was still biased. It turns out that word choices in job applications are gender-specific too.

Modeling transparency and monitoring

Franks endorsed the idea of transparency in models, commonly called "explainability": model users have to know what is triggering a decision. In one example, an image-recognition program was trained to distinguish huskies from wolves. It seemed to perform extremely well until researchers delved into a failed case. They discovered that the program didn't look at the animal at all. If there was snow, it assumed the animal was a wolf. This was due to the kinds of images it was shown; the wolves were outdoors and the huskies were mostly indoors.

The strategy used by these researchers to determine the exact features that led to a decision fall under the term Local Interpretable Model-Agnostic Explanations (LIME). In the case of image processing, a typical LIME approach is to focus on several different parts of the image and run them through the model to see which parts of the image truly predict the result. This process is model-agnostic.

Some fields have legal requirements for transparency. For instance, when a bank denies credit, it has to list the precise criteria that led to the decision. There are no such regulations in health care, but transparency is necessary here too. Most clinicians would be very nervous making a diagnosis based on a black-box model.

Franks suggested that LIME be a factor in government approval for the use of analytics models. If a robust LIME procedure shows that the algorithm is looking at relevant features, the model should have a stronger chance of being approved. Franks also brought up the importance of emergency brakes on algorithms. In one example, a programmed trading algorithm wiped out one company's stock price for a short time. The programmers should have built in a check for such sudden or bizarre behavior, and made the program shut itself down, just as a factory has a button any employee can push to stop the assembly line.

Modeling usage

At this stage, Franks said, you must understand whether the context in which you're using a model makes it fair. Often the researcher spends 10% of the time creating a reasonably good algorithm and 90% error-proofing it.

He complained that the public is overly spooked by exceptional events in the use of new analytics. Every time an autonomous vehicle is involved in an accident, there are bans and public calls to abandon the technology. The real question for Franks is: over a certain interval (say, 100,000 miles driven), how many accidents were caused by human drivers versus self-driving cars?

Defining policies

Franks said organizations should define clear policies regarding the use of machine learning and publish them. The famous case where Apple Computer refused to unlock an accused criminal's cell phone led to the benefit of Apple publicly stating its views about privacy. Apple may have gained some customers and lost others, but now everyone can judge the company on the basis of this policy.

Recommendations

Franks's concluding recommendations were to create an ethics review board consisting of people from different relevant disciplines, like research institutions' Institutional Review Boards (IRBs), to write out policies, and deal firmly with violations. He called on the analytics community as a whole to take ownership of bias.

Anderson's main recommendation for fixing bias was to ensure team diversity. You have to experience life a certain way in order to understand how people with that life behave—and how they're potentially excluded. She also advised companies to collaborate, overcoming the barriers they set up out of worries over trade secrets. Many organizations don't even share the information they have across internal company borders. They should join forums and collaborative initiatives to recognize bias and improve diversity. Organizations focused on this include the Algorithmic Justice League (AJL). There are also toolkits such as IBM's AI Fairness 360.

The talks by Franks and Anderson showed that, about a decade into the new epoch of machine learning, researchers and practitioners are aware of bias and are designing practices that try to correct it. One remaining question is how much we as a society can depend on the competency and goodwill of the organization designing or using the model. Where can regulation fit in? And how much responsibility lies on the researchers who designed the model, versus the user who applies it in real life, or even the regulators who approve the model's use? Hopefully, as we learn the efficacy of practices that correct bias we can also answer the question of how to make sure they are used.

Comments (27 posted)

Kernel runtime security instrumentation

By Jake Edge
September 4, 2019

LSS-NA

Finding ways to make it easier and faster to mitigate an ongoing attack against a Linux system at runtime is part of the motivation behind the kernel runtime security instrumentation (KRSI) project. Its developer, KP Singh, gave a presentation about the project at the 2019 Linux Security Summit North America (LSS-NA), which was held in late August in San Diego. A prototype of KRSI is implemented as a Linux security module (LSM) that allows eBPF programs to be attached to the kernel's security hooks.

Singh began by laying out the motivation for KRSI. When looking at the security of a system, there are two sides to the coin: signals and mitigations. The signals are events that might, but do not always, indicate some kind of malicious activity is taking place; the mitigations are what is done to thwart the malicious activity once it has been detected. The two "go hand in hand", he said.

For example, the audit subsystem can provide signals of activity that might be malicious. If you have a program that determines that the activity actually is problematic, then you might want it to update the policy for an LSM to restrict or prevent that behavior. Audit may also need to be configured to log the events in question. He would like to see a unified mechanism for specifying both the signals and mitigations so that the two work better together. That is what KRSI is meant to provide.

He gave a few examples of different types of signals. For one, a process that executes and then deletes its executable might well be malicious. A kernel module that loads and then hides itself is also suspect. A process that executes with suspicious environment variables (e.g. LD_PRELOAD) might indicate something has gone awry as well.

On the mitigation side, an administrator might want to prevent mounting USB drives on a server, perhaps after a certain point during the startup. There could be dynamic whitelists or blacklists of various sorts, for kernel modules that can be loaded, for instance, to prevent known vulnerable binaries from executing, or stopping binaries from loading a core library that is vulnerable to ensure that updates are done. Adding any of these signals or mitigations requires reconfiguration of various parts of the kernel, which takes time and/or operator intervention. He wondered if there was a way to make it easy to add them in a unified way.

eBPF + LSM

He has created a new eBPF program type that can be used by the KRSI LSM. There is a set of eBPF helpers that provide a "unified policy API" for signals and mitigations. They are security-focused helpers that can be built up to create the behavior required.

Singh is frequently asked why he chose to use an LSM, rather than other options. Security behaviors map better to LSMs, he said, than to things like seccomp filters, which are based on system call interception. Various security-relevant behaviors can be accomplished via multiple system calls, so it would be easy to miss one or more, whereas the LSM hooks intercept the behaviors of interest. He also hopes this work will benefit the overall LSM ecosystem, he said.

He talked with some security engineers about their needs and one mentioned logging LD_PRELOAD values on process execution. The way that could be done with KRSI would be to add a BPF program to to the bprm_check_security() LSM hook that gets executed when a process is run. So KRSI registers a function for that hook, which gets called along with any other LSM's hooks for bprm_check_security(). When the KRSI hook is run, it calls out to the BPF program, which will communicate to user space (e.g. a daemon that makes decisions to add further restrictions) via an output buffer.

The intent is that the helpers are "precise and granular". Unlike the BPF tracing API, they will not have general access to internal kernel data structures. His slides [PDF] had bpf_probe_read() in a circle with a slash through it as an indication of what he was trying to avoid. The idea is to maintain backward compatibility by not tying the helpers to the internals of a given kernel.

He then went through various alternatives for implementing this scheme and described the problems he saw with them. To start with, why not use audit? One problem is that the mitigations have to be handled separately. But there is also a fair amount of performance overhead when adding more things to be audited; he would back that up with some numbers later in the presentation. Also, audit messages have rigid formatting that must be parsed, which might delay how quickly a daemon could react.

Seccomp with BPF was up next. As he said earlier, security behaviors map more directly into LSM hooks than to system-call interception. He is also concerned about time-of-check-to-time-of-use (TOCTTOU) races when handling the system call parameters from user space, though he said he is not sure that problem actually exists.

Using kernel probes (kprobes) and eBPF was another possibility. It is a "very flexible" solution, but it depends on the layout of internal kernel data structures. That makes deployment hard as things need to be recompiled for each kernel that is targeted. In addition, kprobes is not a stable API; functions can be added and removed from the kernel, which may necessitate changes.

The final alternative was the Landlock LSM. It is geared toward providing a security sandbox for unprivileged processes, Singh said. KRSI, on the other hand, is focused on detecting and reacting to security-relevant behaviors. While Landlock is meant to be used by unprivileged processes, KRSI requires CAP_SYS_ADMIN to do its job.

Case study

He then described a case study: auditing the environment variables set when executing programs on a system. It sounds like something that should be easy to do, but it turns out not to be. For one thing, there can be up to 32 pages of environment variables, which he found surprising.

He looked at two different designs for an eBPF helper, one that would return all of the environment variables or one that just returned the variable of interest. The latter has less overhead, so it might be better, especially if there is a small set of variables to be audited. But either of those helpers could end up sleeping because of a page fault, which is something that eBPF programs are not allowed to do.

Singh did some rough performance testing in order to ensure that KRSI was not completely unworkable, but the actual numbers need to be taken with a few grains of salt, he said. He ran a no-op binary 100 times and compared the average execution time (over N iterations of the test) of that on a few different systems: a kernel with audit configured out, a kernel with audit but no audit rules, one where audit was used to record execve() calls, and one where KRSI recorded the value of LD_PRELOAD. The first two were measured at a bit over 500µs (518 and 522), while the audit test with rules came in at 663µs (with a much wider distribution of values than any of the other tests). The rudimentary KRSI test clocked in at 543µs, which gave him reason to continue on; had it been a lot higher, he would have shelved the whole idea.

There are plenty of things that are up for discussion, he said. Right now, KRSI uses the perf ring buffer to communicate with user space; it is fast and eBPF already has a helper to access it. But that ring buffer is a per-CPU buffer, so it uses more memory than required, especially for systems with a lot of CPUs. There is already talk of allowing eBPF programs to sleep, which would simplify KRSI and allow it to use less memory. Right now, the LSM hook needs to pin the memory for use by the eBPF program. He is hopeful that discussions in the BPF microconference at the Linux Plumbers Conference will make some progress on that.

As part of the Q&A, Landlock developer Mickaël Salaün spoke up to suggest working together. He went through the same thinking about alternative kernel facilities that Singh presented and believes that Landlock would integrate well with KRSI. Singh said that he was not fully up-to-speed on Landlock but was amenable to joining forces if the two are headed toward the same goals.

[I would like to thank LWN's travel sponsor, the Linux Foundation, for funding to travel to San Diego for LSS-NA.]

Comments (28 posted)

Change IDs for kernel patches

By Jonathan Corbet
August 29, 2019

For all its faults, email has long proved to be an effective communication mechanism for kernel development. Similarly, Git is an effective tool for source-code management. But there is no real connection between the two, meaning that there is no straightforward way to connect a Git commit with the email discussions that led to its acceptance. Once a patch enters a repository, it transitions into a new form of existence and leaves its past life behind. Doug Anderson recently went to the ksummit-discuss list with a proposal to add Gerrit-style change IDs as a way of connecting the two lives of a kernel patch; the end result may not be quite what he was asking for.

The Gerrit code-review system needs to be able to track multiple versions of the same patch; to do so, it adds a Change-Id tag to the patches themselves:

    Change-Id: I6a007dfe91ee1077a437963cf26d91370fdd9556

The tag is automatically added to the first version of a new patch; developers are expected to retain that tag when posting subsequent versions so that Gerrit can associate the new and old versions. These tags are useful for Gerrit, but they have never been welcome in the kernel community; Anderson posted his missive in the hope of changing that attitude and getting the community to allow (or actively encourage) the use of change IDs in patches:

The basic summary is that I'd like there to be some way to track a logical patch over its lifetime. I don't believe there is a reliable (non-heuristic) way to do this today and I think Change-Id provides a nice solution. While we could come up with a new and different solution (because Change-Id was not invented here), it feels like adopting Change-Id is convenient and easy and provides a true benefit.

The problem Anderson describes is real enough; your editor, who spends a lot of time digging up old versions of patch postings to work out how a patch has evolved over time, can attest to that. Guenter Roeck complained that he has to "use a combination of subject analysis and patch content analysis using fuzzy text / string comparison, combined with an analysis of the patch description" to determine whether a given patch has been merged. There seems to be little doubt that the community as a whole would appreciate a better way to associate a patch's history over time and its final resting place in the kernel repository. That is about where the agreement stops, though.

Linus Torvalds was quick to reject the idea of putting a bare change ID into patch changelogs, citing the same reasoning that has kept those IDs out thus far: they are really only useful to whoever put that ID into the changelog in the first place. Gerrit change IDs are useful to people who know which Gerrit instance is tracking the patch in question and who actually have access to that instance. For everybody else, it's just a number that is just extra noise in the changelog; as Torvalds put it: "A 'change ID' that I can't use to look anything up with is completely pointless and should be removed from kernel history". That assertion also implies, of course, that an ID that can be looked up by third parties might have some value.

One way to make a Gerrit change ID useful, he suggested, would be to turn it into a publicly accessible web link; then anybody could follow the link, see whatever other information exists, and track the history of the patch. Olof Johansson disliked that idea, saying that the Gerrit server could be shut down, making the link useless. Ted Ts'o responded that such a fate could befall any web link, including others (such as bugzilla links) that are accepted in changelogs now.

There may be other ways to solve this problem, though. The idea that Torvalds liked the best — and which seems to have the widest support across the community — is to use the unique ID that is already associated with a patch posting, which is the message ID created by the poster's email client:

The first time it gets magically and reliably created for you without you having to do a single thing. The second time, you just look it up.

Ta-daa - you have a "uuid" that is useful to others, and that describes the whole series unambiguously.

There are a few ways this ID could be presented, but the most popular way is to create a "Link:" tag containing a link to the posting of the patch in a public mailing-list archive server (generally lore.kernel.org in recent times). This is not a new practice; it appears to have first been used for this patch applied by H. Peter Anvin in 2011. Use of this tag is not universal, but it is growing; the number of patches in recent kernels carrying Link: tags is:

Release Tags Percent

4.18 1,413 10.6%

4.19 1,944 13.8%

4.20 1,609 11.6%

5.0 1,778 13.9%

5.1 1,908 14.6%

5.2 2,295 16.4%

5.3 2,614 18.4%

Release	Tags	Percent
4.18	1,413	10.6%
4.19	1,944	13.8%
4.20	1,609	11.6%
5.0	1,778	13.9%
5.1	1,908	14.6%
5.2	2,295	16.4%
5.3	2,614	18.4%

Creation of this tag is relatively easy; it can be entirely automated at the point where a patch is applied to a Git repository. But it doesn't solve the entire problem; it can associate a commit with the final posting of a patch on a mailing list, but it cannot help to find previous versions of a patch. Generally, the discussion of the last version of a patch is boring since there is usually a consensus at that point that it should be applied. It's the discussion of the previous versions that will have caused changes to be made and which can explain some of the decisions that were made. But kernel developers are remarkably and inexplicably poor at placing the message ID of the final version of a patch into the previous versions.

The most commonly suggested solution to that problem is not fully automatic. Developers like Thomas Gleixner and Christian Brauner argued in favor of adding a link to previous versions of a patch when posting an updated version. Gleixner called for a link to the cover letter of the prior version, while Brauner puts links to all previous versions. Either way, an interested developer can follow the links backward to see how a patch series has changed, along with the discussions that led to those changes.

A convention like that would provide most or all of what developers like Anderson are asking for. It would, however, require that developers do some work to insert those links, and not everybody is convinced that this will ever happen. Dmitry Torokhov said that he could not be bothered:

As a patch submitter, I frankly could not care less about proper trace, history, etc. I might be putting cover letter and outline the version changes, but I am doing that to reduce friction and help committer to land my change. That's it.

Developers, he said, simply would not do the extra work to add links to previous postings to their cover letters. Anderson also asserted that "the adoption rate will be near to zero". Such concerns have merit; it is hard to force a community of thousands of developers to do more work for every patch they submit. But without their cooperation, this idea will not go far.

The answer, naturally, is to provide tools that make the right thing happen with a minimum of extra work. Gleixner described the setup he uses with Quilt, but it seems unlikely that all developers will find it useful for their purposes. Joel Fernandes described a tool that he is considering writing that might be more generally useful. Greg Kroah-Hartman described it as overly complex, though, and suggested simply posting patches as a reply to previous versions, but others pointed out that not all mailers would make that entirely easy to do.

Ts'o more-or-less ended the discussion by saying that it was time for interested developers to go off and implement a prototype of the tools they have in mind for this task. Then the code could be evaluated to see how it actually works. "Trying to pick something before people who actually have to use it day to day have had a chance to try it in real life is how CIO's end up picking Lotus Notes". That is where things stand now; the next step will come about when somebody comes forward with a tool that might provide a better solution to the problem. Until then, we'll have to continue to use fuzzy string comparisons and other tricks to track the history of patches in the repository.

Comments (23 posted)

Examining exFAT

By Jonathan Corbet
August 30, 2019

Linux kernel developers like to get support for new features — such as filesystem types — merged quickly. In the case of the exFAT filesystem, that didn't happen; exFAT was created by Microsoft in 2006 for use in larger flash-storage cards, but there has never been support in the kernel for this filesystem. Microsoft's recent announcement that it wanted to get exFAT support into the mainline kernel would appear to have removed the largest obstacle to Linux exFAT support. But, as is so often the case, it seems that some challenges remain.

For years, the Linux community mostly ignored exFAT; it was a proprietary format overshadowed by an unpleasant patent cloud. A Linux driver existed, though, and was shipped as a proprietary module on various Android devices. In 2013, the code for this driver escaped into the wild and was posted to a GitHub repository. But that code was never actually released under a free license and the patent issues remained, so no serious effort to upstream it into the mainline kernel was ever made.

The situation stayed this way for some years. Even Microsoft's decision to join the Open Invention Network (OIN) in 2018 did not change the situation; exFAT, being outside the OIN Linux System Definition, was not covered by any new patent grants. Some people pointed this out at the time, but it didn't raise a lot of concern. Most people, it seemed, had simply forgotten about exFAT, which has a relatively limited deployment overall.

In July of this year, though, Valdis Klētnieks posted that he had "beaten into shape" the exFAT code and wondered how it might be upstreamed. The ensuing discussion made it clear that the patent issues were still a show-stopper for inclusion; that discussion also included a couple of pointed suggestions to the Microsoft employees on the list that perhaps they could help to change that situation. By all appearances, that prod started an internal discussion that ended with Microsoft agreeing with the addition of exFAT to the mainline kernel.

One never really knows what is going on in large companies. The exclusion of exFAT from Microsoft's commitment to OIN looked like a deliberate, old-time Microsoftian act, but it now looks likely that opening up exFAT is just one of those things that nobody thought about until it was brought to their attention.

Greg Kroah-Hartman wasted no time in taking Klētnieks's code and proposing it for addition to the staging tree for further work, of which it is said to need a fair amount. That drew an equally quick objection from Christoph Hellwig, who said it would be better to "just review the damn thing and get it into the proper tree". He is unhappy about how filesystems have been handled in the staging tree in the past, and mentioned the handling of the EROFS filesystem as a particular sore point. That sparked a whole subthread on the remaining concerns about EROFS that has little to do with exFAT.

The code quality of the exFAT implementation is of concern generally; that is the kind of thing that can be improved over time in the staging tree. But there are a couple of deeper issues that could yet prove to be a sticking point for exFAT. One is a complaint from Pali Rohár that the posted specification is incomplete. In particular, he said, the "TexFAT" extension is not documented. As Klētnieks pointed out, this extension seems to be used only by Windows CE, so it may prove to be a feature that the rest of the world can do without.

The bigger concern, perhaps, is that this filesystem module should not exist at all, so tweaking it will not help the situation. As Hellwig put it:

It basically is a reimplementation of fs/fat/ not up to kernel standards with a few indirections thrown in to also support exfat. So no amount of work on this codebase is really going to bring us forward.

The right course, he said, is to just add the necessary support to the kernel's existing VFAT filesystem. Kroah-Hartman replied that he had tried to do that once "a few years ago" and concluded that it wouldn't work. But, he said, it may well be easier now that the specification has been posted. If exFAT support were to be reimplemented entirely, perhaps as part of the existing VFAT code, the staging version could simply be deleted once it outlived its usefulness. It would not be the first time such a thing had happened.

Hellwig is clearly not convinced that things will play out that way, but his concerns may not be enough to keep the exFAT code out of the staging tree. Whether that step is taken or not, though, there is clearly some work to be done before exFAT truly becomes a part of the mainline kernel. But, then, after thirteen years out in the cold, there is probably no point in being in a hurry to get full support upstream now.

Comments (46 posted)

CHAOSS project bringing order to open-source metrics

September 3, 2019

This article was contributed by Sean Kerner

OSS NA

Providing meaningful metrics for open-source projects has long been a challenge, as simply measuring downloads, commits, or GitHub stars typically doesn't say much about the health or diversity of a project. It's a challenge the Linux Foundation's Community Health Analytics Open Source Software (CHAOSS) project is looking to help solve. At the 2019 Open Source Summit North America (OSSNA), Matt Germonprez, one of the founding members of CHAOSS, outlined what the group is currently doing and why its initial efforts didn't work out as expected.

Germonprez is an Associate Professor at the University of Nebraska at Omaha and helped to start CHAOSS, which was first announced at the 2017 OSSNA held in Los Angeles. When CHAOSS got started, he said, there was no bar as to what the project was interested in. "We developed a long list of metrics, they were really unfiltered and uncategorized, so it wasn't doing a lot of good for people," Germonprez admitted.

Learning from initial mistakes

A number of lessons were learned by the CHAOSS project team members after the first year that have guided the project in the years since. Among the somewhat obvious lessons learned was that just collecting metrics related to open-source development and dumping them into one bucket, is an approach that doesn't work, he said.

One area where there is a lot of interest in metrics is for diversity and inclusion of open-source projects. Germonprez said that type of data isn't included in digital trace data, which is data that can be derived from a Git repository or even an email list. Rather, diversity and inclusion data is something that requires a researcher to go out and ask questions in order to get the required answers. "So one of the things we started to realize is that some metrics are easy to get, and some are more challenging to get, but that shouldn't preclude you from wanting to get those metrics," Germonprez said.

Perhaps even more confusing for the CHAOSS project was the realization that different people had different interpretations of what certain terms meant. For example, how different people defined "code commits" varied and, in general, he said that the concept of code contribution was understood in a number of ways. The CHAOSS project leaders came to realize that there was a need to standardize the way the project talked about metrics to make sure the terms are clearly articulated.

Fundamentally, though, the big takeaway from the initial CHAOSS metrics efforts was that there is more to learn; that listening to the community and asking for feedback, rather than just collecting metrics, is the right path forward. "We're a community that doesn't have all the answers, we really don't," Germonprez commented. "I think maybe some people thought we did and we were going to make this project and just provide software that you could push a button and say, green, it's all healthy. But that's not going to happen. So we spend a lot of our time listening."

CHAOSS in 2019

To help bring order to the (ahem) chaos of collecting metrics, CHAOSS now has five working groups, each of which represents an attempt to think about metrics in a more categorical way. The groups are: Diversity and Inclusion, which looks at participation, Evolution which looks at how projects change over time, Risk which is focused on metrics pertaining to risk factors when using open-source software, the Value working group, which looks at metrics for determining economic value, and, finally, the Common metrics working group, which combines the metrics from the others in different ways. As he put it: "Common is a working group that looks at metrics that may have kind of a cross-cutting interest in a variety of different working groups. So for example, Common is looking at organizational affiliation and that may be a metric that you care to look at with with respect to Risk or Evolution."

On August 6 the project released the first version of its metrics in a 105-page document [PDF]. Germonprez explained that the rationale for publishing the metrics document was to help make open-source metrics consumable and deployable. The overall goal is to help understand what the pain points are for open-source projects and provide the metrics that represent the information that a project needs to be able to make decisions. "These are the first metrics that we're putting forward, to try to provide better transparency and actionability inside of your organization's projects," Germonprez said.

After the experience of the first few years of CHAOSS, he became convinced that most projects had little or no understanding of their own metrics, with no real indication of the project's health. The CHAOSS metrics are an attempt to move a project from having zero metrics to a starting point where it can figure out what's needed. "When we started we were just collecting metrics for metrics sake and we realized that was actually completely backwards," Germonprez admitted. "We weren't really taking any time to understand the goals and the questions and other metrics that address issues."

Each working group within CHAOSS has focus areas and defined goals. For example, under Diversity and Inclusion one focus area is on governance, with the goal of identifying how diverse and inclusive governance is for a given project. One of the metrics being used for that goal, is to look at the code of conduct for the project and identify how it can be used to support diversity and inclusion. "This is not about necessarily doing software contributions, it's not necessarily about helping people do deployments out in the field, it's really just us saying, these are the goals that we're trying to achieve, these are the questions to address those goals, and these are the metrics that we'd like to see to address those goals," Germonprez said.

While the focus of Germonprez's talk was about the new metrics release, CHAOSS does in fact have several software projects as part of its portfolio. Grimoire Lab provides software development analytic capabilities to help collect and visually display data. The Augur project is a rapid prototyping tool for metrics. "It's one thing to come up with a metric. It's another thing to deploy the metric," Germonprez said.

It is somewhat ironic that an effort with the name CHAOSS really is all about bringing order to the myriad variables and data points that make up any open-source effort. The metrics effort is still a work in progress, but it does serve to lay the groundwork to help organizations and project developers think about what metrics are and how they can be used to help support larger goals. It will be interesting to see in the years ahead how the metrics project continues to mature and, perhaps more importantly, how, and if, projects find ways to gain full value from them.

Comments (2 posted)

Page editor: Jonathan Corbet

Inside this week's LWN.net Weekly Edition

Briefs: iOS exploits; Linux LTS Spectre fix failure; Android 10; Rename Perl 6?; Quotes; ...
Announcements: Newsletters; conferences; security updates; kernel patches; ...

Next page: Brief items>>