Leading items

Welcome to the LWN.net Weekly Edition for March 14, 2024

This edition contains the following feature content:

Questions about machine-learning models for Fedora: should machine-learning weights be treated as code, or content?
Insecurity and Python pickles: the pickle serialization module is a popular component of machine-learning file formats, but it is a bad choice for handling untrusted data.
Untangling the Open Collectives: what we know about the Open Collective organizations and Open Collective Foundation (OCF) shutdown.
Better linked-list traversal in BPF: new BPF primitives to make it easier to write loops may soon be available.
A new filesystem for pidfds: bumps along the road to a new filesystem for pidfds that is likely to appear in the 6.9 kernel.
Development statistics for 6.8: a look at contributions to the 6.8 kernel, with a look at the first appearance of bugs fixed in this kernel.
Vale: enforcing style guidelines for text: a tool for linting technical prose.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

Questions about machine-learning models for Fedora

By Joe Brockmeier
March 13, 2024

Kaitlyn Abdo of Fedora's AI/ML SIG opened an issue with the Fedora Engineering Steering Committee (FESCo) recently that carried a few tricky questions about packaging machine-learning (ML) models for Fedora. Specifically, the SIG is looking for guidance on whether pre-trained weights for PyTorch constitute code or content. And, if the models are released under a license approved by the Open Source Initiative (OSI), does it matter what data the models were trained on? The issue was quickly tossed over to Fedora's legal mailing list and sparked an interesting discussion about how to handle these items, and a temporary path forward.

Defining terms

AI/ML is becoming, or at least feels, nearly omnipresent these days. However, the terminology may be confusing to those who are outside of that bubble, so Tim Flink helpfully supplied some definitions of terms when he carried the discussion from the FESCo ticket to Fedora legal.

To sum up Flink's definitions, a model describes an artificial neural network (ANN), which is structured data "consisting of neurons (nodes containing some value) organized into layers with various connections between the neurons". A model describes a specific ANN, how its layers are configured, how the data is structured, and the learning algorithms that will be used when the model is trained on data to assign values (weights) to its nodes.

A model by itself is not of much use until it has been trained on data and has had weights assigned to it to provide the "exact value of how the connections affect flow through the network". As an example, one might want to use PyTorch's torchvision library, which offers models for tasks like image classification and object detection, to determine if there's a cat in a picture. One of the torchvision models might be suitable for this, but it would need to be trained on data (pictures with and without cats) before it would be able to do its job.

Users could do the training themselves but, as Flink noted, doing the training to create the weights "is a very expensive and time consuming process". So projects like PyTorch provide pre-trained weights—that is, models that are populated with training data—so that users can make use of the models without having to do the training, or even having access to the data used to train the model. For example, the PyTorch project offers torchvision models with and without weights.

Herein lies the perceived problem: while the weights and models may be offered under a license that is approved for Fedora, there's uncertainty about the data used to do the training. Copyright holders have been suing over the use of their data in training models, and the AI/ML SIG has hesitated to package weights due to those lawsuits and the questions about the input data used—even if the model is under a license accepted by Fedora.

In his first message in the thread, Flink asked whether weights are normal non-code content, or if they "require special handling" even if the upstream offers the models under a license "acceptable for non-code content in Fedora". Do packagers have any responsibility for reviewing the training data, in other words, or is the upstream's choice of license sufficient to make that call? So far, the SIG has erred on the side of caution and refrained from packaging weights, but that has its own drawbacks.

Supplying software like PyTorch without weights is a lot like supplying a spellchecker without the dictionary it needs as a word list. It would be easier for users if they didn't have to seek out the weights separately from installing the software—but the SIG wanted public guidance from FESCo (and then Fedora legal) before proceeding.

Are AI/ML weights special?

Richard Fontana, a member of Red Hat's legal team who advises Fedora on legal issues, wrote that he had recommended bumping the questions to FESCo because he thought FESCo might decide weights were "analogous to object code" that users must be able to modify. Fontana alluded to discussions being led by the OSI over the definition of "open-source AI" and said there is "definitely some sentiment among participants in that effort" that training data must be open because "this is necessary for users to exercise rights of modification." It would seem, however, that FESCo is satisfied with the idea that weights are content unless Fedora legal decides otherwise.

Fontana said that he was "struggling to see a justification" not to consider weights as content. However, he indicated that Fedora should be cautious, at least initially, and take weights "on a case-by-case basis" until Fedora has more experience with this type of content. Fontana said there may be circumstances when Fedora might not want to package the weights "given what is disclosed, or not disclosed, about how a model was trained". That was, he said, "unlikely, but not impossible".

In a followup message, Fontana elaborated on this and said that he wants to do further review "for any specific pre-trained weights that will actually be included in Fedora packages, for some initial period", since it would be "highly impractical" to expect packagers or package reviewers to do this type of review. Fontana said that if there are technical issues, such as "if there ought to be some standards around packaging of upstream pre-trained weights" he would not be able to give guidance "beyond my initial suggestion to raise this topic with FESCo which seems to have been unsuccessful".

FESCo member Neal Gompa replied:

With my FESCo hat on, the main question to answer is how we classify and identify them for package reviews, which is largely a Fedora Legal question. Personally, it's basically content to me, we do probably need some explicit documentation of this for the guidance that the AI/ML SIG can use to write packaging guidelines for FPC [Fedora Packaging Committee] to review.

What about build-time downloads?

Flink had also asked about scenarios where software packaged for Fedora might download weights upon first use. He used torchvision as the example of a library that is already packaged for Fedora as python-torchvision with the weights removed. If a user calls one of the models when using the torchvision library, such as the vit_b_16 model then the weights are downloaded from a third-party site on first use. Flink noted that some of the weights are under licenses acceptable as content for Fedora, some are under Creative Commons licenses that are not acceptable for content packaged for Fedora, and others have no explicit license at all.

Those scenarios, Gompa suggested, were "in the same bucket" as Python's pip, Ruby's gem, and other software with package-manager functionality that downloads software from sources outside of Fedora's repositories. Fontana agreed, but Flink wrote that "the capabilities do overlap but in my opinion, the intended uses are different and that may be worth noting". He also explained that downloads happen "transparently to the user with no warning outside of a log message when the weights are first downloaded". This is a bit different than, say, installing Python software with pip because users have to explicitly run a pip install command to download software. Flink said he was not arguing against having this functionality, but wanted to be sure it was explained correctly.

Fontana replied that he felt Flink was raising a more general issue that was not specific to pre-trained models. He noted that any Fedora package could download things with no warning to the user, and that he was unaware of any Fedora technical or packaging guidelines that address that issue. At this point, Gompa seemed impatient with the discussion, noting that lots of Fedora packages have similar functionality, and suggested that packagers "tweak pytorch to require configuration or make a prompt when it triggers the first time or something" if the packagers were concerned about that functionality.

Next steps

To date, no packages including pre-trained weights have been submitted for review. I followed up with Flink by email, and he confirmed plans to package weights for torchvision so that Fedora packages could be used alongside upstream PyTorch vision-related tutorials. It will be interesting to see what the case-by-case review looks like and what packages, if any, may be held back from Fedora due to concerns over the training data used.

Comments (20 posted)

Insecurity and Python pickles

By Daroc Alden
March 12, 2024

Serialization is the process of transforming Python objects into a sequence of bytes which can be used to recreate a copy of the object later — or on another machine. pickle is Python's native serialization module. It can store complex Python objects, making it an appealing prospect for moving data without having to write custom serialization code. For example, pickle is an integral component of several file formats used for machine learning. However, using pickle to deserialize untrusted files is a major security risk, because doing so can invoke arbitrary Python functions. Consequently, the machine-learning community is working to address the security issues caused by widespread use of pickle.

It has long been clear that pickle can be insecure. LWN covered a PyCon talk ten years ago that described the problems with the format, and the pickle documentation contains the following:

Warning:

The pickle module is not secure. Only unpickle data you trust.

It is possible to construct malicious pickle data which will execute arbitrary code during unpickling. Never unpickle data that could have come from an untrusted source, or that could have been tampered with.

That warning might give the impression that the creation of malicious pickles is difficult, or relies on exploiting flaws in the pickle module, but executing arbitrary code is actually a core part of pickle's design. pickle has supported running Python functions as part of deserializing a stored structure since 1997.

Objects are serialized by pickle as a list of opcodes, which will be executed by a custom virtual machine in order to deserialize the object. The pickle virtual machine is highly restricted, with no ability to execute conditionals or loops. However, it does have the ability to import Python modules and call Python functions, in order to support serializing classes.

When writing a Python class, the programmer can define a __reduce__() method that gives pickle the information it needs to store an instance of that class. __reduce__() returns a tuple of information needed to save and restore the object; the first element of the tuple is a callable — a function or a class object — that will be called in order to reconstitute the object. Only the name of callable objects are stored in the pickle, which is why pickle doesn't support serializing anonymous functions.

The ability to customize the pickling of a class is the secret to pickle's ability to store such a wide variety of Python objects. For objects without special requirements, the default object.__reduce__() method — that just stores the object's instance variables — usually suffices. For objects that have more complicated requirements, having a hook available to customize pickle's behavior allows for the programmer to completely control how the object is serialized.

Limiting pickle to not support unnamed callable objects is a deliberate design choice with two advantages: allowing code upgrades, and decreasing the size of pickled objects, both before and after deserialization. The fact that pickle loads classes by name allows a programmer to serialize an object with a custom class, edit their program, and deserialize the object with the new semantics. This also ensures that unpickled objects don't come with an extra copy of their classes (and all the objects that those reference, etc.), which significantly reduces the amount of memory required to store many small unpickled objects.

pickle does support restricting which named callables can be accessed during unpickling, but finding a set of functions to allow without introducing the potential to run arbitrary code can be surprisingly difficult. Python is a highly dynamic language, and Python code is often not written with the security of unpickling in mind — both because security is not a goal of the pickle module, and because programmers often don't need to think about pickling at all.

A malicious pickle

The pickle documentation gives this example of a malicious pickle:

    import pickle
    pickle.loads(b"cos\nsystem\n(S'echo hello world'\ntR.")

This pickle imports the os.system() function, and then calls it with "echo hello world" as an argument. This particular example is not terribly malicious; malware using this technique in the real world usually executes Python code to set up a reverse shell, or download and execute the next stage of the malware. The builtin pickletools module shows how this byte stream is interpreted as instructions for the pickle machine:

    0: c    GLOBAL     'os system'
    11: (    MARK
    12: S        STRING     'echo hello world'
    32: t        TUPLE      (MARK at 11)
    33: R    REDUCE
    34: .    STOP

GLOBAL is the instruction used to import functions and classes. REDUCE calls a function with the given arguments.

Widespread use

Because pickle is so convenient, it is used in many different applications. Programs that use pickle to send data to themselves — such as programs that use multiprocessing — mostly have little to worry about on the security front. But it is common, especially in the world of machine learning, to use pickle to share data between programs developed by different people.

There are several directories of machine-learning models, such as Hugging Face, PyTorch Hub, or TensorFlow Hub, that allow users to share the weights of pre-trained models. Since Python is a popular language to use for machine learning, many of these are shared in the form of either raw pickle files, or other formats that have pickled components.

Security researchers have found models on the platforms that embed malware that is delivered via unpickling. Security company Trail of Bits recently announced an update to its LGPL-licensed tool — fickling — for detecting these kinds of payloads. Fickling disassembles pickle byte streams without executing them to produce a report about suspicious characteristics. It can also recognize polyglots — files that appear to use one file format, but can be interpreted as pickles by other software.

The machine-learning community is certainly aware of these problems. The fact that loading a model is insecure is noted in PyTorch's documentation. Hugging Face, EleutherAI, and Stability AI collaborated to design a new format — called safetensors — for securely sharing machine-learning models. Safetensors files use a JSON header to describe the contained data: the shape of each layer of the model, the numeric format used for the weights, etc. After the header, a safetensors file includes a flat byte-buffer containing the packed weights. Safetensors files can only store model weights without any associated code, making it a much simpler format. The safetensors file format has been audited (also by Trail of Bits), suggesting that it might prove to be a secure alternative.

Even with safetensors becoming the new default format to save models for several libraries, there are still many older pickle-based models in regular use. As with any transition to a new technology, it seems likely that there will be a long tail of pickle-based models.

Hugging Face has started including security warnings on files that contain pickle data, but this information is only visible if users click through to view the files associated with a model, not if they only look at the "model card". Other sources of machine-learning models, such as PyTorch Hub and TensorFlow Hub, merely host pointers to weights stored elsewhere, and therefore do not do even that small check.

pickle's compatibility with many kinds of Python objects and its presence in the standard library make it an attractive choice for developers wishing to quickly share Python objects between programs. Using pickle within a single application can be a good way to simplify communication of complex objects. Despite this, using pickle outside of its specific use case is dangerously insecure. Once pickle has made its way into an ecosystem, it can be difficult to remove, since any alternative will have a hard time providing the same flexibility and ease of use.

Comments (36 posted)

Untangling the Open Collectives

By Joe Brockmeier
March 8, 2024

Name collisions aren't just a problem for software development—organizations, projects, and software that have the same or similar names can cause serious confusion. That was certainly the case on February 28 when the Open Collective Foundation (OCF) began to notify its hosted projects that it would be shutting down by the end of 2024. The announcement surprised projects hosted with OCF, as one might expect. It also worried and confused users of the Open Collective software platform from Open Collective, Inc. (OCI), as well as organizations hosted by the Open Source Collective (OSC) and Open Collective Europe (OC Europe). There is enough confusion about the names, relationships between the organizations, and impact on projects like Flatpak, Homebrew, and htop hosted by OCF, that a deeper look is warranted.

What's in a name?

The Open Collective story starts with Open Collective, Inc. a US-based for-profit company. It was founded in 2015 by Xavier Damman, who was joined by Pia Mancini and Aseem Sood in 2016. Sood left in 2018, Damman is now listed as an advisor on the team page, and Mancini is the current CEO.

OCI develops and supports the Open Collective platform that allows groups to take donations, manage money, provide a public page of transactions, publish updates, host events, and more. Groups, called "collectives" in OCI parlance, might be local mutual-aid projects, political organizations, user groups, open-source projects, or some other type of grassroots organization that seeks to raise money for its cause.

The idea is that the Open Collective is OCI's Software-as-a-Service offering that provides all the tools that a group needs to take donations, manage funds, and provide a public accounting of its budget. OCI promises that it will never sell its groups' data or lock them in, and it provides the software for the hosted platform under the MIT license on GitHub. In theory this means anyone can stand up their own version of the Open Collective platform, in practice it's unclear that anyone has done so or that there's any real effort toward supporting self-hosting. A setup guide for developers is available here. The developer documentation contains a section on deployment, but it's specific to the Open Collective deployment on Heroku. Deploying the software to other platforms to become independent of OCI seems to be an exercise left to the reader.

The Open Collective hosted platform makes it simple for groups to take donations without having to stand up their own payment infrastructure. But it does leave a large gap that needs to be filled: namely, having a legal entity with a bank account that can receive money and deal with one of life's inevitabilities—taxes. So OCI's founders helped create OCF, OSC, and OC Europe as fiscal hosts. The fiscal host is the legal entity that holds groups' funds and is responsible for compliance, accounting, paying expenses, and filing taxes. The goal is for the fiscal host to take on all of those responsibilities so that hosted groups can focus on their mission, whatever it may be.

OCF, the host that is shutting down, is a 501(c)(3) non-profit for collectives in the US. The 501(c)(3) designation matters for a few reasons: it means that the organization is recognized as tax-exempt in the US and donations can be tax-deductible. The 501(c)(3) status does limit the types of activities the organization can conduct (for example, it places restrictions on political lobbying) and its activities should not benefit private interests. It also, in this case, means there are restrictions for OCF on transferring money to other organizations as it dissolves.

The OCF charter included, but was not limited to, open-source software projects. OCF also hosts mutual-aid groups, civic technology organizations, arts and culture groups, and more. In fact, only 57 out of 544 groups under OCF's umbrella, are categorized open-source. These include OpenMined, Homebrew, Flatpak, and many regional meetup and user groups.

OSC is a 501(c)(6) non-profit. This type of organization is a trade organization. Donations to a 501(c)(6) are not tax-deductible, but the organization is tax-exempt. It is still a non-profit, but it can engage in activities that benefit for-profit businesses. The Linux Foundation is a good example of this: it serves the interests of the larger Linux community and that includes many for-profit entities. OSC serves collectives around the world, and is exclusively focused on the open-source software ecosystem: software projects, meetup groups, events, advocacy efforts, and research related to open source. OSC's web page states that open-source projects "with at least 100 stars on GitHub and at least two contributors" are likely to be immediately approved for acceptance.

OC Europe, a non-profit registered in Belgium, is similar to OCF in its mission. It has been certified 501(c)(3) equivalent by NGOsource, so donations to OC Europe from the US can be considered tax-deductible. It hosts initiatives focused on "the sustainability of the social and solidarity economy as well as open source technologies". It lists open-source projects like F-Droid, EndeavourOS, Manjaro ARM, and a number of other related projects or efforts. It also hosts collectives with little to no connection to open-source software.

Note that there are two more fiscal hosts listed with Open Collective naming. Open Collective Brussels, was absorbed by OC Europe in 2022. Open Collective NZ was founded by Alanna Irving in 2020. Irving also held a role as executive director of OCF until October 2023. Open Collective NZ has not released a statement on its blog about the shutdown, nor any other update since 2022.

The Open Collective platform lists more than 1,600 fiscal hosts, such as Women Who Code, and All For Climate, but they are more easily distinguished from the hosts and platforms sharing some variant of the "Open Collective" naming.

The OCF surprise shutdown

The organizations with Open Collective in their name share more than a name: at least one person has been active in a leadership role across multiple organizations. Mancini, in addition to being co-founder and current CEO of OCI was listed as part of the OCF board of directors on its website through September 2023, and is still listed as part of the OC Europe team.

Despite Mancini's recent tenure on OCF's board, news of the shutdown was met with surprise in statements from OCI, OSC, and OC Europe. It was definitely a surprise to the hosted collectives. Before OCF posted anything publicly, it sent out a notice to hosted groups, which was shared by Daniel Lange of the htop project, stating that its board of directors had made the decision to dissolve OCF by December 31, 2024. Several comments on Lange's post indicated disappointment and surprise with the shutdown, and Lange wrote "this all seems rushed and badly coordinated (if at all)".

According to the notice sent to the groups, the OCF experienced rapid growth after being founded by OCI due to the increased demand following the COVID-19 pandemic "without taking the time to establish the appropriate systems and infrastructure to sustain that growth." It also stated that the OCF business model "is not sustainable with the number of complex services we have offered and the fees we pay to the Open Collective Inc. tech platform". In an update on March 6, OCF program officer Mike Strode said that OCF "spent several months in late 2023-early 2024 exploring alternative options to address the numerous challenges in the organization". He also said that the decision was announced "as soon as possible from a legal standpoint."

The OCI's statement said the decision to dissolve and its announcement "came as a surprise to us", and that OCI was "still grappling with the realities of this decision, which has sent subsequent shock waves through the Open Collective ecosystem and community". The statement also emphasized that all of the entities are separate, and thus OCF made its decision "completely independently". OCI said that OCF had "made us aware of increased technological needs, to which we have been responding with new features and other forms of support", but despite those efforts, OCF hasn't found a path forward.

OSC's executive director Lauren Gardner wrote that OSC is "extremely sad to hear the news and are still processing it", but reassured its collectives that it is not affected by OCF's shutdown. Jean-François De Hertogh, co-founder and CEO of OC Europe, wrote that OC Europe is independent of OCF and "mobilizing to provide solutions" for collectives that might want to work with OC Europe.

In retrospect, there were external signs that OCF might be in trouble before announcing it intended to dissolve. In March 2023 Irving wrote about delays in processing expenses due to "incredible growth in recent times" but noted that OCF was hiring an admin to help with processing expenses. OCF also stopped issuing new virtual credit cards to groups to be used for spending funds. Irving said this was because OCF needed to review its "related processes and policies" about the cards.

In September 2023, OCF paused new applications through February 2024. The blog post said that it was working to accommodate "exponential growth" it had experienced, and was pausing new applications "to move with intentionality to serve our current and future collectives better". On January 30 Strode posted an update in response to a question about reopening applications: "We do not plan to reopen our application process and I encourage you to pursue other options for fiscal sponsorship."

OCF may have experienced "exponential growth" in the number of collectives it took on, but overall donations seem to have taken a dip in 2023. According to its public page, it took in about $2 million less in 2023 than in 2022. The chart shows OCF's collectives raised $23,189,253 for 2022, and $21,267,753 for 2023, for the "all categories" tag. Note that the amounts listed on the Open Collective platform and the tax filings differ significantly. According to the public tax filing for 2022, OCF took in more than $27 million after raising $8.8 million in its 2021 tax year. This may be due to money donated outside the platform to collectives hosted by OCF, say by check. The public filing doesn't have obvious red flags like excessive officer compensation that might explain the sudden shutdown. The spend on OCI for 2022 was $641,452, a little more than 2% of what OCF took in for 2022, which does not seem like enough to make OCF's business model "unsustainable".

The bulk of that money is to be earmarked for the collectives hosted by OCF, so its budget is substantially smaller than what it takes in. Its host fee structure is tiered by type of donation and amount raised. OCF would take up to 8% of donations. What OCF spends its money on is supposed to be "completely transparent" via the platform, but the actual implementation leaves much to be desired. For the past year nearly $1.2 million in expenses are listed as "no tag", which is more opaque than one might hope. If OCF was expecting revenue to continue to grow in 2023, they may have increased expenses in ways that proved awkward when donations declined in 2023 instead. It will be interesting to see the public filings for 2023 and 2024 when they are available.

To the lifeboats

OCF's explanation for the shutdown was short on facts, but it has provided a detailed FAQ about the shutdown process. Organizations hosted with OCF have until March 15 to receive new funds from donors. After that, all the donation methods (e.g. ACH, Stripe, and PayPal) will be shut down for collectives and recurring donations canceled. For collectives with employees, the FAQ states that they must be laid off by June 30 "at the latest". Organizations can continue to spend money or transfer money to a qualifying host until September 30.

Because of OCF's 501(c)(3) status, it can only transfer funds to another host that also has qualified as a 501(c)(3). Collectives will need to provide proof of this status along with a list of grants and "a signed letter from the grantor for their release to the new entity". Funds that aren't spent by September 30 will be moved to "another qualified fiscal sponsor to support their legal or financial compliance infrastructure". The exact fiscal sponsor (or sponsors) that might receive unspent funds is not specified.

Shane Curcuru, currently a member of the Apache Software Foundation's board of directors, has posted an explainer about the situation along with a few suggestions of alternative 501(c)(3) hosts for open-source projects. Curcuru includes the Software Freedom Conservancy, Software in the Public Interest, HackClub, and others as alternatives to OCF. OCI has also shared resources in its announcement, and promised to provide a list of fiscal hosts on the Open Collective platform along with "tools and features to help with the transition".

Harder than it looks

It is disappointing, but not entirely surprising, that OCF has failed. Funding open-source projects, or public-good groups in general, is far from a solved problem. Depending on the kindness of strangers for funding is tricky business, especially in the unprecedented times we're all living through. As OCF winds down, one hopes it will share more detail on what went awry and what it might have done differently. It will be interesting to see where the collectives it will no longer host disperse to, and what types of models prove more successful in the long run.

Comments (14 posted)

Better linked-list traversal in BPF

By Jonathan Corbet
March 8, 2024

Before loading a BPF program, the kernel must verify that the program is safe to run; among other things, that verification includes ensuring that the program will terminate within a bounded time. That requirement has long made writing loops in BPF a challenging task. The situation has improved over the years for some types of loops, but others — including linked-list traversal — are still awkward in BPF programs. A new set of BPF primitives aims to make life easier for this use case through the installation of what can be seen as a sort of circuit breaker.

Even relatively simple loops can be hard for the verifier to handle. To the human eye, a loop like this looks safe:

    for (i = 1; i < 10; i++)
    	do_something(i);

It can be hard, though, for the verifier (which is dealing with lower-level code for the BPF virtual machine) to know that nothing will reset the value of the iteration variable in a loop, though; without that assurance, it cannot verify that the loop will terminate as expected. Over the years, a number of helpers have been added to make this kind of iteration easier; they include the bpf_loop() function and generic iterators. This sort of bounded iteration is now relatively easy to do in BPF programs.

If one is iterating through a linked list, though, there is no loop variable that can bound the number of times the loop will run. There is no way for the verifier to know about the properties of a list that a program would like to traverse. If the list is circular, traversal could go forever. That prospect makes the verifier grumpy, forcing developers to engage in workarounds that make them even grumpier. When Alexei Starovoitov recently proposed a solution to this problem, he provided an example of the code needed (in current kernels) to go through a list stored in a BPF arena:

    for (struct bpf_iter_num ___it __attribute__((aligned(8),
                                                  cleanup(bpf_iter_num_destroy))),
		* ___tmp = (bpf_iter_num_new(&___it, 0, (1000000)),
                    	pos = list_entry_safe((head)->first,
                                              typeof(*(pos)), member),
	                (void)bpf_iter_num_destroy,
		     	(void *)0);
	bpf_iter_num_next(&___it) && pos &&
            ({ ___tmp = (void *)pos->member.next; 1; });
        pos = list_entry_safe((void __arena *)___tmp, typeof(*(pos)), member))

Briefly, this construct creates a new generic iterator (the bpf_iter_num_new() call) set for a maximum of 1,000,000 iterations. The bpf_iter_num_next() call increments that iterator and forces an exit from the loop if it goes too high. The iterator is never expected to reach anything close to the maximum value; it exists only to reassure the verifier that something will force the loop to end at some point.

One might fairly conclude that this code is not pleasant to write — and even less pleasant to try to understand. But, as Starovoitov put it: "Unfortunately every 'for' in normal C code needs an equivalent monster macro". He initially proposed a solution (a function called bpf_can_loop()), but the shape of that solution changed fairly quickly.

As of the v6 patch set, the first step is to create a bit of infrastructure in the form of a new BPF instruction called may_goto. This instruction has some interesting semantics. If the kernel sees a may_goto instruction in a code block, it will automatically reserve space for an iteration count on the stack. Each execution of may_goto increments that count and compares it to a kernel-defined maximum; if that maximum is exceeded, a goto will be executed to a point just far enough ahead to insert another goto.

This instruction is used to create a macro called cond_break that turns into BPF code like this:

    		 may_goto l_break;
   		 goto l_continue;
    l_break: 	 break;
    l_continue:  ;

In words: the macro normally uses may_goto to cause (by way of a bit of a goto dance) a break to be executed when the loop count is exceeded. This macro could, in turn, be used in this sort of loop:

    for (ptr = first_item; ptr; ptr = ptr->next)
    {
        do_something_with(ptr);
	cond_break;
    }

The presence of cond_break (which uses may_goto) in the loop causes stack space to be set aside for an iteration count; the maximum is set to BPF_MAX_LOOPS, which is defined as 8*1024*1024 in current kernels. Each execution of cond_break checks the iteration count and forces an exit from the loop if the maximum is exceeded.

Should that forced exit ever happen, chances are good that something is going wrong. Either some sort of out-of-control loop has been created, or the list to process is too long and the traversal will not be completed as expected. But, again, in real programs, exceeding the loop count is not expected to ever happen. It exists only as a sort of circuit breaker to reassure the verifier that the loop is safe to run. Or, as Starovoitov put it:

In other words "cond_break" is a contract between the verifier and the program. The verifier allows the program to loop assuming it's behaving well, but reserves the right to terminate it. So [a] bpf author can assume that cond_break is a nop if their program is well formed.

The promise of the BPF verifier — that it would be able to guarantee that BPF programs cannot harm the kernel — was always going to be hard to achieve without imposing significant limitations on developers. Much of the work on BPF over the years has been aimed at lifting some of those limitations, which have only become more onerous as the complexity of BPF programs has increased. As awkward as the new features may seem, they are less so than what came before.

Still, there is room for improvement. Starovoitov said that relying on loop counts was not the best approach, and that "the actual limit of BPF_MAX_LOOPS is a random number"; he suggested that the kernel may eventually implement a watchdog timer to simply interrupt programs that run for too long. That might remove some of the awkwardness, but would have some interesting implications; BPF programs are not written with the idea that they could be interrupted at an arbitrary point. Addressing that could take a while; in the meantime, there is cond_break. There do not seem to be objections to the changes, and the patch set has been merged into the bpf-next repository, so cond_break seems likely to show up in the mainline during the 6.9 merge window.

Comments (18 posted)

A new filesystem for pidfds

By Jonathan Corbet
March 13, 2024

The pidfd abstraction is a Linux-specific way of referring to processes that avoids the race conditions inherent in Unix process ID numbers. Since a pidfd is a file descriptor, it needs a filesystem to implement the usual operations performed on files. As the use of pidfds has grown, they have stressed the limits of the simple filesystem that was created for them. Christian Brauner has created a new filesystem for pidfds that seems likely to debut in the 6.9 kernel, but it ran into a little bump along the way, demonstrating that things you cannot see can still hurt you.

In this case, the pidfd filesystem is indeed invisible; it cannot be mounted and accessed like most other filesystems. A pidfd is created with a system call like pidfd_open() or clone3(), so there is no need for a visible filesystem. (One could imagine such a filesystem as a way of showing all of the existing processes in the system, but /proc already exists for that purpose). Since there was no need to implement many of the usual filesystem operations, pidfds were implemented using anon_inode_getfile(), a helper that creates file descriptors for simple, virtual filesystems. Over time, though, this filesystem has proved to be a bit too simple, leading to Brauner's pidfdfs proposal as a replacement.

So what was the problem with the anonymous-inode approach? Brauner provides a list of capabilities added by pidfdfs in the changelog to this patch. It allows system calls like statx() to be used on a pidfd, for example, and that, in turn, allows for direct comparison of two pidfds to see whether they refer to the same process. While not implemented yet, pidfdfs will enable functionality like automatically killing a process when the last pidfd referring to it is closed. The initial version of the series also used dentry_open() to set up the "file" behind the pidfd; that brought the opening of the pidfd under the control of Linux security modules and made the user-space file-notification system calls work with them as well.

The patch series subsequently had to evolve considerably, though. Linus Torvalds was not entirely happy with how it had been implemented, even though much of that implementation was borrowed from the existing namespace filesystem in the kernel. Some significant reworking followed, resulting in a cleaner implementation that Torvalds described as "quite nice".

That was not the end of the story, though. Nathan Chancellor reported that, with pidfdfs in the kernel, many services on his system failed at boot time; Heiko Carstens ran into similar problems. It turns out that, while users may or may not appreciate the robustness of race-free process management, they are, without exception, unimpressed by a system that lacks functional networking. So Brauner had to go looking for an explanation.

It seems, though, that he already knew where to look when "something fails for completely inexplicable reasons": the SELinux security module. As noted above, one of the advantages of the new filesystem is that it exposed pidfd operations to security modules, which is something that the policy maintainers had requested. The downside is that it exposed those operations to security modules, one of which promptly set about denying them.

There was, as Brauner later described, a bit of a cascade of failures here. SELinux started seeing events on a new type of file descriptor that it had no policy for; following fairly normal security practice, it responded by denying everything, causing attempts to work with pidfds to fail. The dbus-broker process, on seeing these failures, decided to just throw up its virtual hands and let the system collapse into a smoldering heap. This is somewhat ironic given that, as Brauner pointed out, that process has a PID-using fallback path that it uses on kernels that do not support pidfds at all, but it didn't use that path here. So, to truly fix this problem, there needs to be both an SELinux policy update and a D-Bus fix; patches for both have already been prepared and submitted.

Even then, though, there was the little problem that some systems may get a new kernel before the above fixes arrive. The same users who have proved so strangely intolerant of broken networking are likely to also be slow to accept the idea that networking will only come back once their user-space code has been fixed and updated. Beyond that, Torvalds didn't like the idea that the internal filesystem change somehow caused the resulting descriptors to behave differently in user space, and requested that something better be done.

After a bit of discussion, Brauner found a solution. Rather than call dentry_open(), the new filesystem sets up the new file descriptor directly, using lower-level operations, and without invoking the problematic security hook. The people in charge of security modules still want to be able to intervene in pidfd creation, of course; that will be accommodated by adding a new security hook for that case. Once SELinux (or any other security module) is ready to make decisions about pidfds, it can use the new hook; until then, things will work as they did before. Torvalds liked this approach: "This is how new features go in: they act like the old ones, but have expanded capabilities that you can expose for people who want to use them".

With those changes, it would appear that the roadblocks to the addition of pidfdfs have been overcome. The code is in linux-next now, and will probably find its way to the mainline for the 6.9 release. Most users will, if all goes according to plan, never notice that anything has changed.

Comments (13 posted)

Development statistics for 6.8

By Jonathan Corbet
March 11, 2024

The 6.8 kernel was released on March 10 after a typical, nine-week development cycle. Over this time, 1,938 developers contributed 14,405 non-merge changesets, making 6.8 into a slower cycle than 6.7 (but busier than 6.6), with the lowest number of developers participating since the 6.5 release. Still, there was a lot going on during this cycle; read on for some of the details.

Of the developers contributing to 6.8, 245 appeared for the first time. The most active developers in this cycle were:

Most active 6.8 developers

By changesets

Uwe Kleine-König 368 2.6%

Kent Overstreet 317 2.2%

Lucas De Marchi 189 1.3%

Krzysztof Kozlowski 182 1.3%

Dmitry Baryshkov 148 1.0%

Matt Roper 135 0.9%

Andy Shevchenko 133 0.9%

Andrii Nakryiko 129 0.9%

Matthew Brost 115 0.8%

Matthew Wilcox 113 0.8%

David Howells 108 0.7%

Arnd Bergmann 104 0.7%

Matthew Auld 102 0.7%

Randy Dunlap 102 0.7%

Jakub Kicinski 94 0.7%

Neil Armstrong 90 0.6%

Alexander Viro 90 0.6%

Thomas Zimmermann 83 0.6%

Christoph Hellwig 80 0.6%

Konrad Dybcio 79 0.5%

By changed lines

Arnd Bergmann 59205 7.3%

Matthew Brost 46142 5.7%

Jakub Kicinski 37553 4.6%

Sarah Walker 29771 3.7%

Neil Armstrong 21336 2.6%

Rajendra Nayak 16235 2.0%

Thomas Zimmermann 14881 1.8%

Andrii Nakryiko 12938 1.6%

Kent Overstreet 12617 1.6%

Darrick J. Wong 12403 1.5%

David Howells 10224 1.3%

Nas Chung 10207 1.3%

Ping-Ke Shih 8007 1.0%

Shinas Rasheed 8006 1.0%

Dmitry Safonov 7938 1.0%

Lucas De Marchi 7324 0.9%

Vlastimil Babka 5377 0.7%

Peter Griffin 5263 0.7%

Donald Robson 4911 0.6%

Dmitry Baryshkov 4873 0.6%

In the changesets column, Uwe Kleine-König once again ends up on top, mostly for ongoing work refactoring platform drivers. Kent Overstreet is not far behind, though, as he works to stabilize bcachefs (and also did a bit of include-file rationalization). Lucas De Marchi worked on the new Intel Xe graphics driver, Krzysztof Kozlowski worked mostly with devicetree files, and Dmitry Baryshkov worked extensively with Qualcomm drivers.

Arnd Bergmann, as usual, worked all over the kernel tree; he landed at the top of the "changed lines" column by removing a number of old and unloved WiFi drivers. Matthew Brost did a lot of work with the Xe driver. Jakub Kicinski removed a bunch of machine-generated, netlink-related code, Sarah Walker added the PowerVR/IMG GPU driver, and Neil Armstrong added a number of Qualcomm clock-controller drivers.

The top testers and reviewers this time around were:

Test and review credits in 6.8

Tested-by

Daniel Wheeler 151 14.0%

Pucha Himasekhar Reddy 51 4.7%

Hyeonggon Yoo 31 2.9%

Fuad Tabba 23 2.1%

Arnaldo Carvalho de Melo 23 2.1%

Philipp Hortmann 23 2.1%

David Rientjes 21 1.9%

Andrew Halaney 19 1.8%

Hans de Goede 16 1.5%

Jeremi Piotrowski 16 1.5%

Randy Dunlap 15 1.4%

Neil Armstrong 14 1.3%

Reviewed-by

Krzysztof Kozlowski 226 2.4%

Matt Roper 214 2.3%

Simon Horman 210 2.2%

Konrad Dybcio 204 2.1%

Christoph Hellwig 195 2.1%

Matthew Brost 189 2.0%

Rodrigo Vivi 176 1.9%

Lucas De Marchi 167 1.8%

Dmitry Baryshkov 121 1.3%

AngeloGioacchino Del Regno 115 1.2%

Hans de Goede 107 1.1%

Jan Kara 98 1.0%

As usual, Daniel Wheeler tests many of the driver patches coming out of AMD, while Pucha Himasekhar Reddy performs a similar function within Intel. Hyeonggon Yoo, instead, has made a habit of testing memory-management patches coming from a number of developers. On the review side, Krzysztof Kozlowski reviewed large numbers of devicetree patches; Matt Roper's review was focused mostly on Xe patches. Konrad Dybcio also reviewed devicetree patches, Simon Horman worked in the networking subsystem, and Christoph Hellwig looked at a lot of block-layer patches.

Looking at Signed-off-by tags applied by developers other than the author of a patch reveals who handled the patch after it was posted; it shows who the first-level maintainers are. In 6.8, the pattern of non-author signoffs was a bit different than usual:

Non-author signoffs in 6.8

Individuals

Rodrigo Vivi 955 7.1%

Greg Kroah-Hartman 797 6.0%

Andrew Morton 601 4.5%

Jakub Kicinski 565 4.2%

David S. Miller 537 4.0%

Alex Deucher 500 3.7%

Mark Brown 485 3.6%

Bjorn Andersson 443 3.3%

Alexei Starovoitov 279 2.1%

Hans Verkuil 260 1.9%

Kalle Valo 214 1.6%

Arnaldo Carvalho de Melo 177 1.3%

Martin K. Petersen 164 1.2%

Paolo Abeni 153 1.1%

Takashi Iwai 149 1.1%

Herbert Xu 147 1.1%

Shawn Guo 131 1.0%

Palmer Dabbelt 130 1.0%

Mauro Carvalho Chehab 128 1.0%

Vinod Koul 127 0.9%

Employers

Intel 2260 16.9%

Red Hat 1434 10.7%

Linaro 1337 10.0%

Google 1326 9.9%

Meta 1048 7.8%

Linux Foundation 860 6.4%

Qualcomm 740 5.5%

AMD 653 4.9%

SUSE 409 3.1%

(Unknown) 377 2.8%

(None) 278 2.1%

Cisco 260 1.9%

NVIDIA 249 1.9%

Oracle 203 1.5%

Microsoft 200 1.5%

Huawei Technologies 171 1.3%

IBM 150 1.1%

Collabora 147 1.1%

Rivos 132 1.0%

(Consultant) 124 0.9%

Rodrigo Vivi is not a name that comes quickly to mind when thinking about kernel maintainers (even for those of us who think about such things). In what is getting to be a common theme, the reason for his presence at the top of this list is that he is the maintainer who manages patches for the new Xe driver. Other than that, the busiest maintainers are the usual crowd that one would expect to see on that list. The Xe work also put Intel at the top of the signoffs list — though the Xe patches account for less than half of the total handled by Intel maintainers.

As has been the case for years, over half of the patches going into the kernel pass through the hands of developers working for just five companies.

Speaking of companies, 219 companies were identified as supporting work on the 6.8 kernel; the most active of those were:

Most active 6.8 employers

By changesets

Intel 2527 17.5%

(Unknown) 1087 7.5%

Linaro 1084 7.5%

Google 878 6.1%

Red Hat 871 6.0%

(None) 757 5.3%

AMD 657 4.6%

Pengutronix 416 2.9%

SUSE 372 2.6%

Meta 368 2.6%

Oracle 346 2.4%

NVIDIA 266 1.8%

Qualcomm 261 1.8%

Huawei Technologies 237 1.6%

IBM 224 1.6%

Collabora 167 1.2%

Broadcom 142 1.0%

Arm 141 1.0%

Bootlin 135 0.9%

Renesas Electronics 132 0.9%

By lines changed

Intel 151009 18.7%

Linaro 115647 14.3%

Meta 59065 7.3%

(Unknown) 52084 6.4%

Red Hat 43378 5.4%

Imagination Technologies 34692 4.3%

Qualcomm 30115 3.7%

SUSE 27574 3.4%

(None) 22259 2.8%

Google 22067 2.7%

AMD 21853 2.7%

Oracle 19462 2.4%

Realtek 12587 1.6%

Marvell 10869 1.3%

Bootlin 8978 1.1%

MediaTek 8449 1.0%

NVIDIA 8163 1.0%

Arista Networks 7955 1.0%

Ideas on Board 7429 0.9%

ST Microelectronics 7057 0.9%

Intel dominates the by-changesets list — and would be at the top even without the Xe contribution. The 6.7 kernel showed a spike in contributions from unaffiliated developers; that number has reverted to something close to its long-term mean in 6.8, though. Otherwise, these numbers are about the same as they usually are.

Fixes

Commits fixing a bug should contain a Fixes tag identifying the commit that introduced the bug; that practice helps in the understanding of the problem and informs the backporting effort for the stable releases. In 6.8, 2,582 commits contained a total of 2732 Fixes tags identifying 2,292 commits in 90 releases. Of those tags, 533 identified other 6.8 commits, and thus do not refer to bugs that made it into a released kernel.

The distribution of the remaining tags is shown in the following table. The "Fixed" column indicates the number of commits in the named release that were fixed by commits in 6.8, while "By" gives the number of commits in 6.8 fixing that release.

Releases fixed in v6.8

Release Commits

Fixed By

v6.7 241 282 282

v6.6 137 164 164

v6.5 125 146 146

v6.4 84 103 103

v6.3 83 91 91

v6.2 74 82 82

v6.1 50 51 51

v6.0 70 73 73

v5.19 53 54 54

v5.18 45 45 45

v5.17 27 28 28

v5.16 43 43 43

v5.15 32 35 35

v5.14 22 26 26

v5.13 33 35 35

v5.12 27 31 31

v5.11 30 35 35

v5.10 23 26 26

v5.9 25 27 27

v5.8 26 29 29

v5.7 26 30 30

v5.6 31 34 34

v5.5 16 16 16

v5.4 15 25 25

v5.3 25 25 25

v5.2 11 11 11

v5.1 12 19 19

v5.0 11 12 12

v4.20 25 30 30

v4.19 18 21 21

v4.18 13 13 13

v4.17 11 10 10

v4.16 14 16 16

v4.15 11 11 11

v4.14 9 10 10

v4.13 5 5 5

v4.12 7 8 8

v4.11 11 11 11

v4.10 14 14 14

v4.9 7 7 7

v4.8 14 14 14

v4.7 8 9 9

v4.6 7 7 7

v4.5 4 4 4

v4.4 6 6 6

v4.3 9 9 9

v4.2 9 10 10

v4.1 6 7 7

v4.0 1 1 1

v3.19 4 9 9

v3.18 9 9 9

v3.17 7 7 7

v3.16 13 13 13

v3.15 6 6 6

v3.14 4 5 5

v3.13 3 6 6

v3.12 6 6 6

v3.11 6 6 6

v3.10 11 16 16

v3.9 4 4 4

v3.8 4 4 4

v3.7 4 4 4

v3.6 2 2 2

v3.5 4 4 4

v3.4 3 3 3

v3.3 5 6 6

v3.2 5 5 5

v3.1 3 3 3

v3.0 4 4 4

v2.6.39 6 6 6

v2.6.38 4 6 6

v2.6.37 3 3 3

v2.6.36 2 2 2

v2.6.35 2 2 2

v2.6.34 4 4 4

v2.6.33 2 2 2

v2.6.31 1 1 1

v2.6.30 3 3 3

v2.6.29 2 3 3

v2.6.28 2 2 2

v2.6.27 2 2 2

v2.6.26 2 2 2

v2.6.25 2 2 2

v2.6.23 1 1 1

v2.6.22 3 5 5

v2.6.18 1 1 1

v2.6.17 2 2 2

v2.6.13 1 1 1

v2.6.12 1 23 23

Releases fixed in v6.8
v6.7	241	282	282
v6.6	137	164	164
v6.5	125	146	146
v6.4	84	103	103
v6.3	83	91	91
v6.2	74	82	82
v6.1	50	51	51
v6.0	70	73	73
v5.19	53	54	54
v5.18	45	45	45
v5.17	27	28	28
v5.16	43	43	43
v5.15	32	35	35
v5.14	22	26	26
v5.13	33	35	35
v5.12	27	31	31
v5.11	30	35	35
v5.10	23	26	26
v5.9	25	27	27
v5.8	26	29	29
v5.7	26	30	30
v5.6	31	34	34
v5.5	16	16	16
v5.4	15	25	25
v5.3	25	25	25
v5.2	11	11	11
v5.1	12	19	19
v5.0	11	12	12
v4.20	25	30	30
v4.19	18	21	21
v4.18	13	13	13
v4.17	11	10	10
v4.16	14	16	16
v4.15	11	11	11
v4.14	9	10	10
v4.13	5	5	5
v4.12	7	8	8
v4.11	11	11	11
v4.10	14	14	14
v4.9	7	7	7
v4.8	14	14	14
v4.7	8	9	9
v4.6	7	7	7
v4.5	4	4	4
v4.4	6	6	6
v4.3	9	9	9
v4.2	9	10	10
v4.1	6	7	7
v4.0	1	1	1
v3.19	4	9	9
v3.18	9	9	9
v3.17	7	7	7
v3.16	13	13	13
v3.15	6	6	6
v3.14	4	5	5
v3.13	3	6	6
v3.12	6	6	6
v3.11	6	6	6
v3.10	11	16	16
v3.9	4	4	4
v3.8	4	4	4
v3.7	4	4	4
v3.6	2	2	2
v3.5	4	4	4
v3.4	3	3	3
v3.3	5	6	6
v3.2	5	5	5
v3.1	3	3	3
v3.0	4	4	4
v2.6.39	6	6	6
v2.6.38	4	6	6
v2.6.37	3	3	3
v2.6.36	2	2	2
v2.6.35	2	2	2
v2.6.34	4	4	4
v2.6.33	2	2	2
v2.6.31	1	1	1
v2.6.30	3	3	3
v2.6.29	2	3	3
v2.6.28	2	2	2
v2.6.27	2	2	2
v2.6.26	2	2	2
v2.6.25	2	2	2
v2.6.23	1	1	1
v2.6.22	3	5	5
v2.6.18	1	1	1
v2.6.17	2	2	2
v2.6.13	1	1	1
v2.6.12	1	23	23

Thus, 6.8 contained 23 commits with fixes tags identifying a single commit in 2.6.12 that needed a lot of fixing; that is, of course, the initial commit made at the beginning of the Git era. It has been almost 19 years, but we're still fixing bugs that went in prior to the adoption of Git.

The pattern shown above is typical for a kernel release; while a lot of the bugs fixed were introduced within the last year, there are also vast numbers of bugs that have lurked in the kernel for far longer.

In conclusion

As a final note: the Xe driver, first merged for 6.8, figures strongly in the statistics for this cycle; it is worth looking just a bit more at this work to see what is involved in adding a new graphics driver to the kernel. The Xe driver accounted for 1,041 changesets in this development cycle. Those commits were contributed by 70 developers, 66 of whom work for Intel (with a few still sticking to their Habana Labs email addresses). Their work added about 60,000 lines of code to the kernel.

Once upon a time, such a code contribution would have been huge news; in 2024, it draws little attention outside of the community that is interested in graphics drivers. Such is the nature of contemporary kernel development, where the addition of major new code components is a routine event. As of this writing, there are over 11,600 changesets waiting in linux-next for the 6.9 merge window to open, so it seems that the flow will not be stopping soon; keep your eyes on LWN to see what those commits bring.

Comments (2 posted)

Vale: enforcing style guidelines for text

March 7, 2024

This article was contributed by Koen Vervloesem

While programmers are used to having tools to check their code for stylistic problems, writers often limit automatic checks of their texts to spelling and, sometimes, grammar, because there are not a lot of options for further checking. If that is the case, Vale, an open-source, command-line tool to enforce editorial-style guidelines, would make a useful addition to their toolbox. The recent release of Vale 3.0 warrants a look at this versatile tool, which assists writers by identifying common errors and helping them maintain a consistent voice in their prose.

Vale is the creation of Joseph Kato, who published the initial version in 2018. He introduced it as "a command-line tool that brings code-like linting to prose". In the context of programming, linting means analyzing source code to flag common errors, suspicious constructs, and stylistic mistakes. The program that does this analysis, known as a linter, typically adheres to a style guide, such as PEP 8 for Python code. Kato's program provides writers with a similar tool. However, it doesn't aim to serve writers who use word processors like LibreOffice Writer. Instead, Vale supports documents composed in plain-text markup languages such as Markdown, reStructuredText, AsciiDoc, or HTML. Consequently, it aligns more closely with the needs of documentation writers and technical writers.

Vale is written in the Go programming language, is cross-platform, and its code is published on GitHub under the MIT license. Precompiled binaries are available for Linux, macOS, and Windows. A few Linux distributions have packaged Vale and offer it through their repositories. There are also third-party packages enabling Vale installation from PyPI or from npm.

Configuring Vale

Vale doesn't check anything by itself. Instead, it offers a framework for creating and enforcing custom style rules. Therefore, Vale must be configured before being used. Its web site conveniently provides a configuration generator, which asks for a base style and supplementary styles. For linting texts for a web site, it also asks which static-site generator is being used (for now, only Hugo is supported), so that the special codes used by the static-site generator aren't flagged as errors. After making these choices, the generated configuration can be copied from the web page and pasted into a .vale.ini file. Vale looks for this file in the directory where the command is executed, or in the default location given by the output of the vale ls-dirs command.

The generated configuration file looks something like this:

    StylesPath = styles

    MinAlertLevel = suggestion

    Packages = Microsoft, Readability, Hugo

    [*]
    BasedOnStyles = Vale, Microsoft, Readability

The various styles are offered as packages on Vale's Package Hub. For example, the Microsoft package enforces the Microsoft Writing Style Guide for documents related to computer technologies, while the Readability package implements some popular metrics for readability, based on computations on the number of syllables, words, and sentences in the text. The Hugo package adds support for shortcodes to the Markdown files being used with Hugo, thereby ensuring Vale doesn't mistakenly mark the shortcode names and their parameters as spelling errors. Vale also incorporates a built-in style named Vale that implements spell-checking.

The vale sync command will cause the packages specified in the configuration file to be downloaded and saved in the directory defined in the configuration's StylesPath key, making them ready to be used by Vale. After this, using Vale is just a matter of running the vale command with one or more text files as arguments. It checks all those files for violations of the rules in the enabled packages. Vale then outputs an overview of all suggestions, warnings, and errors (the three importance levels a rule can have) for each individual file, showing the line and column on which the violation occurs, the name of the violated rule, and a rule-specific message. Vale is aware of the syntax of several markup languages. Thus, it knows to ignore code blocks and inline code because they don't contain natural language.

Customizing

Using one of the existing packages that implement a well-known style guide is an easy way to start improving the quality of a text with Vale. However, there will inevitably come a time when customizations are needed. The easiest way to do this is by adding a custom vocabulary, which is a directory (<StylesPath>/config/vocabularies/<name>/) with two plain-text files, each containing a word, phrase, or regular expression on individual lines. Entries in accept.txt are accepted as valid words (thus Vale's built-in spell checker ignores them), while entries in reject.txt are flagged as errors. The vocabulary is defined in Vale's configuration file with Vocab = <name>.

For handling tasks more complicated than simply accepting or rejecting phrases, Vale allows users to create their own rules. Each rule is defined in its own YAML file, and a collection of rules is organized in a directory under Vale's styles path. Such a collection of rules is called a style; it can be published as a package and referred to in other people's Vale configuration files. Vale supports nearly a dozen extension points for creating rules, including flagging the existence of specific phrases, suggesting the substitution of a word or phrase with an alternative, and spell-checking based on Hunspell-compatible custom dictionaries.

Integrating Vale with other tools

Running Vale on the command line is straightforward, but the program can also be integrated with other tools. Installing the Asynchronous Lint Engine (ALE) plugin for Vim allows the editor to show E (error), W (warning), or I (suggestion) in the margins for violations discovered by Vale, and the flagged words are highlighted. In Vim's normal mode, Vale's message for the current line or word under the cursor is shown under the status line. Both of those can be seen in the following screen shot:

There's also the flymake-vale plugin for Emacs, which I have not tested because I'm a Vim user. Note that the plugin has not been updated for over a year, however.

Writers working with a docs-as-code approach naturally wish for their text to be automatically checked on committing files to a repository. For Git, this can be done with a pre-commit hook that runs Vale locally on any file added to a commit. Vale can also be run in a repository's continuous-integration (CI) pipeline, for example by using the official GitHub Action for Vale. Note that it may be wise to adjust the MinAlertLevel key in Vale's configuration file to "warning" or even "error" to avoid blocking the release of a project's documentation when a minor stylistic issue, such as an instance of passive voice, is identified.

Working with Vale

In the five years that it has existed, Vale has gained popularity; some high-profile projects and companies are using it for their documentation. This includes Grafana Labs, GitLab, Angular, Fedora, and Red Hat. I have been using Vale for a few years now, both for articles and documentation. For articles, I prefer to use the Microsoft style, although I disable some of its more pedantic rules. Rules can be individually disabled with a line like "Microsoft.Contractions = NO" in Vale's configuration file, where Microsoft is the name of the style and Contractions is the name of the rule within the style.

For my LWN articles, which I write in Markdown, I created a custom style based on the feedback I have received on my drafts. For example, I created an existence rule that flags phrases such as "leverage" and "learning curve" as errors. I also added a substitution rule that suggests corrections to common errors and LWN's preferences, such as "command-line tool" instead of "command line tool" and "WiFi" instead of "Wi-Fi". I constantly update these rules based on feedback on my drafts, aiming to save both LWN's editors and myself valuable time.

Recently I also introduced Vale to the documentation of Theengs Gateway, one of the projects I contribute to. I used the Google style for this, again excluding some of its more nitpicky rules, and added some project names and specific terms to a custom vocabulary. I also added a pre-commit hook running Vale on all Markdown files. Additionally, I made sure to run this same pre-commit hook in the project's GitHub Actions workflow.

Given that I set Vale's minimum alert level to "suggestion", the workflow fails on a pull request for even the slightest style violation. This is still manageable at the moment, as Theengs Gateway only has a handful of core contributors. However, in a larger project, raising the minimum alert level to "error" would be prudent. Not everyone installs the pre-commit hooks, so having a pull request fail due to the use of a passive voice or an unexplained acronym can be a frustrating experience for new contributors. Fortunately, individual rules can not only be globally disabled in Vale's configuration file, but also within specific parts of the documents using comments like the following:

    <!-- vale Google.Passive = NO -->

Conclusion

Vale makes for a nice addition to the toolbox of any writer using a supported markup language. It makes it possible to deliver texts with a consistent style and quality, without having to memorize a style guide. It's especially suited for documentation. Vale is easy to set up and to integrate with text editors, pre-commit hooks, and continuous-integration pipelines. That makes it an essential tool in any docs-as-code workflow.

Comments (14 posted)

Page editor: Daroc Alden
Next page: Brief items>>

Releases fixed in v6.8
Release	Commits
Release	Fixed	By
v6.7	241	282	282
v6.6	137	164	164
v6.5	125	146	146
v6.4	84	103	103
v6.3	83	91	91
v6.2	74	82	82
v6.1	50	51	51
v6.0	70	73	73
v5.19	53	54	54
v5.18	45	45	45
v5.17	27	28	28
v5.16	43	43	43
v5.15	32	35	35
v5.14	22	26	26
v5.13	33	35	35
v5.12	27	31	31
v5.11	30	35	35
v5.10	23	26	26
v5.9	25	27	27
v5.8	26	29	29
v5.7	26	30	30
v5.6	31	34	34
v5.5	16	16	16
v5.4	15	25	25
v5.3	25	25	25
v5.2	11	11	11
v5.1	12	19	19
v5.0	11	12	12
v4.20	25	30	30
v4.19	18	21	21
v4.18	13	13	13
v4.17	11	10	10
v4.16	14	16	16
v4.15	11	11	11
v4.14	9	10	10
v4.13	5	5	5
v4.12	7	8	8
v4.11	11	11	11
v4.10	14	14	14
v4.9	7	7	7
v4.8	14	14	14
v4.7	8	9	9
v4.6	7	7	7
v4.5	4	4	4
v4.4	6	6	6
v4.3	9	9	9
v4.2	9	10	10
v4.1	6	7	7
v4.0	1	1	1
v3.19	4	9	9
v3.18	9	9	9
v3.17	7	7	7
v3.16	13	13	13
v3.15	6	6	6
v3.14	4	5	5
v3.13	3	6	6
v3.12	6	6	6
v3.11	6	6	6
v3.10	11	16	16
v3.9	4	4	4
v3.8	4	4	4
v3.7	4	4	4
v3.6	2	2	2
v3.5	4	4	4
v3.4	3	3	3
v3.3	5	6	6
v3.2	5	5	5
v3.1	3	3	3
v3.0	4	4	4
v2.6.39	6	6	6
v2.6.38	4	6	6
v2.6.37	3	3	3
v2.6.36	2	2	2
v2.6.35	2	2	2
v2.6.34	4	4	4
v2.6.33	2	2	2
v2.6.31	1	1	1
v2.6.30	3	3	3
v2.6.29	2	3	3
v2.6.28	2	2	2
v2.6.27	2	2	2
v2.6.26	2	2	2
v2.6.25	2	2	2
v2.6.23	1	1	1
v2.6.22	3	5	5
v2.6.18	1	1	1
v2.6.17	2	2	2
v2.6.13	1	1	1
v2.6.12	1	23	23