Leading items

Welcome to the LWN.net Weekly Edition for April 13, 2023

This edition contains the following feature content:

The early days of Linux: Lars Wirzenius looks back at the genesis of the Linux kernel.
Searching for an elusive orchid pollinator: using free software and hardware to figure out what is pollinating Helmet Orchids.
Seeking an acceptable unaccepted memory policy: some types of secure guest enclaves require memory to be explicitly accepted before use; supporting that feature leads to an interesting backward-compatibility problem.
The shrinking role of semaphores: a look back at the kernel's first mutual-exclusion primitive.
Standardizing BPF: the BPF virtual machine is growing up; the time has come to create a standard describing BPF and how it works.
Python 3.12: error messages, perf support, and more: looking forward to the next major Python release.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

The early days of Linux

April 12, 2023

This article was contributed by Lars Wirzenius

My name is Lars Wirzenius, and I was there when Linux started. Linux is now a global success, but its beginnings were rather more humble. These are my memories of the earliest days of Linux, its creation, and the start of its path to where it is today.

I started my computer science studies at the University of Helsinki in the fall of 1988, and met Linus Torvalds, who was the other new Swedish speaking student in computer science that year. Toward the end of that first year, we had gotten access to a Unix server, and I accidentally found Usenet, the discussion system, by mistyping rm as rn, the Usenet reader. I told Linus about it and we spent way too much time exploring this.

After the first year, we both went away to do the mandatory military service, though in different places. We returned to our university studies in the fall of 1990, and both took the course on C and Unix programming, which included a fair bit of theory of the Unix kernel architecture as well. This led to us reading about other operating system kernels, such as QNX and Plan 9. Linus and I discussed with some enthusiasm how an operating system should be built correctly. We had all the overconfidence of 20-year-old second-year university students. Everyone is better off that this wasn't recorded for posterity.

In January 1991, Linus bought his first PC from a local shop that assembled computers from parts. The PC had a 386 CPU, which was relatively fancy at that time, because Linus wanted to explore multitasking. Also, since he came from a Sinclair QL with a 32-bit Motorola 68008 CPU, he wanted a 32-bit CPU, and did not want to step down to a 16-bit one, so a 286 was not an option. Linus's first PC had a whopping 4 megabytes of RAM and a hard drive.

He got a copy of the game Prince of Persia, which occupied most of his spare time for the next couple of months. He later also bought a copy of MINIX, because after using Unix at the university, he wanted something like that at home as well.

As and Bs

After finishing the game, Linus started learning Intel assembly language. One day he showed me a program that did multitasking. One task or thread would write a stream of the letter "A" on the screen, the other "B"; the context switches were visually obvious when the stream of As became Bs. This was the first version of what would later become known as the Linux kernel.

Linus would later expand the program, and write most of it in C. During this time, late spring of 1991, I wrote an implementation of the C sprintf() function for him, as he hadn't yet learned how to write functions with variable argument lists. I wanted to spare him the pain of having a different function for every type of value to write out. The core of this code is still in the kernel, as snprintf().

As time went on, Linus made his fledgling kernel better and kept implementing new things. After a while, he had drivers for the keyboard and the serial port, emulation of VT100 terminal escape sequences for the screen, and could use it to dial via a modem to the university to read Usenet from home. Science fiction! One day, Linus accidentally attempted to use his hard drive to dial the university, resulting in his master boot sector starting with "ATDT" and the university modem-pool phone number. After recovering from this, he implemented file permissions in his kernel.

In August 1991, Linus mentioned his new kernel in public for the first time, in the comp.os.minix newsgroup. This included the phrase "I'm doing a (free) operating system (just a hobby, won't be big and professional like gnu)". Such humility. The system was initially called Freax. A few weeks later, Linus asked Ari Lemmke, one of the administrators of ftp.funet.fi, to do an upload of the first tar archive. Ari chose the name Linux. The initial version still contains the original name embedded in one of the source files.

During this time, people were interested in trying out this new thing, so Linus needed to provide an installation method and instructions. Since he only had one PC, he came to visit to install it on mine. Since his computer had been used to develop Linux, which had simply grown on top of his Minix installation, it had never actually been installed before. Thus, mine was the first PC where Linux was ever installed. While this was happening, I was taking a nap, and I recommend this method of installing Linux: napping, while Linus does the hard work.

The first releases of Linux used a license that forbade commercial use. Some of the early contributors suggested a change to a free-software license. In the fall of 1991, Richard Stallman visited Finland and I took Linus to a talk given by Stallman. This, the pressure from contributors, and my nagging eventually convinced Linus to choose the GNU GPL license instead, in early 1992.

Over the Christmas break, Linus implemented virtual memory in Linux. This made Linux a much more practical operating system on cheap machines with little memory.

1992

The year 1992 started with the famous debate with Andrew Tanenbaum, who is a university professor and the author of MINIX. He had some opinions about Linux and its architecture. Linus had opinions on MINIX. The debate has been described as a flame war, but was actually rather civil in hindsight.

More importantly for the future success of Linux was that the X11 system was ported to it, making 1992 the year of the Linux desktop.

I had chosen to contribute on the community side, rather than to the kernel directly, and helped answer questions, write documentation, and such. I also ran a short-lived newsletter about Linux, which is mainly interesting for publishing the first ever interview with Linus. The newsletter was effectively replaced by the comp.os.linux.announce newsgroup.

The first Linux distribution was also started in 1992: Softlanding Linux System or SLS. The next year, SLS morphed into Slackware, which inspired Ian Murdock to start Debian in 1993, in order to explore a more community-based development structure. A few other distributions would follow in the years to come.

In 1993, both Linus and I got hired as teaching assistants at the university. We got to share an office. That room had a PC, which Linus took over, and used for Linux development. I was happy with a DEC terminal for Usenet access.

One day, Linus was bored and the PC at work felt slow. He spent the day rewriting the Linux kernel command-line parser in assembly language, for speed. (That was, of course, quite pointless, and the parser would later be rewritten again in C, for portability. Its speed does not matter.) A couple of years later, he spent days playing Quake, ostensibly to stress-test kernel memory management, although that was with a newer PC. Much fun was had in that room, and there were no pranks whatsoever. None at all.

At some point, Linux gained support for Ethernet and TCP/IP. That meant one could read Usenet without having to use a modem. Alas, early Linux networking code was occasionally a little rough, having been written from scratch. At one point, Linux would send some broken packets that took down all of the Sun machines on the network. As it was difficult to get the Sun kernel fixed, Linux was banned from the university network until its bug was fixed. Not having Usenet access from one's desk is a great motivator.

1.0

In the spring of 1994 we felt that Linux was done. Finished. Nothing more to add. One could use Linux to compile itself, to read Usenet, and run many copies of the xeyes program at once. We decided to release version 1.0 and arranged a release event. The Finnish computer press was invited, and a TV station even sent a crew. Most of the event consisted of ceremonially compiling Linux 1.0 in the background, while Linus and others spoke about what Linux was and what it was good for. Linus explained that commercial Unix for a PC was so expensive that it was easier to write your own.

In 1995 Linus and I did a software engineering course at the university, which mostly consisted of a large practical project. This was built on top of Linux, of course. I insisted that a version-control system be used. I had witnessed students in earlier courses do the shouting kind of version control: the students shared a source tree over NFS and shouted "I'm editing this file" when they were changing something. This did not seem like an effective method to me, so I insisted on CVS, which I'd just learned about. This experience is why Linus dislikes CVS and for years refused to use any version control beyond uploading tar balls to FTP sites.

That year was also when Linux was first ported to a new architecture by Linus. He'd been given a DEC Alpha machine. I would later get the machine to use as a terminal for reading Usenet. Other people ported Linux to other architectures, but that did not result in me getting any more machines to read Usenet on.

In 1997 Linus graduated and moved to the US to take a job at Transmeta. I took a job at a different university in the Helsinki area.

In the following years, many things happened. It turned out that there were still a few missing features from Linux, so people worked on those. The term "open source" was coined and IBM invested a ton of money in Linux development. Netscape published a version of its web browser as open source. Skipping a few details and many years, open source basically took over the world. LWN was started and covered much of this history on a week-by-week basis.

In 1991, Linus wrote that Linux "won't be big and professional like gnu". In 2023. Linux is running on every continent, on every ocean, on billions of devices, in orbit, and on Mars. Not bad for what started as two threads, writing streams of As and Bs on the screen.

Comments (45 posted)

Searching for an elusive orchid pollinator

By Jake Edge
April 12, 2023

Everything Open

Orchids are, of course, flowers, and flowers generally need pollinators in order to reproduce. A seemingly offhand comment about the unknown nature of the pollinator(s) for a species of orchid in Western Australia has led Paul Hamilton to undertake a multi-year citizen-science project to try to fill that hole. He came to Everything Open 2023 to give a report on the progress of the search.

Helmet orchids

Hamilton lives in Busselton, Western Australia, which is near the southwest tip of the country. There are 117 native terrestrial orchid species in his area, some of which he displayed on a slide that can be seen in the video of his talk. A photo of the slide is below, on the left, as well. There are various kinds of pollinators for these orchids, mostly insects, including ants and wasps. The orchids in his area do not have powdery, granular pollen like many other flowers, but instead have pollinia, which are waxy blobs of pollen that stick to the backs of insects that enter the flower.

He had a picture of a thynnid wasp with two pollinia on it, as well as another of a wasp on his finger. They are a fairly small, around 20mm long, native wasp that does not sting. They pick up the pollinia at one flower and drop them off at some other, thus helping to propagate the species.

"Quite a few years ago now", he was on a walk with Professor Mark Brundrett and they were discussing pollinators and why they were important. Brundrett pointed out that the pollinator for the helmet orchid is unknown. That got Hamilton thinking "how hard can it be?"; he set out to try to find the pollinator of this type of orchid, which is not rare in his area and also lives all along the east coast of the country.

Ten years earlier, he had been shown a large, slowly rotting, downed tree with lots of helmet orchids on it, so he decided to try to find that log to use as the focal point of his search. Over the next few years, when he was in the mood for a walk, he would go back to that area to search for the log. He crisscrossed the area, logging his GPS tracks, but it turned out that his memory was bad—the log was actually a little ways outside of his search area.

In the meantime, though, he found an "even more spectacular" log with helmet orchids; it was elevated "like a bench", which made it easy to photograph. There were around 60 flowers of one species (Corybas recurvus) all growing together in the moss layer on the rotting wood. He showed a close-up of the orchid, which has a roughly 20mm-wide leaf and a 10mm-wide flower; the location inside the flower where the pollinia are is around 5mm wide, "whatever goes in is small", Hamilton said.

The flowers come out in July and August (Australian winter) "in a dark, wet forest". It may not sound all that daunting, but if you go there at that time of year, you will be attacked by hordes of mosquitoes. There is a second helmet species (Corybas despectans) in that same area that he is also targeting, which is even smaller than C. recurvus. The two orchids bloom for two or three weeks in succession, with a bit overlap between them.

Traps

He and Brundrett constructed some traps, one of which did actually catch some insects. It was raised up a few centimeters above the flower, but none of the insects they captured had the pollinia on them. Next up was using a sticky paint-on substance on transparent flexible plastic (from overhead transparencies) for trapping insects; most of what they trapped using this first generation of sticky traps were fungus gnats that "look like mosquitoes, but they're a little bit smaller".

The next generation traps were laser cut from sturdy, clear plastic, each with grid marks for reference purposes along with a place for the date and site name. Each was around twice the area of an index card; they were sticky-painted and placed in the field. At lunch, after he and Brundrett first put out the traps, they realized they were going to have a real problem storing the traps. Some "quick thinking" led to a 3D printer design for a storage box for the traps; he now has many of these boxes, from multiple years, all filled with the traps and their stuck insects.

They did have some success with this technique. He showed pictures of three fungus gnats, each with differently shaped pollinia on their backs; "Fantastic!" But there is a problem: even though the traps were positioned just above the orchids, there is no way of knowing that those pollinia came from the orchids; "maybe they came from over there or maybe from a kilometer away".

At that time, DNA testing was around $2,000 per sample. Doing a test was discussed, but, as a citizen-science project, that was too steep for citizen Hamilton—his wife was even less keen on the idea, he said with a chuckle. The collected pollen is still available, though, so if prices for DNA tests fall low enough, perhaps that testing can be done.

Instead, he went around and collected samples from helmet orchids as well as other orchids in the area. The pollinia are quite small; he showed pictures of them on the ends of toothpicks. But trying to work with individual pollen grains would be far more difficult. However, the pollinia for the two helmet orchids were around 1mm wide and had no real discernible shape. "Time to go high tech", he said.

Photography

He was in a "Raspberry Pi phase" a year or so later and thought he could use that device to help in the search. He put up a block diagram of the Raspberry-Pi-based camera system that he built. It uses an infrared camera to continuously take pictures of a target flower. It is powered by two "very heavy, unfortunately" sealed lead-acid (SLA) batteries using a power distribution board that allows him to swap batteries in the field without shutting down. He also designed a companion board with a realtime clock, small display for power monitoring, and that could drive an infrared floodlight at night.

He started out taking full-resolution photos, but "it was taking too long to take a photograph, relatively speaking"; he reduced the resolution to 1920x1080, which allowed him to take a photo every 0.8 seconds. The "3.2 million photographs" from the title of his talk referred to what he gathered in a normal month-long season: a bit over 100K photos per day for roughly 30 days. Over a season, five different flowers were monitored in succession.

Much of the journey of building the camera "trap" is documented in a series of posts on the South West Makers forum. The designs and code are also available in Hamilton's GitLab repository.

He would go daily after work to the camera site in order to swap batteries, so it was dark—and generally raining at that time of year. The electronics and batteries were housed in a plastic bin, which he tried to shield with his poncho when he switched batteries, but eventually water got into the housing and corroded some of the contacts on his hand-assembled board. That led him to create a printed circuit board for the project, which has worked out well; it can be used for other projects too.

He did some more 3D printing for various parts, such as a camera sun/water shield and flexible battery terminals. He also needed a housing for the camera that would keep out the water except for a small hole over the lens. He prototyped it with laser-cut MDF, then created the final housing from acrylic. The camera focused at 2m, which was way too far for his needs, so he carefully removed the glue that held the lens from changing focus. Then he could adjust the focus through the hole in the housing using a 3D-printed tool that he created, which allowed him to focus at around 6.5cm.

The plastic tub eventually got waterproof connectors for USB, HDMI, and power, which allows him to connect to the sealed box in the field. The housing is just externally connected to the camera and light when it is in use. The camera and light are positioned near the flower and the "rig" is covered in a waterproof camouflage poncho. People do walk through the area with their dogs at times, so the camouflage is to try to disguise the $300-400 worth of equipment.

The camera generates around 100GB of data each day. He showed photos from both day and night, which are washed-out looking due to the nature of infrared light. Next up, he needed to figure out what to do with all of this data that he was gathering.

Software

"OpenCV to the rescue." He said that there are some fantastic OpenCV tutorials available from Adrian Rosebrock, which Hamilton used to come up to speed on doing motion detection. There are some tricks, though, like figuring out how to ignore things moving in the wind. He identified an area of interest for OpenCV and had it look in that area for motion with certain characteristics. For example, if the item moving was too large, it could be a cockroach or other larger insect and thus would not be of interest.

When OpenCV found a match, it would copy the photo to a folder. That would reduce the number of photos he had to look at greatly; instead of the 100K photos generated, he would only need to look at 500-2000 or so. Every day, he would swap out a 128GB USB flash drive when he swapped batteries.

The basic pipeline consisted of copying the data from the USB drive to his home network-attached storage (NAS) server. Then a script was run to use the data in the photo log file to annotate the date, time, and frame number directly into each image. It also creates an 800x600 version of the file, which helps make OpenCV run faster; those images are turned into a video using FFmpeg. OpenCV was run on the video and would copy the files of interest to another directory for him to examine.

He showed some photos of the results, with various insects in and around the flowers. Over 1.5 seasons, he has detected two bush flies entering the flower, one stayed for seven seconds and the other for 36, which was a surprising amount of time he thought. There have been seven brown ant visits, for five or six seconds each, and two fungus gnats for 13 and 20 seconds. At the outset, it was thought that fungus gnats were the mystery pollinator, "so it is interesting to see so many ants going down inside there". He has yet to capture the "smoking gun" of an insect entering without pollinia and leaving with one or more. He plans to try again this year.

When you visit the site, "you see nothing, it's just barren", Hamilton said, which makes you think the site is kind of boring. But the camera captures that, roughly five minutes after the human leaves, all sorts of life emerges. "It is fascinating" to see the insects, worms, spiders, ticks, mites of various sorts, and so on. He showed some pictures of those as well; the spiders are particularly impressive, he said, as they look like something out of a horror movie in the video.

He has more data to analyze, including temperature readings from the site. It might be interesting to correlate insect activity with temperature, for example. As he was putting together the talk, he realized that gathering rainfall data might also be interesting, so he may look into adding a way to do that at some point. Another tidbit is that he had little trouble making the housing waterproof, but making it ant-proof was a real problem; it turns out that ants love a dry place and can get around the seal on the plastic bin, almost no matter how much effort you put into sealing it further.

In 2017, he was invited to give a talk about the project at the Maker Faire Shenzhen. He and his wife also set up a booth, where some 50,000 people walked by over the three days of the event. The orchid-pollinator project led him to Shenzhen, which also led to another interesting project—with the inevitable add-on maker projects thrown in as well.

His wife wanted to have her own activity at the booth, so she brought fabric squares and pens that attendees could write or draw on; over 650 of those, with stories, poems, pictures, and so on, were collected. "As one does", he then created a scanner to photograph each of the squares, which were processed with OpenCV for edge detection and auto-rotation; they plan to put them on the web somewhere so that others can see them too. The squares themselves were incorporated into a quilt over the next two years, which they donated to the Maker Faire organization at the 2019 event.

Over the years of the project, which has gone in fits and starts at times due to various glitches, life happenings, and so on, he has gathered some 2.7TB of data, which he is happy to share—though his data plan is not up to a direct transfer, he said with a chuckle. At this point, it sounds like he has most of the "bugs" worked out of the process; one hopes that a smoking-gun photo is in the cards for the coming season. If so, it seems likely that Hamilton will find another way to combine electronics, 3D printing, and FOSS, perhaps with a different orchid—or a completely different citizen-science objective.

[Thanks to LWN subscribers for supporting my travel to Melbourne for Everything Open.]

Comments (1 posted)

Seeking an acceptable unaccepted memory policy

By Jonathan Corbet
April 6, 2023

Operating systems have traditionally used all of the memory that the hardware provides to them. The advent of virtualization and confidential computing is changing this picture somewhat, though; the system can now be more picky about which memory it will use. Patches to add support for explicit memory acceptance when running under AMD's Secure Encrypted Virtualization and Secure Nested Paging (SEV-SNP), though, have run into some turbulence over how to handle a backward-compatibility issue.

Normally, when the operating-system kernel boots, it discovers the available memory and happily sets itself up to use that memory. Version 2.9 of the UEFI specification, though, added the concept of unaccepted memory; when this mechanism is in use, a system (normally a virtualized guest) will be launched with its memory in an unaccepted state. That system will not be able to make use of the memory provided until that memory has been explicitly accepted. On such systems, the bootloader will typically pre-accept enough memory to allow the guest kernel to boot; that kernel must take responsibility for accepting the rest before using it.

Documentation on the motivation for this feature is scarce, but there would appear to be a couple of reasons for the addition of this new complication:

Secure guest environments like SEV-SNP and Intel's TDX can protect their memory contents from the host and other guests through encryption, reverse-mapping tables, and more. Setting that protection up takes some time, though, slowing the boot process considerably. An explicit acceptance step allows the operating system to spread the initialization of memory over time. If memory is only accepted in chunks as it is needed, the system will boot into a running state more quickly. The patches adding unaccepted-memory support from Kirill Shutemov take advantage of this by deferring acceptance of memory until it is needed.
Explicit acceptance can help to defend a secure guest from a malicious hypervisor that might try to play games with the guest's memory behind the scenes. Should the hypervisor try to sneak a new page into a guest's address space, that new memory will not have been accepted by the guest and an attempt to access it will generate a fault.

Each vendor's secure environment has its own way of managing the acceptance process, so some of the code that implements acceptance must necessarily be specific to one subarchitecture. Shutemov's patches add support for Intel's TDX, but support for AMD's SEV-SNP comes from a separate patch set from Tom Lendacky.

It turns out that SEV-SNP support has to handle a problem that TDX does not: existing users. The kernel has been able to work with SEV-SNP since the 5.19 release, so there are already systems using SEV-SNP in the wild. But current kernels, while they understand SEV-SNP, do not have support for memory acceptance, or even the concept that memory must be explicitly accepted. If such a kernel is booted on a system where some of the memory has not been accepted, it will be unable to use that memory and may fail badly trying.

That is not the security experience that SEV-SNP was created to provide. To avoid such an outcome, Lendacky's series includes a patch from Dionna Glaze adding a special UEFI protocol to provide compatibility for older systems. Specifically, when running on AMD hardware, a booting system must invoke the new UEFI protocol prior to the call to ExitBootServices() that transfers full control away from the firmware. If the call to the new protocol is not made, the firmware will pre-accept all of the memory provided to the system before handing control to the operating system.

This mechanism lets kernels that are capable of handling unaccepted memory inform the firmware of that fact while avoiding problems for kernels that lack that ability. The plan is that this will be a temporary measure, only needed until users can be expected to have newer kernels:

This protocol will be removed after the end of life of the first LTS that includes it, in order to give firmware implementations an expiration date for it. When the protocol is removed, firmware will strictly infer that a SEV-SNP VM is running an OS that supports the unaccepted memory type.

When an earlier version of this patch was posted in January, Shutemov objected, calling the feature "a bad idea". He added: "This patch adds complexity, breaks what works and the only upside will turn into a dead weight soon". X86 maintainer Dave Hansen agreed, worrying that it would never be possible to remove support for this interface once it had been added.

Shutemov reiterated his opposition in response to the most recent patch set, but this time Hansen indicated that he had changed his mind:

The fact is that we have upstream kernels out there with SEV-SNP support that don't know anything about unaccepted memory. They're either relegated to using the pre-accepted memory (4GB??) or _some_ entity needs to accept the memory. That entity obviously can't be the kernel unless we backport unaccepted memory support.

He would like to pretend that the problem doesn't exist, he continued, but "my powers of self-delusion do have their limits".

Shutemov was unswayed, though, suggesting that the hypervisor could load a special firmware that pre-accepts all of the memory when launching a system that lacks that support; how the hypervisor would know that about any specific guest is not entirely clear. Ard Biesheuvel disagreed, arguing that letting the kernel make its own capabilities known to the firmware is the most straightforward solution to a problem that cannot be ignored. When Shutemov said that this protocol would fail in cases where the bootloader calls ExitBootServices() prior to starting the kernel, Biesheuvel answered that it was "a theoretical concern" that will not show up in real-world use.

After holding a finger up to the wind, your editor's guess is that this feature will eventually be accepted into the mainline. A more interesting question, perhaps, is when it will be removed. Biesheuvel said that, over time, firmware will stop supporting this protocol, and it will be possible to remove that support from the kernel as well. Hansen is unconvinced, though; he notes that users run old kernels for a long time, so the support for them will also need to stay for a long time. Or, as he said earlier in the discussion: "Yeah, the only real expiration date for an ABI is "never". I don't believe for a second that we'll ever be able to remove the interface."

There is nothing unusual about this situation; whenever maintaining compatibility is a concern, software will fill up with little hacks like this. That is part of the cost of keeping things working. In this case, the cost appears small enough to be acceptable. Existing SEV-SNP users will, once this work is merged, be able to run their virtual machines on systems where memory must be explicitly accepted prior to use.

Comments (none posted)

The shrinking role of semaphores

By Jonathan Corbet
April 7, 2023

The kernel's handling of concurrency has changed a lot over the years. In 2023, a kernel developer's toolkit includes tools like completions, highly optimized mutexes, and a variety of lockless algorithms. But, once upon a time, concurrency control came down to the use of simple semaphores; a discussion on a small change to the semaphore API shows just how much the role of semaphores has changed over the course of the kernel's history.

At its core, a semaphore is an integer counter used to control access to a resource. Code needing access must first decrement the counter — but only if the counter's value is greater than zero; otherwise it must wait for the value to increase. Releasing the semaphore is a matter of incrementing the counter. In the Linux kernel implementation, acquisition of a semaphore happens with a call to down() (or one of a few variants); if the semaphore is unavailable, down() will wait until some other thread releases it. The release operation, unsurprisingly, is called up(). In the classic literature, as defined by Edsger Dijkstra, those operations are called P() and V() instead.

The 0.01 kernel release in 1991 did not have semaphores — or much of any other concurrency-control mechanism, in truth. In the beginning, the kernel only ran on uniprocessor systems and, like most Unix systems at that time, the kernel had exclusive access to the CPU for as long as it chose to run. A process running in the kernel would not be preempted and would continue to execute until it explicitly blocked on an event or returned to user space, so data races were rarely a problem. The one exception was hardware interrupts; to prevent unwanted concurrency from interrupts, the code was liberally sprinkled with cli() and sti() calls to block (and unblock) interrupts where needed.

In May 1992, the 0.96 release brought a number of significant changes, including some initial "networking" support; it enabled Unix-domain sockets using a Linux-specific socketcall() system call. Perhaps most significant in this release, though, was the addition of support for SCSI devices; good SCSI support would be a key factor during the early adoption phase of Linux. The SCSI subsystem brought with it the first mention of semaphores in the kernel, buried deep down within the driver layer. Like many that would follow, SCSI semaphores were binary semaphores, meaning that their initial value was set to one, allowing only a single thread to have access to the resource (a SCSI host controller) that it managed. The 0.99.10 release in June 1993 brought a reimplemented network layer and support for System V semaphores in user space, but still no general support for semaphores in the kernel.

The addition of semaphores

The first implementation of general-purpose semaphores for the kernel itself showed up in the 0.99.15c release in February 1994. The initial user was the virtual filesystem layer, which added a semaphore to the inode structure; no other users had been added by the 1.0 release one month later. The 2.0 release (June 1996) saw a slow growth in the number of semaphores, as well as the addition of the infamous big kernel lock (BKL), which was not a semaphore.

That was the beginning of SMP support and, even then, kernel code ran under the BKL by default, so most kernel code was limited in the amount of concurrency it had to deal with. In essence, the BKL existed so that kernel code could run under the same assumption of exclusive access to the CPU that had been wired deeply into the code since the beginning; it only allowed one CPU to be running in the kernel at any given time. So disabling interrupts was still by far the most common concurrency-control mechanism in use in the kernel.

By the 2.2 release (January 1999), there were 71 struct semaphore declarations in the kernel; by 2.4.0 (January 2001) that number had grown to 138, and by 2.6.0 (December 2003) it was 332. The 2.6.14 release, in October 2005, had 483 semaphore declarations. By this time, disabling interrupts was falling out of favor as a way to control concurrency — the cost on system performance as a whole was simply too high — and the big kernel lock had become a scalability problem in its own right.

Meanwhile, the first spinlock infrastructure was added in the 2.1.23 development kernel, though it was not really used until a spinlock was added to the scheduler in 2.1.30. Unlike a semaphore, a spinlock is a pure mutual-exclusion primitive, without a semaphore's count. It also is a non-sleeping lock; code waiting for a spinlock would simply "spin" in a tight loop until the lock became available. Until this addition, semaphores had been the only generalized mutual-exclusion mechanism supported by the kernel.

Spinlocks were better than semaphores for many situations, but they came with the restriction that code holding a spinlock is not allowed to sleep; that meant that there was still a need for a semaphore-like structure. Around the end of 2005, though, developers started thinking that a better solution might exist for the binary-semaphore case — which was how most semaphores were used. An initial "mutex" implementation turned out to perform worse than semaphores did but, as happened frequently in that era, Ingo Molnar showed up with a faster alternative within days. Mutexes were soon added to the kernel as an alternative to semaphores, and the process of converting semaphores to mutexes began.

A slow transition

When mutexes were introduced, developers worried that they would force a flag-day change where all binary semaphores would be changed over to the new type. But mutexes were added alongside the old type, allowing the two to coexist and code to be converted at leisure. As a result, unsurprisingly, there are still over 100 semaphores declared in the kernel, the bulk of which appear to be binary semaphores. It is hard to find patches that add new semaphores, though; the most recent would appear to be this driver patch in August 2022. Most kernel developers, it seems, have no reason to think about semaphores much of the time.

Modules maintainer Luis Chamberlain has recently been working on a problem where the arrival of a large number of requests to load modules in a short time can create difficulties for the memory-management subsystem. After some discussion, he posted a proposal for a mechanism that would simply limit the number of module-load operations that can be underway at any given time. Linus Torvalds quickly answered, reminding Chamberlain that semaphores ("a *classic* concurrency limiter") could be used for that purpose. The patch has since been reworked along those lines.

As part of the associated discussion, though, Peter Zijlstra noted that the DEFINE_SEMAPHORE() macro, which declares and initializes a static semaphore, sets the initial value to one, creating a binary semaphore by default. Since, as he said, "binary semaphores are a special case", it would have been better to have DEFINE_SEMAPHORE() take an additional argument to specify what the initial value should be. Torvalds agreed that this change would make sense: "So let's just make it clear that the only reason to use semaphores these days is for counting semaphores, and just make DEFINE_SEMAPHORE() take the number." Semaphores, he said, are now "almost entirely a legacy thing". Zijlstra has since posted a patch to that effect.

This minor change to the semaphore API is not likely to affect too many developers. There is still, though, the open question of the dozens of binary semaphores still in use. There would be value in converting them over to mutexes; the performance would be better, and the resulting code would look more familiar to current developers. As Sergey Senozhatsky pointed out, though, it is not possible to mechanically convert those users without taking a close look. There is, for example, a binary semaphore that persists in the printk() code because mutex_unlock() cannot be called from interrupt context, while up() can.

It just goes to show that in the kernel, as elsewhere, old code can persist for a long time. The use of binary semaphores was arguably outmoded in 2006, but many uses remain and it took until 2023 to change the initializer to not create a binary semaphore by default. Kernel developers may come and go, but kernel code, at least sometimes, can stay around for a lot longer.

Comments (11 posted)

Standardizing BPF

April 10, 2023

This article was contributed by David Vernet

The extended BPF (eBPF) virtual machine allows programs to be loaded into and executed with the kernel — and, increasingly, other environments. As the use of BPF grows, so does interest in defining what the BPF virtual machine actually is. In an effort to ensure a consistent and fair environment for defining what constitutes the official BPF language and run-time environment, and to encourage NVMe vendors to support BPF offloading, a recent effort has been undertaken to standardize BPF.

BPF programs are written in C, and compiled into BPF bytecode. Like other bytecode instruction sets, BPF programs are platform-independent and just-in-time (JIT) compiled. For a long time, "platform-independent" for BPF simply meant the ability to run BPF programs on multiple different architectures on Linux. That definition has expanded in recent years, with Microsoft implementing a version of BPF for Windows, and network-interface vendors, such as Netronome, providing the ability to offload BPF networking programs. NVMe vendors are also looking into supporting offloading functionality to BPF for storage devices with a new framework called eXpress Resubmission Path (XRP), though this effort is currently stalled due to BPF not being standardized.

What's in scope for standardization?

BPF is not simply an instruction set, but rather a combination of an instruction set and a run-time environment. The latter must, at a minimum, include the necessary logic to execute the BPF program; either through an interpreter, or by JIT-compiling the program directly into native instructions. Additionally, it may include features such as static verification of the program, performing type checking using information provided via BTF, built-in data structures, and more. While the BPF instruction set architecture (ISA) is in scope for standardization, it's less clear which other aspects of BPF are appropriate.

This question was posed by Christoph Hellwig in his 2022 LSFMM/BPF presentation. While all of the participants in the discussion agreed that standardizing the ISA is the highest priority, there was also discussion about whether to standardize certain run-time semantics such as what happens when a program divides by zero. In the discussion, Alexei Starovoitov explained that, initially, BPF would simply exit the program if a divide by zero was encountered. After realizing that abruptly exiting a program can be dangerous (it may, for example, have needed to clean up some state), the semantics were changed to instead simply return zero and produce no exception; matching the behavior of aarch64.

The discussion concluded with a general agreement that the first order of business was to fully document and decide on a versioning system for the ISA.

In a follow-up email, Hellwig suggested that the ISA should be versioned according to Clang CPU levels — the "processor version" used by Clang when compiling BPF programs. Starovoitov pointed out in response that multiple instructions have, in the past, been added to the BPF ISA without a bump in the Clang CPU level, so the CPU levels weren't a clean match with the BPF ISA versions. Starovoitov suggested a number of other approaches, such as versioning with an upstream kernel commit hash, or simply declaring the current ISA as 1.0 and bumping it for every new instruction. Hellwig was unenthusiastic about the idea of using kernel commit hashes, but was amenable to the idea of considering the current ISA as version 1.0.

Following these discussions, the BPF ISA documentation has been improved significantly, with all of the current instructions being fully documented. The documentation page lists the instruction set as v1.0, so it would seem that Starovoitov's idea of treating the current ISA as v1.0 was chosen as the way forward.

Yet, while the current ISA is fully documented, there are still new instructions being added that will presumably be included in the official v1.0 BPF ISA. Yonghong Song proposed a set of six such instructions to be included in the new -mcpu=v4 Clang CPU level. These will surely not be the last instructions added to the ISA, but for now they appear to be the last instructions that will be added to v1.0.

Choosing a standards organization

In addition to finalizing the ISA and deciding what else is in scope for standardization, there is another important question to resolve before standardization can begin in earnest: with which organization will the standard be ratified?

The natural choice is the eBPF Foundation, which was founded as a subsidiary of the Linux Foundation in December 2021; the foundation is responsible for managing both the finances and the technical direction of the BPF project. For technical matters, the foundation has a steering committee composed of engineers from various companies throughout the tech industry. Were BPF to be standardized through the eBPF Foundation, the steering committee would presumably be the responsible party.

Standardizing through the eBPF Foundation would likely be the most straightforward option, incurring the smallest amount of latency in achieving consensus. It does, however, have a major drawback: the eBPF Foundation has never published a standard. This, on its own, isn't necessarily a hard blocker for publishing (every organization had to publish their first standard at some point), but it does mean that the eBPF Foundation would have to go through the standardization process without the benefit of prior experience. In this regard, while the bureaucracy of a more recognized organization could be considered a pain point, it could also be considered a feature if that organization's processes and experience help to ensure that the standard is well considered and of the highest quality. On the other hand, some members of the steering committee, such as Dave Thaler, have experience from working with other standards organizations such as the Internet Engineering Task Force (IETF).

One alternative to publishing with the eBPF Foundation is publishing directly through ISO, an international standards organization that is home to the C programming language standard, among others, that we all know and love. Standardizing with ISO would likely guarantee the strongest possible worldwide consensus, as it is an international standards body with a rigorous and widely reviewed ratification process. For that reason, it is also likely to be the most difficult and time-consuming option. While I am by no means an expert in the domain of standardizing with ISO, it appears that, in order to even consider standardizing BPF with the ISO, the standard would first have to be brought before the American National Standards Institute (ANSI), (the ISO member representing the US), which would then propose the idea to the larger international ISO community. Ratifying the BPF standard with the ISO may be a desirable long-term goal, but seems unlikely to be the approach taken for the initial standardization effort.

IETF discussions

Yet another alternative is standardizing with the IETF, which is best known for creating the standards that comprise the "Internet Protocol Suite", more commonly known as TCP/IP, though it also publishes standards for non-networking topics such as file formats. The IETF is also an international standards body, though its process for standardization is less onerous than the ISO. As such, it may represent an ideal middle ground between the eBPF Foundation and the ISO.

Discussions have been ongoing between members of the BPF and IETF communities, including on an IETF mailing list, following a BPF standardization meeting at the IETF 115 conference in 2022 as to whether IETF is an appropriate venue for BPF standardization. The topic was recently revisited at IETF 116 in 2023. Despite there being some vocal opponents among the attendees, the overall consensus in favor of standardizing BPF was apparently quite strong relative to the norm for IETF. Jari Arkko, an IETF Area Director (AD), posted this summary on the IETF BPF mailing list:

The chairs asked if the room felt the problem was well defined and scoped. The meeting was almost unanimous that it was. Same for recommending to start the work. The community seems to want the work to go ahead by a larger level of consensus than we're normally used to in IETF BOFs.

Yet, while IETF as an organization seems enthusiastic to move forward (pending some legal matters as discussed below), Arkko also pointed out that more work needs to be done in terms of formally defining what is in scope for standardization:

I do have one concern however. I think the meeting discussed the issues only in the abstract, and spent almost no time discussing the actual list of work items. There's a draft list of work items in the charter (https://datatracker.ietf.org/doc/charter-ietf-bpf/), and the room hums seemed to say that the charter is acceptable. However, to what extent has this been discussed on list or somewhere else? I personally thought some items were quite clearly feasible while I wasn't so sure of others.

Arkko certainly has a point. As discussed above, the scope of standardization for BPF could be broad. If BPF proceeds with the IETF, achieving consensus on what will be in scope for the first publication of the standard will certainly be one of the major work items.

Now that some sort of consensus has been achieved in the IETF community, it seems likely that BPF standardization will proceed through the IETF. Before work can formally begin in earnest, however, there are still a few legal matters to finalize. The co-chairs of the IETF 116 BPF BoF meeting, Suresh Krishnan and Lorenzo Colitti, mentioned that the IETF legal counsel was still doing due diligence on some questions related to licensing and copyright. Though these legal matters are expected to be resolved without issue, final approval has yet to be given. Assuming there are no legal hiccups, the next step would be to formally create an IETF working group, which would likely be co-chaired by Krishnan and myself.

Worth noting as well is that BPF is not the first major subsystem in the kernel that is undergoing a standardization effort. Virtio was first standardized through the Organization for the Advancement of Structured Information Standards (OASIS) back in March 2016, and there are lessons to be learned from that effort. For instance, Rusty Russell, who led this effort, also made it a point to shop around for different standards organizations. According to the LWN article linked above, he was warned that, "some standards groups exist primarily to slow things down", which did not suit his goals of finding an organization that was "interested in the creation of useful standards without a lot of unnecessary hoops to jump through." The BPF community will have similar goals of its own and, so far, it seems that it is following virtio's example of putting in the legwork to find an organization whose processes match the project's needs.

It will be interesting to see where the standardization effort goes from here. Some interested parties, such as the aforementioned NVMe vendors, seem to be deferring substantial investment until BPF is fully standardized. There is thus a significant incentive for the effort to proceed. In the meantime, we can at least enjoy the steady stream of high-quality BPF documentation inspired by this standardization effort.

Comments (38 posted)

Python 3.12: error messages, perf support, and more

April 11, 2023

This article was contributed by Lincoln Swaine-Moore

Python 3.12 approaches. While the full feature set of the final release—slated for October 2023—is still not completely known, by now we have a good sense for what it will offer. It picks up where Python 3.11 left off, improving error messages and performance. These changes are accompanied by a smattering of smaller changes, though Linux users will likely make use of one in particular: support for the perf profiler.

Error messages

Last year, Python 3.11 added a major quality-of-life improvement to the interpreter's traceback reporting. Noting that existing tracebacks left ambiguity as to where an error occurred within a particular line, PEP 657 ("Include Fine Grained Error Locations in Tracebacks") introduced intra-line bytecode-to-column-offset mappings that help users spot problems more quickly. For instance, the carets underneath ['c'] highlight its role in the TypeError in this example from the PEP:

    Traceback (most recent call last):
      File "test.py", line 2, in <module>
	x['a']['b']['c']['d'] = 1
	~~~~~~~~~~~^^^^^
    TypeError: 'NoneType' object is not subscriptable

In that case, x['a']['b'] has evaluated to None, which is not subscriptable, but earlier versions of Python did not give any clue as to which of the four possibilities caused the error. This change was made possible by work done in 3.9 to switch to a new parser based on a parsing expression grammar (PEG). That switch has also enabled other error message enhancements in recent releases.

Python 3.12 continues to aid in debugging with several new error message improvements. The most straightforward change affects import statements. For example, programmers might try the following:

    import foo from bar

That may be more idiomatic in English, but it is incorrect Python syntax; it will now result in:

    SyntaxError: Did you mean to use 'from ... import ...' instead?

The rest of the improvements focus on correcting and expanding the set of suggestions made when users reference attributes, modules, and module symbols that don't exist. Python 3.10 added a suggestion to some NameError messages when the undefined variable was sufficiently similar in name to an actual one; there was an analogous improvement for AttributeError. However, a common mistake (for the beginning and experienced Python practitioner alike) is to omit the self. in front of an instance attribute in a class's method. In 3.10, instance attributes were not candidates for suggestions; in fact, the suggestions might erroneously include a variable deemed similar in name. Consider the following (somewhat silly) example:

    class Clown:
	def __init__(self):
	    self.cool = 1
	def hijinks(self):
	    schools = ["Sunnydale High", "Dillon High School"]
	    return [2 * cool for school in schools]

Running Clown().hijinks() produces an exception:

    NameError: name 'cool' is not defined. Did you mean: 'school'?

But it is evident that the user meant to write self.cool. Python 3.12 fixes the problem by superseding other recommendations with the instance attribute if one exists. Note that this only applies in the case of an exact match between the name of the attribute and the name in the code.

Another new suggestion also only applies in the case of an exact match. If a module from the standard library is used without an import, the interpreter will now make a suggestion:

    NameError: Did you forget to import <module>?

Finally, 3.12 extends 3.10's NameError and AttributeError suggestions to the realm of the ImportError. For example:

    from bar import foox

There is no foox in bar, and the programmer misspelled foo, so the interpreter will now suggest:

    Did you mean: 'foo'?

In order to accomplish this, bar must be imported and then examined for near matches, so the performance implications of this change were hotly debated. As Erlend Aasland pointed out, "A lot of stdlib modules try to import their C implementation first, else fall back to a Python implementation", so adding the typo-matching overhead in such a case could be problematic. However, Pablo Galindo Salgado eventually discovered "a way to implement this without penalty to control-flow failed imports"; so an ImportError that is caught by a try block will not cause the module to be examined for suggestions. Galindo Salgado is the author of all four of the error-message changes and is also the release manager for 3.10 and 3.11.

Optimizations

Python 3.11 was perhaps most notable for its performance enhancements. According to the release notes, CPython 3.11 outperformed CPython 3.10 by 25% on the pyperformance suite, and expected speedups in the wild were between 10% and 60%.

Python 3.12 continues this trend, though not as dramatically. As part of the Faster CPython project that he leads, Guido van Rossum suggested using the Binary Optimization and Layout Tool (BOLT); he noted its usage in Pyston, which is a CPython fork focused on performance. BOLT, which originated at Meta and is now part of the LLVM compiler project, strives to improve performance by optimizing an application's code layout based on an execution profile generated by a tool like perf.

The first attempt to use BOLT yielded a 1% improvement on a set of benchmarks, which Van Rossum deemed insufficient to justify its complexity. But a subsequent try suggested that adhering to BOLT best practices might improve performance, and that the earlier measurements may in part be an artifact of the suite in question, which may not sufficiently stress the instruction cache. As a result, 3.12 includes experimental support for BOLT, and estimates improvements of up to 5% on certain tasks.

A more narrowly focused performance improvement in Python 3.12 affects the re regular-expression module in the standard library; in particular, the re.sub() and re.subn() functions were enhanced. Serhiy Storchaka was able to improve performance on those functions by up to 2-3x by moving more of the algorithm into C; "re.sub() is relatively slow, because for every match it calls a Python code". While the scope of this change is narrower, it will be welcomed by anyone trying to run a substitution-heavy workload.

Overall, according to the Faster CPython benchmarks for the latest release, Python 3.12 outperforms 3.11 by about 1.04x. The release documentation doesn't clarify which specific changes are to be credited for this improvement, however. The announcement of the latest releases of 3.12 does include this tidbit: "The internal representation of integers has changed in preparation for performance enhancements." It seems that further speedups are afoot. Python supports arbitrary-length integers, and handling small integers more efficiently would seem to be the target of this effort.

`perf`

The Linux perf tool was introduced in 2009 and allows users to profile the performance of their code at the hardware level. Unfortunately, to date, perf has only been able to reference Python functions implemented in C—any Python calls are simply shown as _PyEval_EvalFrameDefault. The documentation has an example of the perf report without the new feature; a portion of that shows:

    ...
    --54.65%--PyEval_EvalCode
	      _PyEval_EvalFrameDefault
	      PyObject_Vectorcall
	      _PyEval_Vector
	      _PyEval_EvalFrameDefault
	      PyObject_Vectorcall
	      _PyEval_Vector
	      _PyEval_EvalFrameDefault
	      ...

With Python 3.12, however, developers can see the names of the Python functions that get called. The perf support can be enabled at run time using the PYTHONPERFSUPPORT environment variable or with an interpreter flag: -X perf. The support can also be enabled and disabled in the code:

    import sys

    sys.activate_stack_trampoline("perf")
    do_profiled_stuff()
    sys.deactivate_stack_trampoline()

    non_profiled_stuff()

Use of any of these approaches adds references to the relevant Python functions in the report generated by perf:

    ...
    --53.26%--PyEval_EvalCode
	      py::<module>:/src/script.py
	      _PyEval_EvalFrameDefault
	      PyObject_Vectorcall
	      _PyEval_Vector
	      py::baz:/src/script.py
	      _PyEval_EvalFrameDefault
	      PyObject_Vectorcall
	      _PyEval_Vector
	      py::bar:/src/script.py
	      _PyEval_EvalFrameDefault
	      ...

Under the hood, this change uses a clever trick implemented by other run-time environments. As Galindo Salgado explained, perf has a means for "mapping the JIT-ed areas to a string that identifies them", and so with "a very simple JIT compiler", mentions of _PyEval_EvalFrameDefault can be augmented with Python function names.

It's worth noting that not all Linux users will have access to this new tool out of the gate:

Support for the perf profiler is only currently available for Linux on selected architectures. Check the output of the configure build step or check the output of python -m sysconfig | grep HAVE_PERF_TRAMPOLINE to see if your system is supported.

The two currently supported platforms are Linux on 64-bit x86 and Arm processors. So developers on those systems will be able to use perf and, with luck, the number of supported architectures will grow.

Notable removals

Python 3.12 will remove a handful of deprecated standard library modules. Probably the most notable of these is the distutils module that has been superseded by setuptools, which is not part of the standard library. The Python Packaging Authority (PyPA) provides setuptools, which has been deemed better suited for package-installation management, though the landscape remains somewhat contentious. According to PEP 632 ("Deprecate distutils module"), "Setuptools has recently integrated a complete copy of distutils and is no longer dependent on the standard library". This removal completes the deprecation that began in 3.10, though distutils has been on its way out for rather longer than that.

Other removals include the smtpd module (replaced by aiosmtpd), a few deprecated features from unittest, and the wstr and wstr_length Unicode APIs. For a full enumeration of removed features, see the "Deprecated" section of "What's New in Python 3.12".

And more

While we've covered some of the more interesting features of the Python 3.12 release here, we've omitted some others—for instance, the new Unstable C API. What's more, 3.12 has only just completed its alpha stage with the alpha 7 release on April 4. The first beta is expected to follow on May 8, which is also the feature freeze for the release. Consult the full release schedule for more details. In the meantime, it is a good time to tinker with 3.12, explore what it has to offer, and report back on any bugs found.

Comments (5 posted)

Page editor: Jonathan Corbet
Next page: Brief items>>