Leading items

Welcome to the LWN.net Weekly Edition for May 27, 2021

This edition contains the following feature content:

Turmoil at the freenode IRC network: an apparently hostile takeover at freenode causes many community projects to flee.
Why RISC-V doesn't (yet) support KVM: the RISC-V KVM implementation appears to be ready, but has encountered roadblocks getting into the mainline kernel.
Control-flow integrity in 5.13: a new security feature in the next kernel release.
Multi-generational LRU: the next generation: this complex memory-management patch set continues to evolve.
Julia 1.6 addresses latency issues: changes made to the Julia runtime to reduce the "time to first plot" problem.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

Turmoil at the freenode IRC network

By Jake Edge
May 26, 2021

Internet Relay Chat (IRC) is a longstanding protocol—or series of protocols—for creating online, text-based chat rooms. While many of the "channels" (as chat rooms are usually called) are highly useful to a wide variety of projects and organizations, including much of the free-software world, IRC seems to have a community that suffers from more than its fair share of disagreements, hostile forks, vitriol, and other types of divisiveness. It is perhaps no huge surprise, then, that the IRC world is currently undergoing another of its periodic upheavals. The largest IRC network, freenode, is embroiled in a messy dispute that has led to the mass resignation of many of its volunteer staff, the founding of a competitor network (run by the former staff), and its abandonment by multiple high-profile projects.

As one digs into the details, though, they will find a number of conflicting accounts of what happened, "The freenode resignation FAQ" gathers a bunch of information in one place and makes for a good starting point. It was written by Sven Slootweg ("joepie91"), who is not directly involved: "I am not a Freenode staffer, and this document only contains information from public channels." He does, however, note that his sympathies lie with the freenode staff. This Ars Technica article provides a somewhat more balanced view of the mess perhaps.

Selling freenode

This episode has its roots in 2016, when Christel Dahlskjaer ("christel"), the project lead at freenode at the time, formed a corporation called Freenode Limited. That company was sold in 2017 to Andrew Lee ("rasengan"), founder of VPN service provider Private Internet Access (PIA). Ostensibly, at least, Freenode Ltd. was formed in order to have an entity that could sign contracts and such for the freenode #live conference; the new owner was not supposed to have any operational control over the IRC network and the staff of volunteers, and freenode users, were assured of that.

At the time of the sale, Dahlskjaer posted a note about "PIA and freenode joining forces". It painted a picture of a harmonious alliance that would not affect the operational side of freenode at all:

Freenode will continue to operate as a not-for-profit entity under the same management, with the same principles, but PIA's involvement going forwards will provide us with opportunities and resources that we could previously only dream of.

At the time, some were leery of the sale and of PIA's involvement in freenode. In that same post, Dahlskjaer announced that she had taken a full-time job at PIA in a dual role as the head of freenode and the director of sponsorship and events for PIA, which probably also contributed to the level of concern. But for several years, nothing changed for the IRC network—it kept running as it had been without interference from its new owner—so those concerns faded into the background.

Changes

That all changed in February 2021 when the logo for another company co-founded by Lee, Shells, appeared on the freenode front page. According to the recent resignation letter of freenode staff member Aaron Jones ("amdj"), the acknowledgments page is normally used for that kind of thing. He said that adding the logo/link for Shells on the site was disruptive:

This was not discussed among staff at the time. This caused some disruption and considerable confusion among staff for months afterward, as to how to handle user inquiries about it, of which we received many.
Ultimately, christel chose to resign rather than explain the situation to us.

A new project lead, Tom Wesley ("tomaw"), was elected by the rest of the freenode staff in April. A blog post in mid-April announced the organizational changes, pointing to a page describing the people and their roles; the blog post has since been removed and the "people" page now reflects the organization post-upheaval. According to Jones, Lee "insisted that we remove" the blog post, without any discussion

As might be guessed, Lee has a rather different take on things. On May 12, he posted a "Letter to freenode" that featured a "timeline" of events (without much in the way of dates) starting from Shells sponsoring freenode for $3,000 per month, which resulted in the logo/link addition to the web site. According to Lee, that led, in turn, to attacks that caused Dahlskjaer to resign because she was "unable to deal with the persistent harassment". The latter is disputed by Jones and others. From there, Lee recounts a power struggle that he had with Wesley for control of freenode, which, at least in part, revolved around the freenode domain-name registrations.

Another part of the dispute seems to center around a project for a new IRC server that some at freenode were working on along with staff members from the Open and Free Technology Community (OFTC), which is another IRC network. Freenode and OFTC staff started work on Solanum in September 2020, according to Jones, after the Charybdis IRCd, which had been planned as the next IRC server for freenode, was forked "because of unrelated drama".

But late in April, the test network "which was being used to ready Solanum in preparation for the network's migration to it" was unexpectedly (and silently) shut down. That was apparently done by Wesley, though he has been conspicuously quiet throughout all of the recent uproar. Jones believes legal action (Lee did mention getting a lawyer in his letter) is behind it:

Attempts to discover a reason why were met with silence. Tom [Wesley] has not been able to talk about it, but I strongly suspect that Andrew [Lee] is behind it, and given Tom's silence on the subject, I suspect Andrew used the threat of legal force to ensure it. I have also heard rumours that a gag order was sent to a large collection of OFTC staff [...]

Resignations

Since then, things have spiraled further out of control. On May 19, the staff resignations started being posted; there are 11 separate staff members who have stepped down at this point. Links for the resignations can be found in the FAQ; each has its own account of the last few months (or more), which are, perhaps unsurprisingly, generally consistent with what Jones posted.

For his part, Lee posted a note on the freenode site on the same day that outlined the situation from his perspective. He made it clear that he is in control of the network, since he has been the "owner of freenode since 2017"; his hope is for "freenode to continue to be what it has been". He also noted that the community is paramount: "Freenode is freenode, and it exists for the sake of the FOSS movement."

Libera.Chat

The FOSS community has another option now, however. Many of the resignation letters pointed to the new Libera.Chat IRC network, which was introduced on that same fateful day. Libera.Chat came about due to the freenode mess and it is staffed by former freenode volunteers. It would seem that the founders of Libera.Chat are keen to prevent a repeat of the freenode situation:

Our legal home is a non-profit association in Sweden, with all our staff holding equal stakes, and we will never accept corporate control.

The first week or so for Libera.Chat has gone rather well, it would seem. As of May 25, the numbers for the new network are rather eye-opening:

In these first few days, we’ve been able to reach 16,500 simultaneous connections and 20,000 registered accounts. We entered the global top 10 within days, and are the fastest-growing IRC network ever.
This growth is not without challenges: We have managed to finalise registrations of 250+ projects, and we’re getting more and more all the time. Our backlog is as large as our finished registrations!

As shown in the IRC top 100 networks on netsplit.de, Libera.Chat has already slightly overtaken OFTC, though it is still well below the numbers for freenode (which is not tracked in the top 100, though it does appear in the top 10 graph). In addition, many different projects have moved or are in the process of moving from freenode to Libera.Chat:

We’re proud to host projects across the entire spectrum of use cases, from games, to programming languages, to Linux distributions, and even the world’s knowledge. Some of the projects we are now supporting include Ubuntu, PostgreSQL, Wikimedia, Wikipedia, and friends, and of course the IRCv3 Working Group.

Over at freenode

Lee posted a rather strident note on May 25; it boasted that "the plan to destroy freenode has failed", though the evidence for the existence of such a plan seems fairly thin. Meanwhile, he claims that there has been an enormous outpouring of good wishes—"tens of thousands of messages of love and support"—from the "silent majority" of the FOSS community that is sticking with freenode.

He suggested that there are only a few projects leaving freenode and that they are doing their users a disservice by forcing them to migrate elsewhere. The projects came to freenode for its users, he said, not the users for the projects. He also claims that he has "hundreds of reports from project leads" who are being threatened with "being canceled" (whatever that might mean in this context) if they do not move their channel(s) away from freenode; he called for a halt to that practice.

One suspects that if things were as rosy as Lee paints them, there would be no need for a posting of this sort. In addition, there would be no need to try to stop channels from publishing their plans to change servers, but there are reports of just that kind of activity on freenode. It would seem that mentioning "Libera" in the topic of any channel on freenode is enough to have the channel taken over by the freenode staff in order to, apparently, keep it running there under new management. It seems that forcing users to go elsewhere for official communication channels is wrong, but keeping them at freenode under false pretenses is ... not.

Wrapping up

This whole saga highlights something that Lee may not have realized when he "bought" freenode. An IRC network is more than just the servers on which it runs. And, for freenode, those servers are being donated by various entities, so they could not have been part of any purchase in any case. Effectively, Lee bought the domain names, web site, and name recognition of freenode. The latter has been severely sullied at this point, at least in the eyes of a large and seemingly growing list of projects and people. It will be interesting to see where freenode (and Libera.Chat) go from here.

There is something of a bottomless pit of links to web pages and IRC logs that crops up as one starts to dig into this affair—and IRC history in general. Even just limited to the current freenode kerfluffle, there are allegations of bribery (with money or revenge), a sad tale of a would-be moderator of the dispute disassociating themselves from both "sides" and, in fact, from IRC development altogether, connections to the Korean "royal family" and to the failed Mt. Gox Bitcoin exchange, and more. To an outsider like myself, it is all truly weird—bordering on surreal at times. Those interested should strap in, follow links, do searches on search engines or sites like Hacker News, and make up their own minds.

At its core, the struggle here is about control. The former freenode staff thought they were in control of the network and had policies in place to ensure they were involved in any changes going forward. When they were disabused of that notion, they decided to go elsewhere to regain that control. On the flipside, Lee decided (or was forced, depending on which account you believe) to assume control over freenode, apparently believing that either the staff would acquiesce or that freenode could weather the storm. We are now watching the latter play out in realtime.

One thing seems clear, however: projects should be able to decide where their communication channels live. Even if one disagrees with the reasons for a switch away from a network, the choice of the project should be honored. Confusing users about where to go for help or discussion does not in any way help the cause of less FOSS fragmentation—quite the reverse, in fact.

Comments (32 posted)

Why RISC-V doesn't (yet) support KVM

By Jonathan Corbet
May 20, 2021

The RISC-V CPU architecture has been gaining prominence for some years; its relatively open nature makes it an attractive platform on which a number of companies have built products. Linux supports RISC-V well, but there is one gaping hole: there is no support for virtualization with KVM, despite the fact that a high-quality implementation exists. A recent attempt to add that support is shining some light on a part of the ecosystem that, it seems, does not work quite as well as one would like.

Linux supports a number of virtualization mechanisms, but KVM is generally seen as the native solution. It provides a standard interface across systems, but much of KVM is necessarily architecture-specific, since the mechanisms for supporting virtualization vary from one processor to the next. Thus, architectures that support KVM generally have a kvm directory nestled in with the rest of the architecture-specific code.

Given that, some eyebrows were raised when Anup Patel's patch series adding RISC-V KVM support deposited the architecture-specific code into the staging directory instead. Staging is normally used for device drivers that do not meet the kernel's standards for code quality; if all goes well they are improved and eventually "graduate" out of the staging directory. It is not usually a place for architecture support. So staging maintainer Greg Kroah-Hartman was quick to ask why things were being done that way.

The answer comes down to the patch-acceptance policy for RISC-V code, found in Documentation/riscv/patch-acceptance.rst, which reads:

We'll only accept patches for new modules or extensions if the specifications for those modules or extensions are listed as being "Frozen" or "Ratified" by the RISC-V Foundation. (Developers may, of course, maintain their own Linux kernel trees that contain code for any draft extensions that they wish.)

Virtualization for RISC-V is described by the hypervisor extension specification. As Patel explained, that extension has not yet been approved:

The KVM RISC-V patches have been sitting on the lists for almost 2 years now. The requirements for freezing RISC-V H-extension (hypervisor extension) keeps changing and we are not clear when it will be frozen. In fact, quite a few people have already implemented RISC-V H-extension in hardware as well and KVM RISC-V works on real HW as well.
Rationale of moving KVM RISC-V to drivers/staging is to continue KVM RISC-V development without breaking the Linux RISC-V patch acceptance policy until RISC-V H-extension is frozen.

It is fair to say that Kroah-Hartman was not impressed; he stated that circumventing policies in other parts of the kernel tree is not the purpose of the staging directory, so the RISC-V KVM code would not be accepted there. Paolo Bonzini, the overall maintainer for KVM, added that "the RISC-V acceptance policy as is just doesn't work", as demonstrated by the fact that developers are having to work around it. This is especially unfortunate here, he said, because the RISC-V KVM implementation "is also a very good example of how to do things right". He went on to suggest that there may be some players out there who benefit from the slowing down of patches on their way into the kernel.

Kroah-Hartman responded that slowing down the merging of useful code is "horrible" and that the job of the kernel community is to make hardware work; a policy that prevents the merging of good code that adds support for existing hardware does not make sense. He asked the RISC-V maintainers to explain this policy; they have yet to answer that question. Back in April, though, RISC-V maintainer Palmer Dabbelt acknowledged that the acceptance policy is not producing the desired results:

My goal with the RISC-V stuff has always been getting us to a place where we have real shipping products running a software stack that is as close as possible to the upstream codebases. I see that as the only way to get the software stack to a point where it can be sustainably maintained. The "only frozen extensions" policy was meant to help this by steering vendors towards a common base we could support, but in practice it's just not working out.

He added that the policy could change, but only when there is a new policy for patch acceptance in place that everybody agrees to. Co-maintainer Paul Walmsley, though, placed the blame with RISC-V International's specification process and said that any fixes should be applied there instead.

It is not that hard to understand why the RISC-V maintainers might not want to attempt to support every nonstandard CPU that is out there. The open nature of RISC-V makes it relatively easy for anybody to create their own variant, and supporting them all could become unworkable. The counterpoint is, as Kroah-Hartman said, that the kernel's job is to run on the hardware that is out there. Blocking support for shipping systems can only have the effect of pushing those systems toward vendor-supplied kernels with a lot of out-of-tree code — an unfortunate outcome for what is supposed to be an open architecture.

When the subject is support for a feature as fundamental as virtualization, the question becomes even more urgent. Hopefully this episode will lead to a rethinking of the patch-acceptance policies for the RISC-V architecture. Failing that, Kroah-Hartman has signaled his willingness to allow that support into staging if there is truly no alternative. So Linux seems likely to gain KVM support for RISC-V in the relatively near future, even if it's necessary to circumvent the architecture's maintainers to do it.

Comments (25 posted)

Control-flow integrity in 5.13

By Jonathan Corbet
May 21, 2021

Among the many changes merged for the 5.13 kernel is support for the LLVM control-flow integrity (CFI) mechanism. CFI defends against exploits by ensuring that indirect function calls have not been redirected by an attacker. Quite a bit of work was needed to make this feature work well for the kernel, but the result appears to be production-ready and able to defend Linux systems from a range of attacks.

Protecting indirect function calls

The kernel depends heavily on indirect function calls — calls where the destination address is not known at compile time. Device drivers, filesystems, and other kernel subsystems interface with the generic, core code by providing functions to be called to carry out specific actions. When the time comes to, for example, open a file (which may be a special file corresponding to a device), the core kernel will make an indirect call to the appropriate open() function defined in the file_operations structure for the file in question. Indirect function calls allow for a clean separation between generic and low-level code.

This mechanism is flexible and performs well, but it also makes those indirect calls into an attractive target for attackers. If an indirect call can be redirected to an attacker-chosen location, there are few limits to the disorder that can result. Changes over the years have made it hard for attackers to inject their own code into the kernel, but if they can force execution to an arbitrary location, that matters little. Note that an exploit need not redirect a call to the beginning of another function; it can jump to any arbitrary point within the kernel image. There is no shortage of useful targets for a corrupted indirect function call.

CFI attempts to block this sort of exploit by restricting indirect calls to locations that are plausible targets. In this case, "plausible" means that the call goes to the beginning of an actual function, and that the target function has the same prototype as the caller was expecting. That is not a perfect test; there may be functions with the same prototype that will perform some sort of useful action for an attacker. But the result is still a massive reduction in the set of available targets, which will often be enough.

This check is often called "forward-edge CFI", since it protects calls to functions. The corresponding "backward-edge" protection ensures that return addresses on the stack have not been tampered with. The patches merged for 5.13 are focused on the forward-edge problem.

LLVM CFI in Linux

Specifically, this CFI implementation works by examining the full kernel image at link time; for this reason, link-time optimization must also be enabled to use it. Whenever a location is found where the address of a function is taken, LLVM makes a note of the function and its prototype. It then injects a set of "jump tables" into the built kernel, one for each encountered function prototype. So, for example, the open() function mentioned above is defined as:

    int (*open) (struct inode *inode, struct file *file);

There are many functions in the kernel matching this prototype that have their address taken to stuff into a file_operations structure somewhere. LLVM will collect them all into a single jump table, which is essentially a list of the addresses of these functions.

The next step is to change all of the places where that function's address is taken, and replace the address with the corresponding location in the jump table. So an assignment like:

    func_ptr = my_open_function;

will result in assigning an address within the jump table to func_ptr.

Finally, whenever an indirect function call is made, control goes to a special function called __cfi_check(); this function receives, along with the target address, the address of the jump table matching the prototype of the called function. It will verify that the target address is, indeed, an address within the expected jump table, extract the real function address from the table, and jump to that address. If the target address is not within the jump table, instead, the default action is to assume that an attack is in progress and immediately panic the system. There is a permissive mode selectable at configuration time that simply logs the error instead.

Kernel-specific quirks

That severe response may be justified, but it would be awfully annoying if there were situations where the kernel makes an indirect call to a function that doesn't exactly match the prototype of the pointer being used. So, naturally, the kernel did exactly that. In pre-5.13 kernels, list_sort() was declared as:

    void list_sort(void *priv, struct list_head *head,
		   int (*cmp)(void *priv, struct list_head *a, struct list_head *b))

The comparison function cmp() is passed in by the caller and is invoked, via an indirect call, to compare items in the list. Inside list_sort(), though, one sees this line:

    a = merge(priv, (cmp_func)cmp, b, a);

The cmp_func() type to which the function pointer is cast looks almost like the prototype of cmp(), except that the two list_head pointers have the const attribute. That is enough to change the prototype of the function and, at run time, to cause a CFI failure. The fix that was adopted was to propagate the const attribute to the callers of list_sort() so that the cast of the function pointer became unnecessary. That, however, required changing callers in 40 different files across the kernel source.

Another interesting quirk comes from the fact that the jump tables are built at link time. That works for the monolithic kernel, but loadable modules are linked separately. CFI in loadable modules works, but each module gets its own jump tables. Remember that function pointers are replaced by pointers into the jump tables; since modules have different jump tables, they will get different pointers as well. In other words, the values of two pointers to the same function will differ if one of them is in a loadable module.

For the most part, things will work anyway; calls to those two different pointers will end up in the same place. But consider this line in __queue_delayed_work():

    WARN_ON_ONCE(timer->function != delayed_work_timer_fn);

This test was added to the 3.7 kernel in 2012 as a way to "detect delayed_work users which diddle with the internal timer"; nearly nine years later one assumes that they have all been found, but the test remains. But, if CFI is in use, then the address for delayed_work_timer_fn() as seen from a loadable module will not be the same as the address seen from the core kernel; that will cause the test to fail. There are a couple of places in the kernel with tests like this; they have been "fixed" by simply disabling the test when CFI is configured in.

Various other things needed to be fixed as well, including making provisions for parts of the code that absolutely must have a direct pointer to a function rather than to the jump table. CFI in the kernel only works for the arm64 architecture in 5.13; support for x86 is in the works but is not yet ready to be enabled. There doesn't seem to be much in the way of data regarding the performance impact of this feature, but the LLVM page describing CFI says that its cost is "less than 1%".

CFI looks like a new feature that could have some scary, sharp edges. It is worth noting though that Kees Cook, when he sent the pull request asking that the patches (which were written by Sami Tolvanen) be merged, said that CFI "has happily lived in Android kernels for almost 3 years". It is, in other words, already widely deployed in the real world and probably doesn't have many surprises left to offer — except, perhaps, for attackers, who will find that many of their exploits no longer work.

Comments (20 posted)

Multi-generational LRU: the next generation

By Jonathan Corbet
May 24, 2021

The multi-generational LRU patch set is a significant reworking of the kernel's memory-management subsystem that promises better performance for a number of workloads; it was covered here in April. Since then, two new versions of that work have been released by developer Yu Zhao, with version 3 being posted on May 20. Some significant changes have been made since the original post, so another look is in order.

As a quick refresher: current kernels maintain two least-recently-used (LRU) lists to track pages of memory, called the "active" and "inactive" lists. The former contains pages thought to be in active use, while the latter holds pages that are thought to be unused and available to be reclaimed for other uses; a fair amount of effort goes into deciding when to move pages between the two lists. The multi-generational LRU generalizes that concept into multiple generations, allowing pages to be in a state between "likely to be active" and "likely to be unused". Pages move from older to newer generations when they are accessed; when memory is needed pages are reclaimed from the oldest generation. Generations age over time, with new generations being created as the oldest ones are fully reclaimed.

That summary oversimplifies a lot of things; see the above-linked article for a more detailed description.

Multi-tier, multi-generational LRU

Perhaps the largest change since the first posting of this work is the concept of "tiers", which are used to subdivide the generations of pages which, in turn, facilitates better decisions about which pages to reclaim, especially on systems where a lot of buffered I/O is taking place. Specifically, tiers are a way of sorting the pages in a generation by the frequency of accesses — but only accesses made by way of file descriptors. When a page first enters a generation, it normally goes into tier 0. If some process accesses that page via a file descriptor, the page's usage count goes up and it will move to tier 1. Further accesses will push the page into higher tiers; the actual tier number is the base-2 log of the usage count.

Before looking at how these tiers are used, it is worth asking why they are managed this way — why are only file-descriptor-based accesses counted? One possible reason is never mentioned in the patch set or discussion, but seems plausible: accesses via file-descriptor will happen as the result of a system call and are relatively easy and cheap to count. Direct accesses to memory by the CPU are more costly to track and cannot reasonably be monitored with the same resolution.

The other reason, though, is that this mechanism enables some changes to how the aging of pages brought in via I/O is done. In current kernels, a page that is brought into memory as the result of, say, a read() call will initially be added to the inactive list. This makes sense because that page will often never be used again. Should there be another access to the page, though, it will be made active and the kernel will try to avoid reclaiming it. This mechanism works better than its predecessors, but it is still possible for processes doing a lot of I/O to flush useful pages out of the active list, hurting the performance of the system.

Doing better involves making use of the existing shadow-page tracking in the kernel. When pages are reclaimed for another use, the kernel remembers, for a while, what those pages contained and when the old contents were pushed out. If one of those pages is accessed again in the near future, requiring it to be brought back in from secondary storage, the kernel will notice this "refault", which is a signal that actively used pages are being reclaimed. As a general rule, refaults indicate thrashing, which is not a good thing. The kernel can respond to excessive refaulting by, for example, making the active list larger.

The multi-generational LRU work tweaks the shadow entries to record which tier a page was in when it was reclaimed. If the page is refaulted, it can be restored to its prior tier, but the refault can also be counted in association with that tier. That allows the computation of the refault rate for each tier — what percentage of pages being reclaimed from that tier are being subsequently refaulted back into memory? It seems evident that refaults on pages in higher tiers — those which are being accessed more frequently — would be worth avoiding in general.

This refault information is used by comparing the refault rates of the higher tiers against that of tier 0, which contains, remember, pages that are accessed directly by the CPU and pages that have not been accessed at all. If the higher tiers have a refault rate that is higher than the tier 0 rate, then pages in those tiers are moved to a younger generation and thus protected (for a while) from reclaim. That has the effect of focusing reclaim on the types of pages that are seeing fewer refaults.

The other piece of the puzzle is that the memory-management code no longer automatically promotes pages on the second file-descriptor-based access, as is done in current kernels. Instead, pages resulting from I/O stay in the oldest generation unless they have been moved, as the result of usage, into a tier that is refaulting at a higher rate than directly accessed pages. That, as Zhao explained in this lengthy message, has the effect of preventing these pages from forcing out directly accessed pages that are more heavily used. That should give better performance on systems doing a lot of buffered I/O; this remark from Jens Axboe suggests that it does indeed help.

Another change from the first version is the addition of a user-space knob to force the eviction of one or more generations. The purpose of this feature appears to be to allow job controllers to make some space available for incoming work; this documentation patch contains a little more information.

Multiple patch generations

The multi-generational LRU work remains promising, and it has garnered a fair amount of interest. Its path into the mainline kernel still looks long and difficult, though. Johannes Weiner raised a concern that was mentioned in the first article as well: the multi-generational LRU, as implemented now, sits alongside the existing memory-management code as a separate option, essentially giving the kernel two page-reclaim mechanisms. That will always be a hard sell for reasons described by Weiner:

It would be impossible to maintain both, focus development and testing resources, and provide a reasonably stable experience with both systems tugging at a complicated shared code base.

So the new code would have to replace the existing system, which is a tall order. It would have to be shown to perform better (or, at least, not worse) for just about any workload, at a level of confidence that would motivate the replacement of code that has "billions of hours of production testing and tuning". The only way to do this is to merge the changes as a series of small, evolutionary steps. So the multi-generational LRU patch set would have to be broken up into a series of changes, none of which are so big that the memory-management developers don't feel that they can be safely merged.

Over the years, the kernel has absorbed massive changes this way, but it is not a fast or easy process. Weiner suggested a couple of areas that could be focused on as a way of beginning the task of getting parts of this work upstream and making the rest easier to consider. If this advice is followed, some progress toward merging the multi-generational LRU could be made in the relatively near future. But this functionality as a whole is likely to have to go through numerous generations of revisions before it all makes it into a mainline kernel.

Comments (18 posted)

Julia 1.6 addresses latency issues

May 25, 2021

This article was contributed by Lee Phillips

On March 24, version 1.6.0 of the Julia programming language was released. This is the first feature release since 1.0 came out in 2018. The new release significantly reduces the "time to first plot", which is a common source of dissatisfaction for newcomers to the language, by parallelizing pre-compilation, downloading packages more efficiently, and reducing the frequency of just-in-time re-compilations at run time.

The detailed list of new features, added functions, improvements to existing functions, and so on can be found in the release notes. The focus of this article will be the changes that affect all users of Julia, rather than those that only apply to certain packages or usage patterns.

The perennial complaint

Despite Julia's success as a vehicle for high-performance computing, there has been one persistent complaint on the part of those who try it out for the first time. This complaint is so common that it is referred to by a standard tag in discussions about Julia performance: the "time to first plot" problem. It refers to the fact that while typing julia at the terminal would result in the read-eval-print loop (REPL) prompt appearing instantly, one had to endure several minutes of loading and pre-compiling before plotting something for the first time. Although subsequent plots are speedy, this was still a pain point for many users who are familiar with one or more of Python, Octave, Matlab, gnuplot, etc., where the "first plot" appears without a noticeable delay.

The reasons behind the wait for the first plot are, perhaps paradoxically, also the reasons for Julia's ability to achieve the execution speed of C and Fortran while also providing the interactive experience of an interpreted language. But Julia does its compilation step as part of an interactive session, which is where the wait comes in.

It is worth noting that the membership of the "Petaflop" club, those languages that have reached one petaflop per second on real-world problems, consists of Fortran, C, C++, and Julia. The precocious Julia was admitted to this group before it had reached version 1.0, with an astronomical calculation using over a million threads on this Cray XC40. All of the languages in this club are ahead-of-time (AOT) compiled, except Julia. They achieve their performance as the result of efficient machine code that is created by optimizing compilers, a process that can take significant time.

Julia is different. The journey from source code to machine instructions has multiple stages. The first stage (after downloading), which happens when a package is installed, is pre-compilation, where parsed and serialized representations of some functions and data structures are stored on disk. This is what can cause delays of several minutes when installing or upgrading a package, and adds to the time to first plot.

The other major stage that can lead to a noticeable lag is the just-in-time (JIT) compilation phase. The Julia JIT compiler is different from the typical sort found in, for example, LuaJIT. Those compilers trace the execution of code and create highly optimized versions of time-consuming functions or loops. The Julia version is sometimes called a "just-barely-ahead-of-time" compiler. Rather than trace execution, it performs a static analysis of code. This happens at run time and can cause pauses in execution when, for example, functions are invoked for the first time.

Choosing this type of JIT compiler is due to Julia's type system and its organization around multiple dispatch. Each generic function in the source can have hundreds of methods associated with it. The compiler does not generate machine code for each possible method, but only for those actually used, which is not generally known until run time. This is why there is a delay the first time plot() is invoked in the REPL, but the second time is delay-free. It is also why, after importing other packages, invoking plot() sometimes incurs additional latency. The new functions may create method invalidations, requiring additional methods, not needed previously, to be compiled.

For those used to compiled languages, such as Fortran, this was never a cause for concern. We are used to having to compile our programs before running them. But since Julia has been widely described as something like a better Python, many people curious enough to look into it only had experience with interpreted languages. Julia offers the user a REPL like any other interpreted language, and then expects you to wait a minute after you ask it to do something. For many, this was jarring; thus, the perennial complaint.

Timing Julia 1.6

The new version of Julia makes good progress toward eliminating some of the latency that led to these complaints. It significantly reduces pre-compilation time, is faster at downloading package files, and reduces the frequency of run-time JIT compilation delays. Due to the factors discussed above, latency in Julia can never be completely eliminated, but the current version creates a noticeably snappier interactive experience. Pre-compilation times in 1.6 are reduced mainly through the use of all available cores to carry out concurrent compilation of modules. Therefore, the more cores available, the greater the reduction in pre-compilation latency (to a point, of course).

The 1.6 release was followed five weeks later by a "patch release" tagged v1.6.1. This minor release adds no new features, but fixes a few bugs and applies several optimizations, in addition to some subtle enhancements, for example in the display of stacktraces.

Below are the results of some timing trials comparing versions 1.6.1 with 1.5.0 and 1.0.5, which is the current long-term support release (LTS) from September 2019. The next LTS release may be one of the 1.6 series, or it may be deferred to the 1.7 series; this decision will be made later. The machine on which I carried out these experiments has an Intel Core i5 processor, which has two physical and four logical cores, and 4GB of RAM. The table below compares the timings for the three Julia versions. I measured all times with a stopwatch, as this was the most reliable way to get actual wall clock times, which are the times that affect the user experience.

Julia version install + plot download pre-compile start + plot Plots version

1.0.5 417s 103 (5.1 MB/s) 315s 30.0s 1.4.3

1.5.0 599s 230 (3.1 MB/s) 367s 38.5s 1.15.0

1.6.1 306s 123 (5.1 MB/s) 170s 19.2s 1.15.0

Julia version	install + plot	download	pre-compile	start + plot	Plots version
1.0.5	417s	103 (5.1 MB/s)	315s	30.0s	1.4.3
1.5.0	599s	230 (3.1 MB/s)	367s	38.5s	1.15.0
1.6.1	306s	123 (5.1 MB/s)	170s	19.2s	1.15.0

The first column shows the Julia version. The second column, "install + plot", shows the total time from entering the command:

    Pkg.add("Plots"); using Plots; plot(rand(100))

until seeing the plot. It was timed on a fresh install of the given Julia version, with no other packages installed. [Update: As pointed out by a reader, Pkg must be imported before the command above will work; doing so is nearly instantaneous and was not included in the timings.]

The next two columns separate the times into downloading and pre-compiling times. The numbers in these columns do not always add exactly to the numbers in the second column, for several reasons. I had to depend on the display in the REPL to discern when different phases were in progress; the way this information is displayed varies greatly among the three versions in the table. In addition, there are delays during which it is not clear what is happening; the REPL indicates neither downloading nor pre-compiling. These delays are presumably JIT passes, and account for much of the 13-second discrepancy in the last row.

Download time measurements are necessarily subject to variations in my network conditions at the time, but I minimized this by averaging two installs for each version, interleaving the downloads among versions, and conducting all the timings within a span of three hours. I wiped out my ~/.julia directory before each of these timing runs, so each version started with a blank slate.

The "start + plot" column shows the amount of time it takes for subsequent plots to appear; after restarting the REPL, the following command is timed:

    using Plots; plot(rand(100))

Once the first plot has been performed in a given REPL session, additional simple plots will be nearly instantaneous. The other times that are shown are for actions that do not occur frequently; packages are installed once, and not again until a new version is desired. The "start + plot" times, in contrast, affect the startup time of any script that uses Plots, and influence the responsiveness of interactive work when the REPL needs to be restarted repeatedly. As the table shows, this time is cut about in half in the new version.

Version differences

The times across different Julia versions cannot be compared meaningfully without taking account of a few other facts. As seen in the final column, the version of the package fetched with the add command is different for version 1.0.5. This earlier version of Plots is simpler and smaller. Fetching Plots downloads 523MB for Julia 1.0.5, 716MB for 1.5.0, and 629MB for the current version. Although the version of the Plots package is the same for the two recent Julia versions, the size of the payload is significantly different; while Plots is the same, the versions of some packages in its dependency graph may have changed. The more recent versions of Julia also download a larger number of smaller files, which may account for the observed faster average download speed of 1.0.5 compared with 1.5.0.

The significant improvement in download speed over version 1.5.0 is presumably due to an improved system for file transfers. In previous versions, when fetching a package with the add command, the Julia packaging system would search for a binary installed on the system that was capable of downloading files from URLs: curl, wget, etc. This was fragile, inconsistent, and slow as each download would launch a new process that had to connect and negotiate TLS anew, all of which slows things down.

The new system uses libcurl for all downloads. It works through Downloads.jl, which is a new package in the standard library that exposes two functions, one for downloading files and the other for making http requests. Now all of the downloads that are part of a package installation, which could be a large number of files for a complex package, are downloaded in-process and can reuse connections.

Watching the CPU meters during the compilation phase confirms that the new version uses all available cores, and that previous versions use one core only. It also shows that the concurrent compilation in the new version seems to be confined to simultaneous compilation of different packages; when there was one package left, Plots itself, the work to complete that job stayed on one core. This suggests a possible opportunity for further speedups, by concurrently compiling functions within packages.

The use of concurrent compilation means that the speedup here will depend on the number of cores available and also, to some extent, on how the package is organized. Here is a report of compiling and loading the DifferentialEquations package, which I explored in a recent article, that shows version 1.6 taking about one minute to compile and load, while version 1.5 took eight minutes.

The reduction in the "start + plot" times can be attributed to the improvements in the scheme for method invalidation that reduce the frequency of JIT compilations. These improvements make package loading through the use command faster, and reduce latency overall.

Sysimages

One may wonder: if packages need to be pre-compiled, how is it possible for the REPL to start up instantly? After all, when that prompt appears, a slew of functions are already available, including the large and complex REPL system itself, along with its subsystems for documentation, package management, and more.

This is possible because all of that, along with other commonly used functions, is already compiled and included in a "sysimage": the binary that is loaded when you type julia. For a while now, the ability to create custom sysimages has been exposed to the user. So someone who routinely uses the Plots package along with, say, the Statistics package, could create a custom sysimage that already has these things pre-compiled. It will start up instantly, with all of the plotting and statistics functions ready to use.

A drawback to using sysimages is that the package versions compiled into them are, of course, frozen. Upgrading any of the ingredients of a sysimage means manually creating a new one. For those who want to learn how to do this, the PackageCompiler.jl package contains all of the functions required for creating sysimages; it comes with a manual complete with practical examples. A sysimage is not only useful for eliminating latency; it is the way to distribute Julia executable programs to those without the Julia runtime installed.

Final words

Since my last article here about Julia six months ago, the language has continued to increase its footprint in the world of high-performance and scientific computing. It has also undergone significant development, both in the language itself and in the ecosystem of packages for numerical, mathematical, and scientific applications.

Installation of Julia is simple. The latest releases can be found on the download page, as tarfiles for various operating systems and architectures. Simply download the appropriate file and unpack it, then create a link to the Julia binary somewhere in the executable path. After I expanded the tarfile for 1.6.1, I found that the installation occupied 393MB. Julia is also commonly found in the package managers of Linux distributions, but the versions there may be older than desired, depending on the distribution; for example, the version packaged with Debian 10 is 1.0.3, but Arch Linux is completely up to date.

Julia appears to have become established as one of a small handful of languages that are serious contenders for large-scale number-crunching projects. I'd like to highlight two interesting recent research initiatives with Julia at the core.

CliMA, the Climate Modeling Alliance, consists of researchers from Caltech, MIT, the Naval Postgraduate School, and NASA's Jet Propulsion Laboratory, who have joined to create an Earth-system simulation integrating data from multiple sources, including satellites and ground sensors. All of the code will be open source; simulation results and predictions will also be available to the public.

The system is based on a collection of Julia packages that are developed on GitHub. These range from the general-purpose fluid dynamics solver Oceananigans that I played with in this article, to VizCLIMA, which is a specialized tool for visualizing the results of CliMA simulations. One exciting thing about this type of open-source science is that anyone can check out current versions of any of these packages and calculate with them, as well as contribute improvements and bug fixes.

Julia Computing, a company established to create products and offer consulting to support the use of the language, has teamed with quantum-computing startup QuEra on a DARPA contract to apply machine learning to microelectronic system design. This project uses Julia code to train a reduced model of an electronic circuit that can be, potentially, several orders of magnitude faster to run than a detailed simulation, while remaining accurate. QuEra's goal is to acquire tools to design the control electronics for their quantum computers, a task for which they have found existing circuit-simulation technology inadequate.

The success of Julia in the scientific-computing sphere is an important development at the intersection of science, engineering, and free software (in the free speech sense). Until the advent of Julia, the only programming language with comparable influence and ubiquity in the science world was Fortran.

While there are capable free compilers for Fortran, large-scale computations using the language are usually performed using proprietary compilers created by chip manufacturers such as Intel. This is because these compilers squeeze the best performance out of particular CPUs. In contrast, the LLVM compiler used by Julia is a free-software project. Also, there is a strong tradition within the community of scientists using Julia to expose the code used in research to scrutiny, usually on GitHub. These are healthy developments for science, where, in the interests of transparency and reproducibility, ideally there should be no black boxes.

With this new release, Julia is easier than ever to get started with for those interested in exploring whether it might be suitable for their work or their play. On the timescale defined by Fortran, it's still a young language. And as Fortran has continuously evolved since the 1950s in response to the needs of its users, Julia will undoubtedly evolve as well, in directions that are unpredictable. It already occupies a unique position as a language that is both friendly to use as an interactive calculator and capable of running the most demanding number-crunching applications. Even for those not involved with scientific computing, I recommend looking into Julia as an example of interesting language and system design.

Comments (8 posted)

Page editor: Jonathan Corbet
Next page: Brief items>>