|
|
Subscribe / Log in / New account

LWN.net Weekly Edition for May 18, 2023

Welcome to the LWN.net Weekly Edition for May 18, 2023

This edition contains the following feature content:

This week's edition also includes these inner pages:

  • Brief items: Brief news items from throughout the community.
  • Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (1 posted)

Democratizing AI with open-source language models

May 17, 2023

This article was contributed by Koen Vervloesem

When OpenAI made its chatbot ChatGPT available to the public in November 2022, it immediately became a hit. However, despite the company's name, the underlying algorithm isn't open. Furthermore, ChatGPT users require a connection to OpenAI's cloud service and face usage restrictions. In the meantime, several open-source or freely available alternatives have emerged, with some even able to run on consumer hardware. Although they can't match ChatGPT's performance yet, rapid advancements are occurring in this field, to the extent that some people at the companies developing these artificial intelligence (AI) models have begun to worry.

ChatGPT is presented as a bot that users interact with, which generates human-like text based on their input. OpenAI has fine-tuned it specifically for engaging in conversations and providing contextually appropriate responses. It's capable of handling a variety of tasks, such as generating content or ideas, translating languages, answering questions, and even providing suggestions for code in various programming languages. ChatGPT also responds to follow-up questions, challenges incorrect premises, and rejects inappropriate requests. How does this work? Under the hood, ChatGPT uses a neural network trained on vast amounts of text to generate new content based on the input it receives. Think of it as an advanced form of autocomplete suggestions.

A neural network is a learning algorithm inspired by how our brains function. It consists of a large number of nodes, known as neurons, that receive input from other neurons and perform a mathematical function to calculate their output, which then goes to other neurons. Each input has a weight attached to it that determines how much that value contributes to the result. The neural network's architecture (how the layers of neurons are connected) and its weights determine its functionality.

The network starts with random weights; therefore, when it receives text as input (we're glossing over some details, such as how text is encoded in numbers), its output is also random. As a result, the network has to be trained using training data: an input text with a corresponding output text. Each time the training data input enters the network, its output is compared with the training data's corresponding output. The weights are then adjusted to decrease the difference between the predicted and correct output. In this way, the network undergoes a learning process until it becomes proficient at predicting text.

Such a large neural network capable of generating human-like text is called a large language model (LLM). It typically involves billions to several hundred billion weights, also known as parameters. ChatGPT is based on some large language models developed by OpenAI, with names like GPT-3.5 or (the most recent version) GPT-4. GPT stands for Generative Pre-trained Transformer and is a specific type of large language model, introduced by OpenAI in 2018, and based on the Transformer architecture invented by Google in 2017. Since GPT-3.5, OpenAI hasn't disclosed the size of its models; GPT-3 (released in May 2020) had 175 billion parameters and was trained on 570GB of text.

When large language models are trained on a broad range of data, as is the case with GPT-3.5 and GPT-4, they are also known as foundational models. Their broad training makes them adaptable to various tasks. A foundational model can be fine-tuned by training the model (or part of it) with new data for a specific task or a specific subject-matter domain. This is what OpenAI has done with ChatGPT: it has fine-tuned its foundational GPT models with conversations in which humans played both the user and the AI role. The result is a model specifically fine-tuned to follow a user's instructions and provide human-like responses.

BLOOM

Companies developing large language models lack incentives to open-source their models and the code to run them since training the models requires significant computing power and financial investment. To make the development of these models sustainable, companies need to be able to build a profitable business around it. OpenAI aims to do this by offering the paid ChatGPT Plus subscription and its pay-per-use API access.

Last year, the situation changed with BLOOM (BigScience Large Open-science Open-access Multilingual Language Model), which is freely available. This large language model was the result of a global collaboration involving over a thousand scientists from more than 250 institutions participating as volunteers in the BigScience collective. The project was started by Hugging Face, a company that provides a machine-learning platform, with contributions from NVIDIA, Microsoft, and the French research institution CNRS.

Development of BLOOM occurred entirely in public. The model was trained for 3.5 months on the Jean Zay supercomputer in Paris, utilizing 384 NVIDIA A100 Tensor Core GPUs each with 80GB of RAM. The data set comprised 1.6TB of text (341 billion words) in 46 human languages and 13 programming languages, and the model had 176 billion parameters, which is comparable to GPT-3.

BigScience developed a new license to publish BLOOM: the Responsible AI Licence (RAIL). Its main purpose is to minimize risks arising from irresponsible use of the model. According to the license, users are not allowed to use the model to generate false information with the intent to harm others; nor may they neglect to disclose that the generated text was machine-generated. RAIL is not an open-source license and does not appear on the list of OSI-approved licenses. Developers can use the BLOOM model with Hugging Face's Apache-2-licensed transformers library in their own code, subject to the terms of RAIL.

Smaller large language models

A downside of BLOOM is that it's still too large for convenient local use. In principle, anyone could download the 330GB model, but using it would require substantial hardware. BigScience also released smaller versions of the model, and this trend of smaller models was continued by others. In February, Meta announced a language model called LLaMA, which is available in versions with sizes of seven billion, 13 billion, 33 billion, and 65 billion parameters. According to the developers, the 13B version performs as well as OpenAI's GPT-3, while being a factor of ten smaller. And unlike GPT-3, which requires multiple A100 GPUs to operate, LLaMA-13B needs only one GPU to achieve the same performance.

Meta trained LLaMA on publicly available data sets, such as Wikipedia and Common Crawl. The code to run LLaMA is GPLv3-licensed, but to obtain the full weights of the model, users were required to fill out a form and agree to a "non-commercial bespoke license". Moreover, Meta proved to be quite selective in granting access. But within a week, the weights were leaked on BitTorrent, and LLaMA kickstarted the development of a lot of derivatives. Stanford University introduced Alpaca 7B, based on the LLaMA model with seven billion parameters and supplemented with instructions based on OpenAI's text-davinci-003 model of the GPT-3.5 family. Both the data set and the model were released under the CC BY-NC 4.0 license and thus do not permit commercial use. One reason for this is that OpenAI's terms of use disallow the development of models that compete with OpenAI.

Subsequently, the open research organization Large Model Systems Organization published Vicuna, a LLaMA-based model fine-tuned on 70,000 conversations of users with ChatGPT. This was accomplished using ShareGPT, which is a browser extension for Google Chrome that is designed to easily share ChatGPT conversations. According to the researchers, Vicuna achieves 90% of ChatGPT's quality and outperforms LLaMA and Alpaca in 90% of cases. Both the code (Apache 2 license) and the weights (13 billion parameters subject to LLaMA's license) have been made public. However, since Vicuna is based on LLaMA and on output from ChatGPT, commercial use is not allowed.

This issue prompted US software company Databricks to develop an open-source large language model suitable for commercial use: Dolly 2.0. It is based on EleutherAI's Pythia model with 12 billion parameters and was trained on the Pile text data set, then fine-tuned on 15,000 instructions with answers. To achieve this, the company engaged more than 5,000 of its employees. Dolly is trained on open-ended questions, closed questions, extracting factual information from texts, summarizing texts, brainstorming, classification, and creative writing tasks, all of this in English only. The 23.8GB dolly-v2-12b model can be downloaded from Databricks' page on Hugging Face. The model is using the MIT license, while the databricks-dolly-15k data set is published under the CC BY-SA 3.0 license.

Following this, Stability AI, the creators of the Stable Diffusion open-source model for generating images, published its own family of large language models: StableLM, under a CC BY-SA 4.0 license. Additionally, MosaicML introduced its MPT-7B family of open-source commercially usable large language models (some of them Apache 2 licensed). Another interesting development is BigCode, a project kickstarted by ServiceNow Research and Hugging Face to develop large language models for completing and writing code from other code and natural language descriptions. Their first model, StarCoder, has been trained on permissively licensed data from GitHub and is using the OpenRAIL license, an updated version of the Responsible AI License.

Crowdsourcing open data sets

With new open-source (or freely available) language models emerging regularly (many of which can be found in the awesome-totally-open-chatgpt repository), various organizations have started considering ways to streamline the development of data sets for training models. One such organization is LAION (Large-scale Artificial Intelligence Open Network), a non-profit research organization aiming to democratize AI. With Open Assistant, it plans to develop large language models capable of running on consumer hardware.

Open Assistant is still under development, and currently focuses mainly on collecting data sets with the help of users. The project already boasts a data set of 600,000 interactions, contributed by 13,000 volunteers. Everyone can lend a hand in this endeavor, as explained in the documentation. For example, users are tasked with grading an answer provided by another person, based on parameters such as quality or politeness. Another task involves offering an answer in the role of a chatbot to a user's request. Volunteers may also be asked to select the best response from two possible answers. Tasks are available not only in English, but in many other languages as well. The researchers intend to train language models using the data set generated by these volunteer tasks. It's worth noting that OpenAI has a similar approach with ChatGPT: the company pays (low-wage) contractors to assist in training its language model and to help identify toxic content.

Running language models on consumer hardware

Running large language models with tens to hundreds of billions of parameters on consumer hardware is not feasible. However, with the trend of smaller language models, initiated by LLaMA, operating ChatGPT-like software on a PC becomes possible. An important project for running these models is Georgi Gerganov's MIT-licensed llama.cpp. It enables users to run LLaMa, Alpaca, and other LLaMA-based models locally on their computers. It runs entirely on the CPU, which is possible by applying 4-bit quantization of the models. This reduces the accuracy of the weights to four bits to reduce memory consumption and computational complexity. Llama.cpp supports Linux, macOS, and Windows. Users can chat with the model via a command-line interface.

The model has to be downloaded first, for example from Hugging Face's model hub, and then converted into a specific format. Instructions to do this are provided in llama.cpp's README file. As a test, I downloaded the 8.14GB model ggml-vic13b-q4_0.bin, which was already in the correct format for llama.cpp. The program loads the model fully in RAM, so it's crucial to ensure the computer has sufficient free RAM.

After compiling llama.cpp, running the model in interactive mode is as simple as:

    $ ./main -m models/ggml-vic13b-q4_0.bin -i

On my somewhat dated laptop with four-core Intel Core i7-10510U 1.8GHz CPU and 24GB RAM, this took 20 seconds to start up. After posing the question "What is the most common way of transportation in Amsterdam?", the Vicuna model began to generate its response word by word, taking 15 minutes to complete the task. The resulting answer was vague but not incorrect, resembling a text from a travel guide:

Amsterdam, the capital city of the Netherlands, is known for its picturesque canals and bustling bike culture. It's no surprise that the most common way to get around Amsterdam is by bicycle. However, there are also plenty of other transportation options available in the city, including buses, trams, trains, and even boats. Whether you prefer two wheels or four, there's a mode of transportation in Amsterdam that will suit your needs.

Various projects have emerged that aim to make using open-source language models more akin to a ChatGPT-like experience. For example, Nathan Sarrazin has been working on Serge, a web-based interface for llama.cpp. It runs in a Docker container, allowing users to download a model from a list, choose its settings, and then initiate a chat. Another similar undertaking is oobabooga's text-generation-webui, which supports both CPUs and GPUs for running models. Using GPUs can significantly improve the performance.

Is open-source AI gaining an edge?

In early May, a document penned by Google's Luke Sernau was leaked: "We have no moat, and neither does OpenAI". The author contends that the open-source community has been rapidly catching up with commercial efforts: "They are doing things with $100 and 13B params that we struggle with at $10M and 540B. And they are doing so in weeks, not months." Sernau also clearly recognizes the advantages of the open-source development model. After detailing the numerous innovations that have occurred within a month of LLaMA's weights being leaked, he notes that anyone can tinker: "Many of the new ideas are from ordinary people." The barrier to entry to contribute to these open-source large language models is just "one person, an evening, and a beefy laptop".

Sernau continues with lessons for Google (and OpenAI), focusing on LoRA, a technique that accelerates fine-tuning of models and is already used by a lot of open-source projects in this domain. Thanks to LoRA, almost anyone with an idea can generate a fine-tuned model in under a day and for around $100. "At that pace, it doesn't take long before the cumulative effect of all of these fine-tunings overcomes starting off at a size disadvantage", he said, adding that focusing on maintaining some of the largest models on the planet actually puts Google at a disadvantage. At the end, he even made the case for opening Google's large language models. He perceives Meta as the clear winner in all of this, because most open-source innovation is happening on top of its LLaMA model architecture.

If Sernau is right, it means that language models could become a commodity, fueled by the innovative nature and fast-paced development model of the open-source AI community. This would enable researchers, non-profit organizations, and small businesses to access AI capabilities without depending on cloud-based services or expensive subscription fees. Looking back at the numerous language models that have been published during the last few months, we can wonder how long it will take before we have some usable, completely open-source, AI assistants to help us with our daily tasks.

Comments (66 posted)

Faster CPython at PyCon, part two

By Jake Edge
May 12, 2023

PyCon

In part one of the tale, Brandt Bucher looked specifically at the CPython optimizations that went into Python 3.11 as part of the Faster CPython project. More of that work will be appearing in future Python versions, but on day two of PyCon 2023 in Salt Lake City, Utah, Mark Shannon provided an overall picture of CPython optimizations, including efforts made over the last decade or more, with an eye toward the other areas that have been optimized, such as the memory layout for the internal C data structures of the interpreter. He also described some additional optimization techniques that will be used in Python 3.12 and beyond.

Background

Shannon said that he had been thinking about the ideas for speeding up CPython for quite some time; he showed a picture of him giving a presentation at EuroPython 2011 on the subject. He has been researching virtual machines and performance improvements for them since 2006 or so. He wanted to think more in terms of time spent, rather than speed, however. If you wanted to achieve a 5x speedup, that is an 80% reduction in the execution time.

[Mark Shannon]

In order to achieve these performance increases, it is important to consider the performance of the entire runtime; if you are able to speed up 90% of a program by nine times, but at the cost of slowing down the remaining 10% nine times as well, there is no difference in the execution speed. Even making 80% of the program 10x faster at the cost of a 3x reduction for the remainder, you only reduce the execution time to 68% of what it originally was.

People focus on the just-in-time (JIT) compiler for the V8 JavaScript engine, but that is only part of what allows that engine to be so fast. It has "incredibly sophisticated garbage collection", optimized object layouts, and "all sorts of clever things to speed up many aspects of what it has to do". For example, it does its garbage collection incrementally every 1/60th of a second, so as not to disturb animations. "Yes it has a just-in-time compiler, but there's many other parts to it".

There are some guiding principles that the Faster CPython project is following in order to improve the performance of the language. The first is that "nothing is faster than nothing"; the "best way to make something faster is to not do it at all". The project bends the rule a bit by, say, only doing something once ahead of time, rather than doing it over and over as the program executes.

Another principle involves speculation, which is making guesses about future behavior based on past behavior. A CPU's hardware branch predictor does the same kind of thing; it speculates on which branch will be taken based on what has come before (though, of course, we know now that hardware speculation comes with some dangers). The interpreter can take advantage of speculation; if the previous 99 times it added two things together they were always integers, it is pretty likely they will be integers the 100th time as well.

Memory layout

Efficient data structures are another important part of what the project is working on; by that he means "how we lay stuff out in memory". The goal is to have "more compact and more efficient data structures", which will require fewer memory reads, so that more of the data the program needs lives in the caches. To start with, he wanted to talk about reductions in the size of the Python object, which has mostly already been done at this point. He gave an example do-nothing class:

    class C:
        def __init__(self, a, b, c, d):
            self.a = a
            ...
If you look at an instance, it has a simple, obvious instance dictionary:
    >>> C(1, 2, 3, 4).__dict__
    {'a': 1, 'b': 2, 'c': 3, 'd': 4}

Back in the "olden days" of Python 2.7—"maybe some of you are lucky enough and young enough to not remember 2.7, but most of us do"—even up through Python 3.2, the Python objects used for representing an instance were complicated, weighing in at 352 bytes (on a 64-bit machine). The object itself is relatively small, but it points to two other objects: a reference to the class object (i.e. C) and another for the instance __dict__. The class reference is shared by all of the instances; for 1000 instances, the price of that object is amortized, so he was ignoring that. Data that can be shared between instances can be similarly ignored, thus this sharing is desirable.

But the __dict__ is specific to each instance and contains a hash table with keys and their hashes that are identical for each instance, which is redundant. So in Python 3.3, the keys and hashes were moved into a shared structure, which reduced the size to 208 bytes per instance. The values were still stored in a table with space for additional keys, but that went away in Python 3.6 with the addition of compact dictionaries, which had the side effect of causing dictionaries to maintain their insertion order. The compact dictionaries dropped the size to 192 bytes.

There were still some small inefficiencies in the object header because there were three word-sized garbage-collection header fields, which meant another word was added for alignment purposes. In Python 3.8 one of those garbage-collection fields was removed, so the alignment padding could be as well. That reduced the cost of each instance to 160 bytes, "which is already less than half of where we started".

But, in truth, the dictionary object itself is actually redundant. Nearly all of the information that the object has can be gotten from elsewhere or is not needed. It has a class reference, but that is already known: it is a dict. The keys can be accessed from the shared C class object and the table of values can be moved into the instance object itself. So that stuff was eliminated in Python 3.11, reducing the size per instance to 112 bytes.

Python 3.12 will rearrange things a bit to get rid of another padding word; it also shares the reference to the values or __dict__ by using a tag in the low-order bit. The __dict__ is only used if more attributes are added to the instance than the four initial ones. That results in 96 bytes per instance. There are some more things that could be done to perhaps get the size down to 80 bytes in the future, but he is not sure when that will happen (maybe 3.14 or 3.15).

So, from Python 2.7/3.2 to the hypothetical future in a few years, the size of an instance of this object has dropped from 352 to 80 bytes, while the number of memory accesses needed to access a value dropped from five to two. That is still roughly twice as much work (and memory) as Java or C++ need, but it was five times as much work and memory at the start. There is still a price for the dynamism that Python provides, but to him (and he hopes the audience agrees) it has been reduced to a "reasonable price to pay".

Interpreter speedups

He switched over to looking at what has been done on speeding up the interpreter over the years as an introduction to what is coming on that front in the future. Unlike reducing object sizes, not much work has gone into interpreter speedups until quite recently. In 3.7, method calls were optimized so that the common obj.method() pattern did not require creating a temporary callable object for the method (after the attribute lookup) before calling it. In addition, the values for global names started to be cached in Python 3.8, so instead of looking up, say, int() in the Python builtins every time it was needed, the cache could be consulted; global variables were treated similarly. Looking up builtins was somewhat costly since it required checking the module dictionary first to see if the name had been shadowed; now the code checks to see if the module dictionary has changed and short-circuits both lookups if it has not.

The PEP 659 ("Specializing Adaptive Interpreter") work went into Python 3.11; it is focused on optimizing single bytecode operations. But he would not be covering that since Bucher had given his talk the previous day. In fact, Shannon suggested that people watching his talk online pause it to go watch Bucher's talk; since Bucher had done much the same thing in his talk, it made for a bit of mutually recursive fun that the two had obviously worked out in advance.

The future work will be focused on optimizing larger regions of code; "obviously one bytecode is as small a region as we can possibly optimize", Shannon said, but optimizers "like bigger chunks of code" because it gives them more flexibility and opportunities for improving things. Some of this work would likely appear in 3.13, but he was not sure how much of it would.

He used a simple function add() that just adds its two arguments; it is a somewhat silly example, but larger examples do not fit on slides, he said. If a particular use of the function needs optimization, because it is done frequently, the bytecode for add() can effectively be inlined into a use of it. But, because of Python's dynamic nature, there must be a check to determine if the function has changed since the inlining was done; if so, the original path needs to be taken. Then, the specialization mechanism (which Bucher covered) can be used to check that both operands are integers (assuming the profiling has observed that is what is normally seen here) and perform the operation as a "considerably faster" integer-addition bytecode.

That specialization enables a more powerful optimization, partial evaluation, which is a huge area of research that he said he could only scratch the surface of in the talk. The idea is to evaluate things ahead of time so that they do not have to be recomputed each time. His add() example had optimized the following use of the function:

    a = add(b, 1)
But there are parts of even the optimized version that can be removed based on some analysis of what is actually required to produce the correct result. The first inlined and specialized version of that statement required 13 bytecode operations, some of which are rather expensive.

Doing a kind of virtual execution of that code, and tracking what is needed in order to produce the correct result, reduced that to five bytecode instructions. It effectively only needs to check that b is an integer, then does the integer addition of the two values and stores the result. "What I've clearly shown here is that with a suitably contrived example you can prove anything," he said with a grin, but "this is a real thing" that can be used for Python. When the video becomes available, which should hopefully be soon, it will be worth watching that part for those interested in understanding how that analysis works. [Update: Video link]

Combining

The optimization techniques that he has been talking about can be combined to apply to different problem areas for Python's execution speed, he said. As described, the bytecode interpreter benefits from partial evaluation, which depends on specialization. Once the bytecode sequences have been optimized with those techniques, it will be worth looking at converting them directly to machine code via a JIT compiler. Meanwhile, the cost of Python's dynamic features can be drastically reduced using specialization for the places where that dynamic nature is not being used.

The better memory layout for objects helps with Python's memory-management performance, which can also be augmented with partial evaluation. Another technique is to "unbox" numeric types so that they are no longer handled as Python objects and are simply used as regular numeric values.

While much of Python's garbage collection uses reference counting, that is not sufficient for dealing with cyclic references from objects that are no longer being used. Python has a cyclic garbage collector, but it can be a performance problem; that area can be improved with better memory layout. It may also make sense to do the cyclic collection on an incremental basis, so that different parts of the heap are handled by successive runs, reducing the amount of time spent in any given invocation of it.

The C extensions are another area that needs attention; specialization and unboxing will help reduce the overhead in moving between the two languages. A new C API could help with that as well.

So there are various aspects of running Python programs that need to be addressed and multiple techniques to do so, but there is "quite a lot of synergy here". The techniques help each other and build on each other. "Python is getting faster and we expect it to keep getting faster". The upshot is to upgrade to the latest Python, he concluded, to save energy—and money.

After the applause, he did put in his normal plea for benchmarks; the team has a standard set it uses to guide its work. "If your workloads are not represented in that set, your workloads are not necessarily getting any faster". He had time for a question, which was about the full-program memory savings from the reductions in the object size, but Shannon answered with a common refrain of his: "it depends". The savings seen will be workload-dependent but also dependent on how much Python data is being handled; in truth, he said, the layout optimizations were mostly done for the purposes of performance improvement, with memory savings as a nice added benefit.

[I would like to thank LWN subscribers for supporting my travel to Salt Lake City for PyCon.]

Comments (6 posted)

1½ Topics: realtime throttling and user-space adaptive spinning

By Jonathan Corbet
May 13, 2023

OSSNA
The Linux CPU scheduler will let realtime tasks hog the CPU to the exclusion of everything else — except when it doesn't. At the 2023 Open Source Summit North America, Joel Fernandes covered the problems with the kernel's realtime throttling mechanism and a couple of potential solutions. As a bonus, since the room was unscheduled for the following slot, attendees were treated to a spontaneous session on adaptive spinning in user space run by André Almeida.

Realtime throttling

Fernandes began with a quick overview of the scheduling classes supported by the kernel. The class known as the completely fair scheduler (CFS) is the normal, default scheduler; it treats all tasks equally and shares the CPU between them. The realtime scheduler implements two scheduling classes (FIFO and round-robin) at a higher priority; at any given time, the highest-priority realtime task will be given access to the CPU. Or, at least, that is true if there are no tasks running in the deadline class. These tasks do not have priorities; instead, they tell the system how much CPU time they need and how soon they must get it. The deadline scheduler will then select tasks in a way that ensures that they all meet their deadlines.

[Joel Fernandes] The Chrome OS system is, Fernandes said, "scheduler-heavy". The browser that forms the user interface for the system runs as a whole set of cooperating processes, each of which consists of multiple threads. The browser process handles overall user interaction, render processes fill pages with content, and the "viz process" handles the graphics. Numerous threads are involved when something happens. Fernandes traced a "mouse down" event through the browser, outlining about ten thread transitions — and that was if the event did not result in any display changes. If something delays any one of those threads from running, the result will be increased input latency.

Looking at the system, Fernandes found that the interrupt-handling thread that first receives the event runs at realtime priority; everything else was in the CFS scheduling class with a mix of priorities. He wondered if simply moving everything to the realtime class would help the situation; an experiment showed a 32% decrease in mouse-event latency, which would seem to indicate that the change is worthwhile. The only problem is that some of the threads involved can, for example, run JavaScript programs and stay in the CPU for a long time. If the user loads a page containing a cryptocurrency-mining program, they are likely to be unhappy if it runs at realtime priority; nothing else will be able to run at all.

The scheduler developers tried to address this kind of problem many years ago through a mechanism known as realtime throttling, which limits realtime tasks to 95% of the available CPU time. Once that limit has been hit, the realtime task(s) will be set aside for a little while to allow CFS tasks to run, hopefully giving a system administrator a chance to deal with a runaway realtime system. The problem with realtime throttling, Fernandes said, is that it is "horrible and broken". Specifically, that last 5% of the available CPU time will never be given to realtime tasks, even if the alternative is for the CPU to go idle. That wastes CPU time at a time when a realtime task would like to be running.

Within the kernel, the throttling mechanism works by simply removing the realtime run queues from the scheduling hierarchy, making them invisible to the task-selection code, until a suitable amount of time passes. So the way to improve the situation seemed clear enough: simply keep the realtime run queue in place, but skip over it if the system is meant to be throttling — unless the CPU would go idle. An implementation was put together, but it did not survive its encounter with the scheduling community at the recently concluded OSPM Summit.

As Fernandes put it, the scheduler developers are uninterested in fixes to the realtime-throttling code. They would rather do away with it completely, but they do not want to lose the ability to recover a runaway system. The alternative approach that emerged from the OSPM discussion was to implement throttling by temporarily boosting the CFS scheduler into the deadline class, putting it above realtime tasks in the hierarchy. This idea, Fernandes said, "might work", and it would be somewhat less complex.

The specific changes required are the result of, in large part, contributions from Peter Zijlstra and Juri Lelli. The plan is to create a fake deadline task (called fair_server) that would be given 5% of the available CPU time. The first CFS task to wake on a given CPU will (within the kernel) start the deadline server task, while the last task to sleep will stop it. This task will run CFS tasks as usual, but within the deadline class, which will automatically limit the running time to 5% of the available time.

There are some details to deal with, such as ensuring that the fair_server task runs at the end of the throttling period rather than the beginning. But it appears that there is a solution for realtime throttling in sight; stay tuned.

Adaptive spinning in user space

Mutexes and similar mutual-exclusion primitives can be found in both the kernel and in user space. The kernel has a distinct advantage, though, in that it can employ adaptive spinning in its locks. If a mutex is found to be unavailable when needed, the losing thread can simply block until the mutex is freed again. However, better performance will usually be had if, before blocking, the thread simply spins for a while waiting for the lock to be freed. Often, that will happen quickly, the lock can change hands immediately, and a fair amount of overhead is avoided.

[André Almeida] This technique only works, though, if the thread holding the lock is known to be executing on another CPU. If that thread is blocked waiting for something else, the wait for the lock to become free could go on for a long time. Even worse, if that thread is waiting for the CPU that is spinning on the lock, that spinning will actively prevent the lock from being freed. So kernel code will only spin in this way after verifying that the lock holder is actively running somewhere else.

User space is unable to make that determination, though, so it is unable to safely spin on a contended lock. Almeida would like to improve that situation. Getting there, he said, requires making two separate pieces of information available to the thread that is trying to acquire a futex (the user-space mutex implementation in Linux): which thread currently holds the futex, and whether that thread is currently running somewhere in the system.

The current owner of the futex is actually a difficult thing for the kernel to provide: it simply does not know the answer. The futex code has been carefully crafted to avoid involving the kernel at all when the lock is uncontended. To get around this, Almeida suggests creating a convention where the ID of the owning thread is stored in the lock when it is acquired. Any other thread wanting to acquire the lock could immediately see which thread currently owns it.

There is also currently no way to ask the kernel whether a given thread is running or not. The information available in /proc only provides process-level information. Almeida asked whether it would be possible to add a system call to query the running status of a thread, but Steve Rostedt pointed out that a system call would defeat the purpose. The cost of entering the kernel to get that information would far outweigh the gain from spinning on the futex. Any solution to this problem has to work purely in user space, he said.

He continued by saying that there might just be a solution. The recently merged user trace events mechanism allows the kernel to inform user space when a specific tracepoint has been enabled by setting a bit in shared memory. Something similar could perhaps be set up to indicate which threads are running at any given time. An alternative, suggested by your editor (who had just conceived the idea and recognizes that it probably lacks merit) might be to hook into the restartable sequences virtual CPU ID feature, which is designed to efficiently communicate information about which CPU each thread is running on at any given time.

Almeida left the session with some thoughts on how to further pursue this idea. He will have his work cut out for him, though; the desire to implement adaptive spinning futexes has come and gone before. Darren Hart made an attempt at an implementation in 2010, and Waiman Long tried in 2016, but neither approach was merged. Maybe the third spin will be the charm for this challenge.

Comments (8 posted)

The 2023 LSFMM+BPF Summit

By Jonathan Corbet
May 11, 2023

LSFMM+BPF
The 2023 Linux Storage, Filesystem, Memory-Management, and BPF Summit was held May 8 to 10 in Vancouver, British Columbia, Canada. This year, a slightly enlarged crowd (150 developers) met to discuss ongoing core-kernel development projects, solve problems, and chart a course for the next year. LWN was, of course, there as well.

[MM Room]

Videos of the talks are now available on YouTube.

Plenary sessions

A small subset of the sessions involved the whole group (minus the BPF contingent, which mostly met independently). These sessions included:

Combined storage and filesystem sessions

Combined filesystem and BPF sessions

Memory-management sessions

Filesystem sessions

[ Thanks to LWN subscribers for supporting our travel to Vancouver to attend this event. ]

Comments (none posted)

A storage standards update at LSFMM+BPF

By Jonathan Corbet
May 11, 2023

LSFMM+BPF
Storage technology may seem like a slow-moving area, but there is, instead, a lot of development activity happening there. An early session at the 2023 Linux Storage, Filesystem, Memory-management and BPF Summit, led by Martin Petersen and Vincent Haché, updated the assembled group on the latest changes to the storage landscape, with an emphasis on the Compute Express Link (CXL) 3.0 specification.

Linux storage-stack projects

Petersen started with a quick overview of a number of areas of interest to the storage community, the first of which is "flexible data placement". That, he said, is "the new cloud-vendor favorite way" to address write-amplification issues; there is a new favorite every year, he added. Flexible data placement allows the kernel to tell storage devices which blocks belong together so that they can be updated in a single operation. That should make the device's garbage-collection process easier.

[Martin Petersen] Copy offload is a perennial subject in the storage world. The SCSI standard provides many ways of offloading copy operations, but NVMe has, until now, only been able to offload copy operations within a single NVMe namespace. Work is now happening on enabling cross-namespace copy offloading, which complicates the situation in a number of ways. One big challenge is simply figuring out whether two NVMe devices are able to communicate with each other. The SCSI stack makes that determination ahead of time so that it knows not to attempt offload an operation if the devices involved are unable to talk to each other; NVMe will gain a similar capability.

A related area is computational storage — an NVMe namespace without any actual storage associated with it. Instead, these devices can offload operations like compression and encryption, working directly with data stored in other namespaces.

"Hinting" is telling storage devices more about the data that they are holding; again, it is intended to allow the devices to make better decisions about data placement. There are a lot of devices that can benefit from hinting, but developers have spent years trying to get it to work properly. That work will continue, he hinted.

Some types of devices can support atomic block-write operations, meaning that either the entire operation succeeds, or none of it does. The kernel would like to make it possible for user space to use that capability where it exists, but it's a complicated task. The need to find some sort of common ground between the SCSI and NVMe implementations of atomic writes makes it even more so.

Checksums are often used to detect (and possibly correct) the corruption of stored data, but the 16-bit checksums that have long been in use were designed for a world of 512-byte blocks; they are too small for the larger blocks used now. Both SCSI and NVMe have added 32-bit and 64-bit checksum formats that are being deployed, though an audience member commented that it is not being pushed hard in the SCSI world. This feature, Petersen said, is most useful in cloud-storage environments, where data corruption is more common.

Finally, Petersen mentioned NVMe live migration. If a virtual machine that is running with an NVMe-like device is migrated to a new physical host, the device needs to migrate with it. There are currently efforts afoot to define a standards-based approach for that type of migration.

CXL 3.0

Haché then took over to present some highlights from the new CXL 3.0 standard. The specification, he said, is 1,100 pages in length; mercifully, he did not plan to cover the whole thing during the session. Memory, he said, is traditionally a static resource physically attached to the CPU, but we are heading into a world where it is dynamically pooled instead. The kernel will have to adapt to this world.

[Vincent Haché] The 3.0 standard has added a long list of features designed to improve scalability. These include the ability to create fabrics of CXL devices, and increased memory pooling and sharing capabilities. A lot of work has been done at the transaction layer to ensure coherency when multiple systems are accessing the same memory; this is especially needed for peer-to-peer operations, where devices can access CXL memory directly without going through a host system.

The CXL 2.0 specification added "multiple logical device" support, where each function supported by a CXL device would be attached to a single host computer. This mechanism works, but it requires a sophisticated switch to implement, and that switch increases memory-access latency considerably. To address this problem, the 3.0 specification added a "multi-headed device" mode that adds more host ports and moves the switch into the controller. This mode, Haché said, does not scale as well, but it does improve latency.

On top of this mechanism is the "dynamic capacity device" (DCD) layer, which controls the allocation of CXL resources to hosts. Before DCD, changing a host's memory allocation was a disruptive act that required remapping the host's physical address space. In practice, that required rebooting the host. This kind of change was "better than popping out a DIMM", he said, but was still painful. With DCD, instead, the maximum capacity of each CXL resource is exposed to the host from the beginning, and DCD tells the host what its actual memory allocation is. CXL memory is divided into blocks, then grouped into extents that can be allocated to hosts. The extent lists are small and easy to update with the hosts.

The tree-like hierarchy of CXL controllers described in the 2.0 specification turned out to limit scalability, so version 3.0 has added a way to organize CXL resources into fabrics, with switches densely interconnected with each other. There is a provision for fabric-attached memory that is not part of any specific host; accelerators can also sit on the fabric and access resources directly. A CXL fabric, Haché said, cannot provide the same bandwidth that Ethernet can, so it will not be possible to put an entire data center on one fabric. But it will be possible to create large memory pools, with on the order of 1,000 hosts and 1,000 memory devices, and share that memory across the hosts with around 100ns latency.

Haché suggested that attendees download the specification and read it for themselves, it can be had from computeexpresslink.org after agreeing to a lengthy set of terms and conditions.

David Howells said that CXL starts to sound a lot like InfiniBand, and asked how the kernel was expected to actually use CXL resources. Haché answered that, on most systems, CXL memory would just show up as if it were local memory, contained within a CPU-less NUMA node. The kernel should not need to do anything special with it other than realizing that it will be a bit slower, much like persistent memory. Some sort of tiering approach is likely to be necessary at some point.

Dan Williams added that CXL has been defined so that the host does not really even need to know what it is. The BIOS can set everything up before the system boots. CXL memory, he said, is best used when it is normal and uninteresting. But it can be used in more interesting ways when the need arises.

Matthew Wilcox complained that nobody was talking about contention. What happens when there are 1,000 hosts all accessing the same cacheline in CXL memory? That could lead to multi-second delays for cache-line access, which could "break CPUs"; the prospect scares him. Haché pointed out that memory-pooling devices can have quality-of-service capabilities that allow access limits to be set on a per-host basis; that could prevent the worst problems. Wilcox responded that it can often be hard to tell highly active users from malicious users. Another audience member said that, for now, the most common use case will be cloud computing, where memory will be allocated to a single host and contention issues will not arise. There will be high-performance computing applications in the future that want to more fully use CXL's capabilities, though.

James Bottomley, at the close of the session, asked what the "killer app" for CXL would be. Haché responded that memory capacity and bandwidth are a big problem for data centers currently. There is only so much memory that can be physically connected to a CPU before it runs out of DIMM slots; there are efforts to increase memory densities, but they are expected to double the price of memory. In many data centers now, memory alone represents 50-60% of the cost of a server, and that percentage will only go up. CXL offers the alternative of connecting terabytes of DDR4 memory and making the resulting capacity available to a bunch of servers. There are other interesting use cases, but RAM costs are the biggest motivating factor currently, he said.

Bottomley responded that this was the overcommit issue again; vendors are trying to sell the same memory to multiple customers on the assumption that they won't all use it all simultaneously. CXL is a good way of not getting caught at that game, he said. Haché refused to comment on that point.

The session ended there, but the changes discussed here were to reappear many times throughout the conference, where they would be discussed in greater detail.

Comments (2 posted)

Peer-to-peer DMA

By Jake Edge
May 16, 2023

LSFMM+BPF

In a plenary session on the first day of the 2023 Linux Storage, Filesystem, Memory-Management and BPF Summit, Stephen Bates led a discussion about peer-to-peer DMA (P2PDMA). The idea is to remove the host system's participation in a transfer of data from one PCIe-connected device to another. The feature was originally aimed at NVMe SSDs so that data could simply be copied directly to and from the storage device without needing to move it to system memory and then from there to somewhere else.

Background

The idea goes back to 2012 or so, Bates said, when he and Logan Gunthorpe (who did "most of the real work") were working on NVMe SSDs, RDMA, and NVMe over fabrics (before it was a standard, he thought). Some customers suggested that being able to DMA directly between devices would be useful. With devices that exposed some memory (which would be called a "controller memory buffer" or CMB today) they got the precursor to P2PDMA working. There are some user-space implementations of the feature, including for SPDK and NVIDIA's GPUDirect Storage, which allows copies directly between NVMe namespaces and GPUs.

[Stephen Bates]

Traditional DMA has some downsides when moving data between two PCIe devices, such as an NVMe SSD and an RDMA network card. All of the DMA operations come into the system memory from one of the devices, then have to be copied out of the system memory to the other device, which doubles the amount of memory-channel bandwidth required. If user-space applications are also trying to access the RAM on the same physical DIMM as the DMA operation, there can be various quality-of-service problems as well.

P2PDMA avoids those problems, but comes with a number of challenges, he said. The original P2PDMA implementation for Linux was in-kernel-only; there were some hacks that allowed access from user space, but they were never merged into the mainline. More recently, though, the 6.2 kernel has support for user-space access to P2PDMA, at least in some circumstances. P2PDMA is available in the NVMe driver but only devices that have a CMB can be a DMA source or destination. NVMe devices are the only systems currently supported as DMA masters as well.

Bates is unsure whether Arm64 is a fully supported architecture currently, as there is "some weird problem" that Gunthorpe is working through, but x86 is fully supported. An IOMMU plays a big role for P2PDMA because it needs to translate physical and virtual addresses of various sorts between the different systems; "believe me, DMAing to the wrong place is never a good thing". The IOMMU can also play a safeguard role to ensure that errant DMA operations are not actually performed.

Currently, there is work on allowlists and blocklists for devices that do and do not work correctly, but the situation is generally improving. Perhaps because of the GPUDirect efforts, support for P2PDMA in CPUs and PCIe devices seems to be getting better. He pointed to his p2pmem-test repository for the user-space component that can be used to test the feature in a virtual machine (VM). As far as he knows, no other PCIe drivers beyond the NVMe driver implement P2PDMA, at least so far.

Future

Most NVMe drivers are block devices that are accessed via logical block addresses (LBAs), but there are devices with object-storage capabilities as well. There is also a computational storage interface coming soon (which was the topic of the next session) for doing computation (e.g. compression) on data that is present on a device. NVMe namespaces for byte-addressable storage are coming as well; those are not "load-store interfaces", which would be accessible from the CPU via load and store instructions as with RAM, but are instead storage interfaces available at byte granularity. Supporting P2PDMA for the NVMe persistent memory region (PMR), which is load-store accessible and backed by some kind of persistent data (e.g. battery backed-up RAM), is a possibility on the horizon, though he has not heard of any NVMe PMR drives in development. PMR devices could perhaps overlap the use cases of CXL, he said.

Better VM and IOMMU support is in the works. PCIe has various mechanisms for handling and caching memory-address translations, which could be used to improve P2PDMA. Adding more features to QEMU (e.g. SR-IOV) is important because it is difficult to debug problems using real hardware. Architecture support is also important; there may still be problems with Arm64 support, but there are other important architectures, like RISC-V, that need to have P2PDMA support added.

CXL had been prominently featured in the previous session, so Bates said he wanted to dig into it a bit. P2PDMA came about in a world where CXL did not exist, but now that it does, he thinks there are an interesting set of use cases for P2PDMA in a CXL world. Electrically and physically, CXL is the same as PCIe, which means that both types of devices can plug into the same bus slots. They are different at the data link layer, but work has been done on CXL.io, which translates PCIe to CXL.

That means that an NVMe drive that has support for CXL flow-control units (flits) can be plugged into a CXL port and can then be used as a storage device via the NVMe driver on the host. He and a colleague had modeled that using QEMU the previous week, which may be the first time it had ever been done. He believes it worked but more testing is needed.

Prior to CXL 3.0, doing P2PDMA directly between CXL memory and an NVMe SSD was not really possible because of cache-coherency issues. CXL 3.0 added a way for CXL to tell the CPUs that it was about to do DMA for a particular region of physical memory and ask the CPUs to update the CXL memory from their caches. The unordered I/O (UIO) feature added that ability, which can be used to move large chunks of data from or to storage devices at hardware speeds without affecting the CPU or its memory interconnects. Instead of a storage device, an NVMe network device could be used to move data directly out of CXL memory to the network.

Bates said that peer-to-peer transfers of this sort are becoming more and more popular, though many people are not using P2PDMA to accomplish them. That popularity will likely translate to more users of P2PDMA over time, however. At that point, LSFMM+BPF organizer Josef Bacik pointed out that time had expired on the slot, so the memory-management folks needed to head off to their next session, while the storage and filesystem developers continued the discussion.

David Howells asked if Bates had spoken with graphics developers about P2PDMA since it seems like they might be interested in using it to move, say, textures from storage to a GPU. Bates said that he had been focusing on cloud and enterprise kinds of use cases, so he had not contacted graphics developers. The large AI clusters are using peer-to-peer transfers to GPUs, though typically via the GPUDirect mechanism.

The NVMe community has been defining new types of namespaces lately, Bates said. The LBA namespace is currently used 99% of the time, but there are others coming as he had noted earlier. All of those namespace types and command sets can be used over both PCIe and CXL, but they can also be used over fabrics with RDMA or TCP/IP. Something that is not yet in the standard, but he hopes is coming, is providing a way to present an NVMe namespace (or a sub-region of it) as a byte-addressable, load-store region that P2PDMA can then take advantage of.

There was a digression on what was meant by load-store versus DMA for these kinds of operations. Bates said that for accessing data on a device, DMA means that some kind of descriptor is sent to a data-mover that would simply move the data as specified, whereas load-store means that a CPU is involved in doing a series of load and store operations. So there would be a new NVMe command requesting that a region be exposed as a CMB, a PMR, or "something new that we haven't invented yet"; the CPU (or some other device, such as a DMA data-mover) can then do load-store accesses on the region.

One use case that he described would be having an extremely hot (i.e. frequently accessed) huge file on an NVMe drive, but wanting to be able to access it directly with loads and stores. A few simple NVMe commands could prepare this data to be byte-accessible, which could then be mapped into the application's address space using mmap(); it would be like having the file in memory without the possibility of page faults when accessing it.

Comments (10 posted)

The state of the page in 2023

By Jonathan Corbet
May 17, 2023

LSFMM+BPF
The conversion of the kernel's memory-management subsystem over to folios was never going to be done in a day. At a plenary session at the start of the second day of the 2023 Linux Storage, Filesystem, Memory-Management and BPF Summit, Matthew Wilcox discussed the current state and future direction of this work. Quite a lot of progress has been made — and a lot of work remains to be done.

There is no single best page size for the management of memory in a Linux system, Wilcox began. On some benchmarks, using 64KB pages produces significantly better results, but others do better with 4KB base pages. In general, though, managing a system with 4KB pages is inefficient at best; at that size, the kernel must scan through millions of page structures to provide basic management functions. Beyond that, the page structure is badly overloaded and difficult to understand. If it needs to grow for one page type, it must grow for all, meaning in practice that it simply cannot grow, because somebody will always push back on it.

[Matthew Wilcox] To address this problem, Wilcox and others are trying to split struct page into a set of more specialized structures. Eventually struct page itself will shrink to a single pointer, where the least-significant bits are used to indicate what type of usage-specific structure is pointed to. The work to move slab-allocator information out of struct page has already been completed. There are plans (in varying states of completion) to make similar changes for pages representing page tables, compressed swap storage, poisoned pages, folios, free memory, device memory, and more.

When will this work be done? Wilcox waved his hands and said "two years" to general laughter. There have been 1,087 commits in the mainline so far that mentioned folios. The page cache has been fully converted, as have the slab allocators. The tail-page portion of compound pages has been moved to folios, allowing the removal of another section from struct page. The address_space_operations structure has been converted — except for three functions that will soon be deleted entirely.

There are three filesystems (XFS, AFS, and EROFS) that have been fully converted, as have the iomap and netfs infrastructure layers. A number of other filesystems, including ext4, NFS, and tmpfs, can use single-page folios now. The get_user_pages() family of functions uses folios internally, though its API is still based on struct page. Much of the internal memory-management code has been converted. One might be tempted to think that this work is nearly done, but there is still a lot of code outside of the memory-management layer that uses struct page and will need to be converted.

Every conversion that is done makes the kernel a little smaller, Wilcox said, due to the simplifying assumption that there are no pointers to tail pages. Over time, this shrinkage adds up.

There are plenty of topics to discuss for the future, he said. One is the conversion of the buffer-head layer, which is in progress (and which was the subject of the next session). Folios will make it easier to support large filesystem block sizes. The get_user_pages() interfaces need to be redesigned, and there are more filesystem conversions to do. A big task is enabling multi-page anonymous-memory folios. Most of the work done so far has been with file-backed pages, but anonymous memory is also important.

One change that is worth thinking about, he said, is reclaiming the __GFP_COMP allocation flag. This flag requests the creation of a compound page (as opposed to a simple higher-order page); that results in the addition of a bunch of metadata to the tail pages. This is useful for developers working on kernel-hardening projects, who can use it to determine if a copy operation is overrunning the underlying allocation. They would like the kernel to always create compound pages and simply drop non-compound allocations so, Wilcox suggested, the page allocator could just do that by default and drop the __GFP_COMP flag entirely.

He mentioned some pitfalls that developers working on folio conversions should be aware of. Some folio functions have different semantics than the page-oriented functions they replace; the return values may be different, for example. These changes have been carefully thought about, he said, and result in better interfaces overall, but they are something to be aware of when working in this area.

Multi-page folios can also cause surprises for code that is not expecting them. He mentioned filesystems that check for the end of a file by calculating whether an offset lands within a given page; now they must be aware that it could happen anywhere within a multi-page folio. Anytime a developer encounters a definition involving the string PAGE (PAGE_SIZE, for example), it is time to be careful. And so on.

There are also, he concluded, a few misconceptions about folios that are worth clearing up. One of those is that there can be only one lock per folio; he confessed that he doesn't quite understand why there is confusion here. There was always just one lock per compound page as well. The page lock is not highly contended; whenever it looks like a page lock is being contended, it is more likely to be an indication of threads waiting for I/O to complete.

Some developers seem to think that dirtiness can only be tracked at the folio level. It is still entirely possible to track smaller chunks within a folio, though; that is up to the filesystem and how it handles its memory. The idea that page poisoning affects a whole folio is also incorrect; that is a per-page status.

As the session wound down, David Hildenbrand said that, while folios are good, there is still often a need to allocate memory in 4KB chunks. Page-table pages, he said, would waste a lot of memory if allocated in larger sizes. What is really needed is the ability to allocate in a range of sizes, depending on how the memory will be used. Wilcox closed the session by saying that is exactly the outcome that the developers are working toward.

Comments (14 posted)

Reconsidering the direct-map fragmentation problem

By Jonathan Corbet
May 11, 2023

LSFMM+BPF
Mike Rapoport has put a considerable amount of effort into solving the problem of direct-map fragmentation over the years; this has resulted in proposals like __GFP_UNMAPPED and a session at the 2022 Linux Storage, Filesystem, Memory-Management, and BPF Summit. Rapoport returned at the 2023 Summit to revisit this issue, but he started with a somewhat surprising spoiler.

The kernel's direct map makes all of the system's physical memory accessible in the kernel's virtual address space (on 64-bit systems, at least). As a performance optimization, huge pages are used to construct this mapping; by keeping the kernel's use of the translation lookaside buffer (TLB) down, using huge pages for the direct map will speed memory access in general. When the permissions for specific pages in the direct map must be changed (to hide memory from the kernel or to remove write permission from executable code, for example), though, those huge pages must be split into smaller "base" pages, hurting system performance. This type of direct-map fragmentation is thus worth working hard to avoid.

[Mike Rapoport] Or, at least, that is what everybody has assumed for years. Rapoport started his 2023 session with the statement that he is no longer convinced that the kernel needs to make any changes to its memory allocators to avoid direct-map fragmentation. It is, he said, no longer an issue that the kernel community needs to be concerned about. "End of the talk".

The talk didn't actually end there, though. Instead, he reviewed the causes of direct-map fragmentation, including the allocation of memory for executable code (such as a loadable module or a BPF program) or for secret-memory features. Proposed future changes, such as a version of vmalloc() with memory permissions or using protection keys supervisor for page tables, may also require splitting huge pages and, as a result, create fragmentation in the direct map.

The __GFP_UNMAPPED patches tried to reduce this problem by setting aside an area for allocations that should be removed from the direct map. Once those were working, he went looking for numbers to justify the change. He ran a whole series of benchmarks to show how the reduced fragmentation made the system run faster, but got results that he described as "peculiar". The results (which he has made available for the curious) showed improvements that were, at best, far smaller than the noise in the measurements. There was little sign that any of the changes would translate into performance improvements for real workloads.

Vlastimil Babka pointed out that all of the performance tests were done on AMD CPUs and wondered whether Intel systems would behave differently; Rapoport answered that he has run the tests on Intel as well with similar results. Michal Hocko asked whether Rapoport had run the tests using only base pages for the direct map — the fully fragmented case, in other words; that test, too, has been run and showed a "single-digit" degradation in performance.

The conclusion from all of this, Rapoport continued, was that direct-map fragmentation just does not matter — for data access, at least. Using huge-page mappings does still appear to make a difference for memory containing the kernel code, so allocator changes should focus on code allocations — improving the layout of allocations for loadable modules, for example, or allowing vmalloc() to allocate huge pages for code. But, for kernel-data allocations, direct-map fragmentation simply appears to not be worth worrying about.

James Bottomley said that these results might show that, on current CPUs, the TLB doesn't work the way developers think it does. Perhaps improved speculative execution is reducing the cost of TLB misses; much of the memory-management subsystem might be built for a world that no longer exists. Rapoport answered that most of the TLB is occupied by user space in any case, so the kernel will almost certainly incur a TLB miss on each entry regardless of the state of the direct map. Trying to prevent that miss with large-page mappings may not be doing any good.

Direct-map fragmentation concerns have led to Rapoport's secret-memory features to be disabled by default. Having concluded that those concerns are not actually concerning, he is now proposing to enable the feature in all kernels, making memfd_secret() available by default. As the session ran out of time, Babka worried that fragmentation could still increase the kernel's memory usage by requiring more page tables, but Rapoport answered that the cost was not enough to worry about. Secret memory is controlled by the system resource limits, so there is only so much damage that a malicious user can do.

The proof is likely to be when the configuration change is proposed; if users can demonstrate real performance regressions, direct-map fragmentation may have to be reconsidered yet again.

Comments (3 posted)

Memory-management changes for CXL

By Jonathan Corbet
May 12, 2023

LSFMM+BPF
Kyungsan Kim began his talk at the 2023 Linux Storage, Filesystem, Memory-Management and BPF Summit with a claim that the Compute Express Link (CXL) technology is leading to fundamental changes in computer architecture. The kernel will have to respond with changes of its own, including in its memory-management layer. Drawing on some experience gained at Samsung, Kim had a few suggestions on the form those changes should take — suggestions that ran into some disagreement from other memory-management developers.

Requirements

CXL, he said, is creating real-world use cases for memory tiering, an architecture that separates memory with different performance characteristics and attempts to place each page in an appropriate tier. A classic example is placing "hot" (frequently accessed) pages in a fast, near tier, while cold pages could be placed more remotely — in slower memory attached via CXL, for example. Whether any given page is hot is, in the end, determined by its user, while the positioning is determined by the provider. His work has focused on the provider side, and has led to a set of requirements and proposals.

[Kyungsan Kim] The first of those requirements is that users should use NUMA node IDs to access CXL-hosted memory. That corresponds to how Linux has implemented access to such memory to date — by placing it into a separate CPU-less NUMA node. Those node IDs themselves do not reflect the distance of any given memory; a separate API is required for that.

Kim's second requirement is that the system should prevent unwanted migration of pages between nodes. As an example, he described a system running zswap, which "swaps" pages by compressing them and storing the result in RAM. A zswap configuration could reclaim pages from remote CXL memory and store the compressed result in fast, local memory, essentially promoting those pages. Michal Hocko responded that the solution to that specific problem was to fix zswap to preserve locality when compressing pages. David Hildenbrand added that, if pages are being put into zswap, they have already been determined to be cold, so it would be better to just put zswap-compressed pages in slow storage to begin with.

Kim then presented his first proposal, which was to provide an explicit API to allocate different types of memory. In user space, this API would use the existing mmap(), mbind(), and set_mempolicy() system calls. So, for example, mmap() would gain a couple of new flags: MMAP_NORMAL to request a mapping in "normal", local memory, and MMAP_EXMEM to map into CXL-stored, remote memory. There would be similar new flags for the NUMA memory-policy system calls. Within the kernel, the new GFP_NORMAL and GFP_EXMEM flags would be used to explicitly request a given memory type.

Notably, the defaults are different between the user-space and kernel interfaces. User space allocations could be placed in CXL memory unless the application explicitly specifies otherwise, but the kernel will only get that memory if it asks for it. Allowing unmovable kernel allocations to land in CXL memory could lead to problems if an attempt is made to unplug that memory in the future.

Unplugging was the subject of Kim's next requirement: that the ability to unplug CXL memory must be maintained. Avoiding kernel-space allocations in CXL-attached memory is one step in that direction, but it does not avoid all problems. User-space memory, too, can end up being non-movable (and thus block unplugging) in some situations, he said, citing pages that are pinned for DMA as an example. Matthew Wilcox objected that, when user-space pages located in a movable zone are pinned, they are automatically migrated out of that zone first, so that problem should not actually exist. Hocko added that CXL memory should always be configured outside of ZONE_NORMAL to ensure that kernel allocations will not be placed there; it would not do to put kernel data structures in memory with potentially unbounded latency.

Kim started into the fourth requirement, which was to reduce the number of CXL nodes that are visible to user space. Managing all of those nodes can become unwieldy. It would be better to, somehow, aggregate CXL resources into a single node. There was another requirement to be presented, but at this point the session had run over time, and the discussion came to an end.

Zoning out

The subject was revisited in another session late in the conference, though Kim's final requirement was never presented. Meanwhile, it seems that some of what was presented resulted in some hallway disagreements; in response, Dan Williams stood in front of the room and proclaimed his intent to mediate between Kim and his colleagues on one side, and the memory-management community on the other. The specific sticking point was Kim's proposal to add a new mmap() flag as described above, and especially to add a new memory zone (which wasn't covered in the earlier session, but would be needed to implement the flag). Neither of those things is done lightly, he said, but it can happen; similar changes were made for persistent memory. Williams said that he had "risked his career" to get MAP_SYNC into the kernel.

[Dan Williams] First, though, Williams wanted to talk about the problem to be solved, and that involved a quick overview of how CXL-attached memory looks to a system. On these systems, the host's physical address space is divided into "windows"; this division is set up in the ACPI tables and does not change over the life of the system. Each window is a place where a CXL host bridge can be mapped; it is possible to map resources from more than one host bridge into a single window.

The Linux kernel does "the simple thing" with host bridges for now. Attached memory is organized into performance classes, then a NUMA node is created for each class. Memory from multiple devices can be mapped into the same node if they all have similar performance.

When it comes to using this memory, the primary need is to be able to bind processes to memory of one or more specific performance classes. At times, it may also be necessary to create bindings that avoid a given performance class that might, for example, be too slow for most use. Processes would only be given memory from that class if they explicitly request it. The kernel also needs to avoid most CXL memory for its own data structures. There will also certainly be a need to migrate a process from one performance class to another at times.

It is currently difficult to bind processes to a specific class of memory using just the NUMA API. The problem comes about when a new node comes online; running applications will not know that they are able to bind to that new node, and so will not make use of it. The solution, it was suggested in the room, is to bind to all possible nodes, not just those that exist at the time.

Kim returned to the front of this room at this point to make the case for a different approach. Internally, the kernel divides memory into zones with different characteristics. ZONE_NORMAL usually contains most memory, while ZONE_MOVABLE is meant to be restricted to allocations that can be moved at need, for example. Kim said that putting CXL-attached memory into its own zone might be a better way to reflect its performance characteristics than using NUMA nodes. Allocation policies are generally applied at the zone level, he said, so it is a more natural fit.

Hocko disagreed, saying that zones are an internal memory-management concept and should not be exposed outside of the core code. The right place for CXL memory will be ZONE_MOVABLE most of the time. But Kim said that the existing zones are not well suited to CXL memory. ZONE_NORMAL is not pluggable, he said, while ZONE_MOVABLE does not allow pinning, which should be allowed for at least some CXL memory. ZONE_DEVICE, which is for memory hosted on attached devices, does not allow page allocation.

Kim would, thus, like to add ZONE_EXMEM to handle the peculiarities of CXL memory, which go beyond variable performance characteristics. For example, there are other performance-related concerns; CXL memory is subject to link negotiation and quality-of-service throttling. There can be errors caused by connection problems; these tend not to arise with normal RAM. CXL memory has sharing and permission options that are not present with normal memory; it can also implement some types of asynchronous operations ("sanitize", for example) that the host need not wait for.

Williams answered that asynchronous operations can be managed with normal memory as well, and that regular memory is certainly not immune to errors. Zones would just add complexity, he said, without enabling anything new. Kim, though, insisted that a zone-based solution would require fewer code changes to implement.

Wilcox argued that exposing zones more widely is an idea that scares people; zones should be hidden within the memory-management subsystem. Nodes are sufficient, he added, to do what needs to be done. Hocko worried that a single zone would not suffice if that approach were taken; he wondered how many zones would be needed in the end. Williams added that the memory-management developers have been working on integrating high-bandwidth memory, which is also a whole new class of memory, but nobody has thought about adding a new zone for it.

It was the end of the third day of the conference, and everybody was tired. As thoughts turned toward beer, Williams summarized the current state of the conversation. He has not heard, he said, that anybody (other than Kim) is convinced that nodes are not sufficient for CXL memory, but that the community could still be convinced otherwise. Doing so, though, would require a well-defined use case that cannot be handled with NUMA nodes. So the response to adding a new zone is "no" today, but could be changed in response to a compelling example of why that zone is needed. That, he said, is how MAP_SYNC came about — and it took a two-year discussion to get there. The plan to add a new zone for CXL-attached memory will have to clear a similar bar.

Comments (2 posted)

The future of memory tiering

By Jonathan Corbet
May 12, 2023

LSFMM+BPF
Memory tiering is the practice of dividing physical memory into separate levels according to its performance characteristics, then allocating that memory in a (hopefully) optimal manner for the workload the system is running. The subject came up repeatedly during the 2023 Linux Storage, Filesystem, Memory-Management and BPF Summit. One session, led by David Rientjes, focused directly on tiering and how it might be better supported by the Linux kernel.

Tiering was often mentioned in the context of CXL memory but, Rientjes began, it is not just about CXL. Instead, tiering presents a number of "generic use cases for hardware innovation". There are a lot of ways of thinking about tiering and what is covered by the term. The management of NUMA systems, where some memory is closer to a given CPU (and thus faster to access) than the rest, is a classic example. Swapping could be seen as a form of tiering, as can non-volatile memory or high-bandwidth memory. And, of course, mechanisms like CXL memory expansion and memory pooling. It is, he said, leading to "a golden age of optimized page placement".

[David Rientjes] The discussion briefly digressed into whether swapping really qualifies as tiering. In the end, the consensus seemed to be that, to be a memory tier, a location must be byte-addressable by the CPU. So swapping is off the list.

Michal Hocko said that there are two dimensions to the tiering problem. One is the user-space APIs to be provided by the kernel; somehow user space has to be given the control it needs over memory placement. The relevant question here is whether the existing NUMA API is sufficient, or whether something else is needed. The other aspect of the problem, though, is the kernel implementation, which should handle tiering well enough that user space does not actually need to care about it most of the time.

Rientjes responded that the NUMA API has been a part of the kernel for around 20 years. Whether it is suitable for the tiering use case depends on the answers to a number of questions, including whether it can properly describe and control all of the types of tiering that are coming. Slower expansion memory is the case that is cited most often, but there are others, including memory stored on coprocessors, network interfaces, and GPUs. He wondered what kinds of incremental changes to the NUMA API would be needed; the one-dimensional concept of NUMA distance may not be enough to properly describe the differences between tiers. The group should also, he said, consider what the minimal level of kernel support should be, and which optimizations should be left to user space.

One problem, Dan Williams said, is that vendors (and their devices) often lie to the kernel about their capabilities. Getting to the truth of the matter is not a problem that can just be punted to user space. There need to be ways for user space to indicate its intent, which can then be translated by the kernel into actual placement decisions.

Matthew Wilcox said that systems will have multiple levels of tiering; the kernel will have to decide how to move pages up and down through those tiers. Specifically, should movement be done one step at a time, or might it be better to just relegate memory directly to the slowest tier (or to swap) if it is known not to be needed for a while? And if multi-tier movement is the right thing to do, how does the kernel figure out when it is warranted? After a bit of inconclusive discussion, Hocko repeated that, while it would be nice to push decisions like that to user space, the kernel has the responsibility to do the right thing as much as possible.

Rientjes had a number of other questions to discuss, but the time allotted to the session was running out. The biggest problem for memory tiering still appears to be page promotion; it is not particularly hard to tell when pages are not in use and should be moved to slower memory, but it is rather more difficult to determine when a page has become hot and should be migrated to faster storage. There are a number of options being explored by CPU vendors to help in this area; the kernel is going to have to find a generic way to support these forthcoming architecture-specific features.

A few other questions had to be skipped over. One of these was what the interface for the configuration of memory devices as tiered nodes should look like. User space will want to influence tiering policies, but that interface has yet to be designed as well. Probably some sort of integration with control groups will be necessary. The list of questions went on from there, but they will have to be discussed some other time.

Comments (6 posted)

Live migration of virtual machines over CXL

By Jonathan Corbet
May 15, 2023

LSFMM+BPF
Virtual-machine hosting can be a fickle business; once a virtual machine has been placed on a physical host, there may arise a desire to move it to a different host. The problem with migrating virtual machines, though, is that there is a period during which the machine is not running; that can be disruptive even if it is brief. At the 2023 Linux Storage, Filesystem, Memory-Management and BPF Summit, Dragan Stancevic, presenting remotely, showed how CXL shared memory can be used to migrate virtual machines with no offline time.

Traditional migration, Stancevic began, is a four-step process. As much of the virtual machine as possible is pre-copied to its new home while that machine continues to run, after which the virtual machine is quiesced — stopped so that its memory no longer changes. A second copy operation is done to update anything that may have changed during the pre-copy phase then, finally, the moved machine is relaunched in its new home. Stancevic's goal is to create a "nil-migration" scheme that takes much of the work — and the need to quiesce the target machine — out of the picture.

[Dragan
Stancevic] Specifically, this scenario is meant to work in situations where both physical hosts have access to the same pool of CXL shared memory. In such a setting, migrate_pages() can be used to move the virtual machine's pages to the shared-memory pool without disturbing the operation of the machine itself; at worst, its memory accesses slow down slightly. Once the memory migration is complete, the virtual machine can be quickly handed over to the new host, which also has access to that memory; the machine should be able to begin executing there almost immediately. The goal is to make virtual-machine migration as fast as task switching on a single host — an action that could happen transparently between that machine's normal time slices. Eventually, the new host could migrate the virtual machine's memory into its own directly attached RAM.

This work, he said, is still in an early stage. It does, however, have a mailing list and a web site at nil-migration.org.

David Hildenbrand asked about pass-through devices — devices on the host computer that are made directly available to a virtual machine. Those, he said, cannot be migrated through CXL memory, or in any other way, for that matter. Stancevic agreed that such configurations simply would not work. Dan Williams asked whether migration through CXL memory was really necessary or if, instead, virtual machines could just live in CXL memory all the time. In Stancevic's use case, CXL shared-memory pools are only used for virtual-machine migration, but other configurations are possible.

Another audience member asked whether it would be possible to do a pre-copy phase over the network first, and only use CXL for any remaining pages just before the move takes place. Stancevic answered that it could work, but would defeat the purpose of keeping the virtual machine running at all times.

Yet another attendee pointed out that CXL memory may not be mapped at the same location on each physical host, and wondered how the nil-migration scheme handles that. The answer is that, so far, this scheme has only been tested with QEMU-emulated hardware (CXL 3.0 hardware won't be available for a while yet), and it is easy to make the mappings match in that environment. This will be a problem when real hardware arrives, though, and a solution has not yet been worked out.

The final question, from Williams, was whether the nil-migration system would need new APIs to identify the available CXL devices. Stancevic answered that having CXL resources show up as available as NUMA nodes is the best solution, but that it would be good to have some metadata show up in sysfs to help with figuring out the paths between the hosts.

Comments (23 posted)

Memory overcommit in containerized environments

By Jonathan Corbet
May 15, 2023

LSFMM+BPF
Overcommitting memory is a longstanding tradition in the Linux world (and beyond); it is rare that an application uses all of the memory allocated to it, so overcommitting can help to improve overall memory utilization. In situations where memory has been overcommitted, though, it may be necessary to respond quickly to ensure that applications have the memory they actually need, even when those needs change. At the 2023 Linux Storage, Filesystem, Memory-Management and BPF Summit, T.J. Alumbaugh (in the room) and Yuanchu Xie (remotely) presented a new mechanism intended to help hosts provide containerized guests with the memory resources they need.

Xie started by pointing out that, while containers are most often seen as a server-side technology, both server and client applications are often run in containerized environments. Those two types of workloads can vary in their execution, though; server applications tend to run constantly and predictably, while clients can be more bursty as they respond to user interactions. For server applications, the focus tends to be on reliability, while clients aim for responsiveness. Proper management of overcommitted memory is important in both cases.

Providing the memory resources that a containerized application needs — and no more — requires understanding what that application's working set is at any given time. The working set can be seen as a sort of histogram, where pages of memory are placed in bins according to some metric, usually the time of last access or some estimate of coldness. These bins can take the form of generations in the multi-generational LRU or the traditional active/inactive-list mechanism used by the kernel for years. Indeed, sometimes the classification of pages can even be done in user space.

One way of controlling the memory available to a container is a balloon device, which can allocate memory within the container (thus "inflating") and return that memory to the host if a container's memory needs to shrink. The balloon can be deflated to give a container more memory. The work under discussion is aimed at collecting working-set data and providing it to the host by way of the balloon device. The host can then use this information to respond to changes in memory use.

[T.J. Alumbaugh] Alumbaugh took over to talk about the notification mechanism. In short, the balloon driver within the container will be informed (by whatever mechanism is employed to monitor memory use) when a new working-set report is available, via a shrinker-like callback interface. That information can then be passed up to the host, which will use it to implement its resource-management policies. Actions taken in response to working-set reports can include setting control-group limits, or changing the balloon size in specific virtual machines.

Patches implementing this mechanism have been posed to the mailing lists, he said, and an associated QEMU patch set will be posted soon. Google's crosvm virtual-machine monitor has already gained support for working-set reports transmitted in this way, and there have been discussions on adding it to the Virtio specification as well.

The only response to the presentation was a comment from David Hildenbrand, who expressed his dislike for balloon drivers as a way of controlling memory resources. Without care, balloon inflation can create out-of-memory situations, which is rarely the desired result. It is better, he said, to use free-page reporting to the host, which can respond by telling guests to adjust their working-set sizes. That provides guests with the flexibility they need to avoid out-of-memory problems. The core idea is the same as what had been presented, he said, but the mechanism is different.

Comments (none posted)

User-space control of memory management

By Jonathan Corbet
May 15, 2023

LSFMM+BPF
In a remotely presented, memory-management-track session at the 2023 Linux Storage, Filesystem, Memory-Management and BPF Summit, Frank van der Linden pointed out that the line dividing resources controlled by the kernel from those managed by user space has moved back and forth over the years. He is currently interested in making it possible for user space to take more control over the management of memory resources. A proposal was discussed in general terms, but it will require some real scrutiny on its way toward the mainline, if it ever gets there.

Van der Linden noted that, in recent years, resource control has been moving back toward user space in a number of areas. Networking is a prominent example of this shift. There has also been an increasing demand for interfaces to assist orchestrators with their work; this shows, for example, in the proposal to add a pidfd_set_mempolicy() system call. He has been doing some prototyping in this area, building on Google's extensive experience in pushing resource control to user space.

[Frank van der
Linden] There are a number of existing mechanisms that can be used to influence memory management from user space, he said. These include a set of sysctl knobs, control-group limits, control knobs for proactive reclaim, and the madvise() system call. madvise() has gained a lot of options over the years, he said, suggesting that perhaps there is a need for a more generic solution. There is also the NUMA API and, for applications that want to get deeply involved in memory management, userfaultfd(), which actually diverts page-fault handling into user space. His focus was on madvise() and set_mempolicy(), with the idea of getting the most performance out of increasingly complex computing environments.

The core idea behind his work is to create a structure that can be used to provide memory-management hints and control to the kernel. It should be easily accessible, especially to BPF programs loaded from user space, which would make decisions based on input received from user space via BPF maps. He is not looking for a way to replace madvise() and set_mempolicy(), but he would like a way to combine them in a way that BPF programs could use. Also needed is a way to tag a virtual memory area (VMA) with an opaque (to the kernel) value, preferably in a way that avoids collisions when more than one user wants to tag a given VMA.

In the prototype work done so far, he has developed a hint structure for anonymous VMAs. The multi-generational LRU has been modified to use this structure to make memory accesses by some processes count more than accesses by others. An access by a favored process could immediately promote a page to the youngest generation, for example, while accesses by others could result in slower (or, indeed, no) promotion. This mechanism would be, in essence, "a nice value for memory access". It was easy to implement, he said, though he did stipulate that its practical value is "questionable".

A more useful feature, perhaps, is the ability to attach "compressibility hints" to pages. Google, like others, uses zswap to compress unused pages to save memory. Some pages do not compress well, though, so attempts to store them in zswap are just a waste of CPU time. An application that, for example, is storing already-compressed data in memory can provide a hint telling the kernel not to bother trying to compress it again.

Van der Linden closed his presentation by asking the group whether the idea seemed like a direction worth pursuing.

The first question came from Suren Baghdasaryan, who asked why BPF had been chosen as the way to access these hint structures. Was that choice made for ease of prototyping, or is a BPF-based interface the intended solution in the end? Van der Linden answered that BPF was the most flexible way to begin this work, but it isn't necessarily the final result. BPF does make things easy, though; it performs well and, he said, having hooks for BPF is a good thing in general.

Michal Hocko worried about exposing too many internal memory-management details that would create future ABI problems; a mechanism like this could be hard to maintain over the long term, he said. This concern, in turn, could make it hard to get the work upstream in the first place. Van der Linden responded that, regardless of the mechanism chosen, maintaining ABI compatibility could be challenging; he asked if anybody could suggest a better way to avoid such problems. Hocko said that a well-defined user interface might be better, but the bar for acceptance will be high regardless. Defining the functionality as narrowly as possible would help.

Matthew Wilcox said that this feature would have been more interesting if it has been proposed prior to the merging of process_madvise(), which has a similar intended functionality. Any new approach will have to demonstrate added value beyond what process_madvise() provides. Van der Linden concluded the session by saying that he would continue the discussion by posting examples to the mailing list.

Comments (2 posted)

A 2023 DAMON update

By Jonathan Corbet
May 16, 2023

LSFMM+BPF
DAMON is a framework that allows user space to influence and control the kernel's memory-management operations. It first entered the kernel with the 5.15 release, and has been gaining capabilities ever since. At the 2023 Linux Storage, Filesystem, Memory-Management and BPF Summit, DAMON author Seongjae Park provided an overview of the current status of DAMON development and where it can be expected to go in the near future.

At its core, Park began, DAMON provides access-frequency monitoring of memory in both the physical and virtual address spaces. It includes a number of methods to reduce the overhead of that monitoring. DAMOS is an add-on that allows this access information to be turned into memory-management decisions. It works by identifying regions of memory that match a specified access pattern, then applying policies to that region. DAMOS includes a quota mechanism that limits its own CPU overhead and can prioritize certain types of regions. Park has added various APIs over time, enhancing the user-space API and adding actions for control of memory-reclaim operations.

[SeongJae Park] A number of features have been added in the last year. DAMON gained online tuning in 5.19, making it easier to tweak its operating parameters in a running system. LRU-list manipulation arrived in 5.20; it works by detecting hot regions of memory and marking the pages found therein as active. This mechanism, Park said, can reduce memory-pressure stalls by 20% when properly fine-tuned. "DAMOS tried regions" showed up in 6.2; it allows for inspection of memory regions in detail and supersedes an older API based on tracepoints. Tried regions are mostly useful for debugging DAMOS policies, he said. DAMOS filters, merged for 6.3, allow policies to limit actions to given types of pages — only anonymous pages, for example, or those belonging to a specific control group.

Looking forward, Park is planning to add a feature for the monitoring of read-only and write-only access patterns. This could be used, for example, to identify processes that should be migrated to different types of backing store or those that would benefit from kernel samepage merging. He has posted a patch set implementing this feature in the past, but it was blocked due to unrelated memory-management problems; perhaps that situation has been resolved by now, he said. David Hildenbrand pointed out that write-only tracking would not be possible with pages that are pinned in memory, so that data could be incomplete; Park answered that, since DAMON uses sampling to limit overhead, it can't track every single page in any case and pinning doesn't make things any worse.

Park's top priority, he said, might be "feedback-based quota auto tuning". Getting the most out of DAMON requires fine-tuning it to just the right level of aggressiveness; that can be tricky, and the optimal tuning may change over time. A poorly tuned implementation may impose too much overhead on the system, or it may not produce the desired results. To make life easier for administrators, Park intends to add a feedback mechanism telling DAMON about how effective it is being and adjusting parameters automatically. Someday, hopefully, it can be entirely self-tuning.

There are a number of other features in the works as well. A mechanism to control reclaim via the virtio-balloon driver is planned, for example; that could provide better memory allocation on overcommitted hosts.

Park would like to add an in-tree user-space tool for control of the system; this tool currently lived out-of-tree in a GitHub repository. Moving it in-tree would help users to understand how to manage the DAMON ABI, he said, and might also help developers avoid breaking that ABI. Dan Williams answered that moving a tool in-tree doesn't increase the flexibility of the ABI, and might well create more problems than it solves. It is easier to tie tooling to specific kernel versions when that tooling lives in the kernel tree — breaking out-of-tree users in the process, but that is exactly the kind of ABI break that must be avoided. Andrew Morton said that he would need to see a strong argument before he would accept an in-tree tool like that.

Michal Hocko commented that he finds DAMON hard to understand; it is a complex memory-management system that lives outside of the kernel's memory-management subsystem. It would appreciate some good documentation that pulls it all together and shows how a use case might work. An article on "a certain Internet site" might also be beneficial, he said.

Park had more future plans to discuss, but ran out of time in which to cover them. He has posted his slides for those who are interested in looking further.

Comments (2 posted)

High-granularity mappings for huge pages

By Jonathan Corbet
May 17, 2023

LSFMM+BPF
The use of huge pages can make memory management more efficient in a number of ways, but it can also impose costs in the form of internal fragmentation and I/O amplification. At the 2023 Linux Storage, Filesystem, Memory-Management and BPF Summit, James Houghton ran a session on a scheme to get the best of both worlds: using huge pages while maintaining base-page mappings within them.

The problem to be solved, Houghton began, is that performance requirements argue for the use of huge pages to back virtual machines. So far, so good; the hugetlbfs system can be used to implement such a policy. But providers also want to be able to migrate virtual machines using userfaultfd(). Typical migration schemes will pre-copy the machine to be moved, then use userfaultfd() to catch any post-copy writes and re-copy the changed pages before committing the moved machine to the new host. But, if huge pages are in use, the only information available is that an entire (huge) page has been touched. It would be nice to just copy the 4KB base page that the process changed, but there is no way to know where within a 1GB huge page that change was made.

[James Houghton] To address this problem, he posted a patch set in February implementing high-granularity mapping for 1GB hugetlbfs pages. This series allows huge pages to be mapped at the PTE (base-page) level without splitting them. For now, this work only supports x86 systems, but there is an arm64 version written that has not yet been posted to the lists.

There are some challenges in implementing a mechanism like this, he said, starting with the user-space ABI to control it. The approach taken was to implement an MADV_SPLIT operation for madvise() — an approach that David Hildenbrand described as "confusing". This ABI seems likely to change in the future if this work proceeds. There are some implementation challenges as well, Houghton said; keeping track of the page-table entries for high-granularity-mapped pages requires access to the page middle directory (PMD) level of the page-table hierarchy, but the internal APIs do not always pass that in. A number of other internal changes are needed to implement this feature as well, making the implementation of high-granularity mappings into a 46-part patch set.

Michal Hocko suggested that, perhaps, the real problem is that this use case should not be using hugetlbfs as its backing store. Instead, a different way of accessing huge pages that does not bring hugetlbfs's baggage should be found, he said. If the desire is to keep the huge pages from being reclaimed, mlock() can be used. Houghton answered that hugetlbfs can guarantee that huge pages will be available, which is a necessary feature; trying to solve the problem without hugetlbfs would also require copying many of its quirks into the core memory-management code. Hocko said that, perhaps, the right approach would be to create a driver to obtain huge pages from the CMA pool; Houghton thought that might work.

Another audience member asked whether the transparent hugepage mechanism, which already supports high-granularity mappings, could be enhanced to provide guaranteed allocations for this use case. Another attendee said that the memory-management subsystem should, eventually at least, be prepared to handle folios of any size, so the need for special-purpose code for huge pages should go away. If this work could get the kernel closer to that point, that would be a good thing.

Houghton felt, though, that implementing the needed functionality in the core memory-management code would be a lot of work, while containing it in a separate implementation simplifies the task considerably. That implementation shows the core changes that would eventually be needed, and is thus a step in the right direction. It is not clear that the rest of the room was in agreement with this position, though; this feature may not be headed for the mainline in the near future.

Comments (none posted)

Computational storage

By Jake Edge
May 17, 2023

LSFMM+BPF

A new development in the NVMe world was the subject of a combined storage and filesystem session led by Stephen Bates at the 2023 Linux Storage, Filesystem, Memory-Management and BPF Summit. Computational storage namespaces will allow NVMe devices to offer various types of computation—anything from simple compression through complex queries and data manipulations—to be performed on the data stored on the device.

Bates began the session by noting that he had seen a video of a talk recently where a professor had calculated that about two-thirds of the energy used in "high-scale distributed systems" is spent on data movement. So, large Hadoop clusters or AI-training frameworks are using an enormous amount of energy just to move data around, without actually doing any real work ("not actually changing any bits"). That is done because we generally move our data from storage into memory in order to operate on it, which is sort of inherent in the von Neumann architecture. The storage systems are typically somewhat removed from the processor and, even if they are attached via, say, the PCIe bus, they still take a lot of power.

The idea behind computational storage, he said, is to see if parts of the computation can be pushed out to the storage layer. NVMe is more than just a good storage protocol, it is "an awesome protocol for telling something to do something". That request can be sent via a variety of mechanisms (e.g. direct attached, over fabric) and it can all be done efficiently because of the NVMe design. A command is sent as a 64-byte submission queue entry (SQE), which will result in a 16-byte completion queue entry (CQE) at some later point that gives the completion status of the command. NVMe is "super parallel" with large numbers of queues that can be split up among the cores in an SMP system. The queues can be "incredibly deep" and do not need to be processed in order. In addition, those 64-byte SQEs do not have to request a storage operation; "tell me the temperature in Idaho" could be a valid NVMe command, he said.

An NVMe controller may not just be for a single SSD, but could be some rack-sized storage system with lots of processing power available to it. Two new namespace types are being defined; one is called "subsystem local memory", which is byte-addressable in order to facilitate the computational processing on the local device. It can be populated from a traditional, logical block address (LBA) namespace or from an object-storage namespace; the subsystem local memory becomes the physical address space for the computation.

The second new namespace is for computational programs, which can perform certain actions on the data in the local memory. Those actions could be vendor-specific, simple things like compression, executing eBPF programs, and more. In theory, you could provide a Docker image to the NVMe device and it could run on the local data, he said. In some measurements that he and his colleagues have done, they have seen order-of-magnitude reductions in energy usage by pushing computations out to the storage devices.

The challenge is in figuring out how to break up the application's computation to take advantage of this. There are various efforts from Samsung and within the Storage Networking Industry Association (SNIA) to define a user-space API for computational storage. There are plans for a library that would let applications discover devices and their computational capabilities; it would provide frameworks for pushing programs in various languages out to the devices. There would also be interfaces for executing the programs and gathering the results.

Bates said that it is unlikely that support for this goes directly into the kernel. He noted that storage-track organizer Javier González and Samsung have been working on adding infrastructure to io_uring to allow passing arbitrary NVMe commands from user space, though there are some security concerns that need to be worked out. That way the kernel would not need to learn about these new commands; it would simply send the 64-byte messages and return the 16-byte replies. That ability will be useful even beyond the computational-storage realm as it will allow experimenting with new NVMe commands. As new commands get proven, they could then move into the kernel proper.

When Bates started looking at this problem, he was focused on SSDs with computational storage, but these days he thinks that storage arrays with computational-storage facilities are the more compelling target. He also sees the object-storage namespace as being important for doing computations; the mapping from files to LBAs in filesystems is complicated, so avoiding that completely makes sense. If a filesystem is being used, the library for computational storage would need to be able to turn a file name into a list of LBAs to operate on. González said that some of the work on ublk may help with parts of that.

For storage arrays, the cores controlling the arrays may only be fully utilized when the array is writing at its maximum rate, Bates said. At other times, there may be excess capacity and power availability in an array for doing other kinds of work. Some of his customers have workloads that are highly time-dependent, so there are largely idle periods where computation could be done on the storage array, which is much more efficient than moving the data to host systems for performing the computation.

Comments (6 posted)

FUSE passthrough for file I/O

By Jake Edge
May 17, 2023

LSFMM+BPF

There are some filesystems that use the Filesystem in Userspace (FUSE) framework but only to provide a different view of an underlying filesystem, such as different file metadata, a changed directory hierarchy, or other changes of that sort. The read-only filtered filesystem, which simply filters the view of which files are available, is one example; the file data could come directly from the underlying filesystem, but currently needs to traverse the FUSE user-space server process. Finding a way to bypass the server, so that the file I/O operations go directly from the application to the underlying filesystem would be beneficial. In a filesystem session at the 2023 Linux Storage, Filesystem, Memory-Management and BPF Summit, Miklos Szeredi wanted to explore different options for adding such a mechanism, which was referred to as a "FUSE passthrough"—though "bypass" might be a better alternative.

[Miklos Szeredi]

The mechanism needs to establish a file mapping, so that the file descriptor used by the application connects to the file on the underlying filesystem, in order to bypass the FUSE server. There is a question of what the granularity of the file mapping should be, Szeredi said. It could simply be the whole file, or perhaps blocks or byte ranges. There is also a question about what is used to reference the underlying file; an open file descriptor passed in a FUSE message would work, but there is a security concern regarding that. One way around that restriction would be to create an ioctl() command to establish the mapping.

Filesystem-track organizer Amir Goldstein wondered why the ioctl() was needed. An attendee said that there were problems because programs can be tricked into doing a write() to the FUSE daemon using some, perhaps privileged, file descriptor, but that it is much harder to trick a program into doing a particular ioctl() command. Christian Brauner said that the seccomp notifier API uses ioctl() commands for the same reason.

There was some discussion around why the problem being solved here was not more widespread, without reaching much of a clear conclusion; adopting the ioctl() mechanism seems prudent, at least for now. This email from Jann Horn, which Szeredi referenced when he suggested the topic, may shed further light on the problem. This was followed by some ... rather hard to follow ... discussion of a grab bag of different things that needed to be worked out, including the lifetime of the mapping and whether different user namespaces would create complications. "Namespaces are horrible", David Howells said.

There are several potential solutions for ways to bypass the FUSE server for reads and writes so that those can go directly to the underlying filesystem. The most recent of those solutions is fuse-bpf, which has a wider scope but could perhaps provide the needed functionality. Its developer, Daniel Rosenberg, was on hand to describe how that filesystem might fit into the picture. Another fuse-bpf session was held on the last day of LSFMM+BPF, as a combined filesystem and BPF session, coverage of which will be coming in due course.

One goal of the fuse-bpf effort is to be as easy to use as FUSE is, Rosenberg said. There is a set of calls that is "mirroring what the FUSE user-space calls would be doing". There are two hooks available for adding BPF filtering both before and after the filesystem operation is performed. The pre-filter allows changing some of the input parameters to the operation, while the post-filter can change the output data and error code.

Howells asked if fuse-bpf could be tricked to run arbitrary BPF programs, perhaps even from remote sources. Rosenberg said that the BPF programs have to be registered with FUSE ahead of time. "This is no more dangerous than any other BPF", an attendee said, to general laughter.

There was some discussion of how fuse-bpf could be used for passthrough, but the read and write paths for that are not yet fully implemented, Rosenberg said. Beyond the BPF filters, there are also regular FUSE filters that can be applied; those might be used to prototype a BPF filter, to filter on more arguments than the BPF filters currently support, or to perform some operation that the BPF verifier will reject. With a grin, he asked if there were "any questions about this thing that I have not fully explained until Wednesday", referring to his upcoming talk. It was agreed that the ordering of the sessions was a tad unfortunate, but that a more cohesive overview of fuse-bpf would be forthcoming.

Comments (1 posted)

Page editor: Jonathan Corbet

Inside this week's LWN.net Weekly Edition

  • Briefs: Debian TC /usr-merge ruling; GNOME Core Apps; Sourceware and SFC; util-linux 2.39; Quote; ...
  • Announcements: Newsletters, conferences, security updates, patches, and more.
Next page: Brief items>>

Copyright © 2023, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds