Getting Lustre upstream
The Lustre filesystem has a long history, some of which intersects with Linux. It was added to the staging tree in 2013, but was bounced out of staging in 2018, due to a lack of progress and a development model that was incompatible with the kernel's. Lustre may be working its way back into the kernel, though. In a filesystem-track session at the 2025 Linux Storage, Filesystem, Memory Management, and BPF Summit (LSFMM+BPF), Timothy Day and James Simmons led a discussion on how to get Lustre into the mainline.
Day began with an overview of Lustre, which is a "high-performance
parallel filesystem
". It is typically used by systems with lots of GPUs that
need to be constantly fed with data (e.g. AI workloads) and for
checkpointing high-performance-computing (HPC) workloads. A file is split
up into multiple chunks that are stored on different servers. Both the
client and server implementations run in the kernel, similar to NFS. For
the past ten or more years, the wire and disk formats have been "pretty
stable
" with "very little change
"; Lustre has good
interoperability between different versions, unlike in the distant past where both
server and client needed to be on the same version.
![Timothy Day [Timothy Day]](https://static.lwn.net/images/2025/lsfmb-day-sm.png)
The upstreaming project has been going on for a long time at this
point, he said. A fork of the client was added to the staging tree and
resided there for around five years before "it got ejected, essentially
due to insufficient progress
". It was a "bad fit
" for the
kernel, since most developers worked on the out-of-tree version, rather
than what was in staging.
But "the dream of actually getting upstream still continued
". There
have been more than 1000 patches aimed at getting the code ready for the
kernel since it got ejected; around 600 of those were from Neil Brown and 200
came from Simmons. Roughly 1/3 of the patches that have gone into the
out-of-tree repository since the staging removal
have been related to the upstream goal, Simmons said.
Day said that the biggest question is how the project can move from its
out-of-tree development model to one that is based around the upstream
kernel repository. The current state is "a giant filesystem, #ifdef-ed
to hell and back to get it working with a bunch of kernel versions
".
The next stage, which is currently being worked on and is slated to
complete in the next year or so, is to split the compatibility code out of
the core filesystem code; the goal is to eventually have two separate trees
for those pieces. The core filesystem tree would go into the kernel tree,
while the compatibility code, which is meant to support customers on
older kernels, would continue to
live in a Lustre repository.
Another area that needs attention is changes to the development process to
better mesh with kernel development. The Lustre project does not use
mailing lists; it uses a Gerrit instance
instead. "We have got to figure out how to adapt.
" Simmons said
that there are some developers who are totally Gerrit-oriented and some who
could live with a mailing list; "we have to figure out how to please
both audiences
".
Amir Goldstein said that the only real requirement is that the project post the patches once to the mailing list before merging; there is no obligation to do patch review on the list. Simmons said that he and Brown have maintained a Git tree since Lustre was removed from staging; it is kept in sync and updated to newer kernels. All of the patches are posted to the lustre-devel mailing list and to a Patchwork instance, so all of the history is open and available for comments or criticism, he said.
![James Simmons [James Simmons]](https://static.lwn.net/images/2025/lsfmb-simmons-sm.png)
Josef Bacik asked about what went wrong when Lustre was in the staging
tree. From his perspective, Lustre has been around for a long time and
there are no indications that it might be abandoned, so why did it not make
the jump into the mainline fs/ directory? Simmons said that the
project normally has a few different features being worked on at any given
time, but Greg Kroah-Hartman, who runs the staging tree, did not want any
patches that were not cleanups. So the staging version fell further and
further behind the out-of-tree code. Bacik said that made sense to him;
"that's like setting you up to fail
".
Christian Brauner said that he would like to come up with a "more streamlined
model
" for merging new filesystems, where the filesystem community
makes a collective decision on whether the merge should happen. The
community has recently been "badly burned by 'anybody can just send a
filesystem for inclusion' and then it's upstream and then we have to deal with
all of the fallout
". As a VFS maintainer, he does not want to be the
one making the decision, but merging a new filesystem "puts a burden on
the whole community
" so it should be a joint decision.
Bacik reiterated that no one was concerned that Lustre developers were
going to disappear, but that there are other concerns. It is important
that Lustre is using folios everywhere, for example, and is using "all
of the modern things
"; that sounds a little silly coming from him,
since Btrfs is still only halfway there, he said. Simmons said that the
Lustre developers completely agree; there is someone working on the folio
conversion currently. At the summit, he and Day have been talking with David
Howells about using his netfs
library.
Jeff Layton asked if the plan was to merge both the client and the server.
Simmons said that most people are just asking for the client and that is a
slower-moving code base. The client is "a couple-hundred patches per
month, the server is three to four times the volume of patches
", which
makes it harder to keep up with the kernel. Layton said: "baby steps
are good
", though Day noted that it is harder to test Lustre without
having server code in the kernel.
The reason he has been pushing for just merging the client, Ted Ts'o said,
is because the server is "pretty incestuous with ext4
". The server
requires a bunch of ext4 symbols and there is "a need to figure out how
to deal with that
". That impacts ext4 development, but other changes,
such as the plan to rewrite the jbd2
journaling layer to not require buffer heads,
may also be complicated by the inclusion of the Lustre server, he said.
Simmons asked about posting patches to the linux-fsdevel mailing list before Lustre is upstream so that the kernel developers can start to get familiar with the code. Bacik said that made sense, but that he would not really dig into the guts of Lustre; he is more interested in the interfaces being used and whether there will be maintenance problems for the kernel filesystem community in the Lustre code. Goldstein suggested setting up a Lustre-specific mailing list, but Simmons noted that lustre-devel already exists and is being archived; Brauner suggested getting it added to lore.kernel.org
The intent is that when Linus Torvalds receives a pull request for a new filesystem that he can see that the code has been publicly posted prior to that, Goldstein said. It will also help if the Git development tree has a mirror on git.kernel.org, Ts'o said. Bacik said that he thought it was a probably a lost cause to try to preserve all of the existing Git history as part of the merge, though it is up to Torvalds; instead, he suggested creating a git.kernel.org archive tree that people can consult for the history prior to the version that gets merged.
Given that Lustre targets high performance, Ts'o said, it will be important to support large folios. Simmons said that someone was working on that, and that it is important to the project; the plan is to get folio support, then to add large folios. Matthew Wilcox said that was fine, as long as the page-oriented APIs were getting converted. Many of those APIs are slowly going away, so the Lustre developers will want to ensure the filesystem is converted ahead of those removals.
Index entries for this article | |
---|---|
Kernel | Filesystems/Lustre |
Conference | Storage, Filesystem, Memory-Management and BPF Summit/2025 |
Posted Jun 18, 2025 14:36 UTC (Wed)
by koverstreet (✭ supporter ✭, #4296)
[Link] (2 responses)
Was there any elaboration on what fallout he was referring to?
Posted Jun 18, 2025 15:10 UTC (Wed)
by hailfinger (subscriber, #76962)
[Link] (1 responses)
Posted Jun 18, 2025 15:23 UTC (Wed)
by koverstreet (✭ supporter ✭, #4296)
[Link]
And then of course there was the bcachefs merger, which has had its ups and downs :) I didn't make it this year so I couldn't take part, but I'd generally agree with Christian that there should be more of a process.
My brief take - the kernel community can be very "missing the forest for the trees", lots of focus on individual patches or individual rules, while at times it seems to me the high level discussions about design or goals are a bit absent.
The bit about Greg having a "cleanups only" rule for staging - that seems entirely typical, there should've at some point been a discussion about whether that makes sense; you simply can't expect an entire community to bend to a rule like that. I've also heard the staging merger didn't have the involvement of most of the Lustre community, so - bad communication all around, perhaps. Good communication is hard, such is life :)
Posted Jun 18, 2025 16:58 UTC (Wed)
by akkornel (subscriber, #75292)
[Link]
Upstreaming Lustre was also talked about in the recent LUG 2025 conference, which was held on April 1st & 2nd at Stanford University. The session recording is available to watch, and all the slides and talks are available.
Posted Jun 18, 2025 18:30 UTC (Wed)
by geofft (subscriber, #59789)
[Link] (9 responses)
Posted Jun 18, 2025 19:07 UTC (Wed)
by tim-day-387 (subscriber, #171751)
[Link] (7 responses)
As for writing a Lustre FUSE driver, one big issue is networking: Lustre has a preexisting wire protocol that would prevent us from using something like libfabric. All of the user space networking would have to be re-implemented. Also, both the client and server exist in the kernel - so both would need to be ported to user space.
Posted Jun 19, 2025 2:03 UTC (Thu)
by geofft (subscriber, #59789)
[Link] (2 responses)
But overall, yeah, that all makes sense. (On the existing-code front, I sort of wish porting were easier - I feel like I saw at some point some vaguely-kernel-API-compatible thing for userspace filesystems. But certainly if that doesn't exist in a production-ready high-performance state then that's not helpful to you.)
Posted Jun 19, 2025 3:11 UTC (Thu)
by tim-day-387 (subscriber, #171751)
[Link]
I'm thinking in terms of opportunity cost. Always more features that can be added to Lustre. :)
> I have to imagine a good chunk of Lustre's users are people who are conservative about kernel versions in various ways, and offering them something that is as good as you can get in kernelspace, but lets you run a newer Lustre version against an older kernel version, lets you patch Lustre without pushing out a kernel patch to the entire fleet, etc., feels like it ought to be a big advantage.
We would be limited if older kernels don't have the FUSE performance enhancements of the latest kernels. And without a kernel module, we'd have less flexibility to fix that. Even CentOS 7.9 still has some traction in HPC (although it's close to going away) - so some of the kernels are fairly ancient.
> But overall, yeah, that all makes sense. (On the existing-code front, I sort of wish porting were easier - I feel like I saw at some point some vaguely-kernel-API-compatible thing for userspace filesystems. But certainly if that doesn't exist in a production-ready high-performance state then that's not helpful to you.)
There were some efforts to reuse existing filesystem kernel drivers to mount untrusted disk images in userspace via FUSE. I think the goal was to create a path to deprecate abandoned disk filesystems in the kernel. I can't find any links to this right now....
For Lustre (or NFS or Ceph kernel driver), we'd still have the network problem with a solution like this.
Posted Jun 19, 2025 3:17 UTC (Thu)
by pabs (subscriber, #43278)
[Link]
Posted Jun 19, 2025 14:40 UTC (Thu)
by csamuel (✭ supporter ✭, #2624)
[Link]
Yeah I was going to mention the Lnet part of Lustre as another potential barrier to fuse - especially as some systems have other kernel consumers of that (like HPE/Cray's DVS filesystem projection layer). There's also the existence of Lnet routers which might not mount Lustre but instead shuffle Lustre traffic between different interconnects (our old Cray XC used them to go between the Aries fabric on the inside and the IB fabric where the Lustre storage lived using RDMA on both sides).
> that would prevent us from using something like libfabric.
HPE has kkfilnd for lnet for their Slingshot fabric and I had thought that the kfabric part in that was some reference to libfabric given HPE/Cray's use in other contexts but reading Chris Horn's slides from 2023 I can see I was mistaken, it's something altogether different (insert Picard facepalm emoji here). https://www.opensfs.org/wp-content/uploads/LUG2023-kfilnd...
Posted Jun 24, 2025 14:18 UTC (Tue)
by cpitrat (subscriber, #116459)
[Link] (2 responses)
This sounds like a very "Lustre-sided" vision though. I'm external to this whole world but I can easily imagine that from a generic Linux filesystem developer, maintainer or even user a Lustre on FUSE could be much more attractive: maintainer wouldn't have to worry about what happens if nobody maintains Lustre anymore, everybody would benefits for performance improvements in FUSE that may be driven by the usage by Lustre, Lustre users would benefit from all the benefits of a user space implementation, etc ...
Posted Jun 24, 2025 15:14 UTC (Tue)
by Wol (subscriber, #4433)
[Link] (1 responses)
Which are?
I believe the article lists the DISbenefits of a FUSE implementation, which currently includes a severe performance hit, made worse by the current use cases. The projected benefits aren't worth the candle.
Something to bear in mind, with ALL suggestions of improvements like that, is do the *current* users see what you're offering as a benefit, or a hindrance. If it's the latter, feel free to have a crack at it, but don't expect them to help you ...
Always remember, selling something based on its value to YOU is very good at pissing other people off if you don't LISTEN. It's very good at turning me into an anti-customer.
Cheers,
Posted Jun 24, 2025 18:26 UTC (Tue)
by cpitrat (subscriber, #116459)
[Link]
Which is kind of my point. Again I'm external yo this and I might be wrong but my read of the situation is that people working on Lustre want it in the kernel because they see the benefit of it less work, better user experience), but from the kernel point of view, it's a high risk of ending up with having to maintain a filesystem with users but no dedicated maintainer.
Posted Jun 19, 2025 10:59 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link]
FUSE is getting faster, but it's still several times slower than the in-kernel filesystems.
Posted Jun 18, 2025 20:19 UTC (Wed)
by error27 (subscriber, #8346)
[Link] (2 responses)
My impression was that the Lustre devs viewed all the random newbie changes as too risky to push to customers so that's why the kernel.org version was never the real version that anyone used. I reviewed the git log for Lustre and there were two newbie patches which did end up causing trouble. We've improved our processes since then. There were some oldbie patches which also caused problems.
It really doesn't work when the kernel.org version is not the real upstream. The diff only gets larger and riskier for customers over time.
Posted Jun 18, 2025 21:04 UTC (Wed)
by tim-day-387 (subscriber, #171751)
[Link] (1 responses)
Staging (in my view) was sort of a chicken-and-egg problem. Customers won't use unofficial clients and have an affinity for older kernels. And vendors won't support a client (or kernel version) that customers don't use. But if the official Lustre client is just the latest kernel.org client backported to old kernels, that sidesteps the issue. Of course, plenty of people will use the client natively on the latest kernels. But we'll still have a path for supporting old distro kernels as well.
[1] https://wiki.lustre.org/Lustre_Upstreaming_to_Linux_Kernel
Posted Jun 18, 2025 21:12 UTC (Wed)
by johill (subscriber, #25196)
[Link]
Posted Jun 19, 2025 2:17 UTC (Thu)
by ejr (subscriber, #51652)
[Link] (4 responses)
SSDs arrived for the metadata issues, but are they now enough when the storage itself is NVMe? I have a suspicion that some reactions are out of date. I do believe mine are. Ceph has similar issues but is "built around" them. How out-dated are my Lustre views (eight-ten years)?
Posted Jun 19, 2025 5:17 UTC (Thu)
by sthiell (guest, #177937)
[Link]
Posted Jun 19, 2025 15:01 UTC (Thu)
by csamuel (✭ supporter ✭, #2624)
[Link] (2 responses)
That's not something I've run into (yet) with Lustre, but I know people do sometimes complain about "ls" being slow for which there can be a number of reasons:
* anything that requires the size of a file means going out and querying all the OSTs where that file resides to see how much data is there
Thankfully newer distros have an ls which uses statx() and so can say whether or not it cares about the size of the file so a straight "ls" might be much better these days (in the old days \ls or unsetting whatever distro provided aliases were present might be necessary to make it liveable).
Of course "ls -l" does care about that. I did have a vague memory that may Lustre caches some of this now, but I cannot for the life of me find a reference to it. :-(
Posted Jun 19, 2025 18:31 UTC (Thu)
by tim-day-387 (subscriber, #171751)
[Link] (1 responses)
Lustre has a couple other features that accelerate stat() heavy commands like `ls` - DNE (i.e. multiple metadata servers) and statahead (i.e. prefetching stat() data in bulk). So those can help quite a bit as well.
[1] https://doc.lustre.org/lustre_manual.xhtml#lsom
Posted Jun 25, 2025 13:41 UTC (Wed)
by csamuel (✭ supporter ✭, #2624)
[Link]
> You're probably thinking of Lazy Size-on-Metadata [1] - the feature that caches file size on MDS. LSoM isn't automatic, so preexisting tools like `ls` won't use it. There are special Lustre tools to fetch that data. LSoM became less important once statx() gained more adoption [2] and we could avoid all those OSS RPCs you mentioned (at least, when we don't care about size).
Ah right, that sounds likely, thank you!
> Lustre has a couple other features that accelerate stat() heavy commands like `ls` - DNE (i.e. multiple metadata servers) and statahead (i.e. prefetching stat() data in bulk). So those can help quite a bit as well.
Yeah, we've got 15 metadata servers in our production Lustre filesystem (about 1 per 24 OSS's) which helps a great deal.
I'm not much of a Lustre person so I'd missed statahead, thanks for that! Looks like 2.16 has a handy statahead improvement too with LU-14139 landed, we're still on 2.15 from our vendor.
All the best!
Posted Jun 19, 2025 3:05 UTC (Thu)
by pabs (subscriber, #43278)
[Link]
https://web.archive.org/web/20201128141136/https://lore.k...
Posted Jun 20, 2025 7:05 UTC (Fri)
by taladar (subscriber, #68407)
[Link] (1 responses)
I feel this hard requirement to use mailing lists instead of the kind of tools literally every modern software project uses is going to be a huge barrier for the kernel to get new contributors in the coming years as old developers retire and/or die of old age.
Posted Jun 20, 2025 7:27 UTC (Fri)
by neilbrown (subscriber, #359)
[Link]
Whatever tooling the kernel community decided to standardize on, there would be 5 other perfectly usable (but somewhat annoying) tools that someone would want us to use. The network effect doesn't require the best in any sense, only that we can all use the same interaction mechanism.
Posted Jun 21, 2025 21:37 UTC (Sat)
by neilbrown (subscriber, #359)
[Link] (3 responses)
By far the biggest problem in upstreaming is the required change to work practices of the many people working on lustre. It is currently all one big project including kernel code, tools code, and documentation. It will have to be multiple projects just like other filesystems. e.g. e2fsprogs and kernel-ext4/jbd2. When splitting lustre up the process will be much smoother if the lines are clean.
Separating the utils and doco out from the kernel code should be easy enough - there is plenty of precedent. Separating client from server is a very different question. There is a lot of shared code and it isn't cleanly marked. There are well over 200 "#ifdef HAVE_SERVER_SUPPORT" and quite a bit of server code that isn't marked at all. There are better lines along which the kernel code can be split.
Lustre has an "OSD" abstraction - Object Storage Device. There is an "osd-zfs" implementation which supports zfs. This isn't going upstream because zfs isn't upstream so it will have to maintained separately anyway. There is "osd-ldiskfs" which contains a lightly edited version of ext4. This is the part that concerns Ted Ts'o. It cannot go upstream in a hurry. But there is work on an in-memory osd. https://jira.whamcloud.com/browse/LU-17995 This isn't ready yet, but could well be by the time lustre might start being submitted. Having this in-kernel and the other osds in separate projects could work well.
The other line that it makes sense to split the project on is the "LND" interface - Lustre Network Device. socklnd which uses TCP sockets would naturally go upstream. o2iblnd which supports infiniband might go as well though it might make sense to delay that. I don't know much about the other lnds but keeping them out at least while the dust settles could certainly make sense.
So there will HAVE to be some kernel code which is maintained out-of-tree - osd-zfs at the very least. But keeping that to a minimum would be best. It isn't clear to me that splitting out the "server' from the "client" to make it a separate project really benefits anyone....
Posted Jun 21, 2025 22:43 UTC (Sat)
by tim-day-387 (subscriber, #171751)
[Link] (2 responses)
So I think we can revisit the decision. We don't really have to decide until we start posting patches on the list. The work we'd need to do in the interim is pretty much the same either way.
> But there is work on an in-memory osd. https://jira.whamcloud.com/browse/LU-17995 This isn't ready yet, but could well be by the time lustre might start being submitted.
This work will definitely be done before upstreaming.
> The other line that it makes sense to split the project on is the "LND" interface - Lustre Network Device. socklnd which uses TCP sockets would naturally go upstream. o2iblnd which supports infiniband might go as well though it might make sense to delay that. I don't know much about the other lnds but keeping them out at least while the dust settles could certainly make sense.
One issue with o2iblnd is GPUDirect Storage support. The current implementation relies on the out-of-tree GDS driver which depends on the out-of-tree display driver [2]. But NVIDIA now supports in-tree P2PDMA [3][4]. So we should be able to work around this now. But it'll require some refactoring of o2iblnd. I'm not sure if all of this work will be done before upstreaming.
[1] https://wiki.lustre.org/images/f/f9/LUG2025-Lustre_Upstre...
Posted Jun 22, 2025 23:17 UTC (Sun)
by neilbrown (subscriber, #359)
[Link] (1 responses)
As you seem to agree, I don't think there is anything "small" about starting by splitting the client out from the server - it would be a large and risky effort with minimal real gain. And I don't think "Start small" is in any way a useful sort of goal. We need to "start well" which we have done, and we need to continue coherently.
We started (as you know) by aligning the Lustre code base with Linux style - white space, comments, indentation. By using Linux support code in place of alternates: wait_event() macros, rhashtables, extent-tree, ringbuffer for tracing etc. By ensuring we support IPv6 because the networking maintainers require that (and the community wanted it but was never quite motivated enough until we realised it would be a firm block to merging).
We are continuing by looking at splitting out the utils/doc into a separate package (I think you were working on that??) and talking internally about alternate processes to get people used to the idea.
So when we eventually submit some patches I don't think we should accept anyone saying that this is "starting" whether small or otherwise. It is an important step on a process, but just one of several important steps. And the code that we land should look like that we want to have in the kernel - not some trimmed down un-testable "lite" version. I think one of the problems with the first upstreaming was that it was client only.
So as one of those people who has done a lot of work in this direction, and may well do some more, I vote that, as part of continuing well, we land client and server together.
Posted Jun 23, 2025 16:53 UTC (Mon)
by tim-day-387 (subscriber, #171751)
[Link]
That's probably the biggest problem with splitting the client and server. No one wants to work on it. :)
> We are continuing by looking at splitting out the utils/doc into a separate package (I think you were working on that??) and talking internally about alternate processes to get people used to the idea.
Yeah, I've been trying to move things to their final resting places. Split up user space / kernel code and split up core filesystem / backward compatibility code. Ideally, the core filesystem code (under lnet/ and lustre/) would be a verbatim copy of what we eventually upstream. So I've been slowly working towards that.
> So as one of those people who has done a lot of work in this direction, and may well do some more, I vote that, as part of continuing well, we land client and server together.
I agree. I think it's a matter of convincing everyone (in and outside the Lustre community) that upstreaming both components together is the most likely path to success. We'd need to highlight the practical challenges of doing the split - I don't think I emphasized that enough at LSF.
Fallout?
Fallout?
Fallout?
This was also covered in LUG 2025
Lustre FUSE performance
Lustre FUSE performance
Lustre FUSE performance
Lustre FUSE performance
LKL?
Lustre FUSE performance
Lustre FUSE performance
Lustre FUSE performance
Wol
Lustre FUSE performance
Lustre FUSE performance
My experience in staging
My experience in staging
[2] https://github.com/LINBIT/drbd/tree/master
My experience in staging
I suspect some reticense is historical.
I suspect some reticense is historical.
Today, Lustre is quite stable and extremely performant on many aspects. Still not that easy to manage: you will need to invest quite some time on the userland/tooling part or bet on a vendor to do that for you. But just watch "Lustre DNE3 - Busting the small files myth" from LUG 2025 at https://wiki.lustre.org/Lustre_User_Group_2025 on how Lustre is used by finance folks, they have filesystems with 25B inodes and 200k/s getattr per MDT (they are not even using Infiniband, with it they could perhaps double that number) and a lot of MDTs – a rare share of info in this talk.
I suspect some reticense is historical.
* if you had a file striped over many OSTs then that means needing to talk to many OSTs
* if you have a directory with a lot of files in it those may well reside on many OSTs, see above
* if you have a directory with a lot of large files in it which are all striped across many OSTs then multiply the above two pain points together
* in older Linux versions "ls -F" and "ls -C" for various indications of the type of object would call stat() which would request everything about a file, including the file size
I suspect some reticense is historical.
[2] https://jira.whamcloud.com/browse/LU-11554
I suspect some reticense is historical.
Chris
lustre-devel
Mailing lists as a barrier to kernel development?
Mailing lists as a barrier to kernel development?
Better to upstream client and server together
Better to upstream client and server together
[2] https://github.com/NVIDIA/gds-nvidia-fs
[3] https://developer.nvidia.com/gpudirect-storage
[4] https://www.kernel.org/doc/html/latest/driver-api/pci/p2p...
Better to upstream client and server together
thanks for the extra context.
You say:
> Almost everyone was in favor of doing the client first i.e. start small.
and I have to wonder how many of those "Almost everyone" were actually considering contributing to the effort.
As we all know, Linux is not a democracy. Decisions are made by those who do the work - whether development, review, testing, documentation, maintenance etc. Decisions aren't made by votes at conferences or complaints in comment threads like this :-)
Better to upstream client and server together