|
|
Subscribe / Log in / New account

Getting Lustre upstream

By Jake Edge
June 18, 2025

LSFMM+BPF

The Lustre filesystem has a long history, some of which intersects with Linux. It was added to the staging tree in 2013, but was bounced out of staging in 2018, due to a lack of progress and a development model that was incompatible with the kernel's. Lustre may be working its way back into the kernel, though. In a filesystem-track session at the 2025 Linux Storage, Filesystem, Memory Management, and BPF Summit (LSFMM+BPF), Timothy Day and James Simmons led a discussion on how to get Lustre into the mainline.

Day began with an overview of Lustre, which is a "high-performance parallel filesystem". It is typically used by systems with lots of GPUs that need to be constantly fed with data (e.g. AI workloads) and for checkpointing high-performance-computing (HPC) workloads. A file is split up into multiple chunks that are stored on different servers. Both the client and server implementations run in the kernel, similar to NFS. For the past ten or more years, the wire and disk formats have been "pretty stable" with "very little change"; Lustre has good interoperability between different versions, unlike in the distant past where both server and client needed to be on the same version.

[Timothy Day]

The upstreaming project has been going on for a long time at this point, he said. A fork of the client was added to the staging tree and resided there for around five years before "it got ejected, essentially due to insufficient progress". It was a "bad fit" for the kernel, since most developers worked on the out-of-tree version, rather than what was in staging.

But "the dream of actually getting upstream still continued". There have been more than 1000 patches aimed at getting the code ready for the kernel since it got ejected; around 600 of those were from Neil Brown and 200 came from Simmons. Roughly 1/3 of the patches that have gone into the out-of-tree repository since the staging removal have been related to the upstream goal, Simmons said.

Day said that the biggest question is how the project can move from its out-of-tree development model to one that is based around the upstream kernel repository. The current state is "a giant filesystem, #ifdef-ed to hell and back to get it working with a bunch of kernel versions". The next stage, which is currently being worked on and is slated to complete in the next year or so, is to split the compatibility code out of the core filesystem code; the goal is to eventually have two separate trees for those pieces. The core filesystem tree would go into the kernel tree, while the compatibility code, which is meant to support customers on older kernels, would continue to live in a Lustre repository.

Another area that needs attention is changes to the development process to better mesh with kernel development. The Lustre project does not use mailing lists; it uses a Gerrit instance instead. "We have got to figure out how to adapt." Simmons said that there are some developers who are totally Gerrit-oriented and some who could live with a mailing list; "we have to figure out how to please both audiences".

Amir Goldstein said that the only real requirement is that the project post the patches once to the mailing list before merging; there is no obligation to do patch review on the list. Simmons said that he and Brown have maintained a Git tree since Lustre was removed from staging; it is kept in sync and updated to newer kernels. All of the patches are posted to the lustre-devel mailing list and to a Patchwork instance, so all of the history is open and available for comments or criticism, he said.

[James Simmons]

Josef Bacik asked about what went wrong when Lustre was in the staging tree. From his perspective, Lustre has been around for a long time and there are no indications that it might be abandoned, so why did it not make the jump into the mainline fs/ directory? Simmons said that the project normally has a few different features being worked on at any given time, but Greg Kroah-Hartman, who runs the staging tree, did not want any patches that were not cleanups. So the staging version fell further and further behind the out-of-tree code. Bacik said that made sense to him; "that's like setting you up to fail".

Christian Brauner said that he would like to come up with a "more streamlined model" for merging new filesystems, where the filesystem community makes a collective decision on whether the merge should happen. The community has recently been "badly burned by 'anybody can just send a filesystem for inclusion' and then it's upstream and then we have to deal with all of the fallout". As a VFS maintainer, he does not want to be the one making the decision, but merging a new filesystem "puts a burden on the whole community" so it should be a joint decision.

Bacik reiterated that no one was concerned that Lustre developers were going to disappear, but that there are other concerns. It is important that Lustre is using folios everywhere, for example, and is using "all of the modern things"; that sounds a little silly coming from him, since Btrfs is still only halfway there, he said. Simmons said that the Lustre developers completely agree; there is someone working on the folio conversion currently. At the summit, he and Day have been talking with David Howells about using his netfs library.

Jeff Layton asked if the plan was to merge both the client and the server. Simmons said that most people are just asking for the client and that is a slower-moving code base. The client is "a couple-hundred patches per month, the server is three to four times the volume of patches", which makes it harder to keep up with the kernel. Layton said: "baby steps are good", though Day noted that it is harder to test Lustre without having server code in the kernel.

The reason he has been pushing for just merging the client, Ted Ts'o said, is because the server is "pretty incestuous with ext4". The server requires a bunch of ext4 symbols and there is "a need to figure out how to deal with that". That impacts ext4 development, but other changes, such as the plan to rewrite the jbd2 journaling layer to not require buffer heads, may also be complicated by the inclusion of the Lustre server, he said.

Simmons asked about posting patches to the linux-fsdevel mailing list before Lustre is upstream so that the kernel developers can start to get familiar with the code. Bacik said that made sense, but that he would not really dig into the guts of Lustre; he is more interested in the interfaces being used and whether there will be maintenance problems for the kernel filesystem community in the Lustre code. Goldstein suggested setting up a Lustre-specific mailing list, but Simmons noted that lustre-devel already exists and is being archived; Brauner suggested getting it added to lore.kernel.org

The intent is that when Linus Torvalds receives a pull request for a new filesystem that he can see that the code has been publicly posted prior to that, Goldstein said. It will also help if the Git development tree has a mirror on git.kernel.org, Ts'o said. Bacik said that he thought it was a probably a lost cause to try to preserve all of the existing Git history as part of the merge, though it is up to Torvalds; instead, he suggested creating a git.kernel.org archive tree that people can consult for the history prior to the version that gets merged.

Given that Lustre targets high performance, Ts'o said, it will be important to support large folios. Simmons said that someone was working on that, and that it is important to the project; the plan is to get folio support, then to add large folios. Matthew Wilcox said that was fine, as long as the page-oriented APIs were getting converted. Many of those APIs are slowly going away, so the Lustre developers will want to ensure the filesystem is converted ahead of those removals.


Index entries for this article
KernelFilesystems/Lustre
ConferenceStorage, Filesystem, Memory-Management and BPF Summit/2025


to post comments

Fallout?

Posted Jun 18, 2025 14:36 UTC (Wed) by koverstreet (✭ supporter ✭, #4296) [Link] (2 responses)

> Christian Brauner said that he would like to come up with a ""more streamlined model"" for merging new filesystems, where the filesystem community makes a collective decision on whether the merge should happen. The community has recently been ""badly burned by 'anybody can just send a filesystem for inclusion' and then it's upstream and then we have to deal with all of the fallout""

Was there any elaboration on what fallout he was referring to?

Fallout?

Posted Jun 18, 2025 15:10 UTC (Wed) by hailfinger (subscriber, #76962) [Link] (1 responses)

IMHO one of the more prominent examples would be ReiserFS which was well-maintained by Namesys (and some people from SUSE and the community) in the beginning, but then development resources of Namesys were shifted to Reiser4 and then Hans Reiser went to jail and Namesys stopped existing. SUSE went on to focus on other filesystems (Btrfs). Subsequently, the community had to take care of fixing anything that came up and various kernel developers have complained about the constant burden updating ReiserFS to any MM/VFS changes.

Fallout?

Posted Jun 18, 2025 15:23 UTC (Wed) by koverstreet (✭ supporter ✭, #4296) [Link]

Well, more recently than that was ntfs, which I've also heard a lot of complaints about - I think the main one being that everyone got flooded with syzbot bugs, and syzbot wasn't yet able to file them with the proper subsystem so everyone had to sift through them.

And then of course there was the bcachefs merger, which has had its ups and downs :) I didn't make it this year so I couldn't take part, but I'd generally agree with Christian that there should be more of a process.

My brief take - the kernel community can be very "missing the forest for the trees", lots of focus on individual patches or individual rules, while at times it seems to me the high level discussions about design or goals are a bit absent.

The bit about Greg having a "cleanups only" rule for staging - that seems entirely typical, there should've at some point been a discussion about whether that makes sense; you simply can't expect an entire community to bend to a rule like that. I've also heard the staging merger didn't have the involvement of most of the Lustre community, so - bad communication all around, perhaps. Good communication is hard, such is life :)

This was also covered in LUG 2025

Posted Jun 18, 2025 16:58 UTC (Wed) by akkornel (subscriber, #75292) [Link]

Upstreaming Lustre was also talked about in the recent LUG 2025 conference, which was held on April 1st & 2nd at Stanford University. The session recording is available to watch, and all the slides and talks are available.

Lustre FUSE performance

Posted Jun 18, 2025 18:30 UTC (Wed) by geofft (subscriber, #59789) [Link] (9 responses)

I'm sure people have thought about this thoroughly which is why it's not discussed, but I'm curious - what are the issues with using Lustre via FUSE? My impression is there's been a bunch of stuff recently to improve FUSE performance including passthrough (https://docs.kernel.org/next/filesystems/fuse-passthrough...) and io_uring (https://docs.kernel.org/next/filesystems/fuse-io-uring.html). Are those good enough for performance to be acceptably close to a kernel filesystem? Is there kernel-side work that can be done on FUSE to make Lustre over FUSE acceptable, as an alternative to adding a filesystem to the kernel?

Lustre FUSE performance

Posted Jun 18, 2025 19:07 UTC (Wed) by tim-day-387 (subscriber, #171751) [Link] (7 responses)

Lustre already has an existing kernel driver that predates FUSE. I think the investment to create a FUSE driver would be huge with not much benefit. At best, we would get feature parity. At worst, we'd get diminished performance - which especially hurts Lustre, since we aim to fully saturate the available hardware bandwidth. So there isn't much payoff to justify the investment. There's a much better ROI on upstreaming the existing kernel driver, IMHO.

As for writing a Lustre FUSE driver, one big issue is networking: Lustre has a preexisting wire protocol that would prevent us from using something like libfabric. All of the user space networking would have to be re-implemented. Also, both the client and server exist in the kernel - so both would need to be ported to user space.

Lustre FUSE performance

Posted Jun 19, 2025 2:03 UTC (Thu) by geofft (subscriber, #59789) [Link] (2 responses)

Is feature parity not compelling? I have to imagine a good chunk of Lustre's users are people who are conservative about kernel versions in various ways, and offering them something that is as good as you can get in kernelspace, but lets you run a newer Lustre version against an older kernel version, lets you patch Lustre without pushing out a kernel patch to the entire fleet, etc., feels like it ought to be a big advantage.

But overall, yeah, that all makes sense. (On the existing-code front, I sort of wish porting were easier - I feel like I saw at some point some vaguely-kernel-API-compatible thing for userspace filesystems. But certainly if that doesn't exist in a production-ready high-performance state then that's not helpful to you.)

Lustre FUSE performance

Posted Jun 19, 2025 3:11 UTC (Thu) by tim-day-387 (subscriber, #171751) [Link]

> Is feature parity not compelling?

I'm thinking in terms of opportunity cost. Always more features that can be added to Lustre. :)

> I have to imagine a good chunk of Lustre's users are people who are conservative about kernel versions in various ways, and offering them something that is as good as you can get in kernelspace, but lets you run a newer Lustre version against an older kernel version, lets you patch Lustre without pushing out a kernel patch to the entire fleet, etc., feels like it ought to be a big advantage.

We would be limited if older kernels don't have the FUSE performance enhancements of the latest kernels. And without a kernel module, we'd have less flexibility to fix that. Even CentOS 7.9 still has some traction in HPC (although it's close to going away) - so some of the kernels are fairly ancient.

> But overall, yeah, that all makes sense. (On the existing-code front, I sort of wish porting were easier - I feel like I saw at some point some vaguely-kernel-API-compatible thing for userspace filesystems. But certainly if that doesn't exist in a production-ready high-performance state then that's not helpful to you.)

There were some efforts to reuse existing filesystem kernel drivers to mount untrusted disk images in userspace via FUSE. I think the goal was to create a path to deprecate abandoned disk filesystems in the kernel. I can't find any links to this right now....

For Lustre (or NFS or Ceph kernel driver), we'd still have the network problem with a solution like this.

LKL?

Posted Jun 19, 2025 3:17 UTC (Thu) by pabs (subscriber, #43278) [Link]

You might be thinking of the Linux Kernel Library (LKL) project?

https://lkl.github.io/
https://github.com/lkl

Lustre FUSE performance

Posted Jun 19, 2025 14:40 UTC (Thu) by csamuel (✭ supporter ✭, #2624) [Link]

> As for writing a Lustre FUSE driver, one big issue is networking: Lustre has a preexisting wire protocol

Yeah I was going to mention the Lnet part of Lustre as another potential barrier to fuse - especially as some systems have other kernel consumers of that (like HPE/Cray's DVS filesystem projection layer). There's also the existence of Lnet routers which might not mount Lustre but instead shuffle Lustre traffic between different interconnects (our old Cray XC used them to go between the Aries fabric on the inside and the IB fabric where the Lustre storage lived using RDMA on both sides).

> that would prevent us from using something like libfabric.

HPE has kkfilnd for lnet for their Slingshot fabric and I had thought that the kfabric part in that was some reference to libfabric given HPE/Cray's use in other contexts but reading Chris Horn's slides from 2023 I can see I was mistaken, it's something altogether different (insert Picard facepalm emoji here). https://www.opensfs.org/wp-content/uploads/LUG2023-kfilnd...

Lustre FUSE performance

Posted Jun 24, 2025 14:18 UTC (Tue) by cpitrat (subscriber, #116459) [Link] (2 responses)

> There's a much better ROI on upstreaming the existing kernel driver, IMHO.

This sounds like a very "Lustre-sided" vision though. I'm external to this whole world but I can easily imagine that from a generic Linux filesystem developer, maintainer or even user a Lustre on FUSE could be much more attractive: maintainer wouldn't have to worry about what happens if nobody maintains Lustre anymore, everybody would benefits for performance improvements in FUSE that may be driven by the usage by Lustre, Lustre users would benefit from all the benefits of a user space implementation, etc ...

Lustre FUSE performance

Posted Jun 24, 2025 15:14 UTC (Tue) by Wol (subscriber, #4433) [Link] (1 responses)

> Lustre users would benefit from all the benefits of a user space implementation, etc ...

Which are?

I believe the article lists the DISbenefits of a FUSE implementation, which currently includes a severe performance hit, made worse by the current use cases. The projected benefits aren't worth the candle.

Something to bear in mind, with ALL suggestions of improvements like that, is do the *current* users see what you're offering as a benefit, or a hindrance. If it's the latter, feel free to have a crack at it, but don't expect them to help you ...

Always remember, selling something based on its value to YOU is very good at pissing other people off if you don't LISTEN. It's very good at turning me into an anti-customer.

Cheers,
Wol

Lustre FUSE performance

Posted Jun 24, 2025 18:26 UTC (Tue) by cpitrat (subscriber, #116459) [Link]

> Always remember, selling something based on its value to YOU is very good at pissing other people off if you don't LISTEN.

Which is kind of my point. Again I'm external yo this and I might be wrong but my read of the situation is that people working on Lustre want it in the kernel because they see the benefit of it less work, better user experience), but from the kernel point of view, it's a high risk of ending up with having to maintain a filesystem with users but no dedicated maintainer.

Lustre FUSE performance

Posted Jun 19, 2025 10:59 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

Performance. Lustre is used by the HPC guys that routinely stream gigabytes per second of data.

FUSE is getting faster, but it's still several times slower than the in-kernel filesystems.

My experience in staging

Posted Jun 18, 2025 20:19 UTC (Wed) by error27 (subscriber, #8346) [Link] (2 responses)

I hesitate to rehash fights which happened almost a decade ago... But I don't remember the "no new features" rule as being the issue. That rule is mainly to push people to focus on getting their code out of staging and I can't speak for Greg but it's probably something which can be negotiated. There probably bad communication if that was really the problem.

My impression was that the Lustre devs viewed all the random newbie changes as too risky to push to customers so that's why the kernel.org version was never the real version that anyone used. I reviewed the git log for Lustre and there were two newbie patches which did end up causing trouble. We've improved our processes since then. There were some oldbie patches which also caused problems.

It really doesn't work when the kernel.org version is not the real upstream. The diff only gets larger and riskier for customers over time.

My experience in staging

Posted Jun 18, 2025 21:04 UTC (Wed) by tim-day-387 (subscriber, #171751) [Link] (1 responses)

We plan on making the kernel.org version the canonical upstream. We have a lot of work planned to make that possible [1]. We need to rework how we do kernel compatibility to be more upstream friendly. In DRDB, for example, the kernel modules apply and compile natively on an up-to-date kernel. They handle older kernels by mutating the modules with Coccinelle scripts. Lustre needs to have a similar process. We need a separation between the core filesystem code and the compatibility layer. The wiki covers the current thinking - I'm going to update it with some of the recent progress. We're hoping to have all of the major pieces in place before the next LSF/MM.

Staging (in my view) was sort of a chicken-and-egg problem. Customers won't use unofficial clients and have an affinity for older kernels. And vendors won't support a client (or kernel version) that customers don't use. But if the official Lustre client is just the latest kernel.org client backported to old kernels, that sidesteps the issue. Of course, plenty of people will use the client natively on the latest kernels. But we'll still have a path for supporting old distro kernels as well.

[1] https://wiki.lustre.org/Lustre_Upstreaming_to_Linux_Kernel
[2] https://github.com/LINBIT/drbd/tree/master

My experience in staging

Posted Jun 18, 2025 21:12 UTC (Wed) by johill (subscriber, #25196) [Link]

We have - but it's probably not all that relevant to filesystems - some of those mechanics also at https://backports.docs.kernel.org/

I suspect some reticense is historical.

Posted Jun 19, 2025 2:17 UTC (Thu) by ejr (subscriber, #51652) [Link] (4 responses)

Once upon a time, 'ls -l' on a lustre file system slowed down everything across the cluster / supercomputer. That is not good PR for an OS still somewhat struggling for acceptance.

SSDs arrived for the metadata issues, but are they now enough when the storage itself is NVMe? I have a suspicion that some reactions are out of date. I do believe mine are. Ceph has similar issues but is "built around" them. How out-dated are my Lustre views (eight-ten years)?

I suspect some reticense is historical.

Posted Jun 19, 2025 5:17 UTC (Thu) by sthiell (guest, #177937) [Link]

Yep, been there but your views are quite outdated, I would say. :) Lustre development has been very active these last 10 years, led by Andreas Dilger, and development efforts have recently been pushed by the AI hype.. sorry, AI needs... :D
Today, Lustre is quite stable and extremely performant on many aspects. Still not that easy to manage: you will need to invest quite some time on the userland/tooling part or bet on a vendor to do that for you. But just watch "Lustre DNE3 - Busting the small files myth" from LUG 2025 at https://wiki.lustre.org/Lustre_User_Group_2025 on how Lustre is used by finance folks, they have filesystems with 25B inodes and 200k/s getattr per MDT (they are not even using Infiniband, with it they could perhaps double that number) and a lot of MDTs – a rare share of info in this talk.

I suspect some reticense is historical.

Posted Jun 19, 2025 15:01 UTC (Thu) by csamuel (✭ supporter ✭, #2624) [Link] (2 responses)

> Once upon a time, 'ls -l' on a lustre file system slowed down everything across the cluster / supercomputer.

That's not something I've run into (yet) with Lustre, but I know people do sometimes complain about "ls" being slow for which there can be a number of reasons:

* anything that requires the size of a file means going out and querying all the OSTs where that file resides to see how much data is there
* if you had a file striped over many OSTs then that means needing to talk to many OSTs
* if you have a directory with a lot of files in it those may well reside on many OSTs, see above
* if you have a directory with a lot of large files in it which are all striped across many OSTs then multiply the above two pain points together
* in older Linux versions "ls -F" and "ls -C" for various indications of the type of object would call stat() which would request everything about a file, including the file size

Thankfully newer distros have an ls which uses statx() and so can say whether or not it cares about the size of the file so a straight "ls" might be much better these days (in the old days \ls or unsetting whatever distro provided aliases were present might be necessary to make it liveable).

Of course "ls -l" does care about that. I did have a vague memory that may Lustre caches some of this now, but I cannot for the life of me find a reference to it. :-(

I suspect some reticense is historical.

Posted Jun 19, 2025 18:31 UTC (Thu) by tim-day-387 (subscriber, #171751) [Link] (1 responses)

You're probably thinking of Lazy Size-on-Metadata [1] - the feature that caches file size on MDS. LSoM isn't automatic, so preexisting tools like `ls` won't use it. There are special Lustre tools to fetch that data. LSoM became less important once statx() gained more adoption [2] and we could avoid all those OSS RPCs you mentioned (at least, when we don't care about size).

Lustre has a couple other features that accelerate stat() heavy commands like `ls` - DNE (i.e. multiple metadata servers) and statahead (i.e. prefetching stat() data in bulk). So those can help quite a bit as well.

[1] https://doc.lustre.org/lustre_manual.xhtml#lsom
[2] https://jira.whamcloud.com/browse/LU-11554

I suspect some reticense is historical.

Posted Jun 25, 2025 13:41 UTC (Wed) by csamuel (✭ supporter ✭, #2624) [Link]

Hi Tim,

> You're probably thinking of Lazy Size-on-Metadata [1] - the feature that caches file size on MDS. LSoM isn't automatic, so preexisting tools like `ls` won't use it. There are special Lustre tools to fetch that data. LSoM became less important once statx() gained more adoption [2] and we could avoid all those OSS RPCs you mentioned (at least, when we don't care about size).

Ah right, that sounds likely, thank you!

> Lustre has a couple other features that accelerate stat() heavy commands like `ls` - DNE (i.e. multiple metadata servers) and statahead (i.e. prefetching stat() data in bulk). So those can help quite a bit as well.

Yeah, we've got 15 metadata servers in our production Lustre filesystem (about 1 per 24 OSS's) which helps a great deal.

I'm not much of a Lustre person so I'd missed statahead, thanks for that! Looks like 2.16 has a handy statahead improvement too with LU-14139 landed, we're still on 2.15 from our vendor.

All the best!
Chris

lustre-devel

Posted Jun 19, 2025 3:05 UTC (Thu) by pabs (subscriber, #43278) [Link]

Looks like lustre-devel is on lore.k.o since 2020:

https://web.archive.org/web/20201128141136/https://lore.k...

Mailing lists as a barrier to kernel development?

Posted Jun 20, 2025 7:05 UTC (Fri) by taladar (subscriber, #68407) [Link] (1 responses)

> Another area that needs attention is changes to the development process to better mesh with kernel development. The Lustre project does not use mailing lists; it uses a Gerrit instance instead. ""We have got to figure out how to adapt."" Simmons said that there are some developers who are totally Gerrit-oriented and some who could live with a mailing list; ""we have to figure out how to please both audiences"".

I feel this hard requirement to use mailing lists instead of the kind of tools literally every modern software project uses is going to be a huge barrier for the kernel to get new contributors in the coming years as old developers retire and/or die of old age.

Mailing lists as a barrier to kernel development?

Posted Jun 20, 2025 7:27 UTC (Fri) by neilbrown (subscriber, #359) [Link]

I don't understand why this might be a problem. Software engineers are tool users by nature. None of the tools we use are perfect, but equally none are hard to learn. The sane choice is to use the tool that achieves the goal. By all means complain about the tool while using it (if you see nothing to complain about you likely aren't trying) but do what is needed to get the job done.

Whatever tooling the kernel community decided to standardize on, there would be 5 other perfectly usable (but somewhat annoying) tools that someone would want us to use. The network effect doesn't require the best in any sense, only that we can all use the same interaction mechanism.

Better to upstream client and server together

Posted Jun 21, 2025 21:37 UTC (Sat) by neilbrown (subscriber, #359) [Link] (3 responses)

I'm somewhat disappointed at the intention to only upstream the client, not the server.

By far the biggest problem in upstreaming is the required change to work practices of the many people working on lustre. It is currently all one big project including kernel code, tools code, and documentation. It will have to be multiple projects just like other filesystems. e.g. e2fsprogs and kernel-ext4/jbd2. When splitting lustre up the process will be much smoother if the lines are clean.

Separating the utils and doco out from the kernel code should be easy enough - there is plenty of precedent. Separating client from server is a very different question. There is a lot of shared code and it isn't cleanly marked. There are well over 200 "#ifdef HAVE_SERVER_SUPPORT" and quite a bit of server code that isn't marked at all. There are better lines along which the kernel code can be split.

Lustre has an "OSD" abstraction - Object Storage Device. There is an "osd-zfs" implementation which supports zfs. This isn't going upstream because zfs isn't upstream so it will have to maintained separately anyway. There is "osd-ldiskfs" which contains a lightly edited version of ext4. This is the part that concerns Ted Ts'o. It cannot go upstream in a hurry. But there is work on an in-memory osd. https://jira.whamcloud.com/browse/LU-17995 This isn't ready yet, but could well be by the time lustre might start being submitted. Having this in-kernel and the other osds in separate projects could work well.

The other line that it makes sense to split the project on is the "LND" interface - Lustre Network Device. socklnd which uses TCP sockets would naturally go upstream. o2iblnd which supports infiniband might go as well though it might make sense to delay that. I don't know much about the other lnds but keeping them out at least while the dust settles could certainly make sense.

So there will HAVE to be some kernel code which is maintained out-of-tree - osd-zfs at the very least. But keeping that to a minimum would be best. It isn't clear to me that splitting out the "server' from the "client" to make it a separate project really benefits anyone....

Better to upstream client and server together

Posted Jun 21, 2025 22:43 UTC (Sat) by tim-day-387 (subscriber, #171751) [Link] (2 responses)

We discussed whether to do client-only or client/server [1]. Almost everyone was in favor of doing the client first i.e. start small. Personally, I also think we ought to do both at the same time. Having both the client and server upstream would make testing a lot easier. And (as you mention) there isn't really any clean split between the client and server. Honestly, I'm not sure it's even feasible to split the out-of-tree repo along a client/server boundary. There isn't a clearly defined API and it'd be a huge amount of work without very much payoff. Upstreaming the client and an in-memory server would be simpler in a lot of ways.

So I think we can revisit the decision. We don't really have to decide until we start posting patches on the list. The work we'd need to do in the interim is pretty much the same either way.

> But there is work on an in-memory osd. https://jira.whamcloud.com/browse/LU-17995 This isn't ready yet, but could well be by the time lustre might start being submitted.

This work will definitely be done before upstreaming.

> The other line that it makes sense to split the project on is the "LND" interface - Lustre Network Device. socklnd which uses TCP sockets would naturally go upstream. o2iblnd which supports infiniband might go as well though it might make sense to delay that. I don't know much about the other lnds but keeping them out at least while the dust settles could certainly make sense.

One issue with o2iblnd is GPUDirect Storage support. The current implementation relies on the out-of-tree GDS driver which depends on the out-of-tree display driver [2]. But NVIDIA now supports in-tree P2PDMA [3][4]. So we should be able to work around this now. But it'll require some refactoring of o2iblnd. I'm not sure if all of this work will be done before upstreaming.

[1] https://wiki.lustre.org/images/f/f9/LUG2025-Lustre_Upstre...
[2] https://github.com/NVIDIA/gds-nvidia-fs
[3] https://developer.nvidia.com/gpudirect-storage
[4] https://www.kernel.org/doc/html/latest/driver-api/pci/p2p...

Better to upstream client and server together

Posted Jun 22, 2025 23:17 UTC (Sun) by neilbrown (subscriber, #359) [Link] (1 responses)

Hi Tim,
thanks for the extra context.
You say:
> Almost everyone was in favor of doing the client first i.e. start small.
and I have to wonder how many of those "Almost everyone" were actually considering contributing to the effort.
As we all know, Linux is not a democracy. Decisions are made by those who do the work - whether development, review, testing, documentation, maintenance etc. Decisions aren't made by votes at conferences or complaints in comment threads like this :-)

As you seem to agree, I don't think there is anything "small" about starting by splitting the client out from the server - it would be a large and risky effort with minimal real gain. And I don't think "Start small" is in any way a useful sort of goal. We need to "start well" which we have done, and we need to continue coherently.

We started (as you know) by aligning the Lustre code base with Linux style - white space, comments, indentation. By using Linux support code in place of alternates: wait_event() macros, rhashtables, extent-tree, ringbuffer for tracing etc. By ensuring we support IPv6 because the networking maintainers require that (and the community wanted it but was never quite motivated enough until we realised it would be a firm block to merging).

We are continuing by looking at splitting out the utils/doc into a separate package (I think you were working on that??) and talking internally about alternate processes to get people used to the idea.

So when we eventually submit some patches I don't think we should accept anyone saying that this is "starting" whether small or otherwise. It is an important step on a process, but just one of several important steps. And the code that we land should look like that we want to have in the kernel - not some trimmed down un-testable "lite" version. I think one of the problems with the first upstreaming was that it was client only.

So as one of those people who has done a lot of work in this direction, and may well do some more, I vote that, as part of continuing well, we land client and server together.

Better to upstream client and server together

Posted Jun 23, 2025 16:53 UTC (Mon) by tim-day-387 (subscriber, #171751) [Link]

> As we all know, Linux is not a democracy. Decisions are made by those who do the work - whether development, review, testing, documentation, maintenance etc. Decisions aren't made by votes at conferences or complaints in comment threads like this :-)

That's probably the biggest problem with splitting the client and server. No one wants to work on it. :)

> We are continuing by looking at splitting out the utils/doc into a separate package (I think you were working on that??) and talking internally about alternate processes to get people used to the idea.

Yeah, I've been trying to move things to their final resting places. Split up user space / kernel code and split up core filesystem / backward compatibility code. Ideally, the core filesystem code (under lnet/ and lustre/) would be a verbatim copy of what we eventually upstream. So I've been slowly working towards that.

> So as one of those people who has done a lot of work in this direction, and may well do some more, I vote that, as part of continuing well, we land client and server together.

I agree. I think it's a matter of convincing everyone (in and outside the Lustre community) that upstreaming both components together is the most likely path to success. We'd need to highlight the practical challenges of doing the split - I don't think I emphasized that enough at LSF.


Copyright © 2025, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds