Gathering multiple system parameters in a single call
Running a command like lsof, which lists the open files on the system along with information about the process that has each file open, takes a lot of system calls, mostly to read a small amount of information from many /proc files. Providing a new interface to collect those calls together into a single (or, at least, fewer) system calls is the target of Miklos Szeredi's getvalues() RFC patch that was posted on March 22. While the proposal does not look like it is going far, at least in its current form, it did spark some discussion of the need—or lack thereof—for a way to reduce this kind of overhead, as well as to explore some alternative ways to get there via code that already exists in the kernel.
getvalues()
In his post, Szeredi highlighted the performance problem: "Calling
open/read/close for many small files is inefficient
". Running
lsof on his desktop resulted in around 60,000 calls to read small
amounts of data from /proc files; "90% of those are 128
bytes or less
". But another problem that getvalues() tries
to address is the fragmentation of the interfaces for gathering system
information on Linux:
For files we have basic stat, statx, extended attributes, file attributes (for which there are two overlapping ioctl interfaces). For mounts and superblocks we have stat*fs as well as /proc/$PID/{mountinfo,mountstats}. The latter also has the problem on not allowing queries on a specific mount.
His proposed solution is a system call with the following prototype, which uses a new structure type:
struct name_val { const char *name; /* in */ struct iovec value_in; /* in */ struct iovec value_out; /* out */ uint32_t error; /* out */ uint32_t reserved; }; int getvalues(int dfd, const char *path, struct name_val *vec, size_t num, unsigned int flags);
It will look up an object (which he calls $ORIGIN) using dfd and path, as with openat(); flags is used to modify the path-based lookup. vec is an array of num entries for the parameters of interest. getvalues() will return the number of values filled in or an error.
The name field in struct name_val is where most of the action is. It consists of a string in a kind of new micro-language that describes the value of interest, using prefixes to identify different types of information. From the post:
mnt - list of mount parameters mnt:mountpoint - the mountpoint of the mount of $ORIGIN mntns - list of mount ID's reachable from the current root mntns:21:parentid - parent ID of the mount with ID of 21 xattr:security.selinux - the security.selinux extended attribute data:foo/bar - the data contained in file $ORIGIN/foo/bar
The prefix can be omitted if it is the same as that of the previous entry in vec, so a "mnt:mountpoint" followed by a ":parentid" would imply the "mnt" prefix on the latter. value_in provides a buffer to hold the value retrieved; passing a NULL for iov_base in the struct iovec will reuse the previous entry's buffer. That allows a single buffer to be used for multiple retrieved values with getvalues() stepping through the buffer as needed. value_out will hold the address of where the value was stored, which is useful for shared buffers, and its length. If an error occurs, its code will be stored in error.
It is a fairly straightforward interface, though it does add yet another (simple) parser into the kernel. Szeredi also posted a sample program that shows how it can be used.
Reaction
Casey Schaufler pointed
out that the open/read/close problem could be addressed without all of
the rest of the generality with a openandread() system call or
similar. He also had some questions and comments about the interface, some
of its shortcuts, and its behavior in the presence of errors.
Greg Kroah-Hartman noted that he had
posted
a proposal for a readfile() system call that would address the
overhead problem as well. It was the subject of an LWN article just over two years ago. But it
turns out that he found little real-world performance improvement using
readfile(), which is part of why it was never merged. "Do
you have anything real that can use this that shows a speedup?
".
Bernd Schubert thought
that network filesystems could benefit, because operations could be batched
up rather than sent individually over the wire. He said that because there
is no readfile() (or its equivalent) available, network filesystem
protocols are not adding combined operations for open/read/close. But J. Bruce Fields said that
NFSv4 already has compound operations, "so you can do OPEN+READ+CLOSE
in a single round trip
". So far, at least, the NFS client does not
actually use it, but the protocol support is there.
While Christian Brauner was in favor of better ways to query filesystem information, he was concerned about the ease-of-use for getvalues():
I would really like if we had interfaces that are really easy to use from userspace comparable to statx for example. I know having this generic as possible was the goal but I'm just a bit uneasy with such interfaces. They become cumbersome to use in userspace.[...] Would it be really that bad if we added multiple syscalls for different types of info? For example, querying mount information could reasonably be a more focussed separate system call allowing to retrieve detailed mount propagation info, flags, idmappings and so on. Prior approaches to solve this in a completely generic way have gotten us not very far too so I'm a bit worried about this aspect too.
But Szeredi thinks that the generality of the interface is important for the future. A system call like statx() could perhaps be added for filesystem information (e.g. statfsx()), but that only works for data that can be represented in a flat structure. Hierarchical data has to be represented in some other way. He would like to see some kind of unified interface to gather information from multiple different sources in the kernel, both textual and binary, that uses hierarchical namespaces (a la file paths) for data that does not have a flat structure—rather than a collection of ad hoc interfaces that get added over time.
Kroah-Hartman pointed to two different mechanisms that might be used,
starting with the KVM binary_stats.c
interface, "which tried to create a 'generic' api, but ended up just
making something to work for KVM as they got tired of people ignoring
their more intrusive patch sets
".
But Szeredi said
that the KVM mechanism would not be easily used for things like extended
attributes (xattrs) that do not have a fixed size. Kroah-Hartman followed
that up with a suggestion to look
at varlink as a possible protocol for
transferring the data.
Ted Ts'o was not
sure what problem getvalues() was truly solving. He noted
that an lsof on his laptop did not take an inordinate amount of
time, so the performance argument does not really make sense to him. As
for ease-of-use, he suggested adding user-space libraries that gather up
the data from various sources "to make life easier for
application programmers
". He had other concerns as well:
Each new system call, especially with all of the parsing that this one is going to use, is going to be an additional attack surface, and an additional new system call that we have to maintain --- and for the first 7-10 years, userspace programs are going to have to use the existing open/read/close interface since enterprise kernels stick [around] for a L-O-N-G time, so any kind of ease-of-use argument isn't really going to help application programs until RHEL 10 becomes obsolete.
If the open/read/close problem is real for some filesystems (e.g. network
or FUSE), Christoph Hellwig said, a better
way to address it would be with an io_uring
operation. "And even on that I need to be sold first.
" The
readfile() article linked above also has a section on a mechanism to
support that use case with io_uring.
Linus Torvalds was skeptical of the whole concept. Coalescing the open/read/close cycle has been shown to make little difference from a performance standpoint, and he did not think that the more general query interface was particularly compelling either:
With the "open-and-read" thing, the wins aren't that enormous.And getvalues() isn't even that. It's literally a [specialty] interface for a very special thing. Yes, I'm sure it avoids several system calls. Yes, I'm sure it avoids parsing strings etc. But I really don't think this is something we want to do unless people can show enormous and real-world examples of where it makes such a huge difference that we absolutely have to do it.
Virtual xattrs?
Dave Chinner pointed out that the XFS filesystem has a somewhat similar ioctl() command (XFS_IOC_ATTRMULTI_BY_HANDLE) that is used to dump and restore extended attributes in batches. He suggested that idea could be further extended:
I've said in the past when discussing things like statx() that maybe everything should be addressable via the xattr namespace and set/queried via xattr names regardless of how the filesystem stores the data. The VFS/filesystem simply translates the name to the storage location of the information. It might be held in xattrs, but it could just be a flag bit in an inode field.Then we just get named xattrs in batches from an open fd.
He said that the values that Szeredi envisions being available via
getvalues() could simply be mapped into an xattr namespace and
retrieved using "a new,
cleaner version of xattr batch APIs that have been around for 20-odd
years already
". Schaufler cautioned
that there is a "significant and vocal set of people who dislike xattrs
passionately
", but if that problem could be solved, Chinner's
approach had a lot going for it. "You could even provide getvalues()
on top of it.
"
Szeredi seemed
amenable to the idea, though he wondered about information from
elsewhere in the system. Amir Goldstein said
that there is already precedence for "virtual xattrs" in the CIFS
filesystem, so that idea could be extended to mount information and
statistics of various kinds: "I don't see a problem with querying
attributes of a mount/sb the same
way as long as the namespace is clear about what is the object that
is being queried (e.g. getxattr(path, "fsinfo.sbiostats.rchar",...).
"
Chinner also noted
that using the xattr interface would provide "a
symmetrical API for -changing- values
". Instead of using some other
mechanism (e.g. configfs) to change system parameters, they could be done
with a setxattr()
call. "That retains the simplicity of proc and sysfs attributes in that you
can change them just by writing a new value to the file....
"
The discussion more or less wound down after that. The xattrs-based idea seemed reasonably popular and much of the infrastructure to use it is already present in the kernel in various forms. So, while getvalues() itself does not have a path toward merging, seemingly, the idea behind it could perhaps be preserved in a somewhat different form. So far, patches for that have not appeared, but perhaps that is something we will see before too long.
Index entries for this article | |
---|---|
Kernel | System calls |
Posted Apr 6, 2022 22:59 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link] (2 responses)
Posted Apr 7, 2022 0:14 UTC (Thu)
by dw (subscriber, #12017)
[Link]
Posted Apr 7, 2022 1:11 UTC (Thu)
by flussence (guest, #85566)
[Link]
Posted Apr 7, 2022 1:42 UTC (Thu)
by calumapplepie (guest, #143655)
[Link] (9 responses)
htop -d 10 , which gives me a nice, reliable view of what my system is doing, consumes 12-15% CPU. htop -d 1, which gives a great picture of what my computer is up to that moment, goes up to 50-60% CPU. 12-15% may not sound like a lot, but its the highest or 2nd highest consumer most of the time. This is on a system with two web browsers with well over 100 tabs open, as well as several other apps (some in python) running nearly constantly. Having my performance monitoring tool eat up more CPU than Firefox's 117 tabs in 3 windows is a little bit sad.
> He noted that an lsof on his laptop did not take an inordinate amount of time, so the performance argument does not really make sense to him.
I should be clear that the machine I am writing this on (and running these tests on) is running in battery mode: cpu frequencies are locked at 800 mHz, everything in "extreme power save mode", etc. However, despite running this command thrice to ensure everything was nice and warm in the caches, I still got these results:
time lsof > /dev/null
This might not be 'inordinate', but it's certainly significant. Even if we assume that every last moment of the 12 seconds spent on system calls is vitally needed and cannot be optimized by this new call, what exactly is being done during the 16 seconds in userspace? How much of that time amounts to "parse what the kernel gave us"?
This isn't really a matter of user-facing performance for me: rather, it's a question of power efficiency. If I leave htop running in the background for a while, that shouldn't harm my battery life substantially. Performance monitoring tools in userspace are very common: think of how many there are running in the world right now. If we make all of them twice as fast, we will save megawatts of power. Using the standard "oh, but tools won't use it at first" argument seems pretty silly to me: tools can detect the availability of system calls, and further, that argument applies to every possible bit of functionality: why aren't we hearing that objection to argue that the performance gains of, say, io-uring are irrelevant? As for attack surface, this interface is very easy to fuzz, and doesn't need to be sophisticated or heavily optimized. Compared to what it's replacing (a confused mix of interfaces developed over decades), I think it's very probable that the reduced attack surface in 10 years when distros can start configuring out the IOCTLs far exceeds that added by this interface.
Posted Apr 7, 2022 2:13 UTC (Thu)
by anonymous_commenter (guest, #117657)
[Link]
Posted Apr 7, 2022 10:37 UTC (Thu)
by Kamiccolo (subscriber, #95159)
[Link] (3 responses)
real 0m12.425s
oh well...
Posted Apr 7, 2022 21:03 UTC (Thu)
by calumapplepie (guest, #143655)
[Link] (2 responses)
Posted Apr 8, 2022 5:45 UTC (Fri)
by jezuch (subscriber, #52988)
[Link] (1 responses)
Posted Apr 8, 2022 23:42 UTC (Fri)
by HenrikH (subscriber, #31152)
[Link]
henrik@Sineya:~$ time lsof > /dev/null
henrik@Sineya:~$ time lsof -n > /dev/null
Posted Apr 7, 2022 14:21 UTC (Thu)
by Hello71 (subscriber, #103412)
[Link] (1 responses)
have you checked that this actually reduces battery usage? it has been the goal of Linux schedulers for quite some time to use the fastest reasonably-efficient speed available to complete a given task, then drop to a deep power-saving level. "off" uses so much less power than "800 mHz" (sic) that it saves energy significantly.
Posted Apr 7, 2022 21:01 UTC (Thu)
by calumapplepie (guest, #143655)
[Link]
I should note that I think the schedulers could use some fine-tuning on this, actually: I understand there is a latency (and accompanying power consumption) going from (say) C10 to C0, so it'd make sense to completely stop using some cores, but that isn't what happens, based on what I see in Powertop: every core sees some use. Of course, I don't actually know what the most efficent thing is: just an observation from my machine.
Posted Apr 13, 2022 15:06 UTC (Wed)
by metan (subscriber, #74107)
[Link] (1 responses)
Posted Apr 15, 2022 14:55 UTC (Fri)
by mrugiero (guest, #153040)
[Link]
Considering it already needs OS specific calls, that's certainly some low-hanging fruit. Even if it didn't, the odd one out in terms of having open/read/close as first class citizens is Windows, and I'm not even sure lsof runs there. I think that's a good catch :)
Posted Apr 7, 2022 4:02 UTC (Thu)
by pabs (subscriber, #43278)
[Link] (4 responses)
Posted Apr 7, 2022 4:42 UTC (Thu)
by mjg59 (subscriber, #23239)
[Link] (3 responses)
Posted Apr 7, 2022 6:09 UTC (Thu)
by zdzichu (subscriber, #17118)
[Link] (2 responses)
Posted Apr 8, 2022 4:23 UTC (Fri)
by pabs (subscriber, #43278)
[Link] (1 responses)
Posted Apr 11, 2022 10:57 UTC (Mon)
by sur5r (subscriber, #61490)
[Link]
Posted Apr 7, 2022 6:27 UTC (Thu)
by adobriyan (subscriber, #30858)
[Link] (1 responses)
Posted Apr 23, 2022 3:05 UTC (Sat)
by cody_schafer (guest, #85326)
[Link]
The "virtual file with 1 plain text value" model only really works very well for folks writing shell scripts, and breaks down otherwise. There are also repeated issues in sysfs where it's very difficult to decide what type of "thing" some directory represents because of how it lays out attributes.
It's clear that some nested key-value store is useful as a operating system API, and nested key-value store _is_ what the filesystem model is supposed to be. So it feels like improving the system call interface here to reduce overhead could be useful. Though perhaps some rethinking of the posix filesystem API also could make sense (transactions?)
Alternates might include leaning towards an RPC style interface, but that seems like we'd be reimplementing open/close/read/write on top of a stream like protocol. This _might_ make the efficiency better given syscall overhead though.
Posted Apr 7, 2022 6:51 UTC (Thu)
by tlamp (subscriber, #108540)
[Link] (2 responses)
Posted Apr 7, 2022 8:46 UTC (Thu)
by Lionel_Debroux (subscriber, #30014)
[Link]
Posted Apr 7, 2022 11:47 UTC (Thu)
by jschrod (subscriber, #1646)
[Link]
Posted Apr 9, 2022 19:05 UTC (Sat)
by judas_iscariote (guest, #47386)
[Link]
Gathering multiple system parameters in a single call
Seems this was merged a long time ago
Gathering multiple system parameters in a single call
Gathering multiple system parameters in a single call
Gathering multiple system parameters in a single call
real 0m29.248s
user 0m16.245s
sys 0m12.842s
Gathering multiple system parameters in a single call
Gathering multiple system parameters in a single call
user 0m1.496s
sys 0m1.544s
Gathering multiple system parameters in a single call
Gathering multiple system parameters in a single call
Gathering multiple system parameters in a single call
real 0m8,932s
user 0m1,899s
sys 0m1,497s
real 0m3,276s
user 0m1,834s
sys 0m1,387s
Gathering multiple system parameters in a single call
Gathering multiple system parameters in a single call
Gathering multiple system parameters in a single call
Gathering multiple system parameters in a single call
Gathering multiple system parameters in a single call
Gathering multiple system parameters in a single call
Gathering multiple system parameters in a single call
Gathering multiple system parameters in a single call
Gathering multiple system parameters in a single call
Gathering multiple system parameters in a single call
Gathering multiple system parameters in a single call
Gathering multiple system parameters in a single call
perf stat lsof
on my (quite active) workstation here:
Performance counter stats for 'lsof':
24.013,25 msec task-clock # 0,308 CPUs utilized
2.569.794 context-switches # 107,016 K/sec
1.474 cpu-migrations # 61,383 /sec
87.247 page-faults # 3,633 K/sec
106.298.004.008 cycles # 4,427 GHz (74,83%)
136.488.994.248 instructions # 1,28 insn per cycle (75,31%)
32.366.388.743 branches # 1,348 G/sec (74,98%)
257.065.097 branch-misses # 0,79% of all branches (74,88%)
78,066150910 seconds time elapsed
9,366655000 seconds user
19,249239000 seconds sys
fwiw, lsof outputs 1072596 lines.
So, if there would be some more performant way to do this, I'd definitively look forward to it..
I get figures in the same ballpark when not redirecting similarly-sized output, despite the terminal being in limited scrolling mode with 1K lines of scrollback - the running time is further nearly doubled or so when serializing the whole terminal scrollback to a SSD.Gathering multiple system parameters in a single call
The run time for lsof is less bad when piping the output to `wc -l`, but still not blazing fast, and shows more sys time than usr time here as well:
perf stat lsof | wc -l
lsof: WARNING: can't stat() fuse.portal file system /run/user/1000/doc
Output information may be incomplete.
Performance counter stats for 'lsof':
16 993,59 msec task-clock # 0,947 CPUs utilized
44 362 context-switches # 2,611 K/sec
9 cpu-migrations # 0,530 /sec
103 592 page-faults # 6,096 K/sec
62 601 134 626 cycles # 3,684 GHz
117 648 197 174 instructions # 1,88 insn per cycle
31 922 541 769 branches # 1,879 G/sec
139 279 693 branch-misses # 0,44% of all branches
17,953006430 seconds time elapsed
6,730828000 seconds user
10,358977000 seconds sys
1243951
I don't have perf installed on my workstation, but system time needed for lsof is even longer:
Gathering multiple system parameters in a single call
time /bin/bash -c 'lsof | wc -l'
521342
real 0m25.253s
user 0m10.128s
sys 0m21.455s
This workstation has 2 Intel Xeon E5-2603 v4 CPUs, and 128 GiB memory.
Gathering multiple system parameters in a single call