Gathering multiple system parameters in a single call

By Jake Edge
April 6, 2022

Running a command like lsof, which lists the open files on the system along with information about the process that has each file open, takes a lot of system calls, mostly to read a small amount of information from many /proc files. Providing a new interface to collect those calls together into a single (or, at least, fewer) system calls is the target of Miklos Szeredi's getvalues() RFC patch that was posted on March 22. While the proposal does not look like it is going far, at least in its current form, it did spark some discussion of the need—or lack thereof—for a way to reduce this kind of overhead, as well as to explore some alternative ways to get there via code that already exists in the kernel.

`getvalues()`

In his post, Szeredi highlighted the performance problem: "Calling open/read/close for many small files is inefficient". Running lsof on his desktop resulted in around 60,000 calls to read small amounts of data from /proc files; "90% of those are 128 bytes or less". But another problem that getvalues() tries to address is the fragmentation of the interfaces for gathering system information on Linux:

For files we have basic stat, statx, extended attributes, file attributes (for which there are two overlapping ioctl interfaces). For mounts and superblocks we have stat*fs as well as /proc/$PID/{mountinfo,mountstats}. The latter also has the problem on not allowing queries on a specific mount.

His proposed solution is a system call with the following prototype, which uses a new structure type:

    struct name_val {
	    const char *name;		/* in */
	    struct iovec value_in;	/* in */
	    struct iovec value_out;	/* out */
	    uint32_t error;		/* out */
	    uint32_t reserved;
    };

    int getvalues(int dfd, const char *path, struct name_val *vec, size_t num,
	          unsigned int flags);

It will look up an object (which he calls $ORIGIN) using dfd and path, as with openat(); flags is used to modify the path-based lookup. vec is an array of num entries for the parameters of interest. getvalues() will return the number of values filled in or an error.

The name field in struct name_val is where most of the action is. It consists of a string in a kind of new micro-language that describes the value of interest, using prefixes to identify different types of information. From the post:

mnt                    - list of mount parameters
mnt:mountpoint         - the mountpoint of the mount of $ORIGIN
mntns                  - list of mount ID's reachable from the current root
mntns:21:parentid      - parent ID of the mount with ID of 21
xattr:security.selinux - the security.selinux extended attribute
data:foo/bar           - the data contained in file $ORIGIN/foo/bar

The prefix can be omitted if it is the same as that of the previous entry in vec, so a "mnt:mountpoint" followed by a ":parentid" would imply the "mnt" prefix on the latter. value_in provides a buffer to hold the value retrieved; passing a NULL for iov_base in the struct iovec will reuse the previous entry's buffer. That allows a single buffer to be used for multiple retrieved values with getvalues() stepping through the buffer as needed. value_out will hold the address of where the value was stored, which is useful for shared buffers, and its length. If an error occurs, its code will be stored in error.

It is a fairly straightforward interface, though it does add yet another (simple) parser into the kernel. Szeredi also posted a sample program that shows how it can be used.

Reaction

Casey Schaufler pointed out that the open/read/close problem could be addressed without all of the rest of the generality with a openandread() system call or similar. He also had some questions and comments about the interface, some of its shortcuts, and its behavior in the presence of errors. Greg Kroah-Hartman noted that he had posted a proposal for a readfile() system call that would address the overhead problem as well. It was the subject of an LWN article just over two years ago. But it turns out that he found little real-world performance improvement using readfile(), which is part of why it was never merged. "Do you have anything real that can use this that shows a speedup?".

Bernd Schubert thought that network filesystems could benefit, because operations could be batched up rather than sent individually over the wire. He said that because there is no readfile() (or its equivalent) available, network filesystem protocols are not adding combined operations for open/read/close. But J. Bruce Fields said that NFSv4 already has compound operations, "so you can do OPEN+READ+CLOSE in a single round trip". So far, at least, the NFS client does not actually use it, but the protocol support is there.

While Christian Brauner was in favor of better ways to query filesystem information, he was concerned about the ease-of-use for getvalues():

I would really like if we had interfaces that are really easy to use from userspace comparable to statx for example. I know having this generic as possible was the goal but I'm just a bit uneasy with such interfaces. They become cumbersome to use in userspace.
[...] Would it be really that bad if we added multiple syscalls for different types of info? For example, querying mount information could reasonably be a more focussed separate system call allowing to retrieve detailed mount propagation info, flags, idmappings and so on. Prior approaches to solve this in a completely generic way have gotten us not very far too so I'm a bit worried about this aspect too.

But Szeredi thinks that the generality of the interface is important for the future. A system call like statx() could perhaps be added for filesystem information (e.g. statfsx()), but that only works for data that can be represented in a flat structure. Hierarchical data has to be represented in some other way. He would like to see some kind of unified interface to gather information from multiple different sources in the kernel, both textual and binary, that uses hierarchical namespaces (a la file paths) for data that does not have a flat structure—rather than a collection of ad hoc interfaces that get added over time.

Kroah-Hartman pointed to two different mechanisms that might be used, starting with the KVM binary_stats.c interface, "which tried to create a 'generic' api, but ended up just making something to work for KVM as they got tired of people ignoring their more intrusive patch sets". But Szeredi said that the KVM mechanism would not be easily used for things like extended attributes (xattrs) that do not have a fixed size. Kroah-Hartman followed that up with a suggestion to look at varlink as a possible protocol for transferring the data.

Ted Ts'o was not sure what problem getvalues() was truly solving. He noted that an lsof on his laptop did not take an inordinate amount of time, so the performance argument does not really make sense to him. As for ease-of-use, he suggested adding user-space libraries that gather up the data from various sources "to make life easier for application programmers". He had other concerns as well:

Each new system call, especially with all of the parsing that this one is going to use, is going to be an additional attack surface, and an additional new system call that we have to maintain --- and for the first 7-10 years, userspace programs are going to have to use the existing open/read/close interface since enterprise kernels stick [around] for a L-O-N-G time, so any kind of ease-of-use argument isn't really going to help application programs until RHEL 10 becomes obsolete.

If the open/read/close problem is real for some filesystems (e.g. network or FUSE), Christoph Hellwig said, a better way to address it would be with an io_uring operation. "And even on that I need to be sold first." The readfile() article linked above also has a section on a mechanism to support that use case with io_uring.

Linus Torvalds was skeptical of the whole concept. Coalescing the open/read/close cycle has been shown to make little difference from a performance standpoint, and he did not think that the more general query interface was particularly compelling either:

With the "open-and-read" thing, the wins aren't that enormous.
And getvalues() isn't even that. It's literally a [specialty] interface for a very special thing. Yes, I'm sure it avoids several system calls. Yes, I'm sure it avoids parsing strings etc. But I really don't think this is something we want to do unless people can show enormous and real-world examples of where it makes such a huge difference that we absolutely have to do it.

Virtual xattrs?

Dave Chinner pointed out that the XFS filesystem has a somewhat similar ioctl() command (XFS_IOC_ATTRMULTI_BY_HANDLE) that is used to dump and restore extended attributes in batches. He suggested that idea could be further extended:

I've said in the past when discussing things like statx() that maybe everything should be addressable via the xattr namespace and set/queried via xattr names regardless of how the filesystem stores the data. The VFS/filesystem simply translates the name to the storage location of the information. It might be held in xattrs, but it could just be a flag bit in an inode field.
Then we just get named xattrs in batches from an open fd.

He said that the values that Szeredi envisions being available via getvalues() could simply be mapped into an xattr namespace and retrieved using "a new, cleaner version of xattr batch APIs that have been around for 20-odd years already". Schaufler cautioned that there is a "significant and vocal set of people who dislike xattrs passionately", but if that problem could be solved, Chinner's approach had a lot going for it. "You could even provide getvalues() on top of it."

Szeredi seemed amenable to the idea, though he wondered about information from elsewhere in the system. Amir Goldstein said that there is already precedence for "virtual xattrs" in the CIFS filesystem, so that idea could be extended to mount information and statistics of various kinds: "I don't see a problem with querying attributes of a mount/sb the same way as long as the namespace is clear about what is the object that is being queried (e.g. getxattr(path, "fsinfo.sbiostats.rchar",...)."

Chinner also noted that using the xattr interface would provide "a symmetrical API for -changing- values". Instead of using some other mechanism (e.g. configfs) to change system parameters, they could be done with a setxattr() call. "That retains the simplicity of proc and sysfs attributes in that you can change them just by writing a new value to the file...."

The discussion more or less wound down after that. The xattrs-based idea seemed reasonably popular and much of the infrastructure to use it is already present in the kernel in various forms. So, while getvalues() itself does not have a path toward merging, seemingly, the idea behind it could perhaps be preserved in a somewhat different form. So far, patches for that have not appeared, but perhaps that is something we will see before too long.

Index entries for this article
Kernel	System calls

Gathering multiple system parameters in a single call

Posted Apr 6, 2022 22:59 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

Uhmm.... io_uring?

Gathering multiple system parameters in a single call

Posted Apr 7, 2022 0:14 UTC (Thu) by dw (subscriber, #12017) [Link]

Seems this was merged a long time ago

Gathering multiple system parameters in a single call

Posted Apr 7, 2022 1:11 UTC (Thu) by flussence (guest, #85566) [Link]

First thing I thought of while reading this, too. If the objective is to reduce syscall overhead that's the obvious answer, though I agree with the parts where the syscalls themselves could be nicer to use.

Gathering multiple system parameters in a single call

Posted Apr 7, 2022 1:42 UTC (Thu) by calumapplepie (guest, #143655) [Link] (9 responses)

> But I really don't think this is something we want to do unless people can show enormous and real-world examples of where it makes such a huge difference that we absolutely have to do it.

htop -d 10 , which gives me a nice, reliable view of what my system is doing, consumes 12-15% CPU. htop -d 1, which gives a great picture of what my computer is up to that moment, goes up to 50-60% CPU. 12-15% may not sound like a lot, but its the highest or 2nd highest consumer most of the time. This is on a system with two web browsers with well over 100 tabs open, as well as several other apps (some in python) running nearly constantly. Having my performance monitoring tool eat up more CPU than Firefox's 117 tabs in 3 windows is a little bit sad.

> He noted that an lsof on his laptop did not take an inordinate amount of time, so the performance argument does not really make sense to him.

I should be clear that the machine I am writing this on (and running these tests on) is running in battery mode: cpu frequencies are locked at 800 mHz, everything in "extreme power save mode", etc. However, despite running this command thrice to ensure everything was nice and warm in the caches, I still got these results:

time lsof > /dev/null
real 0m29.248s
user 0m16.245s
sys 0m12.842s

This might not be 'inordinate', but it's certainly significant. Even if we assume that every last moment of the 12 seconds spent on system calls is vitally needed and cannot be optimized by this new call, what exactly is being done during the 16 seconds in userspace? How much of that time amounts to "parse what the kernel gave us"?

This isn't really a matter of user-facing performance for me: rather, it's a question of power efficiency. If I leave htop running in the background for a while, that shouldn't harm my battery life substantially. Performance monitoring tools in userspace are very common: think of how many there are running in the world right now. If we make all of them twice as fast, we will save megawatts of power. Using the standard "oh, but tools won't use it at first" argument seems pretty silly to me: tools can detect the availability of system calls, and further, that argument applies to every possible bit of functionality: why aren't we hearing that objection to argue that the performance gains of, say, io-uring are irrelevant? As for attack surface, this interface is very easy to fuzz, and doesn't need to be sophisticated or heavily optimized. Compared to what it's replacing (a confused mix of interfaces developed over decades), I think it's very probable that the reduced attack surface in 10 years when distros can start configuring out the IOCTLs far exceeds that added by this interface.

Gathering multiple system parameters in a single call

Posted Apr 7, 2022 2:13 UTC (Thu) by anonymous_commenter (guest, #117657) [Link]

Amen to your statistics, that reminds me of some articles on disk failure rate that Valerie Aurora worked on (which one, don’t remember exactly but here’s the list: https://lwn.net/Archives/GuestIndex/#Aurora_Henson_Valerie ) and also, process monitors on one of my computers (a laptop).

Gathering multiple system parameters in a single call

Posted Apr 7, 2022 10:37 UTC (Thu) by Kamiccolo (subscriber, #95159) [Link] (3 responses)

$ time lsof > /dev/null

real 0m12.425s
user 0m1.496s
sys 0m1.544s

oh well...

Gathering multiple system parameters in a single call

Posted Apr 7, 2022 21:03 UTC (Thu) by calumapplepie (guest, #143655) [Link] (2 responses)

See the wild difference between realtime and usertime + systime? Is your system under heavy CPU load, or is the number of syscalls required for lsof causing the scheduler to keep sleeping the process uneccessarily?

Gathering multiple system parameters in a single call

Posted Apr 8, 2022 5:45 UTC (Fri) by jezuch (subscriber, #52988) [Link] (1 responses)

I think the difference is because lsof resolves hostnames by default. I got into a habit of running lsof -n for that reason.

Gathering multiple system parameters in a single call

Posted Apr 8, 2022 23:42 UTC (Fri) by HenrikH (subscriber, #31152) [Link]

I think you are quite correct:

henrik@Sineya:~$ time lsof > /dev/null
real 0m8,932s
user 0m1,899s
sys 0m1,497s

henrik@Sineya:~$ time lsof -n > /dev/null
real 0m3,276s
user 0m1,834s
sys 0m1,387s

Gathering multiple system parameters in a single call

Posted Apr 7, 2022 14:21 UTC (Thu) by Hello71 (subscriber, #103412) [Link] (1 responses)

> cpu frequencies are locked at 800 mHz

have you checked that this actually reduces battery usage? it has been the goal of Linux schedulers for quite some time to use the fastest reasonably-efficient speed available to complete a given task, then drop to a deep power-saving level. "off" uses so much less power than "800 mHz" (sic) that it saves energy significantly.

Gathering multiple system parameters in a single call

Posted Apr 7, 2022 21:01 UTC (Thu) by calumapplepie (guest, #143655) [Link]

I didn't actually 'lock' them to that frequency: just the cpufreq governor to maximize power saving.

I should note that I think the schedulers could use some fine-tuning on this, actually: I understand there is a latency (and accompanying power consumption) going from (say) C10 to C0, so it'd make sense to completely stop using some cores, but that isn't what happens, based on what I see in Powertop: every core sees some use. Of course, I don't actually know what the most efficent thing is: just an observation from my machine.

Gathering multiple system parameters in a single call

Posted Apr 13, 2022 15:06 UTC (Wed) by metan (subscriber, #74107) [Link] (1 responses)

I did have a look at the lsof code when this patchset has been discussed and as far as I can tell significant part of the overhead is caused by the fact that it uses FILE interface to parse single value proc files instead of much simpler open()/read()/close(). I bet that we can make it twice as fast without any kernel changes...

Gathering multiple system parameters in a single call

Posted Apr 15, 2022 14:55 UTC (Fri) by mrugiero (guest, #153040) [Link]

> I did have a look at the lsof code when this patchset has been discussed and as far as I can tell significant part of the overhead is caused by the fact that it uses FILE interface to parse single value proc files instead of much simpler open()/read()/close(). I bet that we can make it twice as fast without any kernel changes...

Considering it already needs OS specific calls, that's certainly some low-hanging fruit. Even if it didn't, the odd one out in terms of having open/read/close as first class citizens is Windows, and I'm not even sure lsof runs there. I think that's a good catch :)

Gathering multiple system parameters in a single call

Posted Apr 7, 2022 4:02 UTC (Thu) by pabs (subscriber, #43278) [Link] (4 responses)

This reminds me that tools to monitor hardware sensors; temperature, fan speed etc tend to run `sensors` once per second. Is there a way for them to instead stream sensor data from the Linux kernel?

Gathering multiple system parameters in a single call

Posted Apr 7, 2022 4:42 UTC (Thu) by mjg59 (subscriber, #23239) [Link] (3 responses)

The information is available in sysfs so they can just read that directly rather than parse the output of sensors, but most motherboard sensor hardware doesn't have any interrupt support so streaming from the kernel would just mean the kernel polling instead of userland polling. I guess there's potentially a small performance benefit in the kernel polling with a reasonable amount of timer slack and only waking userland if a value has changed, but that doesn't seem like a huge benefit tbh.

Gathering multiple system parameters in a single call

Posted Apr 7, 2022 6:09 UTC (Thu) by zdzichu (guest, #17118) [Link] (2 responses)

A good middle ground is using libsensors. It will give the benefits of userspace (stable sensor names, labeling and applying arithmetic to correct readings) without overhead of exec, fork and config file parsing each time.

Gathering multiple system parameters in a single call

Posted Apr 8, 2022 4:23 UTC (Fri) by pabs (subscriber, #43278) [Link] (1 responses)

Are there any daemons that do this already?

Gathering multiple system parameters in a single call

Posted Apr 11, 2022 10:57 UTC (Mon) by sur5r (subscriber, #61490) [Link]

collectd using the sensors plugin comes to mind.

Gathering multiple system parameters in a single call

Posted Apr 7, 2022 6:27 UTC (Thu) by adobriyan (subscriber, #30858) [Link] (1 responses)

/proc would be rejected today if sent for inclusion on speed reasons alone.

Gathering multiple system parameters in a single call

Posted Apr 23, 2022 3:05 UTC (Sat) by cody_schafer (guest, #85326) [Link]

Perhaps, though sysfs would need to be similarly rejected. And it's use is still growing.

The "virtual file with 1 plain text value" model only really works very well for folks writing shell scripts, and breaks down otherwise. There are also repeated issues in sysfs where it's very difficult to decide what type of "thing" some directory represents because of how it lays out attributes.

It's clear that some nested key-value store is useful as a operating system API, and nested key-value store _is_ what the filesystem model is supposed to be. So it feels like improving the system call interface here to reduce overhead could be useful. Though perhaps some rethinking of the posix filesystem API also could make sense (transactions?)

Alternates might include leaning towards an RPC style interface, but that seems like we'd be reimplementing open/close/read/write on top of a stream like protocol. This _might_ make the efficiency better given syscall overhead though.

Gathering multiple system parameters in a single call

Posted Apr 7, 2022 6:51 UTC (Thu) by tlamp (subscriber, #108540) [Link] (2 responses)

perf stat lsof on my (quite active) workstation here:

 Performance counter stats for 'lsof':

         24.013,25 msec task-clock                #    0,308 CPUs utilized          
         2.569.794      context-switches          #  107,016 K/sec                  
             1.474      cpu-migrations            #   61,383 /sec                   
            87.247      page-faults               #    3,633 K/sec                  
   106.298.004.008      cycles                    #    4,427 GHz                      (74,83%)
   136.488.994.248      instructions              #    1,28  insn per cycle           (75,31%)
    32.366.388.743      branches                  #    1,348 G/sec                    (74,98%)
       257.065.097      branch-misses             #    0,79% of all branches          (74,88%)

      78,066150910 seconds time elapsed

       9,366655000 seconds user
      19,249239000 seconds sys

fwiw, lsof outputs 1072596 lines. So, if there would be some more performant way to do this, I'd definitively look forward to it..

Gathering multiple system parameters in a single call

Posted Apr 7, 2022 8:46 UTC (Thu) by Lionel_Debroux (subscriber, #30014) [Link]

I get figures in the same ballpark when not redirecting similarly-sized output, despite the terminal being in limited scrolling mode with 1K lines of scrollback - the running time is further nearly doubled or so when serializing the whole terminal scrollback to a SSD.
The run time for lsof is less bad when piping the output to `wc -l`, but still not blazing fast, and shows more sys time than usr time here as well:

perf stat lsof | wc -l
lsof: WARNING: can't stat() fuse.portal file system /run/user/1000/doc
      Output information may be incomplete.

 Performance counter stats for 'lsof':

         16 993,59 msec task-clock                #    0,947 CPUs utilized          
            44 362      context-switches          #    2,611 K/sec                  
                 9      cpu-migrations            #    0,530 /sec                   
           103 592      page-faults               #    6,096 K/sec                  
    62 601 134 626      cycles                    #    3,684 GHz                    
   117 648 197 174      instructions              #    1,88  insn per cycle         
    31 922 541 769      branches                  #    1,879 G/sec                  
       139 279 693      branch-misses             #    0,44% of all branches        

      17,953006430 seconds time elapsed

       6,730828000 seconds user
      10,358977000 seconds sys


1243951

Gathering multiple system parameters in a single call

Posted Apr 7, 2022 11:47 UTC (Thu) by jschrod (subscriber, #1646) [Link]

I don't have perf installed on my workstation, but system time needed for lsof is even longer:

time /bin/bash -c 'lsof | wc -l'
521342

real    0m25.253s
user    0m10.128s
sys     0m21.455s

This workstation has 2 Intel Xeon E5-2603 v4 CPUs, and 128 GiB memory.

Gathering multiple system parameters in a single call

Posted Apr 9, 2022 19:05 UTC (Sat) by judas_iscariote (guest, #47386) [Link]

There is a system call for getting the current directory, all sorts of obscure and obsolete ones.. but anything actually useful is off-limits and must be done writting a large amount of buggy, slow, racy code for parsing and dealing with /proc /sys /dev because reasons, or provided as an ioctl, fnctl or interfaces with a lot of baggage like open with O_TMPFILE instead of a sanely designed tmpfd syscall. the list goes on and on.. the end result of this ain't nice or is often useless.