|
|
Log in / Subscribe / Register

Improving fget() performance

By Jake Edge
May 6, 2019

LSFMM

The performance of the fget() function in the kernel was the topic of a discussion led by Dave Watson at the 2019 Linux Storage, Filesystem, and Memory-Management Summit (LSFMM). fget() is used to take a reference to a file (i.e. bump a reference count), based on its file descriptor, and to return the struct file pointer for it; references are dropped with fput(). Some recent profiling at Watson's employer, Facebook, found the function to be taking a sizable portion of the CPU time for some applications, so he wanted to talk about some of the things he has tried to make that situation better.

Watson found that fget() and fput() were taking up to nearly 3% of the CPU for some services that Facebook runs. The worst was 2.9% for an internal service, but the Memcached distributed key-value store was at 2% and mcrouter, which is used to scale Memcached, was at 1.7%. Various other services ranged from 0.3% to 0.9%.

His first thought was that taking up so much CPU simply to manage the reference count is excessive. But he noticed that the services that seemed most affected were networking services, so he guessed that something in the network stack might be causing this. He focused on Memcached and found that 72.5% of the CPU used by fget() and fput() was coming from the recvfrom() system call. The other two system calls that used significant amounts of CPU time were epoll_pwait() at 11.5% and sendmsg() at 11%.

[Dave Watson]

He noted that Memcached is a call-response service; it receives a request for a key's value and sends it back. So sendmsg() is being called as often as recvfrom() but is contributing much less to the problem. His suspicion is that the receive path is taking a bunch of cache misses, but that the send comes relatively soon after it, so the cache will have fewer misses.

He then annotated fget() and found that cache misses on the file-descriptor table and on dereferencing the struct file pointer were taking up much of the CPU time, as does the atomic increment for the reference count. So he tried two different ways to reduce that overhead.

The first was to delay the reclamation of the struct file pointer by not incrementing the reference count in fget() (and not decrementing it in fput()) for some system calls that use a single file descriptor (e.g. recvfrom(), sendmsg(), various epoll*() calls, etc.). The calls are not likely to block, but if they do, the behavior reverts to the existing path. He worked up a proof of concept for this idea, but the results were underwhelming, so he does not recommend going down that path.

His second attempt tried to get at the cache misses that were causing much of the CPU use by creating a cache of struct file pointers on a per-task basis. Multiple file pointers in the file-descriptor table are sharing the same cache line, when any of those get changed, even by an unrelated thread, there is cache-line bouncing. For Facebook, the file descriptors stay with the same thread once the "accepter" thread hands them off to a processing thread, so the cache eliminates the processor cache misses and the performance loss that went along with them. It is not a fancy cache as, once again, he just wanted to see if it helped. There is a complication, however, as it is not clear how to flush the cache, he said; if you want close() to work, though, you need to flush the cache. He ended up adding a field in struct file that pointed at the cache entry so that close() could do the right thing.

Overall, his proof-of-concept seems to work well. Most of the overhead from cache misses for the file-descriptor table are gone; that gives a roughly 2x performance increase. He has not thought about handling the cache misses for the struct file pointer dereferencing, but that could be next.

Jan Kara wondered if the file-descriptor table bouncing could be handled by allocating file descriptors in a way that causes separate threads to use different parts of the table. Some applications may depend on sequential file-descriptor allocation, however, which may not mesh well. It would potentially stop the cache-line bouncing of the table, though. Matthew Wilcox suggested that the scheme could be prototyped using dup2().

Wilcox also suggested that moving to a process-based, rather than thread-based, model for these services would be another way to avoid some of the problems that Facebook is experiencing. Watson said that would essentially be impossible. The idea of a per-thread file-descriptor range is worth experimenting with, however.


Index entries for this article
KernelMemory management/Scalability
ConferenceStorage, Filesystem, and Memory-Management Summit/2019


to post comments

Improving fget() performance

Posted May 6, 2019 21:36 UTC (Mon) by roc (subscriber, #30627) [Link] (9 responses)

A new open() (or accept4() etc) flag could opt out of the sequential file descriptor allocation regime to allow a more performant allocation scheme.

Improving fget() performance

Posted May 6, 2019 22:27 UTC (Mon) by djwatson24 (guest, #74976) [Link] (5 responses)

One complication is that often one thread accepts, but then is load balanced to another thread for most of its lifetime. So it may be necessary to somehow specify or change the fd range at a time later than accept() or open().

Improving fget() performance

Posted May 6, 2019 23:15 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link]

Do a dup2() call?

Improving fget() performance

Posted May 7, 2019 6:40 UTC (Tue) by smurf (subscriber, #17840) [Link] (3 responses)

I'd suggest that the only tasks where sequential allocation matters is when you close fd 0…2 right before dup()ing, and even that is highly unreliable in a threaded environment (thread 1 closes stdout, thread 2 calls accept(), thread 1 dup()s … oops – should have used dup2() …).

Thus, sequential allocation doesn't need a flag or its absence, it only needs to check that the refcount of the fd table is ==1.

Improving fget() performance

Posted May 7, 2019 17:24 UTC (Tue) by willy (subscriber, #9762) [Link] (2 responses)

It is mandated POSIX behaviour though, so we usually like to have a way for applications to opt-in to non-POSIX behaviour, be it a prctl, an fcntl or something else.

Let's see what the dup2() experiment buys us, then we can discuss how to implement it (if it is a win)

Improving fget() performance

Posted May 7, 2019 19:55 UTC (Tue) by abatters (✭ supporter ✭, #6932) [Link] (1 responses)

Instead of using non-sequential fds, how about indexing the table differently by shuffling the bits of the fd around before using it as an index, so that sequential fds don't share a cacheline?

Improving fget() performance

Posted May 7, 2019 20:11 UTC (Tue) by willy (subscriber, #9762) [Link]

Nice idea. Unfortunately, the problem is not that fds are assigned to threads in some kind of round-robin order. They're assigned to threads in a semi-random order in an attempt to keep worker threads equally busy. Adding in a shuffle isn't going to help.

Improving fget() performance

Posted May 6, 2019 23:50 UTC (Mon) by neilbrown (subscriber, #359) [Link] (1 responses)

> A new open() (or accept4() etc) flag could opt out of the sequential file descriptor allocation regime to allow a more performant allocation scheme.

FYI: https://lwn.net/Articles/236843/

Improving fget() performance

Posted May 7, 2019 4:21 UTC (Tue) by roc (subscriber, #30627) [Link]

The objection there about flags not being available on relevant syscalls seems to have become obsolete.

Improving fget() performance

Posted May 7, 2019 16:55 UTC (Tue) by rweikusat2 (subscriber, #117920) [Link]

As usual, there's no need to break working code. Threads of control may run in a shared address space without sharing a file descriptor table. Granted, this means explicit file descriptor passing would have to be used to bounce jobs among threads for the joy of delay but as people who do that obviously don't have any performance requirements (the c10k page is really old by now), this shouldn't matter.

Improving fget() performance

Posted May 26, 2019 9:58 UTC (Sun) by daurnimator (guest, #92358) [Link]

With io_uring, you can skip the fget call by using a 'fixed' reference. There should be a patch coming soon to add sendmsg()+recvmsg() support to io_uring.


Copyright © 2019, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds