LWN.net Logo

Being nicer to executable pages

By Jonathan Corbet
May 19, 2009
In an ideal world, our computers would have enough memory to run all of the applications we need. In the real world, our systems are loaded with contemporary desktop environments, office suites, and more. So, even with the large amounts of memory being shipped on modern systems, there still never quite seems to be enough. Memory gets paged out to make room for new demands, and performance suffers. Some help may be on the way in the form of a new patch by Wu Fengguang which has the potential to make things better, should it ever be merged.

The kernel maintains two least-recently-used (LRU) lists for pages owned by user processes. One of these lists holds pages which are backed up by files - they are the page cache; the other list holds anonymous pages which are backed up by the swap device, assuming one exists. When the kernel needs to free up memory, it will do its best to push out pages which are backed up by files first. Those pages are much more likely to be unmodified, and I/O to them tends to be faster. So, with luck, a system which evicts file-backed pages first will perform better.

It may be possible to do things better, though. Certain kinds of activities - copying a large file, for example - can quickly fill memory with file-backed pages. As the kernel works to recover those pages, it stands a good chance of pushing out other file-backed pages which are likely to be more useful. In particular, pages containing executable code are relatively likely to be wanted in the near future. If the kernel pages out the C library, for example, chances are good that running processes will cause it to be paged back in quickly. The loss of needed executable pages is part of why operations involving large amounts of file data can make the system seem sluggish for a while afterward.

Wu's patch tries to improve the situation through a fairly simple change: when the page reclaim scanning code hits a file-backed, executable page which has the "referenced" bit set, it simply clears the bit and moves on. So executable pages get an extra trip through the LRU list; that will happen repeatedly for as long as somebody is making use of the page. If all goes well, pages running useful code will stay in RAM, while those holding less useful file data will get pushed out first. It should lead to a more responsive system.

The code seems to be in a relatively finished state at this point. So one might well ask whether it will be merged in the near future. That is never a straightforward question with memory management code, though. This patch may well make it into the mainline, but it will have to get over some hurdles in the process. The first of those hurdles is a simple question from Andrew Morton:

Now. How do we know that this patch improves Linux?

Claims like "it feels more responsive" are notoriously hard to quantify. But, without some sort of reasonably objective way to see what benefit is offered by this patch, the kernel developers are going to be reluctant to make changes to low-level memory management heuristics. The fear of regressions is always there as well; nobody wants to learn about some large database workload which gets slower after a patch like this goes in. In summary: knowing whether this kind of patch really makes the situation better is not as easy as one might wish.

The second problem is that this change would make it possible for a sneaky application to keep its data around by mapping its files with the "executable" bit set. The answer to this objection is easier: an application which seeks unfair advantage by playing games can already do so. Since anonymous pages receive preferable treatment already, the sneaky application could obtain a similar effect on current kernels by allocating memory and reading in the full file contents. Sites which are truly worried about this sort of abuse can (1) use the memory controller to put a lid on memory use, and/or (2) use SELinux to prevent applications from mapping file-backed pages with execute permission enabled.

Finally, Alan Cox has wondered whether this kind of heuristic-tweaking is the right approach in the first place:

I still think the focus is on the wrong thing. We shouldn't be trying to micro-optimise page replacement guesswork - we should be macro-optimising the resulting I/O performance. My disks each do 50MBytes/second and even with the Gnome developers finest creations that ought to be enough if the rest of the system was working properly.

Alan is referring to some apparent performance problems with the memory management and block I/O subsystems which crept in a few years ago. Some of these issues have been addressed for 2.6.30, but others remain unidentified and unresolved so far.

Wu's patch will not change that, of course. But it may still make life a little better for desktop Linux users. It is sufficiently simple and well contained that, in the absence of clear performance regressions for other workloads, it will probably find its way into the mainline sooner or later.


(Log in to post comments)

Being nicer to executable pages

Posted May 21, 2009 3:33 UTC (Thu) by jmspeex (subscriber, #51639) [Link]

This is actually something I don't understand. When I do a "cp huge_file somewhere_else", why is it that the kernel insists on caching all that data, while evicting more useful pages. Isn't there a way to realise that "oh, I keep reading from that file descriptor, but I never seem to be re-reading anything I already read"?

splice and reflink

Posted May 21, 2009 3:56 UTC (Thu) by xoddam (subscriber, #2322) [Link]

There's splice, and for copies within a file system (coming soon to a 4th-generation fs near you) reflink.

I'm not entirely sure why cp doesn't use these where applicable. Probably it will, one day.

splice and reflink

Posted May 21, 2009 5:03 UTC (Thu) by dlang (✭ supporter ✭, #313) [Link]

it's also possible for the application to use madvise to tell the kernel that it doesn't intend to re-use the data.

the problem is that cp is a generic tool, it doesn't know if you intend to use the data still or not.

it's common to copy files and then do other things with them, as a result it frequently is the right thing to keep that data in ram.

Better measure before tweaking...

Posted May 21, 2009 18:29 UTC (Thu) by vonbrand (subscriber, #4458) [Link]

Again, would need to measure what happens... but I'd believe it isn't that common to look immediately at the recently copied file. At least I normally don't.

In any case, there are war stories around on "optimizing" something that wasn't used, or very rarely. Ditto for "optimizations" that made things worse.

Better measure before tweaking...

Posted May 21, 2009 18:39 UTC (Thu) by dlang (✭ supporter ✭, #313) [Link]

remember that the kernel doesn't know you are copying the data. it only knows that the app ('cp' is writing a lot of data.

it is _very_ common for one app to write a lot of data and then have that data used immediatly

you are asking for the kernel to notice that this one app is reading data and writing the _same_ data. and then drop it from memory.

what if the process is changing the data as it writes it, should it be kept around or not?

if an app just reads the data, should you assume that the data will be used again soon or not?

historicly it has worked pretty well for the system to assume that if one app is interested in the data (reading or writing it) other apps are fairly likely to want that data again soon.

for every case that someone can raise showing that it isn't needed soon, other cases can be pointed out where it is needed soon. this is why they are reluctant to change the model.

Cache swept by huge file copy

Posted May 23, 2009 23:21 UTC (Sat) by giraffedata (subscriber, #1954) [Link]

Only the most primitive caching systems today allow the cache to be swept out by a large sequential read or write. And they don't use hints from the 'cp' level to prevent it.

Though I don't keep up with Linux virtual page replacement policy, I presume it still does a version of second chance, where a page must be accessed once more after it is added to the cache to reach the "active" state where it can compete with other active pages for use of memory. Until then, it's only competing with other stuff recently added to the cache. So the only problem would be if new active data is coming in at the same time as this big file read/write, where Linux would never get the chance to notice that the new data is active.

So there are more sophisticated, highly successful page replacement policies that are used in other OSes and block storage systems, among other places. One I know keeps track of pages recently evicted so that it can detect when a page is frequently used without having to keep it in memory for a long time on a trial basis. Others simply detect sequential accesses and when it's clear that the whole sequential stream won't fit in memory, give up on caching stuff from that stream.

With caching papers coming out constantly, I've always wondered why Linux is so unsophisticated in that area.

splice and reflink

Posted May 21, 2009 21:29 UTC (Thu) by kleptog (subscriber, #1183) [Link]

madvise only works on memory pages ofcourse. I've looked for a way to avoid a program producing lots of data polluting the kernel cache.

Basically, if I have a process producing say 20MB/s of data, is there a way to tell the kernel not to keep it after it's been written out? There's posix_fadvise but it seems geared to toward pages that have been read in, not pages that have been written out.

splice and reflink

Posted May 21, 2009 21:50 UTC (Thu) by dlang (✭ supporter ✭, #313) [Link]

open the file with fadvise

Being nicer to executable pages

Posted May 21, 2009 10:22 UTC (Thu) by epa (subscriber, #39769) [Link]

When an RDBMS needs to scan through a table, if the table is bigger than available memory then it will not cache any of the pages read. After all, by the time you got to the end of the table the beginning would have been expired out of the cache anyway, so the next sequential scan would not be any faster.

It would be possible for the kernel to look at the size of the file opened, and if the file is big relative to available memory, and reading starts at the beginning and moves sequentially forward, then decide not to cache it. How much of this should be magical heuristics in the kernel, and how much should be hints given to the kernel by the 'cp' program, is a matter of taste.

Being nicer to executable pages

Posted May 21, 2009 11:33 UTC (Thu) by jmspeex (subscriber, #51639) [Link]

Well, technically it's not just cp. There's all the system tools that run with cron and scan the whole filesystem (backup, indexing, ...). They all completely trash the interactivity of the system by not only taking up all the disk bandwidth, but also ejecting all the applications pages. I think it may be too much to ask for all these applications to be fixed, so the kernel may be where a fix should be attempted.

Being nicer to executable pages

Posted May 22, 2009 7:24 UTC (Fri) by xoddam (subscriber, #2322) [Link]

For general-purpose utilities something like 'nice' or 'ulimit', specified in a script and operating on a process group, would perhaps be more appropriate than asking the kernel to guess what pages not to bother caching.

Being nicer to executable pages

Posted May 21, 2009 14:49 UTC (Thu) by nix (subscriber, #2304) [Link]

Applications have other even simpler ways to receive preferable treatment for their file-backed pages: e.g. just keep referencing each page frequently. (Garbage collectors tend to do this already, admittedly generally for anonymous pages, but that isn't a requirement.)

And Alan's disks can do 50Mb/s, but can they do that if the incoming workload is very heavily seeky, as is often the case for major faults of pages from text pages of binaries and swap? I doubt it can manage more than 1--5Mb/s in that situation. Even prefaulting neighbouring pages isn't going to help too much there.

Being nicer to executable pages

Posted May 23, 2009 23:07 UTC (Sat) by giraffedata (subscriber, #1954) [Link]

Alan's disks can do 50Mb/s, but can they do that if the incoming workload is very heavily seeky, as is often the case for major faults of pages from text pages of binaries and swap? I doubt it can manage more than 1--5Mb/s in that situation.

I believe that was Alan's point. He's saying there's more to be gained by getting rid of seekiness than in getting rid of reads of executable files.

I don't know if he had any particular approach in mind, but there are lots. For one thing, on a heavily loaded system, which is the kind on which we care most about performance, seek time disappears. The head can move over a cylinder way faster than the cylinder can turn under the head, and the only time we have long seeks is when there isn't enough work to choose from.

Being nicer to executable pages

Posted May 24, 2009 11:29 UTC (Sun) by nix (subscriber, #2304) [Link]

You overestimate current drives. We can't hand it dozens of things and
say 'give me these back in any order you please'; the most we can do is
give it a lot of stuff at once, and ask it to hand them back *in order*.

So under heavy load (especially heavy memory pressure or write pressure),
seek time comes to dominate :(

Being nicer to executable pages

Posted May 24, 2009 14:55 UTC (Sun) by farnz (guest, #17727) [Link]

You're definitely behind the times - both SATA NCQ drives and SCSI TCQ drives can handle commands out of order. PATA drives, SATA drives without NCQ, and SCSI drives without TCQ can't do this; it's also rare to find USB drives that do this.

So, with a modern laptop drive, I can just shunt 31 commands at it, and let it handle them in the order that's most sensible for the drive. In practice, most drives I've encountered appear to have an internal elevator to minimise seeking; combine this with the deep Linux elevator keeping the drive fed with commands, and my laptop drive is very successfully maintaining a high throughput despite a rather seeky pattern.

Being nicer to executable pages

Posted May 24, 2009 17:58 UTC (Sun) by giraffedata (subscriber, #1954) [Link]

Command queuing in the disk drive (NCQ/TCQ) isn't really an essential part of eliminating seek time. It's main purpose is to eliminate the time you wait to generate a command, get it to the drive, and have the drive interpret it, which it does by allowing you to stream the commands in.

Linux's ability to queue hundreds of I/Os and send them to the drive in block number order is where most of the seek time elimination happens. Even on modern drives, it's the case that seek time between to consecutively numbered blocks is usually negligible.

But even so, we're just talking about what Linux already does -- the point is that it can conceivably do even more to make the disk see sequential block number I/O, and it might be a more profitable investment than trying to make it see less I/O by caching executable pages longer.

Being nicer to executable pages

Posted May 25, 2009 15:39 UTC (Mon) by nix (subscriber, #2304) [Link]

SCSI can, sure, but I'd heard that SATA NCQ pretty much couldn't, that it
was notably less capable than TCQ. Obviously I heard wrong :)

Being nicer to executable pages

Posted May 21, 2009 15:38 UTC (Thu) by MisterIO (subscriber, #36192) [Link]

In response to Alan Cox's objection, why not both? And by the way, if the system becomes more responsive to the user, so much that the user can actually notice it, it means that probably it's not too much micro as an optimization.

Copyright © 2009, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds