|| ||Rik van Riel <riel-AT-redhat.com> |
|| ||John Stultz <john.stultz-AT-linaro.org> |
|| ||Re: [PATCH] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE
|| ||Tue, 22 Nov 2011 04:37:36 -0500|
|| ||LKML <linux-kernel-AT-vger.kernel.org>,
Robert Love <rlove-AT-google.com>,
Christoph Hellwig <hch-AT-infradead.org>,
Andrew Morton <akpm-AT-linux-foundation.org>,
Hugh Dickins <hughd-AT-google.com>, Mel Gorman <mel-AT-csn.ul.ie>,
Dave Hansen <dave-AT-linux.vnet.ibm.com>,
Eric Anholt <eric-AT-anholt.net>,
Jesse Barnes <jbarnes-AT-virtuousgeek.org>,
Johannes Weiner <jweiner-AT-redhat.com>,
Jon Masters <jcm-AT-redhat.com>|
|| ||Article, Thread
On 11/21/2011 10:33 PM, John Stultz wrote:
> This patch provides new fadvise flags that can be used to mark
> file pages as volatile, which will allow it to be discarded if the
> kernel wants to reclaim memory.
> This is useful for userspace to allocate things like caches, and lets
> the kernel destructively (but safely) reclaim them when there's memory
> Right now, we can simply throw away pages if they are clean (backed
> by a current on-disk copy). That only happens for anonymous/tmpfs/shmfs
> pages when they're swapped out. This patch lets userspace select
> dirty pages which can be simply thrown away instead of writing them
> to disk first. See the mm/shmem.c for this bit of code. It's
> different from FADV_DONTNEED since the pages are not immediately
> discarded; they are only discarded under pressure.
I've got a few questions:
1) How do you tell userspace some of its data got
2) How do you prevent the situation where every
volatile object gets a few pages discarded, making
them all unusable?
(better to throw away an entire object at once)
3) Isn't it too slow for something like Firefox to
create a new tmpfs object for every single throw-away
Johannes, Jon and I have looked at an alternative way to
allow the kernel and userspace to cooperate in throwing
out cached data. This alternative way does not touch
the alloc/free fast path at all, but does require some
cooperation at "shrink cache" time.
The idea is quite simple:
1) Every program that we are interested in already has
some kind of main loop where it polls on file descriptors.
It is easy for such programs to add an additional file,
which would be a device or sysfs file that wakes up the
program from its poll/select loop when memory is getting
full to the point that userspace needs to shrink its
The kernel can be smart here and wake up just one process
at a time, targeting specific NUMA nodes or cgroups. Such
kernel smarts do not require additional userspace changes.
2) When userspace gets such a "please shrink your caches"
event, it can do various things. A program like firefox
could throw away several cached objects, eg. uncompressed
images or entire pre-rendered tabs, while a JVM can shrink
its heap size and a database could shrink its internal
3) After doing that, they could all call the same glibc
function that walks across program-internal free memory
and calls MADV_FREE on all free regions that span
multiple pages, which gives the pages back to the kernel,
without needing to move VMA boundaries. This is relatively
light weight and allows for the nuking of pages right in
the middle of a heap VMA.
4) In some GUI libraries, like gtk/glib, we could open the
memory pressure device node (or sysfs file) by default,
hooking it up to the glibc function from (3) by default,
which would give all gtk/glib programs the ability to
give free()d memory back to the kernel on request, without
needing to even modify the program.
Program modification would only be needed in order to
free cached objects, etc. The modification of programs
running under those libraries would consist of overriding
the "shrink caches" hook with their own function, which
first does program-specific stuff and then calls the
default hook to take care of the glibc side.
We considered the same approach you are proposing as well, but
we did not come up with satisfactory answers to the questions I
asked above, which is why we came up with this scheme.
Unfortunately we have not gotten around to implementing it yet,
but I'd be happy to work on it with you guys if you are
to post comments)