|| ||Neil Brown <neilb-AT-suse.de> |
|| ||Theodore Tso <tytso-AT-mit.edu> |
|| ||Re: Linux 2.6.29 |
|| ||Thu, 26 Mar 2009 13:50:10 +1100|
|| ||Linus Torvalds <torvalds-AT-linux-foundation.org>,
David Rees <drees76-AT-gmail.com>, Jesper Krogh <jesper-AT-krogh.cc>,
Linux Kernel Mailing List <linux-kernel-AT-vger.kernel.org>|
|| ||Article, Thread
On Wednesday March 25, firstname.lastname@example.org wrote:
> On Wed, Mar 25, 2009 at 11:40:28AM -0700, Linus Torvalds wrote:
> > On Wed, 25 Mar 2009, Theodore Tso wrote:
> > > I'm beginning to think that using a "ratio" may be the wrong way to
> > > go. We probably need to add an optional dirty_max_megabytes field
> > > where we start pushing dirty blocks out when the number of dirty
> > > blocks exceeds either the dirty_ratio or the dirty_max_megabytes,
> > > which ever comes first.
> > We have that. Except it's called "dirty_bytes" and
> > "dirty_background_bytes", and it defaults to zero (off).
> > The problem being that unlike the ratio, there's no sane default value
> > that you can at least argue is not _entirely_ pointless.
> Well, if the maximum time that someone wants to wait for an fsync() to
> return is one second, and the RAID array can write 100MB/sec, then
> setting a value of 100MB makes a certain amount of sense. Yes, this
> doesn't take seek overheads into account, and it may be that we're not
> writing things out in an optimal order, as Alan as pointed out. But
> 100MB is much lower number than 5% of 32GB (1.6GB). It would be
> better if these numbers were accounted on a per-filesystem instead of
> a global threshold, but for people who are complaining about huge
> latencies, it at least a partial workaround that they can use today.
We do a lot of dirty accounting on a per-backing_device basis. This
was added to stop slow devices from sucking up too much for the "40%
dirty" space. The allowable dirty space is now shared among all
devices in rough proportion to how quickly they write data out.
My memory of how it works isn't perfect, but we count write-out
completions both globally and per-bdi and maintain a fraction:
That device then gets a share of the available dirty space based on
The counts decay some-how so that the fraction represents recent
I shouldn't be too hard to add some concept of total time to this.
If we track the number of write-outs per unit time and use that together
with a "target time for fsync" to scale the 'dirty_bytes' number, we
might be able to auto-tune the amount of dirty space to fit the speeds
of the drives.
We would probably start with each device having a very low "max dirty"
number which would cause writeouts to start soon. Once the device
demonstrates that it can do n-per-second (or whatever) the VM would
allow the "max dirty" number to drift upwards. I'm not sure how best
to get it to move downwards if the device slows down (or the kernel
over-estimated). Maybe it should regularly decay so that the device
keeps have to "prove" itself.
We would still leave the "dirty_ratio" as an upper-limit because we
don't want all of memory to be dirty (and 40% still sounds about
right). But we would not have a time-based value to set a more
realistic limit when there is enough memory to keep the devices busy
for multiple minutes.
Sorry, no code yet. But I think the idea is sound.
to post comments)