The Linux Storage and Filesystem Summit, day 1

Posted Aug 9, 2010 5:20 UTC (Mon) by neilbrown (subscriber, #359)
Parent article: The 2010 Linux Storage and Filesystem Summit, day 1

> but there does not seem to be a better method on offer at the moment.

As far as I can see, the main reason for setting dirty_ratio below about 50% is to limit that time it takes for "sync" to complete (and fsync on ext3 data=ordered filesystems) (as you go above 50% direct reclaim will trigger significantly more often and slow down memory allocation a lot).

So the tunable should be "how long is sync allowed to take". Then you need an estimate of the throughput of each bdi, and don't allow any bdi to gather more dirty memory than that estimate multiplied by the tunable.

Of course this is much more easily said than done - getting a credible estimate in an efficient manner is non-trivial. You can only really measure throughput during intensive write-out, and that probably happens mostly once dirty_ratio is reached, which is a bit late to be setting dirty_ratio.

I suspect some adaptive thing could be done - the first sync might be too slow, but long term it would sort it self out.

The Linux Storage and Filesystem Summit, day 1

Posted Aug 9, 2010 8:24 UTC (Mon) by koverstreet (✭ supporter ✭, #4296) [Link]

My thought when I read that is that what we really want is some statistics on dirty data - probably index it by length of sequential area (and track the average sequential size), and some heuristics for how likely dirty data is to be redirtied, if someone can come up with useful ones.

The idea being that if you're say, copying iso files there's no point in queuing up a gigabyte's worth - but bdb doing random io should be allowed to use more memory. Especially if you maintained those statistics per process, you'd be in good shape to do that.

Having never looked at the writeback code I've no idea what it does already, but it seems to me once you're keeping track of sequential chunks of dirty data it seems to me it'd be a great idea to write them out roughly in order of sequential size - writing out the isos you're copying before your berkeley db.

The Linux Storage and Filesystem Summit, day 1

Posted Aug 9, 2010 8:56 UTC (Mon) by james_ (guest, #55070) [Link] (3 responses)

As an example of a case where setting the ratio low is advantageous:

We were testing a NAS system recently. Our tests use 54 systems writing to the NAS server. The default value of /proc/sys/vm/dirty_ratio was 40. We saw very bad performance when we applied a large write to the system. The vendors technical support noted that the writes where going to the NAS out of order and that because we had a large number of writes we where defeating the NASs cache forcing the out of order writes to become a read modify write cycle. By dropping the value to for example 2 we saw the NAS system perform.

The Linux Storage and Filesystem Summit, day 1

Posted Aug 9, 2010 9:31 UTC (Mon) by neilbrown (subscriber, #359) [Link] (2 responses)

I completely agree - the case where I have seen the need for a low ratio was also when the writes were going out via NFS. The machine had, I think, 32Gig of RAM, so 40% was a lot. Even 1%, the smallest non-zero setting, took longer to flush than we really wanted.

Problems with out-of-order writes is an interesting twist on that!

The Linux Storage and Filesystem Summit, day 1

Posted Aug 15, 2010 17:52 UTC (Sun) by kleptog (subscriber, #1183) [Link] (1 responses)

I have a another case where the default settings don't work well. A process that is somewhat realtime produces 30MB/s of data writing to disk. Under the default settings the kernel will wait 30 seconds before writing anything (900MB) and if it doesn't get it out in a reasonable way that the process gets stuck because it used up all of the 20% of memory for its data.

The solution is to have the kernel check much more often the amount of data waiting (every second rather than every 5 seconds) and drastically reduce the amount of dirty memory there's allowed to be before write back happens.

Without this the kernel suddenly realises it has more than a gigabyte of data to writeback (20% of 8GB = 1.6GB) and manages to starve other processes trying to get it out. Whereas if it just writebacks small amounts in the background continuously everything goes smoothly. 1% works well, since that's what the storage subsystem can handle quickly.

Pity it's a global setting though, other processes would probably work better with a higher writeback threshold, but you can't pick and choose.

The Linux Storage and Filesystem Summit, day 1

Posted Aug 19, 2010 13:04 UTC (Thu) by cypherpunks (guest, #1288) [Link]

I can't help thinking that the solution looks a lot like a PID controller. That is, page write speed is defined by the sum of three terms: one proportional to the amount of excess dirty pages, one proportional to the integral, and one proportional to the derivative.

The latter is the "feed-forward" term, and helps respond quickly to sudden changes. If the rate of page dirtying increases sharply, the rate of writeback should likewise take a sudden jump.