The Linux Storage and Filesystem Summit, day 1
The Linux Storage and Filesystem Summit, day 1
Posted Aug 9, 2010 5:20 UTC (Mon) by neilbrown (subscriber, #359)Parent article: The 2010 Linux Storage and Filesystem Summit, day 1
As far as I can see, the main reason for setting dirty_ratio below about 50% is to limit that time it takes for "sync" to complete (and fsync on ext3 data=ordered filesystems) (as you go above 50% direct reclaim will trigger significantly more often and slow down memory allocation a lot).
So the tunable should be "how long is sync allowed to take". Then you need an estimate of the throughput of each bdi, and don't allow any bdi to gather more dirty memory than that estimate multiplied by the tunable.
Of course this is much more easily said than done - getting a credible estimate in an efficient manner is non-trivial. You can only really measure throughput during intensive write-out, and that probably happens mostly once dirty_ratio is reached, which is a bit late to be setting dirty_ratio.
I suspect some adaptive thing could be done - the first sync might be too slow, but long term it would sort it self out.
Posted Aug 9, 2010 8:24 UTC (Mon)
by koverstreet (✭ supporter ✭, #4296)
[Link]
The idea being that if you're say, copying iso files there's no point in queuing up a gigabyte's worth - but bdb doing random io should be allowed to use more memory. Especially if you maintained those statistics per process, you'd be in good shape to do that.
Having never looked at the writeback code I've no idea what it does already, but it seems to me once you're keeping track of sequential chunks of dirty data it seems to me it'd be a great idea to write them out roughly in order of sequential size - writing out the isos you're copying before your berkeley db.
Posted Aug 9, 2010 8:56 UTC (Mon)
by james_ (guest, #55070)
[Link] (3 responses)
We were testing a NAS system recently. Our tests use 54 systems writing to the NAS server. The default value of /proc/sys/vm/dirty_ratio was 40. We saw very bad performance when we applied a large write to the system. The vendors technical support noted that the writes where going to the NAS out of order and that because we had a large number of writes we where defeating the NASs cache forcing the out of order writes to become a read modify write cycle. By dropping the value to for example 2 we saw the NAS system perform.
Posted Aug 9, 2010 9:31 UTC (Mon)
by neilbrown (subscriber, #359)
[Link] (2 responses)
Problems with out-of-order writes is an interesting twist on that!
Posted Aug 15, 2010 17:52 UTC (Sun)
by kleptog (subscriber, #1183)
[Link] (1 responses)
The solution is to have the kernel check much more often the amount of data waiting (every second rather than every 5 seconds) and drastically reduce the amount of dirty memory there's allowed to be before write back happens.
Without this the kernel suddenly realises it has more than a gigabyte of data to writeback (20% of 8GB = 1.6GB) and manages to starve other processes trying to get it out. Whereas if it just writebacks small amounts in the background continuously everything goes smoothly. 1% works well, since that's what the storage subsystem can handle quickly.
Pity it's a global setting though, other processes would probably work better with a higher writeback threshold, but you can't pick and choose.
Posted Aug 19, 2010 13:04 UTC (Thu)
by cypherpunks (guest, #1288)
[Link]
The latter is the "feed-forward" term, and helps respond quickly to sudden changes. If the rate of page dirtying increases sharply, the rate of writeback should likewise take a sudden jump.
The Linux Storage and Filesystem Summit, day 1
The Linux Storage and Filesystem Summit, day 1
The Linux Storage and Filesystem Summit, day 1
The Linux Storage and Filesystem Summit, day 1
The Linux Storage and Filesystem Summit, day 1