User: Password:
|
|
Subscribe / Log in / New account

Dirty pages, faster writing and fsync

Dirty pages, faster writing and fsync

Posted Mar 26, 2014 14:46 UTC (Wed) by jhhaller (subscriber, #56103)
Parent article: PostgreSQL pain points

The problem is with the dirty buffer caches which can grow quite large compared with the storage I/O capacity, and the limited mechanisms to flush dirty pages to disk. While the amount of dirty pages which can exist without being flushed has been reduced recently, it's still quite large. For a simple example, copy a huge file across a network while watching the disk activity lights. Even with a copy running at 1 Gbps, the disk shows no activity for several seconds, then is solid on for several seconds, and the pattern repeats. While I understand that for some use cases, such as tmp files, that not writing the files which will shortly be deleted is desirable, there are cases which will need to be persisted. For the persistent cases, it would be desirable to start writing dirty pages as they fill, as the lowest priority I/O. Then, when fsync is called, there should be few dirty blocks which need to be written. Perhaps a F_MINIMIZEDIRTYBLOCKS fcntl option is in order.


(Log in to post comments)

Dirty pages, faster writing and fsync

Posted Mar 26, 2014 17:59 UTC (Wed) by roblucid (subscriber, #48964) [Link]

There can be large avantages to deferring writes, you can allocate decent number of blocks contiguously for instance, leaving i/o bandwidth unused tends to be a 'win' even when it's not temporary files that don't ever reach the disk platters due to reductions in disk seeks.

One past problem with fsync() has been it NOT syncing only the file in question, but all dirty blocks for the whole filesytem, which is likely NOT what a RDBMS wants, yet the most feasible implementation for the call. If the FS implementations can't fix that behaviour, how's any other file attribute, going to help the situation?

Dirty pages, faster writing and fsync

Posted Mar 26, 2014 18:45 UTC (Wed) by dlang (subscriber, #313) [Link]

> One past problem with fsync() has been it NOT syncing only the file in question, but all dirty blocks for the whole filesytem, which is likely NOT what a RDBMS wants, yet the most feasible implementation for the call. If the FS implementations can't fix that behaviour, how's any other file attribute, going to help the situation?

As I understand it, this is an ext3 bug, the other filesystems don't behave this way.

Dirty pages, faster writing and fsync

Posted Apr 4, 2014 7:17 UTC (Fri) by dbrower (guest, #96396) [Link]

Even if fsync is limited to the single file on the FD, it won't guarantee ordering of writes, or that only the blocks of interest are written.

A better approach might be O_DIRECT and async i/o.

If some things work better with non-O_DIRECT i/o, then the calling code isn't doing a very good job of planning it's i/o and managing the buffer cache. A typical case for this might be a full-table-scan where read-ahead in the FS page cache is a win; the solution is for the thing doing the scan to make bigger reads over more pages.

For what it's worth, these very same problems have been endemic in all UNIX databases for a very long time. Ingres was fighting these same things in the 80's. It's what led to O_DIRECT existing at all, and O_DATASYNC, but the later never worked as well as people had hoped.

Dirty pages, faster writing and fsync

Posted Apr 4, 2014 8:32 UTC (Fri) by dlang (subscriber, #313) [Link]

> Even if fsync is limited to the single file on the FD, it won't guarantee ordering of writes, or that only the blocks of interest are written.

If you can't do

weite(); fsync(); write()

and be guaranteed that all the blocks of the first write will be on disk before any of the blocks of the second write, then fsync is broken.

what is wrong with ext3 is that when you do the fsync(), not only are the dirty blocks for this FD written out, ALL dirty blocks are written out efore the fsync returns. And if other processes continue to dirty blocks while fsync is running, they get written out as well.

for other filesystems (including ext2 and ext4), only the blocks for the one FD are forced to be written before fsync returns.

Dirty pages, faster writing and fsync

Posted Apr 4, 2014 11:15 UTC (Fri) by etienne (guest, #25256) [Link]

> for other filesystems (including ext2 and ext4), only the blocks for the one FD are forced to be written before fsync returns.

The problem is not really fsync() the content of the file, but you also want the metadata of that file to be on disk (so that you access the right data blocks after a crash) - and it is a can of worms: other files can share the same disk block to store their own metadata, and those files may already have modified their own metadata...
Then you add the problem that you can modify the data content of a block in between the request to write it to disk and it being physically written to disk (done through a DMA not completely under CPU control), and you do not want to copy too many pages for performance reasons.

Dirty pages, faster writing and fsync

Posted Apr 4, 2014 19:00 UTC (Fri) by dlang (subscriber, #313) [Link]

that would only extend the data to be synced to the directory data. but on ext3 fsync() isn't finished until it writes out all dirty data for all files in all directories.

Dirty pages, faster writing and fsync

Posted Apr 7, 2014 11:13 UTC (Mon) by etienne (guest, #25256) [Link]

You want to sync the directory data, but also the parent directory recursively (because directory data may have moved on disk, maybe because now it is bigger).
If you sync a directory data, you would better sync its children to not have invalid metadata on disk and a non bootable system after a crash.
Then, for the general case, you may want to sync the directory data of any other link to this file.
So you want the "fsync algorithm" to know if it is syncing data or metadata.
You can also change the meaning of fsync() to sync only in the filesystem journal, assuming you will replay the journal before trying to read that file after a crash (hope you did not fsync("vmlinuz"), no bootloader will replay the journal at that time).

Dirty pages, faster writing and fsync

Posted Apr 7, 2014 19:01 UTC (Mon) by dlang (subscriber, #313) [Link]

we can argue over the details all week, I'm not an expert here.

But the fact is that every filesystem except ext3 is able to do a fsync without syncing all pending data on the filesystem, and this has been acknowledged as a significant problem by the ext developers. They made sure that ext4 did not suffer the same problem.

Dirty pages, faster writing and fsync

Posted Mar 26, 2014 20:08 UTC (Wed) by jhhaller (subscriber, #56103) [Link]

I agree that coalescing writes have advantages, but when a single fd has written 2-3GB to the buffer cache and it's only memory pressure forcing them to be written, that we are long past where the write coalescing benefit is useful. In the case of PostgreSQL, the pages should be written relatively quickly, as fsync is coming, and there is no point waiting for it. Linux doesn't behave well during either an explicit fsync or an implicit one caused by memory pressure. Another example, mythtv, calls fsync once per second while recording TV just to avoid a huge file operations delayed when memory pressure causes an implicit sync operation.

Having too many contiguous blocks written at once is another problem, as the drive will be busy writing all those blocks for as long as it takes. Once the write exceeds the amount in one cylinder (not that we can really tell the geometry), it has to seek anyway, and might as well let something else use the disk. Otherwise we have the storage equivalent of bufferbloat, where high priority writes get backed up behind a huge transfer.

Dirty pages, faster writing and fsync

Posted Mar 27, 2014 10:45 UTC (Thu) by dgm (subscriber, #49227) [Link]

As with bufferbloat, one approach could be to measure the buffer cache in terms of time it takes to write it back, instead of bytes and megabytes.

As others have suggested, having multiple queues mapped somehow to ionice levels, could be of help too.

Dirty pages, faster writing and fsync

Posted Mar 27, 2014 9:16 UTC (Thu) by iq-0 (subscriber, #36655) [Link]

The trick is performing I/O in the background but not saturating the I/O queue (you really want it to be in the background). Efficiency only becomes an issue when there is contention.
So sure you write some stuff that didn't have to be written to disk. And sure it could be done more efficiently later on. But the benefit is that it's no longer dirty in memory and that is our scarce resource in this scenario (as long as I/O is uncontended in that case that becomes the scarce resource).


Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds