Dirty pages, faster writing and fsync
Dirty pages, faster writing and fsync
Posted Mar 26, 2014 14:46 UTC (Wed) by jhhaller (guest, #56103)Parent article: PostgreSQL pain points
Posted Mar 26, 2014 17:59 UTC (Wed)
by roblucid (guest, #48964)
[Link] (10 responses)
One past problem with fsync() has been it NOT syncing only the file in question, but all dirty blocks for the whole filesytem, which is likely NOT what a RDBMS wants, yet the most feasible implementation for the call. If the FS implementations can't fix that behaviour, how's any other file attribute, going to help the situation?
Posted Mar 26, 2014 18:45 UTC (Wed)
by dlang (guest, #313)
[Link] (6 responses)
As I understand it, this is an ext3 bug, the other filesystems don't behave this way.
Posted Apr 4, 2014 7:17 UTC (Fri)
by dbrower (guest, #96396)
[Link] (5 responses)
A better approach might be O_DIRECT and async i/o.
If some things work better with non-O_DIRECT i/o, then the calling code isn't doing a very good job of planning it's i/o and managing the buffer cache. A typical case for this might be a full-table-scan where read-ahead in the FS page cache is a win; the solution is for the thing doing the scan to make bigger reads over more pages.
For what it's worth, these very same problems have been endemic in all UNIX databases for a very long time. Ingres was fighting these same things in the 80's. It's what led to O_DIRECT existing at all, and O_DATASYNC, but the later never worked as well as people had hoped.
Posted Apr 4, 2014 8:32 UTC (Fri)
by dlang (guest, #313)
[Link] (4 responses)
If you can't do
weite(); fsync(); write()
and be guaranteed that all the blocks of the first write will be on disk before any of the blocks of the second write, then fsync is broken.
what is wrong with ext3 is that when you do the fsync(), not only are the dirty blocks for this FD written out, ALL dirty blocks are written out efore the fsync returns. And if other processes continue to dirty blocks while fsync is running, they get written out as well.
for other filesystems (including ext2 and ext4), only the blocks for the one FD are forced to be written before fsync returns.
Posted Apr 4, 2014 11:15 UTC (Fri)
by etienne (guest, #25256)
[Link] (3 responses)
The problem is not really fsync() the content of the file, but you also want the metadata of that file to be on disk (so that you access the right data blocks after a crash) - and it is a can of worms: other files can share the same disk block to store their own metadata, and those files may already have modified their own metadata...
Posted Apr 4, 2014 19:00 UTC (Fri)
by dlang (guest, #313)
[Link] (2 responses)
Posted Apr 7, 2014 11:13 UTC (Mon)
by etienne (guest, #25256)
[Link] (1 responses)
Posted Apr 7, 2014 19:01 UTC (Mon)
by dlang (guest, #313)
[Link]
But the fact is that every filesystem except ext3 is able to do a fsync without syncing all pending data on the filesystem, and this has been acknowledged as a significant problem by the ext developers. They made sure that ext4 did not suffer the same problem.
Posted Mar 26, 2014 20:08 UTC (Wed)
by jhhaller (guest, #56103)
[Link] (1 responses)
Having too many contiguous blocks written at once is another problem, as the drive will be busy writing all those blocks for as long as it takes. Once the write exceeds the amount in one cylinder (not that we can really tell the geometry), it has to seek anyway, and might as well let something else use the disk. Otherwise we have the storage equivalent of bufferbloat, where high priority writes get backed up behind a huge transfer.
Posted Mar 27, 2014 10:45 UTC (Thu)
by dgm (subscriber, #49227)
[Link]
As others have suggested, having multiple queues mapped somehow to ionice levels, could be of help too.
Posted Mar 27, 2014 9:16 UTC (Thu)
by iq-0 (subscriber, #36655)
[Link]
Dirty pages, faster writing and fsync
Dirty pages, faster writing and fsync
Dirty pages, faster writing and fsync
Dirty pages, faster writing and fsync
Dirty pages, faster writing and fsync
Then you add the problem that you can modify the data content of a block in between the request to write it to disk and it being physically written to disk (done through a DMA not completely under CPU control), and you do not want to copy too many pages for performance reasons.
Dirty pages, faster writing and fsync
Dirty pages, faster writing and fsync
If you sync a directory data, you would better sync its children to not have invalid metadata on disk and a non bootable system after a crash.
Then, for the general case, you may want to sync the directory data of any other link to this file.
So you want the "fsync algorithm" to know if it is syncing data or metadata.
You can also change the meaning of fsync() to sync only in the filesystem journal, assuming you will replay the journal before trying to read that file after a crash (hope you did not fsync("vmlinuz"), no bootloader will replay the journal at that time).
Dirty pages, faster writing and fsync
Dirty pages, faster writing and fsync
Dirty pages, faster writing and fsync
Dirty pages, faster writing and fsync
So sure you write some stuff that didn't have to be written to disk. And sure it could be done more efficiently later on. But the benefit is that it's no longer dirty in memory and that is our scarce resource in this scenario (as long as I/O is uncontended in that case that becomes the scarce resource).