LWN: Comments on "PostgreSQL pain points" https://lwn.net/Articles/591723/ This is a special feed containing comments posted to the individual LWN article titled "PostgreSQL pain points". en-us Sun, 05 Oct 2025 10:10:08 +0000 Sun, 05 Oct 2025 10:10:08 +0000 https://www.rssboard.org/rss-specification lwn@lwn.net virtualization https://lwn.net/Articles/594664/ https://lwn.net/Articles/594664/ MortenSickel <div class="FormattedComment"> Then I think you have not looked well enough into it. As was mentioned earlier, rsync of the database files in combination with the wal logs gives you a simple backup that is immideately usable. At my earlier job, we were heavy users of postgres and used that as our main backup system. (It also makes setting up replication a snap)<br> So, no raw partitions, please - unless rsync and other file management tools get patched to read them... :-P<br> <p> On the other hand, for any database of a certain size and importance, you probably want to have a separate partition for the database files so I could be possible to advice using a certain file system with some certain parameters to get optimal performance.<br> </div> Mon, 14 Apr 2014 07:41:46 +0000 PostgreSQL pain points https://lwn.net/Articles/594148/ https://lwn.net/Articles/594148/ bcopeland <div class="FormattedComment"> <font class="QuotedText">&gt;Also good guesses for semi-automatic advising could work in most cases for most programs by looking at the flags/modes with which a file is opened in fopen or even open as well as the pattern of IO operations on the file like UNIX 25-40 years ago.</font><br> <p> There was a paper a while ago where some researchers looked at the filenames for such hints and got quite good results (e.g. an .mpg is mostly read-only and streamed, a .sqlite file may have seeky writes, and so on.) Cute hack.<br> </div> Wed, 09 Apr 2014 21:04:51 +0000 PostgreSQL pain points https://lwn.net/Articles/593920/ https://lwn.net/Articles/593920/ walex <p>The funny thing about this report is that the Postgres researcher had made almost the same type of complaints about UNIX and Ingres in 1981, 33 years ago:</p> <pre>%A M. R. Stonebraker %T Operating system support for database management %J CACM %V 24 %D JUL 1981 %P 412-418 %K data base</pre> <p>Another funny detail is that adaptive advising about access patterns (adaptive read head, adaptive write behind) were part of the original UNIX design on the PDP-11 well around 35-40 years ago.</p> <p>Also good guesses for semi-automatic advising could work in most cases for most programs by looking at the flags/modes with which a file is opened in <tt>fopen</tt> or even <tt>open</tt> as well as the pattern of IO operations on the file like UNIX 25-40 years ago.</p> Tue, 08 Apr 2014 20:14:20 +0000 Dirty pages, faster writing and fsync https://lwn.net/Articles/593660/ https://lwn.net/Articles/593660/ dlang <div class="FormattedComment"> we can argue over the details all week, I'm not an expert here.<br> <p> But the fact is that every filesystem except ext3 is able to do a fsync without syncing all pending data on the filesystem, and this has been acknowledged as a significant problem by the ext developers. They made sure that ext4 did not suffer the same problem.<br> </div> Mon, 07 Apr 2014 19:01:03 +0000 PostgreSQL pain points https://lwn.net/Articles/593569/ https://lwn.net/Articles/593569/ nix <div class="FormattedComment"> He also missed the point -- that these are problems which *cannot* be solved by the PostgreSQL developers alone; not without implementing what amounts to an operating system inside the database (as some other big database vendors do, and they all hate it).<br> </div> Mon, 07 Apr 2014 13:22:21 +0000 Dirty pages, faster writing and fsync https://lwn.net/Articles/593562/ https://lwn.net/Articles/593562/ etienne <div class="FormattedComment"> You want to sync the directory data, but also the parent directory recursively (because directory data may have moved on disk, maybe because now it is bigger).<br> If you sync a directory data, you would better sync its children to not have invalid metadata on disk and a non bootable system after a crash.<br> Then, for the general case, you may want to sync the directory data of any other link to this file.<br> So you want the "fsync algorithm" to know if it is syncing data or metadata.<br> You can also change the meaning of fsync() to sync only in the filesystem journal, assuming you will replay the journal before trying to read that file after a crash (hope you did not fsync("vmlinuz"), no bootloader will replay the journal at that time).<br> </div> Mon, 07 Apr 2014 11:13:17 +0000 PostgreSQL pain points https://lwn.net/Articles/593335/ https://lwn.net/Articles/593335/ intgr <div class="FormattedComment"> Thanks, clearly you know better than the people developing given DBMSes.<br> <p> </div> Sat, 05 Apr 2014 12:26:50 +0000 PostgreSQL pain points https://lwn.net/Articles/593229/ https://lwn.net/Articles/593229/ XTF <div class="FormattedComment"> A DBMS either already manages most of these things or doesn't need them. <br> </div> Fri, 04 Apr 2014 22:22:11 +0000 Dirty pages, faster writing and fsync https://lwn.net/Articles/593172/ https://lwn.net/Articles/593172/ dlang <div class="FormattedComment"> that would only extend the data to be synced to the directory data. but on ext3 fsync() isn't finished until it writes out all dirty data for all files in all directories.<br> </div> Fri, 04 Apr 2014 19:00:26 +0000 PostgreSQL pain points https://lwn.net/Articles/593054/ https://lwn.net/Articles/593054/ intgr <div class="FormattedComment"> Because going with the "raw partition" approach amounts to duplicating lots of effort that already goes into filesystems and VFS: block allocation, free space tracking, fragmentation avoidance, adaptive readahead, durability guarantees in various storage stacks, TRIM on SSDs, page cache. I'm sure I forgot some. And particularly doing all that in a scalable fashion. For instance, PostgreSQL's own shared_buffers has some scalability limits that have not been addressed; the ability to use Linux's page cache greatly improves its performance on large machines.<br> <p> <font class="QuotedText">&gt; How many files does Postgres use anyway?</font><br> <p> Postgres uses the file system more extensively (I think) than most databases, here's a short overview of the different kinds of files: <a href="http://www.postgresql.org/docs/current/static/storage-file-layout.html">http://www.postgresql.org/docs/current/static/storage-fil...</a><br> <p> </div> Fri, 04 Apr 2014 12:12:42 +0000 Dirty pages, faster writing and fsync https://lwn.net/Articles/593055/ https://lwn.net/Articles/593055/ etienne <div class="FormattedComment"> <font class="QuotedText">&gt; for other filesystems (including ext2 and ext4), only the blocks for the one FD are forced to be written before fsync returns.</font><br> <p> The problem is not really fsync() the content of the file, but you also want the metadata of that file to be on disk (so that you access the right data blocks after a crash) - and it is a can of worms: other files can share the same disk block to store their own metadata, and those files may already have modified their own metadata...<br> Then you add the problem that you can modify the data content of a block in between the request to write it to disk and it being physically written to disk (done through a DMA not completely under CPU control), and you do not want to copy too many pages for performance reasons.<br> </div> Fri, 04 Apr 2014 11:15:49 +0000 PostgreSQL pain points https://lwn.net/Articles/593053/ https://lwn.net/Articles/593053/ XTF <div class="FormattedComment"> Why would you need to reimplement a FS? How many files does Postgres use anyway? <br> <p> </div> Fri, 04 Apr 2014 10:52:27 +0000 Asynchronous commit https://lwn.net/Articles/593052/ https://lwn.net/Articles/593052/ XTF <div class="FormattedComment"> Kernel (FS) devs appear to be allergic to such a sound solution. I've requested it before (O_ATOMIC), they said it was too hard to do.<br> </div> Fri, 04 Apr 2014 10:49:30 +0000 Dirty pages, faster writing and fsync https://lwn.net/Articles/593048/ https://lwn.net/Articles/593048/ dlang <div class="FormattedComment"> <font class="QuotedText">&gt; Even if fsync is limited to the single file on the FD, it won't guarantee ordering of writes, or that only the blocks of interest are written.</font><br> <p> If you can't do<br> <p> weite(); fsync(); write()<br> <p> and be guaranteed that all the blocks of the first write will be on disk before any of the blocks of the second write, then fsync is broken.<br> <p> what is wrong with ext3 is that when you do the fsync(), not only are the dirty blocks for this FD written out, ALL dirty blocks are written out efore the fsync returns. And if other processes continue to dirty blocks while fsync is running, they get written out as well.<br> <p> for other filesystems (including ext2 and ext4), only the blocks for the one FD are forced to be written before fsync returns.<br> </div> Fri, 04 Apr 2014 08:32:18 +0000 Dirty pages, faster writing and fsync https://lwn.net/Articles/593046/ https://lwn.net/Articles/593046/ dbrower <div class="FormattedComment"> Even if fsync is limited to the single file on the FD, it won't guarantee ordering of writes, or that only the blocks of interest are written.<br> <p> A better approach might be O_DIRECT and async i/o.<br> <p> If some things work better with non-O_DIRECT i/o, then the calling code isn't doing a very good job of planning it's i/o and managing the buffer cache. A typical case for this might be a full-table-scan where read-ahead in the FS page cache is a win; the solution is for the thing doing the scan to make bigger reads over more pages.<br> <p> For what it's worth, these very same problems have been endemic in all UNIX databases for a very long time. Ingres was fighting these same things in the 80's. It's what led to O_DIRECT existing at all, and O_DATASYNC, but the later never worked as well as people had hoped.<br> <p> <p> </div> Fri, 04 Apr 2014 07:17:47 +0000 PostgreSQL pain points https://lwn.net/Articles/593045/ https://lwn.net/Articles/593045/ palmer_eldritch <div class="FormattedComment"> I think you missed the joke here...<br> </div> Fri, 04 Apr 2014 06:10:19 +0000 PostgreSQL pain points https://lwn.net/Articles/593017/ https://lwn.net/Articles/593017/ rodgerd <div class="FormattedComment"> And instead you get the problem of your data being quietly corrupted.<br> </div> Thu, 03 Apr 2014 20:22:29 +0000 PostgreSQL pain points https://lwn.net/Articles/592994/ https://lwn.net/Articles/592994/ mathstuf <div class="FormattedComment"> MySQL avoids the problem by just dropping your data instead ;) . You prefer fast over correctness, right?<br> </div> Thu, 03 Apr 2014 18:32:34 +0000 PostgreSQL pain points https://lwn.net/Articles/592991/ https://lwn.net/Articles/592991/ dlang <div class="FormattedComment"> you act as if MySQL doesn't run into the same problems. I think you just don't understand the details well enough.<br> </div> Thu, 03 Apr 2014 18:31:26 +0000 PostgreSQL pain points https://lwn.net/Articles/592964/ https://lwn.net/Articles/592964/ olgeni <div class="FormattedComment"> Use FreeBSD and PostgreSQL. Problems solved, and you also get a proper database for free.<br> </div> Thu, 03 Apr 2014 14:24:30 +0000 PostgreSQL pain points https://lwn.net/Articles/592903/ https://lwn.net/Articles/592903/ Pawlerson <div class="FormattedComment"> Use Linux and MySQL. Problems solved. If PostgreSQL has problems its developers should solve them. If they care..<br> </div> Thu, 03 Apr 2014 09:54:56 +0000 PostgreSQL pain points https://lwn.net/Articles/592482/ https://lwn.net/Articles/592482/ kleptog <div class="FormattedComment"> That might be an idea, except at the moment we're still at the API phase: what information needs to be communicated and how? Stuff mentioned in the article like: order dependencies between blocks, indicating that the kernel cache is out of date, etc. Whether it's eventually implemented in the VFS or as ioctl()s on a fancy new filesystem is really beside the point.<br> <p> I wonder if finding a way to expose parts of the JBD (Journalling Block Device) to user space might help with any of this.<br> </div> Sat, 29 Mar 2014 13:56:53 +0000 virtualization https://lwn.net/Articles/592474/ https://lwn.net/Articles/592474/ fandingo <div class="FormattedComment"> <font class="QuotedText">&gt; How is the raw partition approach worse here?</font><br> <p> I'm not sure a database should be implementing operations necessary for ATA TRIM. <br> </div> Sat, 29 Mar 2014 03:37:07 +0000 virtualization https://lwn.net/Articles/592458/ https://lwn.net/Articles/592458/ kleptog <div class="FormattedComment"> <font class="QuotedText">&gt; &gt; &gt; * Clobbering all other IO-using software on the same machine.</font><br> <p> <font class="QuotedText">&gt; &gt; For the second, as a userspace program you don't have a good view of what the rest of the system is doing, hence you might be interfering with other processes. The kernel has the overview.</font><br> <p> <font class="QuotedText">&gt; How is the raw partition approach worse here? I would intuitively think it makes things better: less sharing.</font><br> <p> I think it depends on what your goals are. If your goal is to make the absolutely fastest database server possible, then you'd probably want to use raw access on a system with nothing else running.<br> <p> If your goal is to make a database server that is broadly useful, runs efficiently on a wide variety of systems then asking the kernel to do its job is the better idea.<br> <p> PostgreSQL tends to the latter. The gains you can get from raw access are simply not worth the effort and would make PostgreSQL much harder to deploy in many situations. A database server that only works well when it's got the machine to itself is a PITA in many situations.<br> </div> Fri, 28 Mar 2014 22:24:05 +0000 PostgreSQL pain points https://lwn.net/Articles/592455/ https://lwn.net/Articles/592455/ bronson <div class="FormattedComment"> ... and a Postgres dev retorted that FreeBSD didn't exhibit any of these problems.<br> </div> Fri, 28 Mar 2014 21:25:39 +0000 PostgreSQL pain points https://lwn.net/Articles/592374/ https://lwn.net/Articles/592374/ marcH <div class="FormattedComment"> Yet there is a *variety* of filesystems developed, even just on Linux. How come?<br> <p> Wild, poorly educated guess: in an ideal world, shouldn't databases be hosted on a trimmed down, "semi-filesystem" which has only management features and none of the duplicated performance stuff which gets in the way? It could be called say, LLVM++ for instance?<br> <p> <p> </div> Fri, 28 Mar 2014 08:31:05 +0000 PostgreSQL pain points https://lwn.net/Articles/592360/ https://lwn.net/Articles/592360/ rodgerd <div class="FormattedComment"> Not to mention it's a huge pain in the arse as a sysadmin. My experience is predominantly with Oracle's raw filesystem (ASM), and the lack of standard tooling to manage the filesystem... it's just not a good thing. To the point where our DBAs will avoid ASM unless they're building a RAC.<br> <p> Probably the most annoying experience was discovering that ASM doesn't do any kind of sanity check when starting filesystems: after a SAN change, the SAN operator wired some LUNs back to the wrong boxes. With an AAAABAAAA on one server and BBBBABBBB on another, LVM would have simply refused to start the volume group. ASM started and then Oracle would silently coredump every time it tried to work with data on the misplaced disk. Such as the horrors of re-inventing decades of volume management and filesystems.<br> </div> Fri, 28 Mar 2014 01:02:18 +0000 sync() starves other reader/writers https://lwn.net/Articles/592333/ https://lwn.net/Articles/592333/ dlang <div class="FormattedComment"> given that your transaction is likely to affect multiple database pages, #2 isn't viable, you can never guarantee that all or none of the transaction will be visible after a crash. That's why the databases do #1<br> </div> Thu, 27 Mar 2014 20:05:15 +0000 sync() starves other reader/writers https://lwn.net/Articles/592329/ https://lwn.net/Articles/592329/ seanyoung <div class="FormattedComment"> Indeed, there are two solutions to this:<br> <p> 1) write the journal before you do anything (requires repeating the operations twice)<br> 2) Before modifying a database page, ensure that any scheduled i/o has completed. If not either copy the page or move on to other pending work.<br> <p> So ideally you want completion information on page level for non-fsync writes.<br> </div> Thu, 27 Mar 2014 19:44:19 +0000 virtualization https://lwn.net/Articles/592327/ https://lwn.net/Articles/592327/ marcH <div class="FormattedComment"> <font class="QuotedText">&gt; The combination gives you a backup</font><br> <p> Heh, that was missing.<br> <p> I am still not convinced that rsync is the ultimate database backup tool. As much as I love rsync it surely does not have the patented exclusivity of incremental copying/backup techniques.<br> <p> </div> Thu, 27 Mar 2014 19:36:14 +0000 virtualization https://lwn.net/Articles/592326/ https://lwn.net/Articles/592326/ marcH <div class="FormattedComment"> <font class="QuotedText">&gt; &gt; * Clobbering all other IO-using software on the same machine.</font><br> <p> <font class="QuotedText">&gt; For the second, as a userspace program you don't have a good view of what the rest of the system is doing, hence you might be interfering with other processes. The kernel has the overview.</font><br> <p> How is the raw partition approach worse here? I would intuitively think it makes things better: less sharing.<br> <p> Anyway: any database of serious size runs on dedicated or practically dedicated hardware, doesn't it?<br> <p> </div> Thu, 27 Mar 2014 19:32:06 +0000 virtualization https://lwn.net/Articles/592314/ https://lwn.net/Articles/592314/ kleptog <div class="FormattedComment"> <font class="QuotedText">&gt; &gt; * Can no longer use standard tools like "rsync" with database files.</font><br> <p> <font class="QuotedText">&gt; Well, you can't use that on a live database anyway, so this point looks moot. Unless maybe you rely on a filesystem with snapshotting which is... not far from duplicating a database feature! Same pattern gain.</font><br> <p> You can. It's useful for both backup and replication. Basically you can use rsync to quickly update your backup image. And then you take a copy of the WAL logs. The combination gives you a backup. If you have a snapshotting filesystem you can indeed achieve similar effects.<br> <p> <font class="QuotedText">&gt; &gt; * Can't keep up with hardware advances in a timely fashion.</font><br> <font class="QuotedText">&gt; &gt; * Clobbering all other IO-using software on the same machine.</font><br> <p> <font class="QuotedText">&gt; Sorry I don't get these two. Care to elaborate</font><br> <p> For the first, consider the effects the rise of SSD is having on the Linux VFS. That would need to be replicated in the database. For the second, as a userspace program you don't have a good view of what the rest of the system is doing, hence you might be interfering with other processes. The kernel has the overview.<br> <p> It's a feature that a database doesn't assume it's the only program on a machine.<br> </div> Thu, 27 Mar 2014 18:58:08 +0000 sync() starves other reader/writers https://lwn.net/Articles/592302/ https://lwn.net/Articles/592302/ dlang <div class="FormattedComment"> Plus you want to make sure none of the database files get written ahead of the corresponding WAL data<br> </div> Thu, 27 Mar 2014 18:06:02 +0000 PostgreSQL pain points https://lwn.net/Articles/592284/ https://lwn.net/Articles/592284/ Shewmaker Regarding double buffering, mapping files from hundreds of processes being slow, and madvise--LLNL has developed a new mmap implementation (loadable as a module) that is optimized for data-intensive applications. They also saw slowdowns for many processes using the standard mmap, but theirs scales up to hundreds of processes. Papers and code are available. This is going into production on their clusters. <p> See <a href="https://computation-rnd.llnl.gov/perma/di-mmap.php">DI-MMAP</a>. Thu, 27 Mar 2014 16:48:22 +0000 Asynchronous commit https://lwn.net/Articles/592280/ https://lwn.net/Articles/592280/ zblaxell <div class="FormattedComment"> I thought I covered that as undesirable option #3 ("put the filesystem in a database server that implements asynchronous commit"). ;)<br> <p> The RDBMS state of the art has progressed over the last 20 years, providing a wide variety of reasonable data integrity and performance tradeoffs. It would be nice if filesystems could catch up. fsync() is a sledgehammer.<br> </div> Thu, 27 Mar 2014 16:36:07 +0000 Dirty page caching not mostly harmless https://lwn.net/Articles/592248/ https://lwn.net/Articles/592248/ zblaxell <div class="FormattedComment"> <font class="QuotedText">&gt;&gt; if a process is continuing to make more dirty pages faster than they can be cleaned, such writes can last indefinitely.</font><br> <font class="QuotedText">&gt; well, that's true no matter what your limit is.</font><br> <p> All the more reason to have a plan for this when it happens. Throttle the process dirtying pages so it can't get too far ahead of the block device. Preempt large writes with higher priority I/O (or even equal priority) from other processes.<br> <p> This can already be done half a dozen ways, but ideally the block layer would default to a more reasonable behavior by itself.<br> </div> Thu, 27 Mar 2014 16:25:41 +0000 Dirty page caching not mostly harmless https://lwn.net/Articles/592249/ https://lwn.net/Articles/592249/ zblaxell <div class="FormattedComment"> <font class="QuotedText">&gt; It used to be that the spinning rust was so slow compared to the rest of the system that it was worth just about any computational effort to optimize it</font><br> <p> Computational effort is one thing. RAM cost is another, and it frequently dominates.<br> <p> Linux has worse problems with slow devices than with fast ones. SSDs are raising the high end, but at the same time enormous USB sticks and slow "green" rust drives are lowering the low end. It's annoying to have your fast SSD be idle 10% of the time because the filesystem can't keep up with it, but it's much more annoying to have most of your RAM be unusable for hours because someone plugged the slowest USB stick money can buy into your server, then tried to copy a few dozen GB of data onto it.<br> <p> The solution is the same at both extremes--limit the number of dirty pages and stop delaying writes so much. It's the spinning rust in the middle of the range that still needs big dirty page caches and big write delays to form them--and even then, there are pretty low limits to the useful data size for optimization. Fast SSDs don't need optimization and slow "green" drives are so slow that the first page write alone will provide plenty of time for an optimizable set of dirty pages to accumulate.<br> </div> Thu, 27 Mar 2014 16:25:29 +0000 PostgreSQL pain points https://lwn.net/Articles/592247/ https://lwn.net/Articles/592247/ smitty_one_each <div class="FormattedComment"> "All went swimmingly until a kernel dev jokingly suggested mySQL as a work-around. . ."<br> </div> Thu, 27 Mar 2014 15:00:10 +0000 sync() starves other reader/writers https://lwn.net/Articles/592197/ https://lwn.net/Articles/592197/ seanyoung <div class="FormattedComment"> Thank you for that.<br> <p> The problem postgresql has is that you want the wal/journal to written asap (so with fsync). You want the database files to be written WITHOUT fsync, but you do want to know when when they complete/are persisted, so you can safely discard old wal/journal files for examples.<br> <p> So what I was trying to avoid was calling fsync(). sync_file_range() just ends up calling fsync in the file system driver, it is no different, as you have done in your code.<br> <p> I'm not sure there is an method for this, although the aio functions seem to provide an API for this.<br> </div> Thu, 27 Mar 2014 10:52:54 +0000 Dirty pages, faster writing and fsync https://lwn.net/Articles/592196/ https://lwn.net/Articles/592196/ dgm <div class="FormattedComment"> As with bufferbloat, one approach could be to measure the buffer cache in terms of time it takes to write it back, instead of bytes and megabytes.<br> <p> As others have suggested, having multiple queues mapped somehow to ionice levels, could be of help too.<br> </div> Thu, 27 Mar 2014 10:45:20 +0000