LWN: Comments on "PostgreSQL pain points"

virtualization

MortenSickel — Mon, 14 Apr 2014 07:41:46 +0000

Then I think you have not looked well enough into it. As was mentioned earlier, rsync of the database files in combination with the wal logs gives you a simple backup that is immideately usable. At my earlier job, we were heavy users of postgres and used that as our main backup system. (It also makes setting up replication a snap)
So, no raw partitions, please - unless rsync and other file management tools get patched to read them... :-P

On the other hand, for any database of a certain size and importance, you probably want to have a separate partition for the database files so I could be possible to advice using a certain file system with some certain parameters to get optimal performance.

PostgreSQL pain points

bcopeland — Wed, 09 Apr 2014 21:04:51 +0000

>Also good guesses for semi-automatic advising could work in most cases for most programs by looking at the flags/modes with which a file is opened in fopen or even open as well as the pattern of IO operations on the file like UNIX 25-40 years ago.

There was a paper a while ago where some researchers looked at the filenames for such hints and got quite good results (e.g. an .mpg is mostly read-only and streamed, a .sqlite file may have seeky writes, and so on.) Cute hack.

PostgreSQL pain points

walex — Tue, 08 Apr 2014 20:14:20 +0000

The funny thing about this report is that the Postgres researcher had made almost the same type of complaints about UNIX and Ingres in 1981, 33 years ago:

%A M. R. Stonebraker
%T Operating system support for database management
%J CACM
%V 24
%D JUL 1981
%P 412-418
%K data base

Another funny detail is that adaptive advising about access patterns (adaptive read head, adaptive write behind) were part of the original UNIX design on the PDP-11 well around 35-40 years ago.

Also good guesses for semi-automatic advising could work in most cases for most programs by looking at the flags/modes with which a file is opened in fopen or even open as well as the pattern of IO operations on the file like UNIX 25-40 years ago.

Dirty pages, faster writing and fsync

dlang — Mon, 07 Apr 2014 19:01:03 +0000

we can argue over the details all week, I'm not an expert here.

But the fact is that every filesystem except ext3 is able to do a fsync without syncing all pending data on the filesystem, and this has been acknowledged as a significant problem by the ext developers. They made sure that ext4 did not suffer the same problem.

PostgreSQL pain points

nix — Mon, 07 Apr 2014 13:22:21 +0000

He also missed the point -- that these are problems which *cannot* be solved by the PostgreSQL developers alone; not without implementing what amounts to an operating system inside the database (as some other big database vendors do, and they all hate it).

Dirty pages, faster writing and fsync

etienne — Mon, 07 Apr 2014 11:13:17 +0000

You want to sync the directory data, but also the parent directory recursively (because directory data may have moved on disk, maybe because now it is bigger).
If you sync a directory data, you would better sync its children to not have invalid metadata on disk and a non bootable system after a crash.
Then, for the general case, you may want to sync the directory data of any other link to this file.
So you want the "fsync algorithm" to know if it is syncing data or metadata.
You can also change the meaning of fsync() to sync only in the filesystem journal, assuming you will replay the journal before trying to read that file after a crash (hope you did not fsync("vmlinuz"), no bootloader will replay the journal at that time).

PostgreSQL pain points

intgr — Sat, 05 Apr 2014 12:26:50 +0000

Thanks, clearly you know better than the people developing given DBMSes.

PostgreSQL pain points

XTF — Fri, 04 Apr 2014 22:22:11 +0000

A DBMS either already manages most of these things or doesn't need them.

Dirty pages, faster writing and fsync

dlang — Fri, 04 Apr 2014 19:00:26 +0000

that would only extend the data to be synced to the directory data. but on ext3 fsync() isn't finished until it writes out all dirty data for all files in all directories.

PostgreSQL pain points

intgr — Fri, 04 Apr 2014 12:12:42 +0000

Because going with the "raw partition" approach amounts to duplicating lots of effort that already goes into filesystems and VFS: block allocation, free space tracking, fragmentation avoidance, adaptive readahead, durability guarantees in various storage stacks, TRIM on SSDs, page cache. I'm sure I forgot some. And particularly doing all that in a scalable fashion. For instance, PostgreSQL's own shared_buffers has some scalability limits that have not been addressed; the ability to use Linux's page cache greatly improves its performance on large machines.

> How many files does Postgres use anyway?

Postgres uses the file system more extensively (I think) than most databases, here's a short overview of the different kinds of files: http://www.postgresql.org/docs/current/static/storage-fil...

Dirty pages, faster writing and fsync

etienne — Fri, 04 Apr 2014 11:15:49 +0000

> for other filesystems (including ext2 and ext4), only the blocks for the one FD are forced to be written before fsync returns.

The problem is not really fsync() the content of the file, but you also want the metadata of that file to be on disk (so that you access the right data blocks after a crash) - and it is a can of worms: other files can share the same disk block to store their own metadata, and those files may already have modified their own metadata...
Then you add the problem that you can modify the data content of a block in between the request to write it to disk and it being physically written to disk (done through a DMA not completely under CPU control), and you do not want to copy too many pages for performance reasons.

PostgreSQL pain points

XTF — Fri, 04 Apr 2014 10:52:27 +0000

Why would you need to reimplement a FS? How many files does Postgres use anyway?

Asynchronous commit

XTF — Fri, 04 Apr 2014 10:49:30 +0000

Kernel (FS) devs appear to be allergic to such a sound solution. I've requested it before (O_ATOMIC), they said it was too hard to do.

Dirty pages, faster writing and fsync

dlang — Fri, 04 Apr 2014 08:32:18 +0000

> Even if fsync is limited to the single file on the FD, it won't guarantee ordering of writes, or that only the blocks of interest are written.

If you can't do

weite(); fsync(); write()

and be guaranteed that all the blocks of the first write will be on disk before any of the blocks of the second write, then fsync is broken.

what is wrong with ext3 is that when you do the fsync(), not only are the dirty blocks for this FD written out, ALL dirty blocks are written out efore the fsync returns. And if other processes continue to dirty blocks while fsync is running, they get written out as well.

for other filesystems (including ext2 and ext4), only the blocks for the one FD are forced to be written before fsync returns.

Dirty pages, faster writing and fsync

dbrower — Fri, 04 Apr 2014 07:17:47 +0000

Even if fsync is limited to the single file on the FD, it won't guarantee ordering of writes, or that only the blocks of interest are written.

A better approach might be O_DIRECT and async i/o.

If some things work better with non-O_DIRECT i/o, then the calling code isn't doing a very good job of planning it's i/o and managing the buffer cache. A typical case for this might be a full-table-scan where read-ahead in the FS page cache is a win; the solution is for the thing doing the scan to make bigger reads over more pages.

For what it's worth, these very same problems have been endemic in all UNIX databases for a very long time. Ingres was fighting these same things in the 80's. It's what led to O_DIRECT existing at all, and O_DATASYNC, but the later never worked as well as people had hoped.

PostgreSQL pain points

palmer_eldritch — Fri, 04 Apr 2014 06:10:19 +0000

I think you missed the joke here...

PostgreSQL pain points

rodgerd — Thu, 03 Apr 2014 20:22:29 +0000

And instead you get the problem of your data being quietly corrupted.

PostgreSQL pain points

mathstuf — Thu, 03 Apr 2014 18:32:34 +0000

MySQL avoids the problem by just dropping your data instead ;) . You prefer fast over correctness, right?

PostgreSQL pain points

dlang — Thu, 03 Apr 2014 18:31:26 +0000

you act as if MySQL doesn't run into the same problems. I think you just don't understand the details well enough.

PostgreSQL pain points

olgeni — Thu, 03 Apr 2014 14:24:30 +0000

Use FreeBSD and PostgreSQL. Problems solved, and you also get a proper database for free.

PostgreSQL pain points

Pawlerson — Thu, 03 Apr 2014 09:54:56 +0000

Use Linux and MySQL. Problems solved. If PostgreSQL has problems its developers should solve them. If they care..

PostgreSQL pain points

kleptog — Sat, 29 Mar 2014 13:56:53 +0000

That might be an idea, except at the moment we're still at the API phase: what information needs to be communicated and how? Stuff mentioned in the article like: order dependencies between blocks, indicating that the kernel cache is out of date, etc. Whether it's eventually implemented in the VFS or as ioctl()s on a fancy new filesystem is really beside the point.

I wonder if finding a way to expose parts of the JBD (Journalling Block Device) to user space might help with any of this.

virtualization

fandingo — Sat, 29 Mar 2014 03:37:07 +0000

> How is the raw partition approach worse here?

I'm not sure a database should be implementing operations necessary for ATA TRIM.

virtualization

kleptog — Fri, 28 Mar 2014 22:24:05 +0000

> > > * Clobbering all other IO-using software on the same machine.

> > For the second, as a userspace program you don't have a good view of what the rest of the system is doing, hence you might be interfering with other processes. The kernel has the overview.

> How is the raw partition approach worse here? I would intuitively think it makes things better: less sharing.

I think it depends on what your goals are. If your goal is to make the absolutely fastest database server possible, then you'd probably want to use raw access on a system with nothing else running.

If your goal is to make a database server that is broadly useful, runs efficiently on a wide variety of systems then asking the kernel to do its job is the better idea.

PostgreSQL tends to the latter. The gains you can get from raw access are simply not worth the effort and would make PostgreSQL much harder to deploy in many situations. A database server that only works well when it's got the machine to itself is a PITA in many situations.

PostgreSQL pain points

bronson — Fri, 28 Mar 2014 21:25:39 +0000

... and a Postgres dev retorted that FreeBSD didn't exhibit any of these problems.

PostgreSQL pain points

marcH — Fri, 28 Mar 2014 08:31:05 +0000

Yet there is a *variety* of filesystems developed, even just on Linux. How come?

Wild, poorly educated guess: in an ideal world, shouldn't databases be hosted on a trimmed down, "semi-filesystem" which has only management features and none of the duplicated performance stuff which gets in the way? It could be called say, LLVM++ for instance?

PostgreSQL pain points

rodgerd — Fri, 28 Mar 2014 01:02:18 +0000

Not to mention it's a huge pain in the arse as a sysadmin. My experience is predominantly with Oracle's raw filesystem (ASM), and the lack of standard tooling to manage the filesystem... it's just not a good thing. To the point where our DBAs will avoid ASM unless they're building a RAC.

Probably the most annoying experience was discovering that ASM doesn't do any kind of sanity check when starting filesystems: after a SAN change, the SAN operator wired some LUNs back to the wrong boxes. With an AAAABAAAA on one server and BBBBABBBB on another, LVM would have simply refused to start the volume group. ASM started and then Oracle would silently coredump every time it tried to work with data on the misplaced disk. Such as the horrors of re-inventing decades of volume management and filesystems.

sync() starves other reader/writers

dlang — Thu, 27 Mar 2014 20:05:15 +0000

given that your transaction is likely to affect multiple database pages, #2 isn't viable, you can never guarantee that all or none of the transaction will be visible after a crash. That's why the databases do #1

sync() starves other reader/writers

seanyoung — Thu, 27 Mar 2014 19:44:19 +0000

Indeed, there are two solutions to this:

1) write the journal before you do anything (requires repeating the operations twice)
2) Before modifying a database page, ensure that any scheduled i/o has completed. If not either copy the page or move on to other pending work.

So ideally you want completion information on page level for non-fsync writes.

virtualization

marcH — Thu, 27 Mar 2014 19:36:14 +0000

> The combination gives you a backup

Heh, that was missing.

I am still not convinced that rsync is the ultimate database backup tool. As much as I love rsync it surely does not have the patented exclusivity of incremental copying/backup techniques.

virtualization

marcH — Thu, 27 Mar 2014 19:32:06 +0000

> > * Clobbering all other IO-using software on the same machine.

> For the second, as a userspace program you don't have a good view of what the rest of the system is doing, hence you might be interfering with other processes. The kernel has the overview.

How is the raw partition approach worse here? I would intuitively think it makes things better: less sharing.

Anyway: any database of serious size runs on dedicated or practically dedicated hardware, doesn't it?

virtualization

kleptog — Thu, 27 Mar 2014 18:58:08 +0000

> > * Can no longer use standard tools like "rsync" with database files.

> Well, you can't use that on a live database anyway, so this point looks moot. Unless maybe you rely on a filesystem with snapshotting which is... not far from duplicating a database feature! Same pattern gain.

You can. It's useful for both backup and replication. Basically you can use rsync to quickly update your backup image. And then you take a copy of the WAL logs. The combination gives you a backup. If you have a snapshotting filesystem you can indeed achieve similar effects.

> > * Can't keep up with hardware advances in a timely fashion.
> > * Clobbering all other IO-using software on the same machine.

> Sorry I don't get these two. Care to elaborate

For the first, consider the effects the rise of SSD is having on the Linux VFS. That would need to be replicated in the database. For the second, as a userspace program you don't have a good view of what the rest of the system is doing, hence you might be interfering with other processes. The kernel has the overview.

It's a feature that a database doesn't assume it's the only program on a machine.

sync() starves other reader/writers

dlang — Thu, 27 Mar 2014 18:06:02 +0000

Plus you want to make sure none of the database files get written ahead of the corresponding WAL data

PostgreSQL pain points

Shewmaker — Thu, 27 Mar 2014 16:48:22 +0000

Regarding double buffering, mapping files from hundreds of processes being slow, and madvise--LLNL has developed a new mmap implementation (loadable as a module) that is optimized for data-intensive applications. They also saw slowdowns for many processes using the standard mmap, but theirs scales up to hundreds of processes. Papers and code are available. This is going into production on their clusters.

See DI-MMAP.

Asynchronous commit

zblaxell — Thu, 27 Mar 2014 16:36:07 +0000

I thought I covered that as undesirable option #3 ("put the filesystem in a database server that implements asynchronous commit"). ;)

The RDBMS state of the art has progressed over the last 20 years, providing a wide variety of reasonable data integrity and performance tradeoffs. It would be nice if filesystems could catch up. fsync() is a sledgehammer.

Dirty page caching not mostly harmless

zblaxell — Thu, 27 Mar 2014 16:25:41 +0000

>> if a process is continuing to make more dirty pages faster than they can be cleaned, such writes can last indefinitely.
> well, that's true no matter what your limit is.

All the more reason to have a plan for this when it happens. Throttle the process dirtying pages so it can't get too far ahead of the block device. Preempt large writes with higher priority I/O (or even equal priority) from other processes.

This can already be done half a dozen ways, but ideally the block layer would default to a more reasonable behavior by itself.

Dirty page caching not mostly harmless

zblaxell — Thu, 27 Mar 2014 16:25:29 +0000

> It used to be that the spinning rust was so slow compared to the rest of the system that it was worth just about any computational effort to optimize it

Computational effort is one thing. RAM cost is another, and it frequently dominates.

Linux has worse problems with slow devices than with fast ones. SSDs are raising the high end, but at the same time enormous USB sticks and slow "green" rust drives are lowering the low end. It's annoying to have your fast SSD be idle 10% of the time because the filesystem can't keep up with it, but it's much more annoying to have most of your RAM be unusable for hours because someone plugged the slowest USB stick money can buy into your server, then tried to copy a few dozen GB of data onto it.

The solution is the same at both extremes--limit the number of dirty pages and stop delaying writes so much. It's the spinning rust in the middle of the range that still needs big dirty page caches and big write delays to form them--and even then, there are pretty low limits to the useful data size for optimization. Fast SSDs don't need optimization and slow "green" drives are so slow that the first page write alone will provide plenty of time for an optimizable set of dirty pages to accumulate.

PostgreSQL pain points

smitty_one_each — Thu, 27 Mar 2014 15:00:10 +0000

"All went swimmingly until a kernel dev jokingly suggested mySQL as a work-around. . ."

sync() starves other reader/writers

seanyoung — Thu, 27 Mar 2014 10:52:54 +0000

Thank you for that.

The problem postgresql has is that you want the wal/journal to written asap (so with fsync). You want the database files to be written WITHOUT fsync, but you do want to know when when they complete/are persisted, so you can safely discard old wal/journal files for examples.

So what I was trying to avoid was calling fsync(). sync_file_range() just ends up calling fsync in the file system driver, it is no different, as you have done in your code.

I'm not sure there is an method for this, although the aio functions seem to provide an API for this.

Dirty pages, faster writing and fsync

dgm — Thu, 27 Mar 2014 10:45:20 +0000

As with bufferbloat, one approach could be to measure the buffer cache in terms of time it takes to write it back, instead of bytes and megabytes.

As others have suggested, having multiple queues mapped somehow to ionice levels, could be of help too.