Better guidance for database developers [LWN.net]

Better guidance for database developers

Posted Sep 25, 2019 11:20 UTC (Wed) by fwiesweg (guest, #116364) [Link]

Well, large ones maybe, but definitely not sqlite on Android phones, unless Google adds a "change the partition layout" app permission allowing random apps to brick the whole device ;)

Better guidance for database developers

Posted Sep 25, 2019 13:32 UTC (Wed) by ringerc (subscriber, #3071) [Link] (9 responses)

Yes, they can. That's what Oracle does/did at various points in time, with various deployment models.

It works, but it has major costs: the DBMS must duplicate a large chunk of OS functionality, which is extremely wasteful. Skills and knowledge of people who know the OS I/O systems are not very transferable to tuning and working with the DBMS's I/O systems because they're parallel implementations. If the OS fixes a bug, the DBMS must fix it separately. The DBMS must find ways to share with and interoperate with the OS sometimes, which can introduce even more complexity.

So we should just bypass the kernel I/O stack. Well, why not just bypass the pesky scheduler, device drivers, etc too and write our own kernel? PostgresOS! We could write our own UEFI firmware and CPU microcode too, and maybe some HBA firmware...

OK, so that's hyperbolic. But why is it that the solution to I/O problems with the kernel is to bypass the kernel? If I wanted to override all kernel CPU scheduling you'd probably call me crazy, but it's if anything less extreme than replacing the I/O stack.

To me, if I can expect to rely on the kernel doing sensible things when I mmap() something, schedule processes reasonably, enforce memory protection, etc, I should be able to expect it to do sane things for I/O too.

Better guidance for database developers

Posted Sep 25, 2019 14:23 UTC (Wed) by epa (subscriber, #39769) [Link] (8 responses)

The kernel provides a POSIX interface (with a few extra frills). As noted in the article, POSIX doesn't really provide any guarantees about persistence of data in the event of a crash. If you have strong requirements for that, it makes sense to avoid the POSIX file system interface and use something else. One day that might be a next-generation file system API which lets you robustly (and simply) guarantee consistent data on disk while getting good performance. Until then, bypassing the file system altogether seems like the only way.

Similarly, POSIX doesn't provide an API for hard real-time; neither does stock Linux. So applications with hard real-time requirements bypass the kernel CPU scheduling and use something else -- often a separate real-time kernel which sits underneath Linux.

Better guidance for database developers

Posted Sep 25, 2019 15:15 UTC (Wed) by rweikusat2 (subscriber, #117920) [Link] (6 responses)

> The kernel provides a POSIX interface (with a few extra frills). As noted in the article, POSIX doesn't really provide any
> guarantees about persistence of data in the event of a crash. If you have strong requirements for that, it makes sense
> to avoid the POSIX file system interface and use something else.

"Holy non-sequitur, Batman!" Nobody uses 'POSIX', hence, there's no reason to avoid using something which happens to be 'in POSIX' just because something else is not. It all boils down to properties of implementations of some interface which happens to be 'in POSIX'. There's also a fundamental misunderstanding about the nature of 'a technical standard' in here: These don't and cannot 'guarantee' anything as a standard has no control over something which claims to be an implementation of it. The standard demands that conforming implenentation shall have certain properties.

Leaving this aside, the statement is also wrong, cf

The fsync() function is intended to force a physical write of data from the buffer cache, and to assure that after a system crash or other failure that all data up to the time of the fsync() call is recorded on the disk. Since the concepts of "buffer cache", "system crash", "physical write", and "non-volatile storage" are not defined here, the wording has to be more abstract.

https://pubs.opengroup.org/onlinepubs/9699919799/function...

This is an optional feature which implementations may or may not implement but it's certainly 'in POSIX'.

Better guidance for database developers

Posted Sep 25, 2019 15:23 UTC (Wed) by epa (subscriber, #39769) [Link] (5 responses)

Well, sure. Instead of saying "POSIX allows any character except NUL and / to appear in a filename" we should all, to be strictly correct, say "the POSIX standard demands that a conforming implementation allow any character...". Instead of "POSIX doesn't provide a video streaming API" we should say "there is no requirement, in the POSIX standard, that a conforming implementation implement an API for video streaming". And so on and so on. Surely we all understand what is meant by the shorter form?

Yes, fsync() exists and is part of POSIX, and guarantees a physical write (when using a conforming implementation). But if fsync() were enough and its semantics were clearly understood by everyone, surely this LWN article would not exist? I thought the whole point was that the the API provided by the Linux kernel (which is loosely speaking a superset of POSIX) doesn't provide the interfaces a database system developer would like to use -- or at least it's not understood by everyone how to use them.

Better guidance for database developers

Posted Sep 25, 2019 21:09 UTC (Wed) by rweikusat2 (subscriber, #117920) [Link] (4 responses)

> Well, sure. Instead of saying "POSIX allows any character except NUL and / to appear in a filename" we should all, to be strictly
> correct, say "the POSIX standard demands that a conforming implementation allow any character...".

[...]

> And so on and so on. Surely we all understand what is meant by the shorter form?

The important distinction here is that a standard is a requirements specification and such, it doesn't and cannot 'guarantee' anything. Implementations aiming to conform to the specification might guarantee something (or not) but that's up the the implementation.

The notion that "the API is all wrong" would seem to be a preconceived opinion of some people (and to which degree this is nothing but "Microsoft does it differentenly" in disguise is anybody's guess) but that's not what I think this article was about. It was about deficiencies of the Linux implementation of an API, especially about the lack of consistency wrt to different file systems and about insuffcient documentation. Eg,

| For example, if you create a file, write to it, and then call fsync() on it, do you also have to open its directory and fsync() that in
| order to be sure that the file is persistent in the directory? Is that even filesystem-specific?
|
|Kernel filesystem developer Jan Kara said that POSIX mandates the directory fsync() for persistence.

But this is just plain wrong. *If* an implementation supports POSIX synchronized I/O (something Linux doesn't claim to support, only aims to support in some way here and there), then "All I/O operations shall be completed as defined for synchronized I/O file integrity completion." upon fsync and "synchronized I/O file integrity completion" is defined as

| Identical to a synchronized I/O data integrity completion with the addition that all file attributes relative to the I/O operation
| (including access time, modification time, status change time) are successfully transferred prior to returning to the calling
| process.

with "I/O data integrity completion" being defined as "all data and all metadata necessary to retrieve this data has been written". IOW, a problem here is that Linux doesn't implement the POSIX API but some essentially random subset of that here and another there, depending on whatever the responsible maintainer had for breakfast a fortnight ago.

Better guidance for database developers

Posted Sep 25, 2019 21:44 UTC (Wed) by nybble41 (subscriber, #55106) [Link] (3 responses)

You seem to be arguing that POSIX compliance requires fsync() on a file to imply an fsync() on the parent directory, and potentially all other ancestor directories up to the root of the filesystem. Or possibly *multiple* parent directories and their ancestors in the case of hard links. Do you have any examples of POSIX-style operating systems which make such guarantees?

Personally I'd say that the Linux implementation is perfectly compliant. The fsync() call ensures that the data and metadata for the target file (i.e., inode) is written to the backing device. After reset and recovery any process with a reference to the file will read the data which was present at the time of the fsync() call (unless it was overwritten later). This is enough to satisfy the requirements. In order to get such a reference, however, you need directory entries to associate a path with that inode. Those directory entries are not part of the file, and the creation of a directory entry is not an I/O operation on the file, so an fsync() call on the file itself does not guarantee anything about the directory. For that you need to fsync() the directory.

Better guidance for database developers

Posted Sep 25, 2019 22:12 UTC (Wed) by rweikusat2 (subscriber, #117920) [Link] (2 responses)

As I quoted in an earlier post:

The fsync() function is intended to force a physical write of data from the buffer cache, and to assure that after a system crash or other failure that all data up to the time of the fsync() call is recorded on the disk.

You're correct insofar as this doesn't explicitly demand that the data which was recorded can ever be retrieved again after such an event, IOW, that an implementation which effectively causes it to be lost is perfectly compliant :-). But that's sort of a moot point as any "sychronous I/O capability" is optional, IOW, loss of data due to write-behind caching of directory operations is just a "quality" of (certain) Linux implementations of this facility. I'm - however - pretty convinced that the idea what that the data can be retrieved after a sudden "cache catastrophe" and not that it just sits on the disk as magnetic ornament. In any case, POSIX certainly doesn't "mandate" this "feature".

Better guidance for database developers

Posted Sep 26, 2019 20:29 UTC (Thu) by nybble41 (subscriber, #55106) [Link] (1 responses)

> I'm - however - pretty convinced that the idea what that the data can be retrieved after a sudden "cache catastrophe" and not that it just sits on the disk as magnetic ornament.

Even if you mandated that fsync() == sync() so that *all* filesystem data was written to disk before fsync() returns it still wouldn't guarantee that there is actually a directory entry pointing to that file. For example, it could have been unlinked by another process, in which case the data on disk really would be nothing more than a "magnetic ornament".

Let's say process A creates a file with path "/a/file", writes some data to it, and calls fsync(). While this is going on, another process hard-links "/a/file" to "/b/file" and then unlinks "/a/file" prior to the fsync() call. Would you expect the fsync() call to synchronize both directories, or just the second directory?

Better guidance for database developers

Posted Sep 26, 2019 20:55 UTC (Thu) by rweikusat2 (subscriber, #117920) [Link]

I'm sorry but you're just dancing around the issue. UNIX(*) file systems used to do directory modifications synchronously in order to guarantee (to the point this was possible) file system integrity in case of a sudden loss of cache contents. And that's what the people who wrote the POSIX text had in mind: A situation where there's file data in the filesystem but no directory entry pointing to it cannot occur. Hence, ensuring that all file data and metadata is written, as per definition of fsync, is sufficient to guarantee that the file won't be lost.

The Linux ext2 file system introduced write-behind caching of directory operations in order to improve performance at the expense of reliablity in situations deemed to be rare. Because of this, depending on the filesystem being used, fsync on a file descriptor is not sufficient to make a file crash-proof on Linux: An application would need to determine the path to the root file system, walk that down while fsyncing every directory and then call fsync on the file descriptor. This is obviously not a requirement applications will realistically meet in practice.

Possibily 'hostile' activities of other processes (as in "Let's say ...") are of no concern here because that's not a situation fsync is supposed to handle.

Better guidance for database developers

Posted Sep 25, 2019 15:26 UTC (Wed) by hkario (subscriber, #94864) [Link]

It's more like Kernel provides a "POSIX-like" interface, yes, it's compatible with POSIX, but it's the lowest common denominator, it's not what Linux can do and what interfaces does it provide.

or to put it other way: POSIX doesn't require the error handling of the APIs to be underspecified

Better guidance for database developers

Posted Sep 25, 2019 22:21 UTC (Wed) by neilbrown (subscriber, #359) [Link] (3 responses)

> Couldn't a large database installation work with raw disk partitions,

Raw disk partitions would be a bit clumsy, but using O_DIRECT access is quite close to raw partition access.

You would need to create the file safely - sync the directory and pre-allocate the address space of the file and make sure that was safely on disk. But then with a raw partition you would need have a reliable way to create the partition safely and be sure the partition details were safely in non-volatile storage.

Which ever way you cut it, you need reliable guarantees about how things work.

Better guidance for database developers

Posted Sep 26, 2019 4:33 UTC (Thu) by dezgeg (subscriber, #92243) [Link] (2 responses)

I thought there are some filesystems that may silently have O_DIRECT I/O fall back to buffered I/O under some circumstances?

Better guidance for database developers

Posted Sep 26, 2019 9:24 UTC (Thu) by metan (subscriber, #74107) [Link] (1 responses)

As far as I can tell that happens only when you pass unaligned buffers to the read()/write() syscalls. In that case some filesystems reports errors and some fall back to page cache backed I/O. But as far as you align you buffers correctly it should not happen.

Fallback depends on more than alignment

Posted Sep 26, 2019 18:01 UTC (Thu) by sitsofe (guest, #104576) [Link]

You can silently fallback to buffered I/O even though you set the O_DIRECT "hint" just because of the filesystem, the filesystem's current options, you're doing allocating writes on a certain filesystem etc. See https://stackoverflow.com/questions/34572559/asynchronous... (point 2 and the references) for some background.

Better guidance for database developers

Posted Sep 26, 2019 8:49 UTC (Thu) by liam (guest, #84133) [Link]

Ceph uses bluestore which, iirc, interfaces directly with the block layer.
A small hitch might be that bluestore uses an (internal) rocksdb for handling the metadata, thus requiring them to reimplement exactly enough of the filesystem interface to support rocks.