Better guidance for database developers

By Jake Edge
September 24, 2019

At the inaugural Databases microconference at the 2019 Linux Plumbers Conference (LPC), two developers who work on rather different database systems had similar complaints about developing for Linux. Richard Hipp, creator of the SQLite database, and Andres Freund from the PostgreSQL project both lamented the lack of definitive documentation on how to best use the kernel's I/O interfaces, especially for corner cases. Both of the sessions, along with others in the microconference, pointed to a strong need for more interaction between user-space and kernel developers.

SQLite

Hipp went first, with a brief introduction to SQLite (which he pronounced "ess-kyew-ell lite"), its ubiquity, and how it is different from other database systems, the gist of which can be seen in his slides. He also created a briefing paper since he had far more material to discuss than there was time in the session. In many ways, his title, "What SQLite Devs Wish Linux Filesystem Devs Knew About SQLite", summed up the presentation nicely.

He noted that SQLite is "in everything and it is everywhere"; he pointed to my Sony camera (see photo) and said that he didn't know if it had SQLite in it, but that it probably did. It is in cars, phones, televisions, and more. There are more than 200 SQLite databases in each of the 2.5 billion Android phones. In aggregate, the phones are doing more than 5GB of SQLite I/O per device per day. "It's a lot of I/O."

Unlike other database systems, SQLite is effectively a library that gets embedded into applications; there is no separate server thread or process. Most databases are designed to run in data centers, but SQLite is designed to run at the edge of the network. It uses a single file to hold the entire database, though there can be journal files to support atomic commits. There is no configuration file for SQLite, so it must discover the capabilities of the underlying system at runtime.

Multiple processes can all be accessing the database at the same time in an uncoordinated fashion, he said. There are three mechanisms to provide atomicity, consistency, isolation, durability (ACID) guarantees for the database. The most universal is a rollback journal, which is also the slowest. A write-ahead log, which is faster, can be used if it is known that the database file does not reside on a network filesystem. There is no good way to determine if that is true, however.

An attendee asked about using statfs() to determine the type of the filesystem. Hipp said that could be done, but then SQLite would have to maintain a list of which types are network filesystems or not. Since SQLite is often statically linked with applications, there would be no way to update that list, he said.

SQLite can also use the atomic write capabilities of F2FS. It is a Linux-only solution, but it is "really fast", he said. He has heard reports that reformatting the filesystem on an old Android phone from ext4 to F2FS will make the handset "seem like a perky new phone". There is a clunky, ioctl()-based interface to detect the feature and to use it; it would be nice to have a more generic way to query for this kind of information.

There were a few specific items he raised in the session, but said there were many more in the paper. The first was a reliable way to query for filesystem attributes. For example, if you create a file, write to it, and then call fsync() on it, do you also have to open its directory and fsync() that in order to be sure that the file is persistent in the directory? Is that even filesystem-specific?

Kernel filesystem developer Jan Kara said that POSIX mandates the directory fsync() for persistence. Generally, filesystem developers are not willing to guarantee anything more relaxed than that because it ties their hands. As it turns out, ext4 effectively does the directory fsync() under the covers, so it is not truly necessary, at least for now. Doing the directory fsync() anyway, as SQLite does, should not be expensive if there is no concurrent activity, Kara said. That is exactly the kind of information he needed, Hipp said: authoritative information from people who know.

He also wondered if it made sense for SQLite to tell the filesystem about unused chunks of the file. At some point in the future, they would be written and then matter, but they are effectively holes that are allocated but whose contents do not matter to SQLite. While there was some thought that filesystems could use that information to send TRIM commands to SSDs, overall the belief was that it probably was not worth it. Kara said that unless the holes were gigabytes in size, it did not make sense to bother with it.

PostgreSQL

Freund launched right into a complaint that durability handling for PostgreSQL is difficult; every Linux filesystem has different behavior. Most system calls do not document what happens when an error is returned. He specifically mentioned error returns from fsync(), which were only reliably reported starting with Linux 4.13; there is still no documentation on what those errors mean and whether the operation can be sensibly retried.

Kara essentially agreed. He noted that the standards define what should happen in the normal case; POSIX does not try to define any durability guarantees. "I share your pain", he said.

Freund continued by describing another documentation flaw: durability operations like sync_file_range() come with big warnings ("This system call is extremely dangerous and should not be used in portable programs ...") that tend to steer application developers away from them. But when the application developers run into performance problems in various cases, they get pointed to sync_file_range(). Kara said the warning is there because there are no durability guarantees provided by sync_file_range(); some filesystems will durably store the range, but others will not. Freund wondered how applications are supposed to actually use the function without having to read kernel code.

In addition, the error behavior is different depending on the filesystem, block device, and kernel version. Depending on the filesystem, you will either get the old or new contents of a page after an I/O error; for NFS you may not see the error at all until the file is closed. There needs to be some documentation of what applications need to check and what they can expect; you can't "complain about people writing crappy code if that's all the guidance that they have".

An attendee asked: "how deep do you want to go?" The filesystem developers are constrained by the block layer developers who are constrained by the devices themselves. The differences between various types of storage devices are going to make guarantees difficult, they said.

Freund said that he was looking for consistency of a different type. For example, right now on a thin-provisioned block device, you can get an ENOSPC error from random system calls that do not document that return. He would be fine with filesystem and block layer errors that were consistent, but does not think Linux needs to hide or try to paper over device errors of various sorts.

In the case of failures, the behavior needs to be documented, he said. If fsync() fails and gets retried, what happens? Does the original sync operation get tried again or does the new data get thrown away? The latter is kind of what happens now in some cases, he said. Application developers have to find and read threads on the Linux kernel mailing list to figure that out.

Beyond just documenting the failure semantics, there is a need for documentation of the right way to do certain things in a safe manner. Right now that is a guessing game or something that requires talking to kernel developers over beer. The latter is nice, he said, but does not work well remotely.

Continuing in that vein, he said that there is no documentation on how to achieve durability for data. Whatever application developers do, though, some kernel developer will complain about it. If there is a performance concern, a kernel developer will say that the application is doing too much, but if the concern is data loss, then someone will complain that it is not doing enough. "Opinions will contradict each other wildly."

Another example is renaming a file atomically; what is required to ensure that it is on disk? According to some filesystem developers, it requires an fsync() of the existing file and the directory containing it, followed by the rename(), and then an fsync() of the new file and of its containing directory. There was some back and forth about whether some of those steps were actually needed, but Tomas Vondra said that PostgreSQL had settled on that sequence after extensive testing; that is what finally made the data-loss problems disappear.

Kara agreed that documentation of the sort Freund is looking for is lacking. He suggested coming up with concrete questions to post to the linux-fsdevel mailing list. The responses can be summarized into a file for the kernel documentation directory. Kara said that the atomic rename() situation is "kind of sad" and suggested that might be a good question to bring to the list.

An attendee asked if Freund was looking for the lowest common denominator because the filesystems are going to have different answers for some things. Freund said that would be fine; if there are major performance implications, it might make sense to have some filesystem-specific code. In answer to another question, Freund said that he was looking for information on what errors it would make sense to retry—which have a chance of actually succeeding if you do so?

From these two sessions and some others in the microconference, it is clear that database developers (and likely other user-space application developers) need to find ways to collaborate more with the kernel developers—and vice versa. The microconference is a great start, but more discussion on the mailing lists and over beer is needed, as is the creation of better documentation. Guidance on how to perform certain operations safely, especially with regard to file data and metadata consistency, seems to be a great starting place.

[I would like to thank LWN's travel sponsor, the Linux Foundation, for travel assistance to Lisbon for LPC.]

Index entries for this article
Conference	Linux Plumbers Conference/2019

Kernel IO APIs

Posted Sep 24, 2019 18:01 UTC (Tue) by darwi (subscriber, #131202) [Link] (2 responses)

Is the situation better for the BSDs, or they don't matter much anyway?

Kernel IO APIs

Posted Sep 24, 2019 18:42 UTC (Tue) by k8to (guest, #15413) [Link] (1 responses)

I expect the BSDs have much less varying behavior for sure. But I think the full durability details are complicated in every system.

Kernel IO APIs

Posted Sep 24, 2019 18:52 UTC (Tue) by k8to (guest, #15413) [Link]

But my experience on a "enterprise" datacenter application over ~10 years is that BSD went away from nearly all large corporate scenarios.

It's certainly used still in some roll your own situations. Mostly I've seen FreeBSD when I've seen it at all.

My expectation is that the platforms people care about are going to be Linux and to a lesser extent Windows for the forseable future. All the shops I've worked for in the past 14 years have their software running on OSX but it's never fully supported. It just "sort of works" as demo-ware.

Projects like SQLite and Postgres tend to care about making their software work reliably on an incredibly number of platforms though. Postgres finally dropped support for Vax not that long ago. I expect they achieve these goals on all the BSDs.

Better guidance for database developers

Posted Sep 25, 2019 6:14 UTC (Wed) by geoffhill (subscriber, #92577) [Link] (21 responses)

Part of me want to think it isn't the db developers' fault. Trying to understand the behavior of Linux syscalls across the disparate myriad of modules that implement them is intractable and becomes more so every year.

But at the end of the day, if db developers claim consistency of data on my disk, I hold them responsible. If Linux doesn't provide the syscalls to do that, the database developers and users who care about on-disk consistency should look elsewhere for a system that provides the guarantees they seek.

It doesn't matter if you are the foremost expert in your database design. If you cannot understand the abstractions you are building upon (perhaps because they are not clear and rock-solid abstractions!), you cannot claim to understand your own product.

Better guidance for database developers

Posted Sep 25, 2019 7:36 UTC (Wed) by weberm (guest, #131630) [Link] (2 responses)

This is completely impractical, and voicing your concerns as developer building on something murky is absolutely necessary to drive a change.

Say for example you have a HD controller that lies about flushing data to the disk, because for some benchmark figure there's a cache in there that's considered "part of the disk" and that's where the flush goes to. But if you flush just enough data, the buffer overflows and your actual data is getting written to the disk. Are database developers now supposed to follow each transaction with just enough data (different per disk / controller) so that this internal cache overflows? Why should they have to be bothered with this? Because some user uses questionable hardware?

This multiplies across the stack that your software is built on, and extends to hardware, even the CPU you're running on. No single sane person can claim to fully understand the whole stack, hardware and software. This is completely unrealistic. So you go to the experts for the various layers and communicate your expectations and necessities. There's no other way than to collaborate.

Better guidance for database developers

Posted Sep 25, 2019 13:25 UTC (Wed) by ringerc (subscriber, #3071) [Link] (1 responses)

Exactly. This is pretty much what happened on most consumer SSDs until quite recently: they would lie about flushing data, instead storing it in a volatile cache where it's re-ordered and lazily written out.

Abruptly lost power? Oh well. Hope you didn't need that data written consistently and in order.

But for marketing and benchmark results reasons, they'd report to the OS that they were doing write-through even when they were really write-back caching.

How's a database supposed to defend against that?

Is it supposed to protect your data from somebody pouring coffee into the host's disk array too?

Better guidance for database developers

Posted Sep 25, 2019 14:16 UTC (Wed) by NightMonkey (subscriber, #23051) [Link]

"Is it supposed to protect your data from somebody pouring coffee into the host's disk array too?"

Yes, for your sake, it better. I will pour my coffee into your database's disk array AGAIN if you keep leaving it in my bedroom, all 16 loud fans blowing full speed, ringere. I don't care how many nines you've promised, or how much fault tolerance you claim, or how "important" your data is. Hot chocolate, too, if you do it in the winter. So, step off or get burned. Make sure your transactions are atomic, check your backups, and get this monster OUT of here!

Worst roommate EVER, you are.

P.S. I agree with you. ;)

Better guidance for database developers

Posted Sep 25, 2019 9:13 UTC (Wed) by epa (subscriber, #39769) [Link] (16 responses)

Couldn't a large database installation work with raw disk partitions, cutting out the file system entirely? Some DBMSes already bypass the page cache, since they do their own caching.

Better guidance for database developers

Posted Sep 25, 2019 11:20 UTC (Wed) by fwiesweg (guest, #116364) [Link]

Well, large ones maybe, but definitely not sqlite on Android phones, unless Google adds a "change the partition layout" app permission allowing random apps to brick the whole device ;)

Better guidance for database developers

Posted Sep 25, 2019 13:32 UTC (Wed) by ringerc (subscriber, #3071) [Link] (9 responses)

Yes, they can. That's what Oracle does/did at various points in time, with various deployment models.

It works, but it has major costs: the DBMS must duplicate a large chunk of OS functionality, which is extremely wasteful. Skills and knowledge of people who know the OS I/O systems are not very transferable to tuning and working with the DBMS's I/O systems because they're parallel implementations. If the OS fixes a bug, the DBMS must fix it separately. The DBMS must find ways to share with and interoperate with the OS sometimes, which can introduce even more complexity.

So we should just bypass the kernel I/O stack. Well, why not just bypass the pesky scheduler, device drivers, etc too and write our own kernel? PostgresOS! We could write our own UEFI firmware and CPU microcode too, and maybe some HBA firmware...

OK, so that's hyperbolic. But why is it that the solution to I/O problems with the kernel is to bypass the kernel? If I wanted to override all kernel CPU scheduling you'd probably call me crazy, but it's if anything less extreme than replacing the I/O stack.

To me, if I can expect to rely on the kernel doing sensible things when I mmap() something, schedule processes reasonably, enforce memory protection, etc, I should be able to expect it to do sane things for I/O too.

Better guidance for database developers

Posted Sep 25, 2019 14:23 UTC (Wed) by epa (subscriber, #39769) [Link] (8 responses)

The kernel provides a POSIX interface (with a few extra frills). As noted in the article, POSIX doesn't really provide any guarantees about persistence of data in the event of a crash. If you have strong requirements for that, it makes sense to avoid the POSIX file system interface and use something else. One day that might be a next-generation file system API which lets you robustly (and simply) guarantee consistent data on disk while getting good performance. Until then, bypassing the file system altogether seems like the only way.

Similarly, POSIX doesn't provide an API for hard real-time; neither does stock Linux. So applications with hard real-time requirements bypass the kernel CPU scheduling and use something else -- often a separate real-time kernel which sits underneath Linux.

Better guidance for database developers

Posted Sep 25, 2019 15:15 UTC (Wed) by rweikusat2 (subscriber, #117920) [Link] (6 responses)

> The kernel provides a POSIX interface (with a few extra frills). As noted in the article, POSIX doesn't really provide any
> guarantees about persistence of data in the event of a crash. If you have strong requirements for that, it makes sense
> to avoid the POSIX file system interface and use something else.

"Holy non-sequitur, Batman!" Nobody uses 'POSIX', hence, there's no reason to avoid using something which happens to be 'in POSIX' just because something else is not. It all boils down to properties of implementations of some interface which happens to be 'in POSIX'. There's also a fundamental misunderstanding about the nature of 'a technical standard' in here: These don't and cannot 'guarantee' anything as a standard has no control over something which claims to be an implementation of it. The standard demands that conforming implenentation shall have certain properties.

Leaving this aside, the statement is also wrong, cf

The fsync() function is intended to force a physical write of data from the buffer cache, and to assure that after a system crash or other failure that all data up to the time of the fsync() call is recorded on the disk. Since the concepts of "buffer cache", "system crash", "physical write", and "non-volatile storage" are not defined here, the wording has to be more abstract.

https://pubs.opengroup.org/onlinepubs/9699919799/function...

This is an optional feature which implementations may or may not implement but it's certainly 'in POSIX'.

Better guidance for database developers

Posted Sep 25, 2019 15:23 UTC (Wed) by epa (subscriber, #39769) [Link] (5 responses)

Well, sure. Instead of saying "POSIX allows any character except NUL and / to appear in a filename" we should all, to be strictly correct, say "the POSIX standard demands that a conforming implementation allow any character...". Instead of "POSIX doesn't provide a video streaming API" we should say "there is no requirement, in the POSIX standard, that a conforming implementation implement an API for video streaming". And so on and so on. Surely we all understand what is meant by the shorter form?

Yes, fsync() exists and is part of POSIX, and guarantees a physical write (when using a conforming implementation). But if fsync() were enough and its semantics were clearly understood by everyone, surely this LWN article would not exist? I thought the whole point was that the the API provided by the Linux kernel (which is loosely speaking a superset of POSIX) doesn't provide the interfaces a database system developer would like to use -- or at least it's not understood by everyone how to use them.

Better guidance for database developers

Posted Sep 25, 2019 21:09 UTC (Wed) by rweikusat2 (subscriber, #117920) [Link] (4 responses)

> Well, sure. Instead of saying "POSIX allows any character except NUL and / to appear in a filename" we should all, to be strictly
> correct, say "the POSIX standard demands that a conforming implementation allow any character...".

[...]

> And so on and so on. Surely we all understand what is meant by the shorter form?

The important distinction here is that a standard is a requirements specification and such, it doesn't and cannot 'guarantee' anything. Implementations aiming to conform to the specification might guarantee something (or not) but that's up the the implementation.

The notion that "the API is all wrong" would seem to be a preconceived opinion of some people (and to which degree this is nothing but "Microsoft does it differentenly" in disguise is anybody's guess) but that's not what I think this article was about. It was about deficiencies of the Linux implementation of an API, especially about the lack of consistency wrt to different file systems and about insuffcient documentation. Eg,

| For example, if you create a file, write to it, and then call fsync() on it, do you also have to open its directory and fsync() that in
| order to be sure that the file is persistent in the directory? Is that even filesystem-specific?
|
|Kernel filesystem developer Jan Kara said that POSIX mandates the directory fsync() for persistence.

But this is just plain wrong. *If* an implementation supports POSIX synchronized I/O (something Linux doesn't claim to support, only aims to support in some way here and there), then "All I/O operations shall be completed as defined for synchronized I/O file integrity completion." upon fsync and "synchronized I/O file integrity completion" is defined as

| Identical to a synchronized I/O data integrity completion with the addition that all file attributes relative to the I/O operation
| (including access time, modification time, status change time) are successfully transferred prior to returning to the calling
| process.

with "I/O data integrity completion" being defined as "all data and all metadata necessary to retrieve this data has been written". IOW, a problem here is that Linux doesn't implement the POSIX API but some essentially random subset of that here and another there, depending on whatever the responsible maintainer had for breakfast a fortnight ago.

Better guidance for database developers

Posted Sep 25, 2019 21:44 UTC (Wed) by nybble41 (subscriber, #55106) [Link] (3 responses)

You seem to be arguing that POSIX compliance requires fsync() on a file to imply an fsync() on the parent directory, and potentially all other ancestor directories up to the root of the filesystem. Or possibly *multiple* parent directories and their ancestors in the case of hard links. Do you have any examples of POSIX-style operating systems which make such guarantees?

Personally I'd say that the Linux implementation is perfectly compliant. The fsync() call ensures that the data and metadata for the target file (i.e., inode) is written to the backing device. After reset and recovery any process with a reference to the file will read the data which was present at the time of the fsync() call (unless it was overwritten later). This is enough to satisfy the requirements. In order to get such a reference, however, you need directory entries to associate a path with that inode. Those directory entries are not part of the file, and the creation of a directory entry is not an I/O operation on the file, so an fsync() call on the file itself does not guarantee anything about the directory. For that you need to fsync() the directory.

Better guidance for database developers

Posted Sep 25, 2019 22:12 UTC (Wed) by rweikusat2 (subscriber, #117920) [Link] (2 responses)

As I quoted in an earlier post:

You're correct insofar as this doesn't explicitly demand that the data which was recorded can ever be retrieved again after such an event, IOW, that an implementation which effectively causes it to be lost is perfectly compliant :-). But that's sort of a moot point as any "sychronous I/O capability" is optional, IOW, loss of data due to write-behind caching of directory operations is just a "quality" of (certain) Linux implementations of this facility. I'm - however - pretty convinced that the idea what that the data can be retrieved after a sudden "cache catastrophe" and not that it just sits on the disk as magnetic ornament. In any case, POSIX certainly doesn't "mandate" this "feature".

Better guidance for database developers

Posted Sep 26, 2019 20:29 UTC (Thu) by nybble41 (subscriber, #55106) [Link] (1 responses)

> I'm - however - pretty convinced that the idea what that the data can be retrieved after a sudden "cache catastrophe" and not that it just sits on the disk as magnetic ornament.

Even if you mandated that fsync() == sync() so that *all* filesystem data was written to disk before fsync() returns it still wouldn't guarantee that there is actually a directory entry pointing to that file. For example, it could have been unlinked by another process, in which case the data on disk really would be nothing more than a "magnetic ornament".

Let's say process A creates a file with path "/a/file", writes some data to it, and calls fsync(). While this is going on, another process hard-links "/a/file" to "/b/file" and then unlinks "/a/file" prior to the fsync() call. Would you expect the fsync() call to synchronize both directories, or just the second directory?

Better guidance for database developers

Posted Sep 26, 2019 20:55 UTC (Thu) by rweikusat2 (subscriber, #117920) [Link]

I'm sorry but you're just dancing around the issue. UNIX(*) file systems used to do directory modifications synchronously in order to guarantee (to the point this was possible) file system integrity in case of a sudden loss of cache contents. And that's what the people who wrote the POSIX text had in mind: A situation where there's file data in the filesystem but no directory entry pointing to it cannot occur. Hence, ensuring that all file data and metadata is written, as per definition of fsync, is sufficient to guarantee that the file won't be lost.

The Linux ext2 file system introduced write-behind caching of directory operations in order to improve performance at the expense of reliablity in situations deemed to be rare. Because of this, depending on the filesystem being used, fsync on a file descriptor is not sufficient to make a file crash-proof on Linux: An application would need to determine the path to the root file system, walk that down while fsyncing every directory and then call fsync on the file descriptor. This is obviously not a requirement applications will realistically meet in practice.

Possibily 'hostile' activities of other processes (as in "Let's say ...") are of no concern here because that's not a situation fsync is supposed to handle.

Better guidance for database developers

Posted Sep 25, 2019 15:26 UTC (Wed) by hkario (subscriber, #94864) [Link]

It's more like Kernel provides a "POSIX-like" interface, yes, it's compatible with POSIX, but it's the lowest common denominator, it's not what Linux can do and what interfaces does it provide.

or to put it other way: POSIX doesn't require the error handling of the APIs to be underspecified

Better guidance for database developers

Posted Sep 25, 2019 22:21 UTC (Wed) by neilbrown (subscriber, #359) [Link] (3 responses)

> Couldn't a large database installation work with raw disk partitions,

Raw disk partitions would be a bit clumsy, but using O_DIRECT access is quite close to raw partition access.

You would need to create the file safely - sync the directory and pre-allocate the address space of the file and make sure that was safely on disk. But then with a raw partition you would need have a reliable way to create the partition safely and be sure the partition details were safely in non-volatile storage.

Which ever way you cut it, you need reliable guarantees about how things work.

Better guidance for database developers

Posted Sep 26, 2019 4:33 UTC (Thu) by dezgeg (subscriber, #92243) [Link] (2 responses)

I thought there are some filesystems that may silently have O_DIRECT I/O fall back to buffered I/O under some circumstances?

Better guidance for database developers

Posted Sep 26, 2019 9:24 UTC (Thu) by metan (subscriber, #74107) [Link] (1 responses)

As far as I can tell that happens only when you pass unaligned buffers to the read()/write() syscalls. In that case some filesystems reports errors and some fall back to page cache backed I/O. But as far as you align you buffers correctly it should not happen.

Fallback depends on more than alignment

Posted Sep 26, 2019 18:01 UTC (Thu) by sitsofe (guest, #104576) [Link]

You can silently fallback to buffered I/O even though you set the O_DIRECT "hint" just because of the filesystem, the filesystem's current options, you're doing allocating writes on a certain filesystem etc. See https://stackoverflow.com/questions/34572559/asynchronous... (point 2 and the references) for some background.

Better guidance for database developers

Posted Sep 26, 2019 8:49 UTC (Thu) by liam (guest, #84133) [Link]

Ceph uses bluestore which, iirc, interfaces directly with the block layer.
A small hitch might be that bluestore uses an (internal) rocksdb for handling the metadata, thus requiring them to reimplement exactly enough of the filesystem interface to support rocks.

Better guidance for database developers

Posted Sep 25, 2019 13:17 UTC (Wed) by ringerc (subscriber, #3071) [Link]

A nice ideal in theory.

In reality, it's turtles all the way down.

Can you confidently state that the UEFI firmware hijacking control of the disk to do some I/O to a UEFI hidden/reserved partition won't affect durability?

What if the SATA firmware on (random made-up example) some Western Digital Silver SSD drives responds prematurely to a flush request if it immediately follows a TRIM command? Should they know that, special case that, handle that?

Because if so, I promise you there is only one possible outcome: "We make no guarantees about the durability of your data, good luck with that."

In reality that's *always* the case. It's all about confidence levels, testing, and experience. We can never prove that we cannot lose your data. We can only say, confidently, that we've done a rather comprehensive job of plugging all the routes we can find by which we might lose it. If that's not good enough, you'd better go back to pen & paper because there's no way in the universe that any one person is going to understand everything and all possible interactions. Not with CPU microcode, UEFI, BMCs, PCIe inter-device communication, IOMMUs, VT-x and VT-IO, hypervisors, firmware on SSDs and HDDs, ACPI, firmware on I/O controllers, the kernel core, kernel device drivers, bus power states, device power states, processor power states, power management interacting with everything, etc etc etc.

Can you list every microprocessor on your laptop that can interact with your RAM, PCI-e bus, or USB HCI? I guarantee you can't.

Better guidance for database developers

Posted Sep 25, 2019 14:31 UTC (Wed) by martin.langhoff (guest, #61417) [Link]

Bravo. We need more of these "core app developers talk with kernel devs" sessions. All the popular stack components -- all the parts of MEAN, LAMP, Pythons and Rubys and Erlangs should come bearing "things I wish kernel devs knew about $foo" two-pagers.

Many of the proposed answers -- ie: the sane way to rename() is x,y,z -- could/should be encoded in a battery of tests that supports fault injection.

Files are hard

Posted Sep 26, 2019 15:27 UTC (Thu) by psoberoi (subscriber, #45666) [Link] (2 responses)

Anyone who thinks this is simple needs to read this:

https://danluu.com/deconstruct-files/

Even you you don't think it's simple - read that article. It's a great explanation of how hard it is to do persistence reliably.

Files are hard

Posted Sep 26, 2019 17:20 UTC (Thu) by rweikusat2 (subscriber, #117920) [Link]

I'd rather call this a great example of how people who don't understand what they're talking about can end up producing loads and loads of gibberish, ex all the talk about "reordering". That's a term someone lifted from machine code execution and applied here to mean "what happened wasn't what I expected to happen !!!1". But that's entirely the fault of this person: By default, all writes to any file system end up in the page cache which does write-behind caching of "disk writes". Consequently, there's absolutely no correlation between an ordering of write system calls on different file descriptors and and a later ordering of "disk writes" flushing dirty pages: That's an inherent property of this kind of caching scheme which has existed since some time in the 1970s.

Moving forward along these lines, fsync is not "a barrier and a flush operation", it's a forced writeback of a part of the page cache. Obviously, updates to the page cache after an fsync won't end up being written prior to the writeback forced by the fsync because - duh! - that has already happened. It's not because there's some kind of "reordering" fsync prevents.

Files are hard

Posted Sep 26, 2019 21:47 UTC (Thu) by jhhaller (guest, #56103) [Link]

There are always trade-offs. You wrote the file to the disk, and then the disk failed. No more data. So, you move to RAID. You wrote the file, the OS wrote to one disk, but the other disk is failed - do you wait until it's repaired to report completion, or just write it to one drive and hope there isn't another failure? Next, you wrote a file, it was written to two drives. But then the data center holding the drives was hit by lightning and burned to the ground. So, you write it to two data centers. If one data center if offline, do you wait, find another data center, or assume that in this case, that one data center is enough. This case is also likely to be slower, unless the replication isn't synchronous, but that case, there is a risk that if the first data center fails before replication, the data is lost. Does the sun turning into a red dwarf become an important cause of data loss? The heat death of the universe?

The trade-offs are between performance, cost, and durability. It's impossible to get high performance with low cost and high durability.

Better guidance for database developers

Posted Sep 28, 2019 6:14 UTC (Sat) by marcH (subscriber, #57642) [Link]

> In the case of failures, the behavior needs to be documented, he said.

Yes - assuming anyone knows in the first place. Error handling in general is barely ever designed and never tested. Are filesystems somewhat better? How much error injection can be found in filesystem test suites?

> Application developers have to find and read threads on the Linux kernel mailing list to figure that out.