Improved block-layer error handling
Block-layer error codes
One problem with existing reporting mechanisms is that they are based on standard Unix error codes, but those codes were never designed to handle the wide variety of things that can go wrong with block I/O. As a result, almost any type of error ends up being reported back to the higher levels of the block layer (and user space) as EIO (I/O error) with no further detail available. That makes it hard to determine, at both the filesystem and user-space levels, what the correct response to the error should be.
Christoph Hellwig is working to change that situation by adding a dedicated set of error codes to be used
within the block layer. This patch set adds a new blk_status_t type to
describe block-level errors. The specific error codes added thus far
correspond mostly to the existing Unix codes. So BLK_STS_TIMEOUT,
indicating an operation timeout, maps to ETIMEDOUT, while
BLK_STS_NEXUS, describing a problem connecting to a remote storage
device, becomes EBADE ("invalid exchange"). There is, according
to Hellwig, "some
low hanging fruit
" that can be improved by additional error codes, but
those codes are not added as part of this patch set.
The new errors can be generated at the lowest levels of the kernel's block drivers, and will be propagated to the point that filesystem code sees them in the results of its block I/O requests. To get there, the bi_error field in struct bio, which contained a Unix error code, has been renamed to bi_status. In-tree filesystems have been changed to use the new field, but they do not yet act on the additional information that may be available there.
This is, in other words, relatively early infrastructural work that makes it possible for the block layer to produce better error information. Actually making use of that infrastructure will have to wait until this work is accepted and headed toward the mainline.
Reporting writeback errors
One particular challenge for block I/O error reporting is that many I/O requests are not the direct result of a user-space operation. Most file data is buffered through the kernel's page cache, and there can be a significant delay between when an application writes data into the cache and when a writeback operation flushes that data to persistent storage. If something goes wrong during writeback, it can be hard to report that error back to user space since the operation that caused that writeback in the first place will have long since completed. The kernel makes an attempt to save the error and report it on a subsequent system call, but it is easy for that information to be lost with the result that the application is unaware that it has lost data.
Jeff Layton's writeback-error reporting patches are an attempt to improve this situation. He adds a mechanism that is based on the idea that applications that care about their data will occasionally call fsync() to ensure that said data has made it to persistent storage. Current kernels might report a writeback error on an fsync() call, but there are a number of ways in which that can fail to happen. With the new mechanism in place, any application that holds an open file descriptor will reliably get an error return on the first fsync() call that is made after a writeback error occurs.
To get there, the patch set creates a new type (errseq_t) for the reporting of writeback errors. It is a 32-bit value with two separate fields: an error code (of the standard Unix variety) and a sequence counter. That counter tracks the number of times that an error has been reported in that particular errseq_t value; kernel code can remember the counter value of the last error reported to user space. If the counter increases on a future check, a new error has been encountered.
The errseq_t variables are added to the address_space structure, which controls the mapping between pages in the page cache and those in persistent storage. The writeback process uses this structure to determine where dirty pages should be written to, so it is a logical place to store error information. Meanwhile, any open file descriptor referring to a given file will include a pointer to that address_space structure, so this errseq_t value is visible (within the kernel) to all processes accessing the file. Each open file (tracked by struct file) gains a new f_wb_err field to remember the sequence number of the last reported error.
Storing that value in the file structure has an important benefit: it makes it possible to report a writeback error exactly once to every process that calls fsync() on that file, regardless of when they make that call. In current kernels, only the first caller after an error occurs has a chance of seeing that error information. It would arguably be better to report the error only to the process that actually wrote the data that experienced the error, but tracking things at that level would be cumbersome and slow. By informing all processes, this mechanism ensures that the right process will get the news.
The final step is to get the low-level filesystem code to use the new reporting mechanism when something goes wrong. Rather than convert all filesystems at once, Layton chose to add a new filesystem-type flag (FS_WB_ERRSEQ) that can be set for filesystems that understand the new scheme. Code at the virtual filesystem layer can then react accordingly depending on whether the filesystem has been converted or not. The intent is to remove this flag and the associated mechanism once all in-tree filesystems have made the change.
The ideas behind this patch set were discussed at the 2017 Linux Storage, Filesystem, and
Memory-Management Summit in March; the patches themselves have been
through five public revisions since then. There is a reasonable chance
that they are approaching a sort of final state where they can be
considered for merging in an upcoming development cycle. The result will
not be perfect writeback error reporting, but it should be significantly
better than what the kernel offers now.
Index entries for this article | |
---|---|
Kernel | Block layer/Error handling |
Posted Jun 2, 2017 17:49 UTC (Fri)
by abatters (✭ supporter ✭, #6932)
[Link] (9 responses)
Posted Jun 2, 2017 17:53 UTC (Fri)
by corbet (editor, #1)
[Link]
Posted Jun 2, 2017 17:54 UTC (Fri)
by jlayton (subscriber, #31672)
[Link] (7 responses)
Posted Jun 3, 2017 6:17 UTC (Sat)
by pbonzini (subscriber, #60935)
[Link] (6 responses)
Posted Jun 3, 2017 9:53 UTC (Sat)
by jlayton (subscriber, #31672)
[Link] (5 responses)
Ultimately, an fsync syscall returns whatever the filesystem's fsync operation returns, so if the filesystem wants to check for O_DIRECT and always return 0 without flushing, then it can do so today.
Now, that said...one wonders why an application would call fsync on an O_DIRECT fd?
Posted Jun 4, 2017 4:02 UTC (Sun)
by neilbrown (subscriber, #359)
[Link] (4 responses)
To ensure that the metadata is safe? I think you need O_SYNC|O_DIRECT if you want to not use fsync at all.
Posted Jun 4, 2017 19:32 UTC (Sun)
by pbonzini (subscriber, #60935)
[Link] (3 responses)
Posted Jun 5, 2017 11:44 UTC (Mon)
by jlayton (subscriber, #31672)
[Link] (2 responses)
I don't quite see why you'd want to avoid reporting errors on a O_DIRECT fd in either case though. In both cases, it's possible that data previously written via that O_DIRECT file descriptor didn't make it to disk, so wouldn't you want to inform the application?
The big change here is that reporting those errors on the O_DIRECT fd won't prevent someone else from seeing those errors on via another fd. So, I don't quite see why it'd be desirable to avoid reporting it on the O_DIRECT one.
Posted Jun 5, 2017 11:55 UTC (Mon)
by pbonzini (subscriber, #60935)
[Link] (1 responses)
I certainly would. :) However, I'm worried about the application using O_DIRECT seeing errors that happened while accessing the file via another fd.
In fact, if I understand correctly, those errors could even have happened before the O_DIRECT file descriptor had even been opened, if they have never been reported to userspace.
Posted Jun 5, 2017 16:15 UTC (Mon)
by jlayton (subscriber, #31672)
[Link]
How mixed buffered and direct I/O are handled is not really addressed (or changed for that matter) in this set. Yes, you will quite likely see an error on an O_DIRECT fsync, but it's quite likely that you'll see that today anyway. Most filesystems make no distinction about whether you opened the fd with O_DIRECT or not. They flush the pagecache and inode anyway just like they would with a buffered fd.
The flip side of this (and the scarier problem) is that with the current code, it's likely that that fsync on the O_DIRECT fd would end up clearing the error such that a later fsync on the buffered fd wouldn't ever see it. That problem at least should be addressed with these changes.
Posted Jun 2, 2017 18:59 UTC (Fri)
by jlayton (subscriber, #31672)
[Link]
Posted Jun 2, 2017 23:55 UTC (Fri)
by jhoblitt (subscriber, #77733)
[Link] (1 responses)
Posted Jun 3, 2017 9:38 UTC (Sat)
by jlayton (subscriber, #31672)
[Link]
fsync is called on a file descriptor, which is ultimately an open file on some sort of filesystem. When there is an error, the filesystem is ultimately responsible for marking the mapping for the inode with an error (sometimes this is handled by lower layers common code, like the buffer.c routines). When fsync is called, the filesystem should check for an error since we last checked via the file, report it if there was one and advance the file's errseq_t to the current value.
Note that the way errors get recorded is not terribly different from what we do today. The big difference is in how we report errors at fsync time. Most of the changes to filesystems are in fsync here, though I am going through various parts of the kernel and trying to make sure that we're recording errors properly when they occur.
Posted Jun 3, 2017 0:36 UTC (Sat)
by Richard_J_Neill (subscriber, #23093)
[Link] (12 responses)
Posted Jun 3, 2017 4:18 UTC (Sat)
by k8to (guest, #15413)
[Link]
It's unclear to me that the kernel should also log for each such failure. It might be so noisy as to cause more breakage. I would want the system to do something like log when this situation is near-occurring and when it has occurred in some throttled way, which suggests monitoring logic. Should that be implemented in-kernel or in userland?
Posted Jun 3, 2017 7:44 UTC (Sat)
by MarcB (guest, #101804)
[Link] (6 responses)
Also, this used to be much more common in the past, when many filesystems allowed much fewer inodes by default. So, perhaps some administrators simply have forgotten (or never learned) that inode exhaustion is a real thing.
And diagnosing this - once you are aware that it can happen - is not harder than diagnosing "out of space" (in practice: even easier, as is is unlikely that large numbers of inodes are held by deleted yet opened files).
Posted Jun 3, 2017 23:16 UTC (Sat)
by Richard_J_Neill (subscriber, #23093)
[Link] (5 responses)
Also, while the sysadmin can add extra monitoring and debugging, surely the point of a reliable system is to minimise the chance of human error.
Anyway... in these days of LVM and resizeable volumes, why shouldn't the filesystem be able to automatically notice that it has lots of space but too few inodes, and automatically create more inodes as needed?
Posted Jun 4, 2017 1:39 UTC (Sun)
by rossmohax (guest, #71829)
[Link] (1 responses)
Posted Jun 4, 2017 5:09 UTC (Sun)
by matthias (subscriber, #94967)
[Link]
We had once the following problem after growing a filesystem. Standard was at that time to only use 32-bit inode numbers. After growing the filesystem the 32-bit inode numbers where all in the already filled lower part of the filesystem.(*) Thus no new inodes could be created. Took a while to find that one only having the meaningful message "No space left on device.". Luckily it was a 64-bit system. Thus, we could just switch to 64-bit inode numbers. The other solution would have been to recreate the filesystem, not the quickest solution with a 56 TB filesystem.
That said the circumstances under which XFS runs out of inodes are very rare. So it would be very important to have meaningful error messages, to notice that one of these rare circumstances just happened.
(*) On fs creation XFS usually chooses the number i to be such that all possible inodes have 32-bit numbers. After growing this condition was not satisfied any more, as this number cannot be changed. On 32-bit systems, one would need to set this number i manually at fs creation time, if one wants to have the possibility to grow the filesystem.
Posted Jun 4, 2017 14:15 UTC (Sun)
by MarcB (guest, #101804)
[Link] (2 responses)
If the software is some kind of cache, discarding the files that are least relevant is a proper course of action for both kinds of ENOSPC.
If the software can't freely discard or move data, all it can do, is scream for help, anyway.
Also, an ENOSPC due to lack of inodes will usually happen on open() while an ENOSPC due to lack of disk space will usually happen on write() or similar.
Of course, ideally filesystems would solve this problem completely. In fact, some do: btrfs has an upper limit of 2^64 inodes, as does XFS or ZFS (might be 2^48).
The ext-family is the big exception. Theoretically, the limit is also 2^32, but it cannot allocate space for inodes dynamically, and thus uses much lower limits by default. Otherwise, each inode would consume 256 bytes, even if unused.
Posted Jun 5, 2017 11:55 UTC (Mon)
by nix (subscriber, #2304)
[Link] (1 responses)
The ERRORS section on each reference page specifies which error conditions shall be detected by all implementations (``shall fail") and which may be optionally detected by an implementation (``may fail"). If no error condition is detected, the action requested shall be successful. If an error condition is detected, the action requested may have been partially performed, unless otherwise stated.
Implementations may generate error numbers listed here under circumstances other than those described, if and only if all those error conditions can always be treated identically to the error conditions as described in this volume of POSIX.1-2008. Implementations shall not generate a different error number from one required by this volume of POSIX.1-2008 for an error condition described in this volume of POSIX.1-2008, but may generate additional errors unless explicitly disallowed for a particular function.
Posted Jun 5, 2017 16:15 UTC (Mon)
by nybble41 (subscriber, #55106)
[Link]
Yes, for *new* error conditions not specified by POSIX. However:
> Implementations shall not generate a different error number from one required by this volume of POSIX.1-2008 for an error condition described in this volume of POSIX.1-2008, ...
The error list for the open() and openat() system calls specifies ENOSPC as follows:
> [ENOSPC]
So if "the filesystem ... cannot be expanded" is read to include the "out of inodes" condition (a reasonable interpretation IMHO) then POSIX requires open() to return ENOSPC for this condition, and not some other error code.
Posted Jun 3, 2017 8:31 UTC (Sat)
by matthias (subscriber, #94967)
[Link] (2 responses)
I would much prefer error reporting by exceptions. The type of the exception more or less corresponds to the error numbers and can be used by the program to determine how to react, but there is a string attached that can be passed up the call chain, which has meaningful information for the user. This way the program still gets the information contained in ENOSPC (actually most programs are fine to react to running out of space and running out of inodes in the same way), but the user which sees the error message knows instantly where to search for the problem.
Adding type inheritance to the exceptions additionally allows the program to select how fine grained the error information should be. Some programs are fine seeing an IO exception. Others want to differentiate whether the error is running out of resources or a real problem and some might want to know the difference between running out of space and running out of inodes.
Posted Jun 4, 2017 1:42 UTC (Sun)
by rossmohax (guest, #71829)
[Link]
Posted Jun 4, 2017 3:39 UTC (Sun)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Posted Jun 3, 2017 10:32 UTC (Sat)
by itvirta (guest, #49997)
[Link]
Also, there's the possibility of distributing unrelated data on separate file systems, or using quotas to
Posted Jun 3, 2017 12:27 UTC (Sat)
by nix (subscriber, #2304)
[Link] (12 responses)
This is just the first example that springs to mind. There are probably many more. One thing that's become clear to me as I classify everything on my machines into 'I care about this, RAID-6 and bcache it' and 'I don't care about this, chuck it on an unjournalled RAID-0' is that not only is there currently no way for applications to indicate what is important in this sense, and there is also *no way for most of them to know at all*. Whether a given file write is important is a property of what the user plans to do with the file later.
(Another kernel-compile-variety case: I do a lot of quick checks of enterprise kernels, with all their modules. Each module_install writes about 3--4GiB of module data out to /lib/modules/$whatever/kernel. Obviously that's an important write, right? If it goes wrong the machine probably won't boot! Only it's not: 90% of those modules are never referenced again, and the whole lot is going onto a loopback filesystem on that RAID-0 array because I'm actually only going to use it once, for testing, then throw it away. There is no way the assember, the linker, install(1), or the kernel makefile could know that, but if it didn't know that it might e.g. in my case decide to cache all 3GiB on an SSD, or journal it all through the RAID journal, or fsync() each file individually, or something. And, of course, in most cases even the users don't bother to make this sort of determination, or don't have the knowledge to, even though they're the only ones who could.)
I do not see an easy way out of this. :(
Posted Jun 3, 2017 13:57 UTC (Sat)
by gdt (subscriber, #6284)
[Link] (3 responses)
The rules for "correct" use of fsync() by applications' programmers are already not useful. If the program wasn't started interactively then it's best not to call fsync(), as a few thousand fsync() calls in a short time leads to substantial jitter. How you can tell if a program is being run interactively is no longer straightforward (is that HTTP POST from a person or a API). So there is a risk v benefit balance in programmer's minds when using fsync() for common file I/O, with a strong tendency towards "no" -- partly because of advocacy from kernel programmers, but also because fsync() historically works less well than suggested by the man page (eg, not clear to me if wear-leveling SSDs work for the case where fsync() is immediately followed by power-loss). It's worse for library authors, as they have no idea of the significance of the data, and so if to implement the notion that "applications that care about their data will occasionally call fsync()". You might argue that databases should use fsync(). You'll recall that Firefox had an issue with adding unwelcome latency by storing bookmarks in a SQLite database which issued fsync(). For these reasons even if they "care" for the data a programmer might well choose not to call fsync() but simply close() the file and let the system proceed without added latency. On the plus side, applications' programmers already accept some asynchronicity in read(), write() and close() error reporting and perhaps this could be further extended.
Posted Jun 3, 2017 14:38 UTC (Sat)
by nix (subscriber, #2304)
[Link] (2 responses)
Posted Jun 3, 2017 15:23 UTC (Sat)
by jhoblitt (subscriber, #77733)
[Link] (1 responses)
Posted Jun 3, 2017 22:23 UTC (Sat)
by nix (subscriber, #2304)
[Link]
Posted Jun 4, 2017 4:26 UTC (Sun)
by neilbrown (subscriber, #359)
[Link] (2 responses)
On the one hand you have applications that need to know that the data they have written is "safe". They need to know this so that they can tell someone. Maybe the editor tells the user "the file has been saved". Maybe the email system tells its peer "I have that email now, you can discard your copy". Maybe the database store is telling the database journal "that information is safe".
The C compiler or assembler doesn't need to tell anyone. But the linker does, as you say, want to know that if the file it is loading is the same as the file that the assembler wrote out. It doesn't care if the data was safe or not. It is perfectly acceptable for the linker to say "sorry, data was corrupt" (as long as it doesn't do it too often). What is not so acceptable is for the linker to provide a complete binary which contains corruption.
In the first case you want data safety - I know I can read it back if I want to. In the second you want data integrity - I know that this data is (or isn't) the data that was written.
I don't believe the OS has any role in providing integrity, beyond best-effort to save and return data as faithfully as possible. If an application really cares, the application needs to add a checksum or crypto-hash or whatever. git does this. backups do this. gzip does this. I'm sure that if the cost/benefit analysis suggested that the C compiler should do this, then it would be easy enough to add.
Posted Jun 5, 2017 12:05 UTC (Mon)
by nix (subscriber, #2304)
[Link] (1 responses)
Of course, POSIX provides no way for applications to say 'hey, fs, I want integrity from this, thank you', and it does whatever checksumming it can so the applications don't all need to reimplement it. This might make sense: it seems like something that could probably be a per-filesystem attribute, or at least a whole-directory-tree attribute or something. Of course, POSIX also provides no way to say 'hey, fs, this file was written but failed integrity checks': -EIO is, ah, likely to be misinterpreted by essentially everything. So while it would be nice to have app-level integrity checking, I doubt we can get there from here: we do need to do it invisibly, below the visible surface of the system.
Posted Jun 5, 2017 18:51 UTC (Mon)
by zblaxell (subscriber, #26385)
[Link]
Nor does it need one. POSIX should assume integrity by default unless applications say the opposite. One way applications can do that is by not checking any system call return values.
> POSIX also provides no way to say 'hey, fs, this file was written but failed integrity checks'
I don't think any changes to POSIX are required. We already have most of this in existing filesystems, just not in most existing filesystems.
In cases like compiles, where the writing application has completely disappeared before the block writes even start, there's no process to notify about the failure at the time the failure is detected. fsync() return behavior is irrelevant to this case--*every* system call, even _exit, returns before *any* error information is available. We want compiles to be fast, so we don't want to change this. A different solution is required. Note that reporting errors through fsync() is not wrong--it's just not applicable to this case.
For compiles we want to get the block-level error information passed from one producing process to another consuming process when the processes communicate through a filesystem. So let's do exactly that: If a block write fails, the filesystem should update its metadata to say "this data blocks were not written successfully and contain garbage now." Future reads of affected logical offsets of affected inodes should return EIO until the data is replaced by new (successful) writes, or the affected blocks are removed from the file by truncate, or the file is entirely deleted. If the filesystem metadata update fails too, move up the hierarchy (block > inode > subvol > planet > constellation > whatever) until the entire FS is set readonly and/or marked with errors for the next fsck to clean up by brute force.
Note that this scheme is different from block checksums. The behavior is similar, but block checksums are used to detect read errors (successful write followed by provably incorrect read), not write errors (where the write itself fails and the disk contents are unknown, possibly correct, possibly incorrect with hash collision). Checksums would not be an appropriate way to implement this. The existing prealloc mechanism in many filesystems could be extended to return EIO instead of zero data on reads. Prealloc already has most of the desired behavior wrt block allocation and overwrites.
> EIO is, ah, likely to be misinterpreted by essentially everything
I'm not sure how EIO could be misinterpreted in this context. The application is asking for data, and the filesystem is saying "you can't have that data because an IO-related failure occurred," so what part of EIO is misinterpreted exactly? What application (other than a general disk-health monitoring application, which could get detailed block error semantics through a dedicated kernel notification mechanism instead) would care about lower-level details, and which details would it use?
Also note EIO already happens in most filesystems, so we're not talking theoretically here. Most applications (even linkers), if they notice errors at all (**), notice EIO and do something sensible when they see it (*). This produces much, much more predictable results than just throwing random disk garbage into applications and betting they'll notice.
(*) interesting side note: linkers don't read all of the data in their input files, and will happily ignore EIO if it only occurs in the areas of files they don't read. Maybe not the best example case for a "data integrity" discussion. ;)
(**) for many years, GCC's as and ld didn't even notice ENOSPC, and would silently produce garbage binaries when the disk was full (maybe these would be detected by the linker later on...maybe not). Arguably we should also mark inodes with a persistent error bit if there is an ENOSPC while writing to them, but that *is* a major change which will surprise ordinary POSIX applications.
Posted Aug 5, 2020 15:39 UTC (Wed)
by pskocik (guest, #130865)
[Link] (4 responses)
Posted Aug 7, 2020 11:14 UTC (Fri)
by flussence (guest, #85566)
[Link] (2 responses)
When the last process in the group exits, it syncs any remaining dirty buffers touched by the process tree - in this example they'd be build artifacts, but they could be overly-paranoid software that fsyncs far too much (apt-get used to, foldingathome is awful on rotational disks), or just data that's low value to begin with (downloaded Docker containers? nosql databases?)
And once we have that in place and people using it, extending it to use filesystem-native transactions (wherever they exist) seems like an obvious next move. :-)
Posted Aug 7, 2020 20:02 UTC (Fri)
by Wol (subscriber, #4433)
[Link] (1 responses)
Why are nosql databases low value? Actually, nosql databases usually have a far higher signal/noise ratio - I converted a database from nosql to sql, and I think the size of the db DOUBLED.
Not only do nosql databases contain much more data per megabyte than relational, but they tend to be much faster - it's an old story but I remember stories about a company converting from UniVerse to (sn)Oracle, and it took SIX MONTHS for the consultants to get a Snoracle query (running on a twin Xeon) to outperform the old system running on a Pentium 90.
Or the "request for bids" put out by some University Astronomy department, that wanted a system to process 100K tpm. snOracle had to "cheat" to meet the target - delayed indexing, a couple of other tricks - while Cache had no trouble hitting 250K tpm.
RDBMSs are fundamentally inefficient, due to limitations in the relational model itself ... (just try and *store* a list in an rdbms).
Cheers,
Posted Aug 13, 2020 11:08 UTC (Thu)
by flussence (guest, #85566)
[Link]
Posted Sep 14, 2020 17:20 UTC (Mon)
by nix (subscriber, #2304)
[Link]
Multiple drives
An error will be returned only if the application calls fsync() on a file descriptor for a file that has experienced errors. Multiple drives are not an issue; errors should not propagate beyond the affected file even on a single drive.
Multiple drives
Multiple drives
Multiple drives
Multiple drives
Multiple drives
See "man 2 open"
Multiple drives
Multiple drives
Multiple drives
Multiple drives
Improved block-layer error handling
Improved block-layer error handling
Improved block-layer error handling
perhaps running out of inodes could be taken "more seriously"?
perhaps running out of inodes could be taken "more seriously"?
perhaps running out of inodes could be taken "more seriously"?
It is just another resource exhaustion that user space has to deal with - and perhaps even is dealing with, so nothing is actually wrong.
It can, and should, also be monitored just like free disk space.
perhaps running out of inodes could be taken "more seriously"?
We are used to the abstraction of a storage being "somewhere you can fill up with data"; the very existence of inodes should be no more the concern of the average programmer/sysadmin than the specifics of which pointer has which address... it should be "the computer's" problem, not "the operator's problem". If the computer is going to break that rule, and do so rarely, but catastrophically, the least it can do is to fail "noisily".
perhaps running out of inodes could be taken "more seriously"?
perhaps running out of inodes could be taken "more seriously"?
perhaps running out of inodes could be taken "more seriously"?
If the software is some kind of archival system, moving the oldest files to the next tier of storage will also help in both cases.
So applications could already translate this to proper error messages. It is common that the same error code has different meaning for different syscalls and developers should know this.
btrfs is fully dynamic, i.e. each btrfs, that is large enough to hold the inode information, can in fact contain 2^64 inodes. XFS is dynamic enough in practice (make sure to use "inode64", though. Otherwise inodes can only be stored in the lowest 1 TB, and that space can run out if also used for file data - been there, done that). Even NTFS allows 2^32 and is also fully dynamic
perhaps running out of inodes could be taken "more seriously"?
Remember that the possible error codes for syscalls were defined by POSIX, so simply adding an EOUTOFINODES would be non-compliant and could easily do more harm then good, because in practice, ENOSPC is a good fit for "out of inodes" and software might actually expect it to cover both cases
It might well do more harm than good, but the first part of your statement is just wrong. POSIX.1 2008 states (and all previous versions have similar wording):
Implementations may support additional errors not included in this list, may generate errors included in this list under circumstances other than those described here, or may contain extensions or limitations that prevent some errors from occurring.
So adding more errors is not only not noncompliant, it is both explicitly permitted and very common.
perhaps running out of inodes could be taken "more seriously"?
> The directory or file system that would contain the new file cannot be expanded, the file does not exist, and O_CREAT is specified.
perhaps running out of inodes could be taken "more seriously"?
perhaps running out of inodes could be taken "more seriously"?
perhaps running out of inodes could be taken "more seriously"?
perhaps running out of inodes could be taken "more seriously"?
It's very much the same as running out of disk space, which isn't that uncommon with some logging
getting out of hand either. Both can be checked with `df`.
protect the rest of the system from an application getting out of hand.
Improved block-layer error handling
He adds a mechanism that is based on the idea that applications that care about their data will occasionally call fsync() to ensure that said data has made it to persistent storage.
I keep hearing this, but the problem is that not only is this not true, you don't want it to be true and would probably refuse to use any system on which it was true because its performance would be appalling. Obviously, yes, text editors should be (and are) very careful about fsync()ing your six hours of work now you finally remembered to save it -- but let's pick on another favourite test load of kernel hackers, compiling a kernel. It would be bad if a chunk of data was omitted from the middle of an object file, right? So clearly the assembler "cares about" its data in this sense. But, equally, an assembler that called fsync() on its output would be the subject of copious vile swearing: you don't want your massive 64-way compile to be fsync()ing all over the place, not even in a filesystem better-behaved than ext3 (where fsync sometimes == sync()). You want any sync to happen at the end, after everything is linked, and you're probably happy if nothing syncs at all much of the time (for test compiles, if the power goes out, you'll just rebuild). However, that doesn't mean you're happy if an I/O error replaces crucial hunks of the kernel with \0!
Improved block-layer error handling
Improved block-layer error handling
not clear to me if wear-leveling SSDs work for the case where fsync() is immediately followed by power-loss
I believe that the only SSD that currently even tries to reliably handle power loss without at least the possibility of massive data loss, corruption, or outright device failure is Intel's fairly costly datacentre parts. So, 'no'. :(
Improved block-layer error handling
Improved block-layer error handling
Improved block-layer error handling
In each of these cases you need fsync() because you need to tell someone that the data is safe.
Improved block-layer error handling
Improved block-layer error handling
Improved block-layer error handling
Improved block-layer error handling
Improved block-layer error handling
Wol
Improved block-layer error handling
Improved block-layer error handling