|| ||Theodore Ts'o <tytso-AT-mit.edu>|
|| ||Kenichi Okuyama <okuyamak-AT-dd.iij4u.or.jp>|
|| ||Re: [Ext2-devel] Re: [RFD] FS behavior (I/O failure) in kernel summit|
|| ||Wed, 15 Jun 2005 10:01:05 -0400|
|| ||Hans Reiser <reiser-AT-namesys.com>,
Andreas Dilger <adilger-AT-clusterfs.com>,
Reiserfs developers mail-list <Reiserfs-Dev-AT-namesys.com>|
On Tue, Jun 14, 2005 at 11:46:36AM +0900, Kenichi Okuyama wrote:
> I agree that kernel can not directly influence user.
> But, application may have better chance.
> Think about case of editor (vi, emacs, almost any text editors are ok ).
> If you try to save file, and recieve no error, user will believe they
> have been written on disk they believe to be existing.
> Even log yells for error, user will not notice.
> If editor recieve error, then user can know something is wrong. Though
> he is still wondering, if he recieve the message
> like "Input Output Error: may be HW error?", he definitely will start
> from looking at cable.
Part of the problem is that we are limited by the constraints of the
POSIX specification for error handling. For example, we don't have a
way of telling the application, "the reason why you the filesystem was
remounted-read-only was in reaction to an I/O error that appears to be
caused by the multiple CRC checksum errors reported by the SCSI
controller". We can only return EIO or EROFS. And while the write()
which causes an I/O error that remounts the filesystem read/only can
(and probably does) return EIO, any subsequent writes will return
EROFS, and changing this would be hard, hackish, and probably wouldn't
Also, there is not neccesarily one right answer to how to respond to a
underlying I/O error in the filesystem. So for ext2/3 filesystem, it
is configurable. In case of an underlying error detected in the
filesystem metadata, the filesystem can be set to either (a) panic and
force a reboot, so that hopefully fsck can resolve the issue, (b)
remount the filesystem read/only, to prevent further damage, or (c)
continue and do nothing (the don't worry, be happy approach).
Different users will want different approaches, and so trying to
standardize what applications will see at the user level doesn't seem
like the right approach, since we want to allow system administrators
some flexibility about how they wish to configure their systems.
(For example, an embedded system or a system where there is higher
levels of redundancy, the right answer might be to panic and either
reboot or halt --- continuing and possibly returning wrong answers
might be completely unacceptable, and it may be that the once the
system goes down hard, the adjacent backup blade can pick up
So instead of trying to standardize the existing error returns, which
are they way they are and for which trying to standardize them would
probably be not worth the effort, since they don't return enough
context to the application anyway ---- I would suggest the better
thing to do is to design a new mechanism for returning block device
errors via either some kind of notifcation mechanism (pick your choice
of hotplug, dbus, or netlink --- dbus may make the most amount of
sense, since multiple applications may want to subscribe to such
notifications) of problems at the filesystem level, so that
applications can take corrective action as necessary.
This is a better approach, since it far more flexible and returns much
more information to the user. For example, in a desktop environment,
the desktop can pop up a warning dialog to the user of a failure of a
block device or filesystem corruption, without having to modify every
single application. In the case of an embedded system, the
notification can trigger an appropriate failover or recovery process.
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to email@example.com
More majordomo info at http://vger.kernel.org/majordomo-info.html
to post comments)