November 16, 2010
This article was contributed by Neil Brown
In the second installment of this series, we documented two designs that
were found to be imperfect and have largely (though not completely)
been fixed through ongoing development. Though there was some
evidence that the result was not as elegant as we might have achieved
had the original mistakes not been made, it appears that the current
design is at least adequate and on a path towards being good.
However, there are some designs mistakes that are not so easily
corrected. Sometimes a design is of such a character that fixing it
is never going to produce something usable. In such cases it can be
argued that the best way forward is to stop using the old design and
to create something completely different that meets the same need.
In this episode we will explore two designs in Unix which have seen
multiple attempts at fixes but for which it isn't clear that the result
is even heading towards "good". In one case a significant change in
approach has produced a design which is both simpler and more functional
than the original. In the other case, we are still waiting for a
suitable replacement to emerge.
After exploring these two "unfixable designs" we will try to address the
question of how to distinguish an unfixable design from a poor design
which can, as we saw last time, be fixed.
Unix signals
Our first unfixable design involves the delivery of
signals to processes. In particular it is the registration of a
function as a "signal handler" which gets called asynchronously when
the signal is delivered.
That this design was in some way broken is clear from the fact that
the developers at UCB (The University of California at Berkeley, home
of BSD Unix) found the need to introduce the sigvec() system
call, along with a few other calls, to allow individual signals to be
temporarily blocked. They also changed the semantics of some system
calls so that they would restart rather than abort if a signal arrived
while the system call was active.
It seems there were two particular problems that these changes tried
to address.
Firstly there is the question of when to re-arm a signal handler. In
the original Unix design a signal handler was one-shot - it would only
respond the first time a signal arrived. If you wanted to catch a
subsequent signal you would need to make the signal handler explicitly
re-enable itself. This can lead to races, such as, if a signal is delivered
before the signal handler is re-enabled it can be lost forever.
Closing these races involved creating a facility for keeping the
signal handler always available, and blocking new deliveries while the
signal was being processed.
The other problem involves exactly what to do if a signal arrives
while a system call is active. Options include waiting for the system
call to complete, aborting it completely, allowing it to return
partial results, or allowing it to restart after the signal has been
handled. Each of these can be the right answer in different contexts;
sigvec() tried to provide more control so the programmer
could choose between them.
Even these changes, however, where not enough to make signals really
usable, so the developers of System V (at AT&T) found the need for
a sigaction() call which adds some extra flags to control the fine
details of signal delivery. This call also allows a signal handler to be
passed a "siginfo_t" data structure with information about the
cause of the signal, such as the UID of the process which sent the
signal.
As these changes, particularly those from UCB, were focused on
providing "reliable" signal delivery, one might expect that at least
the reliability issues would be resolved. Not so it seems. The
select() system call (and related poll()) did not
play well with signals so pselect() and ppoll() had
to be invented and eventually implemented. The interested reader is
encouraged to explore their history.
Along with these semantic "enhancements" to signal delivery, both
teams of developers chose to define more signals generated by
different events. Though signal delivery was already problematic
before these were added, it is likely that these new demands stretched
the design towards breaking point.
An interesting example is SIGCHLD and SIGCLD, which are sent when a child
exits or is otherwise ready for the parent to wait() for it.
The difference between these two (apart from the letter "H" and
different originating team) is that SIGCHLD is delivered once per event
(as is the case with other signals) while SIGCLD would be delivered
constantly (unless blocked) while any child is ready to be waited for.
In the language of hardware interrupts, SIGCHLD is edge triggered while
SIGCLD is level triggered. The choice of a level-triggered signal
might have been an alternate attempt to try to improve reliability.
Adding SIGCLD was more than just defining a new number and sending
the signal at the right time. Two of the new flags added for
sigaction() are specifically for tuning the details of
handling this signal. This is extra complexity that signals didn't
need and which arguably did not belong there.
In more recent years the collection of signal types has been extended
to include "realtime" signals. These signals are user-defined
signals (like SIGUSR1 and SIGUSR2) which are only delivered if
explicitly requested in some way. They have two particular
properties.
Firstly, realtime signals are queued so the handler in the target
process is called exactly as many times as the signal was sent. This
contrasts with regular signals which simply set a flag on delivery.
If a process has a given (regular) signal blocked and the signal is sent several
times, then, when the process unblocks the signal, it will still only
see a single delivery event. With realtime signals it will see
several. This is a nice idea, but introduced new reliability issues
as the depth of the queue was limited, so signals could still be lost.
Secondly (and this property requires the first), a realtime signal can carry
a small datum, typically a number or a pointer. This can be sent
explicitly with sigqueue() or less directly with, e.g.,
timer_create().
It could be thought that this addition of more signals for more events is
a good example of the "full exploitation" pattern that was discussed
at the start of this series. However, when adding new signal types
require significant changes to the original design, it could equally
seem that the original design wasn't really strong enough to be so
fully exploited.
As can be seen from this retrospective, though the original signal
design was quite simple and elegant, it was fatally flawed. The need
to re-arm signals made them hard to use reliably, the exact semantics of
interrupting a system call was hard to get right, and developers
repeatedly needed to significantly extend the design to make it work
with new types of signals.
The most recent step in the saga of signals is the signalfd()
system call which was introduced to Linux in 2007 for 2.6.22. This
system call extends "everything has a file descriptor" to work for
signals too. Using this new type of descriptor returned by
signalfd(), events that would normally be
handled asynchronously via signal handlers can now be handled
synchronously just like all I/O events. This approach makes many of the
traditional difficulties with signals disappear. Queuing becomes
natural so re-arming becomes a non-issue. Interaction with system
calls ceases to be interesting and an obvious way is provided for extra
data to be carried with a signal. Rather than trying to fix a
problematic asynchronous delivery mechanism, signalfd()
replaces it with a synchronous mechanism that is much easier to work
with and which integrates well into other aspect of the Unix design -
particularly the universality of file descriptors.
It is a fun, though probably pointless, exercise to imagine what the
result might have been had this approach been taken to signals when
problems were first observed. Instead of adding new signal types we
might have new file descriptor types, and the set of signals that were
actually used could have diminished rather than grown. Realtime
signals might instead be a general and useful form of interprocess
communication based on file descriptors.
It should be noted that there are some signals which
signalfd() cannot be used for. These include SIGSEGV, SIGILL,
and other signals that are generated because the process tried to do
something impossible. Just queueing these signals to be processed
later cannot work, the only alternatives are switching control to a
signal handler, or aborting the process. These cases are handled
perfectly by the original signal design. They cannot occur while a
system call is active (system calls return EFAULT rather than raising a
signal) and issues with when to re-arm the signal handler are also less
relevant.
So while signal handlers are perfectly workable for some of the early
use cases (e.g. SIGSEGV) it seems that they were pushed beyond their
competence very early, thus producing a broken design for which there
have been repeated attempts at repair. While it may now be possible to
write code that handles signal delivery reliably, it is still very easy
to get it wrong. The replacement that we find in signalfd()
promises to make event handling significantly easier and so more
reliable.
The Unix permission model
Our second example of an unfixable design which is best replaced is
the owner/permission model for controlling access to files.
A well known quote attributed to H. L. Mencken is "there is always a
well-known solution to every human problem - neat, plausible, and
wrong." This is equally true of computing problems, and the Unix
permissions model could be just such a solution.
The initial idea is deceptively simple: six bytes per file gives simple
and broad access control. When designing an operating system to fit in 32 kilobytes
of RAM (or less), such simplicity is very appealing, and thinking about
how it might one day be extended is not a high priority, which is
understandable though unfortunate.
The main problems with this permission model is that it is both too
simple and too broad.
The breadth of the model is seen in the fact that every file stores its
own owner, group owner, and permission bits. Thus every file can have
distinct ownership or access permissions. This is much more flexibility
than is needed. In most cases, all the files in a given directory, or
even directory tree have the same ownership and much the same
permissions. This fact was leveraged by the Andrew filesystem which
only stores ownership and permissions on a per-directory basis, with
little real loss of functionality.
When this only costs six bytes per file it might seem a small price to pay
for the flexibility. However once more than 65,536 different owners are
wanted, or more permission bits and more groups are needed, storing this information begins
to become a real cost. However the bigger cost is in usability.
While a computer may be able to easily remember six bytes per file, a
human cannot easily remember why various different settings might have
been assigned and so are very likely to create sets of permission
settings which are inconsistent, inappropriate, and hence not
particularly secure. Your author has memories from University days of
often seeing home directories given "0777" permissions (everyone has
any access) simply because a student wanted to share one file with a
friend, but didn't understand the security model.
The excessive simplicity of the Unix permission model is seen in the
fixed, small number of permission bits, and, particularly, that there is
only one "group" that can have privileged access. Another maxim from
computer engineering, attributed to Alan Kay, is that "Simple things
should be simple, complex things should be possible." The Unix
permission model makes most use cases quite simple but once the need
exceeds that common set of cases, further refinement becomes
impossible. The simple is certainly simple, but the complex is truly
impossible.
It is here that we start to see real efforts to try to "fix" the model.
The original design gave each process a "user" and a "group"
corresponding to the "owner" and "group owner" in each file, and they
were used to determine access. The "only one group" limit is limiting
on both sides; the Unix developers at UCB saw that, for the process
side at least, this limit was easy to extend. They allowed a process to have a
list of groups for checking filesystem access against. (Unfortunately
this list originally had a firm upper limit of 16, and that limit made
it's way into the NFS protocol where it was hard to change and is still
biting us today.)
Changing the per-file side of this limit is harder as that requires
changing the way data is encoded in a filesystem to allow multiple
groups per file. As each group would also need its own set of
permission bits a file would need a list of groups and permission bits
and these became known quite reasonably as "access control lists" or
ACLs. The Posix standardization effort made a couple of attempts to
create a standard for ACLs, but never got past draft stage. Some Unix
implementations have implemented these drafts, but they have not been
widely successful.
The NFSv4 working group (under the IETF umbrella) were tasked with
creating a network filesystem which, among other goals, would provide
interoperability between POSIX and WIN32 systems. As part of this
effort they developed yet another standard for ACLs which aimed to
support the access model of WIN32 while still being usable on POSIX.
Whether this will be more successful remains to be seen, but it seems
to have a reasonable amount of momentum with an active project trying
to integrate it into Linux (under the banner of "richacls") and
various Linux filesystems.
One consequence of using ACLs is that the per-file storage space needed
to store the permission information is not only larger than six bytes, it
is not of a fixed length. This is, in general, more challenging than any
fixed size. Those filesystems which implement these ACLs do so using
"extended attributes" and most impose some limit on the size of these
- each filesystem choosing a different limit. Hopefully most ACLs that
are actually used will fit within all these arbitrary limits.
Some filesystems - ext3 at least - attempt to notice when multiple
files have the same extended attributes and just store a single copy of those
attributes, rather than one copy for each file. This goes some way to
reduce the space cost (and access-time cost) of larger ACLs that can
be (but often aren't) unique per file, but does nothing to address the
usability concerns mentioned earlier.
In that context, it is worth quoting Jeremy Allison, one of the main
developers of Samba, and so with quite a bit of experience with ACLs
from WIN32 systems and related interoperability issues.
He
writes: "But Windows ACLs are a nightmare beyond human
comprehension :-). In the 'too complex to be usable' camp."
It is worth reading the context and follow up to get a proper picture,
and remembering that richacls, like NFSv4 ACLs, are largely based on
WIN32 ACLs.
Unfortunately it is not possible to present any real example of
replacing rather than fixing the Unix permission model. One contender
might be that part of "SELinux" that deals with file access. This
doesn't really aim to replace regular permissions but rather tries to
enhance them with mandatory access controls. SELinux follows much the
same model of Unix permissions, associating a security context with
every file of interest, and does nothing to improve the usability
issues.
There are however two partial approaches that might provide some
perspective.
One partial approach began to appear in Level 7 Unix with the
chroot() system call. It
appears
that chroot() wasn't originally created for access control but rather to
have a separate namespace in which to create a clean filesystem for
distribution. However it has since been used to provide some level of
access control, particularly for anonymous FTP servers. This is done by
simply hiding all the files that the FTP server shouldn't access.
Anything that cannot be named cannot be accessed.
This concept has been enhanced in Linux with the possibility for each
process not just to have its own filesystem root, but also to have a
private set of mount points with which to build a completely
customized namespace. Further it is possible for a given filesystem
to be mounted read-write in one namespace and read-only in another
namespace, and, obviously, not at all in a third.
This functionality is suggestive of a very different approach to
controlling access permissions. Rather than access control being
per-file, it allows it to be per-mount. This leads to the location of
a file being a very significant part of determining how it can be
accessed. Though this removes some flexibility, it seems to be a
concept that human experience better prepares us to understand. If we
want to keep a paper document private we might put it in a locked
drawer. If we want to make it publicly readable, we distribute copies.
If we want it to be writable by anyone in our team, we pin it to the
notice board in the tea room.
This approach is clearly less flexible than the Unix model as the
control of permissions is less fine grained, but it could well make up
for that in being easier to understand. Certainly by itself it would
not form a complete replacement, but it does appear to be functionality
that is growing - though it is too early yet to tell if it will need to
grow beyond its strength. One encouraging observation is that it is
based on one of those particular Unix strengths observed in our first
pattern, that of "a hierarchical namespace" which would be exploited
more fully.
A different partial approach can be seen in the access controls used
by the Apache web server. These are encoded in a domain-specific
language and stored in centralized files or in ".htaccesss" files near the
files that are being
controlled. This method of access control has a number of real
strengths that would be a challenge to encode into anything based on
the Unix permission model:
-
The permission model is hierarchical, matching the filesystem
model. Thus controls can be set at whichever point makes most sense,
and can be easily reviewed in their entirety. When the controls
set at higher levels are not allowed to be relaxed at lower levels it
becomes easy to implement mandatory access controls.
-
The identity of the actor requesting access can be arbitrary,
rather than just from the set of identities that are known to the
kernel. Apache allows control based on source IP address or
username plus password. Using plug-in modules almost anything
else that could be available.
-
Access can be provided indirectly through a CGI program. Thus,
rather than trying to second-guess all possible access
restrictions that might be desirable and define permission bits
for them in a new ACL, the model can allow any arbitrary action to
be controlled by writing a suitable script to mediate that access.
It should be fairly obvious that this model would not be an easy fit
with kernel-based access checking and, in any case, would have a higher
performance cost than a simpler model. As such it would not be suitable
to apply universally. However it could be that such a model would be
suitable for that small percentage of needs that do not fit in a simple
namespace based approach. There the cost might be a reasonable price
for the flexibility.
While an alternate approach such as these might be appealing, it would
face a much bigger barrier to introduction than signalfd() did.
signalfd() could be added as a simple alternate to signal
handlers. Programs could continue to use the old model with no loss,
while new programs can make use of the new functionality. With
permission models, it is not so easy to have two schemes running in
parallel. People who make serious use of ACLs will probably already
have a bunch of ACLs carefully tuned to their needs and enabling an
alternate parallel access mechanism is very likely to break something.
So this is the sort of thing that would best be trialed in a new
installation rather than imposed on an existing user-base.
Discerning the pattern
If we are to have a convincing pattern of "unfixable designs" it must
be possible to distinguish them from fixable designs such as those
that we found last time. In both cases, each individual fix appears
to be a good idea addressing a real problem without
obviously introducing more problems. In some case this series
of small steps leads to a good result, in others these steps only help you
get past the small problems enough to be able to see the bigger
problem.
We could use mathematical terminology to note that a local maximum can
be very different from a global maximum. Or, using mountain-climbing
terminology, it is hard to know the true summit from a false summit
which just gives you a better view of the mountain.
In each case the missing piece is a large scale perspective. If we can
see the big picture we can more easily decide if a particular path will
lead anywhere useful or if it is best to head back to base and start
again.
Trying to move this discussion back to the realm of software engineering, it is
clear that we can only head off unfixable designs if we can find a
position that can give us a clear and broad perspective. We need to be
able to look beyond the immediate problem, to see the big picture and
be willing to tackle it. The only known source of perspective we have
for engineering is experience, and few of us have enough experience to
see clearly into the multiple facets and the multiple levels of
abstraction that are needed to make right decisions. Whether we look
for such experience by consulting elders, by researching multiple
related efforts, or finding documented patterns that encapsulate the
experience of others, it is vitally important to leverage any
experience that is available rather than run the risk of simply adding
bandaids to an unfixable design.
So there is no easy way to distinguish an unfixable design from a
fixable one. It requires leveraging the broad perspective that is
only available through experience.
Having seen the difficulty of identifying unfixable designs early we can
look forward to the final part of this series, where we will explore a
pernicious pattern in problematic design. While unfixable designs give a
hint of deeper problems by appearing to need fixing, these next designs do
not even provide that hint. The hints that there is a deeper problem must
be found elsewhere.
Exercises
-
Though we found that signal handlers had been pushed well beyond their
competence, we also found at least one area (i.e. SIGSEGV) when they
were still the right tool for the job. Determine if there are other
use cases that avoid the observed problems, and so provide a balanced
assessment of where signal handlers are effective, and where they are
unfixable.
-
Research problems with "/tmp", attempts to fix them, any
unresolved issues, and any known attempts to replace rather than fix
this design.
-
Describe an aspect of the IP protocol suite that fits the pattern
of an "Unfixable design".
-
It has been suggested that dnotify, inotify, fanotify are all broken.
Research and describe the problems and provide an alternate design that
avoids all of those issues.
- Explore the possibility of using fanotify to implement an "apache-like"
access control scheme with decisions made in user-space. Identify
enhancements requires to fanotify for this to be practical.
Next article
Ghosts of Unix past, part 4: High-maintenance
designs
(
Log in to post comments)