April 17, 2013
This article was contributed by Neil Brown
Rust,
the new programming language being
developed by the Mozilla project,
has a number of interesting features. One that stands out is the
focus on safety. There are clear attempts
to increase the range of errors that the compiler can detect and
prevent, and thereby reduce the number of errors that end up in
production code.
There are two general ways that a language design can help increase
safety: it can make it easier to write good code and it can make it
harder to write bad code. The former approach has seen much success
over the years as structured programming, richer type systems, and
improved encapsulation have made it easier for a programmer to express
their goals using high-level concepts. The latter approach has not
met with quite such wide success, largely because it requires removing
functionality that easily leads to bad code. The problem there, of
course, is that potentially dangerous functionality can also be
extremely useful. The classic example is the goto statement,
which has long been known to be an easy source of errors, but which
most major procedural languages still include as they would quickly hit
difficulties without it (though Rust, like Python, has avoided adding
goto
so far).
Nevertheless it is this second approach that Rust appears to focus on.
A number of features that are present in languages like C, and some
that are still present in more safety-conscious languages like Java,
are disallowed or seriously restricted in Rust. To compensate, Rust
endeavors to be a sufficiently rich language that those features are
not necessary. One way to look at this focus is that it endeavors to
produce compile-time errors where other languages would generate
runtime errors (or core dumps). This not only reduces costs at
runtime, but should decrease the number of support calls from
customers.
A large part of this richness is embodied in the type system. Rust
has many of the same sorts of types that other languages do
(structures, arrays, pointers, etc.), but they are often, as we shall
see, somewhat narrower or more restricted than is common. Coupled
with this are various ways to broaden a type, in a controlled way, so
that its shape fits more closely to what the programmer really wants
to say.
And for those cases where Rust turns out not to be sufficiently rich,
there is a loophole.
Rigidly defined areas of doubt and uncertainty
In a move that strongly echos the "System" module in Modula-2 (which
provides access to so-called "low-level" functionality), Rust allows
functions and code sections to be declared as "unsafe". Many of the
restrictions that serve to make Rust a safer language do not apply
inside "unsafe" code.
This could be viewed as an admission of failure to provide a truly
safe language, however that view would be overly simplistic. In the
first place, Gödel's
incompleteness theorem seems to suggest that
any programming language which is rich enough to say all that we
might want to say is also so rich that it cannot possibly be
considered to be "safe". But on a more practical level, a moment's
thought will show that code inside the language compiler itself is
deeply "unsafe" in the context of the program that it generates. A
bug there could easily cause incorrect behavior without any hope of a
compiler being able to detect it. So complete safety is impossible.
Given that the reality of "unsafe" code is unavoidable, it is not
unreasonable to allow some of it to appear in the target program, or
in libraries written in the target language, instead of all of it
being in the compiler. This choice to have the language be safe
by default allows most code to be safe while providing clearly
defined areas where unchecked code is allowed. This, at the very
least, makes it easier to identify which areas of code need particular
attention in code reviews.
If we look at the
Rust compiler,
which itself is written in Rust, we
find that of the 111 source code files (i.e. the src/librustc
directory tree, which excludes the library and runtime support), 33 of
them contain the word "unsafe". 30% of the code being potentially
unsafe may seem like a lot, even for a compiler, but 70% being
guaranteed not to contain certain classes of errors certainly sounds
like a good thing.
Safer defaults
As we turn to look at the particular ways in which Rust encourages
safe code, the first is not so much a restriction as a smaller scale
version of the choice to default to increased safety.
While many languages have keywords like "const" (C, Go) or "final"
(Java) to indicate that a name, once bound, will always refer to the
same value, Rust takes the reverse approach. Bindings are immutable
by default and require the keyword "mut" to make them mutable, thus:
let pi = 3.1415926535;
will never allow the value of
pi to change while:
let mut radius = 1.496e11;
will allow
radius to be altered later. The choice of
mut, which
encourages the vowel to sound short, in contrast to the long vowel in
"mutable", may be unfortunate, but encouraging the programmer to think
before allowing something to change is probably very wise.
For some data structures, the "mutable" setting is inherited through
references and structures to member elements, so that if an immutable
reference is held to certain objects, the compiler will ensure that the
object as a whole will remain unchanged.
Pointers are never NULL
Pointer errors are a common class of errors in code written in C and
similar languages, and many successors have taken steps to control these
by limiting or forbidding pointer arithmetic. Rust goes a step
further and forbids NULL (or nil) pointers as well, so that any attempt
to dereference a pointer will result in finding a valid object. The
only time that a variable or field does not contain a valid pointer
is when the compiler knows that it does not, so a
dereference will result in a compile-time error.
There are of course many cases where "valid pointer or NULL" is a very useful type, such as in a linked list or binary tree data structure. For these cases, the Rust core library provides the "Option" parameterized type.
If T is any type, then Option<T> is a variant
record (sometime called a "tagged union" but called an "enum" in Rust),
which contains a value of type T associated with the tag "Some", or
it contains just the tag
"None". Rust provides a "match" statement, which is a generalization of "case" or "typecase" from other languages, that can be used to test the tag and access the contained value if it is present.
Option is defined as:
pub enum Option<T> {
None,
Some(T),
}
so:
struct element {
value: int,
next: Option<~element>,
}
could be used to make a linked list of integers. The tilde in front of
element indicates a pointer to the structure, similar to an asterisk
in C. There are different types of
pointers which will be discussed later — tilde (
~) introduces an
"owned" pointer while the at sign (
@) introduces a "managed"
pointer.
Thus null-able pointers are quite possible in Rust, but the default is the safer option of a pointer that can never be NULL, and when a null-able pointer is used, it must always be explicitly tested for NULL before that use. For example:
match next {
Some(e) => io::println(fmt!("%d", next.value)),
None => io::println("No value here"),
}
In contrast to this, array (or, as Rust calls them, "vector") references do
not require the compiler to know that the index is in range before that
index is dereferenced. It would presumably be possible to use the type
system to ensure that indexes are in range in many cases, and to require
explicit tests for cases where the type system cannot be sure — similar
to the approach taken for pointers.
Thus, while it is impossible to get a runtime error for a NULL pointer
dereference, it is quite possible to get a runtime error for an array
bounds error. It is not clear whether this is a deliberate omission, or
whether it is something that might be changed later.
Parameterized Generics
Unsurprisingly, Rust does not permit "void" pointers or down casts
(casts from a more general type, like void, to a more specific type).
The use case that most commonly justifies these — the writing of
generic data structures — is supported using parameterized types and
parameterized functions. These are quite similar to Generic Classes
in Java and Templates in C++. Generics in Rust have a simple syntax and
any individual non-scalar type and any function or method can have type parameters.
A toy example might be:
struct Pair<t1,t2> {
first: t1,
second: t2,
}
fn first<t1:Copy,t2:Copy>(p: Pair<t1,t2>) -> t1 {
return copy p.first;
}
We can then declare a variable:
let foo = Pair { first:1, second:'2'};
noting that the type of
foo is not given explicitly, but instead is
inferred from the initial value. The declaration could read:
let foo:Pair<int, char> = Pair { first:1, second:'2'};
but that would be needlessly verbose.
We can then call
first(foo);
and the compiler will know that this returns an
int.
In fact it will return a copy of an
int, not a reference to one.
Rust allows all type declarations (struct, array, enum), functions, and
traits (somewhat like "interfaces" or "virtual classes") to be
parameterized by types. This should mean that the compiler always knows
enough about the type of every value to avoid the need for down casts.
Keep one's memories in order
The final safety measure from Rust that we will examine relates to memory
allocation and particularly how that interacts with concurrency.
Like many modern languages, Rust does not allow the programmer to
explicitly release (or "free" or "dispose" of) allocated memory.
Rather, the language and runtime support take care of that. Two
mechanisms are provided which complement the inevitable "stack"
allocation of local variables.
Both of these mechanisms explicitly relate to the multiprocessing
model that is based on "tasks", which are somewhat like "goroutines"
in Go, or
Green threads.
They are lightweight and can be cooperatively scheduled within one
operating system thread, or across a small pool of such threads.
The first mechanism provides "managed" allocations which are subject
to garbage collection. These will be destroyed (if a destructor is
defined) and freed sometime after the last active reference is
dropped. Managed allocations are allocated on a per-task heap and
they can only ever be accessed by code running in that task. When a
task exits, all allocations in that task's heap are guaranteed to
have been freed. The current implementation uses reference counting
to manage these allocations, though there appears to be an intention to
change to a mark/sweep scheme later.
The second mechanism involves using the type system to ensure that
certain values only ever have one (strong) reference. When that
single reference is dropped, the object is freed. This is somewhat
reminiscent of the
"Once upon a type" optimization described by David
Turner et al. for
functional languages. However, that system automatically inferred
which values only ever had a single reference, whereas Rust requires
that this "once" or "owned" (as Rust calls it) annotation be explicit.
These single-reference allocations are made on a common heap and can
be passed between tasks. As only one reference is allowed, it is not
possible for two different tasks to both access one of these owned
objects, so concurrent access, and the races that so often
involves, are impossible.
May I borrow your reference?
Owned values can have extra references providing they are "borrowed"
references. Borrowed references are quite weak and can only be held
in the context of some specific strong reference. As soon as the
primary reference is dropped or goes out-of-scope, any references
borrowed from it become invalid and it is a compile-time error if they
are used at all.
It is often convenient to pass borrowed references to functions and,
more significantly, to return these references from functions. This
is particularly useful as each of owned, managed, and on-stack references
can equally be borrowed, so passing a borrowed reference can work
independently of the style of allocation.
The passing of borrowed references is easily managed by annotating the
type of the function parameter to say that the reference is a borrowed
reference. The compiler will then ensure nothing is done with it that
is not safe.
Returning borrowed references is a little trickier as the compiler
needs to know not only that the returned reference is borrowed, but
also needs to know something about the lifetime of the reference.
Otherwise it cannot ensure that the returned reference isn't used
after the primary reference has been dropped.
This lifetime information is passed around using the same mechanism as
type-parameters for functions. A function can be declared with one or
more lifetime parameters (introduced by a single quote: '), and the
various arguments and return values can then be tagged with these
lifetimes. Rust uses type inference on the arguments to determine the
actual values of these parameters and thus the lifetime of the
returned value. For example, the function choose_one declared as:
fn choose_one<'r>(first: &'r Foo, second: &'r Foo) -> &'r Foo
has one lifetime parameter (
r) which applies to both
arguments (
first and
second), and to the result. The
ampersand (
&) here indicates a borrowed reference.
When it is called, the lifetime parameter is implicit, not explicit:
baz = choose_one(bar, bat);
The compiler knows that the lifetime of
r must be compatible with the
lifetimes of
bar and
bat so it effectively computes the
intersection of those. The result is associated with
baz as its
lifetime. Any attempt to use it beyond its lifetime will cause a
compile-time error.
Put another way, Rust uses the lifetime parameter information to infer
the lifetime of the function's result, based on the function's
arguments. This is similar to the inference of the type of foo in
our Pair example earlier. A more thorough explanation of these lifetimes can
be found in the "Borrowed Pointers Tutorial" referenced below.
This extra annotation may feel a little clumsy, but it seems a
relatively small price to pay for the benefit of being able to give
the compiler enough information that it can verify all references are
valid.
Time for that loophole
Allowing objects to be allocated on the stack, in a per-task heap, or
in a common heap providing there is only ever one (strong) reference
ensures that there will be no errors due to races with multiple
tasks accessing the same data, but the price seems a little high to
pay — it seems to mean that there can be no sharing of data between
tasks at all.
The key to solving this conundrum is the "unsafe" code regions that
were mentioned earlier. Unsafe code can allocate "raw" memory which
has no particular affiliation or management, and can cast "raw"
pointers into any other sort of pointer.
This allows, for example, the creation of a "queue" from which two
owned references could be created, one of a type that allows
references to be added to the queue, one which allows references to be
removed. The implementation of this queue would require a modest
amount of "unsafe" code, but any client of the queue could be entirely
safe. The end points might be passed around from task to task, can be
stored in some data structure, and can be used to move other owned
references between tasks. When the last reference is dropped
the "queue" implementation will be told and can respond appropriately.
The standard library for Rust implements just such a data structure, which it calls a "pipe". Other more complex data structures could
clearly be implemented to allow greater levels of sharing, while
making sure the interface is composed only of owned and managed
references, and thus is safe from unplanned concurrent access and from
dangling pointer errors.
Is safety worth the cost?
Without writing a large body of code in Rust it is impossible to get a
good feel for how effective the various safety features of the
language really are. Having a general ideological preference for
strong typing, I expect I would enjoy the extra expressiveness and
would find the extra syntax it requires to be a small price to pay.
The greatest irritation would probably come from finding the need
to write too much (or even any) unsafe code, or from having to give up
all safety when wanting to sacrifice any.
Of course, if that led to practical ideas for making the language able
to make richer assertions about various data, or richer assertions
about the level of safety, and could thus enlarge the scope for safe
code and decreasing the need for unsafe code, then the irritation
might well result in some useful pearls.
After all, Rust is only at release "0.6" and is not considered to be
"finished" yet.
References
Rust tutorial
Borrowed pointers tutorial
Rust Reference Manual
Blog post about Rust memory management
Another blog post about Rust
(
Log in to post comments)