June 8, 2013
This article was contributed by Neil Brown
The designers of a new programming language are probably most interested in
the big features — the things that just couldn't be done with whichever
language they are trying to escape from. So they are probably
thinking of the type system, the data model, the concurrency support,
the approach to polymorphism, or whatever it is that they feel will
affect the expressiveness of the language in the way they want.
There is a good chance they will also have a pet peeve about syntax,
whether it relates to the exact meaning of the humble semicolon, or
some abhorrent feature such as the C conditional expression which (they feel)
should never be allowed to see the light of day again.
However, designing a language requires more than just addressing the
things you care about. It requires making a wide range of decisions
concerning various sorts of abstractions, and making sure the choices
all fit together into a coherent, and hopefully consistent, whole.
One might hope that, with over half a century of language development
behind us, there would be some established norms which can be simply
taken as "best practice" without further concern. While this is
true to an extent, there appears to be plenty of room for languages to
diverge even on apparently simple concepts.
Having
begun
an exploration of the relatively new languages
Rust and Go
and, in particular, having two languages to provide illuminating
contrasts, it seems apropos to examine some of those language features
that we might think should be uncontroversial to see just how
uniform they have, or have not, become.
Comments
When first coming to C
[PDF] from Pascal, the usage of braces can be a bit of
a surprise. While Pascal sees them as one option for enclosing
comments, C sees them as a means of grouping statements. This harsh
conflict between the languages is bound to cause confusion, or at
least a little friction, when moving from one language to the next,
but fortunately appears to be a thing of the past.
One last vestige of this sort of confusion can be seen in the
configuration files for
BIND,
the Berkeley Internet Name Daemon.
In the BIND configuration files semicolons are used as statement
terminators while in the database files they introduce comments.
When not hampered by standards conformance as these database files
are, many languages have settled on C-style block comments:
/* This is a comment */
and C++-style one-line
comments:
// This line has a comment
these having won over from the other Pascal option of:
(* similar but different block comments *)
and Ada's:
-- again a similar yet different single line comment.
The other popular alternative is to start comments with a "#"
character, which is a style championed by the C-shell and Bourne shell, and
consequently
used by many scripting languages.
Thankfully the idea of starting a comment with "COMMENT" and ending
with "TNEMMOC" never really took off and may be entirely apocryphal.
Both Rust and Go have embraced these trends, though not as fully as
BIND configuration files and other languages like Crack which allow
all three (/* */, //, #). Rust and Go only
support the C
and C++ styles.
Go doesn't use the "#" character at all, allowing it only inside comments
and string constants, so it is available as a comment character for a
future revision, or maybe for something else.
Rust has another use for "#" which is slightly reminiscent of its use by the
preprocessor in C. The construct:
#[attribute....]
attaches arbitrary metadata to nearby parts of the program which can
enable or disable compiler warnings, guide conditional compilation,
specify a license, or any of various other things.
Identifiers
Identifiers are even more standard than comments. Any combination of
letters, digits, and the underscore that does not start with a digit is
usually acceptable as an identifier providing it hasn't already been
claimed as a reserved word (like if or while).
With the increasing awareness of languages and writing systems other than
English, UTF-8 is more broadly supported in programming languages these
days. This extends the range of
characters that can go into an identifier, though different languages
extend it differently.
Unicode
defines a category for every character, and Go simply extends
the definition given above to allow "Unicode letter" (which has 5
sub-categories: uppercase, lowercase, titlecase, modifier, and other) and
"Unicode decimal digit" (which is one of 3 sub-categories of "Number",
the others being "Number,letter" and "Number,other") to be combined
with the underscore.
The
Go FAQ
suggests this definition may be extended depending on how
standardization efforts progress.
Rust gives a hint of what these efforts may look like by
delegating the task of determining valid identifiers to the Unicode
standard.
The
Unicode Standard Annex #31
defines two character classes, "ID_Start" and "ID_Continue", that can be
used to form identifiers in a standard way. The Annex offers these as
a resource, rather than imposing them as a standard, and acknowledges
that particular use cases may extend them is various ways.
It particularly highlights that some languages like to allow
identifiers to start with an underscore, which ID_Start does not
contain. The particular rule used by Rust is to allow an identifier to
start with an ASCII letter, underscore, or any ID_Start, and to be
continued with ASCII letters, ASCII digits, underscores, or Unicode
ID_Continue characters.
Allowing Unicode can introduce interesting issues if case is
significant, as Unicode supports three cases (upper, lower, and title)
and also supports characters without case. Most programming languages
very sensibly have no understanding of case and treat two characters
of different case as different characters, with no attempt to fold case
or have a canonical representation. Go however does pay some
attention to case.
In Go, identifiers where the first character is an uppercase letter
are treated differently in terms of visibility between packages. A
name defined in one package is only exported to other packages if it
starts with an uppercase letter. This suggests that writing systems
without case, such as Chinese, cannot be used to name exported
identifiers without some sort of non-Chinese uppercase prefix.
The
Go FAQ
acknowledges this weakness but shows a strong reluctance to give up
the significance of case in exports.
Numbers
Numbers don't face any new issues with Unicode though possibly that is
just due to continued English parochialism, as Unicode does contain a
complete set of Roman numerals as well as those from more current numeral
systems. So
you might think that numbers would be fairly well
standardized by now. To a large extent they are, but there still
seems to be wiggle room.
Numbers can be integers or, with a decimal point or exponent suffix
(e.g. "1.0e10"), floating point. Integers can be expressed in decimal, octal
with a leading "0", or hexadecimal with a leading "0x".
In C99 and D [PDF], floating point numbers can also be hexadecimal. The
exponent suffix must then have a "p" rather than "e" and gives a power of
two expressed in decimal. This allows precise specification of floating
point numbers without any risk of conversion errors.
C11 and D also allow a "0b" prefix on integers to indicate a binary
representation (e.g. "0b101010") and D allows underscores to be sprinkled
though numbers to improve readability, so 1_000_000_000 is clearly the
same value as 1e9.
Neither Rust nor Go have included hexadecimal floats. While Rust
has included binary integers and the underscore spacing character, Go
has left these out.
Another subtlety is that while C, D, Go, and many other languages allow a
floating point number to start with a period (e.g. ".314159e1"), Rust does
not. All numbers in Rust must start with a digit. There does not
appear to be any syntactic ambiguity that would arise if a leading
period were permitted, so this is presumably due to personal preference
or accident.
In the language Virgil-III
this choice is much clearer. Virgil has a
fairly rich
"tuple" concept [PDF] which provides a useful shorthand for a
list of values. Members of a tuple can be accessed with a syntax
similar to structure field references, only with a number rather than
a name. So in:
var x:(int, int) = (3, 4);
var w:int = x.1;
The variable "w" is assigned the value "4" as it is element one of the
tuple "x".
Supporting this syntax while also allowing ".1" to be a floating point
number would require the tokenizer to know when to report two tokens
("dot" and "int") and when it is just one ("float"). While possible, this
would be clumsy.
Many fractional numbers (e.g. 0.75) will start with a zero even in languages
which allow a leading period (.75). Unlike the case with integers,
the leading zero does not mean these number are interpreted in base eight.
For 0.75 this is unlikely to
cause confusion. For 0777.0 it might. Best practice for programmers
would be to avoid the unnecessary digit in these cases and it would be
nice if the language required that.
As well as prefixes, many languages allow suffixes on numbers with a
couple of different meanings. Those few languages which have
"complex" as a built-in type need a syntax for specifying "imaginary"
constants. Go, like D, uses an "i" suffix. Python uses "j".
Spreadsheets like LibreOffice localc or Microsoft Excel allow
either "i" or "j". It is a pity more languages don't take that approach.
Rust doesn't support native complex numbers, so it doesn't need to choose.
The other meaning of a suffix is to indicate the "size" of the value -
how many bytes are expected to be used to store it. C and D allow
u, l,
ll, or f for unsigned, long, long long, and float, with a few
combinations permitted.
Rust allows u, u8, u16, u32,
u64, i8, i16, i32, i64,
f32, and f64 which
cover much
the same set of sizes, but are more explicit. Perhaps fortunately, i
is not a permitted suffix, so there is room to add imaginary numbers in
the future if that turned out to be useful.
Go takes a completely different approach to the sizing of constants.
The
language specification
talks about "untyped" constants though this seems to be some strange
usage of the word "untyped" that I wasn't previously aware of. There
are in fact "untyped integer" constants, "untyped floating point"
constants, and even "untyped boolean" constants, which seem like they
are untyped types. A more accurate term might be "unsized constants with
unnamed
types" though that is a little cumbersome.
These "untyped" constants have two particular properties. They are
calculated using high precision with overflow forbidden, and
they can be transparently converted to a different type provided that the exact
value can be represented in the target type. So "1e15" is an untyped
floating point constant which can be used where an int64 is
expected, but not where an int32 is expected, as it requires
50 bits to store as an integer.
The specification states that "Constant expressions are always
evaluated exactly" however some edge cases are to be expected:
print((1 + 1/1e130)-1, "\n")
print(1/1e130, "\n")
results in:
+9.016581e-131
+1.000000e-130
so there does seem to be some limit to precision.
Maintaining high precision and forbidding overflow means that there
really is no need for size suffixes.
Strings
Everyone knows that strings are enclosed in single or double quotes.
Or maybe backquotes (`) or triple quotes ('''). And that
while they
used to contain ASCII characters, UTF-8 is preferred these days.
Except when it isn't, and UTF-16 or UTF-32 are needed.
Both Rust and Go, like C and others, use single quotes for characters
and double quotes for strings, both with the standard set of escape
sequences (though Rust inexplicably excludes \b, \v,
\a, and \f). This set includes \uXXXX and
\UXXXXXXXX so that all
Unicode code-points can be expressed using pure ASCII program text.
Go chooses to refer to character constants as "Runes" and provides the
built in type "rune" to store them. In C and related
languages "char" is used both for ASCII characters and 8-bit values.
It appears that the Go developers wanted a clean break with that and
do not provide a char type at all. rune
(presumably more aesthetic than wchar) stores (32-bit) Unicode
characters while byte or uint8 store 8-bit values.
Rust keeps the name char for 32-bit Unicode characters and
introduces u8 for 8-bit values.
The modern trend seems to be to disallow literal newlines inside
quoted strings, so that missing quote characters can be quickly
detected by the compiler or interpreter. Go follows this trend and, like
D, uses the
back quote (rather than the Python triple-quote) to surround "raw"
strings in which escapes are not recognized and newlines are
permitted. Rust bucks the trend by allowing literal newlines in strings
and does not provide for uninterpreted strings at all.
Both Rust and Go assume UTF-8. They do not support the prefixes of C
(U"this is a string of 32bit characters")
or the suffixes of D
("another string of 32bit chars"d),
to declare a string to be a multibyte string.
Semicolons and expressions
The phrase "missing semicolon" still brings back memories from first-year computer science and learning Pascal. It was a running joke that
whenever the lecturer asked "What does this code fragment do?"
someone would call out "missing semicolon", and they were right more
often than you would think.
In Pascal, a semicolon separates statements while in C it terminates
some statements — if, for, while,
switch and compound statements do not require a semicolon.
Neither rule is particularly difficult to get used to,
but both often require semicolons at the end of lines that can look
unnecessary.
Go follows Pascal in that semicolons separate statements — every pair
of statements must be separated. A semicolon is not needed before the
"}" at the end of a block, though it is permitted there.
Go also follows the pattern seen in Python and JavaScript where the
semicolon is sometimes assumed at the end of a line (when a newline
character is seen). The details of this "sometimes" is quite
different between languages.
In Go, the insertion of semicolons happens during
"lexical analysis",
which is the step of language processing that breaks the stream of
characters into a stream of tokens (i.e. a tokenizer). If a newline is
detected on a
non-empty line and the last token on the line was one of:
- an identifier,
- one of the keywords break, continue, fallthrough,
or return
- a numeric, rune, or string literal
- one of ++, --, ), ], or }
then a semicolon is inserted at the location of the newline.
This imposes some style choices on the programmer such that:
if some_test
{
some_statement
}
is not legal (the open brace must go on the same line as the
condition), and:
a = c
+ d
+ e
is not legal — the operation (+) must go at the end of the
first line, not the start of the second.
In contrast to this, JavaScript waits until the "parsing" step of language
processing when the sequence of tokens is gathered into syntactic units
(statements, expressions, etc.) following a context free grammar.
JavaScript will insert a semicolon, provided that semicolon would
serve to terminate a non-empty statement, if:
- it finds a newline in a location that the grammar forbids a
newline, such as after the word "break" or before the postfix
operator "++";
- it finds a "}" or End-of-file that is not expected by the grammar
- it finds any token that is not expected, which was separated from
the previous token by at least one newline.
This often works well but brings its own share of style choices
including the interesting
suggestion
to sometimes use a semicolon to start a statement.
While both of these approaches are workable, neither really seems
ideal. They both force style choices which are rather arbitrary and
seem designed to make life easy for the compiler rather than for the
programmer.
Rust takes a very different approach to semicolons than Go or JavaScript
or many other languages. Rather than making them less important and
often unnecessary they are more important and have a significant semantic
meaning.
One use involves the attributes mentioned earlier. When followed by a
semicolon:
#[some_attribute];
the attribute applies to the entity (e.g. the function or module) that
the attribute appears within. When not followed by a semicolon, the
attribute applies to the entity that follows it. A missing
semicolon could certainly make a big difference here.
The primary use of semicolons in Rust is much like that in C — they
are used to terminate expressions by turning the expressions into
statements, discarding any result. The effect is really quite
different from C because of a related difference: many things that C
considers to be statements, Rust considers to be expressions. A
simple example is the if expression.
a = if b == c { 4 } else { 5 };
Here the if expression returns either "4" or "5", which is stored in
"a".
A block, enclosed in braces ({ }), typically includes a
sequence of expressions with semicolons separating them. If the last
expression is also followed by a semicolon, then the block-expression
as a whole does not have a value — that last semicolon discards the
final value. If the last expression is not followed by a semicolon,
then the value of the block is the value of the last expression.
If this completely summed up the use of semicolons it would produce
some undesirable requirements.
if condition {
expression1;
} else {
expression2;
}
expression3;
This would not be permitted as there is no semicolon to discard the
value of the if expression before expression3. Having a semicolon
after the last closing brace would be ugly, and that if expression
doesn't actually return a value anyway (both internal expressions are
terminated with a semicolon) so the language does not require the ugly
semicolon and the above is valid Rust code.
If the internal expression did return a value, for example if the
internal semicolons were missing, then a semicolon would be
required before expression3.
Following this line of reasoning leads to an interesting result.
if condition {
function1()
} else {
function2()
}
expression3;
Is this code correct or is there a missing semicolon? To know the
answer you need to know the types of the functions. If they do not
return a value, then the code is correct. If they do, a semicolon is
needed, either one at the end of the whole "if" expression, or one
after each function call. So in Rust, we need to evaluate the types
of expressions before we can be sure of correct semicolon usage in
every case.
Now the above is probably just a silly example, and no one would ever
write code like that, at least not deliberately. But the rules do
seem to add an unnecessary complexity to the language, and the task of
programming is complex enough as it is — adding more complexity through subtle
language rules is not likely to help.
Possibly a bigger problem is that any tool that wishes to accurately
analyze the syntax of a program needs to perform a complete type
analysis. It is a known problem that the correct parsing of C
code requires you to know which identifiers are typedefs and which are
not. Rust isn't quite that bad as missing type information wouldn't lead to an
incorrect parse, but at the very least it is a potential source of confusion.
Return
A final example of divergence on the little issues, though perhaps not
quite so little as the others, can be found in returning values from
functions using a return
statement. Both Rust and
Go support the traditional return and both allow multiple values to
be returned: Go by simply allowing a list of return types, Rust
through the "tuple" type which allows easy anonymous structures.
Each language has its own variation on this theme.
If we look at the half million return statements in the Linux
kernel, nearly 35,000 of them return a variable called "ret",
"retval", "retn", or similar, and a further 20,000 return "err",
"error", or similar. This totals more than 10% of total usage of
return in the kernel.
This suggests that there is often a need to declare a variable to hold
the intended result of a function, rather than to just return a result
as soon as it is known.
Go acknowledges this need by allowing the signature of a function to
give names to the return values as well as the parameter values:
func open(filename string, flags int) (fd int, err int)
Here the (hypothetical) open() function returns two integers
named fd (the file descriptor) and err.
This provides useful documentation of the meaning of the return values
(assuming programmers can be more creative than "retval")
and also declares variables with the given names. These can be set
whenever convenient in the code of the function and a simple:
return
with no expressions listed will use the values in those variables.
Go requires that this return be
present, even if it lists no values and is at the end of the function,
which seems a little unnecessary, but isn't too burdensome.
There is
evidence
[YouTube]
that some Go developers are not completely comfortable with this feature,
though it isn't clear whether the feature itself is a problem, or rather the
interplay with other features of Go.
Rust's variation on this theme we have already glimpsed with the
observation that Rust has "expressions" in preference to
"statements". The whole body of a function can be viewed as an
expression and, provided it doesn't end with a semicolon, the value
produced by that expression is the value returned from the function.
The word return is not needed at all, though it is available and an
explicit return expression within the function body will cause an
early return with the given value.
Conclusion
There are many other little details, but this survey provides a good
sampling of the many decisions that a language designer needs to make
even after they have made the important decisions that shape the
general utility of the language.
There certainly are standards that are appearing and broadly being
adhered to, such as for comments and identifiers, but it is a little
disappointing that there is still such variability concerning the
available representations of numbers and strings.
The story of semicolons and statement separation is clearly not a
story we've heard the end of yet. While it is good to see language
designers exploring the options, none of the approaches explored above
seem entirely satisfactory. The recognition of a line-break as being
distinct from other kinds of white space seems to be a clear recognition that
the two dimensional appearance of the code has relevance for parsing
it. It is therefore a little surprising that we don't see the line
indent playing a bigger role in interpretation of code. The
particular rules used by Python may not be to everyone's liking, but
the principle of making use of this very obvious aspect of a program
seems sound.
We cannot expect ever to converge on a single language that suits
every programmer and every task, but the more uniformity we can find
on the little details, the easier it will be for programmers to move
from language to language and maximize their productivity.
(
Log in to post comments)