GNU Awk (Gawk) is one of those workhorse utilities that usually doesn't make news. The 4.0.0 release, however, deserves a look. Announced on June 30th, the latest iteration of Gawk brings the first Gawk debugger, a sandbox mode for running less trusted scripts, revised internals, a number of changes to regular expressions, and IPv6 compatibility.
Gawk is one implementation of Awk. Named for the last names of its
inventors (Alfred Aho, Peter Weinberger, and Brian Kernighan), Awk is a
scripting language that's standard across UNIX platforms — and by
standard I mean that Awk has been part of a standard UNIX (or UNIX-like)
system since the beginning, as well as part of the POSIX
specification from the Open Group. Even though Gawk is one of many
Awks, it stands out as being one of the most widely used. What is it used
for? Gawk is best-known for data extraction and reporting, though it's been
used to write IRC bots,
a
YouTube downloader, and for AI programming.
New in 4.0.0
Gawk 4.0.0 is a fairly hefty update from the 3.1.8 release. According to
the release announcement, Gawk not only packs a number of new features,
bugfixes, and some updates to comply with POSIX 2008, it also has "revamped internals."
To find out what was revamped, and why, I asked Gawk maintainer Arnold
Robbins by email. It turns out that the revamp has a lengthy
history. Robbins says that "some years ago" John Haque took on
rewriting Gawk's internals using a byte-code-style engine — and to
implement a debugger in the process. Unfortunately, that work wasn't
integrated and Haque moved on. In early 2010, Robbins started trying to bring Haque's code up to date. According to Robbins, the rewrite doesn't provide a huge performance boost, but it does bring a major useful feature:
The performance is about the same (or slightly better) than the original internals, and I have not yet found a case where it's worse. But the really big gain, and why I wanted to have the change, is that gawk now provides an awk-level debugger (similar to GDB).
Right now, dgawk
is usable, but still limited. It doesn't report what an error is, but will
only report "syntax error" when there's a problem. The
debugger will also only work when running a program on the command line
— it cannot be attached to a running Awk program. It is unlikely that
the Gawk developers will focus on adding that functionality, since the Gawk
manual notes
that limiting debugging to programs started from within dgawk "seems reasonable for a language which is used mainly for quickly executing, short programs."
Gawk's regular expressions have undergone some changes in
4.0.0. Interval expressions are now part of the default syntax for Gawk,
and no longer require the -W or --posix
options. Interval expressions — where one or two numbers inside
braces (such as {n} or {n,m}) tell Gawk to match a
regular expression n or n through m times
— were not part of the original Awk specification. Also,
\s and \S have been added to match any white
space character, or any character that is not white space (respectively).
While Gawk tries to be POSIX-compliant, it does have features above and
beyond POSIX — and 4.0.0 introduces a few more. Gawk now supports two
new patterns, BEGINFILE and ENDFILE, that can be
used to perform actions before reading a file and after
(respectively). These are similar to BEGIN/END rules, but are
applied before and after reading individual files (since Gawk may process
two or more files while running any given script). For example, Gawk programs can now test to see if a file is readable before trying to process it. In prior versions of Gawk, this was not possible — so a script would fail with a fatal error if a file passed to Gawk was not readable.
Gawk has long had the ability to work over a network connection. With the 4.0.0 release, Gawk supports IPv6 using the /inet6 special file, or /inet4 to force IPv4.
The Internet is awash with Awk/Gawk scripts that users might want to run, but worry that the scripts will do more than what's advertised. To address this, Gawk 4.0.0 has a sandbox option (--sandbox), which restricts Gawk to operating on the input data that's been specified. It does this by disabling Gawk's system() function, input redirection using getline, and output redirection using the print and printf functions.
However, Robbins cautions against being overconfident in the security it
would convey.
It was contributed by a user who felt a need for it,
IIRC, for use in Web CGI scripts where you don't want someone to send in
malicious data that can trick the script into writing in your filesystem.
It makes a certain amount of sense to have an option like that. It is most
definitely *not* intended to make any promises of security.
The
sandbox mode is not on by default, says Robbins, because it would break
"an untold number of existing awk scripts." In short, this
option may be useful, but the features disabled by the sandbox option may
not be the only way a malicious script could harm a user's system.
The 4.0.0 release is the end of the line for some options and several
old and unsupported operating systems. The redundant --compat, --copyleft,
and --usage options are gone. The option for raw sockets has been removed,
as it was not implemented. If you're still on Amiga, BeOS, Cray, NeXT,
SunOS 3.x, MIPS RiscOS, or a handful of others, Gawk 3.1.8 is the final
supported release. That the Gawk team has dropped those platforms is no
surprise — that they've been carried so long past their expiration
date is. It would be challenging indeed to find new proprietary software
that's supported on BeOS or Amiga.
With the Gawk 4.0.0 release out of the way, Robbins says that the "big ticket" items for upcoming releases are to merge Gawk's three executables (gawk, pgawk for profiling, and dgawk for debugging) into one to reduce the installation footprint. He also says that Haque "has some other plans related to performance, but that's about all I can say about it in public." Robbins also says that there are plans to merge in some of the XMLgawk extensions. (XMLgawk is an extension of Gawk that has an XML parsing library based on the Expat XML parser.)
Robbins also has a few ideas listed in the Gawk roadmap on his site, which includes support for multiple-precision floating-point (MPFR) so gawk can use infinite precision numbers. He notes that it will be a "big job" and has yet to decide whether MPFR support would be on by default. Gawk is released as it's ready, so there are no dates specified as to when the features can be expected.
The Gawk team is not large, but it's got a healthy set of core contributors. Robbins says that Gawk has six people who maintain ports to different systems, one who handles testing on "a zillion different Unix systems," one contributor who helped out with documentation, and "various other people such as the xmlgawk developers, and several people from different GNU/Linux distributions." Naturally, this also includes Robbins and Haque.
Though Awk is not a particularly "sexy" language these days, it's still
a go-to for system administrators and developers. It's good to see that the
GNU Project is not only maintaining Gawk, but adding interesting new
features that help keep it relevant.
(
Log in to post comments)