A look at Gawk 4.0.0
GNU Awk (Gawk) is one of those workhorse utilities that usually doesn't make news. The 4.0.0 release, however, deserves a look. Announced on June 30th, the latest iteration of Gawk brings the first Gawk debugger, a sandbox mode for running less trusted scripts, revised internals, a number of changes to regular expressions, and IPv6 compatibility.
Gawk is one implementation of Awk. Named for the last names of its inventors (Alfred Aho, Peter Weinberger, and Brian Kernighan), Awk is a scripting language that's standard across UNIX platforms — and by standard I mean that Awk has been part of a standard UNIX (or UNIX-like) system since the beginning, as well as part of the POSIX specification from the Open Group. Even though Gawk is one of many Awks, it stands out as being one of the most widely used. What is it used for? Gawk is best-known for data extraction and reporting, though it's been used to write IRC bots, a YouTube downloader, and for AI programming.
New in 4.0.0
Gawk 4.0.0 is a fairly hefty update from the 3.1.8 release. According to
the release announcement, Gawk not only packs a number of new features,
bugfixes, and some updates to comply with POSIX 2008, it also has "revamped internals
".
To find out what was revamped, and why, I asked Gawk maintainer Arnold
Robbins by email. It turns out that the revamp has a lengthy
history. Robbins says that "some years ago
" John Haque took on
rewriting Gawk's internals using a byte-code-style engine — and to
implement a debugger in the process. Unfortunately, that work wasn't
integrated and Haque moved on. In early 2010, Robbins started trying to bring Haque's code up to date. According to Robbins, the rewrite doesn't provide a huge performance boost, but it does bring a major useful feature:
The performance is about the same (or slightly better) than the original internals, and I have not yet found a case where it's worse. But the really big gain, and why I wanted to have the change, is that gawk now provides an awk-level debugger (similar to GDB).
Right now, dgawk
is usable, but still limited. It doesn't report what an error is, but will
only report "syntax error
" when there's a problem. The
debugger will also only work when running a program on the command line
— it cannot be attached to a running Awk program. It is unlikely that
the Gawk developers will focus on adding that functionality, since the Gawk
manual notes
that limiting debugging to programs started from within dgawk "seems reasonable for a language which is used mainly for quickly executing, short programs
".
Gawk's regular expressions have undergone some changes in
4.0.0. Interval expressions are now part of the default syntax for Gawk,
and no longer require the -W
or --posix
options. Interval expressions — where one or two numbers inside
braces (such as {n} or {n,m}) tell Gawk to match a
regular expression n or n through m times
— were not part of the original Awk specification. Also,
\s
and \S
have been added to match any white
space character, or any character that is not white space (respectively).
While Gawk tries to be POSIX-compliant, it does have features above and
beyond POSIX — and 4.0.0 introduces a few more. Gawk now supports two
new patterns, BEGINFILE
and ENDFILE
, that can be
used to perform actions before reading a file and after
(respectively). These are similar to BEGIN/END
rules, but are
applied before and after reading individual files (since Gawk may process
two or more files while running any given script). For example, Gawk programs can now test to see if a file is readable before trying to process it. In prior versions of Gawk, this was not possible — so a script would fail with a fatal error if a file passed to Gawk was not readable.
Gawk has long had the ability to work over a network connection. With the 4.0.0 release, Gawk supports IPv6 using the /inet6
special file, or /inet4
to force IPv4.
The Internet is awash with Awk/Gawk scripts that users might want to run, but worry that the scripts will do more than what's advertised. To address this, Gawk 4.0.0 has a sandbox option (--sandbox
), which restricts Gawk to operating on the input data that's been specified. It does this by disabling Gawk's system()
function, input redirection using getline
, and output redirection using the print
and printf
functions.
However, Robbins cautions against being overconfident in the security it would convey.
The
sandbox mode is not on by default, says Robbins, because it would break
"an untold number of existing awk scripts
". In short, this
option may be useful, but the features disabled by the sandbox option may
not be the only way a malicious script could harm a user's system.
The 4.0.0 release is the end of the line for some options and several old and unsupported operating systems. The redundant --compat, --copyleft, and --usage options are gone. The option for raw sockets has been removed, as it was not implemented. If you're still on Amiga, BeOS, Cray, NeXT, SunOS 3.x, MIPS RiscOS, or a handful of others, Gawk 3.1.8 is the final supported release. That the Gawk team has dropped those platforms is no surprise — that they've been carried so long past their expiration date is. It would be challenging indeed to find new proprietary software that's supported on BeOS or Amiga.
With the Gawk 4.0.0 release out of the way, Robbins says that the "big ticket" items for upcoming releases are to merge Gawk's three executables (gawk, pgawk for profiling, and dgawk for debugging) into one to reduce the installation footprint. He also says that Haque "has some other plans related to performance, but that's about all I can say about it in public
". Robbins also says that there are plans to merge in some of the XMLgawk extensions. (XMLgawk is an extension of Gawk that has an XML parsing library based on the Expat XML parser.)
Robbins also has a few ideas listed in the Gawk roadmap on his site, which includes support for multiple-precision floating-point (MPFR) so gawk can use infinite precision numbers. He notes that it will be a "big job
" and has yet to decide whether MPFR support would be on by default. Gawk is released as it's ready, so there are no dates specified as to when the features can be expected.
The Gawk team is not large, but it's got a healthy set of core contributors. Robbins says that Gawk has six people who maintain ports to different systems, one who handles testing on "a zillion different Unix systems
", one contributor who helped out with documentation, and "
various other people such as the xmlgawk developers, and several people from different GNU/Linux distributions
". Naturally, this also includes Robbins and Haque.
Though Awk is not a particularly "sexy" language these days, it's still a go-to for system administrators and developers. It's good to see that the GNU Project is not only maintaining Gawk, but adding interesting new features that help keep it relevant.
Index entries for this article | |
---|---|
GuestArticles | Brockmeier, Joe |
Posted Jul 8, 2011 1:00 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link] (13 responses)
Posted Jul 8, 2011 6:56 UTC (Fri)
by janneke (guest, #15012)
[Link] (3 responses)
Posted Jul 8, 2011 12:29 UTC (Fri)
by nix (subscriber, #2304)
[Link] (2 responses)
Posted Jul 8, 2011 14:20 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link] (1 responses)
/me runs away screaming.
Posted Jul 8, 2011 16:57 UTC (Fri)
by nix (subscriber, #2304)
[Link]
(but, seriously, pgawk was damn useful in making something that size work at sane speeds!)
Posted Jul 8, 2011 15:24 UTC (Fri)
by HelloWorld (guest, #56129)
[Link] (8 responses)
Posted Jul 8, 2011 17:06 UTC (Fri)
by oak (guest, #2786)
[Link] (6 responses)
For example Perl & Python are huge, their size is in MBs whereas configuring awk[1] to Busybox increases Busybox size only by some KBs. Sometimes size matters. Sometimes Perl or Python aren't otherwise available.
[1] Busybox AWK isn't 100% POSIX compliant, but the few POSIX features it's lacking are marginal (I've only bumped into one, some rarely used printf format feature which I don't anymore remember).
Posted Jul 8, 2011 19:51 UTC (Fri)
by SiB (subscriber, #4048)
[Link] (5 responses)
plot "<awk -v I=7 -f RSH2.awk --s '/^E/ && $F1<1000*mV && $F2<1000*mV && $BH<20*mV && $DH>15*mV{print $F1/mV, $F2/mV}' radpf-2011-06-27-muons-horizontal-3.E | ./hist.py -s 0.5 -S 0.5" with image
Posted Jul 8, 2011 21:13 UTC (Fri)
by felixfix (subscriber, #242)
[Link] (2 responses)
Posted Jul 8, 2011 21:28 UTC (Fri)
by nix (subscriber, #2304)
[Link] (1 responses)
(In any case, I'd look out: from the look of that comment, SiB could be firing ionized plasma from an underground death ray any time now.[1])
[1] sure, they *say* it's a particle accelerator...
Posted Jul 11, 2011 1:51 UTC (Mon)
by zlynx (guest, #2285)
[Link]
It all depends on which direction you point the thing.
Posted Jul 11, 2011 15:35 UTC (Mon)
by rsidd (subscriber, #2582)
[Link] (1 responses)
Posted Jul 11, 2011 15:47 UTC (Mon)
by paulj (subscriber, #341)
[Link]
Posted Jul 11, 2011 13:59 UTC (Mon)
by stevem (subscriber, #1512)
[Link]
Posted Jul 8, 2011 2:01 UTC (Fri)
by mhw (guest, #13931)
[Link]
Posted Jul 14, 2011 8:07 UTC (Thu)
by Duncan (guest, #6647)
[Link]
Virtuoso-server is a backend required by soprano when it in turn is supporting nepomuk, a kde4 base "semantic desktop" technology.
https://bugs.gentoo.org/show_bug.cgi?id=374143
https://sourceforge.net/tracker/?func=detail&aid=3358...
As a result, I had to downgrade back to gawk 3.1.8, for the time being.
Duncan (Yes, running KDE, KDE 4.6.95 aka 4.7-rc2 ATM, to be precise, on Gentoo).
A look at Gawk 4.0.0
A look at Gawk 4.0.0
Any AWK script that requires debugger should probably be rewritten in Python or Perl.
Of course, if you inherit such a script, you may need a debugger
to be able to do that.
A look at Gawk 4.0.0
A look at Gawk 4.0.0
A look at Gawk 4.0.0
A look at Gawk 4.0.0
A look at Gawk 4.0.0
A look at Gawk 4.0.0
Oh Noes!
Oh Noes!
Oh Noes!
A look at Gawk 4.0.0
A look at Gawk 4.0.0
A look at Gawk 4.0.0
A look at Gawk 4.0.0
A look at Gawk 4.0.0