|
|
Subscribe / Log in / New account

A look at Gawk 4.0.0

July 7, 2011

This article was contributed by Joe 'Zonker' Brockmeier.

GNU Awk (Gawk) is one of those workhorse utilities that usually doesn't make news. The 4.0.0 release, however, deserves a look. Announced on June 30th, the latest iteration of Gawk brings the first Gawk debugger, a sandbox mode for running less trusted scripts, revised internals, a number of changes to regular expressions, and IPv6 compatibility.

Gawk is one implementation of Awk. Named for the last names of its inventors (Alfred Aho, Peter Weinberger, and Brian Kernighan), Awk is a scripting language that's standard across UNIX platforms — and by standard I mean that Awk has been part of a standard UNIX (or UNIX-like) system since the beginning, as well as part of the POSIX specification from the Open Group. Even though Gawk is one of many Awks, it stands out as being one of the most widely used. What is it used for? Gawk is best-known for data extraction and reporting, though it's been used to write IRC bots, a YouTube downloader, and for AI programming.

New in 4.0.0

Gawk 4.0.0 is a fairly hefty update from the 3.1.8 release. According to the release announcement, Gawk not only packs a number of new features, bugfixes, and some updates to comply with POSIX 2008, it also has "revamped internals".

To find out what was revamped, and why, I asked Gawk maintainer Arnold Robbins by email. It turns out that the revamp has a lengthy history. Robbins says that "some years ago" John Haque took on rewriting Gawk's internals using a byte-code-style engine — and to implement a debugger in the process. Unfortunately, that work wasn't integrated and Haque moved on. In early 2010, Robbins started trying to bring Haque's code up to date. According to Robbins, the rewrite doesn't provide a huge performance boost, but it does bring a major useful feature:

The performance is about the same (or slightly better) than the original internals, and I have not yet found a case where it's worse. But the really big gain, and why I wanted to have the change, is that gawk now provides an awk-level debugger (similar to GDB).

Right now, dgawk is usable, but still limited. It doesn't report what an error is, but will only report "syntax error" when there's a problem. The debugger will also only work when running a program on the command line — it cannot be attached to a running Awk program. It is unlikely that the Gawk developers will focus on adding that functionality, since the Gawk manual notes that limiting debugging to programs started from within dgawk "seems reasonable for a language which is used mainly for quickly executing, short programs".

Gawk's regular expressions have undergone some changes in 4.0.0. Interval expressions are now part of the default syntax for Gawk, and no longer require the -W or --posix options. Interval expressions — where one or two numbers inside braces (such as {n} or {n,m}) tell Gawk to match a regular expression n or n through m times — were not part of the original Awk specification. Also, \s and \S have been added to match any white space character, or any character that is not white space (respectively).

While Gawk tries to be POSIX-compliant, it does have features above and beyond POSIX — and 4.0.0 introduces a few more. Gawk now supports two new patterns, BEGINFILE and ENDFILE, that can be used to perform actions before reading a file and after (respectively). These are similar to BEGIN/END rules, but are applied before and after reading individual files (since Gawk may process two or more files while running any given script). For example, Gawk programs can now test to see if a file is readable before trying to process it. In prior versions of Gawk, this was not possible — so a script would fail with a fatal error if a file passed to Gawk was not readable.

Gawk has long had the ability to work over a network connection. With the 4.0.0 release, Gawk supports IPv6 using the /inet6 special file, or /inet4 to force IPv4.

The Internet is awash with Awk/Gawk scripts that users might want to run, but worry that the scripts will do more than what's advertised. To address this, Gawk 4.0.0 has a sandbox option (--sandbox), which restricts Gawk to operating on the input data that's been specified. It does this by disabling Gawk's system() function, input redirection using getline, and output redirection using the print and printf functions.

However, Robbins cautions against being overconfident in the security it would convey.

It was contributed by a user who felt a need for it, IIRC, for use in Web CGI scripts where you don't want someone to send in malicious data that can trick the script into writing in your filesystem. It makes a certain amount of sense to have an option like that. It is most definitely *not* intended to make any promises of security.

The sandbox mode is not on by default, says Robbins, because it would break "an untold number of existing awk scripts". In short, this option may be useful, but the features disabled by the sandbox option may not be the only way a malicious script could harm a user's system.

The 4.0.0 release is the end of the line for some options and several old and unsupported operating systems. The redundant --compat, --copyleft, and --usage options are gone. The option for raw sockets has been removed, as it was not implemented. If you're still on Amiga, BeOS, Cray, NeXT, SunOS 3.x, MIPS RiscOS, or a handful of others, Gawk 3.1.8 is the final supported release. That the Gawk team has dropped those platforms is no surprise — that they've been carried so long past their expiration date is. It would be challenging indeed to find new proprietary software that's supported on BeOS or Amiga.

With the Gawk 4.0.0 release out of the way, Robbins says that the "big ticket" items for upcoming releases are to merge Gawk's three executables (gawk, pgawk for profiling, and dgawk for debugging) into one to reduce the installation footprint. He also says that Haque "has some other plans related to performance, but that's about all I can say about it in public". Robbins also says that there are plans to merge in some of the XMLgawk extensions. (XMLgawk is an extension of Gawk that has an XML parsing library based on the Expat XML parser.)

Robbins also has a few ideas listed in the Gawk roadmap on his site, which includes support for multiple-precision floating-point (MPFR) so gawk can use infinite precision numbers. He notes that it will be a "big job" and has yet to decide whether MPFR support would be on by default. Gawk is released as it's ready, so there are no dates specified as to when the features can be expected.

The Gawk team is not large, but it's got a healthy set of core contributors. Robbins says that Gawk has six people who maintain ports to different systems, one who handles testing on "a zillion different Unix systems", one contributor who helped out with documentation, and "various other people such as the xmlgawk developers, and several people from different GNU/Linux distributions". Naturally, this also includes Robbins and Haque.

Though Awk is not a particularly "sexy" language these days, it's still a go-to for system administrators and developers. It's good to see that the GNU Project is not only maintaining Gawk, but adding interesting new features that help keep it relevant.


Index entries for this article
GuestArticlesBrockmeier, Joe


to post comments

A look at Gawk 4.0.0

Posted Jul 8, 2011 1:00 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (13 responses)

Any AWK script that requires debugger should probably be rewritten in Python or Perl.

A look at Gawk 4.0.0

Posted Jul 8, 2011 6:56 UTC (Fri) by janneke (guest, #15012) [Link] (3 responses)

Any AWK script that requires debugger should probably be rewritten in Python or Perl.
Of course, if you inherit such a script, you may need a debugger to be able to do that.

A look at Gawk 4.0.0

Posted Jul 8, 2011 12:29 UTC (Fri) by nix (subscriber, #2304) [Link] (2 responses)

Also, if there's one obscure bug in a 50K-long awk script being used for critical stuff, fixing that one bug is likely to be ever so much less disruptive than rewriting the whole thing in another language.

A look at Gawk 4.0.0

Posted Jul 8, 2011 14:20 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

50k lines long AWK script???

/me runs away screaming.

A look at Gawk 4.0.0

Posted Jul 8, 2011 16:57 UTC (Fri) by nix (subscriber, #2304) [Link]

It grew, slowly. Its growth will continue until it consumes the Earth. There is no escape.

(but, seriously, pgawk was damn useful in making something that size work at sane speeds!)

A look at Gawk 4.0.0

Posted Jul 8, 2011 15:24 UTC (Fri) by HelloWorld (guest, #56129) [Link] (8 responses)

s/that requires debugger//

A look at Gawk 4.0.0

Posted Jul 8, 2011 17:06 UTC (Fri) by oak (guest, #2786) [Link] (6 responses)

Awk & awk scripts are still useful.

For example Perl & Python are huge, their size is in MBs whereas configuring awk[1] to Busybox increases Busybox size only by some KBs. Sometimes size matters. Sometimes Perl or Python aren't otherwise available.

[1] Busybox AWK isn't 100% POSIX compliant, but the few POSIX features it's lacking are marginal (I've only bumped into one, some rarely used printf format feature which I don't anymore remember).

A look at Gawk 4.0.0

Posted Jul 8, 2011 19:51 UTC (Fri) by SiB (subscriber, #4048) [Link] (5 responses)

I use gawk almost daily when analysing data files with gnuplot. Like this :-)

plot "<awk -v I=7 -f RSH2.awk --s '/^E/ && $F1<1000*mV && $F2<1000*mV && $BH<20*mV && $DH>15*mV{print $F1/mV, $F2/mV}' radpf-2011-06-27-muons-horizontal-3.E | ./hist.py -s 0.5 -S 0.5" with image

Oh Noes!

Posted Jul 8, 2011 21:13 UTC (Fri) by felixfix (subscriber, #242) [Link] (2 responses)

You are providing fodder for those who want Jon to start deleting comments.

Oh Noes!

Posted Jul 8, 2011 21:28 UTC (Fri) by nix (subscriber, #2304) [Link] (1 responses)

It's no uglier than MUMPS code.

(In any case, I'd look out: from the look of that comment, SiB could be firing ionized plasma from an underground death ray any time now.[1])

[1] sure, they *say* it's a particle accelerator...

Oh Noes!

Posted Jul 11, 2011 1:51 UTC (Mon) by zlynx (guest, #2285) [Link]

Particle Accelerator or Death Ray.

It all depends on which direction you point the thing.

A look at Gawk 4.0.0

Posted Jul 11, 2011 15:35 UTC (Mon) by rsidd (subscriber, #2582) [Link] (1 responses)

Ah, that's where perl gets its line noise from.

A look at Gawk 4.0.0

Posted Jul 11, 2011 15:47 UTC (Mon) by paulj (subscriber, #341) [Link]

AWK is no where near as bad as Perl, imo. There are no variable context operators that mean you have stick funny chars in front of every variable reference (vars are referenced just by name - $ is just a field dereference operator that acts on vars). The uglyness here is primarily just from its inherent nature as being a very long conditional to control a print statement, and squashing it all on one line.

A look at Gawk 4.0.0

Posted Jul 11, 2011 13:59 UTC (Mon) by stevem (subscriber, #1512) [Link]

Yet more language trolling. Nothing to see here, move on...

A look at Gawk 4.0.0

Posted Jul 8, 2011 2:01 UTC (Fri) by mhw (guest, #13931) [Link]

MPFR supports arbitrary precision, not "infinite precision".

A look at Gawk 4.0.0

Posted Jul 14, 2011 8:07 UTC (Thu) by Duncan (guest, #6647) [Link]

FWIW, gawk 4.0 is incompatible with the virtuoso-server (thru 6.1.3 at least) build process, triggering bad C code as generated from SQL files using gawk.

Virtuoso-server is a backend required by soprano when it in turn is supporting nepomuk, a kde4 base "semantic desktop" technology.

https://bugs.gentoo.org/show_bug.cgi?id=374143

https://sourceforge.net/tracker/?func=detail&aid=3358...

As a result, I had to downgrade back to gawk 3.1.8, for the time being.

Duncan (Yes, running KDE, KDE 4.6.95 aka 4.7-rc2 ATM, to be precise, on Gentoo).


Copyright © 2011, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds