LWN.net Logo

Development

A look at Gawk 4.0.0

July 7, 2011

This article was contributed by Joe 'Zonker' Brockmeier.

GNU Awk (Gawk) is one of those workhorse utilities that usually doesn't make news. The 4.0.0 release, however, deserves a look. Announced on June 30th, the latest iteration of Gawk brings the first Gawk debugger, a sandbox mode for running less trusted scripts, revised internals, a number of changes to regular expressions, and IPv6 compatibility.

Gawk is one implementation of Awk. Named for the last names of its inventors (Alfred Aho, Peter Weinberger, and Brian Kernighan), Awk is a scripting language that's standard across UNIX platforms — and by standard I mean that Awk has been part of a standard UNIX (or UNIX-like) system since the beginning, as well as part of the POSIX specification from the Open Group. Even though Gawk is one of many Awks, it stands out as being one of the most widely used. What is it used for? Gawk is best-known for data extraction and reporting, though it's been used to write IRC bots, a YouTube downloader, and for AI programming.

New in 4.0.0

Gawk 4.0.0 is a fairly hefty update from the 3.1.8 release. According to the release announcement, Gawk not only packs a number of new features, bugfixes, and some updates to comply with POSIX 2008, it also has "revamped internals."

To find out what was revamped, and why, I asked Gawk maintainer Arnold Robbins by email. It turns out that the revamp has a lengthy history. Robbins says that "some years ago" John Haque took on rewriting Gawk's internals using a byte-code-style engine — and to implement a debugger in the process. Unfortunately, that work wasn't integrated and Haque moved on. In early 2010, Robbins started trying to bring Haque's code up to date. According to Robbins, the rewrite doesn't provide a huge performance boost, but it does bring a major useful feature:

The performance is about the same (or slightly better) than the original internals, and I have not yet found a case where it's worse. But the really big gain, and why I wanted to have the change, is that gawk now provides an awk-level debugger (similar to GDB).

Right now, dgawk is usable, but still limited. It doesn't report what an error is, but will only report "syntax error" when there's a problem. The debugger will also only work when running a program on the command line — it cannot be attached to a running Awk program. It is unlikely that the Gawk developers will focus on adding that functionality, since the Gawk manual notes that limiting debugging to programs started from within dgawk "seems reasonable for a language which is used mainly for quickly executing, short programs."

Gawk's regular expressions have undergone some changes in 4.0.0. Interval expressions are now part of the default syntax for Gawk, and no longer require the -W or --posix options. Interval expressions — where one or two numbers inside braces (such as {n} or {n,m}) tell Gawk to match a regular expression n or n through m times — were not part of the original Awk specification. Also, \s and \S have been added to match any white space character, or any character that is not white space (respectively).

While Gawk tries to be POSIX-compliant, it does have features above and beyond POSIX — and 4.0.0 introduces a few more. Gawk now supports two new patterns, BEGINFILE and ENDFILE, that can be used to perform actions before reading a file and after (respectively). These are similar to BEGIN/END rules, but are applied before and after reading individual files (since Gawk may process two or more files while running any given script). For example, Gawk programs can now test to see if a file is readable before trying to process it. In prior versions of Gawk, this was not possible — so a script would fail with a fatal error if a file passed to Gawk was not readable.

Gawk has long had the ability to work over a network connection. With the 4.0.0 release, Gawk supports IPv6 using the /inet6 special file, or /inet4 to force IPv4.

The Internet is awash with Awk/Gawk scripts that users might want to run, but worry that the scripts will do more than what's advertised. To address this, Gawk 4.0.0 has a sandbox option (--sandbox), which restricts Gawk to operating on the input data that's been specified. It does this by disabling Gawk's system() function, input redirection using getline, and output redirection using the print and printf functions.

However, Robbins cautions against being overconfident in the security it would convey.

It was contributed by a user who felt a need for it, IIRC, for use in Web CGI scripts where you don't want someone to send in malicious data that can trick the script into writing in your filesystem. It makes a certain amount of sense to have an option like that. It is most definitely *not* intended to make any promises of security.

The sandbox mode is not on by default, says Robbins, because it would break "an untold number of existing awk scripts." In short, this option may be useful, but the features disabled by the sandbox option may not be the only way a malicious script could harm a user's system.

The 4.0.0 release is the end of the line for some options and several old and unsupported operating systems. The redundant --compat, --copyleft, and --usage options are gone. The option for raw sockets has been removed, as it was not implemented. If you're still on Amiga, BeOS, Cray, NeXT, SunOS 3.x, MIPS RiscOS, or a handful of others, Gawk 3.1.8 is the final supported release. That the Gawk team has dropped those platforms is no surprise — that they've been carried so long past their expiration date is. It would be challenging indeed to find new proprietary software that's supported on BeOS or Amiga.

With the Gawk 4.0.0 release out of the way, Robbins says that the "big ticket" items for upcoming releases are to merge Gawk's three executables (gawk, pgawk for profiling, and dgawk for debugging) into one to reduce the installation footprint. He also says that Haque "has some other plans related to performance, but that's about all I can say about it in public." Robbins also says that there are plans to merge in some of the XMLgawk extensions. (XMLgawk is an extension of Gawk that has an XML parsing library based on the Expat XML parser.)

Robbins also has a few ideas listed in the Gawk roadmap on his site, which includes support for multiple-precision floating-point (MPFR) so gawk can use infinite precision numbers. He notes that it will be a "big job" and has yet to decide whether MPFR support would be on by default. Gawk is released as it's ready, so there are no dates specified as to when the features can be expected.

The Gawk team is not large, but it's got a healthy set of core contributors. Robbins says that Gawk has six people who maintain ports to different systems, one who handles testing on "a zillion different Unix systems," one contributor who helped out with documentation, and "various other people such as the xmlgawk developers, and several people from different GNU/Linux distributions." Naturally, this also includes Robbins and Haque.

Though Awk is not a particularly "sexy" language these days, it's still a go-to for system administrators and developers. It's good to see that the GNU Project is not only maintaining Gawk, but adding interesting new features that help keep it relevant.

Comments (16 posted)

Brief items

Quotes of the week

If we take the version number away and update people silently, the users don't have to think and won't think about any of it.
-- Asa Dotzler

<fellow> Maybe Perl isn't given to over-magic line-noisy crap. I hear there's even a new version. What'd it get us?

<japh> ~~, for smart matching, with 27-way recursive runtime dispatch by operand type!

<fellow> ...

-- Ricardo Signes

Comments (2 posted)

CERN's Open Hardware License v1.1

[OHR logo] CERN has announced the release of version 1.1 of its Open Hardware License. "'For us, the drive towards open hardware was largely motivated by well-intentioned envy of our colleagues who develop Linux device-drivers,' said Javier Serrano, an engineer at CERN's Beams Department and the founder of the OHR. 'They are part of a very large community of designers who share their knowledge and time in order to come up with the best possible operating system. We felt that there was no intrinsic reason why hardware development should be any different.'" CERN also maintains the Open Hardware Repository as a collecting point for free hardware designs.

Comments (none posted)

Mercurial 1.9 released

Version 1.9 of the Mercurial distributed source code management system is out. New features include a functional file set matching language, a new command server mode, and more; see the release notes for details.

Comments (12 posted)

notmuch 0.6 released

After a long hiatus, the notmuch mail indexing project has put out the 0.6 release. New features include folder-based search, PGP/MIME support, some new automatic tags, a number of performance improvements, an initial set of Go bindings, and a lot more. (LWN looked at notmuch in March, 2010).

Full Story (comments: none)

Newsletters and articles

Development newsletters from the last week

Comments (none posted)

Systemd for Developers II

Lennart Poettering has written a second installment in his "systemd for developers" series. "This time we'll focus on adding socket activation support to real-life software, more specifically the CUPS printing server. Most current Linux desktops run CUPS by default these days, since printing is so basic that it's a must have, and must just work when the user needs it. However, most desktop CUPS installations probably don't actually see more than a handful of print jobs each month... That all together makes CUPS a perfect candidate for lazy activation: instead of starting it unconditionally at boot we just start it on-demand, when it is needed. That way we can save resources, at boot and at runtime."

Comments (7 posted)

Zeuthen: Writing a C library, part 3

The third part of David Zeuthen's guide to writing low-level libraries looks at modularity, error handling, and object-oriented design. "Even with a library doing proper parameter validation (to catch programmer errors early on), if you pass garbage to a function you usually end up with undefined behavior and undefined behavior can mean anything including formatting your hard disk or evaporating all booze in a five-mile radius (oh noz). That's why some libraries simply calls abort() instead of carrying on pretending nothing happened."

Comments (none posted)

Zeuthen: Writing a C library, parts 4 and 5

David Zeuthen continues to crank out updates to his "Writing a C library" series faster than we can point to them; part 4 (helpers, daemons, and testing) and part 5 (API design, documentation, and versioning) are now out. "A C library is, almost by definition, something that offers an API that is used in applications. Often an API can't be changed in incompatible ways (it can, however, be extended) so it is usually important to get right the first time because if you don't, you and your users will have to live with your mistakes for a long time."

Comments (none posted)

Not much in new Thunderbird 5, but roadmap looks promising (ars technica)

The Thunderbird mail client gets a review of its version 5 release over at ars technica. "In addition to moving to the Gecko 5 engine, Thunderbird also brings some other improvements. Thunderbird 5 has gained Firefox's slick new tab-hosted add-on management user interface. Startup time has noticeably improved in the new version, allowing the user to start working with the application sooner after startup."

Comments (18 posted)

FabFi: An open source wireless network built with trash (opensource.com)

Opensource.com has a report on FabFi, which is an effort to build low-cost wireless network infrastructure that can operate independently from governments and wireless data companies. "And the main components can be built out of trash. Some boards, wires, plastic tubs, and cans can build you a FabFi node. The design of the node purposefully uses things that are widely available wherever the project takes place. Users in Afghanistan discovered that instead of requiring specialty made reflectors, they could use the metal from USAID vegetable oil cans because it turns out to be the right malleability and size for these reflectors."

Comments (4 posted)

Interview with Lennart Poettering (LinuxFR.org)

LinuxFR talks with Lennart Poettering about his work on Avahi, PulseAudio, and systemd. "You should never forget that in the whole industry there are about 3.5 people paid full-time for doing generic maintainance work of the Linux audio stack (which I consider consisting primarily of ALSA and PulseAudio and a few things around it). With this little manpower I can only say that what has been achieved is pretty good. While we still can't fully match competing audio stacks like CoreAudio, we are a lot closer than we ever were. I do hope that the folks who kept constantly complaining would be a lot more appreciative if they understood that."

Comments (76 posted)

PiTiVi Video Editor Now Kitten-Friendly (Linux.com)

Linux.com has posted a review of PiTiVi 0.14. "PiTiVi is a GStreamer-based non-linear video editor (NLE) developed by members of the GStreamer project itself. That means it is often the first project to showcase new features, and last month's new release is no exception. The major new feature is support for audio and video filter 'effects' but there are usability and speed improvements worth examining, too."

Comments (9 posted)

Paley: Why are the Freedoms guaranteed for Free Software not guaranteed for Free Culture?

Here is a "rantifesto" from Nina Paley, who is frustrated that the freedoms guaranteed by free software licenses aren't always present in other types of works. "Cultural works released by the Free Software Foundation come with 'No Derivatives' restrictions... The problem with this is that it is dead wrong. You do not know what purposes your works might serve others. You do not know how works might be found 'practical' by others. To claim to understand the limits of 'utility' of cultural works betrays an irrational bias toward software and against all other creative work. It is anti-Art, valuing software above the rest of culture. It says coders alone are entitled to Freedom, but everyone else can suck it. Use of -ND restrictions is an unjustifiable infringement on the freedom of others." (Thanks to Davide Del Vento).

Comments (76 posted)

Brazilian government signs up to develop OpenOffice and LibreOffice (The H)

The H reports on an announcement at the FISL conference. "The Brazilian government has signed a letter of intent to work with both The Document Foundation and the Apache OpenOffice.org community to develop the Office Suite platforms maintained by both communities. The letter asserts that the ODF standard is already a guarantee of interoperability within the government. As Brazil is one of the biggest users of both LibreOffice and OpenOffice with an estimated million public computers running the free/open source office suites, the [government] aims to make the national contribution to the projects more effective."

Comments (8 posted)

Page editor: Jonathan Corbet
Next page: Announcements>>

Copyright © 2011, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds