GNU Awk (Gawk) is one of those workhorse utilities that usually doesn't make news. The 4.0.0 release, however, deserves a look. Announced on June 30th, the latest iteration of Gawk brings the first Gawk debugger, a sandbox mode for running less trusted scripts, revised internals, a number of changes to regular expressions, and IPv6 compatibility.
Gawk is one implementation of Awk. Named for the last names of its
inventors (Alfred Aho, Peter Weinberger, and Brian Kernighan), Awk is a
scripting language that's standard across UNIX platforms — and by
standard I mean that Awk has been part of a standard UNIX (or UNIX-like)
system since the beginning, as well as part of the POSIX
specification from the Open Group. Even though Gawk is one of many
Awks, it stands out as being one of the most widely used. What is it used
for? Gawk is best-known for data extraction and reporting, though it's been
used to write IRC bots,
a
YouTube downloader, and for AI programming.
New in 4.0.0
Gawk 4.0.0 is a fairly hefty update from the 3.1.8 release. According to
the release announcement, Gawk not only packs a number of new features,
bugfixes, and some updates to comply with POSIX 2008, it also has "revamped internals."
To find out what was revamped, and why, I asked Gawk maintainer Arnold
Robbins by email. It turns out that the revamp has a lengthy
history. Robbins says that "some years ago" John Haque took on
rewriting Gawk's internals using a byte-code-style engine — and to
implement a debugger in the process. Unfortunately, that work wasn't
integrated and Haque moved on. In early 2010, Robbins started trying to bring Haque's code up to date. According to Robbins, the rewrite doesn't provide a huge performance boost, but it does bring a major useful feature:
The performance is about the same (or slightly better) than the original internals, and I have not yet found a case where it's worse. But the really big gain, and why I wanted to have the change, is that gawk now provides an awk-level debugger (similar to GDB).
Right now, dgawk
is usable, but still limited. It doesn't report what an error is, but will
only report "syntax error" when there's a problem. The
debugger will also only work when running a program on the command line
— it cannot be attached to a running Awk program. It is unlikely that
the Gawk developers will focus on adding that functionality, since the Gawk
manual notes
that limiting debugging to programs started from within dgawk "seems reasonable for a language which is used mainly for quickly executing, short programs."
Gawk's regular expressions have undergone some changes in
4.0.0. Interval expressions are now part of the default syntax for Gawk,
and no longer require the -W or --posix
options. Interval expressions — where one or two numbers inside
braces (such as {n} or {n,m}) tell Gawk to match a
regular expression n or n through m times
— were not part of the original Awk specification. Also,
\s and \S have been added to match any white
space character, or any character that is not white space (respectively).
While Gawk tries to be POSIX-compliant, it does have features above and
beyond POSIX — and 4.0.0 introduces a few more. Gawk now supports two
new patterns, BEGINFILE and ENDFILE, that can be
used to perform actions before reading a file and after
(respectively). These are similar to BEGIN/END rules, but are
applied before and after reading individual files (since Gawk may process
two or more files while running any given script). For example, Gawk programs can now test to see if a file is readable before trying to process it. In prior versions of Gawk, this was not possible — so a script would fail with a fatal error if a file passed to Gawk was not readable.
Gawk has long had the ability to work over a network connection. With the 4.0.0 release, Gawk supports IPv6 using the /inet6 special file, or /inet4 to force IPv4.
The Internet is awash with Awk/Gawk scripts that users might want to run, but worry that the scripts will do more than what's advertised. To address this, Gawk 4.0.0 has a sandbox option (--sandbox), which restricts Gawk to operating on the input data that's been specified. It does this by disabling Gawk's system() function, input redirection using getline, and output redirection using the print and printf functions.
However, Robbins cautions against being overconfident in the security it
would convey.
It was contributed by a user who felt a need for it,
IIRC, for use in Web CGI scripts where you don't want someone to send in
malicious data that can trick the script into writing in your filesystem.
It makes a certain amount of sense to have an option like that. It is most
definitely *not* intended to make any promises of security.
The
sandbox mode is not on by default, says Robbins, because it would break
"an untold number of existing awk scripts." In short, this
option may be useful, but the features disabled by the sandbox option may
not be the only way a malicious script could harm a user's system.
The 4.0.0 release is the end of the line for some options and several
old and unsupported operating systems. The redundant --compat, --copyleft,
and --usage options are gone. The option for raw sockets has been removed,
as it was not implemented. If you're still on Amiga, BeOS, Cray, NeXT,
SunOS 3.x, MIPS RiscOS, or a handful of others, Gawk 3.1.8 is the final
supported release. That the Gawk team has dropped those platforms is no
surprise — that they've been carried so long past their expiration
date is. It would be challenging indeed to find new proprietary software
that's supported on BeOS or Amiga.
With the Gawk 4.0.0 release out of the way, Robbins says that the "big ticket" items for upcoming releases are to merge Gawk's three executables (gawk, pgawk for profiling, and dgawk for debugging) into one to reduce the installation footprint. He also says that Haque "has some other plans related to performance, but that's about all I can say about it in public." Robbins also says that there are plans to merge in some of the XMLgawk extensions. (XMLgawk is an extension of Gawk that has an XML parsing library based on the Expat XML parser.)
Robbins also has a few ideas listed in the Gawk roadmap on his site, which includes support for multiple-precision floating-point (MPFR) so gawk can use infinite precision numbers. He notes that it will be a "big job" and has yet to decide whether MPFR support would be on by default. Gawk is released as it's ready, so there are no dates specified as to when the features can be expected.
The Gawk team is not large, but it's got a healthy set of core contributors. Robbins says that Gawk has six people who maintain ports to different systems, one who handles testing on "a zillion different Unix systems," one contributor who helped out with documentation, and "various other people such as the xmlgawk developers, and several people from different GNU/Linux distributions." Naturally, this also includes Robbins and Haque.
Though Awk is not a particularly "sexy" language these days, it's still
a go-to for system administrators and developers. It's good to see that the
GNU Project is not only maintaining Gawk, but adding interesting new
features that help keep it relevant.
Comments (16 posted)
Brief items
If we take the version number away and update people silently, the
users don't have to think and won't think about any of it.
--
Asa Dotzler
<fellow> Maybe Perl isn't given to over-magic line-noisy crap. I hear there's
even a new version. What'd it get us?
<japh> ~~, for smart matching, with 27-way recursive runtime dispatch by
operand type!
<fellow> ...
--
Ricardo Signes
Comments (2 posted)
![[OHR logo]](/images/2011/ohr.png)
CERN has
announced
the release of version 1.1 of its
Open Hardware License.
"
'For us, the drive towards open hardware was largely motivated by
well-intentioned envy of our colleagues who develop Linux device-drivers,'
said Javier Serrano, an engineer at CERN's Beams Department and the founder
of the OHR. 'They are part of a very large community of designers who share
their knowledge and time in order to come up with the best possible
operating system. We felt that there was no intrinsic reason why hardware
development should be any different.'" CERN also maintains the
Open Hardware Repository as a collecting point for free
hardware designs.
Comments (none posted)
Version 1.9 of the
Mercurial
distributed source code management system is out. New features include a
functional file set matching language, a new command server mode, and more;
see
the
release notes for details.
Comments (12 posted)
After a long hiatus, the notmuch mail indexing project has put out the 0.6
release. New features include folder-based search, PGP/MIME support, some
new automatic tags, a number of performance improvements, an initial set of
Go bindings, and a lot more. (LWN
looked at
notmuch in March, 2010).
Full Story (comments: none)
Newsletters and articles
Comments (none posted)
Lennart Poettering has written
a second
installment in his "systemd for developers" series. "
This time
we'll focus on adding socket activation support to real-life software, more
specifically the CUPS printing server. Most current Linux desktops run CUPS
by default these days, since printing is so basic that it's a must have,
and must just work when the user needs it. However, most desktop CUPS
installations probably don't actually see more than a handful of print jobs
each month... That all together makes CUPS a perfect
candidate for lazy activation: instead of starting it unconditionally at
boot we just start it on-demand, when it is needed. That way we can save
resources, at boot and at runtime."
Comments (7 posted)
The
third
part of David Zeuthen's guide to writing low-level libraries looks at
modularity, error handling, and object-oriented design. "
Even with a
library doing proper parameter validation (to catch programmer errors early
on), if you pass garbage to a function you usually end up with undefined
behavior and undefined behavior can mean anything including formatting your
hard disk or evaporating all booze in a five-mile radius (oh noz). That's
why some libraries simply calls abort() instead of carrying on pretending
nothing happened."
Comments (none posted)
David Zeuthen continues to crank out updates to his "Writing a C library"
series faster than we can point to them;
part 4
(helpers, daemons, and testing) and
part 5
(API design, documentation, and versioning) are now out. "
A C
library is, almost by definition, something that offers an API that is used
in applications. Often an API can't be changed in incompatible ways (it
can, however, be extended) so it is usually important to get right the
first time because if you don't, you and your users will have to live with
your mistakes for a long time."
Comments (none posted)
The Thunderbird mail client gets a
review of its version 5 release over at ars technica. "
In addition to moving to the Gecko 5 engine, Thunderbird also brings some other improvements. Thunderbird 5 has gained Firefox's slick new tab-hosted add-on management user interface. Startup time has noticeably improved in the new version, allowing the user to start working with the application sooner after startup."
Comments (18 posted)
Opensource.com has a
report on
FabFi, which is an effort to build low-cost wireless network infrastructure that can operate independently from governments and wireless data companies. "
And the main components can be built out of trash. Some boards, wires, plastic tubs, and cans can build you a FabFi node. The design of the node purposefully uses things that are widely available wherever the project takes place. Users in Afghanistan discovered that instead of requiring specialty made reflectors, they could use the metal from USAID vegetable oil cans because it turns out to be the right malleability and size for these reflectors."
Comments (4 posted)
LinuxFR
talks
with Lennart Poettering about his work on Avahi, PulseAudio, and systemd. "
You should never forget that in the whole industry there are about 3.5 people paid full-time for doing generic maintainance work of the Linux audio stack (which I consider consisting primarily of ALSA and PulseAudio and a few things around it). With this little manpower I can only say that what has been achieved is pretty good. While we still can't fully match competing audio stacks like CoreAudio, we are a lot closer than we ever were. I do hope that the folks who kept constantly complaining would be a lot more appreciative if they understood that."
Comments (76 posted)
Linux.com has posted
a
review of PiTiVi 0.14. "
PiTiVi is a GStreamer-based non-linear
video editor (NLE) developed by members of the GStreamer project
itself. That means it is often the first project to showcase new features,
and last month's new release is no exception. The major new feature is
support for audio and video filter 'effects' but there are usability and
speed improvements worth examining, too."
Comments (9 posted)
Here is
a
"rantifesto" from Nina Paley, who is frustrated that the freedoms
guaranteed by free software licenses aren't always present in other types
of works. "
Cultural works released by the Free Software Foundation
come with 'No Derivatives' restrictions... The problem with this is that
it is dead wrong. You do not know what purposes your works might serve
others. You do not know how works might be found 'practical' by others. To
claim to understand the limits of 'utility' of cultural works betrays an
irrational bias toward software and against all other creative work. It is
anti-Art, valuing software above the rest of culture. It says coders alone
are entitled to Freedom, but everyone else can suck it. Use of -ND
restrictions is an unjustifiable infringement on the freedom of
others." (Thanks to Davide Del Vento).
Comments (76 posted)
The H
reports on an announcement at the
FISL conference. "
The Brazilian government has signed a letter of intent to work with both The Document Foundation and the Apache OpenOffice.org community to develop the Office Suite platforms maintained by both communities. The letter asserts that the ODF standard is already a guarantee of interoperability within the government. As Brazil is one of the biggest users of both LibreOffice and OpenOffice with an estimated million public computers running the free/open source office suites, the [government] aims to make the national contribution to the projects more effective."
Comments (8 posted)
Page editor: Jonathan Corbet
Next page: Announcements>>