LWN: Comments on "The state of the AWK"

The state of the AWK

RobertX — Fri, 20 Nov 2020 18:42:04 +0000

I don't know why but I really like how AWK 5.X does namespacing.

Surprisingly relevant?

nenad_noveljic — Thu, 19 Nov 2020 13:29:05 +0000

Forking is an expensive OS call. It might cause a problem when used occasionally in the command line. But it will consume substantial kernel CPU if done on a large scale.

Surprisingly relevant?

motiejus — Thu, 19 Nov 2020 11:57:29 +0000

Recently I used awk to filter a few hundred gigabytes of LIDAR data to clip it to a bounding boxes I was interested at:

#!/usr/bin/awk -f
BEGIN { FS = "," }
$1 > ymin && $1 < ymax && $2 > xmin && $2 < xmax {print $2 "," $1 "," $3}

I just ran this again on a data sub-set (100M of data points, 2.7GB uncompressed) just to have data for this comment. My 8-core laptop did the whole operation in 29 seconds:
1. each file: unzip to memory.
2. each file: run through the program above for the bounding box.
3. each file: sort.
4. all files: merge sort.
5. all files: compress.

Combined with GNU Make, `sort` and `sort -m`, I can't imagine a more powerful combination of tools for this simple "big data"(?) task.

No, awk is not dead, and spending half-hour[1] is enough to use it for life. :)

[1]: https://ferd.ca/awk-in-20-minutes.html

mawk in Ubuntu 20.04

ThomasDickey — Sun, 19 Jul 2020 19:12:06 +0000

1.3.4 was released in 2009 - see changelog

https://invisible-island.net/mawk/CHANGES.html#t20091220

(1.3.5 is a different matter)

The state of the AWK

benhoyt — Tue, 02 Jun 2020 19:12:55 +0000

Another article (from very recently - May 2020) about someone using AWK to process "big data". Very interesting read: https://ketancmaheshwari.github.io/posts/2020/05/24/SMC18... ... the author's tl;dr is "Awk crunches massive data; a High Performance Computing (HPC) script calls hundreds of Awk concurrently. Fast and scalable in-memory solution on a fat machine."

The state of the AWK

amnonbc — Tue, 02 Jun 2020 12:41:32 +0000

Awk is a masterpiece of minimalism.
Only two features - regular expressions and associative arrays.

A tool simple tool that does one thing well.
And a clear expressive and readable syntax.

A classic of language design, and a pleasure to use!

mawk in Ubuntu 20.04

mirabilos — Fri, 29 May 2020 21:32:48 +0000

This was actually done by Boyuan Yang, Debian Developer, and *buntu just copied it.

Oh and it’s a pre-release snapshot; 1.3.4 is apparently not quite released yet. Or Tom Dickey doesn’t want to publish a formal release because he just continued development but isn’t the formal developer.

But good/sad to see he’s picking up another project… he also develops ncurses, cdk, xterm and lynx, and except cdk I use them daily… here’s to hoping fixes there won’t be less frequent now ☻

Surprisingly relevant?

jezuch — Tue, 26 May 2020 11:12:40 +0000

A counter-point from me would be that I took an almost immediate dislike to awk because I felt that a full-blown imperative language is overkill in a context which asks for a more declarative approach... But I generally favor declarative and functional over imperative wherever that's practical.

The state of the AWK

SiB — Mon, 25 May 2020 19:44:34 +0000

Apart from one liners on the shell prompt, I use awk almost daily with gnuplot, to extract the numbers to plot or fit from datafiles. Awk program files have gawk's -i option on the shebang line. Its still one liners on the gnuplot command line, but with a comfortabe set of constants and functions predefined.

Surprisingly relevant?

anton — Mon, 25 May 2020 15:39:41 +0000

If all of the above are truly inadequate to the task at hand, I'm probably better off writing a Python script anyway.

The nice thing about awk is that when you find that your non-awk shell-scripting tools miss a feature, you don't need to rewrite the whole shell script in python.

As for whether you are better off, in the 1990s a collegue wrote a tool in Perl that I then inherited. I eventually rewrote it as a shell script (including parts in awk), and the result turned out to have half the lines of the Perl variant.

What does bother me is when people write awk "{ print $1 }" instead of cut -f1. I find the latter more readable.

I am one of these people (except that I use single quotes). Using cut may well be more readable, but then I know how to write it in awk without looking up a man page; that's because I use awk every few days, but not that many of these uses could be replaced with cut.

Surprisingly relevant?

madscientist — Sun, 24 May 2020 17:12:00 +0000

You're right, grep behaves the same; my bad! I was thinking of some other tools like wc which have different output when given a filename versus reading from stdin.

This can be useful in scripting to avoid the complexity of stripping off the unwanted filename.

Surprisingly relevant?

Wol — Sun, 24 May 2020 14:02:50 +0000

Umm...

On first thoughts my reaction was "aren't < and > just different syntaxes for pipes?".

My second thought now is that "no they aren't actually pipes, they're shell built-ins".

So yeah you're right. They're pretty much identical in effect (and concept), but different in implementation and impact on the system. There's more than one way to do it ... :-)

Cheers,
Wol

Surprisingly relevant?

mpr22 — Sun, 24 May 2020 13:50:40 +0000

In what way does the output of grep pattern singlefile.txt differ from the output of cat singlefile.txt | grep?

Surprisingly relevant?

madscientist — Sun, 24 May 2020 13:18:32 +0000

???

There are no pipes in Jandar's suggested alternative.

This feels more like StackOverflow than LWN, but the issue is that grep foo somefile gives different output than cat somefile | grep foo and if you want the latter behavior while still avoiding UUoC, you should be using grep foo < somefile instead.

Surprisingly relevant?

NYKevin — Sat, 23 May 2020 21:51:46 +0000

Well, personally, I like to think that I'm adhering to the Unix philosophy (each binary does one thing, and does it well, whereas awk seems to want to do "reading and modifying text" well, whatever that's supposed to encompass), but this will quickly degenerate into a flamewar.

Surprisingly relevant?

Wol — Sat, 23 May 2020 20:32:08 +0000

You're missing the point - the point is to GET RID of pipes.

Every extra pipe is an extra trip round the setup/teardown busywork loop - which if you pre-allocate memory could actually be a big problem even if you think you have plenty.

Cheers,
Wol

Surprisingly relevant?

Jandar — Sat, 23 May 2020 19:53:25 +0000

If you wish to retain the idea of a pipe: input -> command -> output, you could write

<somefile grep somepattern >output

The position of redirection doesn't matter only the order if there are dependencies.

Surprisingly relevant?

unixbhaskar — Sat, 23 May 2020 12:22:51 +0000

Agreed! Many many moons ago someone wise corrects me the correct way, I was doing like everyone else in wild this:

cat somefile | grep somepattern

and the correction was ..

grep somepattern somefile --> this essentially what you said is planted..one less invocation of calls.

Surprisingly relevant?

unixbhaskar — Sat, 23 May 2020 12:18:42 +0000

Agreed. And it echoed the essence of UNIX, do one thing and do it well. It's a damn good tool to know. :)

The state of the AWK

unixbhaskar — Sat, 23 May 2020 12:16:38 +0000

What a heck of a tool!! wonderful ...I do stumbled and use it every now and then. Kinda , daily part and parcel of life. It does the job for me,and am happy with it.

The state of the AWK

tedd — Sat, 23 May 2020 08:44:06 +0000

Whenever I think of awk I always remember this: http://kmkeen.com/awk-music/

Note that the script may need to be edited to run on modern systems - I don't know when this particular post was published.

Surprisingly relevant?

ptman — Fri, 22 May 2020 08:58:43 +0000

I'll just reach for this handy POSIX AWK instead

Surprisingly relevant?

NYKevin — Thu, 21 May 2020 17:11:34 +0000

> That cut only accepts a single character delimiter (rather than a set like IFS in shell, or a regular expression like FS in AWK[1]), and that it can't span adjacent delimiters (like shell and AWK), is a nearly fatal flaw.

No it isn't. You're meant to call tr(1) with appropriate arguments, and pipe the result into cut. If you do that, then neither of those limitations matter.

The state of the AWK

PaulMcKenney — Thu, 21 May 2020 16:33:51 +0000

Thank you for this! I feel much better about continuing to use awk. :-)

The state of the AWK

smitty_one_each — Thu, 21 May 2020 15:57:04 +0000

let awk and jq be the Batman and Robin of the command line.

Surprisingly relevant?

pgdx — Thu, 21 May 2020 10:19:32 +0000

But this doesn't do the same thing as the Python and awk programs do.

First, it sorts all lines, which is not according to spec.

Second, it doesn't print the duplicate hash as "headers" on a line by itself.

The state of the AWK

xtifr — Thu, 21 May 2020 03:22:04 +0000

Awk was one of the main things that first attracted me to Unix, several geological ages ago! The idea of a simple language specifically designed for creating filters on-the-fly was a complete revelation!

I no longer overuse and abuse awk the way I did when I was young, but I still use it now and then. Often just typing on the command line!

One of my favorite tricks is running accumulators for multiple types of things:

ps aux|awk '{count[$1]++} END { for (u in count) { print u, ": ", count[u]}}'

tells me how many processes each user has. And yes, that's the sort of thing I can and do just type in when I want to know something like that. :)

Surprisingly relevant?

xtifr — Thu, 21 May 2020 03:02:26 +0000

What I hate is to see people flailing around with a bunch of overspecialized and slow tools like cut, paste, comm, grep, sed, and so on, to do--poorly--what a trivial amount of awk would do cleanly and well.

The state of the AWK

wahern — Wed, 20 May 2020 23:57:23 +0000

Alas, Red Hat/CentOS/Fedora no longer install Perl by default. There was a glorious period where Perl was more common than Bash, after the BSDs and commercial Unices adopted Perl, but before the dark years of "shell scripting" becoming synonymous with "Bash scripting". If you had reason to venture away from POSIX utilities for system management tasks, Perl was the obvious and perfectly reasonable choice--it was and remains the better AWK.

I think AWK is seeing a resurgence precisely because Perl isn't as ubiquitous as it once was. You can't depend on Python being installed, either, and even if you could it still sucks for short, shell-style programming. Which is why as Python has displaced Perl, there's more demand for AWK to fill the remaining gap.

I agree that XML and JSON have altered the landscape, but XML and JSON don't fit streaming paradigms very well. Even when something like jq is available, I usually find the regular shell utilities to be far more convenient, and AFAICT so do most others. It's always been the case that for highly structured data you ended up using more sophisticated programming languages, anyhow. The reason why the Unix shell and shell programming have persisted for so long is precisely because the "one language to rule them all" and "one record format to rule them all" approaches never sufficed nearly enough to displace ad hoc text munging tools. The very nature of the problem domain--gluing together disparate, uncooperative tools and data--contradicts the idea that there could ever be a simple, unified solution.

Surprisingly relevant?

wahern — Wed, 20 May 2020 23:27:43 +0000

That cut only accepts a single character delimiter (rather than a set like IFS in shell, or a regular expression like FS in AWK[1]), and that it can't span adjacent delimiters (like shell and AWK), is a nearly fatal flaw. I have half a mind to submit a proposal to POSIX to add a new option, but there's no such extension in any implementation of cut that I've seen. Pre-existing practice isn't a hard requirement, especially for the upcoming revision, but I feel like the fact it doesn't exist constitutes proof that cut is a lost cause and should be left alone.

[1] There are special semantics for single-character FS; semantics that mimic shell word splitting.

Surprisingly relevant?

NYKevin — Wed, 20 May 2020 20:29:42 +0000

> where spawning a bunch of processes to do what awk can do by itself is acceptable to you.

Why do I care? I have gigabytes of RAM and it's not like I'm going to run out of PIDs from a five-command pipeline. Besides, the kernel should be COWing glibc etc. so it's not even all that much overhead to begin with. If you're using something like Toybox/Busybox/whatever-other-box-is-popular-these-days, then you can literally COW the entire executable.

Surprisingly relevant?

NYKevin — Wed, 20 May 2020 20:19:32 +0000

In this context, we're talking about the fixed costs of setting up and tearing down O(1) extra processes (vs. setting up and tearing down exactly one awk process). A reasonable pipeline will scale to millions of lines of text very easily, because the per-process overhead just isn't that big compared to the actual work being done.

On the other hand, if you're doing a while read; do ...; done style thingy, then yes, it will be awful and slow. But I try to avoid that most of the time.

Surprisingly relevant?

NYKevin — Wed, 20 May 2020 20:05:58 +0000

In my experience, replacing dot with [^x] (for some suitable x) is often Good Enough. This is certainly true when parsing something like a path name into its constituent components. True non-greedy matching is more powerful than that, of course, but eventually you may want to reach for a Real Parser (TM).

(Strictly speaking, it is not correct to claim that non-greedy matching is required for parsing arbitrary regular languages. Formally, a regular language can be parsed entirely in terms of literal characters, parentheses, alternation, and the Kleene star, plus anchoring if you assume that regexes are not implicitly anchored. But this might require a very long and unwieldy regex in practice, so a lack of non-greedy matching is certainly a valid complaint.)

Alternatively, I suppose you could use ex(1) noninteractively (with -c {command} or +{command}).

Surprisingly relevant?

jafd — Wed, 20 May 2020 20:01:08 +0000

I was using it because a fat book about administering Red Hat Linux (from when Red Hat Linux 6 was a newfangled thing) gave it in a useful example. It went downhill from there.

Also, back when the whole Unicode mess had been a very on-and-off experience, where some tools would work and some would go bonkers, cut failed me a couple times and awk was solid. And so it went.

Surprisingly relevant?

geert — Wed, 20 May 2020 18:49:35 +0000

For small amounts of data, the tool usually doesn't matter at all.

A long time ago, a colleague came to me for help doing search and replace in a very large file. His editor of choice was "xedit", and the search and replace operation seemed to hang, or at least took ages. I opened his file in "vi", which performed the same operation in the blink of an eye. Didn't even have to refrain to sed.

Lesson learned: "xedit" was written as a sample program for showing how to use the X11 Athena Widgets, it was never meant to be a production-level editor.

The state of the AWK

jthill — Wed, 20 May 2020 18:28:59 +0000

If communicating with the less dedicated is in the mix, perl loses its luster. Awk is much, much more approachable. If I'm trying to explain to someone who won't be doing a lot of scripting how to munge text, flat files, awk's by far the best option if sed isn't easier. They'll likely be able to extend what they've learned because the marginal costs are low, and if awk's out of steam they're likely to need some guidance for other reasons.

json and xml are big hammers, far too often people swing them for little jobs.

Surprisingly relevant?

scientes — Wed, 20 May 2020 16:30:58 +0000

sed is not a horrible idea, but whenever I use it I run into the fact that it cannot parse arbitrary regular languages because of the lack of non-greedy matching (i.e. a decent regex implementation).

Surprisingly relevant?

edeloget — Wed, 20 May 2020 13:56:53 +0000

Given the fact that awk is heavily used in many scripts in the embedded world (mostly through busybox awk) it's definitely not going to disapear any time soon. It may disapear one day, but not before shell scripts died (which means we would have access to a better kind of shell).

And that's good. awk is a solid program with tons of possible use cases.

Surprisingly relevant?

Paf — Wed, 20 May 2020 13:33:15 +0000

A good chunk of the time this doesn’t matter, since it’s just processing small amounts of data. On occasion when working with large log files, I’ve had occasion to need to figure out efficiencies like this... but I don’t do serious “permanent data pipeline” stuff in awk anyway.

I do think about efficiency - for ad-hoc data processing, I start with “how fast can I do this without compromising the actual performance I need”, then work in from there if something’s slow.

mawk in Ubuntu 20.04

hmh — Wed, 20 May 2020 13:03:21 +0000

That change came into Ubuntu through Debian:

https://tracker.debian.org/pkg/mawk