Surprisingly relevant? [LWN.net]

Surprisingly relevant?

Posted May 19, 2020 23:17 UTC (Tue) by benhoyt (subscriber, #138463) [Link] (32 responses)

I appreciate the "mate" as I'm from down under :-), but as Mark Twain said, "the reports of AWK's death have been greatly exaggerated". I personally still use it on a regular basis (maybe once every couple of weeks) to pull a field out of a log file, sum up a column in a CSV file, etc. Half a line of AWK code is better than 5 lines of Python (or 30 lines of Go/C/Java) for ad-hoc jobs like this.

And I know many other developers use it too ... there's still a lot of text around to be processed (and I say, may it long continue). Additionally, there's the "big data" use cases I linked to in the article, where developers found it faster and simpler than heavier distributed computing tools. See also: https://yourdatafitsinram.net/

Surprisingly relevant?

Posted May 20, 2020 1:50 UTC (Wed) by NYKevin (subscriber, #129325) [Link] (31 responses)

Although I don't *object* to awk's continued existence, I've never had cause to learn it, mostly because all the simple stuff can be done with an appropriate combination of cut(1), paste(1), tee(1), comm(1), grep(1), sed(1), and so on, together with a few basic shell constructs like process substitution. If all of the above are truly inadequate to the task at hand, I'm probably better off writing a Python script anyway.

What does bother me is when people write awk "{ print $1 }" instead of cut -f1. I find the latter more readable.

Surprisingly relevant?

Posted May 20, 2020 4:08 UTC (Wed) by marduk (subscriber, #3831) [Link] (3 responses)

I never see "{ print $1 }", or rather, most of the time when see someone has turned to awk it's because their problem has evolved into something more complicated than "{ print $1 }".

You could resort to some kind of permutation of set, cut, in trivial cases where spawning a bunch of processes to do what awk can do by itself is acceptable to you.

Surprisingly relevant?

Posted May 20, 2020 20:01 UTC (Wed) by jafd (subscriber, #129642) [Link]

I was using it because a fat book about administering Red Hat Linux (from when Red Hat Linux 6 was a newfangled thing) gave it in a useful example. It went downhill from there.

Also, back when the whole Unicode mess had been a very on-and-off experience, where some tools would work and some would go bonkers, cut failed me a couple times and awk was solid. And so it went.

Surprisingly relevant?

Posted May 20, 2020 20:29 UTC (Wed) by NYKevin (subscriber, #129325) [Link] (1 responses)

> where spawning a bunch of processes to do what awk can do by itself is acceptable to you.

Why do I care? I have gigabytes of RAM and it's not like I'm going to run out of PIDs from a five-command pipeline. Besides, the kernel should be COWing glibc etc. so it's not even all that much overhead to begin with. If you're using something like Toybox/Busybox/whatever-other-box-is-popular-these-days, then you can literally COW the entire executable.

Surprisingly relevant?

Posted Nov 19, 2020 13:29 UTC (Thu) by nenad_noveljic (guest, #143180) [Link]

Forking is an expensive OS call. It might cause a problem when used occasionally in the command line. But it will consume substantial kernel CPU if done on a large scale.

Surprisingly relevant?

Posted May 20, 2020 5:45 UTC (Wed) by cyphar (subscriber, #110703) [Link] (3 responses)

Depends on your definition of "simple". While I do make use of all of the tools you've mentioned, awk has carved out its own niche in that pantheon. It allows you to do a handful of things that you would ordinarily need to reach for a "real" programming for such as aggregation or basic programs that make use of maps. Yes, you could implement these things in Python fairly easily, but with two downsides:

You can't just write the script in your shell, you need to open a text editor. Though this is mostly a Python problem, caused by its use of whitespace for code flow. Awk uses brackets and semicolons, so this isn't an issue. I would wager that most awk scripts are written from within a shell.
Most "real" languages require you to do a bunch of boilerplate (such as looping over input, or explicitly do conversions to non-string values). For "real" programming languages, it makes sense to require this kind of boiler plate -- but awk lets you elide all of it because its only purpose is to execute programs over records and fields. For a quick-and-dirty language like awk it's a godsend to not need to have any boilerplate.

Compare the following programs which take the output of sha256sum of a directory tree and find any files which have matching hashes. The one written in awk is verbatim a program I wrote a week ago (note that I actually wrote it in a single line on my command-line, but I put it in a file for an easier comparison).

% cat prog.py
collisions = {}
for line in iter(input, ""):
  hash, *_ = line.split() # really should be re.split but that would be too mean to Python
  if hash not in collisions:
    collisions[hash] = []
  collisions[hash].append(line)

for hash, lines in collisions.items():
  if len(lines) > 1:
    print(hash)
    for line in lines:
      print(line)

% cat prog.awk
{
  files[$1][length(files[$1])] = $0
}

END {
  for (hash in files) {
    if (length(files[hash]) > 1) {
      print hash;
      for (idx in files[hash]) {
        print " " files[hash][idx];
      }
    }
  }
}

What is the first thing you notice? All of the boilerplate in Python about iter(input, "") and splitting the line is already done for you in awk. The actual logic of the program is implemented in a single statement in awk, with the rest of the program just printing out the calculation. And that is one of the reasons why I reach for awk much more often than I reach for Python when I have a relatively-simple source of data to parse -- I just have to type a lot less.

> What does bother me is when people write awk "{ print $1 }" instead of cut -f1. I find the latter more readable.

The problem is that cut splits on the literal " " (U+0020) or whatever other literal you specify, while awk splits fields using a regular expression (which by default is /\s+/). Many standard Unix programs output data such that cut's field handling is simply not usable. You could clean up the data with sed, but now you're working around the fact that cut isn't doing its job correctly. I sometimes feel that cut would be a better program if it were implemented as an awk script.

Surprisingly relevant?

Posted May 20, 2020 9:03 UTC (Wed) by mineo (guest, #126773) [Link]

Note that, with the appropriate imports, you can not reduce the line count of your python example, but make it a bit more straightforward:

from collections import defaultdict
from fileinput import input

collisions = defaultdict(list)
for line in input():
  hash, *_ = line.split() # really should be re.split but that would be too mean to Python
  collisions[hash].append(line.strip())

for hash, lines in collisions.items():
  if len(lines) > 1:
    print(hash)
    for line in lines:
      print(line)

Surprisingly relevant?

Posted May 20, 2020 10:39 UTC (Wed) by mgedmin (subscriber, #34497) [Link] (1 responses)

TBH I find both the AWK and Python versions to be awkward.

sort sha256sums.txt | uniq -w64 --all-repeated=separate

Surprisingly relevant?

Posted May 21, 2020 10:19 UTC (Thu) by pgdx (guest, #119243) [Link]

But this doesn't do the same thing as the Python and awk programs do.

First, it sorts all lines, which is not according to spec.

Second, it doesn't print the duplicate hash as "headers" on a line by itself.

Surprisingly relevant?

Posted May 20, 2020 6:43 UTC (Wed) by dumain (subscriber, #82016) [Link] (3 responses)

'{print $1}' may be less readable than cut -f1 but it also deals better with less structured data: separation with variable white space, leading whitespace etc. Plus you can set FS to be a regxp if needed.

Surprisingly relevant?

Posted May 20, 2020 23:27 UTC (Wed) by wahern (subscriber, #37304) [Link] (2 responses)

That cut only accepts a single character delimiter (rather than a set like IFS in shell, or a regular expression like FS in AWK[1]), and that it can't span adjacent delimiters (like shell and AWK), is a nearly fatal flaw. I have half a mind to submit a proposal to POSIX to add a new option, but there's no such extension in any implementation of cut that I've seen. Pre-existing practice isn't a hard requirement, especially for the upcoming revision, but I feel like the fact it doesn't exist constitutes proof that cut is a lost cause and should be left alone.

[1] There are special semantics for single-character FS; semantics that mimic shell word splitting.

Surprisingly relevant?

Posted May 21, 2020 17:11 UTC (Thu) by NYKevin (subscriber, #129325) [Link] (1 responses)

> That cut only accepts a single character delimiter (rather than a set like IFS in shell, or a regular expression like FS in AWK[1]), and that it can't span adjacent delimiters (like shell and AWK), is a nearly fatal flaw.

No it isn't. You're meant to call tr(1) with appropriate arguments, and pipe the result into cut. If you do that, then neither of those limitations matter.

Surprisingly relevant?

Posted May 22, 2020 8:58 UTC (Fri) by ptman (subscriber, #57271) [Link]

I'll just reach for this handy POSIX AWK instead

Surprisingly relevant?

Posted May 20, 2020 6:46 UTC (Wed) by Wol (subscriber, #4433) [Link] (10 responses)

> with an appropriate combination of cut(1), paste(1), tee(1), comm(1), grep(1), sed(1), and so on,

At which point, while you may not care, you are making the computer do 10 times as much work. Invoking a program is expensive. That's the complaint against bash scripts - every time they call out to a command they are setting up an environment, tearing it down, and generally doing loads of busywork.

If you can do all that with a single call to awk, you've probably reduced the overheads by 99%, if not more!

(Still, modern programmers don't seem to understand the meaning of the word "efficient")

Cheers,
Wol

Surprisingly relevant?

Posted May 20, 2020 13:33 UTC (Wed) by Paf (subscriber, #91811) [Link] (2 responses)

A good chunk of the time this doesn’t matter, since it’s just processing small amounts of data. On occasion when working with large log files, I’ve had occasion to need to figure out efficiencies like this... but I don’t do serious “permanent data pipeline” stuff in awk anyway.

I do think about efficiency - for ad-hoc data processing, I start with “how fast can I do this without compromising the actual performance I need”, then work in from there if something’s slow.

Surprisingly relevant?

Posted May 20, 2020 18:49 UTC (Wed) by geert (subscriber, #98403) [Link] (1 responses)

For small amounts of data, the tool usually doesn't matter at all.

A long time ago, a colleague came to me for help doing search and replace in a very large file. His editor of choice was "xedit", and the search and replace operation seemed to hang, or at least took ages. I opened his file in "vi", which performed the same operation in the blink of an eye. Didn't even have to refrain to sed.

Lesson learned: "xedit" was written as a sample program for showing how to use the X11 Athena Widgets, it was never meant to be a production-level editor.

Surprisingly relevant?

Posted May 20, 2020 20:19 UTC (Wed) by NYKevin (subscriber, #129325) [Link]

In this context, we're talking about the fixed costs of setting up and tearing down O(1) extra processes (vs. setting up and tearing down exactly one awk process). A reasonable pipeline will scale to millions of lines of text very easily, because the per-process overhead just isn't that big compared to the actual work being done.

On the other hand, if you're doing a while read; do ...; done style thingy, then yes, it will be awful and slow. But I try to avoid that most of the time.

Surprisingly relevant?

Posted May 23, 2020 12:22 UTC (Sat) by unixbhaskar (guest, #44758) [Link] (6 responses)

Agreed! Many many moons ago someone wise corrects me the correct way, I was doing like everyone else in wild this:

cat somefile | grep somepattern

and the correction was ..

grep somepattern somefile --> this essentially what you said is planted..one less invocation of calls.

:)

Surprisingly relevant?

Posted May 23, 2020 19:53 UTC (Sat) by Jandar (subscriber, #85683) [Link] (5 responses)

If you wish to retain the idea of a pipe: input -> command -> output, you could write

<somefile grep somepattern >output

The position of redirection doesn't matter only the order if there are dependencies.

Surprisingly relevant?

Posted May 23, 2020 20:32 UTC (Sat) by Wol (subscriber, #4433) [Link] (4 responses)

You're missing the point - the point is to GET RID of pipes.

Every extra pipe is an extra trip round the setup/teardown busywork loop - which if you pre-allocate memory could actually be a big problem even if you think you have plenty.

Cheers,
Wol

Surprisingly relevant?

Posted May 24, 2020 13:18 UTC (Sun) by madscientist (subscriber, #16861) [Link] (3 responses)

???

There are no pipes in Jandar's suggested alternative.

This feels more like StackOverflow than LWN, but the issue is that grep foo somefile gives different output than cat somefile | grep foo and if you want the latter behavior while still avoiding UUoC, you should be using grep foo < somefile instead.

Surprisingly relevant?

Posted May 24, 2020 13:50 UTC (Sun) by mpr22 (subscriber, #60784) [Link] (1 responses)

In what way does the output of grep pattern singlefile.txt differ from the output of cat singlefile.txt | grep?

Surprisingly relevant?

Posted May 24, 2020 17:12 UTC (Sun) by madscientist (subscriber, #16861) [Link]

You're right, grep behaves the same; my bad! I was thinking of some other tools like wc which have different output when given a filename versus reading from stdin.

This can be useful in scripting to avoid the complexity of stripping off the unwanted filename.

Surprisingly relevant?

Posted May 24, 2020 14:02 UTC (Sun) by Wol (subscriber, #4433) [Link]

Umm...

On first thoughts my reaction was "aren't < and > just different syntaxes for pipes?".

My second thought now is that "no they aren't actually pipes, they're shell built-ins".

So yeah you're right. They're pretty much identical in effect (and concept), but different in implementation and impact on the system. There's more than one way to do it ... :-)

Cheers,
Wol

Surprisingly relevant?

Posted May 20, 2020 12:00 UTC (Wed) by neilbrown (subscriber, #359) [Link] (1 responses)

> What does bother me is when people write awk "{ print $1 }" instead of cut -f1. I find the latter more readable.

I'm a proud user of 'awk "{print $1}"' - maybe I don't care about readability, only write-ability.
A small extension to that (e.g., ignore lines starting '#') is easy within the same tool. A small extension to "cut -f1" requires a different tool.
Awk seems to be to be a good answer to the requirement "Simple things should be simple, complex things should be possible".

Surprisingly relevant?

Posted May 23, 2020 12:18 UTC (Sat) by unixbhaskar (guest, #44758) [Link]

Agreed. And it echoed the essence of UNIX, do one thing and do it well. It's a damn good tool to know. :)

Surprisingly relevant?

Posted May 20, 2020 16:30 UTC (Wed) by scientes (guest, #83068) [Link] (1 responses)

sed is not a horrible idea, but whenever I use it I run into the fact that it cannot parse arbitrary regular languages because of the lack of non-greedy matching (i.e. a decent regex implementation).

Surprisingly relevant?

Posted May 20, 2020 20:05 UTC (Wed) by NYKevin (subscriber, #129325) [Link]

In my experience, replacing dot with [^x] (for some suitable x) is often Good Enough. This is certainly true when parsing something like a path name into its constituent components. True non-greedy matching is more powerful than that, of course, but eventually you may want to reach for a Real Parser (TM).

(Strictly speaking, it is not correct to claim that non-greedy matching is required for parsing arbitrary regular languages. Formally, a regular language can be parsed entirely in terms of literal characters, parentheses, alternation, and the Kleene star, plus anchoring if you assume that regexes are not implicitly anchored. But this might require a very long and unwieldy regex in practice, so a lack of non-greedy matching is certainly a valid complaint.)

Alternatively, I suppose you could use ex(1) noninteractively (with -c {command} or +{command}).

Surprisingly relevant?

Posted May 21, 2020 3:02 UTC (Thu) by xtifr (guest, #143) [Link] (2 responses)

What I hate is to see people flailing around with a bunch of overspecialized and slow tools like cut, paste, comm, grep, sed, and so on, to do--poorly--what a trivial amount of awk would do cleanly and well.

Surprisingly relevant?

Posted May 23, 2020 21:51 UTC (Sat) by NYKevin (subscriber, #129325) [Link]

Well, personally, I like to think that I'm adhering to the Unix philosophy (each binary does one thing, and does it well, whereas awk seems to want to do "reading and modifying text" well, whatever that's supposed to encompass), but this will quickly degenerate into a flamewar.

Surprisingly relevant?

Posted May 26, 2020 11:12 UTC (Tue) by jezuch (subscriber, #52988) [Link]

A counter-point from me would be that I took an almost immediate dislike to awk because I felt that a full-blown imperative language is overkill in a context which asks for a more declarative approach... But I generally favor declarative and functional over imperative wherever that's practical.

Surprisingly relevant?

Posted May 25, 2020 15:39 UTC (Mon) by anton (subscriber, #25547) [Link]

If all of the above are truly inadequate to the task at hand, I'm probably better off writing a Python script anyway.

The nice thing about awk is that when you find that your non-awk shell-scripting tools miss a feature, you don't need to rewrite the whole shell script in python.

As for whether you are better off, in the 1990s a collegue wrote a tool in Perl that I then inherited. I eventually rewrote it as a shell script (including parts in awk), and the result turned out to have half the lines of the Perl variant.

What does bother me is when people write awk "{ print $1 }" instead of cut -f1. I find the latter more readable.

I am one of these people (except that I use single quotes). Using cut may well be more readable, but then I know how to write it in awk without looking up a man page; that's because I use awk every few days, but not that many of these uses could be replaced with cut.

Surprisingly relevant?

Posted May 20, 2020 3:01 UTC (Wed) by ncm (guest, #165) [Link] (1 responses)

I think of awk as a sort of proto-Perl, more approachable and lighter-weight. It neatly fills the gap between Bash and Python. Quite a lot faster than Python, it does more in one to five lines.

I have never written a 500-line awk script, and probably won't, but for the one-liner that blows up, it has enough headroom for the extra load.

Surprisingly relevant?

Posted May 20, 2020 5:15 UTC (Wed) by areilly (subscriber, #87829) [Link]

I agree. Indeed (regarding it being a proto-Perl): Perl was described at the time of its introduction as a derivative of awk more suitable for heavy lifting and complicated multi-component programming. I never did learn perl, because in my own work I just didn't run into problems that a couple of lines of awk couldn't handle, or for which some other language was clearly called-for, as actual design and engineering was involved. I still use awk one-liners daily. They fit particularly nicely into shell scripts, in-place, too.

There's a really nice video of a lecture by Brian Kernighan about awk, here: https://youtu.be/Sg4U4r_AgJU

It's not perfect. Fairly easy to stub your toe on some of the function syntax. I think that a more modern design would lean towards more "functional" functions, for example.

I do think that the addition of namespaces is an indication that some people are "doing it wrong"... :-)

Surprisingly relevant?

Posted May 20, 2020 3:36 UTC (Wed) by felixfix (subscriber, #242) [Link] (1 responses)

AWK won't die by proclamation, only when there are no users left.

Surprisingly relevant?

Posted May 20, 2020 13:56 UTC (Wed) by edeloget (subscriber, #88392) [Link]

Given the fact that awk is heavily used in many scripts in the embedded world (mostly through busybox awk) it's definitely not going to disapear any time soon. It may disapear one day, but not before shell scripts died (which means we would have access to a better kind of shell).

And that's good. awk is a solid program with tons of possible use cases.

Surprisingly relevant?

Posted Nov 19, 2020 11:57 UTC (Thu) by motiejus (subscriber, #92837) [Link]

Recently I used awk to filter a few hundred gigabytes of LIDAR data to clip it to a bounding boxes I was interested at:

#!/usr/bin/awk -f
BEGIN { FS = "," }
$1 > ymin && $1 < ymax && $2 > xmin && $2 < xmax {print $2 "," $1 "," $3}

I just ran this again on a data sub-set (100M of data points, 2.7GB uncompressed) just to have data for this comment. My 8-core laptop did the whole operation in 29 seconds:
1. each file: unzip to memory.
2. each file: run through the program above for the bounding box.
3. each file: sort.
4. all files: merge sort.
5. all files: compress.

Combined with GNU Make, `sort` and `sort -m`, I can't imagine a more powerful combination of tools for this simple "big data"(?) task.

No, awk is not dead, and spending half-hour[1] is enough to use it for life. :)

[1]: https://ferd.ca/awk-in-20-minutes.html