The state of the AWK

May 19, 2020

This article was contributed by Ben Hoyt

AWK is a text-processing language with a history spanning more than 40 years. It has a POSIX standard, several conforming implementations, and is still surprisingly relevant in 2020 — both for simple text processing tasks and for wrangling "big data". The recent release of GNU Awk 5.1 seems like a good reason to survey the AWK landscape, see what GNU Awk has been up to, and look at where AWK is being used these days.

The language was created at Bell Labs in 1977. Its name comes from the initials of the original authors: Alfred Aho, Peter Weinberger, and Brian Kernighan. A Unix tool to the core, AWK is designed to do one thing well: to filter and transform lines of text. It's commonly used to parse fields from log files, transform output from other tools, and count occurrences of words and fields. Aho summarized AWK's functionality succinctly:

AWK reads the input a line at a time. A line is scanned for each pattern in the program, and for each pattern that matches, the associated action is executed.

AWK programs are often one-liners executed directly from the command line. For example, to calculate the average response time of GET requests from some hypothetical web server log, you might type:

    $ awk '/GET/ { total += $6; n++ } END { print total/n }' server.log 
    0.0186667

This means: for all lines matching the regular expression /GET/, add up the response time (the sixth field or $6) and count the line; at the end, print out the arithmetic mean of the response times.

The various AWK versions

There are three main versions of AWK in use today, and all of them conform to the POSIX standard (closely enough, at least, for the vast majority of use cases). The first is classic awk, the version of AWK described by Aho, Weinberger, and Kernighan in their book The AWK Programming Language. It's sometimes called "new AWK" (nawk) or "one true AWK", and it's now hosted on GitHub. This is the version pre-installed on many BSD-based systems, including macOS (though the version that comes with macOS is out of date, and worth upgrading).

The second is GNU Awk (gawk), which is by far the most featureful and actively maintained version. Gawk is usually pre-installed on Linux systems and is often the default awk. It is easy to install on macOS using Homebrew and Windows binaries are available as well. Arnold Robbins has been the primary maintainer of gawk since 1994, and continues to shepherd the language (he has also contributed many fixes to the classic awk version). Gawk has many features not present in awk or the POSIX standard, including new functions, networking facilities, a C extension API, a profiler and debugger, and most recently, namespaces.

The third common version is mawk, written by Michael Brennan. It is the default awk on Ubuntu and Debian Linux, and is still the fastest version of AWK, with a bytecode compiler and a more memory-efficient value representation. (Gawk has also had a bytecode compiler since 4.0, so it's now much closer to mawk's speed.)

If you want to use AWK for one-liners and basic text processing, any of the above are fine variants. If you're thinking of using it for a larger script or program, Gawk's features make it the sensible choice.

There are also several other implementations of AWK with varying levels of maturity and maintenance, notably the size-optimized BusyBox version used in embedded Linux environments, a Java rewrite with runtime access to Java language features, and my own GoAWK, a POSIX-compliant version written in Go. The three main AWKs and the BusyBox version are all written in C.

Gawk changes since 4.0

It's been almost 10 years since LWN covered the release of gawk 4.0. It would be tempting to say "much has changed since 2011", but the truth is that things move relatively slowly in the AWK world. I'll describe the notable features since 4.0 here, but for more details you can read the full 4.x and 5.x changelogs. Gawk 5.1.0 came out just over a month ago on April 14.

The biggest user-facing feature is the introduction of namespaces in 5.0. Most modern languages have some concept of namespaces to make it easier to ship large projects and libraries without name clashes. Gawk 5.0 adds namespaces in a backward-compatible way, allowing developers to create libraries, such as this toy math library:

    # area.awk
    @namespace "area"

    BEGIN {
        pi = 3.14159  # namespaced "constant"
    }

    function circle(radius) {
        return pi*radius*radius
    }

To refer to variables or functions in the library, use the namespace::name syntax, similar to C++:

    $ gawk -f area.awk -e 'BEGIN { print area::pi, area::circle(10) }'
    3.14159 314.159

Robbins believes that AWK's lack of namespaces is one of the key reasons it hasn't caught on as a larger-scale programming language and that this feature in gawk 5.0 may help resolve that. The other major issue Robbins believes is holding AWK back is the lack of a good C extension interface. Gawk's dynamic extension interface was completely revamped in 4.1; it now has a defined API and allows wrapping existing C and C++ libraries so they can be easily called from AWK.

The following code snippet from the example C-code wrapper in the user manual populates an AWK array (a string-keyed hash table) with a filename and values from a stat() system call:

    /* empty out the array */
    clear_array(array);

    /* fill in the array */
    array_set(array, "name", make_const_string(name, strlen(name), &tmp));
    array_set_numeric(array, "dev", sbuf->st_dev);
    array_set_numeric(array, "ino", sbuf->st_ino);
    array_set_numeric(array, "mode", sbuf->st_mode);

Another change in the 4.2 release (and continued in 5.0) was an overhauled source code pretty-printer. Gawk's pretty-printer enables its use as a standardized AWK code formatter, similar to Go's go fmt tool and Python's Black formatter. For example, to pretty-print the area.awk file from above:

    $ gawk --pretty-print -f area.awk

which results in the following output:

    @namespace "area"

    BEGIN {
        pi = 3.14159    # namespaced "constant"
    }


    function circle(radius)
    {
        return (pi * radius * radius)
    }

You may question the tool's choices: why does "BEGIN {" not have a line break before the "{" when the function does? (It turns out AWK syntax doesn't allow that.) Why two blank lines before the function and parentheses around the return expression? But at least it's consistent and may help avoid code-style debates.

Gawk allows a limited amount of runtime type inspection, and extended that with the addition of the typeof() function in 4.2. typeof() returns a string constant like "string", "number", or "array" depending on the input type. These functions are important for code that recursively walks every item of a nested array, for example (which is something that POSIX AWK can't do).

With 4.2, gawk also supports regular expression constants as a first-class data type using the syntax @/foo/. Previously you could not store a regular expression constant in a variable; typeof(@/foo/) returns the string "regexp". In terms of performance, gawk 4.2 brings a significant improvement on Linux systems by using fwrite_unlocked() when it's available. As gawk is single-threaded, it can use the non-locking stdio functions, giving a 7-18% increase in raw output speed — for example gawk '{ print }' on a large file.

The GNU Awk User's Guide has always been a thorough reference, but it was substantially updated in 4.1 and again in the 5.x releases, including new examples, summary sections, and exercises, along with some major copy editing.

Last (and also least), a subtle change in 4.0 that I found amusing was the reverted handling of backslash in sub() and gsub(). Robbins writes:

The default handling of backslash in sub() and gsub() has been reverted to the behavior of 3.1. It was silly to think I could break compatibility that way, even for standards compliance.

The sub and gsub functions are core regular expression substitution functions, and even a small "fix" to the complicated handling of backslash broke people's code:

When version 4.0.0 was released, the gawk maintainer made the POSIX rules the default, breaking well over a decade’s worth of backward compatibility. Needless to say, this was a bad idea, and as of version 4.0.1, gawk resumed its historical behavior, and only follows the POSIX rules when --posix is given.

Robbins may have had a small slip in judgment with the original change, but it's obvious he takes backward compatibility seriously. Especially for a popular tool like gawk, sometimes it is better to continue breaking the specification than change how something has always worked.

Is AWK still relevant?

Asking if AWK is still relevant is a bit like asking if air is still relevant: you may not see it, but it's all around you. Many Linux administrators and DevOps engineers use it to transform data or diagnose issues via log files. A version of AWK is installed on almost all Unix-based machines. In addition to ad-hoc usage, many large open-source projects use AWK somewhere in their build or documentation tooling. To name just a few examples: the Linux kernel uses it in the x86 tooling to check and reformat objdump files, Neovim uses it to generate documentation, and FFmpeg uses it for building and testing.

AWK build scripts are surprisingly hard to kill, even when people want to: in 2018 LWN wrote about GCC contributors wanting to replace AWK with Python in the scripts that generate its option-parsing code. There was some support for this proposal at the time, but apparently no one volunteered to do the actual porting, and the AWK scripts live on.

Robbins argues in his 2018 paper for the use of AWK (specifically gawk) as a "systems programming language", in this context meaning a language for writing larger tools and programs. He outlines the reasons he thinks it has not caught on, but Kernighan is "not 100% convinced" that the lack of an extension mechanism is the main reason AWK isn't widely used for larger programs. He suggested that it might be due to the lack of built-in support for access to system calls and the like. But none of that has stopped several people from building larger tools: Robbins' own TexiWeb Jr. literate programming tool (1300 lines of AWK), Werner Stoop's d.awk tool that generates documentation from Markdown comments in source code (800 lines), and Translate Shell, a 6000-line AWK tool that provides a fairly powerful command-line interface to cloud-based translation APIs.

Several developers in the last few years have written about using AWK in their "big data" toolkit as a much simpler (and sometimes faster) tool than heavy distributed computing systems such as Spark and Hadoop. Nick Strayer wrote about using AWK and R to parse 25 terabytes of data across multiple cores. Other big data examples are the tantalizingly-titled article by Adam Drake, "Command-line Tools can be 235x Faster than your Hadoop Cluster", and Brendan O'Connor's "Don’t MAWK AWK – the fastest and most elegant big data munging language!"

Between ad-hoc text munging, build tooling, "systems programming", and big data processing — not to mention text-mode first person shooters — it seems that AWK is alive and well in 2020.

[Thanks to Arnold Robbins for reviewing a draft of this article.]

Index entries for this article
GuestArticles	Hoyt, Ben

mawk in Ubuntu 20.04

Posted May 19, 2020 22:42 UTC (Tue) by dmoulding (subscriber, #95171) [Link] (4 responses)

I just noticed that as of Ubuntu 20.04, the included mawk is no longer from 1996:

root@ubuntu2004:~# mawk -W version
mawk 1.3.4 20200120
Copyright 2008-2019,2020, Thomas E. Dickey
Copyright 1991-1996,2014, Michael D. Brennan

As recently as Ubuntu 18.04, this is what I get:

root@ubuntu18:~# mawk -W version
mawk 1.3.3 Nov 1996, Copyright (C) Michael D. Brennan

Yes, that's right, in 2018 it said it's mawk 1.3.3 from 1996. For some perspective, the gawk 5.0 release date is further from November 1996 than mawk 1.3.3's release was from the original AWK from 1977. Congratulations to the Ubuntu team for finally bringing mawk into the new millennium! (Can I still say it's new 20 years in)?

mawk in Ubuntu 20.04

Posted May 20, 2020 11:47 UTC (Wed) by kleptog (subscriber, #1183) [Link]

Well that explains a lot. For years at work I ran into a strange issue in the build system where when new people started they would get mysterious segfaults. Install gawk and the problem went away. I would have filed a bug but it was so deep inside configure with several extra layers of indirection that I didn't bother. That project has since been retired (thank god).

I personally found awk to be just a little confusing past the simple cases and simply replacing awk with perl -lne made it do what I want. I guess I just knew perl better than awk.

mawk in Ubuntu 20.04

Posted May 20, 2020 13:03 UTC (Wed) by hmh (subscriber, #3838) [Link]

That change came into Ubuntu through Debian:

https://tracker.debian.org/pkg/mawk

mawk in Ubuntu 20.04

Posted May 29, 2020 21:32 UTC (Fri) by mirabilos (subscriber, #84359) [Link] (1 responses)

This was actually done by Boyuan Yang, Debian Developer, and *buntu just copied it.

Oh and it’s a pre-release snapshot; 1.3.4 is apparently not quite released yet. Or Tom Dickey doesn’t want to publish a formal release because he just continued development but isn’t the formal developer.

But good/sad to see he’s picking up another project… he also develops ncurses, cdk, xterm and lynx, and except cdk I use them daily… here’s to hoping fixes there won’t be less frequent now ☻

mawk in Ubuntu 20.04

Posted Jul 19, 2020 19:12 UTC (Sun) by ThomasDickey (guest, #140258) [Link]

1.3.4 was released in 2009 - see changelog

https://invisible-island.net/mawk/CHANGES.html#t20091220

(1.3.5 is a different matter)

Surprisingly relevant?

Posted May 19, 2020 23:00 UTC (Tue) by warrax (subscriber, #103205) [Link] (38 responses)

Nah, mate.

It's dead. Dead as a dodo. It's just become so embedded in weird and unnatural places that it can't be killed with a single shot.

It has absolutely no relevance to the modern world in terms of engineering, innovation, ... anything really. Let's just let it die in peace.

It was great for its time, but it's time to let go.

RIP. (And I mean that with respect. My actual first professional/paying job was writing a bit of AWK to process some weird billing format thing into a thing $OTHER_SYSTEM could use, so I appreciate it for what it was... but.)

Surprisingly relevant?

Posted May 19, 2020 23:17 UTC (Tue) by benhoyt (subscriber, #138463) [Link] (32 responses)

I appreciate the "mate" as I'm from down under :-), but as Mark Twain said, "the reports of AWK's death have been greatly exaggerated". I personally still use it on a regular basis (maybe once every couple of weeks) to pull a field out of a log file, sum up a column in a CSV file, etc. Half a line of AWK code is better than 5 lines of Python (or 30 lines of Go/C/Java) for ad-hoc jobs like this.

And I know many other developers use it too ... there's still a lot of text around to be processed (and I say, may it long continue). Additionally, there's the "big data" use cases I linked to in the article, where developers found it faster and simpler than heavier distributed computing tools. See also: https://yourdatafitsinram.net/

Surprisingly relevant?

Posted May 20, 2020 1:50 UTC (Wed) by NYKevin (subscriber, #129325) [Link] (31 responses)

Although I don't *object* to awk's continued existence, I've never had cause to learn it, mostly because all the simple stuff can be done with an appropriate combination of cut(1), paste(1), tee(1), comm(1), grep(1), sed(1), and so on, together with a few basic shell constructs like process substitution. If all of the above are truly inadequate to the task at hand, I'm probably better off writing a Python script anyway.

What does bother me is when people write awk "{ print $1 }" instead of cut -f1. I find the latter more readable.

Surprisingly relevant?

Posted May 20, 2020 4:08 UTC (Wed) by marduk (subscriber, #3831) [Link] (3 responses)

I never see "{ print $1 }", or rather, most of the time when see someone has turned to awk it's because their problem has evolved into something more complicated than "{ print $1 }".

You could resort to some kind of permutation of set, cut, in trivial cases where spawning a bunch of processes to do what awk can do by itself is acceptable to you.

Surprisingly relevant?

Posted May 20, 2020 20:01 UTC (Wed) by jafd (subscriber, #129642) [Link]

I was using it because a fat book about administering Red Hat Linux (from when Red Hat Linux 6 was a newfangled thing) gave it in a useful example. It went downhill from there.

Also, back when the whole Unicode mess had been a very on-and-off experience, where some tools would work and some would go bonkers, cut failed me a couple times and awk was solid. And so it went.

Surprisingly relevant?

Posted May 20, 2020 20:29 UTC (Wed) by NYKevin (subscriber, #129325) [Link] (1 responses)

> where spawning a bunch of processes to do what awk can do by itself is acceptable to you.

Why do I care? I have gigabytes of RAM and it's not like I'm going to run out of PIDs from a five-command pipeline. Besides, the kernel should be COWing glibc etc. so it's not even all that much overhead to begin with. If you're using something like Toybox/Busybox/whatever-other-box-is-popular-these-days, then you can literally COW the entire executable.

Surprisingly relevant?

Posted Nov 19, 2020 13:29 UTC (Thu) by nenad_noveljic (guest, #143180) [Link]

Forking is an expensive OS call. It might cause a problem when used occasionally in the command line. But it will consume substantial kernel CPU if done on a large scale.

Surprisingly relevant?

Posted May 20, 2020 5:45 UTC (Wed) by cyphar (subscriber, #110703) [Link] (3 responses)

Depends on your definition of "simple". While I do make use of all of the tools you've mentioned, awk has carved out its own niche in that pantheon. It allows you to do a handful of things that you would ordinarily need to reach for a "real" programming for such as aggregation or basic programs that make use of maps. Yes, you could implement these things in Python fairly easily, but with two downsides:

You can't just write the script in your shell, you need to open a text editor. Though this is mostly a Python problem, caused by its use of whitespace for code flow. Awk uses brackets and semicolons, so this isn't an issue. I would wager that most awk scripts are written from within a shell.
Most "real" languages require you to do a bunch of boilerplate (such as looping over input, or explicitly do conversions to non-string values). For "real" programming languages, it makes sense to require this kind of boiler plate -- but awk lets you elide all of it because its only purpose is to execute programs over records and fields. For a quick-and-dirty language like awk it's a godsend to not need to have any boilerplate.

Compare the following programs which take the output of sha256sum of a directory tree and find any files which have matching hashes. The one written in awk is verbatim a program I wrote a week ago (note that I actually wrote it in a single line on my command-line, but I put it in a file for an easier comparison).

% cat prog.py
collisions = {}
for line in iter(input, ""):
  hash, *_ = line.split() # really should be re.split but that would be too mean to Python
  if hash not in collisions:
    collisions[hash] = []
  collisions[hash].append(line)

for hash, lines in collisions.items():
  if len(lines) > 1:
    print(hash)
    for line in lines:
      print(line)

% cat prog.awk
{
  files[$1][length(files[$1])] = $0
}

END {
  for (hash in files) {
    if (length(files[hash]) > 1) {
      print hash;
      for (idx in files[hash]) {
        print " " files[hash][idx];
      }
    }
  }
}

What is the first thing you notice? All of the boilerplate in Python about iter(input, "") and splitting the line is already done for you in awk. The actual logic of the program is implemented in a single statement in awk, with the rest of the program just printing out the calculation. And that is one of the reasons why I reach for awk much more often than I reach for Python when I have a relatively-simple source of data to parse -- I just have to type a lot less.

> What does bother me is when people write awk "{ print $1 }" instead of cut -f1. I find the latter more readable.

The problem is that cut splits on the literal " " (U+0020) or whatever other literal you specify, while awk splits fields using a regular expression (which by default is /\s+/). Many standard Unix programs output data such that cut's field handling is simply not usable. You could clean up the data with sed, but now you're working around the fact that cut isn't doing its job correctly. I sometimes feel that cut would be a better program if it were implemented as an awk script.

Surprisingly relevant?

Posted May 20, 2020 9:03 UTC (Wed) by mineo (guest, #126773) [Link]

Note that, with the appropriate imports, you can not reduce the line count of your python example, but make it a bit more straightforward:

from collections import defaultdict
from fileinput import input

collisions = defaultdict(list)
for line in input():
  hash, *_ = line.split() # really should be re.split but that would be too mean to Python
  collisions[hash].append(line.strip())

for hash, lines in collisions.items():
  if len(lines) > 1:
    print(hash)
    for line in lines:
      print(line)

Surprisingly relevant?

Posted May 20, 2020 10:39 UTC (Wed) by mgedmin (subscriber, #34497) [Link] (1 responses)

TBH I find both the AWK and Python versions to be awkward.

sort sha256sums.txt | uniq -w64 --all-repeated=separate

Surprisingly relevant?

Posted May 21, 2020 10:19 UTC (Thu) by pgdx (guest, #119243) [Link]

But this doesn't do the same thing as the Python and awk programs do.

First, it sorts all lines, which is not according to spec.

Second, it doesn't print the duplicate hash as "headers" on a line by itself.

Surprisingly relevant?

Posted May 20, 2020 6:43 UTC (Wed) by dumain (subscriber, #82016) [Link] (3 responses)

'{print $1}' may be less readable than cut -f1 but it also deals better with less structured data: separation with variable white space, leading whitespace etc. Plus you can set FS to be a regxp if needed.

Surprisingly relevant?

Posted May 20, 2020 23:27 UTC (Wed) by wahern (subscriber, #37304) [Link] (2 responses)

That cut only accepts a single character delimiter (rather than a set like IFS in shell, or a regular expression like FS in AWK[1]), and that it can't span adjacent delimiters (like shell and AWK), is a nearly fatal flaw. I have half a mind to submit a proposal to POSIX to add a new option, but there's no such extension in any implementation of cut that I've seen. Pre-existing practice isn't a hard requirement, especially for the upcoming revision, but I feel like the fact it doesn't exist constitutes proof that cut is a lost cause and should be left alone.

[1] There are special semantics for single-character FS; semantics that mimic shell word splitting.

Surprisingly relevant?

Posted May 21, 2020 17:11 UTC (Thu) by NYKevin (subscriber, #129325) [Link] (1 responses)

> That cut only accepts a single character delimiter (rather than a set like IFS in shell, or a regular expression like FS in AWK[1]), and that it can't span adjacent delimiters (like shell and AWK), is a nearly fatal flaw.

No it isn't. You're meant to call tr(1) with appropriate arguments, and pipe the result into cut. If you do that, then neither of those limitations matter.

Surprisingly relevant?

Posted May 22, 2020 8:58 UTC (Fri) by ptman (subscriber, #57271) [Link]

I'll just reach for this handy POSIX AWK instead

Surprisingly relevant?

Posted May 20, 2020 6:46 UTC (Wed) by Wol (subscriber, #4433) [Link] (10 responses)

> with an appropriate combination of cut(1), paste(1), tee(1), comm(1), grep(1), sed(1), and so on,

At which point, while you may not care, you are making the computer do 10 times as much work. Invoking a program is expensive. That's the complaint against bash scripts - every time they call out to a command they are setting up an environment, tearing it down, and generally doing loads of busywork.

If you can do all that with a single call to awk, you've probably reduced the overheads by 99%, if not more!

(Still, modern programmers don't seem to understand the meaning of the word "efficient")

Cheers,
Wol

Surprisingly relevant?

Posted May 20, 2020 13:33 UTC (Wed) by Paf (subscriber, #91811) [Link] (2 responses)

A good chunk of the time this doesn’t matter, since it’s just processing small amounts of data. On occasion when working with large log files, I’ve had occasion to need to figure out efficiencies like this... but I don’t do serious “permanent data pipeline” stuff in awk anyway.

I do think about efficiency - for ad-hoc data processing, I start with “how fast can I do this without compromising the actual performance I need”, then work in from there if something’s slow.

Surprisingly relevant?

Posted May 20, 2020 18:49 UTC (Wed) by geert (subscriber, #98403) [Link] (1 responses)

For small amounts of data, the tool usually doesn't matter at all.

A long time ago, a colleague came to me for help doing search and replace in a very large file. His editor of choice was "xedit", and the search and replace operation seemed to hang, or at least took ages. I opened his file in "vi", which performed the same operation in the blink of an eye. Didn't even have to refrain to sed.

Lesson learned: "xedit" was written as a sample program for showing how to use the X11 Athena Widgets, it was never meant to be a production-level editor.

Surprisingly relevant?

Posted May 20, 2020 20:19 UTC (Wed) by NYKevin (subscriber, #129325) [Link]

In this context, we're talking about the fixed costs of setting up and tearing down O(1) extra processes (vs. setting up and tearing down exactly one awk process). A reasonable pipeline will scale to millions of lines of text very easily, because the per-process overhead just isn't that big compared to the actual work being done.

On the other hand, if you're doing a while read; do ...; done style thingy, then yes, it will be awful and slow. But I try to avoid that most of the time.

Surprisingly relevant?

Posted May 23, 2020 12:22 UTC (Sat) by unixbhaskar (guest, #44758) [Link] (6 responses)

Agreed! Many many moons ago someone wise corrects me the correct way, I was doing like everyone else in wild this:

cat somefile | grep somepattern

and the correction was ..

grep somepattern somefile --> this essentially what you said is planted..one less invocation of calls.

Surprisingly relevant?

Posted May 23, 2020 19:53 UTC (Sat) by Jandar (subscriber, #85683) [Link] (5 responses)

If you wish to retain the idea of a pipe: input -> command -> output, you could write

<somefile grep somepattern >output

The position of redirection doesn't matter only the order if there are dependencies.

Surprisingly relevant?

Posted May 23, 2020 20:32 UTC (Sat) by Wol (subscriber, #4433) [Link] (4 responses)

You're missing the point - the point is to GET RID of pipes.

Every extra pipe is an extra trip round the setup/teardown busywork loop - which if you pre-allocate memory could actually be a big problem even if you think you have plenty.

Cheers,
Wol

Surprisingly relevant?

Posted May 24, 2020 13:18 UTC (Sun) by madscientist (subscriber, #16861) [Link] (3 responses)

???

There are no pipes in Jandar's suggested alternative.

This feels more like StackOverflow than LWN, but the issue is that grep foo somefile gives different output than cat somefile | grep foo and if you want the latter behavior while still avoiding UUoC, you should be using grep foo < somefile instead.

Surprisingly relevant?

Posted May 24, 2020 13:50 UTC (Sun) by mpr22 (subscriber, #60784) [Link] (1 responses)

In what way does the output of grep pattern singlefile.txt differ from the output of cat singlefile.txt | grep?

Surprisingly relevant?

Posted May 24, 2020 17:12 UTC (Sun) by madscientist (subscriber, #16861) [Link]

You're right, grep behaves the same; my bad! I was thinking of some other tools like wc which have different output when given a filename versus reading from stdin.

This can be useful in scripting to avoid the complexity of stripping off the unwanted filename.

Surprisingly relevant?

Posted May 24, 2020 14:02 UTC (Sun) by Wol (subscriber, #4433) [Link]

Umm...

On first thoughts my reaction was "aren't < and > just different syntaxes for pipes?".

My second thought now is that "no they aren't actually pipes, they're shell built-ins".

So yeah you're right. They're pretty much identical in effect (and concept), but different in implementation and impact on the system. There's more than one way to do it ... :-)

Cheers,
Wol

Surprisingly relevant?

Posted May 20, 2020 12:00 UTC (Wed) by neilbrown (subscriber, #359) [Link] (1 responses)

> What does bother me is when people write awk "{ print $1 }" instead of cut -f1. I find the latter more readable.

I'm a proud user of 'awk "{print $1}"' - maybe I don't care about readability, only write-ability.
A small extension to that (e.g., ignore lines starting '#') is easy within the same tool. A small extension to "cut -f1" requires a different tool.
Awk seems to be to be a good answer to the requirement "Simple things should be simple, complex things should be possible".

Surprisingly relevant?

Posted May 23, 2020 12:18 UTC (Sat) by unixbhaskar (guest, #44758) [Link]

Agreed. And it echoed the essence of UNIX, do one thing and do it well. It's a damn good tool to know. :)

Surprisingly relevant?

Posted May 20, 2020 16:30 UTC (Wed) by scientes (guest, #83068) [Link] (1 responses)

sed is not a horrible idea, but whenever I use it I run into the fact that it cannot parse arbitrary regular languages because of the lack of non-greedy matching (i.e. a decent regex implementation).

Surprisingly relevant?

Posted May 20, 2020 20:05 UTC (Wed) by NYKevin (subscriber, #129325) [Link]

In my experience, replacing dot with [^x] (for some suitable x) is often Good Enough. This is certainly true when parsing something like a path name into its constituent components. True non-greedy matching is more powerful than that, of course, but eventually you may want to reach for a Real Parser (TM).

(Strictly speaking, it is not correct to claim that non-greedy matching is required for parsing arbitrary regular languages. Formally, a regular language can be parsed entirely in terms of literal characters, parentheses, alternation, and the Kleene star, plus anchoring if you assume that regexes are not implicitly anchored. But this might require a very long and unwieldy regex in practice, so a lack of non-greedy matching is certainly a valid complaint.)

Alternatively, I suppose you could use ex(1) noninteractively (with -c {command} or +{command}).

Surprisingly relevant?

Posted May 21, 2020 3:02 UTC (Thu) by xtifr (guest, #143) [Link] (2 responses)

What I hate is to see people flailing around with a bunch of overspecialized and slow tools like cut, paste, comm, grep, sed, and so on, to do--poorly--what a trivial amount of awk would do cleanly and well.

Surprisingly relevant?

Posted May 23, 2020 21:51 UTC (Sat) by NYKevin (subscriber, #129325) [Link]

Well, personally, I like to think that I'm adhering to the Unix philosophy (each binary does one thing, and does it well, whereas awk seems to want to do "reading and modifying text" well, whatever that's supposed to encompass), but this will quickly degenerate into a flamewar.

Surprisingly relevant?

Posted May 26, 2020 11:12 UTC (Tue) by jezuch (subscriber, #52988) [Link]

A counter-point from me would be that I took an almost immediate dislike to awk because I felt that a full-blown imperative language is overkill in a context which asks for a more declarative approach... But I generally favor declarative and functional over imperative wherever that's practical.

Surprisingly relevant?

Posted May 25, 2020 15:39 UTC (Mon) by anton (subscriber, #25547) [Link]

If all of the above are truly inadequate to the task at hand, I'm probably better off writing a Python script anyway.

The nice thing about awk is that when you find that your non-awk shell-scripting tools miss a feature, you don't need to rewrite the whole shell script in python.

As for whether you are better off, in the 1990s a collegue wrote a tool in Perl that I then inherited. I eventually rewrote it as a shell script (including parts in awk), and the result turned out to have half the lines of the Perl variant.

What does bother me is when people write awk "{ print $1 }" instead of cut -f1. I find the latter more readable.

I am one of these people (except that I use single quotes). Using cut may well be more readable, but then I know how to write it in awk without looking up a man page; that's because I use awk every few days, but not that many of these uses could be replaced with cut.

Surprisingly relevant?

Posted May 20, 2020 3:01 UTC (Wed) by ncm (guest, #165) [Link] (1 responses)

I think of awk as a sort of proto-Perl, more approachable and lighter-weight. It neatly fills the gap between Bash and Python. Quite a lot faster than Python, it does more in one to five lines.

I have never written a 500-line awk script, and probably won't, but for the one-liner that blows up, it has enough headroom for the extra load.

Surprisingly relevant?

Posted May 20, 2020 5:15 UTC (Wed) by areilly (subscriber, #87829) [Link]

I agree. Indeed (regarding it being a proto-Perl): Perl was described at the time of its introduction as a derivative of awk more suitable for heavy lifting and complicated multi-component programming. I never did learn perl, because in my own work I just didn't run into problems that a couple of lines of awk couldn't handle, or for which some other language was clearly called-for, as actual design and engineering was involved. I still use awk one-liners daily. They fit particularly nicely into shell scripts, in-place, too.

There's a really nice video of a lecture by Brian Kernighan about awk, here: https://youtu.be/Sg4U4r_AgJU

It's not perfect. Fairly easy to stub your toe on some of the function syntax. I think that a more modern design would lean towards more "functional" functions, for example.

I do think that the addition of namespaces is an indication that some people are "doing it wrong"... :-)

Surprisingly relevant?

Posted May 20, 2020 3:36 UTC (Wed) by felixfix (subscriber, #242) [Link] (1 responses)

AWK won't die by proclamation, only when there are no users left.

Surprisingly relevant?

Posted May 20, 2020 13:56 UTC (Wed) by edeloget (subscriber, #88392) [Link]

Given the fact that awk is heavily used in many scripts in the embedded world (mostly through busybox awk) it's definitely not going to disapear any time soon. It may disapear one day, but not before shell scripts died (which means we would have access to a better kind of shell).

And that's good. awk is a solid program with tons of possible use cases.

Surprisingly relevant?

Posted Nov 19, 2020 11:57 UTC (Thu) by motiejus (subscriber, #92837) [Link]

Recently I used awk to filter a few hundred gigabytes of LIDAR data to clip it to a bounding boxes I was interested at:

#!/usr/bin/awk -f
BEGIN { FS = "," }
$1 > ymin && $1 < ymax && $2 > xmin && $2 < xmax {print $2 "," $1 "," $3}

I just ran this again on a data sub-set (100M of data points, 2.7GB uncompressed) just to have data for this comment. My 8-core laptop did the whole operation in 29 seconds:
1. each file: unzip to memory.
2. each file: run through the program above for the bounding box.
3. each file: sort.
4. all files: merge sort.
5. all files: compress.

Combined with GNU Make, `sort` and `sort -m`, I can't imagine a more powerful combination of tools for this simple "big data"(?) task.

No, awk is not dead, and spending half-hour[1] is enough to use it for life. :)

[1]: https://ferd.ca/awk-in-20-minutes.html

The state of the AWK

Posted May 20, 2020 10:55 UTC (Wed) by NAR (subscriber, #1313) [Link] (3 responses)

AWK reads the input a line at a time. A line is scanned for each pattern in the program, and for each pattern that matches, the associated action is executed.

I think awk started to lose its relevance when data started to be structured differently (XML, JSON, etc.) than by sequence of lines. I don't remember the last time I wrote an awk script, for one-liners perl suffices (with the -e and -n options) - and perl can be used to build bigger programs, so what's the point in keeping up with AWK? Similarly I just realized how odd is writing HTML tags into this form when all other forms I use requires Markdown and I automatically almost started to type it when I realized where I am :-)

The state of the AWK

Posted May 20, 2020 18:28 UTC (Wed) by jthill (subscriber, #56558) [Link]

If communicating with the less dedicated is in the mix, perl loses its luster. Awk is much, much more approachable. If I'm trying to explain to someone who won't be doing a lot of scripting how to munge text, flat files, awk's by far the best option if sed isn't easier. They'll likely be able to extend what they've learned because the marginal costs are low, and if awk's out of steam they're likely to need some guidance for other reasons.

json and xml are big hammers, far too often people swing them for little jobs.

The state of the AWK

Posted May 20, 2020 23:57 UTC (Wed) by wahern (subscriber, #37304) [Link]

Alas, Red Hat/CentOS/Fedora no longer install Perl by default. There was a glorious period where Perl was more common than Bash, after the BSDs and commercial Unices adopted Perl, but before the dark years of "shell scripting" becoming synonymous with "Bash scripting". If you had reason to venture away from POSIX utilities for system management tasks, Perl was the obvious and perfectly reasonable choice--it was and remains the better AWK.

I think AWK is seeing a resurgence precisely because Perl isn't as ubiquitous as it once was. You can't depend on Python being installed, either, and even if you could it still sucks for short, shell-style programming. Which is why as Python has displaced Perl, there's more demand for AWK to fill the remaining gap.

I agree that XML and JSON have altered the landscape, but XML and JSON don't fit streaming paradigms very well. Even when something like jq is available, I usually find the regular shell utilities to be far more convenient, and AFAICT so do most others. It's always been the case that for highly structured data you ended up using more sophisticated programming languages, anyhow. The reason why the Unix shell and shell programming have persisted for so long is precisely because the "one language to rule them all" and "one record format to rule them all" approaches never sufficed nearly enough to displace ad hoc text munging tools. The very nature of the problem domain--gluing together disparate, uncooperative tools and data--contradicts the idea that there could ever be a simple, unified solution.

The state of the AWK

Posted May 21, 2020 15:57 UTC (Thu) by smitty_one_each (subscriber, #28989) [Link]

let awk and jq be the Batman and Robin of the command line.

The state of the AWK

Posted May 21, 2020 3:22 UTC (Thu) by xtifr (guest, #143) [Link]

Awk was one of the main things that first attracted me to Unix, several geological ages ago! The idea of a simple language specifically designed for creating filters on-the-fly was a complete revelation!

I no longer overuse and abuse awk the way I did when I was young, but I still use it now and then. Often just typing on the command line!

One of my favorite tricks is running accumulators for multiple types of things:

ps aux|awk '{count[$1]++} END { for (u in count) { print u, ": ", count[u]}}'

tells me how many processes each user has. And yes, that's the sort of thing I can and do just type in when I want to know something like that. :)

The state of the AWK

Posted May 21, 2020 16:33 UTC (Thu) by PaulMcKenney (✭ supporter ✭, #9624) [Link] (1 responses)

Thank you for this! I feel much better about continuing to use awk. :-)

The state of the AWK

Posted May 23, 2020 12:16 UTC (Sat) by unixbhaskar (guest, #44758) [Link]

What a heck of a tool!! wonderful ...I do stumbled and use it every now and then. Kinda , daily part and parcel of life. It does the job for me,and am happy with it.

The state of the AWK

Posted May 23, 2020 8:44 UTC (Sat) by tedd (subscriber, #74183) [Link]

Whenever I think of awk I always remember this: http://kmkeen.com/awk-music/

Note that the script may need to be edited to run on modern systems - I don't know when this particular post was published.

The state of the AWK

Posted May 25, 2020 19:44 UTC (Mon) by SiB (subscriber, #4048) [Link] (1 responses)

Apart from one liners on the shell prompt, I use awk almost daily with gnuplot, to extract the numbers to plot or fit from datafiles. Awk program files have gawk's -i option on the shebang line. Its still one liners on the gnuplot command line, but with a comfortabe set of constants and functions predefined.

The state of the AWK

Posted Jun 2, 2020 12:41 UTC (Tue) by amnonbc (guest, #106638) [Link]

Awk is a masterpiece of minimalism.
Only two features - regular expressions and associative arrays.

A tool simple tool that does one thing well.
And a clear expressive and readable syntax.

A classic of language design, and a pleasure to use!

The state of the AWK

Posted Jun 2, 2020 19:12 UTC (Tue) by benhoyt (subscriber, #138463) [Link]

Another article (from very recently - May 2020) about someone using AWK to process "big data". Very interesting read: https://ketancmaheshwari.github.io/posts/2020/05/24/SMC18... ... the author's tl;dr is "Awk crunches massive data; a High Performance Computing (HPC) script calls hundreds of Awk concurrently. Fast and scalable in-memory solution on a fat machine."

The state of the AWK

Posted Nov 20, 2020 18:42 UTC (Fri) by RobertX (guest, #138591) [Link]

I don't know why but I really like how AWK 5.X does namespacing.