The state of the AWK
AWK is a text-processing language with a history spanning more than 40 years. It has a POSIX standard, several conforming implementations, and is still surprisingly relevant in 2020 — both for simple text processing tasks and for wrangling "big data". The recent release of GNU Awk 5.1 seems like a good reason to survey the AWK landscape, see what GNU Awk has been up to, and look at where AWK is being used these days.
The language was created at Bell Labs in 1977. Its name comes from the initials of the original authors: Alfred Aho, Peter Weinberger, and Brian Kernighan. A Unix tool to the core, AWK is designed to do one thing well: to filter and transform lines of text. It's commonly used to parse fields from log files, transform output from other tools, and count occurrences of words and fields. Aho summarized AWK's functionality succinctly:
AWK reads the input a line at a time. A line is scanned for each pattern in the program, and for each pattern that matches, the associated action is executed.
AWK programs are often one-liners executed directly from the command line. For example, to calculate the average response time of GET requests from some hypothetical web server log, you might type:
$ awk '/GET/ { total += $6; n++ } END { print total/n }' server.log 0.0186667
This means: for all lines matching the regular expression /GET/, add up the response time (the sixth field or $6) and count the line; at the end, print out the arithmetic mean of the response times.
The various AWK versions
There are three main versions of AWK in use today, and all of them conform to the POSIX standard (closely enough, at least, for the vast majority of use cases). The first is classic awk, the version of AWK described by Aho, Weinberger, and Kernighan in their book The AWK Programming Language. It's sometimes called "new AWK" (nawk) or "one true AWK", and it's now hosted on GitHub. This is the version pre-installed on many BSD-based systems, including macOS (though the version that comes with macOS is out of date, and worth upgrading).
The second is GNU Awk (gawk), which is by far the most featureful and actively maintained version. Gawk is usually pre-installed on Linux systems and is often the default awk. It is easy to install on macOS using Homebrew and Windows binaries are available as well. Arnold Robbins has been the primary maintainer of gawk since 1994, and continues to shepherd the language (he has also contributed many fixes to the classic awk version). Gawk has many features not present in awk or the POSIX standard, including new functions, networking facilities, a C extension API, a profiler and debugger, and most recently, namespaces.
The third common version is mawk, written by Michael Brennan. It is the default awk on Ubuntu and Debian Linux, and is still the fastest version of AWK, with a bytecode compiler and a more memory-efficient value representation. (Gawk has also had a bytecode compiler since 4.0, so it's now much closer to mawk's speed.)
If you want to use AWK for one-liners and basic text processing, any of the above are fine variants. If you're thinking of using it for a larger script or program, Gawk's features make it the sensible choice.
There are also several other implementations of AWK with varying levels of maturity and maintenance, notably the size-optimized BusyBox version used in embedded Linux environments, a Java rewrite with runtime access to Java language features, and my own GoAWK, a POSIX-compliant version written in Go. The three main AWKs and the BusyBox version are all written in C.
Gawk changes since 4.0
It's been almost 10 years since LWN covered the release of gawk 4.0. It would be tempting to say "much has changed since 2011", but the truth is that things move relatively slowly in the AWK world. I'll describe the notable features since 4.0 here, but for more details you can read the full 4.x and 5.x changelogs. Gawk 5.1.0 came out just over a month ago on April 14.
The biggest user-facing feature is the introduction of namespaces in 5.0. Most modern languages have some concept of namespaces to make it easier to ship large projects and libraries without name clashes. Gawk 5.0 adds namespaces in a backward-compatible way, allowing developers to create libraries, such as this toy math library:
# area.awk @namespace "area" BEGIN { pi = 3.14159 # namespaced "constant" } function circle(radius) { return pi*radius*radius }
To refer to variables or functions in the library, use the namespace::name syntax, similar to C++:
$ gawk -f area.awk -e 'BEGIN { print area::pi, area::circle(10) }' 3.14159 314.159
Robbins believes that AWK's lack of namespaces is one of the key reasons it hasn't caught on as a larger-scale programming language and that this feature in gawk 5.0 may help resolve that. The other major issue Robbins believes is holding AWK back is the lack of a good C extension interface. Gawk's dynamic extension interface was completely revamped in 4.1; it now has a defined API and allows wrapping existing C and C++ libraries so they can be easily called from AWK.
The following code snippet from the example C-code wrapper in the user manual populates an AWK array (a string-keyed hash table) with a filename and values from a stat() system call:
/* empty out the array */ clear_array(array); /* fill in the array */ array_set(array, "name", make_const_string(name, strlen(name), &tmp)); array_set_numeric(array, "dev", sbuf->st_dev); array_set_numeric(array, "ino", sbuf->st_ino); array_set_numeric(array, "mode", sbuf->st_mode);
Another change in the 4.2 release (and continued in 5.0) was an overhauled source code pretty-printer. Gawk's pretty-printer enables its use as a standardized AWK code formatter, similar to Go's go fmt tool and Python's Black formatter. For example, to pretty-print the area.awk file from above:
$ gawk --pretty-print -f area.awkwhich results in the following output:
@namespace "area" BEGIN { pi = 3.14159 # namespaced "constant" } function circle(radius) { return (pi * radius * radius) }
You may question the tool's choices: why does "BEGIN {" not have a line break before the "{" when the function does? (It turns out AWK syntax doesn't allow that.) Why two blank lines before the function and parentheses around the return expression? But at least it's consistent and may help avoid code-style debates.
Gawk allows a limited amount of runtime type inspection, and extended that with the addition of the typeof() function in 4.2. typeof() returns a string constant like "string", "number", or "array" depending on the input type. These functions are important for code that recursively walks every item of a nested array, for example (which is something that POSIX AWK can't do).
With 4.2, gawk also supports regular expression constants as a first-class data type using the syntax @/foo/. Previously you could not store a regular expression constant in a variable; typeof(@/foo/) returns the string "regexp". In terms of performance, gawk 4.2 brings a significant improvement on Linux systems by using fwrite_unlocked() when it's available. As gawk is single-threaded, it can use the non-locking stdio functions, giving a 7-18% increase in raw output speed — for example gawk '{ print }' on a large file.
The GNU Awk User's Guide has always been a thorough reference, but it was substantially updated in 4.1 and again in the 5.x releases, including new examples, summary sections, and exercises, along with some major copy editing.
Last (and also least), a subtle change in 4.0 that I found amusing was the reverted handling of backslash in sub() and gsub(). Robbins writes:
The default handling of backslash in sub() and gsub() has been reverted to the behavior of 3.1. It was silly to think I could break compatibility that way, even for standards compliance.
The sub and gsub functions are core regular expression substitution functions, and even a small "fix" to the complicated handling of backslash broke people's code:
Robbins may have had a small slip in judgment with the original change, but it's obvious he takes backward compatibility seriously. Especially for a popular tool like gawk, sometimes it is better to continue breaking the specification than change how something has always worked.
Is AWK still relevant?
Asking if AWK is still relevant is a bit like asking if air is still relevant: you may not see it, but it's all around you. Many Linux administrators and DevOps engineers use it to transform data or diagnose issues via log files. A version of AWK is installed on almost all Unix-based machines. In addition to ad-hoc usage, many large open-source projects use AWK somewhere in their build or documentation tooling. To name just a few examples: the Linux kernel uses it in the x86 tooling to check and reformat objdump files, Neovim uses it to generate documentation, and FFmpeg uses it for building and testing.
AWK build scripts are surprisingly hard to kill, even when people want to: in 2018 LWN wrote about GCC contributors wanting to replace AWK with Python in the scripts that generate its option-parsing code. There was some support for this proposal at the time, but apparently no one volunteered to do the actual porting, and the AWK scripts live on.
Robbins argues in his 2018 paper for the use of AWK (specifically gawk) as a "systems programming language", in this context meaning a language for writing larger tools and programs. He outlines the reasons he thinks it has not caught on, but Kernighan is "not 100% convinced" that the lack of an extension mechanism is the main reason AWK isn't widely used for larger programs. He suggested that it might be due to the lack of built-in support for access to system calls and the like. But none of that has stopped several people from building larger tools: Robbins' own TexiWeb Jr. literate programming tool (1300 lines of AWK), Werner Stoop's d.awk tool that generates documentation from Markdown comments in source code (800 lines), and Translate Shell, a 6000-line AWK tool that provides a fairly powerful command-line interface to cloud-based translation APIs.
Several developers in the last few years have written about using AWK in their "big data" toolkit as a much simpler (and sometimes faster) tool than heavy distributed computing systems such as Spark and Hadoop. Nick Strayer wrote about using AWK and R to parse 25 terabytes of data across multiple cores. Other big data examples are the tantalizingly-titled article by Adam Drake, "Command-line Tools can be 235x Faster than your Hadoop Cluster", and Brendan O'Connor's "Don’t MAWK AWK – the fastest and most elegant big data munging language!"
Between ad-hoc text munging, build tooling, "systems programming", and big data processing — not to mention text-mode first person shooters — it seems that AWK is alive and well in 2020.
[Thanks to Arnold Robbins for reviewing a draft of this article.]
Index entries for this article | |
---|---|
GuestArticles | Hoyt, Ben |
Posted May 19, 2020 22:42 UTC (Tue)
by dmoulding (subscriber, #95171)
[Link] (4 responses)
Posted May 20, 2020 11:47 UTC (Wed)
by kleptog (subscriber, #1183)
[Link]
I personally found awk to be just a little confusing past the simple cases and simply replacing awk with perl -lne made it do what I want. I guess I just knew perl better than awk.
Posted May 29, 2020 21:32 UTC (Fri)
by mirabilos (subscriber, #84359)
[Link] (1 responses)
Oh and it’s a pre-release snapshot; 1.3.4 is apparently not quite released yet. Or Tom Dickey doesn’t want to publish a formal release because he just continued development but isn’t the formal developer.
But good/sad to see he’s picking up another project… he also develops ncurses, cdk, xterm and lynx, and except cdk I use them daily… here’s to hoping fixes there won’t be less frequent now ☻
Posted Jul 19, 2020 19:12 UTC (Sun)
by ThomasDickey (guest, #140258)
[Link]
https://invisible-island.net/mawk/CHANGES.html#t20091220
(1.3.5 is a different matter)
Posted May 19, 2020 23:00 UTC (Tue)
by warrax (subscriber, #103205)
[Link] (38 responses)
It's dead. Dead as a dodo. It's just become so embedded in weird and unnatural places that it can't be killed with a single shot.
It has absolutely no relevance to the modern world in terms of engineering, innovation, ... anything really. Let's just let it die in peace.
It was great for its time, but it's time to let go.
RIP. (And I mean that with respect. My actual first professional/paying job was writing a bit of AWK to process some weird billing format thing into a thing $OTHER_SYSTEM could use, so I appreciate it for what it was... but.)
Posted May 19, 2020 23:17 UTC (Tue)
by benhoyt (subscriber, #138463)
[Link] (32 responses)
And I know many other developers use it too ... there's still a lot of text around to be processed (and I say, may it long continue). Additionally, there's the "big data" use cases I linked to in the article, where developers found it faster and simpler than heavier distributed computing tools. See also: https://yourdatafitsinram.net/
Posted May 20, 2020 1:50 UTC (Wed)
by NYKevin (subscriber, #129325)
[Link] (31 responses)
What does bother me is when people write awk "{ print $1 }" instead of cut -f1. I find the latter more readable.
Posted May 20, 2020 4:08 UTC (Wed)
by marduk (subscriber, #3831)
[Link] (3 responses)
You could resort to some kind of permutation of set, cut, in trivial cases where spawning a bunch of processes to do what awk can do by itself is acceptable to you.
Posted May 20, 2020 20:01 UTC (Wed)
by jafd (subscriber, #129642)
[Link]
Also, back when the whole Unicode mess had been a very on-and-off experience, where some tools would work and some would go bonkers, cut failed me a couple times and awk was solid. And so it went.
Posted May 20, 2020 20:29 UTC (Wed)
by NYKevin (subscriber, #129325)
[Link] (1 responses)
Why do I care? I have gigabytes of RAM and it's not like I'm going to run out of PIDs from a five-command pipeline. Besides, the kernel should be COWing glibc etc. so it's not even all that much overhead to begin with. If you're using something like Toybox/Busybox/whatever-other-box-is-popular-these-days, then you can literally COW the entire executable.
Posted Nov 19, 2020 13:29 UTC (Thu)
by nenad_noveljic (guest, #143180)
[Link]
Posted May 20, 2020 5:45 UTC (Wed)
by cyphar (subscriber, #110703)
[Link] (3 responses)
Depends on your definition of "simple". While I do make use of all of the tools you've mentioned, awk has carved out its own niche in that pantheon. It allows you to do a handful of things that you would ordinarily need to reach for a "real" programming for such as aggregation or basic programs that make use of maps. Yes, you could implement these things in Python fairly easily, but with two downsides: Compare the following programs which take the output of sha256sum of a directory tree and find any files which have matching hashes. The one written in awk is verbatim a program I wrote a week ago (note that I actually wrote it in a single line on my command-line, but I put it in a file for an easier comparison). What is the first thing you notice? All of the boilerplate in Python about > What does bother me is when people write awk "{ print $1 }" instead of cut -f1. I find the latter more readable. The problem is that cut splits on the literal " " (U+0020) or whatever other literal you specify, while awk splits fields using a regular expression (which by default is
Posted May 20, 2020 9:03 UTC (Wed)
by mineo (guest, #126773)
[Link]
Posted May 20, 2020 10:39 UTC (Wed)
by mgedmin (subscriber, #34497)
[Link] (1 responses)
sort sha256sums.txt | uniq -w64 --all-repeated=separate
Posted May 21, 2020 10:19 UTC (Thu)
by pgdx (guest, #119243)
[Link]
First, it sorts all lines, which is not according to spec.
Second, it doesn't print the duplicate hash as "headers" on a line by itself.
Posted May 20, 2020 6:43 UTC (Wed)
by dumain (subscriber, #82016)
[Link] (3 responses)
Posted May 20, 2020 23:27 UTC (Wed)
by wahern (subscriber, #37304)
[Link] (2 responses)
[1] There are special semantics for single-character FS; semantics that mimic shell word splitting.
Posted May 21, 2020 17:11 UTC (Thu)
by NYKevin (subscriber, #129325)
[Link] (1 responses)
No it isn't. You're meant to call tr(1) with appropriate arguments, and pipe the result into cut. If you do that, then neither of those limitations matter.
Posted May 22, 2020 8:58 UTC (Fri)
by ptman (subscriber, #57271)
[Link]
Posted May 20, 2020 6:46 UTC (Wed)
by Wol (subscriber, #4433)
[Link] (10 responses)
At which point, while you may not care, you are making the computer do 10 times as much work. Invoking a program is expensive. That's the complaint against bash scripts - every time they call out to a command they are setting up an environment, tearing it down, and generally doing loads of busywork.
If you can do all that with a single call to awk, you've probably reduced the overheads by 99%, if not more!
(Still, modern programmers don't seem to understand the meaning of the word "efficient")
Cheers,
Posted May 20, 2020 13:33 UTC (Wed)
by Paf (guest, #91811)
[Link] (2 responses)
I do think about efficiency - for ad-hoc data processing, I start with “how fast can I do this without compromising the actual performance I need”, then work in from there if something’s slow.
Posted May 20, 2020 18:49 UTC (Wed)
by geert (subscriber, #98403)
[Link] (1 responses)
A long time ago, a colleague came to me for help doing search and replace in a very large file. His editor of choice was "xedit", and the search and replace operation seemed to hang, or at least took ages. I opened his file in "vi", which performed the same operation in the blink of an eye. Didn't even have to refrain to sed.
Lesson learned: "xedit" was written as a sample program for showing how to use the X11 Athena Widgets, it was never meant to be a production-level editor.
Posted May 20, 2020 20:19 UTC (Wed)
by NYKevin (subscriber, #129325)
[Link]
On the other hand, if you're doing a while read; do ...; done style thingy, then yes, it will be awful and slow. But I try to avoid that most of the time.
Posted May 23, 2020 12:22 UTC (Sat)
by unixbhaskar (guest, #44758)
[Link] (6 responses)
cat somefile | grep somepattern
and the correction was ..
grep somepattern somefile --> this essentially what you said is planted..one less invocation of calls.
:)
Posted May 23, 2020 19:53 UTC (Sat)
by Jandar (subscriber, #85683)
[Link] (5 responses)
<somefile grep somepattern >output
The position of redirection doesn't matter only the order if there are dependencies.
Posted May 23, 2020 20:32 UTC (Sat)
by Wol (subscriber, #4433)
[Link] (4 responses)
Every extra pipe is an extra trip round the setup/teardown busywork loop - which if you pre-allocate memory could actually be a big problem even if you think you have plenty.
Cheers,
Posted May 24, 2020 13:18 UTC (Sun)
by madscientist (subscriber, #16861)
[Link] (3 responses)
There are no pipes in Jandar's suggested alternative.
This feels more like StackOverflow than LWN, but the issue is that grep foo somefile gives different output than cat somefile | grep foo and if you want the latter behavior while still avoiding UUoC, you should be using grep foo < somefile instead.
Posted May 24, 2020 13:50 UTC (Sun)
by mpr22 (subscriber, #60784)
[Link] (1 responses)
Posted May 24, 2020 17:12 UTC (Sun)
by madscientist (subscriber, #16861)
[Link]
This can be useful in scripting to avoid the complexity of stripping off the unwanted filename.
Posted May 24, 2020 14:02 UTC (Sun)
by Wol (subscriber, #4433)
[Link]
On first thoughts my reaction was "aren't < and > just different syntaxes for pipes?".
My second thought now is that "no they aren't actually pipes, they're shell built-ins".
So yeah you're right. They're pretty much identical in effect (and concept), but different in implementation and impact on the system. There's more than one way to do it ... :-)
Cheers,
Posted May 20, 2020 12:00 UTC (Wed)
by neilbrown (subscriber, #359)
[Link] (1 responses)
I'm a proud user of 'awk "{print $1}"' - maybe I don't care about readability, only write-ability.
Posted May 23, 2020 12:18 UTC (Sat)
by unixbhaskar (guest, #44758)
[Link]
Posted May 20, 2020 16:30 UTC (Wed)
by scientes (guest, #83068)
[Link] (1 responses)
Posted May 20, 2020 20:05 UTC (Wed)
by NYKevin (subscriber, #129325)
[Link]
(Strictly speaking, it is not correct to claim that non-greedy matching is required for parsing arbitrary regular languages. Formally, a regular language can be parsed entirely in terms of literal characters, parentheses, alternation, and the Kleene star, plus anchoring if you assume that regexes are not implicitly anchored. But this might require a very long and unwieldy regex in practice, so a lack of non-greedy matching is certainly a valid complaint.)
Alternatively, I suppose you could use ex(1) noninteractively (with -c {command} or +{command}).
Posted May 21, 2020 3:02 UTC (Thu)
by xtifr (guest, #143)
[Link] (2 responses)
Posted May 23, 2020 21:51 UTC (Sat)
by NYKevin (subscriber, #129325)
[Link]
Posted May 26, 2020 11:12 UTC (Tue)
by jezuch (subscriber, #52988)
[Link]
Posted May 25, 2020 15:39 UTC (Mon)
by anton (subscriber, #25547)
[Link]
As for whether you are better off, in the 1990s a collegue wrote a tool in Perl that I then inherited. I eventually rewrote it as a shell script (including parts in awk), and the result turned out to have half the lines of the Perl variant.
Posted May 20, 2020 3:01 UTC (Wed)
by ncm (guest, #165)
[Link] (1 responses)
I have never written a 500-line awk script, and probably won't, but for the one-liner that blows up, it has enough headroom for the extra load.
Posted May 20, 2020 5:15 UTC (Wed)
by areilly (subscriber, #87829)
[Link]
There's a really nice video of a lecture by Brian Kernighan about awk, here: https://youtu.be/Sg4U4r_AgJU
It's not perfect. Fairly easy to stub your toe on some of the function syntax. I think that a more modern design would lean towards more "functional" functions, for example.
I do think that the addition of namespaces is an indication that some people are "doing it wrong"... :-)
Posted May 20, 2020 3:36 UTC (Wed)
by felixfix (subscriber, #242)
[Link] (1 responses)
Posted May 20, 2020 13:56 UTC (Wed)
by edeloget (subscriber, #88392)
[Link]
And that's good. awk is a solid program with tons of possible use cases.
Posted Nov 19, 2020 11:57 UTC (Thu)
by motiejus (subscriber, #92837)
[Link]
#!/usr/bin/awk -f
I just ran this again on a data sub-set (100M of data points, 2.7GB uncompressed) just to have data for this comment. My 8-core laptop did the whole operation in 29 seconds:
Combined with GNU Make, `sort` and `sort -m`, I can't imagine a more powerful combination of tools for this simple "big data"(?) task.
No, awk is not dead, and spending half-hour[1] is enough to use it for life. :)
Posted May 20, 2020 10:55 UTC (Wed)
by NAR (subscriber, #1313)
[Link] (3 responses)
I think awk started to lose its relevance when data started to be structured differently (XML, JSON, etc.) than by sequence of lines. I don't remember the last time I wrote an awk script, for one-liners perl suffices (with the -e and -n options) - and perl can be used to build bigger programs, so what's the point in keeping up with AWK? Similarly I just realized how odd is writing HTML tags into this form when all other forms I use requires Markdown and I automatically almost started to type it when I realized where I am :-)
Posted May 20, 2020 18:28 UTC (Wed)
by jthill (subscriber, #56558)
[Link]
json and xml are big hammers, far too often people swing them for little jobs.
Posted May 20, 2020 23:57 UTC (Wed)
by wahern (subscriber, #37304)
[Link]
I think AWK is seeing a resurgence precisely because Perl isn't as ubiquitous as it once was. You can't depend on Python being installed, either, and even if you could it still sucks for short, shell-style programming. Which is why as Python has displaced Perl, there's more demand for AWK to fill the remaining gap.
I agree that XML and JSON have altered the landscape, but XML and JSON don't fit streaming paradigms very well. Even when something like jq is available, I usually find the regular shell utilities to be far more convenient, and AFAICT so do most others. It's always been the case that for highly structured data you ended up using more sophisticated programming languages, anyhow. The reason why the Unix shell and shell programming have persisted for so long is precisely because the "one language to rule them all" and "one record format to rule them all" approaches never sufficed nearly enough to displace ad hoc text munging tools. The very nature of the problem domain--gluing together disparate, uncooperative tools and data--contradicts the idea that there could ever be a simple, unified solution.
Posted May 21, 2020 15:57 UTC (Thu)
by smitty_one_each (subscriber, #28989)
[Link]
Posted May 21, 2020 3:22 UTC (Thu)
by xtifr (guest, #143)
[Link]
Awk was one of the main things that first attracted me to Unix, several geological ages ago! The idea of a simple language specifically designed for creating filters on-the-fly was a complete revelation!
I no longer overuse and abuse awk the way I did when I was young, but I still use it now and then. Often just typing on the command line!
One of my favorite tricks is running accumulators for multiple types of things:
ps aux|awk '{count[$1]++} END { for (u in count) { print u, ": ", count[u]}}'
tells me how many processes each user has. And yes, that's the sort of thing I can and do just type in when I want to know something like that. :)
Posted May 21, 2020 16:33 UTC (Thu)
by PaulMcKenney (✭ supporter ✭, #9624)
[Link] (1 responses)
Posted May 23, 2020 12:16 UTC (Sat)
by unixbhaskar (guest, #44758)
[Link]
Posted May 23, 2020 8:44 UTC (Sat)
by tedd (subscriber, #74183)
[Link]
Note that the script may need to be edited to run on modern systems - I don't know when this particular post was published.
Posted May 25, 2020 19:44 UTC (Mon)
by SiB (subscriber, #4048)
[Link] (1 responses)
Posted Jun 2, 2020 12:41 UTC (Tue)
by amnonbc (guest, #106638)
[Link]
A tool simple tool that does one thing well.
A classic of language design, and a pleasure to use!
Posted Jun 2, 2020 19:12 UTC (Tue)
by benhoyt (subscriber, #138463)
[Link]
Posted Nov 20, 2020 18:42 UTC (Fri)
by RobertX (guest, #138591)
[Link]
I just noticed that as of Ubuntu 20.04, the included mawk is no longer from 1996:
mawk in Ubuntu 20.04
root@ubuntu2004:~# mawk -W version
mawk 1.3.4 20200120
Copyright 2008-2019,2020, Thomas E. Dickey
Copyright 1991-1996,2014, Michael D. Brennan
As recently as Ubuntu 18.04, this is what I get:
root@ubuntu18:~# mawk -W version
mawk 1.3.3 Nov 1996, Copyright (C) Michael D. Brennan
Yes, that's right, in 2018 it said it's mawk 1.3.3 from 1996. For some perspective, the gawk 5.0 release date is further from November 1996 than mawk 1.3.3's release was from the original AWK from 1977.
Congratulations to the Ubuntu team for finally bringing mawk into the new millennium! (Can I still say it's new 20 years in)?
mawk in Ubuntu 20.04
mawk in Ubuntu 20.04
mawk in Ubuntu 20.04
Surprisingly relevant?
Surprisingly relevant?
Surprisingly relevant?
Surprisingly relevant?
Surprisingly relevant?
Surprisingly relevant?
Surprisingly relevant?
Surprisingly relevant?
% cat prog.py
collisions = {}
for line in iter(input, ""):
hash, *_ = line.split() # really should be re.split but that would be too mean to Python
if hash not in collisions:
collisions[hash] = []
collisions[hash].append(line)
for hash, lines in collisions.items():
if len(lines) > 1:
print(hash)
for line in lines:
print(line)
% cat prog.awk
{
files[$1][length(files[$1])] = $0
}
END {
for (hash in files) {
if (length(files[hash]) > 1) {
print hash;
for (idx in files[hash]) {
print " " files[hash][idx];
}
}
}
}
iter(input, "")
and splitting the line is already done for you in awk. The actual logic of the program is implemented in a single statement in awk, with the rest of the program just printing out the calculation. And that is one of the reasons why I reach for awk much more often than I reach for Python when I have a relatively-simple source of data to parse -- I just have to type a lot less./\s+/
). Many standard Unix programs output data such that cut's field handling is simply not usable. You could clean up the data with sed, but now you're working around the fact that cut isn't doing its job correctly. I sometimes feel that cut would be a better program if it were implemented as an awk script.
Note that, with the appropriate imports, you can not reduce the line count of your python example, but make it a bit more straightforward:
Surprisingly relevant?
from collections import defaultdict
from fileinput import input
collisions = defaultdict(list)
for line in input():
hash, *_ = line.split() # really should be re.split but that would be too mean to Python
collisions[hash].append(line.strip())
for hash, lines in collisions.items():
if len(lines) > 1:
print(hash)
for line in lines:
print(line)
Surprisingly relevant?
Surprisingly relevant?
Surprisingly relevant?
Surprisingly relevant?
Surprisingly relevant?
Surprisingly relevant?
Surprisingly relevant?
Wol
Surprisingly relevant?
Surprisingly relevant?
Surprisingly relevant?
Surprisingly relevant?
Surprisingly relevant?
Surprisingly relevant?
Wol
???
Surprisingly relevant?
Surprisingly relevant?
You're right, grep behaves the same; my bad! I was thinking of some other tools like wc which have different output when given a filename versus reading from stdin.
Surprisingly relevant?
Surprisingly relevant?
Wol
Surprisingly relevant?
A small extension to that (e.g., ignore lines starting '#') is easy within the same tool. A small extension to "cut -f1" requires a different tool.
Awk seems to be to be a good answer to the requirement "Simple things should be simple, complex things should be possible".
Surprisingly relevant?
Surprisingly relevant?
Surprisingly relevant?
Surprisingly relevant?
Surprisingly relevant?
Surprisingly relevant?
Surprisingly relevant?
If all of the above are truly inadequate to the task at hand, I'm probably better off writing a Python script anyway.
The nice thing about awk is that when you find that your non-awk shell-scripting tools miss a feature, you don't need to rewrite the whole shell script in python.
What does bother me is when people write awk "{ print $1 }" instead of cut -f1. I find the latter more readable.
I am one of these people (except that I use single quotes). Using cut may well be more readable, but then I know how to write it in awk without looking up a man page; that's because I use awk every few days, but not that many of these uses could be replaced with cut.
Surprisingly relevant?
Surprisingly relevant?
Surprisingly relevant?
Surprisingly relevant?
Surprisingly relevant?
BEGIN { FS = "," }
$1 > ymin && $1 < ymax && $2 > xmin && $2 < xmax {print $2 "," $1 "," $3}
1. each file: unzip to memory.
2. each file: run through the program above for the bounding box.
3. each file: sort.
4. all files: merge sort.
5. all files: compress.
AWK reads the input a line at a time. A line is scanned for each pattern in the program, and for each pattern that matches, the associated action is executed.
The state of the AWK
The state of the AWK
The state of the AWK
The state of the AWK
The state of the AWK
The state of the AWK
The state of the AWK
The state of the AWK
The state of the AWK
The state of the AWK
Only two features - regular expressions and associative arrays.
And a clear expressive and readable syntax.
The state of the AWK
The state of the AWK