Surprisingly relevant?
Surprisingly relevant?
Posted May 19, 2020 23:00 UTC (Tue) by warrax (subscriber, #103205)Parent article: The state of the AWK
It's dead. Dead as a dodo. It's just become so embedded in weird and unnatural places that it can't be killed with a single shot.
It has absolutely no relevance to the modern world in terms of engineering, innovation, ... anything really. Let's just let it die in peace.
It was great for its time, but it's time to let go.
RIP. (And I mean that with respect. My actual first professional/paying job was writing a bit of AWK to process some weird billing format thing into a thing $OTHER_SYSTEM could use, so I appreciate it for what it was... but.)
Posted May 19, 2020 23:17 UTC (Tue)
by benhoyt (subscriber, #138463)
[Link] (32 responses)
And I know many other developers use it too ... there's still a lot of text around to be processed (and I say, may it long continue). Additionally, there's the "big data" use cases I linked to in the article, where developers found it faster and simpler than heavier distributed computing tools. See also: https://yourdatafitsinram.net/
Posted May 20, 2020 1:50 UTC (Wed)
by NYKevin (subscriber, #129325)
[Link] (31 responses)
What does bother me is when people write awk "{ print $1 }" instead of cut -f1. I find the latter more readable.
Posted May 20, 2020 4:08 UTC (Wed)
by marduk (subscriber, #3831)
[Link] (3 responses)
You could resort to some kind of permutation of set, cut, in trivial cases where spawning a bunch of processes to do what awk can do by itself is acceptable to you.
Posted May 20, 2020 20:01 UTC (Wed)
by jafd (subscriber, #129642)
[Link]
Also, back when the whole Unicode mess had been a very on-and-off experience, where some tools would work and some would go bonkers, cut failed me a couple times and awk was solid. And so it went.
Posted May 20, 2020 20:29 UTC (Wed)
by NYKevin (subscriber, #129325)
[Link] (1 responses)
Why do I care? I have gigabytes of RAM and it's not like I'm going to run out of PIDs from a five-command pipeline. Besides, the kernel should be COWing glibc etc. so it's not even all that much overhead to begin with. If you're using something like Toybox/Busybox/whatever-other-box-is-popular-these-days, then you can literally COW the entire executable.
Posted Nov 19, 2020 13:29 UTC (Thu)
by nenad_noveljic (guest, #143180)
[Link]
Posted May 20, 2020 5:45 UTC (Wed)
by cyphar (subscriber, #110703)
[Link] (3 responses)
Depends on your definition of "simple". While I do make use of all of the tools you've mentioned, awk has carved out its own niche in that pantheon. It allows you to do a handful of things that you would ordinarily need to reach for a "real" programming for such as aggregation or basic programs that make use of maps. Yes, you could implement these things in Python fairly easily, but with two downsides: Compare the following programs which take the output of sha256sum of a directory tree and find any files which have matching hashes. The one written in awk is verbatim a program I wrote a week ago (note that I actually wrote it in a single line on my command-line, but I put it in a file for an easier comparison). What is the first thing you notice? All of the boilerplate in Python about > What does bother me is when people write awk "{ print $1 }" instead of cut -f1. I find the latter more readable. The problem is that cut splits on the literal " " (U+0020) or whatever other literal you specify, while awk splits fields using a regular expression (which by default is
Posted May 20, 2020 9:03 UTC (Wed)
by mineo (guest, #126773)
[Link]
Posted May 20, 2020 10:39 UTC (Wed)
by mgedmin (subscriber, #34497)
[Link] (1 responses)
sort sha256sums.txt | uniq -w64 --all-repeated=separate
Posted May 21, 2020 10:19 UTC (Thu)
by pgdx (guest, #119243)
[Link]
First, it sorts all lines, which is not according to spec.
Second, it doesn't print the duplicate hash as "headers" on a line by itself.
Posted May 20, 2020 6:43 UTC (Wed)
by dumain (subscriber, #82016)
[Link] (3 responses)
Posted May 20, 2020 23:27 UTC (Wed)
by wahern (subscriber, #37304)
[Link] (2 responses)
[1] There are special semantics for single-character FS; semantics that mimic shell word splitting.
Posted May 21, 2020 17:11 UTC (Thu)
by NYKevin (subscriber, #129325)
[Link] (1 responses)
No it isn't. You're meant to call tr(1) with appropriate arguments, and pipe the result into cut. If you do that, then neither of those limitations matter.
Posted May 22, 2020 8:58 UTC (Fri)
by ptman (subscriber, #57271)
[Link]
Posted May 20, 2020 6:46 UTC (Wed)
by Wol (subscriber, #4433)
[Link] (10 responses)
At which point, while you may not care, you are making the computer do 10 times as much work. Invoking a program is expensive. That's the complaint against bash scripts - every time they call out to a command they are setting up an environment, tearing it down, and generally doing loads of busywork.
If you can do all that with a single call to awk, you've probably reduced the overheads by 99%, if not more!
(Still, modern programmers don't seem to understand the meaning of the word "efficient")
Cheers,
Posted May 20, 2020 13:33 UTC (Wed)
by Paf (subscriber, #91811)
[Link] (2 responses)
I do think about efficiency - for ad-hoc data processing, I start with “how fast can I do this without compromising the actual performance I need”, then work in from there if something’s slow.
Posted May 20, 2020 18:49 UTC (Wed)
by geert (subscriber, #98403)
[Link] (1 responses)
A long time ago, a colleague came to me for help doing search and replace in a very large file. His editor of choice was "xedit", and the search and replace operation seemed to hang, or at least took ages. I opened his file in "vi", which performed the same operation in the blink of an eye. Didn't even have to refrain to sed.
Lesson learned: "xedit" was written as a sample program for showing how to use the X11 Athena Widgets, it was never meant to be a production-level editor.
Posted May 20, 2020 20:19 UTC (Wed)
by NYKevin (subscriber, #129325)
[Link]
On the other hand, if you're doing a while read; do ...; done style thingy, then yes, it will be awful and slow. But I try to avoid that most of the time.
Posted May 23, 2020 12:22 UTC (Sat)
by unixbhaskar (guest, #44758)
[Link] (6 responses)
cat somefile | grep somepattern
and the correction was ..
grep somepattern somefile --> this essentially what you said is planted..one less invocation of calls.
:)
Posted May 23, 2020 19:53 UTC (Sat)
by Jandar (subscriber, #85683)
[Link] (5 responses)
<somefile grep somepattern >output
The position of redirection doesn't matter only the order if there are dependencies.
Posted May 23, 2020 20:32 UTC (Sat)
by Wol (subscriber, #4433)
[Link] (4 responses)
Every extra pipe is an extra trip round the setup/teardown busywork loop - which if you pre-allocate memory could actually be a big problem even if you think you have plenty.
Cheers,
Posted May 24, 2020 13:18 UTC (Sun)
by madscientist (subscriber, #16861)
[Link] (3 responses)
There are no pipes in Jandar's suggested alternative.
This feels more like StackOverflow than LWN, but the issue is that grep foo somefile gives different output than cat somefile | grep foo and if you want the latter behavior while still avoiding UUoC, you should be using grep foo < somefile instead.
Posted May 24, 2020 13:50 UTC (Sun)
by mpr22 (subscriber, #60784)
[Link] (1 responses)
Posted May 24, 2020 17:12 UTC (Sun)
by madscientist (subscriber, #16861)
[Link]
This can be useful in scripting to avoid the complexity of stripping off the unwanted filename.
Posted May 24, 2020 14:02 UTC (Sun)
by Wol (subscriber, #4433)
[Link]
On first thoughts my reaction was "aren't < and > just different syntaxes for pipes?".
My second thought now is that "no they aren't actually pipes, they're shell built-ins".
So yeah you're right. They're pretty much identical in effect (and concept), but different in implementation and impact on the system. There's more than one way to do it ... :-)
Cheers,
Posted May 20, 2020 12:00 UTC (Wed)
by neilbrown (subscriber, #359)
[Link] (1 responses)
I'm a proud user of 'awk "{print $1}"' - maybe I don't care about readability, only write-ability.
Posted May 23, 2020 12:18 UTC (Sat)
by unixbhaskar (guest, #44758)
[Link]
Posted May 20, 2020 16:30 UTC (Wed)
by scientes (guest, #83068)
[Link] (1 responses)
Posted May 20, 2020 20:05 UTC (Wed)
by NYKevin (subscriber, #129325)
[Link]
(Strictly speaking, it is not correct to claim that non-greedy matching is required for parsing arbitrary regular languages. Formally, a regular language can be parsed entirely in terms of literal characters, parentheses, alternation, and the Kleene star, plus anchoring if you assume that regexes are not implicitly anchored. But this might require a very long and unwieldy regex in practice, so a lack of non-greedy matching is certainly a valid complaint.)
Alternatively, I suppose you could use ex(1) noninteractively (with -c {command} or +{command}).
Posted May 21, 2020 3:02 UTC (Thu)
by xtifr (guest, #143)
[Link] (2 responses)
Posted May 23, 2020 21:51 UTC (Sat)
by NYKevin (subscriber, #129325)
[Link]
Posted May 26, 2020 11:12 UTC (Tue)
by jezuch (subscriber, #52988)
[Link]
Posted May 25, 2020 15:39 UTC (Mon)
by anton (subscriber, #25547)
[Link]
As for whether you are better off, in the 1990s a collegue wrote a tool in Perl that I then inherited. I eventually rewrote it as a shell script (including parts in awk), and the result turned out to have half the lines of the Perl variant.
Posted May 20, 2020 3:01 UTC (Wed)
by ncm (guest, #165)
[Link] (1 responses)
I have never written a 500-line awk script, and probably won't, but for the one-liner that blows up, it has enough headroom for the extra load.
Posted May 20, 2020 5:15 UTC (Wed)
by areilly (subscriber, #87829)
[Link]
There's a really nice video of a lecture by Brian Kernighan about awk, here: https://youtu.be/Sg4U4r_AgJU
It's not perfect. Fairly easy to stub your toe on some of the function syntax. I think that a more modern design would lean towards more "functional" functions, for example.
I do think that the addition of namespaces is an indication that some people are "doing it wrong"... :-)
Posted May 20, 2020 3:36 UTC (Wed)
by felixfix (subscriber, #242)
[Link] (1 responses)
Posted May 20, 2020 13:56 UTC (Wed)
by edeloget (subscriber, #88392)
[Link]
And that's good. awk is a solid program with tons of possible use cases.
Posted Nov 19, 2020 11:57 UTC (Thu)
by motiejus (subscriber, #92837)
[Link]
#!/usr/bin/awk -f
I just ran this again on a data sub-set (100M of data points, 2.7GB uncompressed) just to have data for this comment. My 8-core laptop did the whole operation in 29 seconds:
Combined with GNU Make, `sort` and `sort -m`, I can't imagine a more powerful combination of tools for this simple "big data"(?) task.
No, awk is not dead, and spending half-hour[1] is enough to use it for life. :)
Surprisingly relevant?
Surprisingly relevant?
Surprisingly relevant?
Surprisingly relevant?
Surprisingly relevant?
Surprisingly relevant?
Surprisingly relevant?
% cat prog.py
collisions = {}
for line in iter(input, ""):
hash, *_ = line.split() # really should be re.split but that would be too mean to Python
if hash not in collisions:
collisions[hash] = []
collisions[hash].append(line)
for hash, lines in collisions.items():
if len(lines) > 1:
print(hash)
for line in lines:
print(line)
% cat prog.awk
{
files[$1][length(files[$1])] = $0
}
END {
for (hash in files) {
if (length(files[hash]) > 1) {
print hash;
for (idx in files[hash]) {
print " " files[hash][idx];
}
}
}
}
iter(input, "") and splitting the line is already done for you in awk. The actual logic of the program is implemented in a single statement in awk, with the rest of the program just printing out the calculation. And that is one of the reasons why I reach for awk much more often than I reach for Python when I have a relatively-simple source of data to parse -- I just have to type a lot less./\s+/). Many standard Unix programs output data such that cut's field handling is simply not usable. You could clean up the data with sed, but now you're working around the fact that cut isn't doing its job correctly. I sometimes feel that cut would be a better program if it were implemented as an awk script.
Note that, with the appropriate imports, you can not reduce the line count of your python example, but make it a bit more straightforward:
Surprisingly relevant?
from collections import defaultdict
from fileinput import input
collisions = defaultdict(list)
for line in input():
hash, *_ = line.split() # really should be re.split but that would be too mean to Python
collisions[hash].append(line.strip())
for hash, lines in collisions.items():
if len(lines) > 1:
print(hash)
for line in lines:
print(line)
Surprisingly relevant?
Surprisingly relevant?
Surprisingly relevant?
Surprisingly relevant?
Surprisingly relevant?
Surprisingly relevant?
Surprisingly relevant?
Wol
Surprisingly relevant?
Surprisingly relevant?
Surprisingly relevant?
Surprisingly relevant?
Surprisingly relevant?
Surprisingly relevant?
Wol
???
Surprisingly relevant?
Surprisingly relevant?
You're right, grep behaves the same; my bad! I was thinking of some other tools like wc which have different output when given a filename versus reading from stdin.
Surprisingly relevant?
Surprisingly relevant?
Wol
Surprisingly relevant?
A small extension to that (e.g., ignore lines starting '#') is easy within the same tool. A small extension to "cut -f1" requires a different tool.
Awk seems to be to be a good answer to the requirement "Simple things should be simple, complex things should be possible".
Surprisingly relevant?
Surprisingly relevant?
Surprisingly relevant?
Surprisingly relevant?
Surprisingly relevant?
Surprisingly relevant?
Surprisingly relevant?
If all of the above are truly inadequate to the task at hand, I'm probably better off writing a Python script anyway.
The nice thing about awk is that when you find that your non-awk shell-scripting tools miss a feature, you don't need to rewrite the whole shell script in python.
What does bother me is when people write awk "{ print $1 }" instead of cut -f1. I find the latter more readable.
I am one of these people (except that I use single quotes). Using cut may well be more readable, but then I know how to write it in awk without looking up a man page; that's because I use awk every few days, but not that many of these uses could be replaced with cut.
Surprisingly relevant?
Surprisingly relevant?
Surprisingly relevant?
Surprisingly relevant?
Surprisingly relevant?
BEGIN { FS = "," }
$1 > ymin && $1 < ymax && $2 > xmin && $2 < xmax {print $2 "," $1 "," $3}
1. each file: unzip to memory.
2. each file: run through the program above for the bounding box.
3. each file: sort.
4. all files: merge sort.
5. all files: compress.
