fix uniq -c
fix uniq -c
Posted Feb 17, 2025 9:30 UTC (Mon) by stijn (subscriber, #570)In reply to: fix uniq -c by dskoll
Parent article: Rewriting essential Linux packages in Rust
The thought had occurred to me. Implicit in my point here is that this is a fudge, easily fixed by adding a new option to uniq that does the right thing. Shell programming with unix pipes can be an elegant and very concise way to mutate data in a functional way. Having to include fudges like the above (at the expense of a process) grates and creates an impression of crummy (shell) programming that should be completely unnecessary.
Posted Feb 17, 2025 11:18 UTC (Mon)
by mbunkus (subscriber, #87248)
[Link] (4 responses)
• humans: we need data formatted so that it visually very clear where columns start & end. We also prefer to be able to determine at a glance when a number is bigger than another number. This means that columns must be aligned in the first place in order to satisfy the first requirement, and for numbers right-aligning satisfies the second. We (humans) might even profit from table borders.
You cannot satisfy all those requirements with a single format. Therefore I consider your argument to be completely wrong. The default output for uniq is to be easily readable by humans. That's a design choice. It's not a bug.
[1] Examples with bash:
[mosu@velvet ~]$ printf "moo\nmoo\ncow\n" | uniq -c | awk '{ sum += $1 } END { print sum }'
Posted Feb 17, 2025 18:46 UTC (Mon)
by stijn (subscriber, #570)
[Link] (2 responses)
- current default behaviour of uniq -c is poor for composing.
With this, we can have both a 'visually clear' format and a suitable-for-composing format. For compatibility the current format is of course the default in that scenario.
> For example, if you want to process it further via pipes then then awk & bash don't care at all about the right-aligned numbers[1], whereas other programs might.
I work a lot with dataframes, which are essentially mysql tables in tab-separated format with column headers, or equivalently a single table in a spreadsheet, or the things you might want to read with Python pandas or in R. Tab separated is preferred, as I've never encountered a need to escape embedded tab characters. In this wider ecosystem there is no automatic white-space scrubbing of data and a there is a requirement that tables are well-formatted. Programs such comm, join, datamash, shuf and a fair few more can be very handy in summarising, QC'ing or (even) manipulating this data. Hence I clamour for the ability (not necessarily as default) to have all tuple/table type data formatted as tab-separated tables, with or without column names. This should go well with unix composability of processes.
Posted Feb 17, 2025 19:18 UTC (Mon)
by mbunkus (subscriber, #87248)
[Link] (1 responses)
> It is quite puzzling that Richard Stallman let this program loose on the world as it violates usual Unix well-behavedness of textual interfaces.
And to that my argument was that first and foremost `uniq -c` was most likely designed to be easy to read by humans. By that metric it is very much well-behaved & doesn't violate anything. Furthermore, even with it being designed to be human-readable its output is actually useable as-is by a lot of other traditional Unix programs, making it arguably even less of a "wrong" that has to be "righted" (your choice of words, again from your first post). Apart from awk & bash which I mentioned earlier, "sort -n" works fine as well.
Posted Feb 17, 2025 20:15 UTC (Mon)
by stijn (subscriber, #570)
[Link]
Posted Feb 18, 2025 8:11 UTC (Tue)
by mathstuf (subscriber, #69389)
[Link]
> • humans: we need data formatted so that it visually very clear where columns start & end. We also prefer to be able to determine at a glance when a number is bigger than another number. This means that columns must be aligned in the first place in order to satisfy the first requirement, and for numbers right-aligning satisfies the second. We (humans) might even profit from table borders.
Then why is it a static number of columns wide? I know to make it actually the right width, the entire output needs to be known so that you can't output anything until the whole thing is read, but if human viewing is most important, why not buffer and Do It Right™? Either the output is small and quick enough to not really matter or it is so large that…what human is really going to be looking at it directly anyways?
> • other programs: here it depends on what the other program is & what it expects as input. For example, if you want to process it further via pipes then then awk & bash don't care at all about the right-aligned numbers[1], whereas other programs might. If your goal isn't pipe-processing but e.g. copy-pasting into spreadsheets, then CSV-formatted data might be much better (though that would make processing in awk/bash much harder)
Sure, awk and bash do separator coalescing. But `cut` doesn't, so one needs to `sed` before `cut`, but not before `bash` and `awk`. Great. Yet another paper cut to remember in shell scripts. Given that `bash` and `awk` do support the mechanism that `cut` would understand and the general human-ness usefulness is of questionable quality…it really seems like an unnecessary quirk of a tool's output.
fix uniq -c
• other programs: here it depends on what the other program is & what it expects as input. For example, if you want to process it further via pipes then then awk & bash don't care at all about the right-aligned numbers[1], whereas other programs might. If your goal isn't pipe-processing but e.g. copy-pasting into spreadsheets, then CSV-formatted data might be much better (though that would make processing in awk/bash much harder)
3
[mosu@velvet ~]$ printf "moo\nmoo\ncow\n" | uniq -c | ( while read line ; do set - $line ; echo $1 ; done )
2
1
[mosu@velvet ~]$
fix uniq -c
- let's add an option so that we can have the behaviour that I like.
fix uniq -c
I let a pointless/misdirected rhetorical flourish get the better of me again. I understand the history - and view it as beneficial to strive towards (recasting it from wrong/right) to make outputs (tuples/tables) preferably by default composable or at least add options to do that without assumptions about white-space padding. The table format (mysql dump, spreadsheet, dataframe) is pervasive and super useful. In the right setup shell pipelines can do amazing things with them. Working in those type of table environments has made the various stripes of unix white-space padding and stripping jar with me.
fix uniq -c
fix uniq -c