fix uniq -c

Posted Feb 17, 2025 9:30 UTC (Mon) by stijn (subscriber, #570)
In reply to: fix uniq -c by dskoll
Parent article: Rewriting essential Linux packages in Rust

> ... | uniq -c | sed -e 's/^ *//'

The thought had occurred to me. Implicit in my point here is that this is a fudge, easily fixed by adding a new option to uniq that does the right thing. Shell programming with unix pipes can be an elegant and very concise way to mutate data in a functional way. Having to include fudges like the above (at the expense of a process) grates and creates an impression of crummy (shell) programming that should be completely unnecessary.

fix uniq -c

Posted Feb 17, 2025 11:18 UTC (Mon) by mbunkus (subscriber, #87248) [Link] (4 responses)

I fail to see how this is a problem, let alone a bug. The most important thing here is that the output of such programs can be meant for widely different audiences, each with their own peculiarities:

• humans: we need data formatted so that it visually very clear where columns start & end. We also prefer to be able to determine at a glance when a number is bigger than another number. This means that columns must be aligned in the first place in order to satisfy the first requirement, and for numbers right-aligning satisfies the second. We (humans) might even profit from table borders.
• other programs: here it depends on what the other program is & what it expects as input. For example, if you want to process it further via pipes then then awk & bash don't care at all about the right-aligned numbers[1], whereas other programs might. If your goal isn't pipe-processing but e.g. copy-pasting into spreadsheets, then CSV-formatted data might be much better (though that would make processing in awk/bash much harder)

You cannot satisfy all those requirements with a single format. Therefore I consider your argument to be completely wrong. The default output for uniq is to be easily readable by humans. That's a design choice. It's not a bug.

[1] Examples with bash:

[mosu@velvet ~]$ printf "moo\nmoo\ncow\n" | uniq -c | awk '{ sum += $1 } END { print sum }'
3
[mosu@velvet ~]$ printf "moo\nmoo\ncow\n" | uniq -c | ( while read line ; do set - $line ; echo $1 ; done )
2
1
[mosu@velvet ~]$

fix uniq -c

Posted Feb 17, 2025 18:46 UTC (Mon) by stijn (subscriber, #570) [Link] (2 responses)

I chose a very poor title ('fix unic -c') for what I wrote, implying the presence of a bug. What I meant was

- current default behaviour of uniq -c is poor for composing.
- let's add an option so that we can have the behaviour that I like.

With this, we can have both a 'visually clear' format and a suitable-for-composing format. For compatibility the current format is of course the default in that scenario.

> For example, if you want to process it further via pipes then then awk & bash don't care at all about the right-aligned numbers[1], whereas other programs might.

I work a lot with dataframes, which are essentially mysql tables in tab-separated format with column headers, or equivalently a single table in a spreadsheet, or the things you might want to read with Python pandas or in R. Tab separated is preferred, as I've never encountered a need to escape embedded tab characters. In this wider ecosystem there is no automatic white-space scrubbing of data and a there is a requirement that tables are well-formatted. Programs such comm, join, datamash, shuf and a fair few more can be very handy in summarising, QC'ing or (even) manipulating this data. Hence I clamour for the ability (not necessarily as default) to have all tuple/table type data formatted as tab-separated tables, with or without column names. This should go well with unix composability of processes.

fix uniq -c

Posted Feb 17, 2025 19:18 UTC (Mon) by mbunkus (subscriber, #87248) [Link] (1 responses)

Alright, I think I understand where you're coming from a lot better now. It wasn't just your title, though; in your first comment you wrote:

> It is quite puzzling that Richard Stallman let this program loose on the world as it violates usual Unix well-behavedness of textual interfaces.

And to that my argument was that first and foremost `uniq -c` was most likely designed to be easy to read by humans. By that metric it is very much well-behaved & doesn't violate anything. Furthermore, even with it being designed to be human-readable its output is actually useable as-is by a lot of other traditional Unix programs, making it arguably even less of a "wrong" that has to be "righted" (your choice of words, again from your first post). Apart from awk & bash which I mentioned earlier, "sort -n" works fine as well.

fix uniq -c

Posted Feb 17, 2025 20:15 UTC (Mon) by stijn (subscriber, #570) [Link]

I let a pointless/misdirected rhetorical flourish get the better of me again. I understand the history - and view it as beneficial to strive towards (recasting it from wrong/right) to make outputs (tuples/tables) preferably by default composable or at least add options to do that without assumptions about white-space padding. The table format (mysql dump, spreadsheet, dataframe) is pervasive and super useful. In the right setup shell pipelines can do amazing things with them. Working in those type of table environments has made the various stripes of unix white-space padding and stripping jar with me.

fix uniq -c

Posted Feb 18, 2025 8:11 UTC (Tue) by mathstuf (subscriber, #69389) [Link]

I still think the current behavior is bad.

> • humans: we need data formatted so that it visually very clear where columns start & end. We also prefer to be able to determine at a glance when a number is bigger than another number. This means that columns must be aligned in the first place in order to satisfy the first requirement, and for numbers right-aligning satisfies the second. We (humans) might even profit from table borders.

Then why is it a static number of columns wide? I know to make it actually the right width, the entire output needs to be known so that you can't output anything until the whole thing is read, but if human viewing is most important, why not buffer and Do It Right™? Either the output is small and quick enough to not really matter or it is so large that…what human is really going to be looking at it directly anyways?

> • other programs: here it depends on what the other program is & what it expects as input. For example, if you want to process it further via pipes then then awk & bash don't care at all about the right-aligned numbers[1], whereas other programs might. If your goal isn't pipe-processing but e.g. copy-pasting into spreadsheets, then CSV-formatted data might be much better (though that would make processing in awk/bash much harder)

Sure, awk and bash do separator coalescing. But `cut` doesn't, so one needs to `sed` before `cut`, but not before `bash` and `awk`. Great. Yet another paper cut to remember in shell scripts. Given that `bash` and `awk` do support the mechanism that `cut` would understand and the general human-ness usefulness is of questionable quality…it really seems like an unnecessary quirk of a tool's output.