Reasons for speedup?

Posted Feb 12, 2025 19:49 UTC (Wed) by excors (subscriber, #95769)
In reply to: Reasons for speedup? by jreiser
Parent article: Rewriting essential Linux packages in Rust

In my crude testing, uutils is the same speed as GNU coreutils with LANG=C, but coreutils with LANG=C.UTF-8 (the default in my environment) is several times slower.

With LANG=C.UTF-8, coreutils spends most of its time in strcoll_l, and it sorts by what I presume is some Unicode collation algorithm.

As far as I can see, uutils has no locale support. It aborts if the input is not valid UTF-8 ("sort: invalid utf-8 sequence of 1 bytes from index 0"). It simply sorts by byte values (equivalent to sorting by codepoint), regardless of LANG.

So in this case it's only faster because it doesn't implement Unicode collation.

Reasons for speedup?

Posted Feb 12, 2025 20:08 UTC (Wed) by excors (subscriber, #95769) [Link] (1 responses)

(I should probably add, in practice I never want `sort` to do Unicode collation. On systems which default to e.g. LANG=en_US.UTF-8 I find it actively annoying: it ignores capitalisation and leading whitespace, and I definitely don't want that, so I have to remember to add LANG=C to every invocation. If I wanted to do proper Unicode-aware text processing, I'd do it in a real programming language where I can configure it fully, I wouldn't do it in locale-dependent shell scripts; so I don't mind that uutils is missing that feature. On the other hand, I do sometimes want to run `sort | uniq -c | sort -n` over non-UTF-8 files (e.g. looking for common patterns in binary data), so I do mind that that isn't supported.)

Reasons for speedup?

Posted Feb 18, 2025 3:10 UTC (Tue) by ehiggs (subscriber, #90713) [Link]

> (I should probably add, in practice I never want `sort` to do Unicode collation. On systems which default to e.g. LANG=en_US.UTF-8 I find it actively annoying

Regardless it should do Unicode canonicalization or it will miss sort depending on how different runes are composed. This is fine for diacritic free languages like English but as soon as you get some diacritics then LANG=C's naïve handling of text breaks in my experience.