|
|
Log in / Subscribe / Register

A report from the documentation maintainer

A report from the documentation maintainer

Posted Nov 2, 2016 22:22 UTC (Wed) by nybble41 (subscriber, #55106)
In reply to: A report from the documentation maintainer by bronson
Parent article: A report from the documentation maintainer

Obviously there are pros and cons to each collation sequence, and ideally the shell would be configurable to use either ABCabc or AaBbCc. What I have yet to see is any good reason why the collation sequence should change just because you're working in Unicode rather than ASCII.


to post comments

A report from the documentation maintainer

Posted Nov 2, 2016 23:25 UTC (Wed) by mstone_ (subscriber, #66309) [Link] (11 responses)

It doesn't change with unicode vs ASCII, it changes with natural sorting vs byte sorting. You can set a locale which is not unicode but which does reflect long established rules of dictionary sorting. ASCII byte sorting is a bug. It leads to behavior which is simply incorrect according to various language's conventions for ordering text. It was introduced because it was feasible on the computers of the 1960s, not because it was correct. In recognition of the fact that some people wrote code which was bug-dependent, the C locale was introduced to lock everything to 1980. Some people cling to that as the pinnacle of human achievement and become increasingly angry as the world moves on and fewer and fewer people see things the same way. (Literally: their program output is not the same.) Be that as it may, most of the world (especially the vast majority of people who couldn't write their own name with plain ASCII and the shortcut byte-ordering it facilitated) isn't going to look back.

A report from the documentation maintainer

Posted Nov 3, 2016 15:45 UTC (Thu) by nybble41 (subscriber, #55106) [Link] (10 responses)

> It doesn't change with unicode vs ASCII, it changes with natural sorting vs byte sorting.

Exactly as I've been saying. And yet, enabling Unicode *character encoding* in your environment somehow implies to the shell that you want so-called "natural" sorting rather than traditional byte sorting, unless you take further steps to override LC_COLLATE. These are unrelated topics and there is no good reason for a change in character encoding to affect the sort order, or vice-versa.

There is nothing "unnatural" about sorting uppercase characters first to match the traditional behavior of the shell. This is no different from the common option to sort directories before regular files. For that matter there is nothing "unnatural" about sorting filenames as the byte strings they actually are, regardless of encoding, rather than by their locale-specific decoding into text. Only the byte-strings are guaranteed to be uniform system-wide; the text presented to the user depends on the current environment.

> especially the vast majority of people who couldn't write their own name with plain ASCII and the shortcut byte-ordering it facilitated

Again, separate issues. Nothing about byte-order sorting precludes the use of Unicode character encoding in filenames.

Anyway, this is getting rather far off-topic. The subject was glob patterns and the dubious wisdom of matching characters not specified by the user. If the user writes the glob [A-Z] that should be read as "the uppercase letters from A to Z", not "all letters except lowercase z" (using a range of letters in "natural" sort order) or "all letters" (using case-insensitive comparison). A user who went to the trouble of writing [A-Z] in uppercase obviously didn't intend to match [a-z]. Sort order is, for the most part, a matter of preference and not correctness; redefining the filenames which match a given glob pattern carries a much larger risk of data-loss.

As for the earlier objection that only a programmer would view things this way... the shell is a programming language! Anyone writing commands at the shell prompt is a programmer. If you want an interface designed for non-programmers, there are any number of GUI file managers and program launchers available.

A report from the documentation maintainer

Posted Nov 3, 2016 15:56 UTC (Thu) by farnz (subscriber, #17727) [Link] (3 responses)

But "the uppercase letters from A to Z" is locale-dependent. For a Pole, that includes Ó. For a Swede, it does not.

Basically, the moment you go beyond the user specifying all characters that they're interested in, you're making locale assumptions; whether that be because you're guessing at case, or because you're looking at a range and the user expects the range to include the characters they would include in that range.

A report from the documentation maintainer

Posted Nov 4, 2016 0:09 UTC (Fri) by lsl (guest, #86508) [Link] (2 responses)

There's always Unicode code point order, where A-Z is well-defined without respect to locale. UTF-8 has the nice property to sort that way using a simple strcmp-based comparator.

Sure, it isn't the order you'd see in a good old telephone book, but at least it's simple and predictable.

A report from the documentation maintainer

Posted Nov 4, 2016 9:55 UTC (Fri) by farnz (subscriber, #17727) [Link] (1 responses)

That's the problem, though - there are several non-ambiguous but arbitrary orderings (Polish alphabet, Swedish alphabet, English alphabet, French alphabet, Unicode code point etc). The machine can't (definitionally, as they conflict) give you all reasonable orderings at once; historically (1970s and 1980s) we handled this by saying that the ordering used in English is the One True Ordering, and anyone who thinks in another language can learn the One True Language. Modern machines can do better (and should, IMO).

A report from the documentation maintainer

Posted Nov 5, 2016 1:19 UTC (Sat) by lsl (guest, #86508) [Link]

> Modern machines can do better (and should, IMO).

I'm not convinced. Software is buggy and crappy enough already even without supporting a thousand different ways to sort a directory listing. The machines can do better, but programmers apparently can't.

A report from the documentation maintainer

Posted Nov 3, 2016 16:22 UTC (Thu) by mstone_ (subscriber, #66309) [Link] (5 responses)

You keep saying things that don't make sense. Using unicode character encoding doesn't imply anything. Yes they are unrelated topics, but you keep conflating them for some reason. The implication comes not from the character encodeing, but from the user specifying the language that they want to use, including that language's sorting rules. What you seem to want is for the natural language sorting rules to be arbitrarily ignored (always?) because you want to make things easier for the programmer in one particular case. (Similarly, some people want program output to be exactly as it was in 1980 rather than translated for the user because it makes parsing the output easier--another reason we have the C locale.) I guess you haven't had users complain when things are byte order sorted and they want to see things sorted in a way that makes sense for humans instead. (This is quite a common complaint, and lazy programmers often brush it off as unimportant even though it's quite important to the users.) At any rate, the shell isn't a programming language, it's a user interface. Yes, it tries to be both, and that's why it's both a lousy language and a lousy interface. It has knobs you can fiddle to make it do whatever it is you want. The fact that you have to fiddle with a bunch of knobs is part of why it's a lousy language. If you don't like this, it's more productive to choose a different language than to rail against reality. Specifically you need something with better defined semantics for pattern matching, probably something with the knobs in the API call itself rather than a bunch of environment variables which alter the semantics in unexpected and non-obvious ways. Another option is to just force the locale to C at the top of your script and explain to any users that you just can't be bothered to care about their language preference.

A report from the documentation maintainer

Posted Nov 3, 2016 19:07 UTC (Thu) by nybble41 (subscriber, #55106) [Link] (4 responses)

> The implication comes not from the character encodeing, but from the user specifying the language that they want to use, including that language's sorting rules.

The instructions provided for setting the character encoding, and the defaults for systems configured to use UTF-8, always seem to include the language as well as the character encoding. This bundling of language and encoding and collation and other preferences into a single global setting is the root of the problem. The same environment variable controls both, unless you set yet another variable to selectively override the sorting rules. It doesn't make sense to use the same settings everywhere, and pattern-matching in the shell programming language is one of those areas where a dependency on the current locale makes no sense.

The fact that the user wants to see messages in English (or whatever language) and use UTF-8 for character encoding should not be taken to imply that they want to change the way the shell expands glob patterns.

As I said before, collation order for user-visible output is not really the point. Sometimes the correctness of a script does depend on the sort order internally (such as *.d directories), but typically the files involved are defined to start with ASCII digits and case issues consequently do not apply (given reasonable locale definitions). Personally I would be content with an option to sort by a file type, case, filename triplet which otherwise followed the locale-specific rules. Absent that option, byte-order sorting with LC_COLLATE=C gives the desired behavior in 99.9% of the cases I am ever likely to encounter. YMMV.

> At any rate, the shell isn't a programming language, it's a user interface.

Shell is a user interface in the same sense as any programming language: a user interface designed for use by programmers. It's ridiculous to claim that it isn't a programming language, since significant fraction of the programs on most Unix systems are written in it. By percentage of commands executed it's a programming language first, with interactive use as a distant second.

> If you don't like this, it's more productive to choose a different language than to rail against reality. Specifically you need something with better defined semantics for pattern matching, probably something with the knobs in the API call itself rather than a bunch of environment variables which alter the semantics in unexpected and non-obvious ways.

What do you think this thread was about? This "better language" you suggest is shell, as it existed before glob patterns became locale-aware and thus context-dependent and dangerously unpredictable. Now the only safe way to write a Bash shell script with glob patterns is by setting the globasciiranges option within the script. For system scripts which can't be fixed the only option is to force LC_COLLATE=C.

A report from the documentation maintainer

Posted Nov 3, 2016 20:05 UTC (Thu) by mstone_ (subscriber, #66309) [Link] (3 responses)

> The instructions provided for setting the character encoding, and the defaults for systems configured to use UTF-8, always seem to include the language as well as the character encoding.

You continue to go on about character encoding, which has nothing to do with this. I don't fully understand why you keep bringing it up. If you byte sort purely english language text using UTF-8 encoding it's exactly the same as byte sorting purely english text using ASCII encoding. But real people don't want their output byte sorted, they want it sorted in a way that's sensible to them.

> The fact that the user wants to see messages in English (or whatever language) and use UTF-8 for character encoding should not be taken to imply that they want to change the way the shell expands glob patterns.

Clearly it should, that's the reality. You want a different reality than one which we inhabit. It's also a good thing that people who use languages other than english can use character classes which include characters from their language, which they couldn't do in the 1980s. And if you still want 1980s semantics, tada, they're still there by turning a knob.

> Shell is a user interface in the same sense as any programming language: a user interface designed for use by programmers. It's ridiculous to claim that it isn't a programming language, since significant fraction of the programs on most Unix systems are written in it. By percentage of commands executed it's a programming language first, with interactive use as a distant second.

Most system commands are now written in other, better languages. The majority of people using a unix command line today interact via the shell but don't program using it. That's for the best. In my bin & sbin right now I have:
1236 compiled
238 shell
163 perl
83 bash
71 python
11 ruby
1 tcl
1 php

Looking at the shell scripts they're mostly either trivial, or at least 15 years old. 100 of them are less than 50 lines. A couple are cheats, only using the shell to launch some other interpreter. As far as I can tell from a quick glance, none of them use the critical "remove-files-with-a-glob-character-class" capability you're so stuck on. The longest are 1-2k LOC and date from the early to mid 90s. So yeah, some people program in it, mostly trivial scripts which avoid dealing with corner cases, and mostly as a legacy capability. I stand by my assertion that the majority of users' interaction with the shell is as an interface with some automation capabilities. I suspect that was always true, and that the number of people who copied around cool snippits of .profile was much higher than the number of people who actually created them (that's certainly my recollection of the cs lab).

> What do you think this thread was about? This "better language" you suggest is shell, as it existed before glob patterns became locale-aware and thus context-dependent and dangerously unpredictable.

I have no idea what this thread is about at this point. You keep bringing up character encodings for some reason, and asserting that the shell is something other than what it actually is. I started the thread with the assertion that upper casing filenames to make them come first is something that's pretty irrelevant to anyone except a few wild-eyed people who haven't kept up with the reality of modern systems and then you proceeded to demonstrate my point. Yeah, some tricks that worked 30 years ago don't work anymore (like uppercasing filenames to make them come first). New things have come along that most people are pretty happy about (I'm personally happy taking advantage of the fact that I can run interpreters and libraries that aren't constrained by 1970s or 80s era hardware limitations). You can either continue to complain that things changed or acknowledge that today's user has a different experience from one 30 years ago and do things in a way that make sense in a modern context.

A report from the documentation maintainer

Posted Nov 3, 2016 22:37 UTC (Thu) by bronson (subscriber, #4806) [Link]

> and do things in a way that make sense in a modern context.

Where "modern context" refers to the centuries-old practice of case-insensitive collation. :)

Did case-sensitive collation even exist before punchcards?

A report from the documentation maintainer

Posted Nov 14, 2016 18:27 UTC (Mon) by Wol (subscriber, #4433) [Link] (1 responses)

> If you byte sort purely english language text using UTF-8 encoding it's exactly the same as byte sorting purely english text using ASCII encoding.

Except that if you try that with the purely ENGLISH word "Noël", you can't sort that in ASCII :-)

Cheers,
Wol

A report from the documentation maintainer

Posted Nov 14, 2016 20:23 UTC (Mon) by bronson (subscriber, #4806) [Link]

You can absolutely byte sort it.

If you're saying that's inadequate for most people then, agreed, but that's been stated on this thread a few times already...


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds