|
|
Log in / Subscribe / Register

A report from the documentation maintainer

A report from the documentation maintainer

Posted Nov 2, 2016 21:28 UTC (Wed) by farnz (subscriber, #17727)
In reply to: A report from the documentation maintainer by nybble41
Parent article: A report from the documentation maintainer

It's not too much to ask, but it's a hard problem. To take a couple of examples, using == for "insensitive comparison":

  • Should ß == SS? In a German filename, the answer is "yes", because ß is just a different way to write ss; in an English filename, however, it's a symbol, not a letter.
  • Should a == ä? In an English or Afrikaans filename, yes; the diacritic is a pronunciation guide only. In a Swedish filename, no, because ä is a different letter to a.
  • Should i == I? In most Latin alphabet languages, yes; lowercase I is dotted, uppercase i is not. In Turkish, however, i's uppercase form is İ, not I.

Unicode makes a decent stab at a solution (section 5.18), but then explicitly calls out Lithuanian and Turkic languages as cases where the default algorithm will not include something that users expect it to include; further, the Unicode solution is based on the principle that it's better for the algorithm to match things it shouldn't, than it is for it to miss things it should match. Thus, an English user will be surprised that the glob S* matches ß, but that's better than a German user being surprised when s* does not match ß. Similarly, a Swede is going to be surprised when nä* matches nävi, specifically so that an English user isn't surprised when na* doesn't match nävi.


to post comments

A report from the documentation maintainer

Posted Nov 2, 2016 22:44 UTC (Wed) by nybble41 (subscriber, #55106) [Link] (1 responses)

> ... the Unicode solution is based on the principle that it's better for the algorithm to match things it shouldn't, than it is for it to miss things it should match.

That's debatable, and it really depends on what the glob pattern is being used for. When deleting files, for example, it would generally be better to match conservatively so that you don't remove files which the user didn't expect to match. In most cases it's much easier to clean up any files which were missed than it is to restore ones which were unexpectedly removed. The same would apply to any operation which modified files in place (e.g. perl -i).

Personally, I just set LC_COLLATE=C and compare all strings bytewise--which does not preclude the use of Unicode filenames. I find this far less surprising than any of the locale-specific options, and think it would be a safer default, especially for scripts. Interactively, absent some indication of the user's intent, perhaps the shell should evaluate the glob pattern both ways and generate an error if the results do not agree.

A report from the documentation maintainer

Posted Nov 2, 2016 23:13 UTC (Wed) by mstone_ (subscriber, #66309) [Link]

The solution here is simply not to blindly delete files using glob patterns. If you need this for some reason you'd darn well better set your locale to C, but you should probably just come up with a safer solution.

A report from the documentation maintainer

Posted Nov 2, 2016 22:54 UTC (Wed) by nybble41 (subscriber, #55106) [Link] (1 responses)

> It's not too much to ask, but it's a hard problem.

As an addendum to my previous comment, the examples given here support the argument that *case-insensitive* comparison presents difficulties above and beyond case-sensitive comparison of Unicode strings. This would make traditional case-sensitive glob matching and sorting easier to implement than the current case-insensitive behavior. "Should ß == SS?" No, the first is lowercase and the second is uppercase. (Of course, you still have to deal with the question of whether ß == ss.)

A report from the documentation maintainer

Posted Nov 3, 2016 9:35 UTC (Thu) by farnz (subscriber, #17727) [Link]

The other trouble is that any human-friendly comparison function presents difficulties for the machine - too much of what's "correct" depends on your cultural norms. For example, I've known native Arabic speakers who simply could not get their heads around the idea that English has 5 different short vowel sounds, and that the difference between "nit" and "net" is significant (because the short vowel modifies the preceding consonant) - to their eyes, both were reasonable ways to write what in Arabic would be nt (as Arabic writing does not include short vowel sounds by default).

The brains of the squishy bag of meat in control of the computer are weird places - while I don't think the difficulty is insurmountable, I don't think it's a trivial problem to solve, either.

A report from the documentation maintainer

Posted Nov 3, 2016 22:12 UTC (Thu) by tao (subscriber, #17563) [Link]

Swedish standardised sorting actually treats V = W, which annoys the hell out of me, seeing as my last name starts with a W :)

A report from the documentation maintainer

Posted Nov 4, 2016 10:10 UTC (Fri) by tdz (subscriber, #58733) [Link] (14 responses)

> Should ß == SS? In a German filename, the answer is "yes", because ß is just a different way to
> write ss;

Actually no, because the result would be pronounced differently, or even be a different word. Double-s is just the workaround for scripts that don't have 'ß'.

A report from the documentation maintainer

Posted Nov 4, 2016 10:14 UTC (Fri) by farnz (subscriber, #17727) [Link] (13 responses)

So what is the correct capitalisation of "groß"? I can't find a capital eszet in normal use, and the beginners guide I'm consulting says that "groß" would capitalise as "GROSS". For that to hold true for a case-insensitive comparison, the filename "groß ding" has to compare equal to the filename "GROSS DING", otherwise the comparison is not fully case-insensitive; and on the case insensitive locale-aware OSes I've tried, that indeed holds true.

A report from the documentation maintainer

Posted Nov 4, 2016 10:25 UTC (Fri) by tao (subscriber, #17563) [Link] (4 responses)

You mean "ẞ" U+1E9E LATIN CAPITAL LETTER SHARP S?

A report from the documentation maintainer

Posted Nov 4, 2016 11:23 UTC (Fri) by idrys (subscriber, #4347) [Link]

I have not seen much (if anything at all) of the capital ß in the wild - it's mostly capitalized as the previous poster noted.

(With all the fun of 'in Maßen' (i.e. not too much) vs. 'in Massen' (a lot) - although the reforms inflicted upon German writing in the last years mean that most people are confused anyway...)

A report from the documentation maintainer

Posted Nov 4, 2016 11:42 UTC (Fri) by farnz (subscriber, #17727) [Link] (2 responses)

It exists, yes, but all the references I can find (as a beginner in the language, not a native speaker nor resident in a German-speaking country) tell me that the use of U+1E9E is for things like advertising, where you're doing the "shouting caps" emphasis trick - thus, where an English advert might say "BIG..." the German equivalent advert would use "GROẞ" to indicate that the caps are not "real" capitals, they're shouting.

In normal writing, though, everything seems to use SS as the capital form of ß.

A report from the documentation maintainer

Posted Nov 7, 2016 18:16 UTC (Mon) by JanC_ (guest, #34940) [Link] (1 responses)

That's at least partially also because “ẞ” isn't on German typewriters, (traditional) computer keyboards, etc.

A report from the documentation maintainer

Posted Nov 8, 2016 9:40 UTC (Tue) by anselm (subscriber, #2796) [Link]

The codepoint for the uppercase “ß” was added to Unicode as a sort of precaution and to make life easier for people implementing case conversion routines. In Germany, in spite of this, uppercase “ß” isn't actually being used in practice, and in fact there is no official agreement or rule as to what the glyph should even look like (although people keep throwing around “ẞ” as if that was some sort of gospel). It would be reasonable to render U+1E9E as “SS” except for the ambiguity with “in Maßen/Massen”, or as “ß” (i.e., exactly like the lowercase “ß”) except for the bad typography of having a lowercase letter in the middle of a bunch of uppercase ones.

Generally, according to the new German orthography rules we're now supposed to write “ß” after a long vowel (or diphthong) and “ss” after a short vowel. This is not the worst part of the orthography reform but the uppercase-“ß” issue remains largely unresolved.

A report from the documentation maintainer

Posted Nov 7, 2016 9:11 UTC (Mon) by tdz (subscriber, #58733) [Link] (7 responses)

Hi,

as the other commenter points out, there exists a symbol and a code to display it. But even being a native speaker, this discussion is actually the first place I've ever seen it. :D

You can use double-s in place of ß in capitalization and that's what people usually do. I've also seen ß not being capitalized; GROß in your example. ß won't ever appear at the beginning of words, so there no 'natural' use case for capital-ß. I'm not sure if many people are aware that it exists.

A report from the documentation maintainer

Posted Nov 7, 2016 10:08 UTC (Mon) by farnz (subscriber, #17727) [Link] (6 responses)

Aha, someone who can actually answer the underlying question for me!

Would you expect a case-insensitive equality operator to have "groß" == "gross" == "GROSS" == "GROß" == "GROẞ" (which the case-insensitive OS I've played with chooses to do in a German locale)?

Put differently, would you expect that if you searched for "groß" in a text document, you would not find matches for "GROSS" but would for "GROß"? Equally, if you searched a text document for "GROSS", would you expect to see matches for "groß", or only for "gross" in a case-insensitive search?

A report from the documentation maintainer

Posted Nov 7, 2016 10:27 UTC (Mon) by johill (subscriber, #25196) [Link] (4 responses)

I'm in the same situation as tdz, being a native German speaker and never having seen the ẞ (upper-case) before. I actually appreciate if ss ends up being equivalent to ß in all cases, for multiple reasons:
  1. sometimes I don't have German keyboard settings available immediately, making it awkward to enter ß
  2. a document may use old or new orthography, so words like "Fluss" (river; this is the currently correct spelling) may be spelled as "Fluß" (old spelling)
  3. when spelled in headings/etc., "SS" will frequently be used to replace "ß"
So I'd argue that treating things as in your example ("groß" == "gross" == "GROSS" == "GROß" == "GROẞ") is helpful.

A report from the documentation maintainer

Posted Nov 7, 2016 11:20 UTC (Mon) by idrys (subscriber, #4347) [Link] (3 responses)

(native speaker here as well)

While this matching helps with words that can simply be written in two ways, I'd be rather surprised to get a match for a different word (like in Maßen vs. in Massen). And I think the new orthography is, for the most part, horrible (it emphasized writing over reading while neglecting that you know what you're writing but your reader doesn't). But adherents of the old orthography will die out over time anyway :/

A report from the documentation maintainer

Posted Nov 7, 2016 13:45 UTC (Mon) by farnz (subscriber, #17727) [Link] (2 responses)

Hmm. Are you saying that, when doing a case-insensitive match, you'd really want the computer to be aware of the intended dictionary word? So that a search for "groß" matches all of "GROSS", "groß" and "gross" (leaving your human intelligence to determine which ones are "good" matches), while a search for "maßen" should match "Maßen" but not "MASSEN" or "Massen", because "Maßen" and "Massen" are different words? Or is there an underlying rule that I'm not seeing (something like "Maßen" should match "MASSEN" as an all-caps Maßen and "maßen" as missing the initial capital, but not "Massen" or "massen" because the casing rules let you see that ss was deliberate, not the result of round-tripping through upper case back to lower case)?

A report from the documentation maintainer

Posted Nov 7, 2016 14:30 UTC (Mon) by idrys (subscriber, #4347) [Link] (1 responses)

I'd prefer to not match eszet vs. double-s at all, generally. I understand and to a degree follow the reasoning, but I think it would cause more confusion than not. (And your example neatly illustrates this; too much side-knowledge required.)

I _could_ imagine an exception for eszet vs. upper-case double-s, but I'd be surprised if 'grep -i maßen' would find MASSEN as well... (And what about 'SZ' as a capitalization for 'ß'? It is now extremely uncommon, but I've seen this in documents up to the mid-20th century.)

[As an aside, old documents are sometimes inconsistent for eszet vs. double-s in people's names as well, as they sometimes capitalized names and sometimes not, so this is not a new issue. We are not 100% sure what the family name on my mother's side is for that reason. Oh well...]

A report from the documentation maintainer

Posted Nov 7, 2016 16:52 UTC (Mon) by mathstuf (subscriber, #69389) [Link]

FYI, there also exist ligature codepoints like `fi` would need to be split apart on uppercase.

A report from the documentation maintainer

Posted Nov 10, 2016 14:39 UTC (Thu) by tdz (subscriber, #58733) [Link]

These are all different words, so they should probably not compare equal by default. Having the option of treating ß and ss that same could be useful, though. OTOH I never had this problem in practice.

In English, people sometimes (frequently?) confuse "its" and "it's". Treating them the same in text searches seems a comparable use case.

I thought about your question about ß in capital-letter advertising messages, but I can't remember having seen that anywhere. I could imagine that advertisers avoid using ß and ss in capital letters, because it doesn't look good either way.


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds