Git v2.24.1 and others

Posted Dec 11, 2019 17:14 UTC (Wed) by nix (subscriber, #2304)
In reply to: Git v2.24.1 and others by Vorpal
Parent article: Git v2.24.1 and others

And, of course, Unicode lookalikes are still possible:

nix@loom 1179 ~/oracle/tmp/luci% ls -l foo*bar
-rw-rw-r-- 1 nix nix 0 Dec 11 17:13 foo⧸bar

(That's U+29F8 BIG SOLIDUS.)

Git v2.24.1 and others

Posted Dec 12, 2019 13:30 UTC (Thu) by epa (subscriber, #39769) [Link] (11 responses)

Another plug for Fixing Unix/Linux/POSIX Filenames, for those who have not yet read it.

Git v2.24.1 and others

Posted Dec 14, 2019 23:37 UTC (Sat) by adobriyan (subscriber, #30858) [Link] (10 responses)

The fix is to not use shell scripts for anything serious.

David Wheeler's logic is like this:
most shell scripts are buggy because shell authors make it easy to make mistakes despite knowing perfectly well that Unix allows whitespace in filenames, therefore OS kernel should accomodate shell users.

We've seen this pattern before:
shell can't do system calls therefore OS kernel should interface in text which is inferior in nearly any way.

Real programming languages (say Python) don't have the problem with whitespace (subprocess.call()).

Maybe it is the Unix shells that should be fixed?

Even if whitespace and other characters are banned where will they stop? Unicode is big, there are 8 newlines.

Git v2.24.1 and others

Posted Dec 15, 2019 0:11 UTC (Sun) by Cyberax (✭ supporter ✭, #52523) [Link] (9 responses)

> Unicode is big, there are 8 newlines.
Unicode? What Unicode? Unix file names need not be Unicode in any encoding. They can be arbitrary binary garbage.

Getting filenames to be valid UTF-8 would be an awesome improvement over the status quo.

Git v2.24.1 and others

Posted Dec 15, 2019 11:43 UTC (Sun) by adobriyan (subscriber, #30858) [Link] (1 responses)

UTF-8 implies Unicode. People doing Unicode say that UTF-8 part is trivial and the hard part is at glyph/grapheme level.

Git v2.24.1 and others

Posted Dec 15, 2019 11:45 UTC (Sun) by Cyberax (✭ supporter ✭, #52523) [Link]

Yet the filesystems fail even the trivial part..

Git v2.24.1 and others

Posted Dec 16, 2019 0:13 UTC (Mon) by zlynx (guest, #2285) [Link] (6 responses)

I am glad we have arbitrary binary garbage.

Otherwise Linux / Unix would have ended up like the others in love with "the future" and we'd be stuck with UCS-2 circa 2002. Which is neither big enough for every character, nor space efficient.

Won't it be fun in the year 2100 when users have to create wild WTF-8 hacks to work around the encoding limitations hard coded into their virtual storage backend.

Git v2.24.1 and others

Posted Dec 17, 2019 22:02 UTC (Tue) by flussence (guest, #85566) [Link] (3 responses)

Somehow I doubt the human race will fill up the other 90% of UTF-8 codepoints in less than 0.1% the time it took to invent the first 10%.

Git v2.24.1 and others

Posted Dec 18, 2019 12:06 UTC (Wed) by NAR (subscriber, #1313) [Link] (2 responses)

Aren't emojis using up UTF-8 codepoints? I can very well imagine the human race fill up the UTF-8 codepoints, unfortunately...

Git v2.24.1 and others

Posted Dec 20, 2019 0:33 UTC (Fri) by flussence (guest, #85566) [Link] (1 responses)

The bulk of existing emojis came from a set already used among Japanese phone carriers around the turn of the century. That's why U+1F5FC is labelled the Tokyo (not Eiffel) Tower, and most of that block is similarly laden with cultural artifacts most people wouldn't be familiar with.

We're actually running out of things to add to Unicode. New emoji proposals are in short supply and most of the recent additions have been ancient scripts and increasingly obscure precomposed CJK glyphs. Maybe of more relevance to people reading this, Unicode 13 is adding characters from ancient computer systems (Spectrum, Teletext, C64 and the like): https://www.unicode.org/charts/PDF/Unicode-13.0/

Git v2.24.1 and others

Posted Dec 23, 2019 15:13 UTC (Mon) by geert (subscriber, #98403) [Link]

Looks like we're still lacking a few characters to run a C64 emulator in a terminal window? ;-)

Git v2.24.1 and others

Posted Dec 17, 2019 22:26 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]

UCS-2 would have been an improvement over the binary garbage. And anyway, UTF-8 is now universal for actual languages and there's plenty of space for new ones.

Pretty much the only case where UTF-8 won't be enough is if the Earth join the Galactic Federation with FTL communications. But in this case I think that migration off UTF-8 would be a good problem to have.

Git v2.24.1 and others

Posted Dec 18, 2019 12:24 UTC (Wed) by jezuch (subscriber, #52988) [Link]

With things like this, there usully comes a time when you can say that the problem is basically solved. It was not in 2002, but it pretty much is now. Just be glad that Unicode took upon itself the very ungrateful task of developing a universal standard for text encoding, and that you didn't have to do that yourself. It's now done. We can finally use it to clean up the royal mess of all the previous attempts.

It's like with date handling: do not *ever* write your own date handling library; now that even Java got a decent support for it in its standard library, you don't have to be stupid like this anymore ;) And one standard is enough. We've got it solved, let's move on already.