Working with UTF-8 in the kernel
Working with UTF-8 in the kernel
Posted Apr 4, 2019 13:00 UTC (Thu) by foom (subscriber, #14868)In reply to: Working with UTF-8 in the kernel by marcH
Parent article: Working with UTF-8 in the kernel
Search for $upcase -- the name of the 128KB pseudo file stored on in every NTFS filesystem. You can also look at the NTFS filesystem driver for Linux.
This file contains the corresponding uppercased character (2 bytes) for each one of the 65536 unicode characters. When windows wants to compare filenames, it simply indexes each character in each string through this table, to make an uppercase string, before doing the comparison.
When you reformat a drive it writes the newest mapping to the file, and that partition will use the same mapping as long as you keep it.
And, yes, I am quite aware that everyone who knows anything about unicode is crying out in distress at the utter WRONGness of what I said above...
But of course, the secret is that users aren't really the ones who care about case insensitive comparisons... They are using gui file pickers and such higher level tools where the filesystem case behavior doesn't matter.
Note the primary use cases given for Linux (samba exports, Android emulating a FAT filesystem on top of ext4) are all about *software* expectations, not humans. Software that was written with hardcoded filenames of the wrong case. That's why ntfs's braindead case folding is not really a problem.
Which does rather bring into question whether implementing "correct" normalization and case folding in Linux even has a point... It won't make it more compatible with the legacy software to do that...
