bytes vs. characters
bytes vs. characters
Posted Apr 23, 2015 20:33 UTC (Thu) by lsl (subscriber, #86508)In reply to: bytes vs. characters by marcH
Parent article: Report from the Python Language Summit
Posted Apr 23, 2015 20:47 UTC (Thu)
by marcH (subscriber, #57642)
[Link] (5 responses)
How else is any invalid encoding displayed?
Posted Apr 23, 2015 22:07 UTC (Thu)
by dlang (guest, #313)
[Link] (1 responses)
the spec allows them to be a string of bytes (excluding null and /), no encoding is required.
Posted Apr 23, 2015 22:30 UTC (Thu)
by marcH (subscriber, #57642)
[Link]
They are invalid if you decide that they are supposed to be in some encoding and some filename uses any *other* encoding. Then garbage gets displayed: for real.
https://en.wikipedia.org/wiki/Mojibake (search page for "garbage")
It's less rare with removable media or ID3
> the spec allows them to be a string of bytes (excluding null and /), no encoding is required.
As far as filenames are concerned, you meant: *the lack of* a spec.
http://www.dwheeler.com/essays/fixing-unix-linux-filename... (search page for... "garbage")
> no encoding is required.
Which command do you typically use instead of "ls"? hexdump?
Posted Apr 24, 2015 6:57 UTC (Fri)
by mbunkus (subscriber, #87248)
[Link] (2 responses)
But a reliable tool, especially one running on filesystems where nearly everything goes (including newlines in them and having no discernible encoding at all), should be able to handle such files, too. This goes double for tools where the developer doesn't control the input. Backup software is the prime example.
How often such files happen? You'd be surprised… Email clients are still broken and annotate file names with the wrong character set resulting in broken file names when saving. ZIP files don't have any encoding information at all, so unpacking one with a file name containing non-ASCII characters often results in ISO encoded file names on my UTF-8 system. And so on.
Therefore treating file names as anything else than a sequence of bytes is, in general, a really bad idea. Only force encodings in places where you need that encoding; displaying the file name being the prime example. If you store it in a database use binary column formats (or if you must use hex then use some kind of escaping mechanism like URL encoding an UTF-8 representation). UTF-8 representations have their own problems regarding file names, think of normalization forms and the fun you're having with Macs and non-Macs.
Treating file names correctly is hard enough. Forcing them into any kind of encoding only makes it worse.
Posted Apr 24, 2015 17:22 UTC (Fri)
by marcH (subscriber, #57642)
[Link] (1 responses)
Thanks a lot, this clarifies.
So the core issue seems to be: the filename being the only file handle. Lose the name and you lose the file. I agree it shouldn't be like this. For instance you can have an iterator that returns some opaque FileObject that does not really care about the name. Does Python have this?
Posted Apr 25, 2015 8:19 UTC (Sat)
by peter-b (subscriber, #66996)
[Link]
Yes. listdir(x) where x is bytes returns the raw filenames as bytes.
https://docs.python.org/3.4/library/os.html?highlight=lis...
bytes vs. characters
bytes vs. characters
bytes vs. characters
https://en.wikipedia.org/wiki/ID3 (search page for "mojibake")
bytes vs. characters
bytes vs. characters
bytes vs. characters