bytes vs. characters

Posted Apr 23, 2015 20:33 UTC (Thu) by lsl (subscriber, #86508)
In reply to: bytes vs. characters by marcH
Parent article: Report from the Python Language Summit

Why would they be garbage?

bytes vs. characters

Posted Apr 23, 2015 20:47 UTC (Thu) by marcH (subscriber, #57642) [Link] (5 responses)

> Why would they be garbage?

How else is any invalid encoding displayed?

bytes vs. characters

Posted Apr 23, 2015 22:07 UTC (Thu) by dlang (guest, #313) [Link] (1 responses)

they are only invalid if you decide ahead of time that they are supposed to be UTF8 strings.

the spec allows them to be a string of bytes (excluding null and /), no encoding is required.

bytes vs. characters

Posted Apr 23, 2015 22:30 UTC (Thu) by marcH (subscriber, #57642) [Link]

> they are only invalid if you decide ahead of time that they are supposed to be UTF8 strings.

They are invalid if you decide that they are supposed to be in some encoding and some filename uses any *other* encoding. Then garbage gets displayed: for real.

https://en.wikipedia.org/wiki/Mojibake (search page for "garbage")

It's less rare with removable media or ID3
https://en.wikipedia.org/wiki/ID3 (search page for "mojibake")

> the spec allows them to be a string of bytes (excluding null and /), no encoding is required.

As far as filenames are concerned, you meant: *the lack of* a spec.

http://www.dwheeler.com/essays/fixing-unix-linux-filename... (search page for... "garbage")

> no encoding is required.

Which command do you typically use instead of "ls"? hexdump?

bytes vs. characters

Posted Apr 24, 2015 6:57 UTC (Fri) by mbunkus (subscriber, #87248) [Link] (2 responses)

It's not about the display of broken information. That's the easy part.

But a reliable tool, especially one running on filesystems where nearly everything goes (including newlines in them and having no discernible encoding at all), should be able to handle such files, too. This goes double for tools where the developer doesn't control the input. Backup software is the prime example.

How often such files happen? You'd be surprised… Email clients are still broken and annotate file names with the wrong character set resulting in broken file names when saving. ZIP files don't have any encoding information at all, so unpacking one with a file name containing non-ASCII characters often results in ISO encoded file names on my UTF-8 system. And so on.

Therefore treating file names as anything else than a sequence of bytes is, in general, a really bad idea. Only force encodings in places where you need that encoding; displaying the file name being the prime example. If you store it in a database use binary column formats (or if you must use hex then use some kind of escaping mechanism like URL encoding an UTF-8 representation). UTF-8 representations have their own problems regarding file names, think of normalization forms and the fun you're having with Macs and non-Macs.

Treating file names correctly is hard enough. Forcing them into any kind of encoding only makes it worse.

bytes vs. characters

Posted Apr 24, 2015 17:22 UTC (Fri) by marcH (subscriber, #57642) [Link] (1 responses)

> Only force encodings in places where you need that encoding; displaying the file name being the prime example.

Thanks a lot, this clarifies.

So the core issue seems to be: the filename being the only file handle. Lose the name and you lose the file. I agree it shouldn't be like this. For instance you can have an iterator that returns some opaque FileObject that does not really care about the name. Does Python have this?

bytes vs. characters

Posted Apr 25, 2015 8:19 UTC (Sat) by peter-b (subscriber, #66996) [Link]

> So the core issue seems to be: the filename being the only file handle. Lose the name and you lose the file. I agree it shouldn't be like this. For instance you can have an iterator that returns some opaque FileObject that does not really care about the name. Does Python have this?

Yes. listdir(x) where x is bytes returns the raw filenames as bytes.

https://docs.python.org/3.4/library/os.html?highlight=lis...