bytes vs. characters

Posted Apr 17, 2015 7:15 UTC (Fri) by zyga (subscriber, #81533)
In reply to: bytes vs. characters by Cyberax
Parent article: Report from the Python Language Summit

You can decode("UTF-8", "ignore") or something else to "coerce" it to some form of text though I really do value the sanity of that. Just fix your data sources. Even if you use some 3rd party library it's not going to make any of that "json"-like thing work with other libraries (I assume that other customers/APIs need to read it).

HTTP headers are a perfect example of binary data. Handling them as unicode text is broken IMHO. You can just use byte processing for everything there and Python 3.4, AFAIR, fixed some last gripes about lack of formatting support for edge cases like that.

bytes vs. characters

Posted Apr 17, 2015 7:22 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

> You can decode("UTF-8", "ignore") or something else to "coerce" it to some form of text though I really do value the sanity of that.
I think it will still be broken. There's a workaround that simply stores binary bytes in the lower byte of UCS-4 codepoints and it sorta works.

I'd love to fix these data sources, but they're out of my control. The vendor knows about it and they plan to base64 binary data in the future, but for now I have to work with what I have.

> You can just use byte processing for everything there and Python 3.4
Not exactly. Most of the built-in library can be used with byte sequences, but third-party libraries are often too careless.

I've fixed tons of code like this:

>def blah(p):
> if fail_to_do_something(p):
> raise SomeException(u"Failed to frobnicate %s!" % p)

It mostly works as is, but occasionally it doesn't.