Of bytes and encoded strings
Posted Jan 24, 2014 11:55 UTC (Fri) by tialaramex (subscriber, #21167)
In places where decoding errors can't give you an Exception (and those are few in Python but they're common in say, C) the modern standard is clear how to handle things, and the choice they made is quite safe. You get a U+FFFD Replacement character for every unrecognisable or incomplete UTF-8 sequence, which won't match any plausible token separator, so everything should be fine.
If an API is supposed to accept text, or you just only bothered to correctly implement text and left binary data as a TODO, decoding it from UTF-8 and either getting an exception or a stream of U+FFFD if you were wrong is far from the most stupid thing you could do. Taking some arbitrary binary data and trying to tokenize that as if it's ASCII is _way_ more dangerous.
On the security side of things I don't have enough Python expertise to do this, but I recommend that someone who does should:
Write small programs demonstrating common mistakes in pseudo-text handling for binary data. Expecting a separator when there isn't one, concatenating stuff blindly, assuming inputs obey some particular syntax though you've no such guarantee, that sort of thing. Write them once as best you can using Python 3's existing string vs binary distinction and note where mistakes are caught or what their consequences are. Then do the same with the proposed new functionality. My guess from what I've seen above is that you'll get a bunch more places where minor programming mistakes introduce silent hard to find/ hard to understand bugs in a Python module or program with the "easier" new behaviour.
Now, maybe that's OK. I don't know, but I think it's very easy to walk into this thinking you've just made it easier for people to get their jobs done, when what you've actually done is lost an opportunity to fix or at least mitigate grievous mistakes in their code.
Posted Jan 25, 2014 13:40 UTC (Sat) by kleptog (subscriber, #1183)
You're talking as if the phrase "split a string on '\t'" is somehow ambiguous. If you rewrote that to say "split a bytestring on the byte value 9" is it somehow different? Even though the operations are identical?
Assuming ASCII is so normal that people don't even write down the assumption anymore. In every encoding still widely used the above are identical. Not even SJIS is crazy enough to break it. I suppose for completeness we should specify a byte as 8 bits, but hardly anyone questions that anymore.
If you're decoding a chunk of text whose encoding is not determined a priori, then you need to parse that text to determine the correct encoding. If as part of that the system allows you to treat the string as if it was ASCII that really doesn't seem like a problem to me.
Posted Jan 25, 2014 14:23 UTC (Sat) by vonbrand (guest, #4458)
See, the "everything is ASCII" assumption is exactly what collided with reality when non-English languages became common enough in computing to have to be really considered. Entertaining to watch, if you could stand on the sidelines...
Posted Jan 26, 2014 17:48 UTC (Sun) by kleptog (subscriber, #1183)
Posted Jan 25, 2014 21:35 UTC (Sat) by HelloWorld (guest, #56129)
Posted Jan 26, 2014 18:00 UTC (Sun) by kleptog (subscriber, #1183)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa7 in position 0: ordinal not in range(128)
What the Python bytearray type provides are operations on an unspecified 8-bit ASCII compatible encoding, which is what you need when parsing emails, network protocols, etc.
Posted Jan 26, 2014 20:03 UTC (Sun) by HelloWorld (guest, #56129)
> Now, this is Python 3.2 so maybe it changed in later versions. But it seems ridiculous to me that you'd have to choose an 8-bit encoding at random for a byte-string (Latin-9 is the usual choice, but that breaks sometimes) just to split it on a damn tab character.
I don't think there's anything wrong with having a split method on the bytes class. If a function makes sense in the normal list class, it probably makes sense on byte arrays too. Otoh, if a function doesn't make sense on the list class, it probably shouldn't be in the bytes class either. Methods like format don't make any sense for general lists, so the bytes class shouldn't have them either.
I also think that most of the time one shouldn't be writing parsing code at all. There are many types of libraries and tools to help with that nowadays (parser combinators, iteratees, parser generators and much more).
Copyright © 2018, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds