User: Password:
|
|
Subscribe / Log in / New account

Of bytes and encoded strings

Of bytes and encoded strings

Posted Jan 24, 2014 8:38 UTC (Fri) by hummassa (subscriber, #307)
In reply to: Of bytes and encoded strings by Cyberax
Parent article: Of bytes and encoded strings

> So lots of string functions simply won't work.

This is actually a good thing. Imagine how many security bugs are hidden in the "wrong encoding assumption for a part of a binary buffer" antipattern.

And it's not like the porting is extra-painful. Find out the correct encoding for some part of a bytes object, extract, decode, and voila... a string.


(Log in to post comments)

Of bytes and encoded strings

Posted Jan 24, 2014 8:56 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

Except that simple stuff like "tokenize string on \t separator" becomes complicated. And simplistic "fixes" like assuming that stings are in utf-8 are only going to make it _worse_.

Of bytes and encoded strings

Posted Jan 24, 2014 11:55 UTC (Fri) by tialaramex (subscriber, #21167) [Link]

I don't think that's true. Expecting UTF-8 and being wrong about it should get you an Exception, which is a lot safer than entering some unknown state. Now, in Python there's a good chance the Exception blows up your whole program (because nobody checks exceptions), but at least that's a clean death.

In places where decoding errors can't give you an Exception (and those are few in Python but they're common in say, C) the modern standard is clear how to handle things, and the choice they made is quite safe. You get a U+FFFD Replacement character for every unrecognisable or incomplete UTF-8 sequence, which won't match any plausible token separator, so everything should be fine.

If an API is supposed to accept text, or you just only bothered to correctly implement text and left binary data as a TODO, decoding it from UTF-8 and either getting an exception or a stream of U+FFFD if you were wrong is far from the most stupid thing you could do. Taking some arbitrary binary data and trying to tokenize that as if it's ASCII is _way_ more dangerous.

On the security side of things I don't have enough Python expertise to do this, but I recommend that someone who does should:

Write small programs demonstrating common mistakes in pseudo-text handling for binary data. Expecting a separator when there isn't one, concatenating stuff blindly, assuming inputs obey some particular syntax though you've no such guarantee, that sort of thing. Write them once as best you can using Python 3's existing string vs binary distinction and note where mistakes are caught or what their consequences are. Then do the same with the proposed new functionality. My guess from what I've seen above is that you'll get a bunch more places where minor programming mistakes introduce silent hard to find/ hard to understand bugs in a Python module or program with the "easier" new behaviour.

Now, maybe that's OK. I don't know, but I think it's very easy to walk into this thinking you've just made it easier for people to get their jobs done, when what you've actually done is lost an opportunity to fix or at least mitigate grievous mistakes in their code.

Of bytes and encoded strings

Posted Jan 25, 2014 13:40 UTC (Sat) by kleptog (subscriber, #1183) [Link]

> Taking some arbitrary binary data and trying to tokenize that as if it's ASCII is _way_ more dangerous.

You're talking as if the phrase "split a string on '\t'" is somehow ambiguous. If you rewrote that to say "split a bytestring on the byte value 9" is it somehow different? Even though the operations are identical?

Assuming ASCII is so normal that people don't even write down the assumption anymore. In every encoding still widely used the above are identical. Not even SJIS is crazy enough to break it. I suppose for completeness we should specify a byte as 8 bits, but hardly anyone questions that anymore.

If you're decoding a chunk of text whose encoding is not determined a priori, then you need to parse that text to determine the correct encoding. If as part of that the system allows you to treat the string as if it was ASCII that really doesn't seem like a problem to me.

Of bytes and encoded strings

Posted Jan 25, 2014 14:23 UTC (Sat) by vonbrand (guest, #4458) [Link]

See, the "everything is ASCII" assumption is exactly what collided with reality when non-English languages became common enough in computing to have to be really considered. Entertaining to watch, if you could stand on the sidelines...

Of bytes and encoded strings

Posted Jan 26, 2014 17:48 UTC (Sun) by kleptog (subscriber, #1183) [Link]

Umm, I never said everything is ASCII. I said every common encoding in use is ASCII compatible to such an extent that splitting on a tab is going to always work fine.

Of bytes and encoded strings

Posted Jan 25, 2014 21:35 UTC (Sat) by HelloWorld (guest, #56129) [Link]

> You're talking as if the phrase "split a string on '\t'" is somehow ambiguous. If you rewrote that to say "split a bytestring on the byte value 9" is it somehow different?
And you're talking as if adding “.encode('ascii')” here and there is a huge problem. Python 2 has shown that the muddling between bytes and strings is a mess, thus they fixed it in Python 3. Why unlearn this lesson now?

Of bytes and encoded strings

Posted Jan 26, 2014 18:00 UTC (Sun) by kleptog (subscriber, #1183) [Link]

I think "here and there" is a bit of an understatement. And besides, it doesn't work:
>>> b'\xa7'.decode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa7 in position 0: ordinal not in range(128)
Now, this is Python 3.2 so maybe it changed in later versions. But it seems ridiculous to me that you'd have to choose an 8-bit encoding at random for a byte-string (Latin-9 is the usual choice, but that breaks sometimes) just to split it on a damn tab character.

What the Python bytearray type provides are operations on an unspecified 8-bit ASCII compatible encoding, which is what you need when parsing emails, network protocols, etc.

Of bytes and encoded strings

Posted Jan 26, 2014 20:03 UTC (Sun) by HelloWorld (guest, #56129) [Link]

> Now, this is Python 3.2 so maybe it changed in later versions.
No, why would this ever change? ASCII is a 7-bit character set, thus every byte >= 128 can't be decoded.

> Now, this is Python 3.2 so maybe it changed in later versions. But it seems ridiculous to me that you'd have to choose an 8-bit encoding at random for a byte-string (Latin-9 is the usual choice, but that breaks sometimes) just to split it on a damn tab character.
I don't think there's anything wrong with having a split method on the bytes class. If a function makes sense in the normal list class, it probably makes sense on byte arrays too. Otoh, if a function doesn't make sense on the list class, it probably shouldn't be in the bytes class either. Methods like format don't make any sense for general lists, so the bytes class shouldn't have them either.

I also think that most of the time one shouldn't be writing parsing code at all. There are many types of libraries and tools to help with that nowadays (parser combinators, iteratees, parser generators and much more).


Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds