User: Password:
|
|
Subscribe / Log in / New account

Of bytes and encoded strings

Of bytes and encoded strings

Posted Jan 25, 2014 13:40 UTC (Sat) by kleptog (subscriber, #1183)
In reply to: Of bytes and encoded strings by tialaramex
Parent article: Of bytes and encoded strings

> Taking some arbitrary binary data and trying to tokenize that as if it's ASCII is _way_ more dangerous.

You're talking as if the phrase "split a string on '\t'" is somehow ambiguous. If you rewrote that to say "split a bytestring on the byte value 9" is it somehow different? Even though the operations are identical?

Assuming ASCII is so normal that people don't even write down the assumption anymore. In every encoding still widely used the above are identical. Not even SJIS is crazy enough to break it. I suppose for completeness we should specify a byte as 8 bits, but hardly anyone questions that anymore.

If you're decoding a chunk of text whose encoding is not determined a priori, then you need to parse that text to determine the correct encoding. If as part of that the system allows you to treat the string as if it was ASCII that really doesn't seem like a problem to me.


(Log in to post comments)

Of bytes and encoded strings

Posted Jan 25, 2014 14:23 UTC (Sat) by vonbrand (guest, #4458) [Link]

See, the "everything is ASCII" assumption is exactly what collided with reality when non-English languages became common enough in computing to have to be really considered. Entertaining to watch, if you could stand on the sidelines...

Of bytes and encoded strings

Posted Jan 26, 2014 17:48 UTC (Sun) by kleptog (subscriber, #1183) [Link]

Umm, I never said everything is ASCII. I said every common encoding in use is ASCII compatible to such an extent that splitting on a tab is going to always work fine.

Of bytes and encoded strings

Posted Jan 25, 2014 21:35 UTC (Sat) by HelloWorld (guest, #56129) [Link]

> You're talking as if the phrase "split a string on '\t'" is somehow ambiguous. If you rewrote that to say "split a bytestring on the byte value 9" is it somehow different?
And you're talking as if adding “.encode('ascii')” here and there is a huge problem. Python 2 has shown that the muddling between bytes and strings is a mess, thus they fixed it in Python 3. Why unlearn this lesson now?

Of bytes and encoded strings

Posted Jan 26, 2014 18:00 UTC (Sun) by kleptog (subscriber, #1183) [Link]

I think "here and there" is a bit of an understatement. And besides, it doesn't work:
>>> b'\xa7'.decode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa7 in position 0: ordinal not in range(128)
Now, this is Python 3.2 so maybe it changed in later versions. But it seems ridiculous to me that you'd have to choose an 8-bit encoding at random for a byte-string (Latin-9 is the usual choice, but that breaks sometimes) just to split it on a damn tab character.

What the Python bytearray type provides are operations on an unspecified 8-bit ASCII compatible encoding, which is what you need when parsing emails, network protocols, etc.

Of bytes and encoded strings

Posted Jan 26, 2014 20:03 UTC (Sun) by HelloWorld (guest, #56129) [Link]

> Now, this is Python 3.2 so maybe it changed in later versions.
No, why would this ever change? ASCII is a 7-bit character set, thus every byte >= 128 can't be decoded.

> Now, this is Python 3.2 so maybe it changed in later versions. But it seems ridiculous to me that you'd have to choose an 8-bit encoding at random for a byte-string (Latin-9 is the usual choice, but that breaks sometimes) just to split it on a damn tab character.
I don't think there's anything wrong with having a split method on the bytes class. If a function makes sense in the normal list class, it probably makes sense on byte arrays too. Otoh, if a function doesn't make sense on the list class, it probably shouldn't be in the bytes class either. Methods like format don't make any sense for general lists, so the bytes class shouldn't have them either.

I also think that most of the time one shouldn't be writing parsing code at all. There are many types of libraries and tools to help with that nowadays (parser combinators, iteratees, parser generators and much more).


Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds