User: Password:
|
|
Subscribe / Log in / New account

Of bytes and encoded strings

Of bytes and encoded strings

Posted Jan 25, 2014 21:35 UTC (Sat) by HelloWorld (guest, #56129)
In reply to: Of bytes and encoded strings by kleptog
Parent article: Of bytes and encoded strings

> You're talking as if the phrase "split a string on '\t'" is somehow ambiguous. If you rewrote that to say "split a bytestring on the byte value 9" is it somehow different?
And you're talking as if adding “.encode('ascii')” here and there is a huge problem. Python 2 has shown that the muddling between bytes and strings is a mess, thus they fixed it in Python 3. Why unlearn this lesson now?


(Log in to post comments)

Of bytes and encoded strings

Posted Jan 26, 2014 18:00 UTC (Sun) by kleptog (subscriber, #1183) [Link]

I think "here and there" is a bit of an understatement. And besides, it doesn't work:
>>> b'\xa7'.decode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa7 in position 0: ordinal not in range(128)
Now, this is Python 3.2 so maybe it changed in later versions. But it seems ridiculous to me that you'd have to choose an 8-bit encoding at random for a byte-string (Latin-9 is the usual choice, but that breaks sometimes) just to split it on a damn tab character.

What the Python bytearray type provides are operations on an unspecified 8-bit ASCII compatible encoding, which is what you need when parsing emails, network protocols, etc.

Of bytes and encoded strings

Posted Jan 26, 2014 20:03 UTC (Sun) by HelloWorld (guest, #56129) [Link]

> Now, this is Python 3.2 so maybe it changed in later versions.
No, why would this ever change? ASCII is a 7-bit character set, thus every byte >= 128 can't be decoded.

> Now, this is Python 3.2 so maybe it changed in later versions. But it seems ridiculous to me that you'd have to choose an 8-bit encoding at random for a byte-string (Latin-9 is the usual choice, but that breaks sometimes) just to split it on a damn tab character.
I don't think there's anything wrong with having a split method on the bytes class. If a function makes sense in the normal list class, it probably makes sense on byte arrays too. Otoh, if a function doesn't make sense on the list class, it probably shouldn't be in the bytes class either. Methods like format don't make any sense for general lists, so the bytes class shouldn't have them either.

I also think that most of the time one shouldn't be writing parsing code at all. There are many types of libraries and tools to help with that nowadays (parser combinators, iteratees, parser generators and much more).


Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds