Python 3, ASCII, and UTF-8
Python 3, ASCII, and UTF-8
Posted Dec 20, 2017 15:59 UTC (Wed) by nybble41 (subscriber, #55106)In reply to: Python 3, ASCII, and UTF-8 by Cyberax
Parent article: Python 3, ASCII, and UTF-8
Exactly. This is the sort of thing the runtime should provide a library for—given a byte array, is it valid UTF-8? And of course, if you want to filter for specific Unicode codepoints then you'll have to decode the string anyway.
Not that Python's Unicode strings would necessarily protect against this scenario in the first place. In surrogateescape mode the incomplete codepoints would be passed through unchanged, just as if you'd used byte arrays. But how often does one see code concatenating fields together without a known (and typically ASCII) delimiter anyway? No matter how it was implemented the transformation would lose information.
Posted Dec 20, 2017 17:17 UTC (Wed)
by brouhaha (subscriber, #1698)
[Link] (10 responses)
Posted Dec 20, 2017 17:49 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link] (8 responses)
Posted Dec 20, 2017 18:36 UTC (Wed)
by vstinner (subscriber, #42675)
[Link] (7 responses)
To be clear: Python doesn't force anyone to use Unicode.
All OS functions accept bytes, and Python commonly provides two flavor of the same API: one for Unicode, one for bytes.
Examples: urllib accepts URL as str and bytes, os.listdir(str)->str and os.listdir(bytes)->bytes, os.environ:str and os.environb:bytes, os.getcwd():str and os.getcwdb():bytes, open(str) and open(bytes), etc.
sys.stdin.buffer.read() returns bytes which can be written back into stdout using sys.stdout.buffer.write().
The PEP 538 and 540 try to make the life easier for programmers who cares of Unicode and want to use Unicode for good reasons. Again, for Unix-like tools (imagine a Python "grep"), stdin and stdout are configured to be able to "pass through bytes". So it also makes the life easier for programmers who want to process data as "bytes" (even if technically it's Unicode, Unicode is easier to manipulate in Python 3).
Posted Dec 20, 2017 18:55 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link] (2 responses)
> The PEP 538 and 540 try to make the life easier for programmers who cares of Unicode and want to use Unicode for good reasons.
Posted Dec 20, 2017 20:42 UTC (Wed)
by nas (subscriber, #17)
[Link] (1 responses)
> No they don't try to make it easier.
Your trolling on this site is reaching new heights. Now you actually claiming that the people working on developing these PEPs are not even intending to make Python better? What devious motivation do you think they have then? I'm mean, come on.
You like the string model of Python 2.7 better. Okay, you can have it. As someone who writes internationalized web applications, Python 3.6 works vastly better for me.
Posted Dec 20, 2017 20:46 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link]
> You like the string model of Python 2.7 better. Okay, you can have it. As someone who writes internationalized web applications, Python 3.6 works vastly better for me.
Posted Dec 21, 2017 11:56 UTC (Thu)
by jwilk (subscriber, #63328)
[Link] (1 responses)
Here's a familiar Python developer arguing against adding it: https://bugs.python.org/issue8776#msg217416
Posted Dec 21, 2017 12:00 UTC (Thu)
by vstinner (subscriber, #42675)
[Link]
sys.argb wasn't added since it's hard to maintain two separated lists in sync. Harder than os.environ and os.environb which are mappings.
Posted Dec 21, 2017 17:53 UTC (Thu)
by kjp (guest, #39639)
[Link]
Posted Dec 23, 2017 16:35 UTC (Sat)
by mathstuf (subscriber, #69389)
[Link]
Posted Dec 20, 2017 21:22 UTC (Wed)
by nybble41 (subscriber, #55106)
[Link]
That is a partial solution which works so long as you don't mind dealing with exceptions or care about the performance impact of actually decoding a potentially large byte string into a Unicode string. It would be nice to have a function which just scanned the array in place and returned the result without allocating memory or throwing exceptions.
In any case I was not trying to say that Python 3 does not provide a function like this, just that it's something that does belong in the standard library.
> Do some people think that having distinct data types for array-of-arbitrary-bytes and string-of-characters is a bad thing?
Having separate types is good. However, making the more complex and restrictive string-of-characters type the default or customary form when array-of-arbitrary-bytes would be sufficient is a mistake. Duplicating all the APIs (a byte version and a Unicode version) is yet another mistake, and basing return types on the types of unrelated parameters makes it even worse. (Why assume the file _contents_ are in UTF8 just because a Unicode string was used for the filename?) Putting aside the minority of APIs which inherently relate to Unicode, the rest should only accept and return byte arrays, leaving the conversions up to the user _if they are needed_.
Arguably, filenames should be a third type, neither byte arrays nor Unicode strings. On some platforms (e.g. Linux) they are byte arrays with a variety of encodings, most commonly UTF8 but not restricted to valid UTF8 sequences. On others (e.g. Windows) they are a restricted subset of Unicode (UCS-2). Some platforms (MacOS) apply transformations to the strings, for normalization or case-insensitivity, so equality as bytes or as Unicode codepoints may not be the same as equality as filenames. Handling all of this portably is a difficult problem, and the solutions which are most suitable for filenames are unlikely to be applicable to strings in general.
Python 3 provides that. It's the method decode applied to the array of bytes. It will return a string if the bytes are a valid UTF-8 sequence, and an exception if not.
Python 3, ASCII, and UTF-8
#!/usr/bin/env python3
import sys
def is_valid_unicode(b):
try:
s = b.decode('utf-8')
except:
return False
return True
b = bytes([int(x, 16) for x in sys.argv[1:]])
print(is_valid_unicode(b))
Example:
$ ./validutf8.py ce bc e0 b8 99 f0 90 8e b7 e2 a1 8d 0a
True
$ ./validutf8.py 2d 66 5b 1a f7 53 e3 f6 fd 47 a2 07 fc
False
I'm confused by aspects of this discussion. Do some people think that having distinct data types for array-of-arbitrary-bytes and string-of-characters is a bad thing?
Python 3, ASCII, and UTF-8
Yes it is bad, if the language defaults to "string-of-codepoints" under the hood and goes through a lot of contortions to maintain this pretense.
Python 3, ASCII, and UTF-8
Python 3, ASCII, and UTF-8
You mean: "Python3 provides enough hoops through which you can jump to achieve parity with Python2".
No they don't try to make it easier.
Python 3, ASCII, and UTF-8
Python 3, ASCII, and UTF-8
People who designed register_globals for PHP were also trying to make life easier for other developers.
You assume that I haven't written i18n-ed applications? How exactly is Py3 better for it?
Python 3, ASCII, and UTF-8
Python 3, ASCII, and UTF-8
Python 3, ASCII, and UTF-8
Python 3, ASCII, and UTF-8
Python 3, ASCII, and UTF-8