Python 3, ASCII, and UTF-8

Posted Dec 20, 2017 15:59 UTC (Wed) by nybble41 (subscriber, #55106)
In reply to: Python 3, ASCII, and UTF-8 by Cyberax
Parent article: Python 3, ASCII, and UTF-8

> You can do UTF-8 validation for it.

Exactly. This is the sort of thing the runtime should provide a library for—given a byte array, is it valid UTF-8? And of course, if you want to filter for specific Unicode codepoints then you'll have to decode the string anyway.

Not that Python's Unicode strings would necessarily protect against this scenario in the first place. In surrogateescape mode the incomplete codepoints would be passed through unchanged, just as if you'd used byte arrays. But how often does one see code concatenating fields together without a known (and typically ASCII) delimiter anyway? No matter how it was implemented the transformation would lose information.

Python 3, ASCII, and UTF-8

Posted Dec 20, 2017 17:17 UTC (Wed) by brouhaha (subscriber, #1698) [Link] (10 responses)

Python 3 provides that. It's the method decode applied to the array of bytes. It will return a string if the bytes are a valid UTF-8 sequence, and an exception if not.

#!/usr/bin/env python3
import sys

def is_valid_unicode(b):
    try:
        s = b.decode('utf-8')
    except:
        return False
    return True

b = bytes([int(x, 16) for x in sys.argv[1:]])
print(is_valid_unicode(b))

Example:

$ ./validutf8.py ce bc e0 b8 99 f0 90 8e b7 e2 a1 8d 0a
True
$ ./validutf8.py 2d 66 5b 1a f7 53 e3 f6 fd 47 a2 07 fc
False

I'm confused by aspects of this discussion. Do some people think that having distinct data types for array-of-arbitrary-bytes and string-of-characters is a bad thing?

Python 3, ASCII, and UTF-8

Posted Dec 20, 2017 17:49 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (8 responses)

> I'm confused by aspects of this discussion. Do some people think that having distinct data types for array-of-arbitrary-bytes and string-of-characters is a bad thing?
Yes it is bad, if the language defaults to "string-of-codepoints" under the hood and goes through a lot of contortions to maintain this pretense.

Python 3, ASCII, and UTF-8

Posted Dec 20, 2017 18:36 UTC (Wed) by vstinner (subscriber, #42675) [Link] (7 responses)

I don't know where should I put my comment in this long thread.

To be clear: Python doesn't force anyone to use Unicode.

All OS functions accept bytes, and Python commonly provides two flavor of the same API: one for Unicode, one for bytes.

Examples: urllib accepts URL as str and bytes, os.listdir(str)->str and os.listdir(bytes)->bytes, os.environ:str and os.environb:bytes, os.getcwd():str and os.getcwdb():bytes, open(str) and open(bytes), etc.

sys.stdin.buffer.read() returns bytes which can be written back into stdout using sys.stdout.buffer.write().

The PEP 538 and 540 try to make the life easier for programmers who cares of Unicode and want to use Unicode for good reasons. Again, for Unix-like tools (imagine a Python "grep"), stdin and stdout are configured to be able to "pass through bytes". So it also makes the life easier for programmers who want to process data as "bytes" (even if technically it's Unicode, Unicode is easier to manipulate in Python 3).

Python 3, ASCII, and UTF-8

Posted Dec 20, 2017 18:55 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

> To be clear: Python doesn't force anyone to use Unicode.
You mean: "Python3 provides enough hoops through which you can jump to achieve parity with Python2".

> The PEP 538 and 540 try to make the life easier for programmers who cares of Unicode and want to use Unicode for good reasons.
No they don't try to make it easier.

Python 3, ASCII, and UTF-8

Posted Dec 20, 2017 20:42 UTC (Wed) by nas (subscriber, #17) [Link] (1 responses)

>> The PEP 538 and 540 try to make the life easier for programmers who cares of Unicode and want to use Unicode for good reasons.

> No they don't try to make it easier.

Your trolling on this site is reaching new heights. Now you actually claiming that the people working on developing these PEPs are not even intending to make Python better? What devious motivation do you think they have then? I'm mean, come on.

You like the string model of Python 2.7 better. Okay, you can have it. As someone who writes internationalized web applications, Python 3.6 works vastly better for me.

Python 3, ASCII, and UTF-8

Posted Dec 20, 2017 20:46 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]

> Your trolling on this site is reaching new heights. Now you actually claiming that the people working on developing these PEPs are not even intending to make Python better? What devious motivation do you think they have then? I'm mean, come on.
People who designed register_globals for PHP were also trying to make life easier for other developers.

> You like the string model of Python 2.7 better. Okay, you can have it. As someone who writes internationalized web applications, Python 3.6 works vastly better for me.
You assume that I haven't written i18n-ed applications? How exactly is Py3 better for it?

Python 3, ASCII, and UTF-8

Posted Dec 21, 2017 11:56 UTC (Thu) by jwilk (subscriber, #63328) [Link] (1 responses)

There's no byte equivalent for sys.argv.

Here's a familiar Python developer arguing against adding it: https://bugs.python.org/issue8776#msg217416

Python 3, ASCII, and UTF-8

Posted Dec 21, 2017 12:00 UTC (Thu) by vstinner (subscriber, #42675) [Link]

"argvb can be computed in one line: list(map(os.fsencode, sys.argv))."

sys.argb wasn't added since it's hard to maintain two separated lists in sync. Harder than os.environ and os.environb which are mappings.

Python 3, ASCII, and UTF-8

Posted Dec 21, 2017 17:53 UTC (Thu) by kjp (guest, #39639) [Link]

But unicode doesn't mean unicode with python3! What is this surrogate escaping nonsense coming back from system APIs? What happened to "errors shouldn't pass silently"?

Python 3, ASCII, and UTF-8

Posted Dec 23, 2017 16:35 UTC (Sat) by mathstuf (subscriber, #69389) [Link]

The problems I've seen are usually with parsing JSON or XML files and the like. Python2 is fine, but Python3 all of a sudden starts complaining that things aren't ASCII when it encounters UTF-8. The thing is that JSON is defined as UTF-8, so I don't know why Python is being ignorant of things and our XML has the encoding declared at the top of the file.

Python 3, ASCII, and UTF-8

Posted Dec 20, 2017 21:22 UTC (Wed) by nybble41 (subscriber, #55106) [Link]

> Python 3 provides that. It's the method decode applied to the array of bytes.

That is a partial solution which works so long as you don't mind dealing with exceptions or care about the performance impact of actually decoding a potentially large byte string into a Unicode string. It would be nice to have a function which just scanned the array in place and returned the result without allocating memory or throwing exceptions.

In any case I was not trying to say that Python 3 does not provide a function like this, just that it's something that does belong in the standard library.

> Do some people think that having distinct data types for array-of-arbitrary-bytes and string-of-characters is a bad thing?

Having separate types is good. However, making the more complex and restrictive string-of-characters type the default or customary form when array-of-arbitrary-bytes would be sufficient is a mistake. Duplicating all the APIs (a byte version and a Unicode version) is yet another mistake, and basing return types on the types of unrelated parameters makes it even worse. (Why assume the file _contents_ are in UTF8 just because a Unicode string was used for the filename?) Putting aside the minority of APIs which inherently relate to Unicode, the rest should only accept and return byte arrays, leaving the conversions up to the user _if they are needed_.

Arguably, filenames should be a third type, neither byte arrays nor Unicode strings. On some platforms (e.g. Linux) they are byte arrays with a variety of encodings, most commonly UTF8 but not restricted to valid UTF8 sequences. On others (e.g. Windows) they are a restricted subset of Unicode (UCS-2). Some platforms (MacOS) apply transformations to the strings, for normalization or case-insensitivity, so equality as bytes or as Unicode codepoints may not be the same as equality as filenames. Handling all of this portably is a difficult problem, and the solutions which are most suitable for filenames are unlikely to be applicable to strings in general.