User: Password:
|
|
Subscribe / Log in / New account

PEP 460 reboot

From:  Guido van Rossum <guido-AT-python.org>
To:  Python-Dev <python-dev-AT-python.org>
Subject:  PEP 460 reboot
Date:  Sun, 12 Jan 2014 15:55:23 -0800
Message-ID:  <CAP7+vJLu+jP5zO0UcDpOHwxXayyD89C0Up0MUtq-c9Q-OoiWAA@mail.gmail.com>
Archive-link:  Article

There's a lot of discussion about PEP 460 and I haven't read it all.
Maybe you all have already reached the same conclusion that I have. In
that case I apologize (but the PEP should be updated). Here's my
contribution:

PEP 460 itself currently rejects support for %d, AFAIK on the basis
that bytes aren't necessarily ASCII. I think that's a misunderstanding
of the intention of the bytes type.

The key reason for introducing a separate bytes type in Python 3 is to
avoid *mixing* bytes and text. This aims to avoid the classic Python 2
Unicode failure, where str+unicode fails or succeeds based on whether
str contains non-ASCII characters or not, which means it is easy to
miss in testing. Properly written code in Python 3 will fail based on
the *type* of the objects, not based on their contents. Content-based
failures are still possible, but they occur in typical "boundary"
operations such as encode/decode.

But this does not mean the bytes type isn't allowed to have a
noticeable bias in favor of encodings that are ASCII supersets, even
if not all bytes objects contain such data (e.g. image data,
compressed data, binary network packets, and so on).

IMO it's totally fine and consistent if b'%d' % 42 returns b'42' and
also for b'{}'.format(42) to return b'42'. There are numerous places
where bytes are already assumed to use an ASCII superset:

- byte literals: b'abc' (it's a syntax error to have a non-ASCII character here)
- the upper() and lower() methods modify the ASCII letter positions
- int(b'42') == 42, float(b'3.14') == 3.14

I looked through the example code I recently write for asyncio (which
uses bytes for all data read or written). There are several places
where I have to make a clumsy detour via text strings because I need
to include an ASCII-encoded decimal integer (e.g. the Content-Length
header) or a hex-encoded one (e.g. for Transfer-Encoding: chunked).
Those detours aren't needed for parsing because int() accepts bytes
just fine.

I also note that the behavior of the re module is perfect: if the
pattern is bytes, it can only match bytes and the extracted data is
bytes, and ditto for text -- so it supports both types but doesn't
allow mixing them. The urllib module does this too -- at considerable
cost in its implementation, but it's the right thing, because there
really are good cases to be made for treating URLs as text as well as
for treating them as bytes (as with filenames, command line arguments,
and environment variables).

I'm sad that the json module in Python 3 doesn't support bytes at all,
but at least it is consistent -- it always produces text in ASCII
encoding (by default). The same applies to the http module, which IIUC
adheres to the standard by treating headers as Latin-1.

-- 
--Guido van Rossum (python.org/~guido)


(Log in to post comments)


Copyright © 2014, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds