|
|
Subscribe / Log in / New account

Python 2.8?

Python 2.8?

Posted Jan 20, 2017 9:40 UTC (Fri) by ssokolow (guest, #94568)
In reply to: Python 2.8? by Cyberax
Parent article: Python 2.8?

I'd probably rewrite it in Rust or Python anyway... it'd just be nice to have a known-to-be-effective example to learn from.


to post comments

Python 2.8?

Posted Jan 23, 2017 5:52 UTC (Mon) by adobriyan (subscriber, #30858) [Link] (7 responses)

> rewrite it in Rust

Ironically, Rust assumes everything is valid UTF-8 including Unix filenames.

Python 2.8?

Posted Jan 23, 2017 7:03 UTC (Mon) by liw (subscriber, #6379) [Link] (4 responses)

Rust doesn't have a concept of binary data? That would be so weird I have trouble believing it. It would mean Rust can't handle, say, a JPEG.

Python 2.8?

Posted Jan 23, 2017 11:44 UTC (Mon) by ssokolow (guest, #94568) [Link] (3 responses)

Rust has three types of "strings". From least to most "binary", they are:
  1. std::string::String
    • Guaranteed to be valid UTF-8 but uses a Vec<u8> internally.
    • May contain NULL bytes.
    • Codepoint-oriented APIs
    • Can be iterated byte-wise or codepoint-wise. Grapheme clusters or words available via the unicode-segmentation crate.
  2. std::ffi::OsString
    • Can represent any byte/codepoint sequence returned by OS-native APIs.
    • Can represent NULL bytes even if the OS-native form can't
    • Designed to be producible from String without conversion cost.
      • On POSIX platforms, In-memory representation is identical to String.
      • On Windows, uses a UTF-8 superset called WTF-8 which matches the relaxed well-formedness rules of the Windows UTF-16 APIs.
  3. Vec<u8>
    • Exactly what it looks like. A vector of unsigned 8-bit integers
    • May contain NULL bytes
    • Serves the same role as Python's "bytestrings".
    • Usable as a "string" type because Vec has "sequence of items" versions of most APIs you'd think of as being for string manipulation.
See also: std::path::{Path,PathBuf} (OsString internally)

Python 2.8?

Posted Jan 23, 2017 12:20 UTC (Mon) by mathstuf (subscriber, #69389) [Link] (2 responses)

Um, I know it was mentioned below, but I want the record straight on this subthread too.

> On POSIX platforms, In-memory representation is identical to String.

No, it can represent arbitrary bytes.

> On Windows, uses a UTF-8 superset called WTF-8 which matches the relaxed well-formedness rules of the Windows UTF-16 APIs.

It actually uses WTF-16 internally, a superset of UTF-16.

Right from the docs at the OsString link you gave:

On Unix systems, strings are often arbitrary sequences of non-zero bytes, in many cases interpreted as UTF-8.

On Windows, strings are often arbitrary sequences of non-zero 16-bit values, interpreted as UTF-16 when it is valid to do so.

In Rust, strings are always valid UTF-8, but may contain zeros.

Python 2.8?

Posted Jan 23, 2017 12:51 UTC (Mon) by ssokolow (guest, #94568) [Link] (1 responses)

> No, [OsString] can represent arbitrary bytes.

I never said it couldn't. On POSIX platforms, String (1) and OsString (1, 2) both rely on an internal Vec<u8> to store the raw bytes... String just enforces extra invariants and presents a different set of methods.

That's what I meant when I said they had the same in-memory representation. You can even use the to_str method to get an &str pointing at the contents of the OsString as long it's valid UTF-8.

> It actually uses WTF-16 internally, a superset of UTF-16.

I looked at the up-to-date source to the standard library before I made my post.

OsString uses an inner type called sys::os_str::Buf to actually store the string and, on Windows, that's an API wrapper around sys_common::wtf8::Wtf8Buf which, again, uses a Vec<u8> internally.

(The key detail is that, when Windows UTF-16 is invalid UTF-8, it's not because it encodes codepoints that UTF-8 can't represent, it's that it violates higher-level rules about surrogates never occuring in isolation or out of order. After the transformation process is complete, UTF-16 and UTF-8 are simply two different ways to map 21-bit numbers into sequences of bytes, so It's trivial for WTF-8 to represent an arbitrary sequence of numbers 21-bits or narrower just by omitting some of the rules checks that a UTF-8 codec enforces.)

> Right from the docs at the OsString link you gave

I actually started out with that exact quote but edited it out for brevity.

Python 2.8?

Posted Jan 23, 2017 20:34 UTC (Mon) by mathstuf (subscriber, #69389) [Link]

Ah, yeah about the "backing store"; I shouldn't comment too early in the morning.

The WTF-8 versus WTF-16 thing is news to me. Maybe it was communicated poorly when I first heard about it or it has changed since then.

Thanks for the clarifications.

Python 2.8?

Posted Jan 23, 2017 11:12 UTC (Mon) by Jonno (subscriber, #49613) [Link] (1 responses)

> Ironically, Rust assumes everything is valid UTF-8 including Unix filenames.

Rust does not. While Rust Strings are always UTF-8, Rust OsStrings are not. They are "arbitrary sequences of non-zero bytes" (on Unix) or "arbitrary sequences of non-zero 16-bit values" (on Windows).

Directory listings uses OsStrings, not Strings, for filename components, and File::open() will accept anything from which Rust knows how to build a path, including both Strings *and* OsStrings.

There are convenience methods to convert an OsString to a String (which will fail if the OsString does not contain valid Unicode), as well as to convert a String to an OsString (which will fail if the String contains any "U+0000 NULL" characters), but there is no requirement that you use them.

In fact, in most circumstances you should not. Keep the OsString for path manipulations, and if you need a pretty UTF-8 string to show the user, use the heavier OsString::to_string_lossy() method to get a string with any invalid Unicode sequences replaced with "U+FFFD REPLACEMENT CHARACTER".

Python 2.8?

Posted Jan 23, 2017 12:22 UTC (Mon) by ssokolow (guest, #94568) [Link]

Actually, OsString is a superset of String and whatever the OS offers. It'll carry NULL characters just fine.

Here's a Rust Playground link demonstrating that.


Copyright © 2025, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds