|
|
Log in / Subscribe / Register

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

Posted May 2, 2022 18:43 UTC (Mon) by bartoc (guest, #124262)
In reply to: DeVault: Announcing the Hare programming language by excors
Parent article: DeVault: Announcing the Hare programming language

C++ has char8_t, which is utf-8 or UB and also pointers to it don't alias char*s (or anything else). This means constructing them from a char* that you already know is valid utf-8 is….. a memcpy. They are a complete and utter mess. C has such a type too, and its just a big of a mess there too.

python’s pep383 is actually a pretty cool scheme, but is a bit “nuts”.


to post comments

DeVault: Announcing the Hare programming language

Posted May 3, 2022 10:41 UTC (Tue) by peter-b (guest, #66996) [Link] (2 responses)

> C++ has char8_t, which is utf-8 or UB and also pointers to it don't alias char*s (or anything else).

To my enormous displeasure, unfortunately it isn't UB for a char8_t* to point to a string buffer that doesn't contain UTF-8.

> They are a complete and utter mess.

I can absolutely confirm this.

DeVault: Announcing the Hare programming language

Posted May 3, 2022 11:07 UTC (Tue) by mathstuf (subscriber, #69389) [Link] (1 responses)

> it isn't UB for a char8_t* to point to a string buffer that doesn't contain UTF-8.

One reason that comes to my mind is that it would need to eschew `u8ptr++` because if it currently points at a code point encoded with multiple bytes, incrementing one byte would make it UB, no? Or would `++` inspect the byte encoded length of the current code point and jump an appropriate amount? That certainly seems novel too.

DeVault: Announcing the Hare programming language

Posted May 5, 2022 0:50 UTC (Thu) by tialaramex (subscriber, #21167) [Link]

Because both of the increment operators in C++ can be overridden for a type, this isn't difficult to do, if C++ wanted to do it.

But I don't expect C++ to actually do that lifting for the same reason it still has both these silly operators in the first place. Backward compatibility trumps all other considerations.

Rust's str actually provides both iterations, if you want the underlying *bytes* you can have those, and of course individual bytes are just UTF-8 code units and on their own don't necessarily mean anything specifically, but if you want "characters" (Rust's char) you can iterate over those and under the hood it is indeed moving forward the appropriate number of bytes each time.

DeVault: Announcing the Hare programming language

Posted May 3, 2022 15:02 UTC (Tue) by NYKevin (subscriber, #129325) [Link] (5 responses)

> C++ has char8_t, which is utf-8 or UB and also pointers to it don't alias char*s (or anything else). This means constructing them from a char* that you already know is valid utf-8 is….. a memcpy. They are a complete and utter mess. C has such a type too, and its just a big of a mess there too.

char* can alias anything, so I would be very surprised if they made a special exception for char8_t... are you absolutely sure that it really is UB to alias them?

DeVault: Announcing the Hare programming language

Posted May 3, 2022 19:39 UTC (Tue) by foom (subscriber, #14868) [Link] (4 responses)

You are allowed to access any object as bytes via a `char*`, but the opposite -- accessing a char-array as some other type -- is (surprisingly!!), not obviously allowed. See e.g. this discussion, https://stackoverflow.com/questions/12612488/aliasing-t-w...

DeVault: Announcing the Hare programming language

Posted May 4, 2022 1:37 UTC (Wed) by NYKevin (subscriber, #129325) [Link] (3 responses)

Of course that's not allowed. Then there would effectively be no aliasing rule at all, because "X can be aliased as Y" is generally understood to be a transitive relationship.[1] If you can alias anything to char* and then alias char* back to anything else, then you can alias anything to anything else and all bets are off.

Anyway, any halfway-decent compiler will identify the memcpy as redundant and optimize it out, so you should just bite the bullet and call memcpy.

[1]: It has to be transitive. The alternative would be "I have a Y*, but I don't know whether it can be aliased as a Z*, because someone else gave it to me, and I don't know whether it was originally a Y* or originally an X* that they aliased to a Y*."

DeVault: Announcing the Hare programming language

Posted May 4, 2022 9:09 UTC (Wed) by excors (subscriber, #95769) [Link] (2 responses)

I don't think it's meaningful to talk about transitivity of aliasing, because aliasing is not specified as a relationship between two things of the same kind. It's a relationship between the type of an object, and the type of the pointer being used to access it. (In particular it's not a relationship between two pointers, so it can't be transitively extended to a third pointer.)

If I understand correctly, C++20 says the type of an object is determined by its definition, or by (placement) new, or by assignment to a union, etc, or objects can be "implicitly created" by malloc()/memcpy()/etc and the compiler will act as if it magically determined the type of those objects in order to avoid undefined behaviour when they are subsequently used.

You can access an object of any type T through a pointer to char. You can't access an object of type char through a pointer to T, unless T is char (or similar). That's the easy part.

I think the tricky part is: When does an object have type char? C++20 says "An operation that begins the lifetime of an array of char, unsigned char, or std::byte implicitly creates objects within the region of storage occupied by the array". I think that means "char buf[256];" and "char *buf = new char[256];" are implicitly creating objects of an as-yet-unknown type and size. Regardless of the type, you can access it through char*. If you subsequently access it through char8_t*, that means the type of the implicitly created objects must have been char8_t (to avoid undefined behaviour here), so accessing through char8_t* is also fine.

That means there should be no need to memcpy into an array that was explicitly declared as char8_t, you can just cast the char pointer, because it's really a pointer to char8_t objects even though you declared it and allocated it and used it as char.

(But as is often the case with C++, I'm only about 60% confident in that interpretation.)

DeVault: Announcing the Hare programming language

Posted May 5, 2022 0:32 UTC (Thu) by foom (subscriber, #14868) [Link] (1 responses)

"Technically", I think it might only be be defined in C++20 and later, via the change in <https://wg21.link/p0593> -- though, to the best of my knowledge, no compiler made any changes to become compliant with these changes.

DeVault: Announcing the Hare programming language

Posted May 6, 2022 0:38 UTC (Fri) by khim (subscriber, #9252) [Link]

Without that change it's not really possible to use mmap (or Windows equivalent) thus it was just a matter of fixing the standard. None of them compilers ever broke any code which does these things.

The only thing which may trigger UB there is unaligned access (and yes, it may even happen with x86 since compiler can do autovectorization and use SSE instructions).


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds