|
|
Log in / Subscribe / Register

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

Posted May 10, 2022 16:39 UTC (Tue) by farnz (subscriber, #17727)
In reply to: DeVault: Announcing the Hare programming language by khim
Parent article: DeVault: Announcing the Hare programming language

When you say "if pointers are equal, then they are completely equivalent", are you talking at a single point in time, or over the course of execution of the program?

Given, for example, the following program, it is a fault to assume that ptr1 and ptr2 are equivalent throughout the runtime, because ptr1 is invalidated by the call to release_page:


handle page_handle = get_zeroed_page();
int test;
int *ptr2;
int *ptr1 = get_ptr_to_handle(page_handle);
*ptr1 = -1; // legitimate - ptr1 points to a page, which is larger than an int in this case and correctly aligned etc.
test = *ptr1; // makes test -1
release_page(page_handle);
page_handle = get_zeroed_page();
ptr2 = get_ptr_to_handle(page_handle); // ptr2 could have the same numeric value as ptr1.
if (ptr2 == ptr1 && *ptr1 == test) {
    puts("-1 == 0");
} else {
    puts("-1 != 0");
}
release_page(page_handle);

This is the sort of code that you need to be clear about; C89's language leaves it unclear whether it's legitimate to assume that *ptr1 == test, even though the only assignments in the program are to *ptr1 (setting it to -1) and test. The thing that hurts here is that even if, in bitwise terms including hidden CHERI bits etc, ptr1 == ptr2, it's possible for the underlying machine to change state over time, and any definition of "completely equivalent" has to take that into account.

One way to handle that is to say that even though the volatile keyword does not appear anywhere in that code snippet, you give dereferencing a pointer volatile-like semantics (basically asserting that locations pointed to can change outside the changes done by the C program), and say that each time it's dereferenced, it could be referring to a new location in physical memory. In that case, this program cannot print "-1 == 0", because it has to dereference ptr1 to determine that.

Another is to follow the letter of the C89 spec, which says that the only things that can change in the abstract machine's view of the world other than via a C statement are things marked volatile. In that case, this program is allowed to print "-1 == 0" or "-1 != 0" depending on whether ptr1 == ptr2, because the implementation "knows" that it is the only thing that can assign a value to *ptr1, and thus it "knows" that because no-one has assigned through *ptr1 since it read the value to get test it is tautologically true that *ptr1 == test.

Both are valid readings of this source under the rules set by C89, because C89 states explicitly that the only thing expected to change outside of explicit changes done by C code are things marked as volatile. But in this case, the get_zeroed_page and release_page pair change the machine state in a fairly dramatic way, but in a way that's not visible to C code - changing PTEs, for example.

And that's the fundamental issue with rewinding to C89 rules - C89 implies very strongly that the only interface between things running on the "abstract C89 machine" and the real hardware are things that are marked as volatile in the C abstract machine. In practice, nobody has bothered being that neat, and we accept that there's a whole world of murky, underdefined behaviour where the real hardware changes things that affect the behaviour of the C abstract machine, but it happens that C compilers have not yet exploited that.

Note, too, that I wasn't talking about optimization in either case - I'm simply looking at the semantics of the C abstract machine as defined in C89, and noting that they're not powerful enough to model a change that affects the abstract machine but happens outside it. I find it very tough, within the C89 language, to find anything that motivates the position that *ptr1 != test given that ptr2 == ptr1 and *ptr2 != test - it's instead explicitly undefined behaviour.


to post comments

DeVault: Announcing the Hare programming language

Posted May 10, 2022 17:00 UTC (Tue) by khim (subscriber, #9252) [Link] (2 responses)

> Both are valid readings of this source under the rules set by C89, because C89 states explicitly that the only thing expected to change outside of explicit changes done by C code are things marked as volatile.

No. Calling fread and fwrite can certainly change things, too.

> But in this case, the get_zeroed_page and release_page pair change the machine state in a fairly dramatic way, but in a way that's not visible to C code - changing PTEs, for example.

Yes and no. Change is dramatic, sure. But it's most definitely visible to C code.

By necessity such things have to either be implemented with volatile or by calling system routine (which must be added to the list of functions like fread and fwrite as system extension, or else you couldn't use them them).

Place where you pass your pointer to the invlpg would be place where compiler would know that object may suddenly change value.

> In practice, nobody has bothered being that neat, and we accept that there's a whole world of murky, underdefined behaviour where the real hardware changes things that affect the behaviour of the C abstract machine, but it happens that C compilers have not yet exploited that.

In practice people who are doing these things have to use volatile at some point in the kernel, or else it just wouldn't work. Thus I don't see what you are trying to prove.

The fact that real OSes have to expand list of “special” functions which may do crazy things? It's obvious. In practice your functions are called mmap and munmap and they should be treated by compiler similarly to read and write: compiler either have to know what they are doing or it should assume they may touch and change any object they can legitimately refer given their arguments.

> I find it very tough, within the C89 language, to find anything that motivates the position that *ptr1 != test given that ptr2 == ptr1 and *ptr2 != test - it's instead explicitly undefined behaviour.

No. You couldn't do things like change to PTEs in a fully portable C89 program. It's pointless to talk about such programs since they don't exist.

The only way to do it is via asm and/or call to system routine which, by necessity, needs extensions to C89 standard to be usable. In both cases everything is fully defined.

DeVault: Announcing the Hare programming language

Posted May 10, 2022 18:10 UTC (Tue) by farnz (subscriber, #17727) [Link] (1 responses)

fread and fwrite are poor examples, because they are C code defined in terms of the state change they make to the abstract machine, and with a QoI requirement that the same state change happens to the real machine. Indeed, everything that's defined in C89 has its impact on the abstract machine fully defined by the spec; the only get-out is that volatile marks something where all reads and writes through it must be visible in the real machine in program order.

But note that this is a very minimal promise; the only thing happening in the real machine that I can reason about in C89 is the program order of accesses to volatiles. Nothing else that happens in the abstract machine is guaranteed to be visible outside it - everything else is left to the implementation's discretion.

And no, the state change is not visible inside the C89 abstract machine; if I write through a volatile pointer to a PTE, the implementation must ensure that my write happens in the real machine as well as the abstract machine, but it does not have to assume that anything has changed in the abstract machine. That, in turn, means that it may not know that ptr1 now has changed in the "real" machine, because it's not volatile and thus changes in the real machine are irrelevant.

And I absolutely can change a PTE without assembly or a system routine, using plain C code; all I need is something that gives me the address of the PTE I want to change. Now, depending on the architecture, that almost certainly is not enough to guarantee an instant change - e.g. on x86, the processor can use old or new value of the PTE until the TLB is flushed, and I can force a TLB flush with invlpg to get deterministic behaviour - but I can bring the program into a non-deterministic state without calling into an assembly block or running a system routine, as long as I have the address of a PTE.

And there's no "list of system routines" in C89; the behaviour of fread, fwrite and other such functions is fully defined in the abstract machine by the spec, with a QoI requirement to have their behaviour be reflected in the "real" machine. By invoking the idea of a "list of system routines", you're extending the language beyond C89.

You're making the same mistake a lot of people make, of assuming that the behaviour of compilers in the early 1990s and earlier reflected the specification at the time, and wasn't just a consequence of limited compiler technology. If compilers really did implement C89 to the letter of the specification, then much of what makes C useful wouldn't be possible; provenance is not something that's new, but rather an attempt to let people do all the tricks like bit-stuffing into aligned pointers (which is technically UB in C89) while still allowing the compiler to reason about the meaning of your code in a way compatible with the C89 specification.

DeVault: Announcing the Hare programming language

Posted May 10, 2022 19:01 UTC (Tue) by khim (subscriber, #9252) [Link]

> By invoking the idea of a "list of system routines", you're extending the language beyond C89.

Which is the only way to have PTEs in C code.

> And I absolutely can change a PTE without assembly or a system routine, using plain C code; all I need is something that gives me the address of the PTE I want to change.

If you have such an address then you have to extend the language somehow. Or, alternatively, don't touch it.

> By invoking the idea of a "list of system routines", you're extending the language beyond C89.

Of course. Because it's impossible to write C89 program which changes the PTEs, such a concept just couldn't exist in it. You have to extend the language to cover that usecase.

> If compilers really did implement C89 to the letter of the specification

…then such compilers would have been as popular as ISO 7185. Means: no one would have cared about these and no one would have mentioned their existence.

> If compilers really did implement C89 to the letter of the specification, then much of what makes C useful wouldn't be possible

Yes. But some programs would still be possible. Programs which do tricks with pointers would work just fine, programs which touch PTEs wouldn't.

> provenance is not something that's new, but rather an attempt to let people do all the tricks like bit-stuffing into aligned pointers (which is technically UB in C89)

Citation needed. Badly. Because, I would repeat once more, in C89 (not in C99 and newer) the notion of “pointers which have the same bitpattern yet different” doesn't exist. If you add a few bits to the pointer converted to integer and then clear these same bits you would get the exact same pointer — guaranteed. The fact that these bits are lowest bits of converted integer is implementation-specific thing, you can imagine a case where these would live as top bits, e.g. So yet, that requires some implementation-specific extension. But pretty mild and limited.

Yes, provenance is an attempt to deal with the idea of C99+ that some pointers may be equal to others yet, somehow, still different — but that's not allowed in C89. If two pointers are equal then they are either both null pointers, or both point to the same object, end of story.

Sure, it makes some optimizations hard and/or impossible. So what? This just means that you cannot do such optimizations in C89 mode. Offer -fno-provenance option, enable it for -std=c89, done.

DeVault: Announcing the Hare programming language

Posted May 10, 2022 17:26 UTC (Tue) by Vipketsh (guest, #134480) [Link] (5 responses)

You are playing a nice slight-of-hand here. Your code clearly shows function calls, and without knowing what's in those called functions, one has no choice but to assume that *ptr1 has changed (within the C abstract machine), and thus you can never get to the output of "-1 == 0". If, on the other hand, you do see the internals of those functions you will see the modification of some memory that may alias with ptr1 and then because of that you can not get the "-1 == 0" output.

The only way I can see your reasoning working is if you are somehow allowed to assume that function calls are an elaborate way of saying "nop".

DeVault: Announcing the Hare programming language

Posted May 10, 2022 18:36 UTC (Tue) by farnz (subscriber, #17727) [Link] (4 responses)

If I promise that the function calls are just naming what code does, but it's real behaviour is poking global volatile pointers, and those functions are implemented in pure C89, there's no difference in behaviour. Given the following C definitions of get_zeroed_page, get_ptr_to_handle and release_page, you still have the non-determinism, albeit I've introduced a platform dependency:


const size_t PAGE_SIZE;
struct pt_entry {
    volatile char *v_addr;
    volatile char *p_addr;
}
volatile struct pt_entry *pte; // Initialized by outside code, with suitable guarantees on v_addr for the compiler implementation and on p_addr
int *page_location = pte->v_addr;

void *get_zeroed_page() {
   pte->p_addr += PAGE_SIZE;
   memset(pte->v_addr, 0, PAGE_SIZE);
   return pte->v_addr;
}

void release_page(void *handle) {
  assert(handle == pte->v_addr);
  pte->p_addr -= PAGE_SIZE;
}

void *get_ptr_to_handle(void* handle) {
  assert(handle == pte->v_addr);
  return page_location;
}

This has semantics on the real machine, because of the use of volatile - the writes to *pte are guaranteed to occur in program order. But the compiler does not have any way to know that volatile int *v_addr ends up with the same value between two separate calls to get_ptr_to_handle but points to different memory.

Also, I'd note that C89 does not have language asserting what you claim - it actually says quite the opposite, that the compiler does not have to assume that *ptr1 has changed within the C abstract machine, since ptr1 is not volatile. It's just that early implementations made that assumption because to do otherwise would require them to analyse not just the function at hand, but also other functions, to determine if *ptr1 could change. Like khim, you're picking up on a limitation of 1980s and early 1990s compilers, and assuming it's part of the language as defined, and not merely an implementation detail.

DeVault: Announcing the Hare programming language

Posted May 10, 2022 19:22 UTC (Tue) by Vipketsh (guest, #134480) [Link] (2 responses)

I still don't see it. You have that memset here:

> void *get_zeroed_page()
> [...]
> memset(pte->v_addr, 0, PAGE_SIZE);

If you don't have to assume that this writes over data pointed to by some other pointer it means that your aliasing rules say that no two pointers alias. Or put another way, for all practical purposes, having two pointers pointing to the same thing is unworkable. By some reading of C89 that may be the conclusion, but quite clearly that was never the intent and exactly no one expects things to work that way (including compiler writers, oddly enough).

> compiler does not have to assume that *ptr1 has changed within the C abstract machine

You mean across a function call ? That quite simply means that exactly no data could ever be shared by any two functions (in different compile units). Again, this would make the language completely unworkable and be counter what anyone expects.

> [...] you're picking up on a limitation of 1980s and early 1990s compilers, and assuming it's part of the language as defined, and not merely an implementation detail.

No. The language is defined, first and foremost, by what existing programs expect. If the standard allows interpretations and compilers to do things counter to what a majority of these programs expect, it is the standard that is broken and not the majority of all programs. I firmly believe that the job of a standard is to document existing behaviour and not to be a tool to change all programmes out there.

p.s.: I find it fascinating that instead of arguing about actual behaviour the C standard keeps coming up as if it where a bible handed down by some higher power and everything in it is completely infallible. Then the conclusion is that "See? It all sucks, so use Rust" because Rust is so excruciatingly well specified that, last I checked, it has no specification at all.

DeVault: Announcing the Hare programming language

Posted May 10, 2022 20:10 UTC (Tue) by khim (subscriber, #9252) [Link]

> Then the conclusion is that "See? It all sucks, so use Rust" because Rust is so excruciatingly well specified that, last I checked, it has no specification at all.

Rust hasn't needed any specs because till very recently there was just one compiler. Today we have 1.5: LLVM-based rustc and GCC-based rustc. One more is in development, thus I assume formal specs would be higher on list of priorities now.

This being said IMNSHO it's better to not have specs rather than have ones which are silently violated by actual compilers. At least when there are no specs you know that discussions between compiler developers and language users have to happen, when you have one which is ignored…

DeVault: Announcing the Hare programming language

Posted May 10, 2022 23:48 UTC (Tue) by tialaramex (subscriber, #21167) [Link]

Rust doesn't have anything similar to ISO/IEC 14882:2020, a large written document which is the product of a lot of work but is of limited practical value since it doesn't describe anything that actually exists today.

However, Rust does extensively document what is promised (and what is not) about the Rust language and its standard library, and especially the safe subset which Rust programmers should (and most do) spend the majority of their time working with.

For example, all that ISO document has to say about what happens if I've got two byte-sized signed integers which may happen to have the value 100 in them and I add them together is that this is "Undefined Behaviour" and offers no suggestions as to what to do about that besides try to ensure it never happens. In Rust the "no specification" tells us that this will panic in debug mode, but, if it doesn't panic (because I'm not in debug mode and I didn't enable this behaviour in release builds) it will wrap, to -56. I don't know about you, but I feel like "Absolutely anything might happen" is less specific than "The answer is exactly -56".

Rust also provides plenty of alternatives, including checked_add() unchecked_add() wrapping_add() saturating_add() and overflowing_add() depending on what you actually mean to happen for overflow, as well as the type wrappers Saturating and Wrapping which are useful here (e.g. Saturating<i16> is probably the correct type for a 16-bit signed integer used to represent CD-style PCM audio samples)

DeVault: Announcing the Hare programming language

Posted May 10, 2022 23:48 UTC (Tue) by nybble41 (subscriber, #55106) [Link]

The access to volatile memory at pte->v_addr through the non-volatile pointer ptr1 is UB because according to [C89]:

> If an attempt is made to refer to an object defined with a volatile-qualified type through use of an lvalue with non-volatile-qualified type, the behavior is undefined.[57]

> [57] This applies to those objects that behave as if they were defined with qualified types, even if they are never actually defined as objects in the program (such as an object at a memory-mapped input/output address).

Objects in the page at pte->v_addr behave as if they were defined as volatile objects because the content changes in ways not described by the C abstract machine when pte->p_addr is updated. The same applies to passing a pointer to volatile object(s) to memset(), which takes a non-volatile pointer.

The initializer for page_location (pte->v_addr) is also not a constant, but I assume this is just pseudo-code for the value being set by some initialization function not shown here.

[C89] http://port70.net/~nsz/c/c89/c89-draft.html


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds