Aug 11, 2017 • Internship, RustPermalink

Types as Contracts: Implementation and Evaluation

Some weeks ago, I described Types as Contracts as an approach for how to go about defining Rust’s aliasing-related undefined behavior. One key property of this approach is that it is executable, which means we can actually have a program (in this case, miri) tell us whether some particular Rust test case exhibits undefined behavior or not. I have since then spent most of my time completing and refining this implementation, and running it on miri’s test suite to exercise various bits of the standard library and see whether they are actually following the rules I have been suggesting.

This post is my report on what I found, and how the approach itself changed in response to these findings. It also explains how you can run miri yourself and have it check your code for undefined behavior according to this model. Finally, I discuss some of the things that I would like to look at going forward.

1 Rule Violations in libstd and miri’s Test Suite

Unsurprisingly, some of miri’s test cases started to fail once I properly implemented the rules I wanted to have checked. Some of these turned out to be compiler bugs, but others are actually examples of code violating the proposed rules. In the following, I describe the patterns that I found. I assume that you are familiar with Types as Contracts as introduced in my previous post.

1.1 Ownership Passing Via Raw Pointers

I already mentioned this in the original post: mem::swap triggers a lock violation in the strictest interpretation of the model.

// Defined elsewhere: unsafe fn ptr::swap_nonoverlapping<T>(x: *mut T, y: *mut T) { ... }

pub fn swap<T>(x: &mut T, y: &mut T) {
    Validate(Acquire, [x, y]);
    let x = x as *mut T;
    let y = y as *mut T;
    unsafe {
        Validate(Release, [x, y]);
        ptr::swap_nonoverlapping(x, y, 1);
    }
}

The problem is that ptr::swap_nonoverlapping takes a raw pointer, so swap sees no reason to release its write locks on the memory covered by x and y. That makes it UB for ptr::swap_nonoverlapping to access that memory.

Raw pointers sneakily transferring ownership cannot just lead to functions accessing memory that the model does not allow them to access, it can also lead to overlapping mutable references. We can see this in [T]::split_at_mut:

fn split_at_mut(&mut self, mid: usize) -> (&mut [T], &mut [T]) {
    let len = self.len();
    let ptr = self.as_mut_ptr();

    unsafe {
        assert!(mid <= len);

        (from_raw_parts_mut(ptr, mid),
         from_raw_parts_mut(ptr.offset(mid as isize), len - mid))
    }
}

The problem is that when from_raw_parts_mut(ptr, mid) returns a &mut [T], that overlaps with self, so we have two aliasing mutable references in scope. This makes acquire-validation of the return value of from_raw_parts_mut(ptr, mid) fail.

Another example of this pattern is AtomicBool::get_mut:

pub fn get_mut(&mut self) -> &mut bool {
    unsafe { &mut *(self.v.get() as *mut bool) }
}

self.v.get() returns a *mut u8, which then gets turned into a &mut bool. However, this again overlaps with self.

This list is not exhaustive by any means; there are problems of this sort pretty much whenever raw pointers are used in non-trivial ways.

1.2 Uninitialized memory

Any place that calls mem::uninitialized instantly has validation blow up because the returned T is, obviously, not actually a valid T – it is, after all, completely uninitialized. However, it turns out that mem::uninitialized is causing other problems as well (e.g. around uninhabited types), and better APIs for handling uninitialized data have already been proposed.

1.3 Arc::drop

This is the most interesting case I found. The Arc destructor ends up calling Arc::drop_slow, which is implemented as follows:

unsafe fn drop_slow(&mut self) {
    let ptr = self.ptr.as_ptr();

    // Destroy the data at this time, even though we may not free the box
    // allocation itself (there may still be weak pointers lying around).
    ptr::drop_in_place(&mut self.ptr.as_mut().data);

    if self.inner().weak.fetch_sub(1, Release) == 1 {
        atomic::fence(Acquire);
        Heap.dealloc(ptr as *mut u8, Layout::for_value(&*ptr))
    }
}

fn inner(&self) -> &ArcInner<T> {
    // This unsafety is ok because while this arc is alive we're guaranteed
    // that the inner pointer is valid. Furthermore, we know that the
    // `ArcInner` structure itself is `Sync` because the inner data is
    // `Sync` as well, so we're ok loaning out an immutable pointer to these
    // contents.
    unsafe { self.ptr.as_ref() }
}

When drop_slow is executed, it first uses drop_in_place to drop the user data in the Arc. Next, it calls inner, and this is where things go wrong. The reason validation fails is that when inner calls self.ptr.as_ref(), that returns an &ArcInner<T>. This return value is validated, like all return values are. However, remember that we already dropped the data field of the ArcInner! We don’t actually have a fully valid &ArcInner<T>; only the weak and strong fields may still be used. Nevertheless, the type of inner says that there is a valid shared reference to ArcInner being passed around, making this code UB.

Notice that even if inner did not care about the types, validation would still fail when drop_slow goes on calling the safe function Layout::for_value and passes it the invalid &ArcInner<T>.

To fix this, we have to make sure that the type &ArcInner is not used after data is dropped. The following would work, for example:

unsafe fn drop_slow(&mut self) {
    let ptr = self.ptr.as_ptr();

    let layout = Layout::for_value(&*ptr);

    // Destroy the data at this time, even though we may not free the box
    // allocation itself (there may still be weak pointers lying around).
    ptr::drop_in_place(&mut (*ptr).data);
    // We must not use the type &(mut) ArcInner from now on, because that
    // type is actually no longer valid.

    if (*ptr).weak.fetch_sub(1, Release) == 1 {
        atomic::fence(Acquire);
        Heap.dealloc(ptr as *mut u8, layout)
    }
}

2 Changes to the Model

There were a whole bunch of other test failures, but they all turned out to be bugs in the implementations or even the model, rather than legit violations of the “Types as Contracts” idea. In particular, it turned out that suspensions were actually not modeling borrowing of mutable data properly.

There were also some changes affecting the rules of where validation calls are emitted in some corner cases, but there weren’t really any interesting things to learn there. At some point I will write down the exact rules in full detail; until then it is really the general approach that matters much more than such corner cases.

2.1 Suspension of Write Locks

In the original proposal, on every borrow y = &[mut] x;, the write locks on x would be suspended. This means that the locks get released, but will be re-acquired (or recovered) automatically when the lifetime of the borrow ends. First of all, this actually fails to detect some cases that should be considered contract violations, like passing an &mut bool to a function and then having that function write something invalid (i.e., something that is not a boolean) into the reference. To fix that, I added a new table to the global state of the program that would record suspended lvalues and their type (rather than suspended memory regions). When the lifetime of the suspension ends, we perform a full re-acquire of the given lvalue at the given type, acquiring all the write locks again and also checking other invariants, like whether bool values are valid.

However, it turned out that this model was still too naive. I failed to account for the fact that the borrow checker actually accepts programs like the following:

fn foo(s: (i32, i32)) {
    Validate(Acquire, [s]);
    let x;
    {
        Validate(Suspend('y), [s.1]);
        let y = &/*'y*/ s.1;
        Validate(Acquire, [*y]);

        Validate(Suspend('x), [s]);
        x = &/*'x*/ s;
        Validate(Acquire, [*x]);
    EndRegion('y); }
EndRegion('x); }

Let us step through this code while considering what happens with the locks. Initially, we acquire s, so the memory occupied by the pair will be write-locked by foo. When the borrow of s.1 happens, we release the write lock and suspend s.1 at type i32, making sure we re-acquire the write lock when the lifetime 'y of the borrow. Nothing interesting happens when *y gets acquired. Next, the entire s gets borrowed, so we will try to release all the write locks on it during suspension – but at this point, we don’t even have a write lock of s.1! Furthermore, due to this second borrow (that has a longer lifetime than the first borrow), we actually do not want to re-acquire the write lock on s.1 when 'y ends, we will want to wait until 'x ends and keep the read lock until then.

What we really want to do is count how many shared borrows happened of s.1 and s.2, respectively, and only acquire the write lock again once that count reaches 0. This is complicated a lot once RefCell enters the game: The fact that we take a shared borrow does not mean that there won’t also be mutable borrows again for the same region of memory. This is demonstrated by the following example (with comments explaining what happens to the locks; adding all the Validate calls just makes this code unreadable):

fn test2(r: &mut RefCell<i32>) {
    let x = &*r; // releasing write lock, first suspension recorded
    let mut x_ref = x.borrow_mut();
    let x_inner : &mut i32 = &mut *x_ref;
    let x_inner_shr = &*x_inner; // releasing inner write lock, recording suspension
    let y = &*r; // second suspension for the outer write lock
    let x_inner_shr2 = &*x_inner; // 2nd suspension for inner write lock
    /* ... */
}

At the end of this function, we actually have four aliasing shared borrows covering the content of the RefCell: x, y, x_inner_shr and x_inner_shr2. There are two mutable references, both of which currently suspended (x_inner and r).

Once both x_inner_shr and x_inner_shr2 have expired (i.e, their lifetime has ended), we will want to recover the write lock on x_inner. Then x_inner expires, releasing the write lock. Finally, once both x and y expire, r gets recovered and we again acquire the write lock.

To make that happen, we need to be able to track multiple suspended write locks for a location. In implementing that, an interesting question came up: What identifies a write lock? I ended up using the lifetime of the lock as the identifier, but that doesn’t actually work. Notice that in the above example, both x_inner and r have the same lifetime, even though they are clearly corresponding to distinct write locks. So, I will have to go back to the drawing board and figure out a different way to identify write locks. I have some ideas, but I am not very happy with them.

2.2 Unsafe Code

As already indicated in the original post, the way I suggest to handle the problems with raw pointers described in §1.1 is by relaxing the validation performed around unsafe code. Specifically, the code that emits MIR validation statements now detects if a function is unsafe, contains an unsafe block, or (in the case of closures) is defined inside an unsafe block/function. In that case, we still perform validation of the arguments passed to us, but we immediately release all locks again. The same happens for data returned to us from other functions. No further validation is happening inside the unsafe function. In other words, we enforce the type invariants when data passes between functions, but we do not enforce any invariant within such functions.

I am aware that this part of the proposal is particularly controversial. I do not know of an alternative that actually works with the raw pointers, and that performs validation purely based on the types of variables as they are passed around (rather than, say, also taking some form of provenance or “derived-from” relation into account). In discussions, two possible solutions to this problems became apparent:

Both of these options are certainly worth exploring.

3 Testing Your Own Code

The implementation of the model as described above (well, with the open problems mentioned in §2.1) actually landed upstream in rustc and miri, so you can play around with this yourself! To do so, you need to install Rust nightly and miri. Lucky enough, rustup makes this really easy. (Unfortunately, I have very little Windows experience. If you manage to get this all running on Windows, I’m happy to link to your instructions for how to do that.) On Linux and macOS, you can use the following commands to install everything you need, and run test.rs in miri:

rustup install nightly
git clone https://github.com/solson/miri.git && cd miri
cargo +nightly install xargo # xargo is needed to build a libstd that miri can work with
xargo/build.sh # build that libstd
cargo +nightly build --release
MIRI_SYSROOT=~/.xargo/HOST cargo +nightly run --release --bin miri -- -Zmir-emit-validate=1 test.rs

The flag -Zmir-emit-validate=1 turns on generation of the MIR validation statements. If you set it to 2, full validation will performed even in unsafe functions. Notice however that the flag only affects the code generated when miri actually runs; the script in xargo/build.sh controls validation for code in libstd.

If you find anything odd and things don’t work as expected, please report a bug. Notice however that miri currently supports virtually no interaction with the surrounding operating system, like file access or spawning threads.

4 Future Work

As usual, there are so many things that could be done next.

miri’s test suite is rather small, but still large enough to keep me busy, so I have not yet looked at the much more extensive test suite that comes with rustc – but that certainly something I want to do eventually. I am sure this will uncover bugs in corner-cases of my implementation, and it may also uncover new interesting cases of rules being violated. Speaking of the implementation, the problem around write locks described in §2.1 is not the only issue; another problem I still have to solve is to properly acquire and release locks on padding bytes and enum discriminants.

While §1.1 is “fixed” by exempting unsafe code from the full range of checks, and the trouble with uninitialized memory (§1.2) will hopefully be solved as a side-effect of changing the way uninitialized memory is handled, this still leaves Arc::drop (§1.3) as an open problem. Notice that relaxing validation around unsafe code does not help here; the invalid data actually flows between functions, and so far the plan for unsafe code is to still enforce type validity when function boundaries are crossed. This comes down to a judgment call: We could relax validation sufficiently (e.g., relaxing validation in entire modules when they do certain unsafe things, and not validating private fields of structs that we cannot access) to make this code legal, or we could declare that code to actually be UB. The concrete problem in Arc can be fixed fairly easily as I have shown; the question is whether this kind of code is something that programmers actually expect to be legal – and if they do, whether we can do anything about that (by providing documentation, or tools, or whatever else is necessary). Of course, it may also turn out that this shows that the entire approach is not actually feasible.

Interestingly, the discussion of my previous post introducing Types as Contracts was to a large extend not even about this model, but about the underlying model of what memory and pointers are, and how to write a checker that actually faithfully models what LLVM does. After all, we want miri to also detect UB related to e.g. illegal pointer arithmetic. These issues are far from settled, and I do have some interesting ideas here that I want so share with you.

Finally, there is of course the discussion around how to handle raw pointers; I already talked in §2.2 about some of the options that could be explored here.

Unfortunately, today is the last day of my internship, so I will no longer be able to work on all of this full-time like I did the last three months. Still, I certainly intend to stay involved. This problem is way too interesting to just let it go :)

As always, please comment with your thoughts on the topic. I am particularly curious about what kind of test cases you are throwing at miri, and how it is doing!

Update: I added a proposal for how to fix the Arc problem.

Posted on Ralf's Ramblings on Aug 11, 2017.
Comments? Drop me a mail or leave a note in the forum!