I discover a gotcha with criterion::black_box

created:

Setting the scene

Before we begin, let’s set the scene. I’m working on a potential optimisation for an application, and there’s a type that is used in a lot of places (it stores domain names but thats not important). Currently it stores its data as a String, but I know that this type can only ever hold strings of length <=253. The obvious optimisation therefore is to move from a heap based dynamic String container to a stack based solution using a fixed sized array.

This may seem like an obvious and simple optimisation, I mean everyone knows the stack is always faster… right? Even so I created some micro-benchmarks using the criterion crate just in case.

The two forms I would be testing looked like this:

const MAX_LEN: usize = 253;
struct HeapBacked {
    inner: String,
}
struct StackBacked {
    inner: [MaybeUninit<u8>; MAX_LEN],
    length: u8,
}

In case you are wondering why MaybeUninit is used rather than simply u8, this allows us to skip initialisation of all 253 bytes (which costs a few ns according to the benchmarks 🤷‍♀️)

So far, so good. There is also of course the construction functions:

fn is_input_valid(input: &str) -> bool { /* ... */ }

impl HeapBackend {
    pub fn new(input: &str) -> Option<Self> {
        if is_input_valid(input) {
            Some(Self { inner: input.to_owned() })
        } else {
            None
        }
    }
}
impl StackBacked {
    pub fn new(input: &str) -> Option<Self> {
        if is_input_valid(input) {
            // Safety: We are explicitly allowed to do this by the API of MaybeUninit
            let mut inner: [MaybeUninit<u8>; MAX_LEN] = unsafe { MaybeUninit::uninit().assume_init() };
            let length = input.len();
            // Safety: This `as` is safe because `MaybeUninit<T>` is guaranteed to have the same layout as T
            unsafe { std::ptr::copy_nonoverlapping(input.as_ptr(), inner.as_mut_ptr() as *mut u8, length); }
            Some(Self { inner, length })
        } else {
            None
        }
    }
}

The start of our troubles

Benchmark results:

stack: 80ns
heap: 45ns

As you can see, these results are a little unexpected. The stack version is slower, almost 2x slower.

To try and understand why I added another benchmark, “stack unchecked”. This one gave a time of ~30ns. OK, here is the speedup I was expecting!

The question still remained however: why was the checked version so much slower?

Theory time

I came up with a couple of theories to try and work out what an earth was happening

1. Maybe its the branch?

Perhaps the branch was messing up the predictor in a way that didn’t happen with the heap version? nope: perf said that while there were consistently more branches on the checked version (as expected) there were a similar number of misses (i.e. the predictor got it right most of the time)

What I do notice at this point is that “stalled-cycles-backend” (cycles spent waiting for memory or instructions) is very high (>75%) for both stack versions. The heap version is about average for my system (~20%). This would seem to indicate an issue with fetching memory, possibly something about the stack version is really bad for the cpu cache?

2. OK, perhaps its being ejected from the cache

The stack version is a lot bigger than the heap version (24 bytes vs 254), so maybe the issue is the cache getting pressured when the check is run and ejecting it.

Nope: perf stat once again proves me wrong — cache misses for the heap and stack version are similar, with the heap version consistently having just a *smidgen* more (because of the extra indirection, if I had to guess).

Black box

It turns out the issue was in criterion::black_box, aka the function used by both me and by criterion internally to prevent the compiler optimising away reads or just removing variables because they were never actually used. There are two different ways criterion implements this function, the first (and default) one is to use a volatile pointer read, the second (enabled by the real_blackbox feature flag) uses the rust built-in test::black_box.

The heap version was almost completely unaffected by using the fake black box but the stack version took a pretty major performance hit.

If I switched on real_blackbox then I got the results I was initially expecting: the stack version outperformed the heap version.

In retrospect, I should’ve RTFM at the start:

/// A function that is opaque to the optimizer, used to prevent the compiler from
/// optimizing away computations in a benchmark.
///
/// This variant is stable-compatible, but it may cause some performance overhead
/// or fail to prevent code from being eliminated.
#[cfg(not(feature = "real_blackbox"))]
pub fn black_box<T>(dummy: T) -> T {
    unsafe {
        let ret = std::ptr::read_volatile(&dummy);
        std::mem::forget(dummy);
        ret
    }
}

To be fair (ish) to myself, I did actually read that comment… I just ignored it —. When I migrated my code to its own repository, in order to make it a minimal reproducible example, I discovered that the performance of the stack version was suddenly better than the heap. Eventually I realised that this feature flag was the only difference (and my soul left my body).

The real performance bottleneck

Even with the stack version now performing as expected, it still wasn’t at the level I was hoping for. It was faster sure but only by a small amount, 10ns at most. The unchecked version was significantly faster — ~8ns constant time vs ~30-300ns for the checked version on 15-200 character inputs.

Why was this? Well it turns out that doing the check that the string was valid for the datatype’s representation was absolutely obliterating the cost of the heap allocation 🤦‍♀️. In retrospect, this should have been obvious to me from the moment that the unchecked version outperformed the checked version by such a large margin, my only defense is that the slowdown from black_box was in the same ballpark as the slowdown from the validation, which is just bad luck.

So there we have it, the real villain all along was a rather complex regex. There’s probably a moral about regexes there but I’m not volunteering to write:

^((([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5]\.){3}[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])|(\[([0-9A-Fa-f:.]){2,45}\])|([0-9A-Za-z-.]){1,255})(:[0-9]{1,5})?$

as custom code (it may very well end up slower anyway, 30ns is quite impressive given the length of the input and the complexity of the expression)

Lesson learned

Overall I learned 2 lessons from this whole mess

  1. Benchmarking is really important, and really you should use it to pick what to optimise not the other way round (oops)
  2. Compiler intrinsics for stuff like black_box is essential and trying to fake it can cause serious issues with your results (oh and ofc RTFM of your benchmarking lib)