I discover a gotcha with criterion::black_box
Setting the scene
Before we begin, let’s set the scene. I’m working on a potential optimisation for
an application, and there’s a type that is used in a lot of places (it stores domain
names but thats not important). Currently it stores its data as a String
, but
I know that this type can only ever hold strings of length <=253
. The obvious
optimisation therefore is to move from a heap based dynamic String
container
to a stack based solution using a fixed sized array.
This may seem like an obvious and simple optimisation, I mean everyone knows
the stack is always faster… right? Even so I created some micro-benchmarks
using the criterion
crate just in case.
The two forms I would be testing looked like this:
const MAX_LEN: usize = 253;
struct HeapBacked {
inner: String,
}
struct StackBacked {
inner: [MaybeUninit<u8>; MAX_LEN],
length: u8,
}
In case you are wondering why MaybeUninit
is used rather than simply u8
, this allows us to skip
initialisation of all 253
bytes (which costs a few ns according to the benchmarks 🤷♀️)
So far, so good. There is also of course the construction functions:
fn is_input_valid(input: &str) -> bool { /* ... */ }
impl HeapBackend {
pub fn new(input: &str) -> Option<Self> {
if is_input_valid(input) {
Some(Self { inner: input.to_owned() })
} else {
None
}
}
}
impl StackBacked {
pub fn new(input: &str) -> Option<Self> {
if is_input_valid(input) {
// Safety: We are explicitly allowed to do this by the API of MaybeUninit
let mut inner: [MaybeUninit<u8>; MAX_LEN] = unsafe { MaybeUninit::uninit().assume_init() };
let length = input.len();
// Safety: This `as` is safe because `MaybeUninit<T>` is guaranteed to have the same layout as T
unsafe { std::ptr::copy_nonoverlapping(input.as_ptr(), inner.as_mut_ptr() as *mut u8, length); }
Some(Self { inner, length })
} else {
None
}
}
}
The start of our troubles
Benchmark results:
stack: 80ns
heap: 45ns
As you can see, these results are a little unexpected. The stack version is slower, almost 2x slower.
To try and understand why I added another benchmark, “stack unchecked”. This
one gave a time of ~30ns
. OK, here is the speedup I was expecting!
The question still remained however: why was the checked version so much slower?
Theory time
I came up with a couple of theories to try and work out what an earth was happening
1. Maybe its the branch?
Perhaps the branch was messing up the predictor in a way that didn’t happen with the heap version? nope: perf said that while there were consistently more branches on the checked version (as expected) there were a similar number of misses (i.e. the predictor got it right most of the time)
What I do notice at this point is that “stalled-cycles-backend” (cycles spent waiting for memory or instructions) is very high (>75%) for both stack versions. The heap version is about average for my system (~20%). This would seem to indicate an issue with fetching memory, possibly something about the stack version is really bad for the cpu cache?
2. OK, perhaps its being ejected from the cache
The stack version is a lot bigger than the heap version (24 bytes vs 254), so maybe the issue is the cache getting pressured when the check is run and ejecting it.
Nope: perf stat
once again proves me wrong — cache misses for the heap and
stack version are similar, with the heap version consistently having just a
*smidgen* more (because of the extra indirection, if I had to guess).
Black box
It turns out the issue was in criterion::black_box
, aka the function used by
both me and by criterion
internally to prevent the compiler optimising away
reads or just removing variables because they were never actually used. There
are two different ways criterion
implements this function, the first (and
default) one is to use a volatile pointer read, the second (enabled by the
real_blackbox
feature flag) uses the rust built-in test::black_box
.
The heap version was almost completely unaffected by using the fake black box but the stack version took a pretty major performance hit.
If I switched on real_blackbox
then I got the results I was initially
expecting: the stack version outperformed the heap version.
In retrospect, I should’ve RTFM at the start:
/// A function that is opaque to the optimizer, used to prevent the compiler from
/// optimizing away computations in a benchmark.
///
/// This variant is stable-compatible, but it may cause some performance overhead
/// or fail to prevent code from being eliminated.
#[cfg(not(feature = "real_blackbox"))]
pub fn black_box<T>(dummy: T) -> T {
unsafe {
let ret = std::ptr::read_volatile(&dummy);
std::mem::forget(dummy);
ret
}
}
To be fair (ish) to myself, I did actually read that comment… I just ignored it —. When I migrated my code to its own repository, in order to make it a minimal reproducible example, I discovered that the performance of the stack version was suddenly better than the heap. Eventually I realised that this feature flag was the only difference (and my soul left my body).
The real performance bottleneck
Even with the stack version now performing as expected, it still wasn’t at the
level I was hoping for. It was faster sure but only by a small amount, 10ns
at most. The unchecked version was significantly faster — ~8ns
constant time vs ~30-300ns
for the checked version on 15-200 character
inputs.
Why was this? Well it turns out that doing the check that the string was valid
for the datatype’s representation was absolutely obliterating the cost of the
heap allocation 🤦♀️. In retrospect, this should have been obvious to
me from the moment that the unchecked version outperformed the checked version by such
a large margin, my only defense is that the slowdown from black_box
was in the same
ballpark as the slowdown from the validation, which is just bad luck.
So there we have it, the real villain all along was a rather complex regex. There’s probably a moral about regexes there but I’m not volunteering to write:
^((([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5]\.){3}[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])|(\[([0-9A-Fa-f:.]){2,45}\])|([0-9A-Za-z-.]){1,255})(:[0-9]{1,5})?$
as custom code (it may very well end up slower anyway, 30ns
is quite
impressive given the length of the input and the complexity of the expression)
Lesson learned
Overall I learned 2 lessons from this whole mess
- Benchmarking is really important, and really you should use it to pick what to optimise not the other way round (oops)
- Compiler intrinsics for stuff like
black_box
is essential and trying to fake it can cause serious issues with your results (oh and ofc RTFM of your benchmarking lib)