Crossbeams bounded mpmc is genius (and heres why)


Introduction

I was writing some multithreaded C++ code for a uni project recently (well recently when this article was drafted, now it’s a year ago, I have ADHD okay), and I was really pining for some MPMC channels. My years of writing Rust have made me reach for these concurrency primitives almost instinctually and going back to a language without them really hurt. What’s worse I couldn’t find any good lightweight libraries that provided a nice interface similar to Rust’s crossbeam library (which is also used as the basis for the standard libraries mpmc module).

So I ended up writing my own, calling it transbeam and looking at crossbeam’s source code to find what clever things it was doing, turns out it was quite a few.

I will only be looking at the bounded MPMC channel in this article, though the unbounded one also does some clever things like block allocation (it’s a linked-list) the core principles are the same.

Okay but what is an MPMC

MPMC stands for “Multi Provider Multi Consumer”, it is a data structure for facilitating communication between multiple threads. Multi provider means multiple threads can put data into it simultaneously and multi consumer means multiple threads can pull data out simultaneously. If you’re familiar with Go, its channels are MPMC channels.

Bounded MPMC structures are naturally expressed as ring buffers, with “channels” just being an abstraction on top that place restrictions on whether a given object is a “reader” or “writer” to the channel.

Trying some implementations

This section covers a few naive implementations of MPMC’s to see why they fail, if you’re already familiar with this kind of thing you can probably skip this section

Two pointers

What if we just had a read pointer and a write pointer, like a traditional single threaded ring buffer?

Unfortunately this falls over quickly in an MPMC context, consider:

read: 0
write: 0
--- do a write
write: 1
--- write thread gets slept after incrementing but before writing
--- Our read sees that write is one ahead so reads out of the slot
read: 1
--- now write wakes up and kaboom race condition (or it doesn't wake and we read uninitalised/stale memory)

You might think we can fix this by simply incrementing the write head after the write completes, but – because it is multi provider – this would cause a race as multiple writers write to the same slot.

How about three pointers

OK so we need a guard pointer to prevent the read head advancing while allowing multiple writes to happen at once. This guard pointer will indicate when a slot is actively being written, the idea being we increment the writer head before we start writing but only increment the guard after we finish.

read: 0
write: 0
end: 0
--- do a write
write: 1
--- write thread is slept but read sees that end is the same so it doesn't do anything

--- now write finishes
end: 1
--- and read can continue safely

This works, but we have to be careful how we increment end, consider:

read: 0
write: 0
end: 0
--- two threads start writing at once A and B
write: 1 -- A claims slot 0
write: 2 -- B claims slot 1
--- now A is slept and B completes before it, then increments `end` (or sets it to its write)
end: 1
--- uh oh read comes along, A is still slept,
--- read sees that end is incremented aaand we're back to the two-pointer version

Therefore we need to compare-and-swap end and ensure its value is write - 1 before updating it (i.e. we wait for all writers up to us to finish before we finish). This is not ideal, and causes a lot of contention in our supposedly lock-free data structure, however it is a common one in the wild as it is simple and the most obvious solution to the problems with the two pointer version.

How crossbeam mitigates these issues

Let’s take a look at how crossbeam handles this. The core innovation it uses is to store metadata for each slot alongside the data, this metadata is used by the readers and writers to determine the state the slot is currently in from

  • currently being read
  • currently being written
  • can be written (may or may not have been read before)
  • previousely written, can be read

This bit packed encoding is very useful because it allows using a single atomic operation when updating or reading, which is both faster and simpler to implement.

Basics

The metadata is bit packed as the combination of a “lap” and a flag. Read/write heads store the metadata as the top 2 bits of their address, and each slot stores the metadata as the bottom 2 bits of an atomic u8.

Laps are for tracking when a head wraps around. E.g. if we are reading slot 1 and a writer head wraps around to slot 0, we need to know that the writer is still ahead of the reader, even though the indexes would suggest otherwise.

The flag differs depending on if it is for slot metadata or head metadata. For slots it marks data that has been written by a write head, and for heads it marks when a channel is shutting down. The slot usage is really the one to worry about, as the head one is more of a special case that we don’t discuss here.

In summary:

  • We can read from a slot IF: the written flag is set and the lap matches the active head’s lap
  • We can write to the slot IF: the written flag is NOT set and the lap matches the active head’s lap
  • When we write to a slot WE: set the written flag
  • When we read from a slot WE: unset the written flag and flip the lap bit

Working through it

Let’s look at simplified version of this packing which only considers 5 bits for the head addresses (the real one is a full usize)

Consider this initial state:

stamps: [
    [0 0]
    [0 0]
    [0 0]
    [0 0]
]
read: 0 0 00
write: 0 0 00

Let’s attempt a read operation. We load “0 0 00” and extract the index 00 and lap 0. Now we compare it to the stamp of the slot at 0 which has metadata 0 0. We need it to have the written bit set (i.e. “0 1”) but it’s not. This means we have run into the back of our last read, which means there are no items left. We still check the write head however, because if its index is not equal to us then we know it is currently in the process of writing to some slot and we should wait for it to complete (this ensures that even if the write thread gets slept, reads ordered chronologically after it still succeed). If it is equal we of course simply fail as you cannot pull from an empty channel.

OK so our read in position 0 is failing like it should, now lets see what a write does:

We load the write head ("0 0 00") and compare it’s metadata ("0 0") to the stamp on slot 0. It matches so we can write to this slot, but we have to be careful not to race with any reads happening – if we update the stamp on this slot before we have written fully to it then a read may try accessing the data before we finish. However, we also need to avoid races with other writers, so we have to update something. Therefore:

  1. Increment the write head (to “0 0 01”) using compare-and-swap, ensuring we are the undisputed writer for this slot
  2. Write the data into the slot non-atomically
  3. Update the stamp to say we’re done by adding the written bit (0 1)

This is why we have a special back off in the read case, as the read thread may observe the update to the write head before the update to the stamp (e.g. because we got slept by the OS).

The current state is:

stamps: [
    [0 1]
    [0 0]
    [0 0]
    [0 0]
]
read: 0 0 00
write: 0 0 01

Now if we try to read again we find the stamp has the written bit set, permitting us to read the data. Again however we have to be careful not to race, there may be other readers who have also seen this stamp and they may try to read the data at the same time. Thus the correct ordering is:

  1. Increment the read head (to 0 0 01) using compare-and-swap
  2. Read the data out of the slot
  3. Unset the written bit to notify writers that they are allowed to write to this slot, and flip the lap bit
  4. The stamp in slot 0 will now be 1 0.

Importantly we write the stamp only after fully reading, as if we write the stamp too early we may race with a writer as we read the data out.

Writing to a full channel

Let’s look at one last case: trying to push additional elements to a full channel (i.e. writer bumps into read head). Let’s say we’ve done a bunch of reads and writes and our structure now looks like this:

stamps: [
    [1 0]
    [0 1]
    [0 1]
    [0 1]
]
read: 0 0 01
write: 1 0 00

In this state we have pushed 4 elements to the buffer, and read 1 element, the write head has gone through the following state changes:

0 0 00
0 0 01
0 0 10
0 0 11
1 0 00

Now we try and push another element, causing a write to slot 0. This is fine, and gives us:

stamps: [
    [1 1]
    [0 1]
    [0 1]
    [0 1]
]
read: 0 0 01
write: 1 0 01

OK now we try another write, slot 1. Immediately we notice the stamp isn’t valid. To write this slot we need a stamp of the form “1 0”, (i.e. both correct lap and without written bit set) but we have “0 1”. This does match the stamp + one lap however, which means this slot must have been made by a write on the last lap, and it hasn’t yet been read from. Now we do the mirror of the read thread’s action when it encounters a write stamp: we put a fence down to prevent our loading of the stamp from time-travelling and then we load the read head:

std::atomic_thread_fence(std::memory_order_seq_cst);
const auto rd = read_.load(std::memory_order_relaxed);

In this case it is behind by one lap so we know that no reads are currently happening, and we just abort and tell the user they have filled the buffer. In the case where the read head was say 1 0 00 (could happen if there are many readers suddenly) we know there is an active reader for this slot, so we will wait for it to finish and then return.

Conclusion

In conclusion, multi-threaded datastructures are really complex and crossbeam does a lot of smart stuff to get them performant. Lock-free ring-buffers are a core data structure in multi-threaded contexts, but a naive implementation has a lot of performance and correctness pitfalls. You should probably not roll your own for production code, and you should definitely do further reading if you want to anyway.

The stamp system used by crossbeam allows multiple readers and writers to run at once, with the only limitation on number being the buffer size. The tradeoff is a little more memory usage compared to the three pointer version, along with a much more complex implementation. The results however are worth it.

Further Reading

The book Rust Atomics and Locks by Mara Bos is a great read if you are interested in atomics in Rust or C++. It covers both the low level assembly stuff and the higher level atomic ordering and shows how they fit together.