Async Rust Mutexes vs. Lock-Free

by Daniel Boros

Sep 26

6 min read

7 views

Dear rust-dd readers…

…we’ve been a bit swamped lately, but we’re back! And today we’re poking a topic that has probably given every single one of you at least one mini-stroke while writing code in this beautiful language — and, just to spice it up, in an async context, with a performance twist. We’re going to chase down the mysterious Mutexes from a perf perspective, and we’ll pit them against a lock-free approach too. Actually… make that two lock-free variants for completeness. 😇

Why Mutex at all?

Mutex = Mutual Exclusion: at most one thread writes/reads at a time; others wait. Why is this good? Because the thread gets exclusive access to mutate shared data. “But why not RwLock?” — fair question.

In Rust, Mutex and RwLock wrappers ultimately sit on OS primitives (pthread_mutex_t and pthread_rwlock_t on Unix-like systems). A Mutex lock/unlock can often be as cheap as a single atomic + a possible slow path; an RwLock must track reader/writer counts and orchestrate wakeups, so it tends to have more coordination overhead. In async contexts, this can be compounded by task wakeups (e.g., tokio::RwLock maintains wait queues and wakers). That said, for read-heavy workloads, RwLock can absolutely be the right call — you’ll often see that recommendation.

The contenders: parking_lot, tokio, and std Mutexes

Both parking_lot::Mutex and std::sync::Mutex expose sync APIs. If you want sheer speed for short critical sections without await, parking_lot::Mutex is usually a great choice. It’s famously lean — its fast path fits in a single word of state (not “one byte” 😉), while std::Mutex is typically larger because it integrates with OS facilities and debuggability, which brings a bit more overhead. Meanwhile, tokio::sync::Mutex was designed not to be the fastest raw lock, but to be a fully async-aware, non-blocking mutex that plays nicely with the runtime and is Send + Sync. For tiny critical sections, parking_lot often still wins on raw throughput thanks to its tuned fast path.

“But the code looks the same!” (Not quite.)

If you peek at conceptual fast-path logic, both “try to flip a small state atomically; if that fails, go to a slow path”. For illustration:

// parking_lot::Mutex (concept)
fn lock(&self) {
    if self.state.compare_exchange_weak(UNLOCKED, LOCKED, Acquire, Relaxed).is_err() {
        self.lock_slow(None);
    }
    // deadlock tracking, etc.
}

// std::sync::Mutex (concept)
fn lock(&self) {
    if self.state.compare_exchange_weak(UNLOCKED, LOCKED, Acquire, Relaxed).is_err() {
        self.lock_slow(None); // may park via futex on Linux, etc.
    }
}

Where you’ll feel a difference is often in the unlock/slow-path behavior and wakeup strategy:

// parking_lot::Mutex (concept)
fn unlock(&self) {
    if self.state.compare_exchange(LOCKED, UNLOCKED, Release, Relaxed).is_ok() {
        return; // nobody waiting
    }
    self.unlock_slow(false); // custom park/unpark logic tuned for fast handoff
}

// std::sync::Mutex (concept)
fn unlock(&self) {
    // Uses OS integration (e.g., futex on Linux) to wake waiters.
    // Typically wakes one waiter; fairness/contended behavior differs.
    self.wake();
}

Note: The above is illustrative. std::sync::Mutex doesn’t implement lock_api::RawMutex publicly, and each platform/runtime has its own details. The takeaway is that parking_lot optimizes the fast path and parking/unparking strategy aggressively, while std integrates with OS primitives and fairness/debugging guarantees.

Our five implementations

We’ll run the same workload five ways: std::Mutex, tokio::Mutex, parking_lot::Mutex, plus a basic lock-free counter and a batched lock-free counter.

pub async fn run_std_mutex(n_tasks: usize, iters_per_task: usize) -> u64 {
    let counter = Arc::new(StdMutex::new(0u64));
    let mut set = JoinSet::new();

    for _ in 0..n_tasks {
        let c = Arc::clone(&counter);
        set.spawn(async move {
            for _ in 0..iters_per_task {
                let mut guard = c.lock().unwrap();
                *guard += 1;
            }
        });
    }

    while set.join_next().await.is_some() {}
    *counter.lock().unwrap()
}
pub async fn run_tokio_mutex(n_tasks: usize, iters_per_task: usize) -> u64 {
    let counter = Arc::new(TokioMutex::new(0u64));
    let mut set = JoinSet::new();

    for _ in 0..n_tasks {
        let c = Arc::clone(&counter);
        set.spawn(async move {
            for _ in 0..iters_per_task {
                let mut guard = c.lock().await;
                *guard += 1;
            }
        });
    }

    while set.join_next().await.is_some() {}
    *counter.lock().await
}
pub async fn run_parking_lot_mutex(n_tasks: usize, iters_per_task: usize) -> u64 {
    let counter = Arc::new(ParkingLotMutex::new(0u64));
    let mut set = JoinSet::new();

    for _ in 0..n_tasks {
        let c = Arc::clone(&counter);
        set.spawn(async move {
            for _ in 0..iters_per_task {
                let mut guard = c.lock();
                *guard += 1;
            }
        });
    }

    while set.join_next().await.is_some() {}
    *counter.lock()
}
pub async fn run_atomic(n_tasks: usize, iters_per_task: usize) -> u64 {
    let counter = Arc::new(AtomicU64::new(0));
    let mut set = JoinSet::new();

    for _ in 0..n_tasks {
        let c = Arc::clone(&counter);
        set.spawn(async move {
            for _ in 0..iters_per_task {
                c.fetch_add(1, Ordering::Relaxed);
            }
        });
    }

    while set.join_next().await.is_some() {}
    counter.load(Ordering::Relaxed)
}
pub async fn run_atomic_batched(n_tasks: usize, iters_per_task: usize, batch: usize) -> u64 {
    let global = Arc::new(AtomicU64::new(0));
    let mut set = JoinSet::new();

    for _ in 0..n_tasks {
        let g = global.clone();
        set.spawn(async move {
            let mut local = 0u64;
            for i in 0..iters_per_task {
                local += 1;
                if i % batch == batch - 1 {
                    g.fetch_add(local, Ordering::Relaxed);
                    local = 0;
                }
            }
            if local != 0 {
                g.fetch_add(local, Ordering::Relaxed);
            }
        });
    }

    while set.join_next().await.is_some() {}
    global.load(Ordering::Relaxed)
}

“Lock-free is magic, right?” — well…

To me (still happily exploring the deeper rabbit holes), lock-free always feels a bit magical — surely it’s the top-performance endgame, right? …and yet, not always. As we just saw, a single hot AtomicU64 can thrash the cache line across cores (MESI ping-pong), and the raw “no lock” approach can underperform a well-tuned mutex under heavy contention. Add in things like retries (CAS loops), fences, memory model costs, and you can absolutely lose.

But! If you shard smartly and aggregate at the end, even in this toy example you can hit ridiculous speeds. Which brings us to the numbers.

Benchmark results (this run)

  • StdMutex: 191.31683 ms
  • TokioMutex: 1107.0491 ms
  • ParkingLotMutex: 183.7413 ms
  • Atomic: 547.4252 ms
  • AtomicBatched: 1.621083 ms 🤯

A few fun observations:

  • Somewhat surprisingly, parking_lot wasn’t always faster than std. That’s plausible:

    • std has seen improvements over time,
    • compilers/runtime scheduling evolve,
    • microbench noise and hardware topology matter.
  • Differences become more visible with more complex data structures, where lock-free approaches get tricky (or outright impractical). In those cases, techniques like snapshot-swap (e.g., ArcSwap) can be cheaper and easier than rolling your own complex lock-free machinery.

  • And yes, batched atomics crush the simple atomic counter because they massively reduce coherence traffic.

When should you care?

Truth be told, most folks don’t need this level of tuning on most days — even in the Rust community. Rust is already blazingly fast compared to most ecosystems; people love to juxtapose it with C for a reason. But it’s still worth pushing the envelope, learning new tricks, and keeping the edge sharp — especially if we want Rust to keep winning in domains like game engines, renderers, HFT, and other performance-critical software.

TL;DR

  • Short, await-free critical sections: prefer parking_lot::Mutex (often faster) or even better, avoid shared mutation altogether.
  • You must await inside the critical section: use tokio::Mutex (or refactor so the await happens outside the lock).
  • Single hot atomic under high contention? Expect cache ping-pong and disappointing scaling.
  • Lock-free done right: shard and/or batch to reduce coherence traffic.
  • Don’t forget pragmatic tools like ArcSwap for low-overhead snapshotting of complex shared data.