Lock-Free Counters and When They're Enough
How x86-64 atomic instructions can replace mutexes for simple shared counters, and where the pattern breaks down. · 5 min read
When shared mutable state is a single monotonic counter, x86-64 hardware provides a simpler option than mutexes: LOCK XADD. This is an atomic read-modify-write instruction that completes in roughly 5-10 nanoseconds, requires no kernel involvement, and guarantees correctness without any locking primitives. For the right workload it replaces a mutex, a condition variable, and a futex all at once.
How It Works
- LOCK prefix: tells the CPU to lock the cache line for the duration of the instruction. Other cores see either the old value or the new one, never a partial update.
- XADD: exchanges and adds in a single operation. The old value goes into the source register, the sum goes into memory. Both the increment and the read happen atomically.
- Aligned reads: on x86-64, a naturally aligned 64-bit load is atomic by specification. A consumer thread can read the counter at any time without synchronization and will always see a valid value.
The producer increments with LOCK XADD. The consumer reads with a plain MOV. Neither thread blocks.
Where It Fits
- Progress tracking: a work loop increments a counter; a render thread reads it periodically. The counter only goes up, so stale reads are harmless and the display catches up on the next poll.
- Statistics collection: request counters, byte counters, error counters. Readers tolerate slightly stale values because they’re sampling anyway.
- Reference counting: increment and decrement are both atomic. The last decrementer (the one whose XADD returns 1 for a decrement) handles cleanup.
Where It Breaks Down
- Multiple fields that must be consistent: if you need to update a counter and a timestamp together, a single atomic instruction covers one of them. Protecting both requires either a wider atomic (CMPXCHG16B for 128 bits) or a real lock.
- Ordering guarantees beyond the counter:
LOCK XADDis a full memory barrier on x86-64, but relying on that for ordering other memory accesses ties your correctness to one platform’s memory model. - Contention at scale: the LOCK prefix serialises access to a cache line. Two cores contending is fine. Thirty-two cores hammering the same counter will bottleneck on cache coherence traffic.
- Non-monotonic updates: if two threads both read and conditionally write, you need compare-and-swap (
LOCK CMPXCHG), not add. Atomic increment only covers the case where every update moves in one direction.
The Practical Threshold
A mutex acquire-release cycle on Linux costs roughly 25-50 nanoseconds uncontended, plus the risk of blocking if another thread holds it. LOCK XADD costs 5-10 nanoseconds and never blocks. For a counter hit millions of times per second in a hot loop, that difference compounds. For a counter hit a few hundred times per second, it doesn’t matter. Pick based on your actual update frequency.