Spinlock Replacement Slows High‑Throughput Circuit Simulator

Summary

During the development of a high-throughput circuit simulator, a performance regression occurred when transitioning from a condition variable (condvar) signaling mechanism to a lock-free spinlock architecture. While the initial implementation was bottlenecked by OS scheduler overhead (200 kHz), the lock-free approach collapsed performance by orders of magnitude when scaling to multiple cores, despite showing incredible single-threaded speeds. The investigation revealed that the system was suffering from cache line contention and resource starvation caused by tight spinning on shared atomic flags.

Root Cause

The failure stemmed from two primary architectural flaws in the lock-free implementation:

CPU Starvation via Tight Spinning: The worker threads were executing a while loop that performed constant atomic polls on jobs[nproc].wait. Because these threads were running on the same physical cores/sockets as the main thread, they consumed 100% of the execution pipeline cycles, preventing the main thread from ever getting enough instruction throughput to actually distribute the work.
Cache Line Bouncing (Coherency Traffic): The use of usf_atmflagclr and usf_atmflagtry on shared memory locations forced the CPU to constantly invalidate cache lines across all cores. This MESI protocol traffic created a massive bottleneck on the memory bus, effectively turning a multi-core execution into a serialized, high-latency communication struggle.

Why This Happens in Real Systems

In high-performance computing, there is a common misconception that “waiting without blocking” is always faster than “waiting with a syscall.” In reality, modern OS schedulers are highly optimized. When you use a condvar, the kernel puts the thread into a WAITING state, freeing the hardware execution units for other tasks. When you replace this with a busy-wait loop, you aren’t just “waiting”; you are actively competing for:

Instruction Issue Slots: The spinning thread is using the CPU’s front-end to fetch and decode the same loop instructions repeatedly.
Memory Bandwidth: Constant atomic operations require the core to assert ownership of a cache line, forcing every other core to stall their local caches.
Thermal/Power Budgets: Tight loops trigger higher clock frequencies and heat, which can lead to thermal throttling of the entire package, slowing down the very thread trying to do the productive work.

Real-World Impact

Negative Scaling: Adding more workers actually decreased total throughput, a phenomenon known as negative scaling.
Latency Spikes: The time required to complete a single simulation step jumped from microseconds to milliseconds.
Unpredictable Jitter: Because the performance depended on which thread “won” the race for the cache line, the simulation timing became non-deterministic.

Example or Code

// The problematic worker loop
static usf_compatibility_int startworker(void *n) {
    u64 nproc = (u64) n;
    for (;;) {
        // CRITICAL ERROR: This tight loop starves the main thread 
        // and creates massive cache coherency traffic.
        while (usf_atmflagtry(&jobs[nproc].wait, MEMORDER_ACQ_REL)); 

        switch (jobs[nproc].flag) {
            case UPDATE: update(nproc); break;
            case PROPAGATE: propagate(nproc); break;
            case TERMINATE: return 0;
        }
        jobs[nproc].flag = WAIT;
        usf_atmsubi(&running_, 1, MEMORDER_ACQ_REL);
    }
}

How Senior Engineers Fix It

A senior engineer approaches this by implementing adaptive waiting strategies that balance latency and throughput:

Exponential Backoff: Instead of a pure while(atomic), use a loop that spins a few times, then executes a pause instruction (on x86), and eventually yields to the OS using sched_yield() or a short sleep if the wait persists.
The PAUSE Instruction: Inserting the _mm_pause() intrinsic in the spin loop signals the CPU that it is in a spin-wait, reducing power consumption and preventing the “pipeline flush” penalty when the lock is finally acquired.
Cache Line Padding: Ensure that each Slice or job struct is padded to the size of a cache line (typically 64 bytes) to prevent False Sharing, where writing to one thread’s flag inadvertently invalidates the cache of a neighboring thread.
Hybrid Signaling: Use a mechanism where threads spin for a very brief period (to catch low-latency tasks) but fallback to a futex or condvar if the work doesn’t arrive immediately.

Why Juniors Miss It

Juniors often fall into the “Micro-benchmark Trap”:

Focusing on Instruction Count: They see that a spinlock has fewer instructions than a syscall and assume it must be faster. They fail to account for the system-wide cost of cache invalidation.
Single-Core Bias: They often test logic on a single core where contention is impossible, leading to a false sense of security regarding the “speed” of the lock-free logic.
Ignoring the Hardware: They treat the CPU as an abstract machine that executes instructions perfectly, rather than a complex hierarchy of caches, interconnects, and schedulers that must coordinate to make those instructions run.