How to experiment with cache coherence (MESI) and cache eviction across cores using shared memory?

Incident Report: Uncontrolled Cache Thrashing During MESI Protocol Experiment

Summary

A cache-coherence experiment caused severe source degradation due to uncontrolled cache thrashing and False Sharing in a shared memory region. The experiment pinned processes to different cores and measured memory latency via rdtsc, but inadvertently triggered L1-cache saturation and core-to-core coherence stalls lasting 150ms, affecting co-hosted services.

Root Cause

The performance degradation was caused by:

  • Unrestrained cache thrashing due to uncontrolled access patterns
  • False Sharing via concurrent access to the same cache line(s)
  • Missing isolation mechanisms for pinned core experiments
  • Lack of guard rails around cache-flush instructions (clflush)

Why This Happens in Real Systems

Cache coherence/eviction pitfalls manifest in production due to:

  • Undisciplined shared-memory access patterns causing coherence storms
  • Unbounded thrashing loops in performance-critical code paths
  • Rigid core pinning amplifying NUMA/cross-core latency penalties
  • Assumptions about cache-line states without hardware verification
  • Uncontrolled cache pressure from greedy eviction patterns

Real-World Impact

The experiment triggered cascading effects:

  • 30x latency spikes in adjacent tenant workloads (P99 from 5ms → 150ms)
  • CPU steal time surges on victim cores due to MESI state transitions
  • L3 cache saturation (97% utilization) starving other processes
  • False sharing penalties costing ~150 cycles per coherence ping-pong
  • RDMA packet loss from PCIe saturation during cache-line flushes

Example or Code

// Flawed experiment snippet causing thrashing
volatile uint64_t* shared_data = mmap(...); // Map shared memory

// Core 1: Intentional L1 eviction via forced misses
void thrash_cache(volatile uint64_t* target) {
    uint64_t buffer[512 * 1024 HF]; // Anti-size buffer
    for (int i = 0; i < sizeof(buffer)/64; i++) {
        buffer[i * 64]++; // Beacon every 64 bytes (cache-line size)
    }
    // Forget barrier here >>> 
    uint64_t t0 = __rdtsc();
    atomic_store_explicit(target, 0, memory_order_relaxed); // Uncontrolled access
    uint64_t delta = __rdtsc() - t0;
    // ALSO: Performing this on shared core without isolation
}

How Senior Engineers Fix It

10 mitigation strategies:

  1. Isolate experiments with cset shield cores or dedicate entire NUMA nodes
  2. Use controlled thrashing via bounded stride patterns (e.g., N*cache_line_size ± jitter)
  3. Prevent false sharing with struct { uint64_t value; uint8_t padding[56]; } aligned;
  4. Verify cache-state transitions with Intel PCM instead of local timing only
  5. Replace raw clflush with bounded cache-pressure sequences:
    for (int i= a; i < b; i += CACHELINE) cold_buffer[i] = 0;

    6 indisposables Core-pinning validation via sched_getcpu() audits
    7