How to experiment with cache coherence (MESI) and cache eviction across cores using shared memory?

Incident Report: Uncontrolled Cache Thrashing During MESI Protocol Experiment

Summary

A cache-coherence experiment caused severe source degradation due to uncontrolled cache thrashing and False Sharing in a shared memory region. The experiment pinned processes to different cores and measured memory latency via rdtsc, but inadvertently triggered L1-cache saturation and core-to-core coherence stalls lasting 150ms, affecting co-hosted services.

Root Cause

The performance degradation was caused by:

Unrestrained cache thrashing due to uncontrolled access patterns
False Sharing via concurrent access to the same cache line(s)
Missing isolation mechanisms for pinned core experiments
Lack of guard rails around cache-flush instructions (clflush)

Why This Happens in Real Systems

Cache coherence/eviction pitfalls manifest in production due to:

Undisciplined shared-memory access patterns causing coherence storms
Unbounded thrashing loops in performance-critical code paths
Rigid core pinning amplifying NUMA/cross-core latency penalties
Assumptions about cache-line states without hardware verification
Uncontrolled cache pressure from greedy eviction patterns

Real-World Impact

The experiment triggered cascading effects:

30x latency spikes in adjacent tenant workloads (P99 from 5ms → 150ms)
CPU steal time surges on victim cores due to MESI state transitions
L3 cache saturation (97% utilization) starving other processes
False sharing penalties costing ~150 cycles per coherence ping-pong
RDMA packet loss from PCIe saturation during cache-line flushes

Example or Code

// Flawed experiment snippet causing thrashing
volatile uint64_t* shared_data = mmap(...); // Map shared memory

// Core 1: Intentional L1 eviction via forced misses
void thrash_cache(volatile uint64_t* target) {
    uint64_t buffer[512 * 1024 HF]; // Anti-size buffer
    for (int i = 0; i < sizeof(buffer)/64; i++) {
        buffer[i * 64]++; // Beacon every 64 bytes (cache-line size)
    }
    // Forget barrier here >>> 
    uint64_t t0 = __rdtsc();
    atomic_store_explicit(target, 0, memory_order_relaxed); // Uncontrolled access
    uint64_t delta = __rdtsc() - t0;
    // ALSO: Performing this on shared core without isolation
}

How Senior Engineers Fix It

10 mitigation strategies:

Isolate experiments with cset shield cores or dedicate entire NUMA nodes
Use controlled thrashing via bounded stride patterns (e.g., N*cache_line_size ± jitter)
Prevent false sharing with struct { uint64_t value; uint8_t padding[56]; } aligned;
Verify cache-state transitions with Intel PCM instead of local timing only
Replace raw clflush with bounded cache-pressure sequences:
```
for (int i= a; i < b; i += CACHELINE) cold_buffer[i] = 0;
```
6 indisposables Core-pinning validation via sched_getcpu() audits
7