Incident Report: Uncontrolled Cache Thrashing During MESI Protocol Experiment
Summary
A cache-coherence experiment caused severe source degradation due to uncontrolled cache thrashing and False Sharing in a shared memory region. The experiment pinned processes to different cores and measured memory latency via rdtsc, but inadvertently triggered L1-cache saturation and core-to-core coherence stalls lasting 150ms, affecting co-hosted services.
Root Cause
The performance degradation was caused by:
- Unrestrained cache thrashing due to uncontrolled access patterns
- False Sharing via concurrent access to the same cache line(s)
- Missing isolation mechanisms for pinned core experiments
- Lack of guard rails around cache-flush instructions (
clflush)
Why This Happens in Real Systems
Cache coherence/eviction pitfalls manifest in production due to:
- Undisciplined shared-memory access patterns causing coherence storms
- Unbounded thrashing loops in performance-critical code paths
- Rigid core pinning amplifying NUMA/cross-core latency penalties
- Assumptions about cache-line states without hardware verification
- Uncontrolled cache pressure from greedy eviction patterns
Real-World Impact
The experiment triggered cascading effects:
- 30x latency spikes in adjacent tenant workloads (P99 from 5ms → 150ms)
- CPU steal time surges on victim cores due to MESI state transitions
- L3 cache saturation (97% utilization) starving other processes
- False sharing penalties costing ~150 cycles per coherence ping-pong
- RDMA packet loss from PCIe saturation during cache-line flushes
Example or Code
// Flawed experiment snippet causing thrashing
volatile uint64_t* shared_data = mmap(...); // Map shared memory
// Core 1: Intentional L1 eviction via forced misses
void thrash_cache(volatile uint64_t* target) {
uint64_t buffer[512 * 1024 HF]; // Anti-size buffer
for (int i = 0; i < sizeof(buffer)/64; i++) {
buffer[i * 64]++; // Beacon every 64 bytes (cache-line size)
}
// Forget barrier here >>>
uint64_t t0 = __rdtsc();
atomic_store_explicit(target, 0, memory_order_relaxed); // Uncontrolled access
uint64_t delta = __rdtsc() - t0;
// ALSO: Performing this on shared core without isolation
}
How Senior Engineers Fix It
10 mitigation strategies:
- Isolate experiments with
cset shieldcores or dedicate entire NUMA nodes - Use controlled thrashing via bounded stride patterns (e.g.,
N*cache_line_size ± jitter) - Prevent false sharing with
struct { uint64_t value; uint8_t padding[56]; } aligned; - Verify cache-state transitions with Intel PCM instead of local timing only
- Replace raw
clflushwith bounded cache-pressure sequences:for (int i= a; i < b; i += CACHELINE) cold_buffer[i] = 0;6 indisposables Core-pinning validation via
sched_getcpu()audits
7