How to debug/mitigate RocksDB WAL sync stalls (P99.9 latency) on NVMe when disk I/O is not saturated?

Summary

We observed extreme P99.9 latency spikes (up to 500ms) in a high-performance C++ application utilizing RocksDB v8.5.x. The performance degradation occurred specifically during Write-Ahead Log (WAL) sync operations. Despite the underlying NVMe RAID-0 array reporting only 15% utilization (bandwidth and IOPS), the application threads were stalling on rocksdb::WritableFileWriter::Sync. This discrepancy between hardware throughput and application-level latency indicates a bottleneck in the synchronization primitives or the I/O submission path, rather than a capacity issue.

Root Cause

The root cause is I/O Request Serialization and Kernel-level Flush Stalls caused by the interaction between synchronous writes and the NVMe controller’s internal management.

Synchronous Write Blocking: When sync=true is configured, every write must wait for a hardware acknowledgement that data is persisted to non-volatile media.
The Fsync/Sync Latency Spike: Even at low utilization, a single fsync or fdatasync call triggers a metadata update and a hardware flush command.
Controller-Level Garbage Collection: Modern NVMe drives (like the Samsung 980 Pro) perform internal background housekeeping and Garbage Collection (GC). A single sync operation can trigger a stall if the controller is busy managing its internal NAND mapping tables.
The “Empty Pipe” Fallacy: Low utilization (15%) only means the throughput is low; it does not mean the latency is low. The system is not throughput-bound; it is latency-bound by the round-trip time of the synchronization command.

Why This Happens in Real Systems

In production environments, hardware performance is rarely a linear constant.

Non-Deterministic Hardware Latency: SSDs are not purely deterministic. Internal operations like wear leveling and bad block management occur asynchronously, causing sudden spikes in response time for synchronous commands.
The Cost of Durability: High durability guarantees (sync=true) force the CPU to idle while waiting for the physical hardware to confirm the write.
Filesystem Metadata Contention: Even with noatime, the ext4 filesystem must ensure metadata consistency during a sync, which can lead to journaling contention within the kernel.

Real-World Impact

Application Throughput Collapse: While average latency stays low, the P99.9 spikes cause upstream timeouts in microservices.
Cascading Failures: If a database thread stalls for 500ms, it can cause connection pools to saturate, leading to a thread starvation death spiral across the entire service mesh.
False Positive Monitoring: Standard metrics (utilization %) will show a “healthy” disk, leading SREs to hunt for bugs in the application logic rather than the storage layer.

Example or Code

// This pattern is the source of the P99.9 spikes
void critical_write_path(rocksdb::WriteBatch& batch, rocksdb::DB* db) {
    rocksdb::WriteOptions write_options;
    // This setting forces the thread to block until the NVMe 
    // controller confirms the data is on physical media.
    write_options.sync = true; 

    rocksdb::Status s = db->Write(write_options, &batch);
    if (!s.ok()) {
        // Handle error
    }
}

How Senior Engineers Fix It

Senior engineers move away from synchronous blocking toward grouping and batching to amortize the cost of the sync.

Implement WAL Group Commit: Instead of syncing every single write, batch multiple writes together and perform one sync for the entire group. This transforms many high-latency small writes into fewer, more efficient writes.
Use WAL_BYTES_PER_SYNC: Configure RocksDB to sync only after a certain amount of data has accumulated, rather than on every single write.
Optimize Filesystem Mounts: Ensure the filesystem is tuned for high-performance workloads (e.g., using data=writeback if the application can tolerate slight risk, or ensuring noatime is active).
Hardware-Level Tuning: Move to Enterprise-grade NVMe drives which have much larger Power Loss Protection (PLP) capacitors. PLP allows the drive to acknowledge a write as “persistent” as soon as it hits the internal DRAM, significantly reducing sync latency.
Asynchronous I/O (io_uring): For advanced implementations, migrating the I/O path to use Linux io_uring can decouple the application threads from the blocking nature of synchronous syscalls.

Why Juniors Miss It

Focus on Throughput vs. Latency: Juniors often look at MB/s or IOPS metrics. They see “15% utilization” and conclude “the disk is fine,” failing to realize that latency is independent of utilization.
Misunderstanding sync=true: They treat sync=true as a “set and forget” safety feature without calculating the mathematical penalty of a hardware round-trip on the application’s critical path.
Ignoring Hardware Internals: They assume hardware is a “black box” that performs consistently, rather than a complex system with its own internal scheduling and garbage collection cycles.