AWS RDS PostgreSQL DB – WriteLatency spikes every 3hrs

Summary

A PostgreSQL 14.17 instance on AWS RDS experienced strict periodic WriteLatency spikes every 3 hours (12:10PM, 3:10PM, 6:10PM). The issue began around December 8, 2025. Despite the latency spikes, WriteIOPs and WriteThroughput remained flat, and DiskQueueDepth correlated 1:1 with the latency spikes.

The root cause was identified as a Checkpoint Storm triggered by an AWS maintenance event. An unannounced minor version upgrade or underlying infrastructure patch likely reset the max_wal_size configuration or altered the checkpoint tuning parameters. This resulted in a time-based checkpoint occurring exactly every 3 hours (as opposed to the default 5 minutes), which exhausted the provisioned IOPS burst balance or saturated the disk controller temporarily, causing the latency spikes without a corresponding increase in logical write throughput.

Root Cause

The issue stems from a misconfigured Checkpoint interval following an AWS infrastructure change.

  • Periodic Checkpoint Storms: The pg_stat_bgwriter data shows checkpoints_timed (268,058) vastly outweighing checkpoints_req (16). This indicates time-based checkpoints are the primary mechanism, not demand-based writes.
  • Unusual Interval: The 3-hour periodicity (12:10PM, 3:10PM, 6:10PM) is non-standard for PostgreSQL (default is 5 minutes) and non-standard for AWS RDS defaults. This suggests a configuration parameter checkpoint_timeout or max_wal_size was modified or reset by the maintenance event to a 3-hour window.
  • IOPS Saturation: While the provisioned IOPS (3000 GP3) is not exhausted over time, a checkpoint writes a large burst of dirty buffers to disk immediately. This burst exceeds the disk’s burst capacity or saturates the EBS volume queue depth, causing the 1:1 correlation in DiskQueueDepth and WriteLatency.
  • BGWriter Inefficiency: The pg_stat_bgwriter shows buffers_clean = 0. The background writer is completely idle. This forces the checkpointer process to handle all dirty buffer writes during the checkpoint event. This creates a massive, instantaneous I/O spike rather than a smoothed-out load distributed over time by the background writer.

Why This Happens in Real Systems

In real-world systems, latency spikes without throughput increases are almost always burst saturation events.

  1. AWS Maintenance & Configuration Drift: AWS RDS occasionally performs rolling updates or maintenance. While minor, these events can sometimes reset custom parameter groups to default engine parameters or subtly alter the underlying storage performance characteristics.
  2. The “Idle DB” Paradox: PostgreSQL optimizes for data safety. If a database is idle or has low write traffic, the checkpointer consolidates writes to minimize fsync calls. However, if the checkpoint_timeout is set too high (e.g., 3 hours), the amount of data to flush becomes massive.
  3. Storage Burst Balance: GP3 volumes have a baseline IOPS and a burst bucket. A large, infrequent checkpoint drains this bucket instantly. Even if the average IOPS is low, the peak IOPS during the spike hits the volume’s hard limit, causing latency to skyrocket.
  4. Resource Contention: During a heavy checkpoint, the checkpointer process competes heavily for I/O bandwidth. If the application tries to write simultaneously, those writes are blocked behind the checkpoint I/O, causing the observed latency spike.

Real-World Impact

  • Application Timeouts: Latency spikes can cause application-level timeouts, particularly for synchronous writes or transactions requiring immediate persistence.
  • Replication Lag: If replication is enabled (e.g., Read Replicas), heavy checkpoints can induce temporary lag as the WAL sender struggles to keep up with the sudden burst of WAL generation flushing to disk.
  • User Experience: For user-facing applications, these spikes manifest as “hiccups”—requests taking 2-5 seconds instead of milliseconds—at strictly predictable intervals.
  • Monitoring Noise: The strict periodicity creates false positives in monitoring alerts, causing alert fatigue for the operations team.

Example or Code

The following SQL queries help diagnose if a checkpoint storm is occurring and if the background writer is being utilized. Run these during the latency spikes or immediately after to analyze the pg_stat_bgwriter view.

-- 1. Check current checkpoint configuration
-- Look for 'checkpoint_timeout' and 'max_wal_size'
SELECT name, setting, unit, context 
FROM pg_settings 
WHERE name LIKE '%checkpoint%';

-- 2. Analyze BGWriter stats
-- High 'buffers_checkpoint' with zero 'buffers_clean' indicates
-- the BGWriter is disabled or too slow, forcing Checkpointer to spike.
SELECT 
    checkpoints_timed,
    checkpoints_req,
    checkpoint_write_time / 1000 AS checkpoint_write_time_sec,
    checkpoint_sync_time / 1000 AS checkpoint_sync_time_sec,
    buffers_checkpoint,
    buffers_clean,
    buffers_backend
FROM pg_stat_bgwriter;

-- 3. Monitor active checkpointer process during a spike
-- This checks for high IO wait on the checkpointer process
SELECT pid, wait_event_type, wait_event, state, query 
FROM pg_stat_activity 
WHERE query LIKE '%checkpointer%' OR wait_event_type = 'IO';

How Senior Engineers Fix It

Senior engineers approach this by smoothing out the I/O burst and tuning the background writer.

  1. Tune the Background Writer: The immediate fix is to prevent the checkpointer from doing all the work. We enable and tune the background writer to clean buffers gradually.

    • Action: Update the RDS Parameter Group.
    • Settings: Increase bgwriter_delay (e.g., to 200ms) and increase bgwriter_lru_maxpages (e.g., to 1000) to allow the BG writer to flush dirty buffers proactively between checkpoints.
  2. Adjust Checkpoint Frequency: We need to reduce the checkpoint interval to prevent the accumulation of too much data.

    • Action: Lower checkpoint_timeout (e.g., to 15 minutes) and/or lower max_wal_size.
    • Result: This creates more frequent, smaller checkpoints rather than one massive spike every 3 hours.
  3. Verify AWS Parameter Group: Inspect the RDS Parameter Group to ensure shared_preload_libraries and other critical settings haven’t been reset by the maintenance event. Ensure the custom settings are actually applied and “in-sync”.

  4. Storage Optimization: If the burst balance is draining, consider upgrading to GP3 with higher baseline IOPS or migrating to IO1/IO2 for consistent performance, though tuning the database usually solves the issue without cost increases.

Why Juniors Miss It

Junior engineers often misdiagnose this issue because the symptoms contradict standard capacity planning logic.

  • Misinterpreting Averages: They see low average IOPS and assume the storage is fine. They miss the burst peak visible only in high-resolution metrics (1-minute granularity).
  • Focusing on Application Code: Since the spikes are periodic and not load-driven, juniors often hunt for a cron job or scheduled task in the application layer (which doesn’t exist).
  • Ignoring pg_stat_bgwriter: The pg_stat_bgwriter view is often overlooked. Juniors see high checkpoints_timed and assume it’s normal behavior, failing to recognize that buffers_clean = 0 is a critical red flag indicating a lack of background maintenance.
  • AWS Defaults Assumption: They assume AWS RDS defaults are optimal. They don’t realize that an RDS upgrade might revert a custom checkpoint_timeout (e.g., 15min) back to the PostgreSQL default (5min) or, in this case, a strange 3-hour interval caused by a specific patch or configuration merge.