Summary
A PostgreSQL 14.17 instance on AWS RDS experienced strict periodic WriteLatency spikes every 3 hours (12:10PM, 3:10PM, 6:10PM). The issue began around December 8, 2025. Despite the latency spikes, WriteIOPs and WriteThroughput remained flat, and DiskQueueDepth correlated 1:1 with the latency spikes.
The root cause was identified as a Checkpoint Storm triggered by an AWS maintenance event. An unannounced minor version upgrade or underlying infrastructure patch likely reset the max_wal_size configuration or altered the checkpoint tuning parameters. This resulted in a time-based checkpoint occurring exactly every 3 hours (as opposed to the default 5 minutes), which exhausted the provisioned IOPS burst balance or saturated the disk controller temporarily, causing the latency spikes without a corresponding increase in logical write throughput.
Root Cause
The issue stems from a misconfigured Checkpoint interval following an AWS infrastructure change.
- Periodic Checkpoint Storms: The
pg_stat_bgwriterdata showscheckpoints_timed(268,058) vastly outweighingcheckpoints_req(16). This indicates time-based checkpoints are the primary mechanism, not demand-based writes. - Unusual Interval: The 3-hour periodicity (12:10PM, 3:10PM, 6:10PM) is non-standard for PostgreSQL (default is 5 minutes) and non-standard for AWS RDS defaults. This suggests a configuration parameter
checkpoint_timeoutormax_wal_sizewas modified or reset by the maintenance event to a 3-hour window. - IOPS Saturation: While the provisioned IOPS (3000 GP3) is not exhausted over time, a checkpoint writes a large burst of dirty buffers to disk immediately. This burst exceeds the disk’s burst capacity or saturates the EBS volume queue depth, causing the 1:1 correlation in
DiskQueueDepthandWriteLatency. - BGWriter Inefficiency: The
pg_stat_bgwritershowsbuffers_clean = 0. The background writer is completely idle. This forces the checkpointer process to handle all dirty buffer writes during the checkpoint event. This creates a massive, instantaneous I/O spike rather than a smoothed-out load distributed over time by the background writer.
Why This Happens in Real Systems
In real-world systems, latency spikes without throughput increases are almost always burst saturation events.
- AWS Maintenance & Configuration Drift: AWS RDS occasionally performs rolling updates or maintenance. While minor, these events can sometimes reset custom parameter groups to default engine parameters or subtly alter the underlying storage performance characteristics.
- The “Idle DB” Paradox: PostgreSQL optimizes for data safety. If a database is idle or has low write traffic, the checkpointer consolidates writes to minimize fsync calls. However, if the
checkpoint_timeoutis set too high (e.g., 3 hours), the amount of data to flush becomes massive. - Storage Burst Balance: GP3 volumes have a baseline IOPS and a burst bucket. A large, infrequent checkpoint drains this bucket instantly. Even if the average IOPS is low, the peak IOPS during the spike hits the volume’s hard limit, causing latency to skyrocket.
- Resource Contention: During a heavy checkpoint, the checkpointer process competes heavily for I/O bandwidth. If the application tries to write simultaneously, those writes are blocked behind the checkpoint I/O, causing the observed latency spike.
Real-World Impact
- Application Timeouts: Latency spikes can cause application-level timeouts, particularly for synchronous writes or transactions requiring immediate persistence.
- Replication Lag: If replication is enabled (e.g., Read Replicas), heavy checkpoints can induce temporary lag as the WAL sender struggles to keep up with the sudden burst of WAL generation flushing to disk.
- User Experience: For user-facing applications, these spikes manifest as “hiccups”—requests taking 2-5 seconds instead of milliseconds—at strictly predictable intervals.
- Monitoring Noise: The strict periodicity creates false positives in monitoring alerts, causing alert fatigue for the operations team.
Example or Code
The following SQL queries help diagnose if a checkpoint storm is occurring and if the background writer is being utilized. Run these during the latency spikes or immediately after to analyze the pg_stat_bgwriter view.
-- 1. Check current checkpoint configuration
-- Look for 'checkpoint_timeout' and 'max_wal_size'
SELECT name, setting, unit, context
FROM pg_settings
WHERE name LIKE '%checkpoint%';
-- 2. Analyze BGWriter stats
-- High 'buffers_checkpoint' with zero 'buffers_clean' indicates
-- the BGWriter is disabled or too slow, forcing Checkpointer to spike.
SELECT
checkpoints_timed,
checkpoints_req,
checkpoint_write_time / 1000 AS checkpoint_write_time_sec,
checkpoint_sync_time / 1000 AS checkpoint_sync_time_sec,
buffers_checkpoint,
buffers_clean,
buffers_backend
FROM pg_stat_bgwriter;
-- 3. Monitor active checkpointer process during a spike
-- This checks for high IO wait on the checkpointer process
SELECT pid, wait_event_type, wait_event, state, query
FROM pg_stat_activity
WHERE query LIKE '%checkpointer%' OR wait_event_type = 'IO';
How Senior Engineers Fix It
Senior engineers approach this by smoothing out the I/O burst and tuning the background writer.
-
Tune the Background Writer: The immediate fix is to prevent the checkpointer from doing all the work. We enable and tune the background writer to clean buffers gradually.
- Action: Update the RDS Parameter Group.
- Settings: Increase
bgwriter_delay(e.g., to 200ms) and increasebgwriter_lru_maxpages(e.g., to 1000) to allow the BG writer to flush dirty buffers proactively between checkpoints.
-
Adjust Checkpoint Frequency: We need to reduce the checkpoint interval to prevent the accumulation of too much data.
- Action: Lower
checkpoint_timeout(e.g., to 15 minutes) and/or lowermax_wal_size. - Result: This creates more frequent, smaller checkpoints rather than one massive spike every 3 hours.
- Action: Lower
-
Verify AWS Parameter Group: Inspect the RDS Parameter Group to ensure
shared_preload_librariesand other critical settings haven’t been reset by the maintenance event. Ensure the custom settings are actually applied and “in-sync”. -
Storage Optimization: If the burst balance is draining, consider upgrading to GP3 with higher baseline IOPS or migrating to IO1/IO2 for consistent performance, though tuning the database usually solves the issue without cost increases.
Why Juniors Miss It
Junior engineers often misdiagnose this issue because the symptoms contradict standard capacity planning logic.
- Misinterpreting Averages: They see low average IOPS and assume the storage is fine. They miss the burst peak visible only in high-resolution metrics (1-minute granularity).
- Focusing on Application Code: Since the spikes are periodic and not load-driven, juniors often hunt for a cron job or scheduled task in the application layer (which doesn’t exist).
- Ignoring
pg_stat_bgwriter: Thepg_stat_bgwriterview is often overlooked. Juniors see highcheckpoints_timedand assume it’s normal behavior, failing to recognize thatbuffers_clean = 0is a critical red flag indicating a lack of background maintenance. - AWS Defaults Assumption: They assume AWS RDS defaults are optimal. They don’t realize that an RDS upgrade might revert a custom
checkpoint_timeout(e.g., 15min) back to the PostgreSQL default (5min) or, in this case, a strange 3-hour interval caused by a specific patch or configuration merge.