Summary
We observed a highly non-linear performance anomaly when writing data via Spark JDBC to PostgreSQL. With identical code, a small dataset (1M rows) completed in minutes, a medium dataset (2M rows) hung or stalled for over an hour, and a large dataset (13M rows) resumed normal performance. The root cause was unbounded memory accumulation in the Spark executor heap during the commit phase of the batched JDBC writes, triggered by the interaction between Spark’s DataFrameWriter logic and the PostgreSQL JDBC driver’s rewriteBatchedStatements behavior.
Root Cause
The issue stemmed from Spark’s internal JDBC writer logic leaking memory when processing batched inserts with rewriteBatchedStatements=true and batchSize=20000.
Specifically:
rewriteBatchedStatements=true: This PostgreSQL driver feature rewrites multipleINSERTstatements into a single multi-valueINSERT(e.g.,INSERT INTO table (col) VALUES (1), (2), (3)). This drastically reduces round-trips and is usually a performance booster.- Spark’s Batch Accumulation: Spark’s JDBC data source writer collects rows into batches before sending them to the driver.
- The “Sticky” Buffer: For the 2M row dataset, the specific data distribution caused Spark’s internal
ResultSetor buffering mechanism to hold onto row objects longer than expected. Instead of flushing rows after the batch limit is reached, the accumulator kept references, causing the heap usage to grow linearly with the number of rows. - GC Thrashing / OOM: As the heap filled up, the JVM spent more time in Garbage Collection (GC) than processing data. For 1M rows, the buffer cleared fast enough. For 2M rows, the buffer saturated the heap before the job could complete. For 13M rows, the dataset size likely triggered a different execution path or the sheer volume forced more aggressive flushing, inadvertently bypassing the accumulation bug.
Why This Happens in Real Systems
This anomaly is a classic case of threshold effects in distributed systems.
- Non-Linear Scaling: Systems often scale linearly until a specific resource (CPU, Memory, Network) hits a saturation point. In this case, the saturation point was the JVM Heap relative to the specific buffering logic in Spark’s JDBC writer.
- Driver Optimization Backfire:
rewriteBatchedStatementsis a standard optimization for JDBC. However, it changes how data is batched and transmitted. Standard JDBC batches are acknowledged individually (roughly), while rewritten batches are atomic at the SQL level. Spark’s writer is optimized for the standard batch flow; when the underlying driver behavior changes the packetization logic, it can expose edge cases in Spark’s memory management. - Data Skew Impact: While the user mentioned
repartition(20), data skew often remains. If certain partitions had a higher density of complex types or strings, the memory footprint per row varies. The 2M row dataset might have hit a “sweet spot” where the heap filled up exactly as GC kicked in, creating a death spiral.
Real-World Impact
- Operational Paralysis: The job for 2M rows appeared to “hang.” Tasks remained active in the Spark UI, but no data was moving to the database. This leads to wasted cluster costs and missed SLAs.
- Resource Starvation: The executor nodes were likely stuck in high Garbage Collection pauses (
TimeSpentInGCTimespiked), preventing the application from making progress. - Inconsistent Performance: Engineers cannot rely on linear extrapolation. A job that works for 1M rows failing for 2M rows destroys confidence in the data pipeline’s stability.
- Data Latency: Critical downstream reporting delays as the pipeline stalls unexpectedly.
Example or Code
The provided code snippet highlights the configuration used. The critical lines are batchsize and rewriteBatchedStatements.
df_to_write.write \
.format("jdbc") \
.mode("append") \
.option("url", jdbc_url) \
.option("dbtable", target_table) \
.option("batchsize", "20000") \
.option("rewriteBatchedStatements", "true") \
.save()
How Senior Engineers Fix It
Senior engineers address this by removing the variables that cause memory instability and optimizing for throughput without risking heap exhaustion.
-
Manual JDBC Batching (Spark SQL or RDD):
Instead of relying onDataFrameWriter‘s opaque logic, execute the write viasparkSession.sparkContext.jdbcor aforeachBatchapproach. This gives explicit control overPreparedStatementlifecycle and commit frequency. -
Adjust Driver Configuration:
DisablerewriteBatchedStatementsinitially to isolate the issue. While this usually slows down ingestion, it stabilizes memory usage. If performance is required, tunebatchSizesignificantly (e.g., lower to 1000 or 5000) to reduce the memory footprint per batch. -
Executor Memory Tuning:
Increase the executor heap size (spark.executor.memory) and explicitly increase thespark.executor.memoryOverheadto accommodate the larger memory pressure from JDBC buffering. However, this treats the symptom rather than the root cause. -
Write via Parquet/CSV Intermediate:
The most robust fix for massive data loads is: Write to S3 (Parquet) first, then use the PostgreSQLCOPYcommand (via a separate lightweight process orpsqlutility) to bulk load from S3. This decouples Spark processing from database transaction limits and memory constraints.- Step 1:
df.write.parquet("s3://bucket/temp/") - Step 2: Use
aws s3 cpto stream data to the DB host and pipe topsql -c "COPY table FROM STDIN"
- Step 1:
-
Spark Configuration Tweak:
Ensurespark.sql.adaptive.enabledis set totrue(in Spark 3.x), which helps optimize shuffle partitions dynamically, though it didn’t solve the JDBC buffer issue directly.
Why Juniors Miss It
Junior engineers often focus on the obvious metrics and code syntax while missing the underlying resource constraints.
- Blind Reliance on Framework Features: Juniors see
rewriteBatchedStatements=truein a performance tutorial and apply it universally. They don’t understand that this driver feature fundamentally alters how the JDBC driver buffers data in memory before network transmission. - Misinterpreting Spark UI: In the 2M row case, Spark UI showed tasks running. Juniors might interpret this as “work is being done.” A senior engineer recognizes that tasks running indefinitely with low CPU and network I/O usually indicate GC thrashing or deadlocks, not processing.
- Lack of Heap Analysis: Juniors often don’t check the Executor GC Time metric in the Spark History Server. A spike in GC time correlates perfectly with the “hanging” behavior.
- Linear Scaling Assumption: The intuition that “twice the data takes twice the time” is incorrect when memory boundaries are involved. Juniors often lack the experience to identify non-linear threshold behaviors.