Flattener Freeze: Medallion Architectures JSON Bottlenecks Demystified

Summary

The architecture proposed involves a Medallion Architecture where unstructured JSON files are ingested from on-prem to S3 via AWS Glue, then loaded into Amazon Aurora using a JSONB column in a Bronze layer. The strategy relies on flattening this JSON data into relational structures for the Gold layer to facilitate querying. While this approach provides high flexibility for evolving schemas, it introduces significant performance bottlenecks, storage overhead, and compute costs when scaling to massive data volumes.

Root Cause

The core issue stems from a mismatch between storage patterns and query patterns:

Late Binding Latency: Relying on the Gold layer to flatten JSONB on-the-fly or via heavy ETL processes creates a massive compute tax during every transformation cycle.
Schema-on-Read Overhead: Aurora (PostgreSQL/MySQL compatible) must parse the binary JSON structure at runtime, which is significantly slower than reading native columnar or relational types.
Write Amplification: Ingesting massive volumes into a relational engine (Aurora) as “blobs” of JSON can lead to significant IOPS pressure and bloated WAL (Write Ahead Log) sizes.

Why This Happens in Real Systems

In production environments, teams often prioritize developer velocity over system efficiency:

Schema Evolution Anxiety: Engineers fear that strict relational schemas in the Bronze layer will cause pipelines to break whenever an upstream source adds a new field.
The “Dump Everything” Fallacy: There is a common misconception that storing data in its raw format (JSONB) is a “safety net,” when in reality, it often becomes a data swamp that is too expensive to query.
Tooling Bias: Using AWS Glue for everything is often the path of least resistance, even when a specialized distributed processing engine (like Spark or EMR) is more appropriate for heavy JSON flattening.

Real-World Impact

Degraded Query Performance: As the JSONB column grows, index maintenance becomes heavy, and sequential scans for specific keys become prohibitively slow.
Increased AWS Costs: High CPU utilization during the “Flattening” phase of the Medallion pipeline leads to much higher Aurora instance sizing requirements.
Data Integrity Erosion: Without a schema enforced at the Bronze/Silver transition, “garbage in” results in “garbage out,” making the Gold layer unreliable for downstream BI tools.

Example or Code (if necessary and relevant)

-- The inefficient way: Querying massive JSONB blobs directly
SELECT 
    data->>'user_id' AS user_id,
    data->>'event_type' AS event_type,
    (data->>'timestamp')::timestamp AS event_time
FROM bronze_json_table
WHERE data @> '{"event_type": "purchase"}';

-- The optimized way: Materialized/Flattened Gold Layer
CREATE TABLE gold_events (
    user_id VARCHAR(50),
    event_type VARCHAR(50),
    event_time TIMESTAMP,
    ingest_id UUID
);

CREATE INDEX idx_event_type ON gold_events(event_type);

How Senior Engineers Fix It

A senior engineer would move away from a “Relational-Only” ingestion mindset and implement a Hybrid Storage Strategy:

Schema Enforcement at Silver: Use AWS Glue/Spark to perform Schema Inference and Evolution during the transition from Bronze to Silver. Instead of storing JSONB in Silver, extract the top 80% of most-queried fields into native relational columns.
Decouple Storage and Compute: Keep the raw JSON in S3 (Data Lake) for long-term archival and “replayability,” but only load structured, high-value data into Aurora (Data Warehouse/Serving Layer).
Partitioning Strategy: Implement strict partitioning on the S3 landing zone (e.g., s3://bucket/year/month/day/) and use Partition Projection to ensure Glue jobs don’t scan the entire bucket.
Parquet/Avro over JSON: Transition the intermediate layer (Silver) from JSON to Apache Parquet. Parquet’s columnar format allows for much faster flattening and significantly reduces I/O.

Why Juniors Miss It

Focus on “How” instead of “How Much”: Juniors focus on the technical possibility of storing JSONB in Aurora, whereas seniors focus on the scaling characteristics of that choice.
Ignoring I/O Costs: Juniors often treat storage as infinite and cheap, overlooking the fact that I/O operations and CPU cycles spent parsing JSON are the primary drivers of cloud bills.
Over-reliance on Relational Engines: Juniors try to turn an RDBMS (Aurora) into a Data Lake, failing to recognize that relational engines are optimized for structured, predictable workloads, not massive-scale unstructured ingestion.