Summary
A data engineer reported a discrepancy in row counts between a source relational database and a Databricks/Spark SQL environment. Despite running an identical SELECT COUNT(*) query with no filters, the source returned 86,220 rows while Databricks returned only 86,100 rows. This type of silent data loss during the ETL/ELT pipeline is a critical failure that compromises data integrity and downstream analytics.
Root Cause
The mismatch is rarely caused by the query itself, but rather by the data ingestion and processing layer. The most likely culprits are:
- Incomplete Data Ingestion: The ETL job responsible for moving data from the source to the Delta Lake/Parquet files failed to capture the final batch of records due to a checkpointing error or a truncated stream.
- Duplicate Handling during Ingestion: If the ingestion pipeline uses
MERGEorDISTINCTlogic to prevent duplicates, it may have incorrectly identified 120 legitimate, unique rows as duplicates based on a flawed primary key definition or a null-handling error. - Data Type Mismatches/Parsing Failures: During the conversion from a relational schema to Spark’s schema, rows containing malformed data (e.g., an invalid timestamp format or an out-of-range decimal) may have been dropped or relegated to a “bad records” path without triggering a hard failure.
- Concurrency and Read Consistency: The source database might have had uncommitted transactions or “dirty reads” occurring during the extraction window, meaning the 120 rows existed in the source but were not yet visible to the extraction tool.
Why This Happens in Real Systems
In production, data is not static. Systems are highly distributed and subject to eventual consistency and schema evolution.
- Distributed Nature of Spark: Spark processes data in parallel partitions. If one partition fails to write due to a spilled memory error or a node preemption and the job is configured to “continue on error,” the missing data stays missing.
- The “Silent Failure” Paradigm: Most modern data pipelines are designed for high availability. Instead of crashing the entire pipeline when a single row is corrupted, systems are often configured to skip the row, leading to gradual data drift.
- Schema Enforcement vs. Schema Evolution: When moving from a rigid SQL schema to a schema-on-read environment like Databricks, the way NULL values or special characters are handled can lead to rows being filtered out by the underlying file readers (Parquet/Avro).
Real-World Impact
- Financial Discrepancies: If these 120 rows represent transactions, the company is reporting incorrect revenue or balances.
- Broken Downstream Models: Machine Learning models trained on incomplete datasets suffer from training-serving skew, leading to biased or inaccurate predictions.
- Loss of Trust: Once stakeholders notice a discrepancy between the “Source of Truth” and the “Data Warehouse,” the credibility of the entire data platform is undermined.
Example or Code (if necessary and relevant)
To debug this, a senior engineer would compare the presence of specific IDs using a “Minus” or “Except” logic after staging the data.
-- Run this in the environment where you have access to both sets (if possible)
-- or compare specific ranges of IDs to find the gap.
SELECT id FROM source_table
EXCEPT
SELECT id FROM databricks_table;
-- Check for corrupted records in Spark by enabling bad record path
-- This is a configuration setting used during reading
spark.read.format("csv")
.option("badRecordsPath", "/mnt/errors/bad_records")
.load("/mnt/data/source_files")
How Senior Engineers Fix It
A senior engineer does not just “re-run the query.” They implement observability and validation patterns:
- Implement Data Auditing: Add a “Control Total” step in the pipeline. For every batch processed, the pipeline must compare the
COUNT(*)of the source extract against theCOUNT(*)of the written Delta table. If they don’t match, the pipeline must fail loudly. - Idempotent Ingestion: Use
MERGE INTOpatterns with strictly defined business keys rather than relying on simple appends to ensure that re-running a job doesn’t create duplicates but also doesn’t miss rows. - Schema Validation Layers: Use tools like Great Expectations or Delta Live Tables (DLT) Expectations to define constraints. For example,
EXPECT (id IS NOT NULL) ON VIOLATION FAIL UPDATE. - Check the Ingestion Logs: Inspect the Spark UI to see if any tasks failed or if there was a significant amount of “skipped” data during the shuffle/write phase.
Why Juniors Miss It
- Focus on Syntax, Not Flow: Juniors often assume that if the
SELECTstatement is syntactically correct, the result must be correct. They treat the database as a static entity rather than a moving part of a larger data lifecycle. - Trusting the Tooling: They often assume that “Spark handles everything automatically,” forgetting that Spark is a distributed engine that requires explicit handling of edge cases, nulls, and partial failures.
- Lack of Defensive Programming: Juniors tend to write pipelines that “work when things go right” but fail to build “guardrails” for when things go wrong. They miss the importance of data reconciliation.