Summary
During a routine data transformation pipeline in Google BigQuery, we encountered a data integrity incident where a critical production report failed due to unexpected NULL values propagating through a concatenated string column. The incident highlighted a fundamental difference between the CONCAT() function and the + (or ||) operator regarding null-safety handling.
The core issue was the assumption that all concatenation methods behave identically when encountering a single NULL value in a row.
Root Cause
The failure was caused by the non-null-safe nature of the standard concatenation operator when applied to nullable columns.
- The
CONCAT()Function: In BigQuery,CONCAT(str1, str2, ...)is designed to be null-resilient. If one of the arguments isNULL, the function ignores it and returns the concatenation of the remaining non-null strings. - The
+or||Operator: The standard concatenation operator (often represented by||in SQL standard, or+in T-SQL) follows three-valued logic. If any single operand in the expression isNULL, the entire result becomesNULL.
In our pipeline, a single NULL in a middle_name column caused the entire full_name field to vanish, leading to downstream join failures.
Why This Happens in Real Systems
In distributed data warehouses like BigQuery or Snowflake, data is rarely “clean.”
- Schema Evolution: New columns are added that allow
NULLvalues, breaking existing logic that assumed data completeness. - Upstream ETL Failures: A failure in a source system might result in missing values being ingested into a staging table.
- Implicit Assumptions: Engineers often write logic assuming “if the ID exists, the name must also exist,” failing to account for the asymmetry of data completeness.
Real-World Impact
- Data Loss: Critical identifiers (like
user_id+region_code) becameNULL, causing records to be dropped duringINNER JOINoperations. - Broken Downstream BI: Dashboard filters that relied on concatenated strings failed to show any data for specific segments.
- Silent Failures: The query did not throw a syntax error; it simply produced mathematically correct but logically useless results, making it difficult to detect via standard error monitoring.
Example or Code
WITH sample_data AS (
SELECT 'John' AS first_name, NULL AS middle_name, 'Doe' AS last_name
)
SELECT
first_name,
middle_name,
last_name,
-- This returns 'JohnDoe' (Resilient)
CONCAT(first_name, middle_name, last_name) AS concat_function_result,
-- This returns NULL (Fragile)
first_name || middle_name || last_name AS operator_result
FROM sample_data
How Senior Engineers Fix It
Senior engineers do not just “fix the bug”; they harden the logic against future data drift.
- Explicit Null Handling: Always wrap nullable columns in
IFNULL(column, '')orCOALESCE(column, '')when using operators to ensure deterministic output. - Preferring Resilient Functions: In BigQuery specifically, we mandate the use of
CONCAT()for string building to minimize the surface area for null-propagation bugs. - Unit Testing Data Logic: Implement data quality checks (using tools like dbt tests) that specifically check for unexpected
NULLcounts in concatenated columns. - Defensive Programming: Treating every column as potentially
NULLby default, regardless of the current data state.
Why Juniors Miss It
- Mental Model Mismatch: Juniors often view concatenation as a simple string joining task rather than a logical operation subject to null-propagation rules.
- Focus on “Happy Path”: Most learning resources use “clean” datasets where every column is populated, reinforcing the habit of using the simplest syntax (
||) without considering edge cases. - Lack of Observability Experience: Juniors may see a
NULLresult as “just another value” rather than recognizing it as a signal of a broken pipeline.