BigQuery CONCAT vs || for NULL-Safe String Concatenation

Summary

During a routine data transformation pipeline in Google BigQuery, we encountered a data integrity incident where a critical production report failed due to unexpected NULL values propagating through a concatenated string column. The incident highlighted a fundamental difference between the CONCAT() function and the + (or ||) operator regarding null-safety handling.

The core issue was the assumption that all concatenation methods behave identically when encountering a single NULL value in a row.

Root Cause

The failure was caused by the non-null-safe nature of the standard concatenation operator when applied to nullable columns.

  • The CONCAT() Function: In BigQuery, CONCAT(str1, str2, ...) is designed to be null-resilient. If one of the arguments is NULL, the function ignores it and returns the concatenation of the remaining non-null strings.
  • The + or || Operator: The standard concatenation operator (often represented by || in SQL standard, or + in T-SQL) follows three-valued logic. If any single operand in the expression is NULL, the entire result becomes NULL.

In our pipeline, a single NULL in a middle_name column caused the entire full_name field to vanish, leading to downstream join failures.

Why This Happens in Real Systems

In distributed data warehouses like BigQuery or Snowflake, data is rarely “clean.”

  • Schema Evolution: New columns are added that allow NULL values, breaking existing logic that assumed data completeness.
  • Upstream ETL Failures: A failure in a source system might result in missing values being ingested into a staging table.
  • Implicit Assumptions: Engineers often write logic assuming “if the ID exists, the name must also exist,” failing to account for the asymmetry of data completeness.

Real-World Impact

  • Data Loss: Critical identifiers (like user_id + region_code) became NULL, causing records to be dropped during INNER JOIN operations.
  • Broken Downstream BI: Dashboard filters that relied on concatenated strings failed to show any data for specific segments.
  • Silent Failures: The query did not throw a syntax error; it simply produced mathematically correct but logically useless results, making it difficult to detect via standard error monitoring.

Example or Code

WITH sample_data AS (
  SELECT 'John' AS first_name, NULL AS middle_name, 'Doe' AS last_name
)
SELECT
  first_name,
  middle_name,
  last_name,
  -- This returns 'JohnDoe' (Resilient)
  CONCAT(first_name, middle_name, last_name) AS concat_function_result,
  -- This returns NULL (Fragile)
  first_name || middle_name || last_name AS operator_result
FROM sample_data

How Senior Engineers Fix It

Senior engineers do not just “fix the bug”; they harden the logic against future data drift.

  • Explicit Null Handling: Always wrap nullable columns in IFNULL(column, '') or COALESCE(column, '') when using operators to ensure deterministic output.
  • Preferring Resilient Functions: In BigQuery specifically, we mandate the use of CONCAT() for string building to minimize the surface area for null-propagation bugs.
  • Unit Testing Data Logic: Implement data quality checks (using tools like dbt tests) that specifically check for unexpected NULL counts in concatenated columns.
  • Defensive Programming: Treating every column as potentially NULL by default, regardless of the current data state.

Why Juniors Miss It

  • Mental Model Mismatch: Juniors often view concatenation as a simple string joining task rather than a logical operation subject to null-propagation rules.
  • Focus on “Happy Path”: Most learning resources use “clean” datasets where every column is populated, reinforcing the habit of using the simplest syntax (||) without considering edge cases.
  • Lack of Observability Experience: Juniors may see a NULL result as “just another value” rather than recognizing it as a signal of a broken pipeline.

Leave a Comment