ODPS / MaxCompute SQL: What would be the correct joining query when some records return null?

Summary

This incident centers on incomplete JOIN results in an ODPS / MaxCompute SQL pipeline where a main transaction table fails to match all rows to a reference table. The root cause is that the true join key is embedded inside a JSON field, and the surrounding data (user IDs, amounts, dates, merchant IDs) is not uniquely reliable. As a result, a naïve equality-based join leaves some records unmatched.

Root Cause

The failure stems from a combination of structural and data-quality issues:

  • The real transaction identifier (merchantTransId / invNo) is buried inside a JSON string, sometimes escaped or double-encoded.
  • User identifiers require normalization (e.g., 63-xxxx0xxxx), and mismatches cause join failures.
  • Amounts differ in scale (centavos vs. whole currency units).
  • Dates differ in format (string vs. numeric vs. timestamp).
  • JOIN keys are not unique, so user + amount + date is insufficient.
  • ODPS JOIN restrictions require equality-only conditions, forcing pre-processing into subqueries.

Why This Happens in Real Systems

These mismatches are extremely common in production data pipelines:

  • Heterogeneous ingestion sources (XLSX, CSV, API dumps) produce inconsistent formats.
  • Reference systems evolve independently, adding JSON blobs instead of structured columns.
  • Legacy identifiers (merchant IDs, invoice numbers) are reused or inconsistently formatted.
  • Normalization rules are undocumented, so teams guess at transformations.
  • JOINs depend on business semantics, not just relational structure.

Real-World Impact

When the join fails, downstream systems experience:

  • Missing biz_order_id, causing reconciliation gaps.
  • Incorrect matches, leading to financial misattribution.
  • Manual reprocessing, increasing operational load.
  • Delayed reporting, especially for settlement and audit pipelines.
  • Data quality alerts, often with no clear root cause.

Example or Code (if necessary and relevant)

Below is a valid pattern for ODPS / MaxCompute that extracts the JSON key in a subquery, normalizes identifiers, and performs a deterministic join.

SELECT
    a.*,
    b.biz_order_id
FROM (
    SELECT
        txn_date,
        txn_time,
        sender_account_number,
        transaction_amount,
        invoice_number,
        RIGHT(invoice_number, 6) AS inv_suffix
    FROM main_table
) a
LEFT JOIN (
    SELECT
        batch_date,
        biz_order_id,
        pay_money_amount / 100 AS amount_norm,
        REGEXP_EXTRACT(extend_info, '"merchantTransId":"([0-9]+)"', 1) AS merchant_trans_id,
        RIGHT(REGEXP_EXTRACT(extend_info, '"merchantTransId":"([0-9]+)"', 1), 6) AS trans_suffix,
        REGEXP_REPLACE(login_id, '^63-', '0') AS user_norm
    FROM ref_table
) b
ON a.sender_account_number = b.user_norm
AND a.transaction_amount = b.amount_norm
AND a.inv_suffix = b.trans_suffix
AND a.txn_date = b.batch_date;

How Senior Engineers Fix It

Experienced engineers apply a consistent set of practices:

  • Extract and normalize all join keys in subqueries, never inline.
  • Use suffix-based matching when upstream systems pad or alter identifiers.
  • Normalize user identifiers with deterministic rules.
  • Normalize amounts to a common unit before joining.
  • Validate JSON extraction with sampling queries to detect escaped or malformed rows.
  • Add defensive filters (e.g., date windows, platform IDs) to reduce false positives.
  • Create a canonical “matching key” column in both tables to simplify future joins.

Why Juniors Miss It

This class of issue is subtle and easy to overlook:

  • They assume visible columns are the true keys, not hidden JSON fields.
  • They trust merchant IDs or user IDs without checking normalization rules.
  • They expect amount + date to be unique, which is rarely true in payments data.
  • They underestimate JSON inconsistencies (escaping, padding, missing fields).
  • They try to fix mismatches with window functions, which hides symptoms but not causes.
  • They don’t yet recognize that data pipelines require semantic joins, not just relational ones.

This is the kind of debugging that teaches you how production data really behaves—messy, inconsistent, and full of hidden keys that only reveal themselves after careful investigation.

Leave a Comment