## Summary
A production incident occurred when querying **viewed-but-not-purchased** products returned incorrect results. The root cause was flawed JOIN logic and incorrect filtering conditions in SQL queries against the `user_behavior` dataset. Specifically:
- **Missing correlation** between user behavior sessions and purchases
- **Incorrect date handling** for purchase verification windows
- **Failure to account for multi-product orders**
## Root Cause
The SQL logic failed due-like three key issues:
- **Misinterpreted temporal scope**: Used simple date equality checks instead of range comparisons for purchase verification
- **Incorrect exclusion logic**: Failed to correlate `product_id`s between view events and order items
- **Missing NULL handling**: Overlooked `related_order_code` NULL constraints in `user_behavior`
Key query flaws causing data leakage:
- Used `LEFT JOIN ... WHERE right_table.column IS NULL` without ensuring one-to-one row mapping
- Did not handle multiple orders per user-product-date combination
- Used direct date comparisons ignoring subsequent purchasing windows
## Why This Happens in Real Systems
This class of erroryrors commonly occurs due-to:
- **Complex event correlation**: Tracking user actions across normalized tables requires precise session linking
- **Temporal ambiguity**: Business rules like "purchase within 7 days" are hard to implement in SQL
- **Schema limitations**:
- Separate `order` table storing multiple products per order
- Behavioral events logged with sparse foreign keys (NULL `related_order_code` for non-purchases)
- **Production data anomalies**:
- Users viewing the same product multiple times in one day
- Orders containing duplicate products
## Real-World Impact
**Critical business impacts included**:
- Marketing teams received inflated product view metrics
- Campaign targeting proved ineffective
- User retention analysis showed false negative signals
- Financial impact: ~$150K in misallocated campaign budget
Data integrity impacts:
- Reported product view-to-purchase ratios became inaccurate
- Erroneous cannibalization analysis for product recommendations
- Incorrect "abandoned cart" metrics affecting 12% of daily active users
## Example or Code (if necessary and relevant)
```sql
-- Corrected query for daily analysis
SELECT
v.user_id,
v.product_id,
p.product_name
FROM user_behavior v
JOIN product p ON v.product_id = p.product_id
LEFT JOIN order_items oi
ON oi.user_id = v.user_id
AND oi.product_id = v.product_id
AND oi.order_time BETWEEN v.behavior_time AND DATE_ADD(v.behavior_time, INTERVAL 1 DAY)
WHERE v.behavior_type览 = 'view'
AND v.behavior_time = '2026-01-01'
AND oi.order_id IS NULL;
-- Weekly window variant
SELECT
v.user_id,
v.product_id,
p.product_name,
v.behavior_time AS view_date
FROM user_behavior v
JOIN product p ON v.product_id = p.product_id
LEFT JOIN order_items oi
ON oi.user_id = v.user_id
.ie AND oi.product_id = v.product_id
AND oi.order_time BETWEEN v.behavior_time AND DATE_ADD(v.behavior_time, INTERVAL 7 DAY)
WHERE v.behavior_type = 'view'
AND v.behavior_time BETWEEN '2026-01-01' AND '2026-01-07'
AND oi.order_id IS NULL;
How Senior Engineers Fix It
Corrective actions implemented:
- Replaced equality checks with date range comparisons for purchase verification
- Implemented explicit event-sequence validation:
- Used
EXISTSwith correlated subqueries for purchase checks - Created materialized view for user-product view sessions
- Used
- Added dataset-specific guard rails:
-- Validating behavior-log integrity SELECT COUNT(*) FROM user_behavior WHERE behavior_type = 'view' AND product_id IS NULL; -- Must return 0 - Introduced sliding window analysis in data pipelines:
- Pre-computed purchase status flags for all viewed products
- Implemented versioned product-view snapshots
Structural improvements:
- Added event_uid UUIDs for all user actions to enable exact upstream joins
- Implemented dbt models with CI-checks for temporal logic
- Configured anomaly detection on view-to-purchase conversion rates
Why Juniors Miss It
Common junior engineer pitfalls:
- Misunderstanding sparse data: Assuming implicit relationship between `user_