Summary
This incident documents a failed attempt to build a “Recommended for You” movie feature using machine‑learning techniques without first establishing a stable baseline, clear evaluation metrics, or a reproducible experimentation workflow. The system produced inconsistent recommendations, poor ranking quality, and unpredictable behavior once deployed behind a Node.js API.
Root Cause
The primary root cause was starting with complex ML techniques before validating simple baselines. Additional contributing factors included:
- Lack of a popularity or heuristic baseline to compare against
- No clear offline evaluation metrics (precision@k, recall@k, MAP, NDCG)
- Mixing content-based and collaborative filtering signals without normalization
- Deploying models without A/B testing or monitoring
- Insufficient understanding of data sparsity and cold-start behavior
Why This Happens in Real Systems
Real production systems often fail for reasons that have nothing to do with the algorithm itself:
- Engineers jump straight to “fancy ML” instead of validating simple models
- Data pipelines drift, causing training-serving skew
- User interaction data is noisy, biased, or incomplete
- Cold-start users and items break collaborative filtering approaches
- Teams underestimate evaluation complexity, especially ranking metrics
Real-World Impact
The absence of a reliable baseline and metrics caused:
- Low-quality recommendations, reducing user trust
- Inconsistent ranking behavior across environments
- Increased latency due to unnecessary model complexity
- Difficult debugging, since no baseline existed for comparison
- Wasted compute and engineering time
Example or Code (if necessary and relevant)
Below is a minimal Python example of a popularity baseline, which should always be the first experiment before ML models:
import pandas as pd
# user_id, movie_id, rating
df = pd.read_csv("interactions.csv")
popularity = (
df.groupby("movie_id")["rating"]
.mean()
.sort_values(ascending=False)
)
print(popularity.head(10))
How Senior Engineers Fix It
Experienced engineers stabilize the system by:
- Establishing simple baselines first:
- Global popularity
- Personalized popularity
- Content-based similarity
- Defining clear ranking metrics:
- precision@k
- recall@k
- NDCG
- Running offline evaluation before deployment
- Adding A/B testing for real-world validation
- Implementing feature stores to avoid training-serving skew
- Monitoring model drift, data quality, and latency
- Deploying models behind versioned APIs with rollback capability
Why Juniors Miss It
Junior engineers often overlook these issues because:
- They assume ML = complex models, skipping foundational baselines
- They underestimate the importance of evaluation metrics
- They focus on model code, not data quality
- They lack experience with production constraints like latency, monitoring, and drift
- They don’t realize that simple models often outperform early ML attempts
This postmortem highlights why disciplined baselines, metrics, and evaluation practices are essential before deploying any recommendation system.