trying to do recommended for you feature with ML

Summary

This incident documents a failed attempt to build a “Recommended for You” movie feature using machine‑learning techniques without first establishing a stable baseline, clear evaluation metrics, or a reproducible experimentation workflow. The system produced inconsistent recommendations, poor ranking quality, and unpredictable behavior once deployed behind a Node.js API.

Root Cause

The primary root cause was starting with complex ML techniques before validating simple baselines. Additional contributing factors included:

Lack of a popularity or heuristic baseline to compare against
No clear offline evaluation metrics (precision@k, recall@k, MAP, NDCG)
Mixing content-based and collaborative filtering signals without normalization
Deploying models without A/B testing or monitoring
Insufficient understanding of data sparsity and cold-start behavior

Why This Happens in Real Systems

Real production systems often fail for reasons that have nothing to do with the algorithm itself:

Engineers jump straight to “fancy ML” instead of validating simple models
Data pipelines drift, causing training-serving skew
User interaction data is noisy, biased, or incomplete
Cold-start users and items break collaborative filtering approaches
Teams underestimate evaluation complexity, especially ranking metrics

Real-World Impact

The absence of a reliable baseline and metrics caused:

Low-quality recommendations, reducing user trust
Inconsistent ranking behavior across environments
Increased latency due to unnecessary model complexity
Difficult debugging, since no baseline existed for comparison
Wasted compute and engineering time

Example or Code (if necessary and relevant)

Below is a minimal Python example of a popularity baseline, which should always be the first experiment before ML models:

import pandas as pd

# user_id, movie_id, rating
df = pd.read_csv("interactions.csv")

popularity = (
    df.groupby("movie_id")["rating"]
      .mean()
      .sort_values(ascending=False)
)

print(popularity.head(10))

How Senior Engineers Fix It

Experienced engineers stabilize the system by:

Establishing simple baselines first:
- Global popularity
- Personalized popularity
- Content-based similarity
Defining clear ranking metrics:
- precision@k
- recall@k
- NDCG
Running offline evaluation before deployment
Adding A/B testing for real-world validation
Implementing feature stores to avoid training-serving skew
Monitoring model drift, data quality, and latency
Deploying models behind versioned APIs with rollback capability

Why Juniors Miss It

Junior engineers often overlook these issues because:

They assume ML = complex models, skipping foundational baselines
They underestimate the importance of evaluation metrics
They focus on model code, not data quality
They lack experience with production constraints like latency, monitoring, and drift
They don’t realize that simple models often outperform early ML attempts

This postmortem highlights why disciplined baselines, metrics, and evaluation practices are essential before deploying any recommendation system.