trying to do recommended for you feature with ML

Summary

This incident documents a failed attempt to build a “Recommended for You” movie feature using machine‑learning techniques without first establishing a stable baseline, clear evaluation metrics, or a reproducible experimentation workflow. The system produced inconsistent recommendations, poor ranking quality, and unpredictable behavior once deployed behind a Node.js API.

Root Cause

The primary root cause was starting with complex ML techniques before validating simple baselines. Additional contributing factors included:

  • Lack of a popularity or heuristic baseline to compare against
  • No clear offline evaluation metrics (precision@k, recall@k, MAP, NDCG)
  • Mixing content-based and collaborative filtering signals without normalization
  • Deploying models without A/B testing or monitoring
  • Insufficient understanding of data sparsity and cold-start behavior

Why This Happens in Real Systems

Real production systems often fail for reasons that have nothing to do with the algorithm itself:

  • Engineers jump straight to “fancy ML” instead of validating simple models
  • Data pipelines drift, causing training-serving skew
  • User interaction data is noisy, biased, or incomplete
  • Cold-start users and items break collaborative filtering approaches
  • Teams underestimate evaluation complexity, especially ranking metrics

Real-World Impact

The absence of a reliable baseline and metrics caused:

  • Low-quality recommendations, reducing user trust
  • Inconsistent ranking behavior across environments
  • Increased latency due to unnecessary model complexity
  • Difficult debugging, since no baseline existed for comparison
  • Wasted compute and engineering time

Example or Code (if necessary and relevant)

Below is a minimal Python example of a popularity baseline, which should always be the first experiment before ML models:

import pandas as pd

# user_id, movie_id, rating
df = pd.read_csv("interactions.csv")

popularity = (
    df.groupby("movie_id")["rating"]
      .mean()
      .sort_values(ascending=False)
)

print(popularity.head(10))

How Senior Engineers Fix It

Experienced engineers stabilize the system by:

  • Establishing simple baselines first:
    • Global popularity
    • Personalized popularity
    • Content-based similarity
  • Defining clear ranking metrics:
    • precision@k
    • recall@k
    • NDCG
  • Running offline evaluation before deployment
  • Adding A/B testing for real-world validation
  • Implementing feature stores to avoid training-serving skew
  • Monitoring model drift, data quality, and latency
  • Deploying models behind versioned APIs with rollback capability

Why Juniors Miss It

Junior engineers often overlook these issues because:

  • They assume ML = complex models, skipping foundational baselines
  • They underestimate the importance of evaluation metrics
  • They focus on model code, not data quality
  • They lack experience with production constraints like latency, monitoring, and drift
  • They don’t realize that simple models often outperform early ML attempts

This postmortem highlights why disciplined baselines, metrics, and evaluation practices are essential before deploying any recommendation system.

Leave a Comment