Summary
We encountered a critical performance degradation during the rollout of the new For You Page (FYP) feature. The initial implementation attempted to solve the “filter bubble” problem (where users only see the newest content) by introducing randomness into the query. While functionally correct, the transition from ORDER BY created_at to ORDER BY RAND() caused database CPU utilization to spike to 100%, leading to connection exhaustion and service downtime for all users attempting to access their feeds.
Root Cause
The failure was caused by the algorithmic complexity of the ORDER BY RAND() function in MySQL when applied to large datasets.
- Full Table Scan: MySQL cannot use an index to satisfy
ORDER BY RAND(). It must perform a full scan of the joined result set. - Temporary Table Creation: The engine must assign a random number to every single row that matches the
WHEREclause. - File Sort: After assigning random values, the database performs a massive sort operation in memory or on disk (depending on the
sort_buffer_size). - Complexity O(N): As the
poststable grew to 10 million rows, the cost of calculating a random value and sorting the entire set became mathematically unsustainable for a real-time request.
Why This Happens in Real Systems
In development environments with small datasets (e.g., 1,000 rows), ORDER BY RAND() feels instantaneous. This creates a false sense of security.
In production-scale systems:
- Data Volume Scaling: Performance degrades linearly (or worse) as the table grows.
- Concurrency Multiplication: If 100 users request their FYP simultaneously, the database is forced to perform 100 full table scans and 100 massive sorts concurrently, leading to resource contention.
- Join Amplification: Because the query involves a
JOINbetweenuser_interestsandposts, the temporary result set can be even larger than the base tables, compounding the memory pressure.
Real-World Impact
- Increased Latency: P99 latency for the feed endpoint jumped from 10s.
- Database Saturation: High CPU and I/O wait times prevented other critical services (like authentication or payments) from accessing the database.
- User Churn: Users experienced “infinite loading” screens, leading to immediate drop-offs in engagement.
- Cascading Failure: The slow queries held database connections open longer, eventually hitting
max_connectionsand crashing the application layer.
Example or Code
To achieve high-performance randomness, we replaced the expensive sort with a Primary Key Offset approach.
SELECT p.id, p.content, p.region, p.created_at
FROM posts p
JOIN user_interests ui ON p.region = ui.interested_region
WHERE ui.user_id = 1
AND p.id >= (SELECT FLOOR( MAX(id) * RAND()) FROM posts)
ORDER BY p.created_at DESC
LIMIT 20;
How Senior Engineers Fix It
Senior engineers solve this by moving away from “Global Randomness” toward “Bounded Randomness” or “Application-Layer Shuffling.”
- The Offset Method: Instead of sorting the whole table, pick a random starting point using the Primary Key and grab the next N rows. This leverages the index.
- Pre-computed Randomness: Use a “seed” stored in the user’s session. Generate a random seed in the application code and use it in the query to ensure the “random” order is consistent for a single page load but changes on refresh.
- Two-Step Fetching:
- Query for a list of
post_idsthat match the user’s interests (using a lightweight index-only scan). - Shuffle the resulting IDs in the Application Layer (e.g., Python or Go).
- Perform a single
SELECT ... WHERE id IN (...)to fetch the actual content.
- Query for a list of
- Decoupling with Caching: Use a specialized engine like Redis to store a pre-computed “candidate pool” of post IDs for a user’s interests, then sample from that pool.
Why Juniors Miss It
- Focus on Correctness over Scalability: Juniors often focus on whether the query returns the right data, ignoring how the database finds that data.
- Lack of Execution Plan Awareness: They rarely run
EXPLAINon their queries to see if a “filesort” or “full table scan” is occurring. - Small-Scale Testing Bias: They test against local SQLite or small MySQL containers, failing to simulate the data volume and concurrency of a live environment.
- Treating DB as a Black Box: They treat the database as a magic function that just “works,” rather than a finite resource with specific computational costs for different operations.