Why ORDER BY RAND() kills MySQL performance and how to replace it

Summary

We encountered a critical performance degradation during the rollout of the new For You Page (FYP) feature. The initial implementation attempted to solve the “filter bubble” problem (where users only see the newest content) by introducing randomness into the query. While functionally correct, the transition from ORDER BY created_at to ORDER BY RAND() caused database CPU utilization to spike to 100%, leading to connection exhaustion and service downtime for all users attempting to access their feeds.

Root Cause

The failure was caused by the algorithmic complexity of the ORDER BY RAND() function in MySQL when applied to large datasets.

Full Table Scan: MySQL cannot use an index to satisfy ORDER BY RAND(). It must perform a full scan of the joined result set.
Temporary Table Creation: The engine must assign a random number to every single row that matches the WHERE clause.
File Sort: After assigning random values, the database performs a massive sort operation in memory or on disk (depending on the sort_buffer_size).
Complexity O(N): As the posts table grew to 10 million rows, the cost of calculating a random value and sorting the entire set became mathematically unsustainable for a real-time request.

Why This Happens in Real Systems

In development environments with small datasets (e.g., 1,000 rows), ORDER BY RAND() feels instantaneous. This creates a false sense of security.

In production-scale systems:

Data Volume Scaling: Performance degrades linearly (or worse) as the table grows.
Concurrency Multiplication: If 100 users request their FYP simultaneously, the database is forced to perform 100 full table scans and 100 massive sorts concurrently, leading to resource contention.
Join Amplification: Because the query involves a JOIN between user_interests and posts, the temporary result set can be even larger than the base tables, compounding the memory pressure.

Real-World Impact

Increased Latency: P99 latency for the feed endpoint jumped from 10s.
Database Saturation: High CPU and I/O wait times prevented other critical services (like authentication or payments) from accessing the database.
User Churn: Users experienced “infinite loading” screens, leading to immediate drop-offs in engagement.
Cascading Failure: The slow queries held database connections open longer, eventually hitting max_connections and crashing the application layer.

Example or Code

To achieve high-performance randomness, we replaced the expensive sort with a Primary Key Offset approach.

SELECT p.id, p.content, p.region, p.created_at
FROM posts p
JOIN user_interests ui ON p.region = ui.interested_region
WHERE ui.user_id = 1
  AND p.id >= (SELECT FLOOR( MAX(id) * RAND()) FROM posts)
ORDER BY p.created_at DESC
LIMIT 20;

How Senior Engineers Fix It

Senior engineers solve this by moving away from “Global Randomness” toward “Bounded Randomness” or “Application-Layer Shuffling.”

The Offset Method: Instead of sorting the whole table, pick a random starting point using the Primary Key and grab the next N rows. This leverages the index.
Pre-computed Randomness: Use a “seed” stored in the user’s session. Generate a random seed in the application code and use it in the query to ensure the “random” order is consistent for a single page load but changes on refresh.
Two-Step Fetching:
1. Query for a list of post_ids that match the user’s interests (using a lightweight index-only scan).
2. Shuffle the resulting IDs in the Application Layer (e.g., Python or Go).
3. Perform a single SELECT ... WHERE id IN (...) to fetch the actual content.
Decoupling with Caching: Use a specialized engine like Redis to store a pre-computed “candidate pool” of post IDs for a user’s interests, then sample from that pool.

Why Juniors Miss It

Focus on Correctness over Scalability: Juniors often focus on whether the query returns the right data, ignoring how the database finds that data.
Lack of Execution Plan Awareness: They rarely run EXPLAIN on their queries to see if a “filesort” or “full table scan” is occurring.
Small-Scale Testing Bias: They test against local SQLite or small MySQL containers, failing to simulate the data volume and concurrency of a live environment.
Treating DB as a Black Box: They treat the database as a magic function that just “works,” rather than a finite resource with specific computational costs for different operations.