Preventing Database Cascading Failures in Auth Services

Summary

During a high-traffic event, our authentication service experienced a cascading failure due to a database schema designed for high-level entity modeling rather than high-velocity security telemetry. While the system successfully identified brute force attempts, the write-amplification caused by logging every failed attempt into a heavily indexed relational table led to connection pool exhaustion and a complete service outage.

Root Cause

The primary failure was an architectural mismatch between transactional data (user accounts) and telemetry data (login attempts).

  • Over-indexing: The login_attempts table had multiple secondary indexes to support real-time monitoring. Every failed attempt triggered an expensive B-Tree update.
  • Relational Contention: Using a single PostgreSQL instance for both the source of truth (user credentials) and the attack logs (high-frequency writes) created lock contention.
  • Synchronous Writes: The application logic followed a “write-then-verify” pattern, meaning the user’s login latency was directly tied to the disk I/O speed of the logging table.

Why This Happens in Real Systems

In production, engineers often fall into the trap of Uniform Data Treatment. We tend to treat all data as if it has the same lifecycle and importance.

  • Lifecycle Mismatch: User profiles are updated once every few months; brute force logs are updated hundreds of times per second during an attack.
  • Scaling Bottlenecks: Relational databases are optimized for ACID compliance and complex joins, not for the high-throughput, append-only nature of security event streaming.
  • The “Feature Creep” Trap: Adding “real-time monitoring” often leads to adding more indexes, which inadvertently kills write performance exactly when you need it most (during an active attack).

Real-World Impact

  • Denial of Service (DoS): The attacker didn’t need to guess a password; they only needed to trigger enough failed attempts to exhaust the database connection pool.
  • Increased Latency: Legitimate users experienced 10s+ timeouts because the database was busy re-indexing the attack logs.
  • Observability Blindness: Because the database was overwhelmed, our monitoring tools (which also relied on the DB) failed to report the spike in time to alert.

Example or Code

-- THE ANTI-PATTERN: High-contention relational table
CREATE TABLE login_attempts (
    id UUID PRIMARY KEY,
    user_id UUID REFERENCES users(id),
    ip_address INET,
    attempted_at TIMESTAMP WITH TIME ZONE,
    success BOOLEAN,
    user_agent TEXT
);

-- The killer: Too many indexes on a high-velocity table
CREATE INDEX idx_attempts_user_id ON login_attempts(user_id);
CREATE INDEX idx_attempts_ip ON login_attempts(ip_address);
CREATE INDEX idx_attempts_time ON login_attempts(attempted_at);

How Senior Engineers Fix It

Senior engineers decouple State from Events.

  • Polyglot Persistence: Use a relational database (PostgreSQL) for User Identity and a high-throughput, time-series, or Key-Value store (Redis or ClickHouse) for Security Telemetry.
  • Write-Behind Pattern: Instead of writing logs synchronously, push login events to a Message Queue (Kafka or RabbitMQ). A consumer then processes these logs asynchronously.
  • Probabilistic Data Structures: For rate limiting, use Redis Bloom Filters or Fixed-window counters to track IP hits in memory, rather than querying a disk-based database.
  • TTL (Time To Live): Implement strict data retention policies. Brute force logs are only useful for immediate mitigation and short-term forensics; they shouldn’t live in your primary DB forever.

Why Juniors Miss It

  • Focus on Normalization: Juniors are taught to normalize data to avoid redundancy, which often leads to deep relational webs that are slow to update.
  • Single-Tool Mentality: There is a tendency to believe “PostgreSQL can do anything,” ignoring the specific strengths of specialized engines like Redis or InfluxDB.
  • Ignoring Throughput: Most students test against a single user or a small script. They fail to simulate the concurrency and write-load that defines a real-world distributed denial-of-service (DDoS) or brute force scenario.

Leave a Comment