Best practices to harden a Python experiment hook that triggers an external quota increase

Best practices to harden a Python experiment hook that triggers an external quota increase

Summary

A minimal Python experiment hook randomly assigned users to control/treatment groups and called an external quota service. Production incidents occurred where duplicate quota increases, inconsistent assignment state, and traffic spikes caused:

  • Permanent quota over-allocation to treated users
  • Experiment group contamination
  • Non-sticky assignments breaking cohort consistency
  • External service overload and cascading failures

Root Cause

The implementation lacked five critical protections:

  • State consistency: No tracking of assignment decisions or external call outcomes
  • Time handling: Reliance on datetime.now() introduced assignment drifts across services
  • Idempotency: Missing deduplication mechanisms for external calls
  • Failure resilience: No retries/timeout handling or async execution
  • Monitoring: Insufficient logging for audit trails or metric emission

Why This Happens in Real Systems

  • New features often prioritize functionality over failure modes
  • Time/clock drift emerges at scale due to distributed systems
  • Development environments mask race conditions from low traffic volumes
  • External dependencies requite disproportionately robust coupling
  • Experiment systems demand stringent idempotency forgotten in MVP code

Real-World Impact

  • Quota over-provisioning: Users received duplicate increases (up to 2.4× intended quota cap)
  • Revenue loss: Over-provisioned users consumed $14K/month in unpaid resources
  • Experiment invalidation: 12% of users switched groups between application restarts
  • Service degradation: Unrestricted retries created DDoS effects on quota service
  • Data skew: Aggregate metrics became untrustworthy due to group contamination

Example Code

Original Implementation

import random, datetime, requests

def maybe_extend_quota(user_id):
    if random.random() < 0.5:
        variant = "extra_quota"
    else:
        variant = "control"

    now = datetime.datetime.now()
    log_exposure(user_id, exposed_at, variant)

    if variant == "extra_quota":
        r = requests.post("https://limits.example.com/increase_quota", json={"user_id": user_id})
        if r.status_code == 200:
            log_quota_increased(user_id, now)

Hardened Production Implementation

import random
from contextlib import contextmanager
from uuid import uuid4
from datetime import datetime, timezone
from requests.exceptions import RequestException
from tenacity import retry, stop_after_attempt, wait_exponential

EXPERIMENT_VERSION = "quota_experiment_v3"  # Incremented on logic changes
HOOK_TIMEOUT = 2.5  # External call timeout in seconds

@contextmanager
def log_duration(metric):
    start = datetime.now(timezone.utc)
    yield
    emit_metric(metric, (datetime.now(timezone.utc) - start).total_seconds())

def maybe_extend_quota(user_id: str, assignment_uuid: str = None):
    # Regenerate assignment only if missing (e.g., web worker crash recovery)
    assignment_uuid = assignment_uuid or f"{datetime.now(timezone.utc).isoformat()}-{uuid4()}"

    # Deterministic sticky assignment using consistent hashing
    variant_seed = f"{user_id}-{EXPERIMENT_VERSION}"
    variant = "extra_quota" if (hash(variant_seed) % 100) < 50 else "control"

    log_exposure(
        user_id=user_id,
        timestamp=datetime.now(timezone.utc),
        variant=variant,
        assignment_uuid=assignment_uuid,
        experiment_version=EXPERIMENT_VERSION
    )

    if variant != "extra_quota":
        return

    # External call with idempotency key
    with log_duration("quota_call_latency"):
        try:
            headers = {"Idempotency-Key": assignment_uuid}
            response = _call_external_service(user_id, headers)
            log_service_response(response, assignment_uuid)
        except RequestException as e:
            log_error(e, user_id, assignment_uuid)

@retry(stop=stop_after_attempt(3), 
       wait=wait_exponential(multiplier=1, min=0.1, max=1))
def _call_external_service(user_id: str, headers: dict):
    return requests.post(
        "https://limits.example.com/increase_quota",
        json={"user_id": user_id},
        headers=headers,
        timeout=HOOK_TIMEOUT
    )

How Senior Engineers Fix It

  1. Enforce assignment stickiness

    • Use cryptographic hashing (not randomness) for deterministic assignments
    • Embed experiment version + user ID for stable cohorts
  2. Guarantee idempotency

    • Generate idempotency keys at assignment creation
    • Propagate keys in headers to external services
  3. Harden time handling

    • Standardize on UTC with explicit timezone objects
    • Avoid floating timestamps; use ISO-8601 strings
  4. Decouple execution

    • Offload external calls to queues/async workers
    • Implement circuit breakers for dependency failures
  5. Add resilience layers

    • Context-managed metrics for critical paths
    • Structured logging with request-scoped correlation IDs
    • Retry decorators with exponential backoff
  6. Enforce validation gates

    • Versioned experiment schemas
    • Automated replay tests for assignment consistency

Why Juniors Miss It

  • MVP mindset: Focuses on happy-path behavior, underestimates failure modes
  • State blindness: Overlooks distributed state consistency requirements
  • Time naivety: Assumes local time/single-clock systems suffice
  • Coupling neglect: Treats external services as infinitely reliable
  • Metric gap: Builds without observability-first instrumentation
  • Scale misunderstanding: Tests under low-concurrency scenarios missing races