Best practices to harden a Python experiment hook that triggers an external quota increase
Summary
A minimal Python experiment hook randomly assigned users to control/treatment groups and called an external quota service. Production incidents occurred where duplicate quota increases, inconsistent assignment state, and traffic spikes caused:
- Permanent quota over-allocation to treated users
- Experiment group contamination
- Non-sticky assignments breaking cohort consistency
- External service overload and cascading failures
Root Cause
The implementation lacked five critical protections:
- State consistency: No tracking of assignment decisions or external call outcomes
- Time handling: Reliance on
datetime.now()introduced assignment drifts across services - Idempotency: Missing deduplication mechanisms for external calls
- Failure resilience: No retries/timeout handling or async execution
- Monitoring: Insufficient logging for audit trails or metric emission
Why This Happens in Real Systems
- New features often prioritize functionality over failure modes
- Time/clock drift emerges at scale due to distributed systems
- Development environments mask race conditions from low traffic volumes
- External dependencies requite disproportionately robust coupling
- Experiment systems demand stringent idempotency forgotten in MVP code
Real-World Impact
- Quota over-provisioning: Users received duplicate increases (up to 2.4× intended quota cap)
- Revenue loss: Over-provisioned users consumed $14K/month in unpaid resources
- Experiment invalidation: 12% of users switched groups between application restarts
- Service degradation: Unrestricted retries created DDoS effects on quota service
- Data skew: Aggregate metrics became untrustworthy due to group contamination
Example Code
Original Implementation
import random, datetime, requests
def maybe_extend_quota(user_id):
if random.random() < 0.5:
variant = "extra_quota"
else:
variant = "control"
now = datetime.datetime.now()
log_exposure(user_id, exposed_at, variant)
if variant == "extra_quota":
r = requests.post("https://limits.example.com/increase_quota", json={"user_id": user_id})
if r.status_code == 200:
log_quota_increased(user_id, now)
Hardened Production Implementation
import random
from contextlib import contextmanager
from uuid import uuid4
from datetime import datetime, timezone
from requests.exceptions import RequestException
from tenacity import retry, stop_after_attempt, wait_exponential
EXPERIMENT_VERSION = "quota_experiment_v3" # Incremented on logic changes
HOOK_TIMEOUT = 2.5 # External call timeout in seconds
@contextmanager
def log_duration(metric):
start = datetime.now(timezone.utc)
yield
emit_metric(metric, (datetime.now(timezone.utc) - start).total_seconds())
def maybe_extend_quota(user_id: str, assignment_uuid: str = None):
# Regenerate assignment only if missing (e.g., web worker crash recovery)
assignment_uuid = assignment_uuid or f"{datetime.now(timezone.utc).isoformat()}-{uuid4()}"
# Deterministic sticky assignment using consistent hashing
variant_seed = f"{user_id}-{EXPERIMENT_VERSION}"
variant = "extra_quota" if (hash(variant_seed) % 100) < 50 else "control"
log_exposure(
user_id=user_id,
timestamp=datetime.now(timezone.utc),
variant=variant,
assignment_uuid=assignment_uuid,
experiment_version=EXPERIMENT_VERSION
)
if variant != "extra_quota":
return
# External call with idempotency key
with log_duration("quota_call_latency"):
try:
headers = {"Idempotency-Key": assignment_uuid}
response = _call_external_service(user_id, headers)
log_service_response(response, assignment_uuid)
except RequestException as e:
log_error(e, user_id, assignment_uuid)
@retry(stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=0.1, max=1))
def _call_external_service(user_id: str, headers: dict):
return requests.post(
"https://limits.example.com/increase_quota",
json={"user_id": user_id},
headers=headers,
timeout=HOOK_TIMEOUT
)
How Senior Engineers Fix It
-
Enforce assignment stickiness
- Use cryptographic hashing (not randomness) for deterministic assignments
- Embed experiment version + user ID for stable cohorts
-
Guarantee idempotency
- Generate idempotency keys at assignment creation
- Propagate keys in headers to external services
-
Harden time handling
- Standardize on UTC with explicit timezone objects
- Avoid floating timestamps; use ISO-8601 strings
-
Decouple execution
- Offload external calls to queues/async workers
- Implement circuit breakers for dependency failures
-
Add resilience layers
- Context-managed metrics for critical paths
- Structured logging with request-scoped correlation IDs
- Retry decorators with exponential backoff
-
Enforce validation gates
- Versioned experiment schemas
- Automated replay tests for assignment consistency
Why Juniors Miss It
- MVP mindset: Focuses on happy-path behavior, underestimates failure modes
- State blindness: Overlooks distributed state consistency requirements
- Time naivety: Assumes local time/single-clock systems suffice
- Coupling neglect: Treats external services as infinitely reliable
- Metric gap: Builds without observability-first instrumentation
- Scale misunderstanding: Tests under low-concurrency scenarios missing races