Understanding increase() drops on counters in Prometheus

Summary

The user observed unexpected behavior when using the increase() function in PromQL on a counter metric (bytes_out_total). Instead of seeing a continuous upward trend reflecting the total throughput, the graph exhibited drastic drops or “decreases” following a counter reset. The core misunderstanding lies in how Prometheus handles counter resets and the range vector mechanics within the increase() function.

Root Cause

The issue stems from a fundamental misunderstanding of how increase() processes data points over a specific time window:

Counter Resets: A counter is a cumulative metric that only goes up. When a service restarts, the counter resets to zero.
Extrapolation: The increase() function is not a simple subtraction of the first and last value. Because Prometheus samples data at intervals, the actual reset might happen between two samples. To compensate, Prometheus performs extrapolation to estimate what the value would have been at the exact boundaries of the time range.
Window Misalignment: If you use a range like [24h], Prometheus looks at the entire 24-hour window. If a reset occurred, increase() identifies the reset and attempts to account for it, but if the query window is not handled correctly or if the sampling rate is inconsistent, the resulting calculation can appear mathematically non-intuitive.
The “Drop” Illusion: The user expected a cumulative total, but increase() calculates the delta within that specific window. If the window slides and no new “increase” is detected relative to the previous calculation, or if the reset logic miscalculates the delta due to a lack of data points, the graph visually dips.

Why This Happens in Real Systems

In production environments, counters are volatile due to:

Pod/Container Restarts: In Kubernetes, a deployment update or a crash causes the container to restart, resetting all internal metrics to zero.
Deployment Cycles: CI/CD pipelines frequently push new code, causing periodic, scheduled resets across the fleet.
Memory Pressure: An OOM (Out of Memory) kill will reset the process, wiping the current counter state.
Sampling Gaps: Network jitter or high CPU load can cause “missing” scrapes. When Prometheus misses a scrape during a reset, the mathematical estimation used by increase() becomes highly inaccurate.

Real-World Impact

False Alerts: SRE teams may receive false positive alerts for “low throughput” when, in reality, the metric simply reset.
Inaccurate Billing: If metrics are used for usage-based billing (e.g., GB of data transferred), a miscalculated increase() leads to revenue loss or overcharging customers.
Bad Capacity Planning: Engineers making scaling decisions based on “total increase” will see undervalued trends, leading to under-provisioning of resources.

Example or Code

To properly visualize the rate of change and handle resets, engineers should use rate() for per-second averages or ensure increase() is used on a sufficiently large window to smooth out the reset noise.

# To see the per-second rate of bytes out (most stable for alerting)
rate(bytes_out_total[5m])

# To see the total increase over a 24h period (useful for dashboards)
increase(bytes_out_total[24h])

How Senior Engineers Fix It

Senior engineers don’t just fix the query; they fix the observability strategy:

Use rate() for Alerting: Never alert on increase(). Use rate() because it is mathematically more stable for calculating the velocity of a metric.
Increase Scrape Intervals: If resets are causing massive estimation errors, increase the scrape frequency to provide more data points for the extrapolation algorithm.
Check Counter Implementation: Ensure the application is using a standard Prometheus client library that correctly implements monotonic counters.
Visual Smoothing: When building Grafana dashboards, use the irate() function for high-resolution “instant” views and rate() for long-term trends to prevent visual “spikes” from being misleading.

Why Juniors Miss It

Mathematical Intuition vs. Implementation: Juniors assume increase(metric[24h]) is simply last_value - first_value. They fail to realize Prometheus is performing linear regression/extrapolation on a time series.
Ignoring the “Counter” Nature: They treat counters like Gauges. A Gauge can go up and down; a Counter’s only “downward” movement is a reset, which requires special handling.
Window Selection: Juniors often pick arbitrary windows (like [1h]) without considering the scrape interval. If the window is too small relative to the scrape interval, the extrapolation becomes wildly inaccurate.