How can I get historical pod start/end times and CPU/memory requests from Prometheus (OpenShift/kube-state-metrics)?

Summary

This postmortem addresses a common operational gap: reconstructing historical pod lifecycles (start/end times) and resource requests in Kubernetes/OpenShift clusters using Prometheus and kube-state-metrics (KSM). The core issue is that while KSM provides real-time snapshots, historical queries for ephemeral resources require specific PromQL patterns and an understanding of metric cardinality lifecycles. The primary failure mode is attempting to query metrics like kube_pod_start_time as if they were cumulative counters rather than stateful gauges that vanish upon pod deletion. Reliable historical reconstruction requires leveraging metric timestamps (via timestamp()) and correlates multiple metric streams (start time, status phase, and request gauges) rather than relying on a single series.

Root Cause

The root cause of confusion in this domain is the ephemeral nature of Kubernetes pod resources combined with the pull-based model of Prometheus. Unlike API audit logs, which record discrete events, Prometheus scrapes the state of the cluster at intervals.

Metric Disappearance: kube_pod_start_time and kube_pod_container_resource_requests are Gauges. Once a pod is deleted, the metric series cease to exist. There is no “end time” metric; the end time is defined by the absence of data or the change in status phase.
Series Identity and Relabeling: PromQL queries often fail historically because they rely on labels (like pod) that might change if the deployment strategy changes or if pods are recreated with random suffixes. Without anchoring to a stable identifier (like uid or specific owner references), correlating a pod’s start time to its later resource usage becomes unreliable.
PromQL Function Misapplication: Users often attempt to use increase() or rate() on stateful gauges. kube_pod_start_time is a timestamp; mathematically, increase(kube_pod_start_time) is nonsensical. To capture a “past” value, one must query the value at that specific timestamp or look back at historical recording rules.

Why This Happens in Real Systems

In production environments, we often need to perform capacity planning or audit compliance after the fact. This scenario frequently occurs when:

Cost Allocation: Teams need to bill back resources used by transient jobs or pods that have already exited.
Incident Investigation: Engineers need to correlate a crash (pod failure) with a spike in CPU usage just before the termination.
StatefulSet/DaemonSet Updates: When rolling updates occur, the old pods vanish, but their resource footprint is needed for comparison against the new versions.

The friction arises because Prometheus stores the current state of the cluster. If you query for kube_pod_start_time, you only see currently running pods. To see historical pods, you are essentially querying the metric’s history within the retention period. However, if the pod is deleted, the series stops. To see it, you need to query the database at that past time. If you query now(), the data is effectively “invisible.”

Real-World Impact

Inaccurate Cost Analysis: Without correct historical querying, engineering teams cannot accurately attribute cloud compute costs to specific microservices that spun up and tore down within a billing cycle.
Missed OOMKills: If a pod crashes and is restarted immediately, standard queries often miss the memory pressure events of the previous incarnation because the pod label (with its unique random suffix) is lost.
Performance Regression Blind Spots: Comparing the resource footprint of “V1” vs “V2” of an application is difficult if you cannot reconstruct the requests and limits of the terminated V1 pods.

Example or Code

Below are the PromQL patterns required to solve the constraints.

1. Pod Start Time (Historical)

You cannot rely on kube_pod_start_time alone because it disappears. Instead, use the metric’s timestamp. If you want to know when a pod started at a historical moment t, you query the metric at time t.

Query: Get the start time of a specific pod at a point in history.

# Returns the start time (Unix timestamp) of the pod at the query timestamp
kube_pod_start_time{namespace="my-ns", pod="my-pod-abc123"}

Note: To find historical pods, you must query Prometheus with a specific historical timestamp (using the time parameter in the API), or use a recording rule that snapshots start times.

2. Pod End Time (Inferred)

There is no kube_pod_end_time. We infer it using the kube_pod_status_phase metric. A pod is “gone” when it is not Pending or Running.

Query: Detect the end time of a pod (when it enters Succeeded or Failed).

# This gauge is 1 if the pod is in that phase.
# We look for the timestamp when the 'Running' phase drops to 0 or 'Succeeded' rises to 1.
kube_pod_status_phase{phase="Succeeded", namespace="my-ns", pod="my-pod-abc123"}

Standard Pattern for “Last Seen”:
To find when a pod last existed (using kube_pod_info which exists as long as the pod object exists):

# timestamp() returns the Unix timestamp of the sample.
# group() returns 1 if the series exists.
# We find the timestamp of the last sample before the series disappears.
timestamp(group(kube_pod_info{namespace="my-ns", pod="my-pod-abc123"}))

3. CPU/Memory Requests (Historical)

kube_pod_container_resource_requests is a gauge that vanishes when the pod is deleted. To get the request value for a past pod, you must query at that specific past time. Alternatively, to view all pods that existed in the last hour (including deleted ones), you must use a range query where the pod label is projected against historical data.

Query: Get the sum of CPU requests for a specific pod at a historical timestamp t.

# To get the value at a specific historical second (e.g., 1 hour ago)
kube_pod_container_resource_requests{resource="cpu", unit="core", namespace="my-ns", pod="my-pod-abc123"}

To reconstruct a timeline of all pods that ran in the last hour, you would query a range vector:

# Returns the request value over the last hour (shows 0 or values for pods that existed then)
sum by (namespace, pod) (
  kube_pod_container_resource_requests{resource="cpu", unit="core"}
)

How Senior Engineers Fix It

Senior engineers do not rely on real-time queries for historical data. They implement Recording Rules or Long-term Retention strategies.

Recording Rules for Critical Metadata: Create recording rules that capture the start time and resource requests immediately upon pod creation, storing them in a long-lived metric.
- Rule Example: record: pod_lifecycle_start_time queries kube_pod_start_time and persists the value.
Correlating with kube_pod_info: This metric provides a static label created_by_kind and created_by_name. Senior engineers use this to group resources by owner (e.g., Deployment or StatefulSet) rather than relying on the ephemeral pod label, allowing them to aggregate data across pod restarts.
Use of group() Function: To detect “end times,” they use group(kube_pod_info) over a range. The transition from 1 to 0 (series disappearance) indicates termination.
Thanos or Long-Term Storage: For true historical reconstruction (months back), they export these specific metrics to Thanos or a long-term store, ensuring that raw pod metrics are kept for the required retention period, as standard Prometheus retention often drops high-cardinality metrics like pod-specific labels first.

Why Juniors Miss It

Junior engineers often treat Prometheus metrics like SQL tables with permanent rows.

Lack of Understanding of Metric Types: They often try to mathematically manipulate Gauges as if they are Counters. They might try to calculate rate(kube_pod_start_time[5m]), which yields nonsense because a pod start time is a discrete snapshot, not a rate of change.
Assumption of Persistence: They assume that because a query kube_pod_start_time{pod="xyz"} returns nothing today, the data never existed. They fail to grasp that they must query the database as it was at the time of the event.
Ignoring Label Volatility: Juniors often write queries based solely on the pod label (e.g., my-app-56489d-12345). When that pod dies, the query breaks. They miss the strategy of grouping by deployment or using uid to track the specific lifecycle of a single artifact, regardless of restarts.