WCET of Cortex‑M7 loops with cache and branch prediction

Summary

The core objective is to perform Worst-Case Execution Time (WCET) estimation for a bare-metal loop running on an ARM Cortex-M7 microcontroller. Despite the absence of interrupts and multitasking, the presence of a non-deterministic hardware pipeline—specifically the Instruction Cache (I-Cache) and the Dynamic Branch Predictor—makes traditional cycle-counting inaccurate. The challenge lies in the fact that commercial static analysis tools like AbsInt aiT lack comprehensive support for the specific microarchitecture of the Cortex-M7, forcing engineers to bridge the gap between theoretical upper bounds and empirical measurements.

Root Cause

The difficulty in achieving a precise WCET on high-performance microcontrollers stems from micro-architectural state variability:

Instruction Cache (I-Cache) Misses: A loop may execute quickly when the instructions are cached, but the “worst case” occurs when the cache is cold or when an instruction fetch triggers a line eviction, forcing a stall while fetching from slower Flash memory.
Branch Prediction Uncertainty: While static prediction is predictable, the Dynamic Branch Predictor relies on historical patterns. A single misprediction causes a pipeline flush, adding several cycles of latency that are difficult to model mathematically without a cycle-accurate simulator.
Memory Latency and Contention: Even in single-threaded environments, the latency of the AXI/AHB bus matrix and the wait states of the internal Flash memory create non-deterministic timing based on the state of the memory controller.

Why This Happens in Real Systems

In modern embedded systems, we have moved away from simple “deterministic” cores (like Cortex-M0) toward superscalar, deeply pipelined architectures (like Cortex-M7) to achieve higher throughput.

Performance vs. Determinism Trade-off: Features like caches and branch predictors are designed to improve average-case performance, which is diametrically opposed to the requirements of worst-case predictability.
Abstraction Leaks: High-level C code abstracts away the underlying hardware, but at the timing level, the hardware’s internal state (the “micro-state”) becomes the most critical variable.
Complexity of Silicon: Modern SoC designs are too complex for simple manual calculation; the interaction between the pipeline, the cache controller, and the flash controller creates a state space that is too large to exhaustively test.

Real-World Impact

Failure to accurately estimate WCET in hard real-time systems leads to catastrophic failures:

Deadline Misses: In control loops (e.g., motor control or flight stabilization), a single cycle overrun can lead to physical instability.
Jitter: Variable execution times introduce timing noise, which can degrade the performance of digital signal processing (DSP) algorithms.
Heisenbugs: Timing-related bugs may only appear under specific thermal conditions or memory states, making them nearly impossible to reproduce in a debugger.

Example or Code (if necessary and relevant)

To begin empirical estimation, we use the Data Watchpoint and Trace (DWT) unit inside the Cortex-M7 to get cycle-accurate measurements.

#include "stm32h7xx.h"

void measure_loop_cycles(void) {
    uint32_t start_cycles;
    uint32_t end_cycles;
    uint32_t total_cycles;

    // Enable DWT Cycle Counter
    CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;
    DWT->CYCCNT = 0;
    DWT->CTRL |= DWT_CTRL_CYCCNTENA_Msk;

    // Start measurement
    start_cycles = DWT->CYCCNT;

    // THE TARGET LOOP
    for (int i = 0; i CYCCNT;

    total_cycles = end_cycles - start_cycles;
}

How Senior Engineers Fix It

Senior engineers do not rely on a single “magic number.” They use a hybrid approach combining empirical data with mathematical safety margins:

High-Water Mark Profiling: Run the target loop millions of times under various conditions (different temperature, different memory states) using the DWT cycle counter to find the absolute maximum observed cycles.
Instruction-Level Tracing: Use an ETM (Embedded Trace Macrocell) via a high-speed probe (like a Segger J-Trace) to capture a full instruction trace. This allows you to see exactly where pipeline stalls and cache misses occurred.
Conservative Modeling: If a tool like aiT is unavailable, we apply a Safety Factor (SF). If the measured worst-case is $T{max}$, the system is designed for $T{design} = T_{max} \times SF$ (where $SF$ is typically 1.2 to 1.5).
Determinism by Design: If WCET is too volatile, a senior engineer may disable the I-Cache or move critical code to ITCM (Instruction Tightly Coupled Memory), which provides single-cycle, deterministic access regardless of the cache state.

Why Juniors Miss It

Junior engineers often fall into these traps:

The “Average Case” Fallacy: They run a loop 10 times, take the average, and assume that is the execution time. They fail to realize that the average is useless for real-time guarantees.
Ignoring the Hardware: They assume that C code translates directly to a fixed number of cycles, ignoring that a single if statement can trigger a branch misprediction penalty.
Over-reliance on Oscilloscopes: They try to measure timing using GPIO toggling and an oscilloscope. While useful, this method has instrumentation overhead that can itself alter the very timing they are trying to measure.
Neglecting Memory Hierarchy: They treat “Memory” as a single entity, failing to account for the massive latency difference between TCM, Cache, and Flash.