WCET of Cortex‑M7 loops with cache and branch prediction

Summary

The core objective is to perform Worst-Case Execution Time (WCET) estimation for a bare-metal loop running on an ARM Cortex-M7 microcontroller. Despite the absence of interrupts and multitasking, the presence of a non-deterministic hardware pipeline—specifically the Instruction Cache (I-Cache) and the Dynamic Branch Predictor—makes traditional cycle-counting inaccurate. The challenge lies in the fact that commercial static analysis tools like AbsInt aiT lack comprehensive support for the specific microarchitecture of the Cortex-M7, forcing engineers to bridge the gap between theoretical upper bounds and empirical measurements.

Root Cause

The difficulty in achieving a precise WCET on high-performance microcontrollers stems from micro-architectural state variability:

  • Instruction Cache (I-Cache) Misses: A loop may execute quickly when the instructions are cached, but the “worst case” occurs when the cache is cold or when an instruction fetch triggers a line eviction, forcing a stall while fetching from slower Flash memory.
  • Branch Prediction Uncertainty: While static prediction is predictable, the Dynamic Branch Predictor relies on historical patterns. A single misprediction causes a pipeline flush, adding several cycles of latency that are difficult to model mathematically without a cycle-accurate simulator.
  • Memory Latency and Contention: Even in single-threaded environments, the latency of the AXI/AHB bus matrix and the wait states of the internal Flash memory create non-deterministic timing based on the state of the memory controller.

Why This Happens in Real Systems

In modern embedded systems, we have moved away from simple “deterministic” cores (like Cortex-M0) toward superscalar, deeply pipelined architectures (like Cortex-M7) to achieve higher throughput.

  • Performance vs. Determinism Trade-off: Features like caches and branch predictors are designed to improve average-case performance, which is diametrically opposed to the requirements of worst-case predictability.
  • Abstraction Leaks: High-level C code abstracts away the underlying hardware, but at the timing level, the hardware’s internal state (the “micro-state”) becomes the most critical variable.
  • Complexity of Silicon: Modern SoC designs are too complex for simple manual calculation; the interaction between the pipeline, the cache controller, and the flash controller creates a state space that is too large to exhaustively test.

Real-World Impact

Failure to accurately estimate WCET in hard real-time systems leads to catastrophic failures:

  • Deadline Misses: In control loops (e.g., motor control or flight stabilization), a single cycle overrun can lead to physical instability.
  • Jitter: Variable execution times introduce timing noise, which can degrade the performance of digital signal processing (DSP) algorithms.
  • Heisenbugs: Timing-related bugs may only appear under specific thermal conditions or memory states, making them nearly impossible to reproduce in a debugger.

Example or Code (if necessary and relevant)

To begin empirical estimation, we use the Data Watchpoint and Trace (DWT) unit inside the Cortex-M7 to get cycle-accurate measurements.

#include "stm32h7xx.h"

void measure_loop_cycles(void) {
    uint32_t start_cycles;
    uint32_t end_cycles;
    uint32_t total_cycles;

    // Enable DWT Cycle Counter
    CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;
    DWT->CYCCNT = 0;
    DWT->CTRL |= DWT_CTRL_CYCCNTENA_Msk;

    // Start measurement
    start_cycles = DWT->CYCCNT;

    // THE TARGET LOOP
    for (int i = 0; i CYCCNT;

    total_cycles = end_cycles - start_cycles;
}

How Senior Engineers Fix It

Senior engineers do not rely on a single “magic number.” They use a hybrid approach combining empirical data with mathematical safety margins:

  • High-Water Mark Profiling: Run the target loop millions of times under various conditions (different temperature, different memory states) using the DWT cycle counter to find the absolute maximum observed cycles.
  • Instruction-Level Tracing: Use an ETM (Embedded Trace Macrocell) via a high-speed probe (like a Segger J-Trace) to capture a full instruction trace. This allows you to see exactly where pipeline stalls and cache misses occurred.
  • Conservative Modeling: If a tool like aiT is unavailable, we apply a Safety Factor (SF). If the measured worst-case is $T{max}$, the system is designed for $T{design} = T_{max} \times SF$ (where $SF$ is typically 1.2 to 1.5).
  • Determinism by Design: If WCET is too volatile, a senior engineer may disable the I-Cache or move critical code to ITCM (Instruction Tightly Coupled Memory), which provides single-cycle, deterministic access regardless of the cache state.

Why Juniors Miss It

Junior engineers often fall into these traps:

  • The “Average Case” Fallacy: They run a loop 10 times, take the average, and assume that is the execution time. They fail to realize that the average is useless for real-time guarantees.
  • Ignoring the Hardware: They assume that C code translates directly to a fixed number of cycles, ignoring that a single if statement can trigger a branch misprediction penalty.
  • Over-reliance on Oscilloscopes: They try to measure timing using GPIO toggling and an oscilloscope. While useful, this method has instrumentation overhead that can itself alter the very timing they are trying to measure.
  • Neglecting Memory Hierarchy: They treat “Memory” as a single entity, failing to account for the massive latency difference between TCM, Cache, and Flash.

Leave a Comment