Summary
During a deep-dive debugging session on an STM32H7 (Cortex-M7) bare-metal application, a developer noticed unexpected behavior in the generated assembly for a simple wrapper function. Specifically, the compiler was performing a stack push and pop of register R3, despite R3 being defined by the AAPCS (Procedure Call Standard for the ARM Architecture) as a caller-saved (scratch) register. The developer assumed that because my_function was not using R3 itself, the compiler should omit the save/restore cycle to save cycles.
Root Cause
The root cause is not a compiler bug, but a fundamental aspect of how Instruction Scheduling and Register Pressure interact with the ABI (Application Binary Interface) during optimization.
- Register Pressure Awareness: While
my_functiondoesn’t explicitly use R3 in the C source, the compiler’s backend evaluates the function in the context of the surrounding code block or the immediate instruction stream. - Instruction Reordering: At
-O1optimization, the compiler attempts to minimize pipeline stalls. If the compiler determines that R3 might be needed for a temporary calculation or if it is attempting to align the stack for subsequent operations, it may include it in the prologue/epilogue. - The Calling Convention Constraint: The AAPCS dictates that the callee is not responsible for saving R3. However, the compiler is free to save any register it wants if it decides that doing so enables a more efficient instruction sequence or if it is actually using that register for an internal, non-obvious optimization (like a temporary hold during the
bltransition).
Why This Happens in Real Systems
In complex embedded systems, “simple” functions rarely exist in isolation.
- Inlining and Interprocedural Analysis: If the compiler sees
my_sub_functionor the context surroundingmy_function, it performs Interprocedural Optimization (IPO). It might realize that R3 holds a value that must persist across thebl(Branch with Link) instruction to avoid a reload from memory later. - Stack Alignment: ARM Cortex-M processors often require 8-byte stack alignment for certain operations (like double-precision floating point or specific LDM/STM instructions). Pushing an extra register like R3 can be a “cheap” way for the compiler to satisfy alignment requirements without inserting explicit
add sp, #4instructions. - Optimization Levels: At
-O1, the compiler is “cautiously efficient.” It balances code size and speed. It doesn’t perform the aggressive register renaming found at-O3, so it often defaults to a safer, more conservative stack frame layout.
Real-World Impact
- Increased Stack Usage: In deeply nested interrupt service routines (ISRs) or highly recursive functions, unnecessary pushes consume SRAM, which is often at a premium in STM32 microcontrollers.
- Latency Spikes: Every
pushandpopadds CPU cycles. In hard real-time systems (e.g., motor control or high-speed signal processing), these “ghost” cycles can push a task past its deadline. - Debugging Confusion: Engineers looking at assembly code may misinterpret the state of the processor, assuming a register contains “garbage” when it was actually preserved by the compiler.
Example or Code (if necessary and relevant)
my_function:
push {r3, lr} @ Unexpectedly saving R3 despite it being a scratch register
bl my_sub_function @ Branch with Link to sub-function
pop {r3, pc} @ Restoring R3 before returning via PC
How Senior Engineers Fix It
A senior engineer does not fight the compiler; they provide the compiler with better information.
- Use
static inline: Ifmy_functionis a simple wrapper, marking itstatic inlinein the header file allows the compiler to eliminate the function call entirely, removing the prologue/epilogue overhead. - Analyze with
-fopt-info: Use compiler flags to see why the compiler is making specific decisions (e.g., why it decided to inline or why it failed to optimize a loop). - Trust the ABI: If the code is correct, a senior engineer accepts that the compiler has a “global” view of the machine state that the human developer lacks.
- Profile, Don’t Guess: Instead of looking at a single assembly snippet, use a Hardware Trace (ETM/ITM) or a logic analyzer to verify if the stack operations are actually causing timing violations.
Why Juniors Miss It
- Micro-Focus: Juniors often look at a single function in a vacuum, whereas compilers look at the Control Flow Graph (CFG) of the entire translation unit.
- Literal Interpretation of Docs: A junior reads “R3 is a scratch register” and assumes “The compiler cannot save R3.” They fail to realize that the ABI defines what a function must do, not what the compiler is allowed to do for optimization.
- Ignoring Alignment: They often overlook the hardware requirement for 8-byte stack alignment, assuming all stack operations are purely about data preservation.