Does float16 Round After Each Operation or Only at the End? A Code Investigation

Summary

During a high-precision training run for a deep learning model, we observed unexpected divergence in gradient calculations. The investigation revealed a fundamental misunderstanding of IEEE 754 floating-point arithmetic behavior in hardware. The core question was whether a composite expression like (x1 + x2) * x3 - x4 is evaluated with infinite precision and rounded once at the end, or if intermediate rounding occurs after every discrete operation. Our findings confirm that in standard hardware execution, rounding occurs after every single arithmetic operation.

Root Cause

The root cause is the nature of Register-Level Execution. CPUs and GPUs do not possess “infinite precision” buffers for standard arithmetic instructions.

Discrete Operations: Every operator (+, -, *, /) triggers a discrete instruction in the ALU (Arithmetic Logic Unit).
Immediate Quantization: Once an operation is completed, the result must be stored back into a register or memory. If the result exceeds the mantissa (significand) bits of the target format (in this case, float16), the value is rounded immediately to fit.
Error Accumulation: Each rounding step introduces a quantization error. In a long chain of operations, these errors compound, leading to significant drift from the theoretical mathematical result.

Why This Happens in Real Systems

In production environments, we prioritize throughput and hardware alignment over theoretical mathematical exactness.

Hardware Constraints: Implementing “infinite precision” intermediate steps would require massive silicon area and increase latency exponentially.
IEEE 754 Compliance: Most hardware is strictly designed to follow the IEEE 754 standard, which mandates that the result of an operation be the nearest representable number to the infinitely precise result.
SIMD Optimizations: Modern vector processors (like NVIDIA Tensor Cores) are optimized to perform fast, low-precision math. They are hard-wired to perform the operation and immediately truncate/round to maintain the speed required for deep learning.

Real-World Impact

Gradient Vanishing/Exploding: In deep networks, small rounding errors in float16 can compound across hundreds of layers, causing weights to become NaN or zero.
Non-Deterministic Training: Small differences in how a compiler reorders operations (e.g., changing (a + b) + c to a + (b + c)) can lead to different floating-point results, making debugging nearly impossible.
Numerical Instability: Critical algorithms, such as Softmax or Layer Normalization, can fail if the intermediate sums exceed the narrow dynamic range of float16.

Example or Code

import numpy as np

def investigate_rounding():
    # We choose values that create a precise sum in float32 
    # but cause rounding errors in float16
    x1 = np.float16(1.0)
    x2 = np.float16(1.0)
    x3 = np.float16(1.0)
    x4 = np.float16(0.0001) # Very small value relative to the others

    # Method 1: Emulate intermediate rounding (Standard Hardware Behavior)
    # (x1 + x2) -> rounded to float16, then * x3 -> rounded, then - x4 -> rounded
    step1 = x1 + x2
    step2 = step1 * x3
    result_hardware = step2 - x4

    # Method 2: Emulate infinite precision (Mathematical Ideal)
    # Calculate everything in float64 and then round only once at the end
    x1_64 = np.float64(x1)
    x2_64 = np.float64(x2)
    x3_64 = np.float64(x3)
    x4_64 = np.float64(x4)

    result_ideal = np.float16((x1_64 + x2_64) * x3_64 - x4_64)

    print(f"Hardware Result (Intermediate Rounding): {result_hardware}")
    print(f"Ideal Result (Single Rounding):         {result_ideal}")
    print(f"Difference:                             {np.float64(result_hardware) - np.float64(result_ideal)}")

if __name__ == "__main__":
    investigate_rounding()

How Senior Engineers Fix It

Mixed Precision Training: We use float16 for the heavy lifting (convolutions/matrix multiplications) but keep the Master Weights and Loss Scaling in float32 to prevent precision loss.
Kahan Summation: When calculating large sums, we implement algorithms like Kahan Summation to track the lost low-order bits in a separate compensation variable.
Operation Reordering: We structure mathematical expressions to add smaller numbers together before adding them to large numbers, minimizing the impact of absorption.
Epsilon Injection: We add small constant values (e.g., 1e-7) to denominators to prevent division-by-zero caused by underflow.

Why Juniors Miss It

Mathematical Idealism: Juniors often treat code as pure math. They assume (a + b) * c is identical to a*c + b*c, forgetting that floating-point math is not associative.
Abstraction Over-reliance: They assume that if a library (like PyTorch or NumPy) says it supports float16, it will be “accurate enough,” without realizing that the error budget is consumed rapidly.
Lack of Hardware Awareness: They treat the computer as a logical machine rather than a physical device with limited bit-width and fixed-point logic constraints.