Summary
During a high-precision training run for a deep learning model, we observed unexpected divergence in gradient calculations. The investigation revealed a fundamental misunderstanding of IEEE 754 floating-point arithmetic behavior in hardware. The core question was whether a composite expression like (x1 + x2) * x3 - x4 is evaluated with infinite precision and rounded once at the end, or if intermediate rounding occurs after every discrete operation. Our findings confirm that in standard hardware execution, rounding occurs after every single arithmetic operation.
Root Cause
The root cause is the nature of Register-Level Execution. CPUs and GPUs do not possess “infinite precision” buffers for standard arithmetic instructions.
- Discrete Operations: Every operator (
+,-,*,/) triggers a discrete instruction in the ALU (Arithmetic Logic Unit). - Immediate Quantization: Once an operation is completed, the result must be stored back into a register or memory. If the result exceeds the mantissa (significand) bits of the target format (in this case,
float16), the value is rounded immediately to fit. - Error Accumulation: Each rounding step introduces a quantization error. In a long chain of operations, these errors compound, leading to significant drift from the theoretical mathematical result.
Why This Happens in Real Systems
In production environments, we prioritize throughput and hardware alignment over theoretical mathematical exactness.
- Hardware Constraints: Implementing “infinite precision” intermediate steps would require massive silicon area and increase latency exponentially.
- IEEE 754 Compliance: Most hardware is strictly designed to follow the IEEE 754 standard, which mandates that the result of an operation be the nearest representable number to the infinitely precise result.
- SIMD Optimizations: Modern vector processors (like NVIDIA Tensor Cores) are optimized to perform fast, low-precision math. They are hard-wired to perform the operation and immediately truncate/round to maintain the speed required for deep learning.
Real-World Impact
- Gradient Vanishing/Exploding: In deep networks, small rounding errors in
float16can compound across hundreds of layers, causing weights to becomeNaNor zero. - Non-Deterministic Training: Small differences in how a compiler reorders operations (e.g., changing
(a + b) + ctoa + (b + c)) can lead to different floating-point results, making debugging nearly impossible. - Numerical Instability: Critical algorithms, such as Softmax or Layer Normalization, can fail if the intermediate sums exceed the narrow dynamic range of
float16.
Example or Code
import numpy as np
def investigate_rounding():
# We choose values that create a precise sum in float32
# but cause rounding errors in float16
x1 = np.float16(1.0)
x2 = np.float16(1.0)
x3 = np.float16(1.0)
x4 = np.float16(0.0001) # Very small value relative to the others
# Method 1: Emulate intermediate rounding (Standard Hardware Behavior)
# (x1 + x2) -> rounded to float16, then * x3 -> rounded, then - x4 -> rounded
step1 = x1 + x2
step2 = step1 * x3
result_hardware = step2 - x4
# Method 2: Emulate infinite precision (Mathematical Ideal)
# Calculate everything in float64 and then round only once at the end
x1_64 = np.float64(x1)
x2_64 = np.float64(x2)
x3_64 = np.float64(x3)
x4_64 = np.float64(x4)
result_ideal = np.float16((x1_64 + x2_64) * x3_64 - x4_64)
print(f"Hardware Result (Intermediate Rounding): {result_hardware}")
print(f"Ideal Result (Single Rounding): {result_ideal}")
print(f"Difference: {np.float64(result_hardware) - np.float64(result_ideal)}")
if __name__ == "__main__":
investigate_rounding()
How Senior Engineers Fix It
- Mixed Precision Training: We use
float16for the heavy lifting (convolutions/matrix multiplications) but keep the Master Weights and Loss Scaling infloat32to prevent precision loss. - Kahan Summation: When calculating large sums, we implement algorithms like Kahan Summation to track the lost low-order bits in a separate compensation variable.
- Operation Reordering: We structure mathematical expressions to add smaller numbers together before adding them to large numbers, minimizing the impact of absorption.
- Epsilon Injection: We add small constant values (e.g.,
1e-7) to denominators to prevent division-by-zero caused by underflow.
Why Juniors Miss It
- Mathematical Idealism: Juniors often treat code as pure math. They assume
(a + b) * cis identical toa*c + b*c, forgetting that floating-point math is not associative. - Abstraction Over-reliance: They assume that if a library (like PyTorch or NumPy) says it supports
float16, it will be “accurate enough,” without realizing that the error budget is consumed rapidly. - Lack of Hardware Awareness: They treat the computer as a logical machine rather than a physical device with limited bit-width and fixed-point logic constraints.