BCH-ECC algorithm

Summary

The core issue is a mismatch between the error locator polynomial roots and the actual byte/bit positions in the message buffer. The BCH decoder correctly identifies that errors occurred and how many, but the final step of applying corrections maps the calculated syndrome roots to the wrong indices in memory. This results in catastrophic bit flips in adjacent bytes rather than repairing the intended corrupted bits. The root cause is almost certainly a sign error in the logarithm-to-position mapping or an off-by-one indexing error in the Galois Field arithmetic lookup tables.

Root Cause

BCH decoding involves three main stages: syndrome calculation, finding the error locator polynomial (using algorithms like Berlekamp-Massey or Euclid), and finding the roots of that polynomial (using Chien Search).

The failure is happening in the final step. The Chien Search iterates through the possible error locations (usually represented as powers of the primitive element $\alpha$). When a root is found, the code converts that power (exponent) back into an array index (a byte offset and bit mask).

The specific failure modes causing “adjacent byte flipping” are:

Index Calculation Error: The formula index = (N - 1) - (root_position) might be inverted, causing the correction to apply to the mirror location in the buffer.
Log/Exp Table Misalignment: If the Galois Field log and antilog tables are not generated starting at index 0, or if the primitive polynomial degree is mismatched, the calculated “locators” will point to the wrong memory addresses.
Bitmask Misalignment: Correcting a bit at position i within a byte requires 1 << i. If the bit position is calculated modulo 8 incorrectly (e.g., using root % 8 instead of (root % 8) with specific endian handling), the mask shifts left or right, flipping the neighbor bits.

Why This Happens in Real Systems

This is a classic “Happy Path” vs. “Boundary Condition” failure. In error correction coding:

Abstract vs. Physical: The math operates in a Galois Field (abstract numbers), but the data lives in a physical buffer (bytes/words). The bridge between them is fragile.
Lack of Unit Tests: Engineers often test the mathematical integrity (does the polynomial math work?) but fail to test the serialization/deserialization layer (do the calculated roots map exactly to the uint8_t array indices?).
Off-by-One Nuance: BCH algorithms are often defined with indexing starting at 1 or 0 (e.g., $\alpha^0$ or $\alpha^1$). A C implementation often uses 0-based indexing. If the math assumes 1-based and the code applies 0-based subtraction without adjustment, the correction drifts by exactly one byte or bit.

Real-World Impact

Data Corruption: Instead of healing data, the decoder actively destroys it. If this is used in a file system or storage layer, file headers will be corrupted, rendering the data unreadable.
Cascade Failure: If the corrected data is fed into a subsequent stage (e.g., decrypted or parsed), the adjacent byte flips often cause syntax errors or massive cryptographic failures (avalanche effect).
Infinite Loops: In some cases, if the correction “unfixes” the syndromes, the decoder might try to run again on the corrupted data, finding new “errors” and never stabilizing.

Example or Code

The following Python snippet demonstrates the correct logic for mapping a Chien Search root back to a byte index and bit mask. This logic is often where the C implementation fails.

def apply_correction(message_buffer, roots, n_length):
    """
    Applies corrections to the message buffer based on found roots.
    roots: list of integers representing alpha^i powers found as roots.
    """
    # Iterate over the found error locators
    for root_power in roots:
        # 1. Map the polynomial root to a bit position
        # In standard BCH, if the root is alpha^i, the error is at bit (n_length - 1 - i)
        # Note: This depends heavily on the specific convention used in the generator!
        bit_index = n_length - 1 - root_power

        # 2. Calculate Byte Index
        byte_index = bit_index // 8

        # 3. Calculate Bit Mask
        # We must ensure we are operating on the correct bit inside the byte
        bit_offset = bit_index % 8
        mask = 1 << bit_offset

        # 4. Apply Correction (Flip the bit)
        if 0 <= byte_index < len(message_buffer):
            message_buffer[byte_index] ^= mask
        else:
            print(f"Error: Calculated index {byte_index} out of bounds!")

    return message_buffer

How Senior Engineers Fix It

To fix this robustly, a senior engineer does not just guess the indexing; they instrument the code to prove the logic:

Visualize the Syndrome Roots: Print the values of the roots found by the Chien Search. Do they look like 1, 2, 3 (linear) or 127, 126, 125 (reversed)? This tells you the direction of the search.
Inject Known Errors: Create a unit test where you manually flip bit 0 of byte 0. Run the decoder. If the decoder flips bit 7 of byte 1, you have a mathematical inversion.
Verify Primitive Polynomial: Ensure the primitive polynomial used to generate the GF tables matches exactly what the encoder used. A slight difference here shifts all calculated locators.
Isolate the Math: Separate the Chien Search loop from the memory modification. Log the calculated byte_index and bit_mask before applying the XOR.

Why Juniors Miss It

Assuming the Math Library is Right: Juniors often trust that if the syndrome calculation is correct, the rest of the algorithm follows. They miss that the mapping from math to memory is manual code.
Ignoring Bit Endianness: In C, bit indexing (LSB 0 vs MSB 0) is a common source of confusion. Flipping “bit 0” in a byte might mean the high bit or the low bit depending on the hardware and the definition of the algorithm.
Testing Only “Max Errors”: They test cases with the maximum number of correctable errors but forget to test single-bit errors at the very start and very end of the buffer. These boundary tests would immediately reveal if the index calculation is length - 1 - root or just root.