My neural network for MNIST digit recognition learns for one epoch and then stops learning

Summary

This incident describes a neural network that stops learning after the first epoch, with accuracy stuck at ~9.8%, which is effectively random guessing on MNIST. The core issue stems from using MSE with tanh, incorrect gradient flow, and weight‑update logic that silently zeroes out updates.

Root Cause

The failure to learn is caused by a combination of issues:

Using MSE with tanh for classification, which produces extremely small gradients for most outputs
Incorrect backpropagation math, especially mixing activated and non‑activated values
Resetting weight update buffers incorrectly, causing updates to be overwritten
Using float128 with numpy, which silently degrades performance and can break operations
Transformer output of +1/–1, which is incompatible with tanh saturation behavior
Batch update logic that multiplies accumulated gradients by the learning rate twice

The result is vanishing gradients and no effective parameter updates.

Why This Happens in Real Systems

Even experienced engineers run into this class of bug because:

Activation + loss mismatch is a classic silent failure
Gradient buffers accidentally reused or overwritten is common in custom frameworks
Saturated activations (tanh/sigmoid) produce gradients near zero
Batch update logic is easy to get subtly wrong
Hand‑rolled backprop is extremely error‑prone

These issues rarely throw exceptions—they simply produce flat accuracy curves.

Real-World Impact

When this happens in production ML systems, the consequences include:

Models that appear to train but never improve
Wasted compute time
Misleading metrics that hide underlying math errors
Teams debugging symptoms instead of root causes
Silent model failures that pass CI but fail in deployment

This is one of the most expensive classes of ML bugs because it looks like “normal training.”

Example or Code (if necessary and relevant)

Below is a minimal example of the correct loss/activation pairing for MNIST classification:

import torch
import torch.nn as nn

model = nn.Sequential(
    nn.Linear(784, 128),
    nn.ReLU(),
    nn.Linear(128, 10)
)

loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

This avoids tanh saturation, avoids MSE, and uses stable softmax‑cross‑entropy.

How Senior Engineers Fix It

Experienced engineers approach this systematically:

Switch to softmax + cross‑entropy, the correct loss for classification
Replace tanh with ReLU, eliminating saturation
Verify gradient flow layer by layer
Check that weight update buffers are zeroed correctly
Ensure batch updates are not scaled twice
Unit‑test backprop on tiny networks to confirm gradients match numerical approximations
Use standard frameworks (PyTorch, TensorFlow) unless custom backprop is absolutely required

The key is to eliminate entire classes of failure, not chase symptoms.

Why Juniors Miss It

Less experienced engineers often overlook this because:

They assume any activation + any loss will work
They trust that “if the code runs, the math must be correct”
They focus on debugging the training loop instead of the gradient math
They don’t yet recognize the classic symptom: accuracy stuck at random chance
They underestimate how easily gradients vanish with tanh + MSE
They rarely test with numerical gradient checking

This is a rite‑of‑passage bug in machine learning engineering—everyone hits it once.