How to Fix Poor Invoice OCR Accuracy with OpenCV and Tesseract

Summary

Accurate OCR on invoices requires careful image preprocessing, proper scaling, and targeted Tesseract configuration. The original pipeline yields gibberish because of low contrast and missing steps.

Root Cause- Low image resolution before OCR

Insufficient contrast leading to fragmented characters
Incorrect adaptive threshold values
Missing denoising steps
Generic OCR config not tuned for receipts

Why This Happens in Real Systems

Images are often captured with smartphones at varying distances
Lighting conditions cause shadows and glare
PDFs are rasterized at non‑optimal DPI, losing detail

Real-World Impact

Mis‑extracted amounts can cause billing errors
Wrong dates lead to incorrect financial reporting
Repeated manual correction increases operational cost
False positives trigger downstream validation failures

Example or Code (if necessary and relevant)

Here is a refined preprocessing pipeline that senior engineers typically apply:

import cv2
import pytesseract
import numpy as np

img = cv2.imread('elecBill.jpg')
# 1️⃣ Upscale for better DPI
img = cv2.resize(img, None, fx=2.0, fy=2.0, interpolation=cv2.INTER_CUBIC)

# 2️⃣ Convert to grayscale
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

# 3️⃣ Apply CLAHE for local contrast
clahe = cv2.createCLAHE(clipLimit=3.0, tileGridSize=(8, 8))
gray = clahe.apply(gray)

# 4️⃣ Denoise with bilateral filter
gray = cv2.bilateralFilter(gray, d=9, sigmaColor=75, sigmaSpace=75)

# 5️⃣ Adaptive threshold (tuned parameters)
thresh = cv2.adaptiveThreshold(
    gray, 255,
    cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
    cv2.THRESH_BINARY,
    blockSize=31,
    C=10
)

# 6️⃣ Optional morphological clean‑upkernel = cv2.getStructuringElement(cv2.MORPH_RECT, (2, 2))
thresh = cv2.morphologyEx(thresh, cv2.MORPH_CLOSE, kernel)

# 7️⃣ OCR with receipt‑specific config
custom_config = r'--oem 3 --psm 6 -c tessedit_char_whitelist=0123456789.,$%/'
text = pytesseract.image_to_string(thresh, config=custom_config)

How Senior Engineers Fix It

Scale images to at least 300 DPI equivalent before OCR
Enhance contrast using CLAHE or histogram equalization
Apply bilateral filtering to preserve edges while reducing noise
Tune adaptive threshold blockSize and C based on character size
Add morphological closing to reconnect broken strokes
Use receipt‑specific Tesseract config: --psm 6 and whitelist digits/symbols
Validate extracted fields with regex or rule‑based checks
Log confidence scores and fallback to manual review when low

Why Juniors Miss It- Copy‑paste default code without adjusting parameters

Overlook scaling; treat 1× images as sufficient
Ignore locale‑specific OCR settings for receipts
Fail to experiment with preprocessing pipelines
Lack understanding of how noise impacts character segmentation