How to Fix Poor Invoice OCR Accuracy with OpenCV and Tesseract

Summary

Accurate OCR on invoices requires careful image preprocessing, proper scaling, and targeted Tesseract configuration. The original pipeline yields gibberish because of low contrast and missing steps.

Root Cause- Low image resolution before OCR

  • Insufficient contrast leading to fragmented characters
  • Incorrect adaptive threshold values
  • Missing denoising steps
  • Generic OCR config not tuned for receipts

Why This Happens in Real Systems

  • Images are often captured with smartphones at varying distances
  • Lighting conditions cause shadows and glare
  • PDFs are rasterized at non‑optimal DPI, losing detail

Real-World Impact

  • Mis‑extracted amounts can cause billing errors
  • Wrong dates lead to incorrect financial reporting
  • Repeated manual correction increases operational cost
  • False positives trigger downstream validation failures

Example or Code (if necessary and relevant)

Here is a refined preprocessing pipeline that senior engineers typically apply:

import cv2
import pytesseract
import numpy as np

img = cv2.imread('elecBill.jpg')
# 1️⃣ Upscale for better DPI
img = cv2.resize(img, None, fx=2.0, fy=2.0, interpolation=cv2.INTER_CUBIC)

# 2️⃣ Convert to grayscale
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

# 3️⃣ Apply CLAHE for local contrast
clahe = cv2.createCLAHE(clipLimit=3.0, tileGridSize=(8, 8))
gray = clahe.apply(gray)

# 4️⃣ Denoise with bilateral filter
gray = cv2.bilateralFilter(gray, d=9, sigmaColor=75, sigmaSpace=75)

# 5️⃣ Adaptive threshold (tuned parameters)
thresh = cv2.adaptiveThreshold(
    gray, 255,
    cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
    cv2.THRESH_BINARY,
    blockSize=31,
    C=10
)

# 6️⃣ Optional morphological clean‑upkernel = cv2.getStructuringElement(cv2.MORPH_RECT, (2, 2))
thresh = cv2.morphologyEx(thresh, cv2.MORPH_CLOSE, kernel)

# 7️⃣ OCR with receipt‑specific config
custom_config = r'--oem 3 --psm 6 -c tessedit_char_whitelist=0123456789.,$%/'
text = pytesseract.image_to_string(thresh, config=custom_config)

How Senior Engineers Fix It

  • Scale images to at least 300 DPI equivalent before OCR
  • Enhance contrast using CLAHE or histogram equalization
  • Apply bilateral filtering to preserve edges while reducing noise
  • Tune adaptive threshold blockSize and C based on character size
  • Add morphological closing to reconnect broken strokes
  • Use receipt‑specific Tesseract config: --psm 6 and whitelist digits/symbols
  • Validate extracted fields with regex or rule‑based checks
  • Log confidence scores and fallback to manual review when low

Why Juniors Miss It- Copy‑paste default code without adjusting parameters

  • Overlook scaling; treat 1× images as sufficient
  • Ignore locale‑specific OCR settings for receipts
  • Fail to experiment with preprocessing pipelines
  • Lack understanding of how noise impacts character segmentation

Leave a Comment