Summary
Accurate OCR on invoices requires careful image preprocessing, proper scaling, and targeted Tesseract configuration. The original pipeline yields gibberish because of low contrast and missing steps.
Root Cause- Low image resolution before OCR
- Insufficient contrast leading to fragmented characters
- Incorrect adaptive threshold values
- Missing denoising steps
- Generic OCR config not tuned for receipts
Why This Happens in Real Systems
- Images are often captured with smartphones at varying distances
- Lighting conditions cause shadows and glare
- PDFs are rasterized at non‑optimal DPI, losing detail
Real-World Impact
- Mis‑extracted amounts can cause billing errors
- Wrong dates lead to incorrect financial reporting
- Repeated manual correction increases operational cost
- False positives trigger downstream validation failures
Example or Code (if necessary and relevant)
Here is a refined preprocessing pipeline that senior engineers typically apply:
import cv2
import pytesseract
import numpy as np
img = cv2.imread('elecBill.jpg')
# 1️⃣ Upscale for better DPI
img = cv2.resize(img, None, fx=2.0, fy=2.0, interpolation=cv2.INTER_CUBIC)
# 2️⃣ Convert to grayscale
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
# 3️⃣ Apply CLAHE for local contrast
clahe = cv2.createCLAHE(clipLimit=3.0, tileGridSize=(8, 8))
gray = clahe.apply(gray)
# 4️⃣ Denoise with bilateral filter
gray = cv2.bilateralFilter(gray, d=9, sigmaColor=75, sigmaSpace=75)
# 5️⃣ Adaptive threshold (tuned parameters)
thresh = cv2.adaptiveThreshold(
gray, 255,
cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
cv2.THRESH_BINARY,
blockSize=31,
C=10
)
# 6️⃣ Optional morphological clean‑upkernel = cv2.getStructuringElement(cv2.MORPH_RECT, (2, 2))
thresh = cv2.morphologyEx(thresh, cv2.MORPH_CLOSE, kernel)
# 7️⃣ OCR with receipt‑specific config
custom_config = r'--oem 3 --psm 6 -c tessedit_char_whitelist=0123456789.,$%/'
text = pytesseract.image_to_string(thresh, config=custom_config)
How Senior Engineers Fix It
- Scale images to at least 300 DPI equivalent before OCR
- Enhance contrast using CLAHE or histogram equalization
- Apply bilateral filtering to preserve edges while reducing noise
- Tune adaptive threshold blockSize and C based on character size
- Add morphological closing to reconnect broken strokes
- Use receipt‑specific Tesseract config:
--psm 6and whitelist digits/symbols - Validate extracted fields with regex or rule‑based checks
- Log confidence scores and fallback to manual review when low
Why Juniors Miss It- Copy‑paste default code without adjusting parameters
- Overlook scaling; treat 1× images as sufficient
- Ignore locale‑specific OCR settings for receipts
- Fail to experiment with preprocessing pipelines
- Lack understanding of how noise impacts character segmentation