Recovering Lost Page Numbers from Legal HTML and PDF in Python

How can I reliably recover and preserve page numbers from legal-document HTML/PDF text in Python at scale?

Summary

The core challenge is detecting and recovering lost page markers during legal-document processing. Inconsistent page numbers in rendered HTML, source HTML, or PDF text disrupt downstream workflows. The goal is to deterministically recover these markers without OCR unless absolutely necessary.

Root Cause

Page number loss occurs due to:

Renderer-specific stripping: Tools like browsers or libraries may omit or misplace <span class="page-number"> tags during rendering.
PDF text formatting: Markers like *12 in PDF text lack semantic tags, making them fragile to parse.
HTML markup inconsistencies: Sources may omit or misattribute page-number elements.

Why This Happens in Real Systems

Bullet lists explaining the issue in real-world contexts:

Renderer variability: Different renderers (e.g., browsers vs. headless tools) handle <span> elements differently.
PDF extraction quirks: Libraries like PyMuPDF or pdfplumber may ignore non-text annotations, leaving only free-text markers like *12.
HTML sanitization: Aggressive sanitization removes non-standard page-number markup.
Algorithmic gaps: Juniors might overlook edge cases where markers are hidden (e.g., white text on white backgrounds).

Real-World Impact

Consecutive failures manifest as:

Data corruption: Missing page numbers break structured JSON output.
Manual rework: Teams spend hours debugging alignment issues.
OCR overload: Fallback to OCR introduces latency and cost.
Inconsistent UX: Rendered HTML loses traceable pagination cues.

Example or Code

from bs4 import BeautifulSoup
import re

def recover_page_number(source_html, rendered_html, pdf_text):
    # Check source HTML for embedded markers
    soup = BeautifulSoup(source_html, 'lxml')
    page_span = soup.find('span', class_='page-number')
    if page_span:
        return page_span.get_text(strip=True).replace('*', '')

    # Check PDF text for inline markers
    match = re.search(r'\*(\d+)\*', pdf_text)
    if match:
        return match.group(1)

    # Default to None (trigger OCR)
    return None

This function prioritizes semantic HTML markers first, then free-text PDF patterns.

How Senior Engineers Fix It

Senior engineers use:

Marker anchoring: Link page numbers to invariant document structure (e.g., linking *12 to a specific header).
Cross-format validation: Compare page numbers across source HTML, rendered HTML, and PDF text.
Regex precision: Use r'\b\*(\d+)\*' to avoid false positives in PDF text.
Fallback logic: Only resort to OCR if no markers exist in any source.
Semantic preservation: Reinsert markers into rendered HTML without disrupting footnotes or paragraphs.

Why Juniors Miss It

Juniors often fail due to:

Over-reliance on OCR: jumpring to OCR without exhausting marker recovery strategies.
Ignoring PDF footprints: Missing *12 in PDF due to inadequate regex or parser settings.
Assumptions about HTML: Assuming all page numbers are in <span class="page-number">.
Poor error handling: Not validating recovered markers against multiple sources.
Static parsing: Using hardcoded regex instead of dynamic alignment.