How to parse informal Persian number words into an integer currency value (Rial)

Summary

A production incident occurred in a Persian‑language financial application where informal Persian number words were parsed incorrectly, producing wrong integer Rial values. The failure stemmed from ambiguous colloquial forms, mixed units (تومان/ریال), and inconsistent tokenization rules. The system behaved unpredictably when users combined spoken‑style numerals, misspellings, and magnitude units in the same sentence.

Root Cause

The core failure was a non‑deterministic parsing pipeline that assumed clean, dictionary‑matched tokens. The real input stream violated those assumptions.

Key causes:

Unnormalized colloquial forms such as یه، پونصد، پونزده، هتصد، هراز were not mapped to canonical forms.
Magnitude units (هزار، میلیون، میلیارد، تریلیون) were applied inconsistently because the parser lacked a hierarchical numeric model.
Mixed currencies (تومان + ریال) were treated as independent values instead of a single convertible expression.
Greedy tokenization split or merged words incorrectly (e.g., یکهزارو or ملیونپانصد).
No fallback grammar, causing the system to misinterpret sequences like “یکو پونصد میلیون”.

Why This Happens in Real Systems

Real‑world NLP pipelines fail on informal Persian numerals because:

Human input is not standardized; users type the way they speak.
Persian number morphology is highly flexible, allowing concatenation, omission of spaces, and phonetic spellings.
Currency expressions are compositional, requiring multi‑stage reasoning rather than simple extraction.
Magnitude stacking (e.g., “نه میلیارد و ششصد و سی میلیون…”) requires a tree‑based numeric model, not regex.
Legacy code paths often rely on brittle heuristics that collapse under noisy input.

Real-World Impact

Incorrect parsing of monetary values leads to:

Financial miscalculations in transactions and invoices.
Incorrect reporting in accounting dashboards.
User distrust when displayed values differ from expectations.
Silent data corruption when wrong Rial values propagate downstream.
Operational escalations when support teams must manually correct amounts.

Example or Code (if necessary and relevant)

A minimal example of a robust approach uses normalization → tokenization → numeric composition:

def parse_persian_money(text):
    normalized = normalize_persian(text)
    tokens = tokenize(normalized)
    value = compose_numeric(tokens)
    return value

How Senior Engineers Fix It

Senior engineers stabilize the system by introducing deterministic, layered parsing:

Normalization layer
Map all colloquial forms to canonical equivalents (e.g., پونصد → پانصد).
Dictionary‑driven tokenization
Use a curated lexicon of number words, magnitude units, and currency units.
Hierarchical numeric grammar
Build a composition engine that understands:
- base numbers
- magnitude multipliers
- nested structures (e.g., میلیون + هزار + صد)
Currency resolution rules
Convert تومان → ریال only once, at the end, using ×10.
Fuzzy matching
Handle misspellings using edit‑distance or phonetic similarity.
Extensive test corpus
Validate against thousands of real user examples like those in the incident report.
Idempotent parsing
Ensure repeated parsing yields the same numeric result.

Why Juniors Miss It

Junior engineers often overlook:

The linguistic complexity of Persian numerals and colloquial speech.
Magnitude stacking rules, assuming linear addition instead of hierarchical multiplication.
Normalization as a mandatory step, not an optional enhancement.
Ambiguity in mixed currency expressions, especially when both تومان and ریال appear.
The need for a grammar, relying instead on regex‑only solutions.
Edge cases like concatenated words, missing spaces, or phonetic spellings.

They typically underestimate how messy real user input is and how quickly heuristic solutions break under production load.

If you want, I can generate a complete production‑ready parsing algorithm or a full Python implementation that handles all examples in your dataset.