Technical SEO: Fixing Search Recall in Production Services

Summary

A production search service failed to return relevant results because the Full-Text Search (FTS) engine was unable to match queries where users entered variations of technical specifications (e.g., “16GB” vs “16 GB”). The system relied on standard tokenization, which treats “16GB” as a single unique token, failing to match the decomposed “16” and “GB” tokens present in the indexed documents. To bypass this, the engineer implemented a Regex-based normalization layer within the SQL query to force whitespace between numbers and units, effectively “fixing” the data at query time.

Root Cause

The fundamental issue is a Tokenization Mismatch between the indexed data and the user query.

Standard Tokenization Behavior: PostgreSQL’s default text search parser identifies alphanumeric strings like 16GB as a single word or numword token.
Lexeme Divergence: If the document contains 16 GB (two tokens) and the user searches for 16GB (one token), the intersection of the two lexeme sets is empty.
Lack of Semantic Normalization: The standard FTS configuration lacks a rule to treat “number+unit” as a composite entity that can be decomposed into constituent parts.

Why This Happens in Real Systems

In high-scale production environments, data is rarely “clean.” This problem arises due to:

Unstructured User Input: Users do not follow strict formatting rules; they vary whitespace, casing, and punctuation based on speed and habit.
Heterogeneous Data Sources: Data ingested from various scrapers, APIs, or manual entries often contains inconsistent formatting for the same semantic entity (e.g., measurements, SKU numbers, or model versions).
Complexity of Language Rules: Standard parsers are designed for linguistic rules (stemming, stop words), not for the specific domain-specific syntax of technical specifications.

Real-World Impact

Reduced Search Recall: Users cannot find products or documents even when they are clearly present, leading to “No Results Found” errors.
Degraded User Experience: The perceived “intelligence” of the search engine drops, often resulting in increased support tickets or user churn.
Performance Degradation: Using Regex in a SELECT clause to normalize data on the fly prevents the database from utilizing standard GIN (Generalized Inverted Index) optimizations effectively, as the transformation must be computed for every row during the scan.

Example or Code

The following demonstrates the mismatch and the “quick fix” using regex normalization:

-- The problematic mismatch
SELECT to_tsvector('portuguese', 'Memória 16GB') @@ plainto_tsquery('portuguese', '16 GB'); 
-- Result: false

-- The regex-based workaround (Normalization at query time)
SELECT 
    source, 
    query, 
    to_tsvector(source) @@ plainto_tsquery(query)
FROM (
    SELECT 
        regexp_replace(regexp_replace(source, '(\d+)\s*(\D+)', '\1 \2', 1, 0), '(\D+)-(\d+)', '\1 \2', 1, 0) AS source,
        regexp_replace(regexp_replace(query, '(\d+)\s*(\D+)', '\1 \2', 1, 0), '(\D+)-(\d+)', '\1 \2', 1, 0) AS query
    FROM (
        VALUES ('Memória 16GB', '16 gb')
    ) AS t(source, query)
) AS s;
-- Result: true

How Senior Engineers Fix It

A senior engineer avoids “fixing it in the query” and instead focuses on Data Normalization at Ingestion or Custom Parser Logic.

Normalization at Ingest (ETL): Instead of running regex during the search, run the regex during the Write/Update phase. Store a “searchable” version of the text in a dedicated column that is already normalized (e.g., always 16 GB).
Custom Text Search Configuration: Create a specialized configuration that uses specific dictionaries or mappings to handle unit-number combinations.
Pre-processing Pipelines: Use an application-layer service (e.g., in Python or Go) to clean and standardize technical tokens before they ever reach the database.
N-Gram Tokenization: Implement Trigram indexes (pg_trgm) which can match partial substrings, making the system more resilient to whitespace variations without requiring complex regex.

Why Juniors Miss It

Focus on the “Correct” Query: Juniors often focus on writing a query that “works” for the current input, rather than architecting a system that handles the data lifecycle.
Ignoring Computational Complexity: They may not realize that applying regexp_replace to every row in a million-row table turns an indexed $O(\log n)$ lookup into a massive $O(n)$ sequential scan.
Treating Symptoms, Not Diseases: Using regex to solve a tokenization issue is treating the symptom (the mismatch). The disease is the inconsistent data representation within the index.