Summary
During the development of an automated NL-to-SQL pipeline, we encountered a cascade of failures involving schema ambiguity, LLM reasoning loops, and SQL syntax invalidity. Despite using a sophisticated graph-based retrieval system with NetworkX and semantic embeddings, the system suffered from path explosion, contextual inaccuracy (selecting valid but logically incorrect join paths), and a failure in the self-healing loop where the LLM repeatedly hallucinated the same erroneous SQL. This postmortem analyzes the architectural friction between graph theory, semantic search, and LLM reasoning.
Root Cause
The failure stems from three distinct layers of the system:
- Semantic Dilution in Schema Linking: Using only vector embeddings for table selection ignores the relational topology. High cosine similarity between a query term and a table name does not guarantee that the table is part of the necessary logical join path.
- Context-Agnostic Path Scoring: The current scoring mechanism (similarity + path length + nullability) is purely mathematical/topological. It lacks domain-specific constraints, allowing the system to favor a “valid” path (e.g., through a drug log) over the “correct” business path (e.g., through prescription details) because both are structurally sound.
- Zero-Information Feedback Loops: The self-healing loop failed because the error messages sent back to the LLM lacked execution context and schema constraints. When an LLM receives a syntax error but no instruction to change its structural approach, it often performs “token-level repairs” (fixing a comma) rather than “architectural repairs” (changing the join logic).
- Identifier Collision: The system failed to account for PostgreSQL reserved keywords used as identifiers, leading to immediate syntax errors that the LLM attempted to “fix” without understanding the need for double-quoting identifiers.
Why This Happens in Real Systems
In production-grade RAG (Retrieval-Augmented Generation) and Text-to-SQL systems, these issues occur due to:
- The Curse of Dimensionality in Schema Space: As the number of tables grows, the “semantic distance” between unrelated tables decreases, leading to schema noise.
- Graph vs. Logic Gap: A graph knows how tables can connect, but it doesn’t know how they should connect. Business logic is non-topological.
- LLM Inertia: LLMs are probabilistic. If the prompt does not explicitly force a divergent reasoning path during a failure, the model will gravitate toward the highest-probability (but incorrect) token sequence it previously generated.
Real-World Impact
- Data Integrity Risks: If the system silently selects the wrong join path (e.g., joining via a log instead of a transaction), it returns mathematically correct but logically false data, which is more dangerous than an error.
- Increased Latency and Cost: Failed self-healing loops result in multiple API calls to expensive models (like Llama 3.3 70B) that yield zero utility, increasing token spend and user wait times.
- System Unreliability: The inability to handle reserved words makes the system brittle and unusable on standard enterprise databases where terms like
user,order, orgroupare common.
Example or Code
# The incorrect way: Sending only the error message
def failed_retry_loop(error_msg, previous_sql, prompt):
# This often leads to the LLM repeating the same mistake
return llm.generate(f"The SQL failed: {error_msg}. Fix this: {previous_sql}")
# The Senior Engineer way: Providing structural context and quoting rules
def robust_retry_loop(error_msg, previous_sql, schema_context, rules):
prompt = f"""
SYSTEM ERROR: {error_msg}
PREVIOUS ATTEMPT: {previous_sql}
CRITICAL INSTRUCTIONS:
1. ALL table and column names must be wrapped in double quotes (e.g., "user") to avoid reserved word conflicts.
2. Analyze the error: Is it a syntax error or a schema mismatch?
3. If the error is 'relation does not exist', check the schema context provided.
4. DO NOT repeat the same join structure if it caused a logic error.
SCHEMA CONTEXT:
{schema_context}
RULES:
{rules}
"""
return llm.generate(prompt)
How Senior Engineers Fix It
To move this system from a prototype to production, we implement the following strategies:
- Identifier Sanitization (The “Double-Quote” Rule): Instead of renaming tables, we enforce a PostgreSQL quoting policy. The prompt must explicitly instruct the LLM to wrap all identifiers in double quotes (
"table_name") to neutralize reserved word conflicts. - Multi-Stage Schema Linking:
- Stage 1: Keyword/Entity extraction to identify “Anchor Tables.”
- Stage 2: Expand the subgraph by traversing 1-2 hops from anchors.
- Stage 3: Use the expanded subgraph (not the whole schema) as the prompt context to reduce noise.
- Contextual Path Disambiguation: Move beyond pure similarity. We implement “Golden Path” Metadata. We tag specific join paths in the
information_schemaor a sidecar metadata file withis_primary_business_path: true. The scorer then applies a massive multiplier to these verified paths. - Advanced Self-Healing (Chain-of-Thought Debugging): Instead of asking the LLM to “fix the SQL,” we ask it to:
- Explain why the previous SQL failed.
- Identify the specific column or table that caused the error.
- Propose a new join strategy.
- Generate the final SQL.
Why Juniors Miss It
- Over-reliance on Embeddings: Juniors often assume Vector Search = Intelligence. They fail to realize that embeddings capture “aboutness” but not “relational necessity.”
- Treating LLMs as Deterministic Functions: Juniors expect a “fix” to work if they just repeat the request. They miss the need for prompt-based forcing functions to break the LLM out of its local minima.
- Ignoring SQL Dialect Nuances: A junior might focus on the Python logic while ignoring how PostgreSQL handles identifiers and reserved words, treating SQL as a generic string rather than a strict language.
- Neglecting the “Golden Path”: They try to solve ambiguity with complex math (weights/scoring) rather than acknowledging that business logic is a human requirement that must be explicitly injected into the data model.