Apache Tika failing to read content from nested content control

Summary

We investigated an issue where Apache Tika fails to extract text from the content of a nested Structured Document Tag (SDT), also known as a Content Control, in a .docx file. The XML structure contains an SDT inside another SDT’s sdtContent. While Tika correctly extracts text from the outer and non-nested elements, the text inside the inner SDT (e.g., “LC Chat 04”) is omitted. This occurs because Tika’s default OOXMLParser relies on a document-order traversal that often processes the inner SDT’s header properties and fails to correctly recurse into its content for text extraction in this specific nested configuration.

Root Cause

The root cause is insufficient handling of deeply nested SDT structures within Tika’s OOXMLParser logic. Specifically:

Recursive Depth Limitation/Logic Flaw: The parser likely traverses the w:sdt node, recognizes the outer SDT, extracts its w:sdtContent, and then processes that. However, when it encounters the inner w:sdt within that content, it either fails to trigger the specific extraction routine for that child node or incorrectly prioritizes the w:sdtPr (properties) over the w:sdtContent for the nested block.
Standard Deviation: The expected behavior for a compliant OOXML parser is to recursively unwrap sdtContent layers. The failure indicates the recursion logic stops or returns null/empty for the nested node before text extraction completes.

Why This Happens in Real Systems

This is a classic edge-case scenario in document processing engines:

Tooling Variance: Documents generated by tools like docx4j often use non-standard nesting for “smart documents” or complex templates. Standard Microsoft Word usage often flattens these controls or places them side-by-side rather than strictly nesting them inside one another’s content flow.
Parser Assumptions: Parsers often assume SDTs are leaf nodes (containing text or a single block) or handle only one level of nesting. When a user inserts a placeholder inside another placeholder’s text stream, it creates a hierarchy that standard generic parsers might misinterpret as a property block rather than a content stream.

Real-World Impact

Data Loss: In automated contract generation or data extraction workflows, users will lose critical data (e.g., “LC Chat 04”) hidden inside the nested tags.
Broken Logic: This creates a discrepancy between the visual text of the Word document and the extracted text, leading to failed validation checks in document management systems.
Maintenance Overhead: Engineering teams must implement custom parsers or pre-processors to strip SDT tags, adding unnecessary complexity to the pipeline.

Example or Code

While the specific parsing logic of Apache Tika is internal, the XML structure causing the failure is defined in the input. The problematic hierarchy in document.xml is:


  
   ... 
  
     ... 
    
      
       ... 
      
        LC Chat 04

How Senior Engineers Fix It

Senior engineers implement robust recursion to handle nested structures. If modifying the parser (or patching Tika locally), the fix involves ensuring the text handler processes w:sdt nodes regardless of depth:

Deep Recursion: Update the content handler to explicitly check for w:sdt children inside w:sdtContent. It must not stop at the first level.
Pre-processing (Workaround): Write a pre-processor (using POI or XMLUnit) that strips w:sdt tags but preserves w:t text nodes before the file reaches Tika. This normalizes the document.
Configuration Check: Ensure OOXMLParser is configured to ignore formatting control words that might be hiding the text (though usually, this is a structural bug, not a config one).

Why Juniors Miss It

Linear Thinking: Juniors often test with “happy path” documents where content controls are used individually, not nested deeply inside one another.
Visual vs. Code: They rely on the “Text View” in Word, which renders the merged content correctly, failing to realize the underlying XML structure creates a recursion trap for the parser.
Library Trust: There is a tendency to assume that standard libraries like Apache Tika handle all valid OOXML structures perfectly, overlooking that valid XML doesn’t always guarantee valid parser logic for every edge case.