Apache Tika failing to read content from nested content control
Summary We investigated an issue where Apache Tika fails to extract text from the content of a nested Structured Document Tag (SDT), also known as a Content Control, in a .docx file. The XML structure contains an SDT inside another SDT’s sdtContent. While Tika correctly extracts text from the outer and non-nested elements, the text … Read more