AI and Marketing Content
The Microsoft Revolution
About 20 years ago there was a major shift in document processing which had profound repercussions for the AI world, years before it started. With Office 2007, Microsoft made a major change in their document formats. Up until that point, Microsoft had preferred closed, proprietary, customer lock-in of content into silos of information. But during the 2000s the rise of Apache OpenOffice and its open document formats, along with governmental demands for interoperability, put pressure on Microsoft to introduce OOXML. Of course it had added benefits - competition alone rarely encourages change. The new format was up to 75% smaller, improved corruption recovery and improved security.
Of course for those willing to adapt, it opened up other options for programmatically creating content, because the files are basically zip files of content. And this meant that documents and spreadsheets and presentation were equally accessible to humand and code.
And this is the game-changer that is key to AI. Because in order to effectively ingest content programmatically, it needs to be readable by code systems. This openness makes the content useful, instead of a silo that is of no use outside the system containing it, an irrelevancy to the outside world.
This is what python-docx and tools that use it, like docling, leverage for ingesting documents for RAG or passing directly to LLMs.
But what about other file types?
The PDF Problem
Although Adobe also embraced openness, relinquished control to the International Organization for Standardization (ISO), turning the PDF format into ISO 32000, there is a big difference compared to the Microsoft formats. PDF format is designed for printing. The format is a digital printing format, not a data structure. Whereas heading in .docx files are semantic structures, in a PDF it is more correct to think of it as an instruction to go to a specific location on the page, use a specific font size and write a set of characters. They store text out of order, tables are a collection of lines and floating text, and custom font encodings and ligatures are used that distort text.
The result is a format which is designed for humans to read online or print. But also a format that has inherent barriers to digital extraction.
This was famously encountered with the "vegetative electron microscopy" blunder, a classic case of layout-induced hallucination caused by PDF parsers reading across columns rather than down them. In a 1959 paper in "Bacteriological Review", the word "vegetative" in one column was incorrectly merged with "electron microscopy" in the next, creating a nonsensical technical term. This "tortured phrase" was subsequently ingested into large-scale training datasets, leading AI models to treat it as a legitimate concept and causing it to creep into over twenty peer-reviewed scientific journals as authors relied on AI-generated text.
Partial Solution
There are various general purpose Python libraries for reading PDFs like pypdf, pdfplumber, and PyMuPdf. But the problem is that none are able to read them successfully. Even specialised libraries like docling, Marker, and Unstructured fail on elements. docling, for example, reads headings and tables well, but fails with ligatures.
Why Marketing and Why They Need To Change?
Why do PDFs have ligatures? Typically this is in content generated by marketing departments whose priority is creating content that looks good to humans. But by making content unusable to AI, they're making it less relevant to humans, because AI is incresingly becoming the bridge to digital content.
There are technical approaches that can help like PDF/UA which adds ths to the PDF. But this can only help at the point of creating the PDF, and may not solve all problems like image readability and only works if the export to PDF function does not fail the Matterhorn Protocol.
A simpler solution may be if marketing departments focus not solely on good looking with complex fonts prioritised for humans, but more on dual purpose content. It will be interesting to see how things change.
But change they must.