Where I got stuck this week

RAG over a messy SharePoint got complicated fast. A short log of what went sideways.

A
The Author
Azure · AI · Infrastructure
HERO IMAGE

This is a short one — just a log of what went sideways this week so I don’t forget it.

The problem

We’re trying to build a RAG pipeline over our SharePoint content. The content is messy: old Word docs, PDFs with scanned pages, some pages that haven’t been touched since 2019.

The chunking strategy that works fine on clean markdown absolutely falls over on a 60-page Word doc with track changes left in.

What broke

Three things, roughly in order of how painful they were:

  1. Track changes bloat — The extracted text included every revision, not just the accepted text. The chunks were nonsense.
  2. Table extraction — Tables in Word docs come out as tab-separated chaos. The model couldn’t make sense of them.
  3. Headers as context — We were chunking by paragraph, so headings ended up either alone or attached to the wrong content.

What we tried

Switched from raw text extraction to using the Document Intelligence service for the Word docs. It handles tables properly and strips revision markup. Slower, but actually useful.

Still haven’t solved the header context problem cleanly. Next week’s problem.