PDF Corpus Scaling Incident
Date: 2026-03-29 Severity: Quality regression — eval pass rate dropped from 99.7 % to 98.7 % Resolution: Retroactive boilerplate cleanup + PDF extraction pipeline fix Lesson: RAG systems that handle mixed content types (HTML + PDF) need per-format boilerplate detection
The pipeline fix has been operational for ~6 weeks across multiple nightly auto-ingest cycles (corpus 5,841 completed documents). Zero boilerplate chunks from brochure covers have re-appeared post-fix. The retroactive deletion (449 chunks) and the preventive _extract_pdf_content change ship together; the regression has not recurred.
Timeline
| Time | Event |
|---|---|
| March 27 | Eval baseline: 99.7% (298/299), ~3,500 chunks from HTML pages |
| March 28 | Ingested 573 PDF brochures from ZOL Novation → 6,485 new chunks |
| March 29 | Eval regression: 98.7% (295/299) — 4 new failures |
| March 29 | Root cause identified: brochure boilerplate flooding retrieval |
| March 29 | Retroactive fix: 449 boilerplate chunks deleted, PDF pipeline patched |
The Problem
After ingesting 573 PDF brochures, the system's document chunks went from ~4,400 to ~10,900 — with 59.6% of all chunks now coming from brochures. The boilerplate detection system (7 layers, documented in Content Deduplication) was designed for HTML content and worked perfectly for web pages. But it had a critical blind spot: PDF-extracted markdown bypassed all text-level boilerplate filtering.
Every ZOL brochure has a standardized cover page:
### [Brochure Title]
**ZOL GENK** **ZOL MAAS EN KEMPEN**
Campus Sint-Jan Campus Sint-Barbara
Synaps Park 1 Bessemerstraat 478
B 3600 Genk B 3620 Lanaken
Medisch Centrum André Dumont
Stalenstraat 2a
B 3600 Genk
This identical block appeared in 449 chunks across 441 brochures, each with a slightly different title prepended (making content hashes unique, defeating the cross-document dedup).
Impact on Retrieval Quality
Before: Balanced Index
Web pages (HTML): ~4,400 chunks (100% of index)
- Department pages, doctor profiles, conditions, treatments
- Well-targeted content, each page about one topic
After: Brochure-Dominated Index
Brochures (PDF): 6,485 chunks (59.6% of index)
- 449 pure boilerplate (campus address blocks)
- 51 chunks about chemotherapy from a single brochure (br0297)
- Dense, procedural content competing with navigational pages
Web pages (HTML): 4,394 chunks (40.4% of index)
- Same quality content, now outnumbered 1.5:1
Concrete Failure Examples
GQ-102: "Waar kan ik terecht voor chemotherapie bij borstkanker?"
Expected: Borstcentrum department page (1 chunk, 482 chars) Retrieved: Chemotherapy brochure br0297 (51 chunks mentioning "chemotherapie")
The brochure's 51 chunks about chemotherapy procedures dominated semantic retrieval, pushing the authoritative Borstcentrum department page — which directly answers "where can I go?" — below the relevance threshold. The LLM generated an answer from brochure content (treatment locations, room numbers) instead of the department overview.
GQ-093: "Zijn er dokters die zowel op Sint-Jan als op André Dumont werken?"
Expected: Doctor-campus relationship data Retrieved: 449 brochure cover pages mentioning both campus names as address boilerplate
The retrieval system found hundreds of high-similarity matches for "Sint-Jan" + "André Dumont" — but they were all brochure address blocks, not actual doctor-campus relationship data. The signal was completely drowned in noise.
Why Existing Deduplication Didn't Catch This
The system has 7 deduplication layers (see Content Deduplication):
| Layer | Why It Didn't Help |
|---|---|
| 1. URL dedup | Each brochure has a unique URL |
| 2. Title dedup | Each brochure has a unique title |
| 3. Canonical URL | PDFs don't have <link rel="canonical"> |
| 4. Boilerplate CSS | CSS selectors don't apply to PDFs (no DOM) |
| 5. Cross-doc hash | Each cover page has a different title prepended, making hashes unique |
| 6. Language filter | Content is in Dutch (correct language) |
| 7. Re-ingestion | Each brochure was new content |
The architectural gap: Text-level boilerplate patterns (strip_text_boilerplate) were applied only to HTML-extracted markdown in _extract_html_from_response. PDF extraction in _pdf_extract_in_process returned raw markdown directly, bypassing all text filtering.
Resolution
Retroactive Fix (Immediate)
Direct SQL deletion of 449 boilerplate chunks from the pilot database:
DELETE FROM app.document_chunks
WHERE content_length < 1000
AND content ILIKE '%ZOL GENK%'
AND content ILIKE '%Campus Sint-Jan%'
AND (content ILIKE '%B 3600%' OR content ILIKE '%Synaps Park%');
This removed the boilerplate without requiring re-ingestion of 573 PDFs.
Pipeline Fix (Preventive)
_extract_pdf_contentnow applies boilerplate stripping — the samestrip_text_boilerplate()from site_config runs on PDF output, not just HTML- New
_strip_pdf_cover_boilerplate()function — regex patterns specifically targeting ZOL brochure cover blocks (campus address patterns with postal codes) - DB-configurable — hospital-specific PDF boilerplate patterns can be added via the Pipeline Config UI, same as HTML patterns
Measured Result
| Metric | Before Fix | After Fix |
|---|---|---|
| Total chunks | 10,879 | 10,430 |
| Brochure chunks | 6,485 (59.6%) | 6,036 (57.9%) |
| Boilerplate chunks | 449 | 0 |
| Campus address pollution | 449 retrieval-polluting chunks | Eliminated |
Lessons Learned
1. Content Type Parity
Every content extraction path must have the same quality gates. The HTML pipeline had 7 dedup layers; the PDF pipeline had 0 text-level filtering. When we scaled from HTML-only to mixed HTML+PDF, the untested path became the dominant content source.
Principle: If you add a new content type (PDF, DOCX, email), test the full quality pipeline against it before bulk ingestion.
2. Corpus Composition Matters More Than Corpus Size
Going from 4,400 to 10,900 chunks didn't cause the regression. The 59.6% brochure composition did. A balanced corpus with 10,900 chunks would have performed fine. The problem was that one content type (brochures) overwhelmed the index with dense, procedural content that competed with navigational content for the same queries.
Principle: Monitor content type ratios in the index. If any single category exceeds 50%, investigate whether it's diluting retrieval for other categories.
3. Boilerplate Is Format-Specific
HTML boilerplate (nav menus, cookie banners, footers) looks nothing like PDF boilerplate (cover pages, address blocks, legal disclaimers). A boilerplate detection system designed for one format won't catch the other. Each format needs its own patterns.
Principle: Boilerplate detection must be format-aware, not one-size-fits-all.
4. Test at Production Scale Before Declaring Victory
The system passed 99.7% on 302 questions with HTML-only content. The regression only appeared after ingesting real-world PDFs. Academic evaluation on clean data doesn't prove production readiness — you need to test with the messy, heterogeneous content that production systems actually handle.
Principle: Run the full eval suite after every major content ingestion, not just after code changes.
Related
- Content Deduplication — The 7-layer dedup system (now extended to PDFs)
- Document Ingestion Pipeline — Subprocess-isolated PDF extraction
- Evaluation Overview — Golden question methodology and eval reports