Skip to main content

PDF Corpus Scaling Incident

Date: 2026-03-29 Severity: Quality regression — eval pass rate dropped from 99.7 % to 98.7 % Resolution: Retroactive boilerplate cleanup + PDF extraction pipeline fix Lesson: RAG systems that handle mixed content types (HTML + PDF) need per-format boilerplate detection

Status (May 2026): stable

The pipeline fix has been operational for ~6 weeks across multiple nightly auto-ingest cycles (corpus 5,841 completed documents). Zero boilerplate chunks from brochure covers have re-appeared post-fix. The retroactive deletion (449 chunks) and the preventive _extract_pdf_content change ship together; the regression has not recurred.


Timeline

TimeEvent
March 27Eval baseline: 99.7% (298/299), ~3,500 chunks from HTML pages
March 28Ingested 573 PDF brochures from ZOL Novation → 6,485 new chunks
March 29Eval regression: 98.7% (295/299) — 4 new failures
March 29Root cause identified: brochure boilerplate flooding retrieval
March 29Retroactive fix: 449 boilerplate chunks deleted, PDF pipeline patched

The Problem

After ingesting 573 PDF brochures, the system's document chunks went from ~4,400 to ~10,900 — with 59.6% of all chunks now coming from brochures. The boilerplate detection system (7 layers, documented in Content Deduplication) was designed for HTML content and worked perfectly for web pages. But it had a critical blind spot: PDF-extracted markdown bypassed all text-level boilerplate filtering.

Every ZOL brochure has a standardized cover page:

### [Brochure Title]

**ZOL GENK** **ZOL MAAS EN KEMPEN**

Campus Sint-Jan Campus Sint-Barbara
Synaps Park 1 Bessemerstraat 478
B 3600 Genk B 3620 Lanaken

Medisch Centrum André Dumont
Stalenstraat 2a
B 3600 Genk

This identical block appeared in 449 chunks across 441 brochures, each with a slightly different title prepended (making content hashes unique, defeating the cross-document dedup).

Impact on Retrieval Quality

Before: Balanced Index

Web pages (HTML): ~4,400 chunks (100% of index)
- Department pages, doctor profiles, conditions, treatments
- Well-targeted content, each page about one topic

After: Brochure-Dominated Index

Brochures (PDF): 6,485 chunks (59.6% of index)
- 449 pure boilerplate (campus address blocks)
- 51 chunks about chemotherapy from a single brochure (br0297)
- Dense, procedural content competing with navigational pages

Web pages (HTML): 4,394 chunks (40.4% of index)
- Same quality content, now outnumbered 1.5:1

Concrete Failure Examples

GQ-102: "Waar kan ik terecht voor chemotherapie bij borstkanker?"

Expected: Borstcentrum department page (1 chunk, 482 chars) Retrieved: Chemotherapy brochure br0297 (51 chunks mentioning "chemotherapie")

The brochure's 51 chunks about chemotherapy procedures dominated semantic retrieval, pushing the authoritative Borstcentrum department page — which directly answers "where can I go?" — below the relevance threshold. The LLM generated an answer from brochure content (treatment locations, room numbers) instead of the department overview.

GQ-093: "Zijn er dokters die zowel op Sint-Jan als op André Dumont werken?"

Expected: Doctor-campus relationship data Retrieved: 449 brochure cover pages mentioning both campus names as address boilerplate

The retrieval system found hundreds of high-similarity matches for "Sint-Jan" + "André Dumont" — but they were all brochure address blocks, not actual doctor-campus relationship data. The signal was completely drowned in noise.

Why Existing Deduplication Didn't Catch This

The system has 7 deduplication layers (see Content Deduplication):

LayerWhy It Didn't Help
1. URL dedupEach brochure has a unique URL
2. Title dedupEach brochure has a unique title
3. Canonical URLPDFs don't have <link rel="canonical">
4. Boilerplate CSSCSS selectors don't apply to PDFs (no DOM)
5. Cross-doc hashEach cover page has a different title prepended, making hashes unique
6. Language filterContent is in Dutch (correct language)
7. Re-ingestionEach brochure was new content

The architectural gap: Text-level boilerplate patterns (strip_text_boilerplate) were applied only to HTML-extracted markdown in _extract_html_from_response. PDF extraction in _pdf_extract_in_process returned raw markdown directly, bypassing all text filtering.

Resolution

Retroactive Fix (Immediate)

Direct SQL deletion of 449 boilerplate chunks from the pilot database:

DELETE FROM app.document_chunks
WHERE content_length < 1000
AND content ILIKE '%ZOL GENK%'
AND content ILIKE '%Campus Sint-Jan%'
AND (content ILIKE '%B 3600%' OR content ILIKE '%Synaps Park%');

This removed the boilerplate without requiring re-ingestion of 573 PDFs.

Pipeline Fix (Preventive)

  1. _extract_pdf_content now applies boilerplate stripping — the same strip_text_boilerplate() from site_config runs on PDF output, not just HTML
  2. New _strip_pdf_cover_boilerplate() function — regex patterns specifically targeting ZOL brochure cover blocks (campus address patterns with postal codes)
  3. DB-configurable — hospital-specific PDF boilerplate patterns can be added via the Pipeline Config UI, same as HTML patterns

Measured Result

MetricBefore FixAfter Fix
Total chunks10,87910,430
Brochure chunks6,485 (59.6%)6,036 (57.9%)
Boilerplate chunks4490
Campus address pollution449 retrieval-polluting chunksEliminated

Lessons Learned

1. Content Type Parity

Every content extraction path must have the same quality gates. The HTML pipeline had 7 dedup layers; the PDF pipeline had 0 text-level filtering. When we scaled from HTML-only to mixed HTML+PDF, the untested path became the dominant content source.

Principle: If you add a new content type (PDF, DOCX, email), test the full quality pipeline against it before bulk ingestion.

2. Corpus Composition Matters More Than Corpus Size

Going from 4,400 to 10,900 chunks didn't cause the regression. The 59.6% brochure composition did. A balanced corpus with 10,900 chunks would have performed fine. The problem was that one content type (brochures) overwhelmed the index with dense, procedural content that competed with navigational content for the same queries.

Principle: Monitor content type ratios in the index. If any single category exceeds 50%, investigate whether it's diluting retrieval for other categories.

3. Boilerplate Is Format-Specific

HTML boilerplate (nav menus, cookie banners, footers) looks nothing like PDF boilerplate (cover pages, address blocks, legal disclaimers). A boilerplate detection system designed for one format won't catch the other. Each format needs its own patterns.

Principle: Boilerplate detection must be format-aware, not one-size-fits-all.

4. Test at Production Scale Before Declaring Victory

The system passed 99.7% on 302 questions with HTML-only content. The regression only appeared after ingesting real-world PDFs. Academic evaluation on clean data doesn't prove production readiness — you need to test with the messy, heterogeneous content that production systems actually handle.

Principle: Run the full eval suite after every major content ingestion, not just after code changes.