PDF Corpus Scaling Incident

Date: 2026-03-29 Severity: Quality regression — eval pass rate dropped from 99.7 % to 98.7 % Resolution: Retroactive boilerplate cleanup + PDF extraction pipeline fix Lesson: RAG systems that handle mixed content types (HTML + PDF) need per-format boilerplate detection

Status (May 2026): stable

The pipeline fix has been operational for ~6 weeks across multiple nightly auto-ingest cycles (corpus 5,841 completed documents). Zero boilerplate chunks from brochure covers have re-appeared post-fix. The retroactive deletion (449 chunks) and the preventive _extract_pdf_content change ship together; the regression has not recurred.

Timeline

Time	Event
March 27	Eval baseline: 99.7% (298/299), ~3,500 chunks from HTML pages
March 28	Ingested 573 PDF brochures from ZOL Novation → 6,485 new chunks
March 29	Eval regression: 98.7% (295/299) — 4 new failures
March 29	Root cause identified: brochure boilerplate flooding retrieval
March 29	Retroactive fix: 449 boilerplate chunks deleted, PDF pipeline patched

The Problem

After ingesting 573 PDF brochures, the system's document chunks went from ~4,400 to ~10,900 — with 59.6% of all chunks now coming from brochures. The boilerplate detection system (7 layers, documented in Content Deduplication) was designed for HTML content and worked perfectly for web pages. But it had a critical blind spot: PDF-extracted markdown bypassed all text-level boilerplate filtering.

Every ZOL brochure has a standardized cover page:

### [Brochure Title]

**ZOL GENK** **ZOL MAAS EN KEMPEN**

Campus Sint-Jan          Campus Sint-Barbara
Synaps Park 1            Bessemerstraat 478
B 3600 Genk              B 3620 Lanaken

Medisch Centrum André Dumont
Stalenstraat 2a
B 3600 Genk

This identical block appeared in 449 chunks across 441 brochures, each with a slightly different title prepended (making content hashes unique, defeating the cross-document dedup).

Impact on Retrieval Quality

Before: Balanced Index

Web pages (HTML):  ~4,400 chunks (100% of index)
  - Department pages, doctor profiles, conditions, treatments
  - Well-targeted content, each page about one topic

After: Brochure-Dominated Index

Brochures (PDF):   6,485 chunks (59.6% of index)
  - 449 pure boilerplate (campus address blocks)
  - 51 chunks about chemotherapy from a single brochure (br0297)
  - Dense, procedural content competing with navigational pages

Web pages (HTML):  4,394 chunks (40.4% of index)
  - Same quality content, now outnumbered 1.5:1

Concrete Failure Examples

GQ-102: "Waar kan ik terecht voor chemotherapie bij borstkanker?"

Expected: Borstcentrum department page (1 chunk, 482 chars) Retrieved: Chemotherapy brochure br0297 (51 chunks mentioning "chemotherapie")

The brochure's 51 chunks about chemotherapy procedures dominated semantic retrieval, pushing the authoritative Borstcentrum department page — which directly answers "where can I go?" — below the relevance threshold. The LLM generated an answer from brochure content (treatment locations, room numbers) instead of the department overview.

GQ-093: "Zijn er dokters die zowel op Sint-Jan als op André Dumont werken?"

Expected: Doctor-campus relationship data Retrieved: 449 brochure cover pages mentioning both campus names as address boilerplate

The retrieval system found hundreds of high-similarity matches for "Sint-Jan" + "André Dumont" — but they were all brochure address blocks, not actual doctor-campus relationship data. The signal was completely drowned in noise.

Why Existing Deduplication Didn't Catch This

The system has 7 deduplication layers (see Content Deduplication):

Layer	Why It Didn't Help
1. URL dedup	Each brochure has a unique URL
2. Title dedup	Each brochure has a unique title
3. Canonical URL	PDFs don't have `<link rel="canonical">`
4. Boilerplate CSS	CSS selectors don't apply to PDFs (no DOM)
5. Cross-doc hash	Each cover page has a different title prepended, making hashes unique
6. Language filter	Content is in Dutch (correct language)
7. Re-ingestion	Each brochure was new content

The architectural gap: Text-level boilerplate patterns (strip_text_boilerplate) were applied only to HTML-extracted markdown in _extract_html_from_response. PDF extraction in _pdf_extract_in_process returned raw markdown directly, bypassing all text filtering.

Resolution

Retroactive Fix (Immediate)

Direct SQL deletion of 449 boilerplate chunks from the pilot database:

DELETE FROM app.document_chunks
WHERE content_length < 1000
  AND content ILIKE '%ZOL GENK%'
  AND content ILIKE '%Campus Sint-Jan%'
  AND (content ILIKE '%B 3600%' OR content ILIKE '%Synaps Park%');

This removed the boilerplate without requiring re-ingestion of 573 PDFs.

Pipeline Fix (Preventive)

_extract_pdf_content now applies boilerplate stripping — the same strip_text_boilerplate() from site_config runs on PDF output, not just HTML
New _strip_pdf_cover_boilerplate() function — regex patterns specifically targeting ZOL brochure cover blocks (campus address patterns with postal codes)
DB-configurable — hospital-specific PDF boilerplate patterns can be added via the Pipeline Config UI, same as HTML patterns

Measured Result

Metric	Before Fix	After Fix
Total chunks	10,879	10,430
Brochure chunks	6,485 (59.6%)	6,036 (57.9%)
Boilerplate chunks	449	0
Campus address pollution	449 retrieval-polluting chunks	Eliminated

Lessons Learned

1. Content Type Parity

Every content extraction path must have the same quality gates. The HTML pipeline had 7 dedup layers; the PDF pipeline had 0 text-level filtering. When we scaled from HTML-only to mixed HTML+PDF, the untested path became the dominant content source.

Principle: If you add a new content type (PDF, DOCX, email), test the full quality pipeline against it before bulk ingestion.

2. Corpus Composition Matters More Than Corpus Size

Going from 4,400 to 10,900 chunks didn't cause the regression. The 59.6% brochure composition did. A balanced corpus with 10,900 chunks would have performed fine. The problem was that one content type (brochures) overwhelmed the index with dense, procedural content that competed with navigational content for the same queries.

Principle: Monitor content type ratios in the index. If any single category exceeds 50%, investigate whether it's diluting retrieval for other categories.

3. Boilerplate Is Format-Specific

HTML boilerplate (nav menus, cookie banners, footers) looks nothing like PDF boilerplate (cover pages, address blocks, legal disclaimers). A boilerplate detection system designed for one format won't catch the other. Each format needs its own patterns.

Principle: Boilerplate detection must be format-aware, not one-size-fits-all.

4. Test at Production Scale Before Declaring Victory

The system passed 99.7% on 302 questions with HTML-only content. The regression only appeared after ingesting real-world PDFs. Academic evaluation on clean data doesn't prove production readiness — you need to test with the messy, heterogeneous content that production systems actually handle.

Principle: Run the full eval suite after every major content ingestion, not just after code changes.

Content Deduplication — The 7-layer dedup system (now extended to PDFs)
Document Ingestion Pipeline — Subprocess-isolated PDF extraction
Evaluation Overview — Golden question methodology and eval reports

Timeline​

The Problem​

Impact on Retrieval Quality​

Before: Balanced Index​

After: Brochure-Dominated Index​

Concrete Failure Examples​

Why Existing Deduplication Didn't Catch This​

Resolution​

Retroactive Fix (Immediate)​

Pipeline Fix (Preventive)​

Measured Result​

Lessons Learned​

1. Content Type Parity​

2. Corpus Composition Matters More Than Corpus Size​

3. Boilerplate Is Format-Specific​

4. Test at Production Scale Before Declaring Victory​

Related​