Content Deduplication
Hospital websites frequently contain duplicate or near-duplicate content: the same brochure accessible via multiple URL paths, navigation menus repeated on every page, footer text appearing in every document, and content published in multiple languages. Without deduplication, the RAG pipeline would retrieve redundant chunks that waste context window tokens and confuse the LLM.
The ZOL system implements seven deduplication layers, each targeting a different type of redundancy:
Layer 1: URL-Level Deduplication
When: During crawl discovery (Stage 2)
The crawler maintains a crawled_urls table with a unique URL constraint. When a URL is discovered multiple times (e.g., from different sitemap entries or internal links), only the first occurrence is recorded. Subsequent discoveries update metadata but don't create duplicate entries.
-- Duplicate URLs are caught at INSERT time
INSERT INTO app.crawled_urls (url, tenant_id, status, ...)
ON CONFLICT (url, tenant_id) DO UPDATE SET
last_crawled_at = now(),
http_status = EXCLUDED.http_status;
This prevents the same page from being ingested twice when the URL is identical, but does not catch cases where different URLs serve the same content (handled by Layers 2 and 3).
Layer 2: Title-Level Deduplication
When: During ingestion (before and after content extraction)
The ingestion service performs two title-based dedup checks:
Pre-flight check — Before fetching content, if the crawled URL already has a title (from the sitemap or a previous crawl), the system checks for an existing completed document with the same title:
# Pre-flight: skip if a document with this title already exists
existing = await db.execute(
select(Document).where(
Document.title == crawled_url.title,
Document.tenant_id == tenant_id,
Document.status == "completed",
).limit(1)
)
if existing.scalars().first():
crawled_url.document_id = existing_doc.id # Link URL to existing document
return # Skip fetching — saves HTTP request + LLM costs
Post-flight check — After extracting content (when the page title is definitively known), a second title check catches cases where the pre-flight title was absent or different from the extracted title.
This catches cases where the same content is published under different URLs but with identical page titles — common in Drupal CMS where content is accessible via both /friendly-path and /node/1234.
Layer 3: Canonical URL Detection
When: During content extraction (after fetching HTML)
This layer addresses a fundamental CMS problem: Drupal (and other CMSs) frequently expose the same content at multiple URL paths. For example, ZOL's "Huishoudelijk reglement" page was accessible at both /huishoudelijk-reglement and /werking/huishoudelijk-reglement — identical content, different URLs, different page titles.
How it works:
- During HTML extraction, the system reads the
<link rel="canonical">tag from the page's<head> - If the canonical URL differs from the crawled URL, this indicates the page is an alias
- The system checks whether a document already exists for the canonical URL
- If found, the current page is skipped and linked to the existing document
# Extract canonical URL from HTML <head>
canonical_tag = soup.find("link", rel="canonical")
if canonical_tag:
canonical_url = canonical_tag.get("href")
# During post-flight dedup: check if canonical URL already ingested
if canonical_url and canonical_url != source_url:
existing = await db.execute(
text("""SELECT id FROM app.documents
WHERE metadata->>'source_url' = :canon_url
AND status = 'completed'"""),
{"canon_url": canonical_url},
)
if existing.fetchone():
# This page is a URL alias — skip and link to canonical document
return
Impact: In the ZOL corpus, canonical URL detection identified 73 fully-duplicated document pairs (146 documents serving identical content at different URL paths), eliminating 184 redundant chunks.
Root cause analysis of the 73 duplicates:
| Pattern | Count | Example |
|---|---|---|
| Different clean URL paths (Drupal aliases) | 66 | /huishoudelijk-reglement vs /werking/huishoudelijk-reglement |
/node/ ID vs clean URL alias | 7 | /node/2091 vs /arts-anders-dr-helena-van-kerrebroeck |
Layer 4: Boilerplate Stripping
When: During content extraction (CSS selectors) and after chunking (text patterns)
Hospital website pages contain repeated elements: navigation menus, footer links, cookie notices, sidebar widgets, breadcrumbs. These appear in every page's extracted text and would create near-identical chunks across all documents.
The boilerplate filter uses a three-tier pattern system:
Tier 1: Generic CSS selectors (34 selectors)
These work for any hospital website, targeting common structural elements:
GENERIC_BOILERPLATE_SELECTORS = [
"nav", "header", "footer", "aside", # Structural HTML5 elements
".breadcrumb", ".breadcrumbs", # Navigation trails
".cookie-banner", ".cookie-notice", # GDPR consent
".sidebar", "#sidebar", ".widget", # Non-content sidebars
".social-share", ".newsletter-signup", # Promotional chrome
# ... 34 total selectors
]
Tier 2: DB-configurable CSS selectors (per hospital)
Site-specific selectors are stored in the hospital's pipeline configuration (hospitals.config.pipeline), not in code. This makes the system fully hospital-agnostic:
{
"pipeline": {
"boilerplate_css_selectors": [
".block-zol-header",
".block-zol-footer",
".block-tb-megamenu",
".region-sidebar",
".region-navigation"
]
}
}
At runtime, get_site_config_with_db_overrides() merges generic + DB-configured selectors into a single list:
# Deep copy to avoid mutating shared config
config = deepcopy(base_config)
config.boilerplate_selectors = list(set(
config.boilerplate_selectors + custom_css_from_db
))
Tier 3: Text-level regex patterns
Some boilerplate survives CSS selector removal because it lives inside the main content area. For ZOL, this includes the "Moeilijk leesbaar?" accessibility widget, mega-menu navigation text that appears within <main>, and repeated doctor listing links. These are handled by regex patterns applied to the extracted markdown:
# Text patterns applied after markdown conversion
boilerplate_text_patterns = [
r"Moeilijk leesbaar\??\s*(\[.*?leesbaarheid.*?\](\(.*?\))?\s*)?",
r"-?\s*ZOL-artsen\s+Overzicht\s+ZOL-artsen.*?(?=\n\n|\Z)",
# ... 16 total patterns for ZOL
]
Configuring for a new hospital
- Create hospital in Platform Management — zero boilerplate config needed
- Run initial crawl — generic selectors (Tier 1) handle common patterns automatically
- Review chunks — if hospital-specific chrome appears, add CSS selectors via the API:
PUT /api/v1/hospitals/{hospital_id}/config
{
"config": {
"pipeline": {
"boilerplate_css_selectors": [".custom-header", ".custom-nav"]
}
}
}
- Re-ingest affected URLs — custom patterns now active
Tier 4: PDF-specific boilerplate patterns
PDF brochures have their own boilerplate that doesn't exist in HTML: cover pages with hospital addresses, phone numbers, and campus blocks. These are invisible to CSS selectors (no DOM) and often unique enough to defeat hash-based dedup (each cover has a different title).
PDF patterns are stored separately in the hospital's pipeline config (pdf_boilerplate_patterns) and applied during PDF extraction via strip_pdf_boilerplate(). This was added after the PDF Corpus Scaling Incident where 449 brochure cover pages polluted the retrieval index.
{
"pipeline": {
"pdf_boilerplate_patterns": [
"\\*{0,2}ZOL GENK\\*{0,2}\\s+\\*{0,2}ZOL MAAS EN KEMPEN\\*{0,2}[\\s\\S]*?B\\s+36\\d{2}",
"Campus Sint-Jan\\s+Synaps Park 1\\s+B 3600 Genk"
]
}
}
Current ZOL configuration
| Type | Count | Examples |
|---|---|---|
| Generic CSS selectors | 34 | nav, footer, .cookie-banner, .sidebar |
| DB-configured CSS selectors | 5 | .block-zol-header, .block-tb-megamenu, .region-sidebar |
| Text regex patterns (HTML+PDF) | 20 | "Moeilijk leesbaar?" widget, mega-menu nav, contact blocks |
| PDF boilerplate patterns | 2 | Campus address cover blocks, standalone address lines |
Layer 5: Cross-Document Content Hash Deduplication
When: During chunk storage (after embedding generation)
This layer detects identical text appearing across different documents — for example, a standard disclaimer paragraph in every brochure, templated introductions shared across exercise program pages, or hospital policy text copied across admission information pages.
How it works:
- Hash computation: Each chunk's text content is SHA-256 hashed
- Cross-document lookup: A single SQL query checks how many other completed documents contain chunks with the same hash
- Threshold filtering: Chunks whose hash already exists in N+ other completed documents are skipped
-- Single round-trip: find hashes that already exist in completed documents
SELECT dc.content_hash, COUNT(DISTINCT dc.document_id) AS doc_count
FROM app.document_chunks dc
JOIN app.documents d ON d.id = dc.document_id
WHERE dc.content_hash = ANY(:hashes)
AND dc.document_id != CAST(:current_doc AS UUID)
AND d.status = 'completed'
GROUP BY dc.content_hash
HAVING COUNT(DISTINCT dc.document_id) >= :threshold
Key design decisions:
- Threshold = 1 (any chunk in another completed document is skipped). This was lowered from the initial value of 3 after analysis showed that most cross-doc duplicates occur between exactly 2 documents, not 3+
- Completed-only filter (
d.status = 'completed') ensures that re-processing a document works correctly — when a document is re-processed, its old chunks are deleted first, so the new chunks won't false-match against themselves - Intra-document dedup runs separately with a simple
seen_hashesset, catching duplicate chunks within the same document (e.g., repeated sections in a long brochure)
Partial-overlap analysis (ZOL corpus):
| Content type | Shared chunks | Source |
|---|---|---|
| BeweegSaam exercise templates | ~20 | Templated intro/guidelines across 15+ exercise variants |
| Hospital policy paragraphs | ~15 | Standard admission/visitor rules shared across campus pages |
| Training program curricula | ~10 | Yearly editions sharing core curriculum content |
| Department sidebar navigation | ~15 | Cross-linked department lists |
Layer 6: Language Filtering
When: During chunk storage
Hospital content is primarily in Dutch, but extracted text sometimes contains fragments in other languages (French/German headers, English technical terms, untranslated CMS boilerplate). These fragments reduce retrieval quality for Dutch queries.
The language filter uses the Lingua library for language detection:
- For each chunk with 30+ characters of content, detect the primary language
- If the detected language is NOT in the expected set (Dutch by default), skip the chunk
- Short chunks (under 30 characters) bypass the filter — they're typically headings or labels
if lang_detector and expected_lang_codes:
detected = lang_detector.detect_language_of(content)
if detected and detected.iso_code_639_1.name.lower() not in expected_lang_codes:
language_skipped += 1
continue
Configuration: INGESTION_LANGUAGE_FILTER_ENABLED=true, INGESTION_EXPECTED_LANGUAGES=nl
Layer 7: Re-Ingestion Protection
When: When a page is re-crawled or updated
When a URL is re-ingested (e.g., after content update), the system:
- Checks if the content hash has changed since last ingestion
- If unchanged, skips processing entirely (saves LLM costs for contextual embeddings)
- If changed, deletes old chunks and re-processes from scratch
# Content hash comparison — skip if unchanged
if record.content_hash == new_content_hash:
logger.info("Content unchanged for %s, skipping re-ingestion", url)
return
Measured Impact
Analysis of the ZOL corpus (March 2026) quantified the impact of each deduplication layer:
| Layer | What It Prevents | Chunks Saved | Method |
|---|---|---|---|
| 1. URL dedup | Same URL ingested twice | ~200 URLs | DB unique constraint |
| 2. Title dedup | CMS path aliases (same title) | ~50 documents | Pre/post-flight SQL check |
| 3. Canonical URL | Drupal URL aliases (different titles) | 184 chunks (73 documents) | <link rel="canonical"> extraction |
| 4. Boilerplate stripping | Nav/footer/menu chunks | ~500 chunks | CSS selectors + regex |
| 5. Cross-doc hash dedup | Repeated paragraphs across pages | ~60 chunks | SHA-256 hash matching |
| 6. Language filtering | Non-Dutch fragments | ~30 chunks | Lingua language detection |
| 7. Re-ingestion protection | Unchanged content on re-crawl | Varies | Content hash comparison |
Before dedup: The raw ZOL corpus would produce ~5,400 chunks. After all layers: 4,408 clean, unique chunks — a ~18 % reduction in redundant content that would otherwise pollute retrieval results and waste LLM context tokens.
The 5,400 → 4,408 figures are from the March 2026 HTML-only corpus and remain the cleanest "before vs after" comparison for the dedup layers. Following the PDF Corpus Scaling Incident and the nightly auto-ingest going live on the pilot, the production corpus is 5,841 completed documents / ~10,430 chunks (May 2026) — composed of HTML pages plus 573 PDF brochures with PDF-specific boilerplate stripping (Tier 4) operational. The dedup-layer mechanics are unchanged; only the absolute counts have grown.
Configuration Reference
| Setting | Default | Purpose |
|---|---|---|
BOILERPLATE_DEDUP_MIN_DOCUMENTS | 1 | Cross-doc dedup threshold (1 = exact dedup, 3+ = boilerplate only) |
INGESTION_LANGUAGE_FILTER_ENABLED | true | Enable chunk-level language filtering |
INGESTION_EXPECTED_LANGUAGES | nl | Comma-separated expected language codes |
| Hospital pipeline config | Per hospital | boilerplate_css_selectors and boilerplate_text_patterns |
Related
- Document Ingestion Pipeline — Full ingestion architecture
- Prompt Engineering — How the LLM handles remaining duplicates in context
- Taxonomy Extraction Pipeline — Entity-level deduplication