Skip to main content

Content Deduplication

Hospital websites frequently contain duplicate or near-duplicate content: the same brochure accessible via multiple URL paths, navigation menus repeated on every page, footer text appearing in every document, and content published in multiple languages. Without deduplication, the RAG pipeline would retrieve redundant chunks that waste context window tokens and confuse the LLM.

The ZOL system implements seven deduplication layers, each targeting a different type of redundancy:

Layer 1: URL-Level Deduplication

When: During crawl discovery (Stage 2)

The crawler maintains a crawled_urls table with a unique URL constraint. When a URL is discovered multiple times (e.g., from different sitemap entries or internal links), only the first occurrence is recorded. Subsequent discoveries update metadata but don't create duplicate entries.

-- Duplicate URLs are caught at INSERT time
INSERT INTO app.crawled_urls (url, tenant_id, status, ...)
ON CONFLICT (url, tenant_id) DO UPDATE SET
last_crawled_at = now(),
http_status = EXCLUDED.http_status;

This prevents the same page from being ingested twice when the URL is identical, but does not catch cases where different URLs serve the same content (handled by Layers 2 and 3).

Layer 2: Title-Level Deduplication

When: During ingestion (before and after content extraction)

The ingestion service performs two title-based dedup checks:

Pre-flight check — Before fetching content, if the crawled URL already has a title (from the sitemap or a previous crawl), the system checks for an existing completed document with the same title:

# Pre-flight: skip if a document with this title already exists
existing = await db.execute(
select(Document).where(
Document.title == crawled_url.title,
Document.tenant_id == tenant_id,
Document.status == "completed",
).limit(1)
)
if existing.scalars().first():
crawled_url.document_id = existing_doc.id # Link URL to existing document
return # Skip fetching — saves HTTP request + LLM costs

Post-flight check — After extracting content (when the page title is definitively known), a second title check catches cases where the pre-flight title was absent or different from the extracted title.

This catches cases where the same content is published under different URLs but with identical page titles — common in Drupal CMS where content is accessible via both /friendly-path and /node/1234.

Layer 3: Canonical URL Detection

When: During content extraction (after fetching HTML)

This layer addresses a fundamental CMS problem: Drupal (and other CMSs) frequently expose the same content at multiple URL paths. For example, ZOL's "Huishoudelijk reglement" page was accessible at both /huishoudelijk-reglement and /werking/huishoudelijk-reglement — identical content, different URLs, different page titles.

How it works:

  1. During HTML extraction, the system reads the <link rel="canonical"> tag from the page's <head>
  2. If the canonical URL differs from the crawled URL, this indicates the page is an alias
  3. The system checks whether a document already exists for the canonical URL
  4. If found, the current page is skipped and linked to the existing document
# Extract canonical URL from HTML <head>
canonical_tag = soup.find("link", rel="canonical")
if canonical_tag:
canonical_url = canonical_tag.get("href")

# During post-flight dedup: check if canonical URL already ingested
if canonical_url and canonical_url != source_url:
existing = await db.execute(
text("""SELECT id FROM app.documents
WHERE metadata->>'source_url' = :canon_url
AND status = 'completed'"""),
{"canon_url": canonical_url},
)
if existing.fetchone():
# This page is a URL alias — skip and link to canonical document
return

Impact: In the ZOL corpus, canonical URL detection identified 73 fully-duplicated document pairs (146 documents serving identical content at different URL paths), eliminating 184 redundant chunks.

Root cause analysis of the 73 duplicates:

PatternCountExample
Different clean URL paths (Drupal aliases)66/huishoudelijk-reglement vs /werking/huishoudelijk-reglement
/node/ ID vs clean URL alias7/node/2091 vs /arts-anders-dr-helena-van-kerrebroeck

Layer 4: Boilerplate Stripping

When: During content extraction (CSS selectors) and after chunking (text patterns)

Hospital website pages contain repeated elements: navigation menus, footer links, cookie notices, sidebar widgets, breadcrumbs. These appear in every page's extracted text and would create near-identical chunks across all documents.

The boilerplate filter uses a three-tier pattern system:

Tier 1: Generic CSS selectors (34 selectors)

These work for any hospital website, targeting common structural elements:

GENERIC_BOILERPLATE_SELECTORS = [
"nav", "header", "footer", "aside", # Structural HTML5 elements
".breadcrumb", ".breadcrumbs", # Navigation trails
".cookie-banner", ".cookie-notice", # GDPR consent
".sidebar", "#sidebar", ".widget", # Non-content sidebars
".social-share", ".newsletter-signup", # Promotional chrome
# ... 34 total selectors
]

Tier 2: DB-configurable CSS selectors (per hospital)

Site-specific selectors are stored in the hospital's pipeline configuration (hospitals.config.pipeline), not in code. This makes the system fully hospital-agnostic:

{
"pipeline": {
"boilerplate_css_selectors": [
".block-zol-header",
".block-zol-footer",
".block-tb-megamenu",
".region-sidebar",
".region-navigation"
]
}
}

At runtime, get_site_config_with_db_overrides() merges generic + DB-configured selectors into a single list:

# Deep copy to avoid mutating shared config
config = deepcopy(base_config)
config.boilerplate_selectors = list(set(
config.boilerplate_selectors + custom_css_from_db
))

Tier 3: Text-level regex patterns

Some boilerplate survives CSS selector removal because it lives inside the main content area. For ZOL, this includes the "Moeilijk leesbaar?" accessibility widget, mega-menu navigation text that appears within <main>, and repeated doctor listing links. These are handled by regex patterns applied to the extracted markdown:

# Text patterns applied after markdown conversion
boilerplate_text_patterns = [
r"Moeilijk leesbaar\??\s*(\[.*?leesbaarheid.*?\](\(.*?\))?\s*)?",
r"-?\s*ZOL-artsen\s+Overzicht\s+ZOL-artsen.*?(?=\n\n|\Z)",
# ... 16 total patterns for ZOL
]

Configuring for a new hospital

  1. Create hospital in Platform Management — zero boilerplate config needed
  2. Run initial crawl — generic selectors (Tier 1) handle common patterns automatically
  3. Review chunks — if hospital-specific chrome appears, add CSS selectors via the API:
PUT /api/v1/hospitals/{hospital_id}/config
{
"config": {
"pipeline": {
"boilerplate_css_selectors": [".custom-header", ".custom-nav"]
}
}
}
  1. Re-ingest affected URLs — custom patterns now active

Tier 4: PDF-specific boilerplate patterns

PDF brochures have their own boilerplate that doesn't exist in HTML: cover pages with hospital addresses, phone numbers, and campus blocks. These are invisible to CSS selectors (no DOM) and often unique enough to defeat hash-based dedup (each cover has a different title).

PDF patterns are stored separately in the hospital's pipeline config (pdf_boilerplate_patterns) and applied during PDF extraction via strip_pdf_boilerplate(). This was added after the PDF Corpus Scaling Incident where 449 brochure cover pages polluted the retrieval index.

{
"pipeline": {
"pdf_boilerplate_patterns": [
"\\*{0,2}ZOL GENK\\*{0,2}\\s+\\*{0,2}ZOL MAAS EN KEMPEN\\*{0,2}[\\s\\S]*?B\\s+36\\d{2}",
"Campus Sint-Jan\\s+Synaps Park 1\\s+B 3600 Genk"
]
}
}

Current ZOL configuration

TypeCountExamples
Generic CSS selectors34nav, footer, .cookie-banner, .sidebar
DB-configured CSS selectors5.block-zol-header, .block-tb-megamenu, .region-sidebar
Text regex patterns (HTML+PDF)20"Moeilijk leesbaar?" widget, mega-menu nav, contact blocks
PDF boilerplate patterns2Campus address cover blocks, standalone address lines

Layer 5: Cross-Document Content Hash Deduplication

When: During chunk storage (after embedding generation)

This layer detects identical text appearing across different documents — for example, a standard disclaimer paragraph in every brochure, templated introductions shared across exercise program pages, or hospital policy text copied across admission information pages.

How it works:

  1. Hash computation: Each chunk's text content is SHA-256 hashed
  2. Cross-document lookup: A single SQL query checks how many other completed documents contain chunks with the same hash
  3. Threshold filtering: Chunks whose hash already exists in N+ other completed documents are skipped
-- Single round-trip: find hashes that already exist in completed documents
SELECT dc.content_hash, COUNT(DISTINCT dc.document_id) AS doc_count
FROM app.document_chunks dc
JOIN app.documents d ON d.id = dc.document_id
WHERE dc.content_hash = ANY(:hashes)
AND dc.document_id != CAST(:current_doc AS UUID)
AND d.status = 'completed'
GROUP BY dc.content_hash
HAVING COUNT(DISTINCT dc.document_id) >= :threshold

Key design decisions:

  • Threshold = 1 (any chunk in another completed document is skipped). This was lowered from the initial value of 3 after analysis showed that most cross-doc duplicates occur between exactly 2 documents, not 3+
  • Completed-only filter (d.status = 'completed') ensures that re-processing a document works correctly — when a document is re-processed, its old chunks are deleted first, so the new chunks won't false-match against themselves
  • Intra-document dedup runs separately with a simple seen_hashes set, catching duplicate chunks within the same document (e.g., repeated sections in a long brochure)

Partial-overlap analysis (ZOL corpus):

Content typeShared chunksSource
BeweegSaam exercise templates~20Templated intro/guidelines across 15+ exercise variants
Hospital policy paragraphs~15Standard admission/visitor rules shared across campus pages
Training program curricula~10Yearly editions sharing core curriculum content
Department sidebar navigation~15Cross-linked department lists

Layer 6: Language Filtering

When: During chunk storage

Hospital content is primarily in Dutch, but extracted text sometimes contains fragments in other languages (French/German headers, English technical terms, untranslated CMS boilerplate). These fragments reduce retrieval quality for Dutch queries.

The language filter uses the Lingua library for language detection:

  1. For each chunk with 30+ characters of content, detect the primary language
  2. If the detected language is NOT in the expected set (Dutch by default), skip the chunk
  3. Short chunks (under 30 characters) bypass the filter — they're typically headings or labels
if lang_detector and expected_lang_codes:
detected = lang_detector.detect_language_of(content)
if detected and detected.iso_code_639_1.name.lower() not in expected_lang_codes:
language_skipped += 1
continue

Configuration: INGESTION_LANGUAGE_FILTER_ENABLED=true, INGESTION_EXPECTED_LANGUAGES=nl

Layer 7: Re-Ingestion Protection

When: When a page is re-crawled or updated

When a URL is re-ingested (e.g., after content update), the system:

  1. Checks if the content hash has changed since last ingestion
  2. If unchanged, skips processing entirely (saves LLM costs for contextual embeddings)
  3. If changed, deletes old chunks and re-processes from scratch
# Content hash comparison — skip if unchanged
if record.content_hash == new_content_hash:
logger.info("Content unchanged for %s, skipping re-ingestion", url)
return

Measured Impact

Analysis of the ZOL corpus (March 2026) quantified the impact of each deduplication layer:

LayerWhat It PreventsChunks SavedMethod
1. URL dedupSame URL ingested twice~200 URLsDB unique constraint
2. Title dedupCMS path aliases (same title)~50 documentsPre/post-flight SQL check
3. Canonical URLDrupal URL aliases (different titles)184 chunks (73 documents)<link rel="canonical"> extraction
4. Boilerplate strippingNav/footer/menu chunks~500 chunksCSS selectors + regex
5. Cross-doc hash dedupRepeated paragraphs across pages~60 chunksSHA-256 hash matching
6. Language filteringNon-Dutch fragments~30 chunksLingua language detection
7. Re-ingestion protectionUnchanged content on re-crawlVariesContent hash comparison

Before dedup: The raw ZOL corpus would produce ~5,400 chunks. After all layers: 4,408 clean, unique chunks — a ~18 % reduction in redundant content that would otherwise pollute retrieval results and waste LLM context tokens.

Production corpus has grown

The 5,400 → 4,408 figures are from the March 2026 HTML-only corpus and remain the cleanest "before vs after" comparison for the dedup layers. Following the PDF Corpus Scaling Incident and the nightly auto-ingest going live on the pilot, the production corpus is 5,841 completed documents / ~10,430 chunks (May 2026) — composed of HTML pages plus 573 PDF brochures with PDF-specific boilerplate stripping (Tier 4) operational. The dedup-layer mechanics are unchanged; only the absolute counts have grown.

Configuration Reference

SettingDefaultPurpose
BOILERPLATE_DEDUP_MIN_DOCUMENTS1Cross-doc dedup threshold (1 = exact dedup, 3+ = boilerplate only)
INGESTION_LANGUAGE_FILTER_ENABLEDtrueEnable chunk-level language filtering
INGESTION_EXPECTED_LANGUAGESnlComma-separated expected language codes
Hospital pipeline configPer hospitalboilerplate_css_selectors and boilerplate_text_patterns