Content Deduplication

Hospital websites frequently contain duplicate or near-duplicate content: the same brochure accessible via multiple URL paths, navigation menus repeated on every page, footer text appearing in every document, and content published in multiple languages. Without deduplication, the RAG pipeline would retrieve redundant chunks that waste context window tokens and confuse the LLM.

The ZOL system implements seven deduplication layers, each targeting a different type of redundancy:

Layer 1: URL-Level Deduplication

When: During crawl discovery (Stage 2)

The crawler maintains a crawled_urls table with a unique URL constraint. When a URL is discovered multiple times (e.g., from different sitemap entries or internal links), only the first occurrence is recorded. Subsequent discoveries update metadata but don't create duplicate entries.

-- Duplicate URLs are caught at INSERT time
INSERT INTO app.crawled_urls (url, tenant_id, status, ...)
ON CONFLICT (url, tenant_id) DO UPDATE SET
    last_crawled_at = now(),
    http_status = EXCLUDED.http_status;

This prevents the same page from being ingested twice when the URL is identical, but does not catch cases where different URLs serve the same content (handled by Layers 2 and 3).

Layer 2: Title-Level Deduplication

When: During ingestion (before and after content extraction)

The ingestion service performs two title-based dedup checks:

Pre-flight check — Before fetching content, if the crawled URL already has a title (from the sitemap or a previous crawl), the system checks for an existing completed document with the same title:

# Pre-flight: skip if a document with this title already exists
existing = await db.execute(
    select(Document).where(
        Document.title == crawled_url.title,
        Document.tenant_id == tenant_id,
        Document.status == "completed",
    ).limit(1)
)
if existing.scalars().first():
    crawled_url.document_id = existing_doc.id  # Link URL to existing document
    return  # Skip fetching — saves HTTP request + LLM costs

Post-flight check — After extracting content (when the page title is definitively known), a second title check catches cases where the pre-flight title was absent or different from the extracted title.

This catches cases where the same content is published under different URLs but with identical page titles — common in Drupal CMS where content is accessible via both /friendly-path and /node/1234.

Layer 3: Canonical URL Detection

When: During content extraction (after fetching HTML)

This layer addresses a fundamental CMS problem: Drupal (and other CMSs) frequently expose the same content at multiple URL paths. For example, ZOL's "Huishoudelijk reglement" page was accessible at both /huishoudelijk-reglement and /werking/huishoudelijk-reglement — identical content, different URLs, different page titles.

How it works:

During HTML extraction, the system reads the <link rel="canonical"> tag from the page's <head>
If the canonical URL differs from the crawled URL, this indicates the page is an alias
The system checks whether a document already exists for the canonical URL
If found, the current page is skipped and linked to the existing document

# Extract canonical URL from HTML <head>
canonical_tag = soup.find("link", rel="canonical")
if canonical_tag:
    canonical_url = canonical_tag.get("href")

# During post-flight dedup: check if canonical URL already ingested
if canonical_url and canonical_url != source_url:
    existing = await db.execute(
        text("""SELECT id FROM app.documents
                WHERE metadata->>'source_url' = :canon_url
                  AND status = 'completed'"""),
        {"canon_url": canonical_url},
    )
    if existing.fetchone():
        # This page is a URL alias — skip and link to canonical document
        return

Impact: In the ZOL corpus, canonical URL detection identified 73 fully-duplicated document pairs (146 documents serving identical content at different URL paths), eliminating 184 redundant chunks.

Root cause analysis of the 73 duplicates:

Pattern	Count	Example
Different clean URL paths (Drupal aliases)	66	`/huishoudelijk-reglement` vs `/werking/huishoudelijk-reglement`
`/node/` ID vs clean URL alias	7	`/node/2091` vs `/arts-anders-dr-helena-van-kerrebroeck`

Layer 4: Boilerplate Stripping

When: During content extraction (CSS selectors) and after chunking (text patterns)

Hospital website pages contain repeated elements: navigation menus, footer links, cookie notices, sidebar widgets, breadcrumbs. These appear in every page's extracted text and would create near-identical chunks across all documents.

The boilerplate filter uses a three-tier pattern system:

Tier 1: Generic CSS selectors (34 selectors)

These work for any hospital website, targeting common structural elements:

GENERIC_BOILERPLATE_SELECTORS = [
    "nav", "header", "footer", "aside",        # Structural HTML5 elements
    ".breadcrumb", ".breadcrumbs",              # Navigation trails
    ".cookie-banner", ".cookie-notice",         # GDPR consent
    ".sidebar", "#sidebar", ".widget",          # Non-content sidebars
    ".social-share", ".newsletter-signup",      # Promotional chrome
    # ... 34 total selectors
]

Tier 2: DB-configurable CSS selectors (per hospital)

Site-specific selectors are stored in the hospital's pipeline configuration (hospitals.config.pipeline), not in code. This makes the system fully hospital-agnostic:

{
  "pipeline": {
    "boilerplate_css_selectors": [
      ".block-zol-header",
      ".block-zol-footer",
      ".block-tb-megamenu",
      ".region-sidebar",
      ".region-navigation"
    ]
  }
}

At runtime, get_site_config_with_db_overrides() merges generic + DB-configured selectors into a single list:

# Deep copy to avoid mutating shared config
config = deepcopy(base_config)
config.boilerplate_selectors = list(set(
    config.boilerplate_selectors + custom_css_from_db
))

Tier 3: Text-level regex patterns

Some boilerplate survives CSS selector removal because it lives inside the main content area. For ZOL, this includes the "Moeilijk leesbaar?" accessibility widget, mega-menu navigation text that appears within <main>, and repeated doctor listing links. These are handled by regex patterns applied to the extracted markdown:

# Text patterns applied after markdown conversion
boilerplate_text_patterns = [
    r"Moeilijk leesbaar\??\s*(\[.*?leesbaarheid.*?\](\(.*?\))?\s*)?",
    r"-?\s*ZOL-artsen\s+Overzicht\s+ZOL-artsen.*?(?=\n\n|\Z)",
    # ... 16 total patterns for ZOL
]

Configuring for a new hospital

Create hospital in Platform Management — zero boilerplate config needed
Run initial crawl — generic selectors (Tier 1) handle common patterns automatically
Review chunks — if hospital-specific chrome appears, add CSS selectors via the API:

PUT /api/v1/hospitals/{hospital_id}/config
{
  "config": {
    "pipeline": {
      "boilerplate_css_selectors": [".custom-header", ".custom-nav"]
    }
  }
}

Re-ingest affected URLs — custom patterns now active

Tier 4: PDF-specific boilerplate patterns

PDF brochures have their own boilerplate that doesn't exist in HTML: cover pages with hospital addresses, phone numbers, and campus blocks. These are invisible to CSS selectors (no DOM) and often unique enough to defeat hash-based dedup (each cover has a different title).

PDF patterns are stored separately in the hospital's pipeline config (pdf_boilerplate_patterns) and applied during PDF extraction via strip_pdf_boilerplate(). This was added after the PDF Corpus Scaling Incident where 449 brochure cover pages polluted the retrieval index.

{
  "pipeline": {
    "pdf_boilerplate_patterns": [
      "\\*{0,2}ZOL GENK\\*{0,2}\\s+\\*{0,2}ZOL MAAS EN KEMPEN\\*{0,2}[\\s\\S]*?B\\s+36\\d{2}",
      "Campus Sint-Jan\\s+Synaps Park 1\\s+B 3600 Genk"
    ]
  }
}

Current ZOL configuration

Type	Count	Examples
Generic CSS selectors	34	`nav`, `footer`, `.cookie-banner`, `.sidebar`
DB-configured CSS selectors	5	`.block-zol-header`, `.block-tb-megamenu`, `.region-sidebar`
Text regex patterns (HTML+PDF)	20	"Moeilijk leesbaar?" widget, mega-menu nav, contact blocks
PDF boilerplate patterns	2	Campus address cover blocks, standalone address lines

Layer 5: Cross-Document Content Hash Deduplication

When: During chunk storage (after embedding generation)

This layer detects identical text appearing across different documents — for example, a standard disclaimer paragraph in every brochure, templated introductions shared across exercise program pages, or hospital policy text copied across admission information pages.

How it works:

Hash computation: Each chunk's text content is SHA-256 hashed
Cross-document lookup: A single SQL query checks how many other completed documents contain chunks with the same hash
Threshold filtering: Chunks whose hash already exists in N+ other completed documents are skipped

-- Single round-trip: find hashes that already exist in completed documents
SELECT dc.content_hash, COUNT(DISTINCT dc.document_id) AS doc_count
FROM app.document_chunks dc
JOIN app.documents d ON d.id = dc.document_id
WHERE dc.content_hash = ANY(:hashes)
  AND dc.document_id != CAST(:current_doc AS UUID)
  AND d.status = 'completed'
GROUP BY dc.content_hash
HAVING COUNT(DISTINCT dc.document_id) >= :threshold

Key design decisions:

Threshold = 1 (any chunk in another completed document is skipped). This was lowered from the initial value of 3 after analysis showed that most cross-doc duplicates occur between exactly 2 documents, not 3+
Completed-only filter (d.status = 'completed') ensures that re-processing a document works correctly — when a document is re-processed, its old chunks are deleted first, so the new chunks won't false-match against themselves
Intra-document dedup runs separately with a simple seen_hashes set, catching duplicate chunks within the same document (e.g., repeated sections in a long brochure)

Partial-overlap analysis (ZOL corpus):

Content type	Shared chunks	Source
BeweegSaam exercise templates	~20	Templated intro/guidelines across 15+ exercise variants
Hospital policy paragraphs	~15	Standard admission/visitor rules shared across campus pages
Training program curricula	~10	Yearly editions sharing core curriculum content
Department sidebar navigation	~15	Cross-linked department lists

Layer 6: Language Filtering

When: During chunk storage

Hospital content is primarily in Dutch, but extracted text sometimes contains fragments in other languages (French/German headers, English technical terms, untranslated CMS boilerplate). These fragments reduce retrieval quality for Dutch queries.

The language filter uses the Lingua library for language detection:

For each chunk with 30+ characters of content, detect the primary language
If the detected language is NOT in the expected set (Dutch by default), skip the chunk
Short chunks (under 30 characters) bypass the filter — they're typically headings or labels

if lang_detector and expected_lang_codes:
    detected = lang_detector.detect_language_of(content)
    if detected and detected.iso_code_639_1.name.lower() not in expected_lang_codes:
        language_skipped += 1
        continue

Configuration: INGESTION_LANGUAGE_FILTER_ENABLED=true, INGESTION_EXPECTED_LANGUAGES=nl

Layer 7: Re-Ingestion Protection

When: When a page is re-crawled or updated

When a URL is re-ingested (e.g., after content update), the system:

Checks if the content hash has changed since last ingestion
If unchanged, skips processing entirely (saves LLM costs for contextual embeddings)
If changed, deletes old chunks and re-processes from scratch

# Content hash comparison — skip if unchanged
if record.content_hash == new_content_hash:
    logger.info("Content unchanged for %s, skipping re-ingestion", url)
    return

Measured Impact

Analysis of the ZOL corpus (March 2026) quantified the impact of each deduplication layer:

Layer	What It Prevents	Chunks Saved	Method
1. URL dedup	Same URL ingested twice	~200 URLs	DB unique constraint
2. Title dedup	CMS path aliases (same title)	~50 documents	Pre/post-flight SQL check
3. Canonical URL	Drupal URL aliases (different titles)	184 chunks (73 documents)	`<link rel="canonical">` extraction
4. Boilerplate stripping	Nav/footer/menu chunks	~500 chunks	CSS selectors + regex
5. Cross-doc hash dedup	Repeated paragraphs across pages	~60 chunks	SHA-256 hash matching
6. Language filtering	Non-Dutch fragments	~30 chunks	Lingua language detection
7. Re-ingestion protection	Unchanged content on re-crawl	Varies	Content hash comparison

Before dedup: The raw ZOL corpus would produce ~5,400 chunks. After all layers: 4,408 clean, unique chunks — a ~18 % reduction in redundant content that would otherwise pollute retrieval results and waste LLM context tokens.

Production corpus has grown

The 5,400 → 4,408 figures are from the March 2026 HTML-only corpus and remain the cleanest "before vs after" comparison for the dedup layers. Following the PDF Corpus Scaling Incident and the nightly auto-ingest going live on the pilot, the production corpus is 5,841 completed documents / ~10,430 chunks (May 2026) — composed of HTML pages plus 573 PDF brochures with PDF-specific boilerplate stripping (Tier 4) operational. The dedup-layer mechanics are unchanged; only the absolute counts have grown.

Configuration Reference

Setting	Default	Purpose
`BOILERPLATE_DEDUP_MIN_DOCUMENTS`	1	Cross-doc dedup threshold (1 = exact dedup, 3+ = boilerplate only)
`INGESTION_LANGUAGE_FILTER_ENABLED`	true	Enable chunk-level language filtering
`INGESTION_EXPECTED_LANGUAGES`	nl	Comma-separated expected language codes
Hospital pipeline config	Per hospital	`boilerplate_css_selectors` and `boilerplate_text_patterns`

Document Ingestion Pipeline — Full ingestion architecture
Prompt Engineering — How the LLM handles remaining duplicates in context
Taxonomy Extraction Pipeline — Entity-level deduplication

Layer 1: URL-Level Deduplication​

Layer 2: Title-Level Deduplication​

Layer 3: Canonical URL Detection​

Layer 4: Boilerplate Stripping​

Tier 1: Generic CSS selectors (34 selectors)​

Tier 2: DB-configurable CSS selectors (per hospital)​

Tier 3: Text-level regex patterns​

Configuring for a new hospital​

Tier 4: PDF-specific boilerplate patterns​

Current ZOL configuration​

Layer 5: Cross-Document Content Hash Deduplication​

Layer 6: Language Filtering​

Layer 7: Re-Ingestion Protection​

Measured Impact​

Configuration Reference​

Related​