Skip to main content

Entity Extraction

Two Extraction Paths

This document describes the content ingestion extraction path (regex-based pattern matching during document processing). For the SP-4 Entity Resolution Pipeline that uses LLM extraction on hub pages, see Entity Resolution & Deduplication.

Populating the knowledge graph (Hogan et al., 2021) requires extracting structured entities from unstructured Dutch medical text. Named entity recognition (NER) is a foundational NLP task (Nadeau & Sekine, 2007), and the choice between LLM-based extraction and regex-based extraction trades flexibility for determinism. The ZOL system deliberately chooses the latter.

The Challenge

Hospital content arrives as unstructured text -- brochures, web pages, doctor profiles. Within this text, structured entities are embedded naturally:

"Dr. Van den Berg is orthopedisch chirurg bij de afdeling Orthopedie op campus Sint-Jan. Hij is gespecialiseerd in sportgeneeskunde en voert onder andere knieprotheses uit."

From this single paragraph, the system must extract:

  • Doctor: Van den Berg
  • Department: Orthopedie
  • Campus: Sint-Jan
  • Specialty: Sportgeneeskunde
  • Treatment: Knieprothese

LLM vs. Regex Extraction

Trade-off Analysis

CriterionLLM ExtractionRegex Extraction
FlexibilityHigh -- handles novel patternsLow -- only known patterns
Cost~$0.001 per pageZero
Latency~2 seconds per page~5 milliseconds per page
DeterminismNon-deterministicFully deterministic
Dutch languageGood with modern modelsExcellent (patterns designed for Dutch)
Accuracy on known patterns~90-95%~98%+
Accuracy on novel patterns~80-90%0% (fails silently)
TestabilityDifficult (non-deterministic)Trivial (deterministic)

The Decision: Regex for ZOL

The ZOL system uses regex-based extraction for a specific reason: the medical entity patterns in Dutch hospital content are well-defined and predictable. Doctor names follow "Dr." or "dr." prefixes. Department names come from a known vocabulary. Campus names are a fixed set of four. Treatment and condition names follow established Dutch medical terminology.

Design Rationale

In a domain where entities follow predictable linguistic patterns, regex extraction delivers higher precision, zero cost, deterministic behavior, and trivially testable results. LLM extraction's flexibility advantage does not materialize because the entity patterns are already known.

Extraction Patterns

Doctor Extraction

Dutch medical professionals are typically referenced with honorific prefixes:

  • Pattern: Dr. or dr. followed by a capitalized surname
  • Variants: Dr. Van den Berg, dr. Peeters, Professor dr. Janssen
  • Edge cases: Multi-part surnames common in Dutch/Flemish (Van, De, Den, Van den)

Department Extraction

Hospital departments follow Dutch medical naming conventions:

  • Pattern: Known vocabulary of ~50 department names
  • Matching: Case-insensitive with fuzzy matching (SequenceMatcher / Ratcliff-Obershelp similarity) for variant spellings
  • Examples: Orthopedie, Cardiologie, Intensieve Zorgen, Materniteit
Fuzzy Matching

Fuzzy matching via Python's difflib.SequenceMatcher (Ratcliff/Obershelp algorithm, cutoff=0.8) handles common variations and misspellings. "Orthopaedie" matches "Orthopedie", "Cardio" matches "Cardiologie". This is particularly important for handling OCR artifacts from PDF extraction.

Campus Extraction

Campus names are drawn from a fixed set of four:

CampusVariants Matched
Sint-JanSint Jan, St-Jan, St. Jan, campus Sint-Jan
André DumontAndre Dumont, A. Dumont, campus André Dumont
Sint-BarbaraSint Barbara, St-Barbara, St. Barbara
Maas en KempenMaas & Kempen, M&K

Medical Entity Extraction

Conditions, treatments, and examinations are extracted using domain-specific Dutch medical vocabulary patterns:

  • Conditions: Dutch medical condition terminology, suffixes like -itis, -ose, -ie
  • Treatments: Procedure-indicating terms (-plastie, -ectomie, -therapie, operatie)
  • Examinations: Diagnostic terms (scan, onderzoek, test, echo, biopsie)

Extraction Pipeline

Metadata Denormalization

After extraction, entity types and campus names are denormalized onto the source document's doc_metadata for use by the search ranking system:

  • doc_metadata.entity_types -- the set of entity types found in the document (e.g., ["doctors", "departments", "conditions"])
  • doc_metadata.campus -- campus names referenced in the document (e.g., ["ZOL Sint-Jan", "ZOL André Dumont"])

This denormalization enables the metadata boosting stage to apply entity type match (+10%) and campus match (+10%) boosts at query time without performing runtime graph lookups.

Relationship Inference

When multiple entities are extracted from the same document chunk, their co-occurrence implies relationships. If a chunk mentions both "Dr. Van den Berg" and "Orthopedie," the system infers a WORKS_IN relationship. Co-occurrence-based inference is imperfect but produces a reasonably accurate graph when applied at scale across hundreds of documents.

TypedNodeStorage

Extracted entities and inferred relationships are persisted to PostgreSQL taxonomy tables, which:

  1. Deduplicates entities using fuzzy matching (prevents "Dr. Vandenber" and "Dr. Van den Berg" from creating separate nodes)
  2. Merges properties when the same entity is mentioned across multiple documents
  3. Timestamps all entities and relationships for provenance tracking
Hub-Only Mode

When graph_golden_only=True (the default), only pages classified as hub pages pass through to taxonomy storage. Regular crawled detail pages still go through extraction and LLM validation to generate page summaries (stored in chunk_metadata) and doc_metadata entity types, but their entities are not written to the taxonomy. The entity store is populated exclusively by the GoldenPageSeeder.

Taxonomy-Driven Normalization

After regex extraction and before LLM validation, all entities pass through a taxonomy normalization layer powered by a centralized module: zol_taxonomy.py. This module is the single source of truth for all ZOL domain knowledge, replacing scattered constants and ad-hoc normalization logic that previously lived across multiple files.

Single Source of Truth

The taxonomy module defines the complete ZOL domain model in one place:

  • 4 campus definitions with exhaustive alias maps (e.g., "St-Jan", "St. Jan", "Sint Jan" all resolve to "Sint-Jan")
  • ~55 department definitions with aliases, campus assignments, and diagnostic flags
  • Doctor name cleanup rules for stripping role tokens
  • Entity type overrides for ambiguous entities
  • Normalization maps for conditions, treatments, and specialties
  • Domain knowledge maps for relationship inference
  • Search aliases for query-time entity resolution

Department Alias Resolution

Prior to the taxonomy module, duplicate department nodes were a persistent quality issue. Variants like "Intensieve Zorgen" and "Intensive Care", or "Spoedgevallen" and "Spoed", created separate graph nodes that fragmented relationships. The taxonomy defines canonical names with explicit alias lists, resolving 7+ duplicate pairs:

Canonical NameAliases Resolved
Intensieve ZorgenIntensive Care, ICU, IZ
SpoedgevallenSpoed, Spoeddienst
MaterniteitVerloskunde, Kraamafdeling
NeonatologieNICU, Neonatale Intensieve Zorgen
...(~55 departments total)

Campus Normalization

The taxonomy enforces exactly 4 campuses with comprehensive alias maps. Before this normalization, variant spellings could create phantom 5th campus nodes (e.g., "Andre Dumont" vs. "André Dumont" vs. "A. Dumont"). The alias maps ensure all variants resolve to one of the four canonical campus names.

Doctor Name Cleanup

Regex extraction sometimes captures role tokens as part of doctor names. The taxonomy defines a set of role tokens (e.g., "Hoofdverpleegkundige", "Diensthoofd", "Verpleegkundig specialist") that are stripped during normalization to prevent job titles from appearing as doctor names in the graph.

Entity Type Overrides

Some entity names are inherently ambiguous -- they could be classified as multiple entity types. The taxonomy provides explicit overrides:

  • "Radiotherapie": Classified as a department (not a treatment)
  • "Dialyse": Classified as a treatment (not a department)
  • "Revalidatie": Classified as a department (not a treatment)

Dual-Entity Model

Certain entities legitimately function as both a department and a treatment. For example, "Radiotherapie" is both a hospital department where patients go and a treatment modality that doctors prescribe. The taxonomy's dual-entity model creates two separate typed nodes for such entities -- one Department node and one Treatment node -- each with appropriate relationships to other entities in the graph.

Why a Centralized Taxonomy?

Before zol_taxonomy.py, domain knowledge was scattered across medical_extraction.py, typed_nodes.py, llm_entity_validation.py, and configuration constants. This fragmentation caused inconsistencies: a department alias added in one file would be missing from another, leading to duplicate nodes. Centralizing all domain knowledge in a single module ensures that every pipeline stage applies the same normalization rules. See Medical Knowledge Architecture for the full three-layer design separating universal medical knowledge from hospital-specific taxonomy.

LLM Validation Gate (ADR-0014)

While regex extraction is fast and deterministic, it produces systematic semantic errors that no amount of pattern tuning can fix. Body parts get parsed as doctor names ("Borstkas"), job titles become fake entities ("Hoofdverpleegkundige"), and generic terms like "Behandeling" create hub nodes connected to dozens of pages.

The solution is a post-extraction LLM validation gate that uses the Tier 2 (standard) model to validate each entity and relationship before it enters the knowledge graph. A single LLM call per page (batched in groups of up to 12 entities) performs four tasks:

  1. Entity validation: keep, reject, or rename each extracted entity
  2. Relationship validation: keep or reject each inferred relationship
  3. Synonym detection: identify Dutch/Flemish synonyms for entity names
  4. Page summary generation: 2-3 sentence Dutch summary for contextual retrieval

Cross-Page Entity Cache

Once an entity is rejected or renamed on one page, the decision is cached in-memory. When the same entity appears on subsequent pages, the cached decision is applied instantly without an LLM call. This reduces total LLM calls by 10-25% and ensures consistent decisions across the entire corpus.

Fault Tolerance

If the LLM call fails (timeout, rate limit, parse error), the original regex extraction result passes through unchanged. The validation gate is a quality enhancement, not a critical path dependency.

Why Not Replace Regex Entirely?

The regex+LLM hybrid is deliberate. Regex provides a fast, deterministic, cost-free baseline that extracts 95%+ of entities correctly. The LLM only validates and cleans the remaining errors — at ~$1-2 per full corpus run (~2000 pages). LLM-only extraction would be 10-50x more expensive, non-deterministic, and harder to debug.

See ADR-0014: LLM Entity Validation for the full architectural rationale, cost analysis, and alternatives considered.

Future: GLiNER-BioMed Hybrid

While regex extraction is effective for the current entity types, the team has identified GLiNER-BioMed as a promising enhancement for future iterations. GLiNER (Generalized Linear Named Entity Recognizer) is a zero-shot NER model that can extract entities from categories it has never seen during training.

A hybrid approach would combine:

  • Regex: For well-defined patterns (doctors, campuses) where precision is paramount
  • GLiNER-BioMed: For novel medical entities (new conditions, treatments) where pattern coverage is incomplete
  • LLM validation: Post-extraction quality gate for all extraction methods

This hybrid would maintain the determinism and precision of regex for known patterns while adding the flexibility of neural extraction for emerging content, with LLM validation ensuring quality across all extraction methods.

References