Entity Extraction
This document describes the content ingestion extraction path (regex-based pattern matching during document processing). For the SP-4 Entity Resolution Pipeline that uses LLM extraction on hub pages, see Entity Resolution & Deduplication.
Populating the knowledge graph (Hogan et al., 2021) requires extracting structured entities from unstructured Dutch medical text. Named entity recognition (NER) is a foundational NLP task (Nadeau & Sekine, 2007), and the choice between LLM-based extraction and regex-based extraction trades flexibility for determinism. The ZOL system deliberately chooses the latter.
The Challenge
Hospital content arrives as unstructured text -- brochures, web pages, doctor profiles. Within this text, structured entities are embedded naturally:
"Dr. Van den Berg is orthopedisch chirurg bij de afdeling Orthopedie op campus Sint-Jan. Hij is gespecialiseerd in sportgeneeskunde en voert onder andere knieprotheses uit."
From this single paragraph, the system must extract:
- Doctor: Van den Berg
- Department: Orthopedie
- Campus: Sint-Jan
- Specialty: Sportgeneeskunde
- Treatment: Knieprothese
LLM vs. Regex Extraction
Trade-off Analysis
| Criterion | LLM Extraction | Regex Extraction |
|---|---|---|
| Flexibility | High -- handles novel patterns | Low -- only known patterns |
| Cost | ~$0.001 per page | Zero |
| Latency | ~2 seconds per page | ~5 milliseconds per page |
| Determinism | Non-deterministic | Fully deterministic |
| Dutch language | Good with modern models | Excellent (patterns designed for Dutch) |
| Accuracy on known patterns | ~90-95% | ~98%+ |
| Accuracy on novel patterns | ~80-90% | 0% (fails silently) |
| Testability | Difficult (non-deterministic) | Trivial (deterministic) |
The Decision: Regex for ZOL
The ZOL system uses regex-based extraction for a specific reason: the medical entity patterns in Dutch hospital content are well-defined and predictable. Doctor names follow "Dr." or "dr." prefixes. Department names come from a known vocabulary. Campus names are a fixed set of four. Treatment and condition names follow established Dutch medical terminology.
In a domain where entities follow predictable linguistic patterns, regex extraction delivers higher precision, zero cost, deterministic behavior, and trivially testable results. LLM extraction's flexibility advantage does not materialize because the entity patterns are already known.
Extraction Patterns
Doctor Extraction
Dutch medical professionals are typically referenced with honorific prefixes:
- Pattern:
Dr.ordr.followed by a capitalized surname - Variants:
Dr. Van den Berg,dr. Peeters,Professor dr. Janssen - Edge cases: Multi-part surnames common in Dutch/Flemish (Van, De, Den, Van den)
Department Extraction
Hospital departments follow Dutch medical naming conventions:
- Pattern: Known vocabulary of ~50 department names
- Matching: Case-insensitive with fuzzy matching (SequenceMatcher / Ratcliff-Obershelp similarity) for variant spellings
- Examples: Orthopedie, Cardiologie, Intensieve Zorgen, Materniteit
Fuzzy matching via Python's difflib.SequenceMatcher (Ratcliff/Obershelp algorithm, cutoff=0.8) handles common variations and misspellings. "Orthopaedie" matches "Orthopedie", "Cardio" matches "Cardiologie". This is particularly important for handling OCR artifacts from PDF extraction.
Campus Extraction
Campus names are drawn from a fixed set of four:
| Campus | Variants Matched |
|---|---|
| Sint-Jan | Sint Jan, St-Jan, St. Jan, campus Sint-Jan |
| André Dumont | Andre Dumont, A. Dumont, campus André Dumont |
| Sint-Barbara | Sint Barbara, St-Barbara, St. Barbara |
| Maas en Kempen | Maas & Kempen, M&K |
Medical Entity Extraction
Conditions, treatments, and examinations are extracted using domain-specific Dutch medical vocabulary patterns:
- Conditions: Dutch medical condition terminology, suffixes like -itis, -ose, -ie
- Treatments: Procedure-indicating terms (-plastie, -ectomie, -therapie, operatie)
- Examinations: Diagnostic terms (scan, onderzoek, test, echo, biopsie)
Extraction Pipeline
Metadata Denormalization
After extraction, entity types and campus names are denormalized onto the source document's doc_metadata for use by the search ranking system:
doc_metadata.entity_types-- the set of entity types found in the document (e.g.,["doctors", "departments", "conditions"])doc_metadata.campus-- campus names referenced in the document (e.g.,["ZOL Sint-Jan", "ZOL André Dumont"])
This denormalization enables the metadata boosting stage to apply entity type match (+10%) and campus match (+10%) boosts at query time without performing runtime graph lookups.
Relationship Inference
When multiple entities are extracted from the same document chunk, their co-occurrence implies relationships. If a chunk mentions both "Dr. Van den Berg" and "Orthopedie," the system infers a WORKS_IN relationship. Co-occurrence-based inference is imperfect but produces a reasonably accurate graph when applied at scale across hundreds of documents.
TypedNodeStorage
Extracted entities and inferred relationships are persisted to PostgreSQL taxonomy tables, which:
- Deduplicates entities using fuzzy matching (prevents "Dr. Vandenber" and "Dr. Van den Berg" from creating separate nodes)
- Merges properties when the same entity is mentioned across multiple documents
- Timestamps all entities and relationships for provenance tracking
When graph_golden_only=True (the default), only pages classified as hub pages pass through to taxonomy storage. Regular crawled detail pages still go through extraction and LLM validation to generate page summaries (stored in chunk_metadata) and doc_metadata entity types, but their entities are not written to the taxonomy. The entity store is populated exclusively by the GoldenPageSeeder.
Taxonomy-Driven Normalization
After regex extraction and before LLM validation, all entities pass through a taxonomy normalization layer powered by a centralized module: zol_taxonomy.py. This module is the single source of truth for all ZOL domain knowledge, replacing scattered constants and ad-hoc normalization logic that previously lived across multiple files.
Single Source of Truth
The taxonomy module defines the complete ZOL domain model in one place:
- 4 campus definitions with exhaustive alias maps (e.g., "St-Jan", "St. Jan", "Sint Jan" all resolve to "Sint-Jan")
- ~55 department definitions with aliases, campus assignments, and diagnostic flags
- Doctor name cleanup rules for stripping role tokens
- Entity type overrides for ambiguous entities
- Normalization maps for conditions, treatments, and specialties
- Domain knowledge maps for relationship inference
- Search aliases for query-time entity resolution
Department Alias Resolution
Prior to the taxonomy module, duplicate department nodes were a persistent quality issue. Variants like "Intensieve Zorgen" and "Intensive Care", or "Spoedgevallen" and "Spoed", created separate graph nodes that fragmented relationships. The taxonomy defines canonical names with explicit alias lists, resolving 7+ duplicate pairs:
| Canonical Name | Aliases Resolved |
|---|---|
| Intensieve Zorgen | Intensive Care, ICU, IZ |
| Spoedgevallen | Spoed, Spoeddienst |
| Materniteit | Verloskunde, Kraamafdeling |
| Neonatologie | NICU, Neonatale Intensieve Zorgen |
| ... | (~55 departments total) |
Campus Normalization
The taxonomy enforces exactly 4 campuses with comprehensive alias maps. Before this normalization, variant spellings could create phantom 5th campus nodes (e.g., "Andre Dumont" vs. "André Dumont" vs. "A. Dumont"). The alias maps ensure all variants resolve to one of the four canonical campus names.
Doctor Name Cleanup
Regex extraction sometimes captures role tokens as part of doctor names. The taxonomy defines a set of role tokens (e.g., "Hoofdverpleegkundige", "Diensthoofd", "Verpleegkundig specialist") that are stripped during normalization to prevent job titles from appearing as doctor names in the graph.
Entity Type Overrides
Some entity names are inherently ambiguous -- they could be classified as multiple entity types. The taxonomy provides explicit overrides:
- "Radiotherapie": Classified as a department (not a treatment)
- "Dialyse": Classified as a treatment (not a department)
- "Revalidatie": Classified as a department (not a treatment)
Dual-Entity Model
Certain entities legitimately function as both a department and a treatment. For example, "Radiotherapie" is both a hospital department where patients go and a treatment modality that doctors prescribe. The taxonomy's dual-entity model creates two separate typed nodes for such entities -- one Department node and one Treatment node -- each with appropriate relationships to other entities in the graph.
Before zol_taxonomy.py, domain knowledge was scattered across medical_extraction.py, typed_nodes.py, llm_entity_validation.py, and configuration constants. This fragmentation caused inconsistencies: a department alias added in one file would be missing from another, leading to duplicate nodes. Centralizing all domain knowledge in a single module ensures that every pipeline stage applies the same normalization rules. See Medical Knowledge Architecture for the full three-layer design separating universal medical knowledge from hospital-specific taxonomy.
LLM Validation Gate (ADR-0014)
While regex extraction is fast and deterministic, it produces systematic semantic errors that no amount of pattern tuning can fix. Body parts get parsed as doctor names ("Borstkas"), job titles become fake entities ("Hoofdverpleegkundige"), and generic terms like "Behandeling" create hub nodes connected to dozens of pages.
The solution is a post-extraction LLM validation gate that uses the Tier 2 (standard) model to validate each entity and relationship before it enters the knowledge graph. A single LLM call per page (batched in groups of up to 12 entities) performs four tasks:
- Entity validation: keep, reject, or rename each extracted entity
- Relationship validation: keep or reject each inferred relationship
- Synonym detection: identify Dutch/Flemish synonyms for entity names
- Page summary generation: 2-3 sentence Dutch summary for contextual retrieval
Cross-Page Entity Cache
Once an entity is rejected or renamed on one page, the decision is cached in-memory. When the same entity appears on subsequent pages, the cached decision is applied instantly without an LLM call. This reduces total LLM calls by 10-25% and ensures consistent decisions across the entire corpus.
Fault Tolerance
If the LLM call fails (timeout, rate limit, parse error), the original regex extraction result passes through unchanged. The validation gate is a quality enhancement, not a critical path dependency.
The regex+LLM hybrid is deliberate. Regex provides a fast, deterministic, cost-free baseline that extracts 95%+ of entities correctly. The LLM only validates and cleans the remaining errors — at ~$1-2 per full corpus run (~2000 pages). LLM-only extraction would be 10-50x more expensive, non-deterministic, and harder to debug.
See ADR-0014: LLM Entity Validation for the full architectural rationale, cost analysis, and alternatives considered.
Future: GLiNER-BioMed Hybrid
While regex extraction is effective for the current entity types, the team has identified GLiNER-BioMed as a promising enhancement for future iterations. GLiNER (Generalized Linear Named Entity Recognizer) is a zero-shot NER model that can extract entities from categories it has never seen during training.
A hybrid approach would combine:
- Regex: For well-defined patterns (doctors, campuses) where precision is paramount
- GLiNER-BioMed: For novel medical entities (new conditions, treatments) where pattern coverage is incomplete
- LLM validation: Post-extraction quality gate for all extraction methods
This hybrid would maintain the determinism and precision of regex for known patterns while adding the flexibility of neural extraction for emerging content, with LLM validation ensuring quality across all extraction methods.
References
- Hogan, A., et al. (2021). Knowledge graphs. ACM Computing Surveys, 54(4), 1--37. https://doi.org/10.1145/3447772
- Nadeau, D. & Sekine, S. (2007). A survey of named entity recognition and classification. Lingvisticae Investigationes, 30(1), 3--26. https://doi.org/10.1075/li.30.1.03nad