Relationship Inference Strategies
The knowledge graph extraction pipeline uses deterministic regex-based strategies rather than LLM-based relation extraction. This design choice prioritises reproducibility, auditability, and speed: the same page content always produces the same extraction result, each relationship can be traced to a specific strategy, and extraction completes in milliseconds without API calls.
Why Regex Over LLM?
| Criterion | Regex Strategies | LLM Relation Extraction |
|---|---|---|
| Determinism | Identical output for identical input | Non-deterministic (temperature, context window) |
| Auditability | Each relationship tagged with strategy + confidence | Black-box output |
| Cost | Zero marginal cost per page | ~$0.01-0.05 per page (API call) |
| Latency | <10ms per page | 1-5s per page |
| Coverage | Limited to known patterns | Discovers novel relationships |
The system compensates for regex's limited coverage through domain knowledge maps (Strategy 2) and plausibility guards that filter implausible relationships rather than trying to discover all possible ones.
Strategy Catalog
All strategies are implemented in _extract_relationships() in medical_extraction.py. They run sequentially on every page, each contributing relationships that pass through a shared plausibility filter.
Pre-conditions
Before any strategy runs, the method computes page-level context:
page_department: Inferred from URL slug (e.g.,/raadplegingen/cardiologie/) or page titleis_doctor_page: True if the page has 2 or fewer doctors, or the URL matches doctor/team patternsis_focused_dept_page:page_departmentexists AND 15 or fewer doctorsallow_works_in:is_doctor_page OR is_focused_dept_page-- gates all WORKS_IN creation to prevent spurious doctor-department links on listing pages
Strategy 1: Document-Level URL/Title Inference
| Property | Value |
|---|---|
| Relationships | WORKS_IN |
| Confidence | 0.85 |
| Guard | Requires page_department AND allow_works_in |
If a department is identifiable from the URL or page title, all extracted doctors on the page are linked to that department. Example: a page at /raadplegingen/cardiologie/ with two doctor names produces two WORKS_IN relationships to Cardiologie.
Strategy 2: Domain Knowledge Mapping
| Property | Value |
|---|---|
| Relationships | HANDLES, OFFERS, PERFORMS, TREATS |
| Confidence | 0.8 (HANDLES/OFFERS/PERFORMS), 0.85 (TREATS) |
| Guard | Per-relationship plausibility checks |
Applies taxonomy-driven maps when entities co-occur on a page:
- DEPT_CONDITION_MAP: Department + condition on same page →
HANDLES(guarded by_is_plausible_handles) - DEPT_TREATMENT_MAP: Department + treatment →
OFFERS(guarded by_is_plausible_relationship("OFFERS")) - EXAM_PERFORMS_MAP: Department + examination →
PERFORMS(guarded by_is_plausible_relationship("PERFORMS")) - TREATMENT_CONDITION_MAP: Treatment + condition →
TREATS(map match is sufficient)
Strategy 2b: Service-Department Linking
| Property | Value |
|---|---|
| Relationships | OFFERS |
| Confidence | 0.75 |
Links hospital services (e.g., "afspraak maken") to appropriate departments using a curated service-to-department map.
Strategy 3: All-Pairs for Focused Pages
| Property | Value |
|---|---|
| Relationships | WORKS_IN, OFFERS, PERFORMS, HANDLES, RELATED_TO |
| Confidence | 0.5--0.8 (varies by evidence) |
On pages with few entities, assumes all co-occurring entities are related:
| Entity Pair | Cap | Confidence | Relationship |
|---|---|---|---|
| Doctor-Department | ≤3 doctors, 1--3 depts | 0.7 | WORKS_IN |
| Department-Treatment | ≤2 departments | 0.75 | OFFERS |
| Department-Examination | ≤2 departments | 0.75 | PERFORMS |
| Department-Condition | ≤3 departments | 0.5--0.8 | HANDLES |
| Treatment-Condition | both non-empty | 0.6 | RELATED_TO |
Conditions mentioned in headings or early text receive higher confidence (0.8) than body-only mentions (0.5). Body-only mentions below the minimum confidence threshold (0.65) are filtered out.
Strategy 4: Sentence and Paragraph Co-occurrence
| Property | Value |
|---|---|
| Relationships | WORKS_IN, WORKS_AT, TREATS, DIAGNOSES, USED_FOR, RELATED_TO, OFFERS, PERFORMS |
| Confidence | 0.6--0.9 |
Two analysis levels:
Paragraph-level (skipped on listing pages with >15 doctors or >10 departments):
- Doctor + Department in same paragraph →
WORKS_IN(0.65) - Treatment + Condition →
RELATED_TO(0.6) - Examination + Condition →
USED_FOR(0.6)
Sentence-level (always runs):
Uses Dutch relationship indicator patterns ("werkt bij", "behandeling van", "diagnose van") to boost confidence when explicit textual evidence exists:
| Pair Type | With pattern match | Without pattern | Guard |
|---|---|---|---|
| Doctor-Department | WORKS_AT (0.9) | WORKS_IN (0.75) | _is_valid_doctor_department() |
| Treatment-Condition | TREATS (0.9) | RELATED_TO (0.7) | plausibility check |
| Exam-Condition | DIAGNOSES (0.9) | USED_FOR (0.7) | is_plausible_used_for() |
| Department-Treatment | OFFERS (0.8) | OFFERS (0.8) | plausibility check |
| Department-Exam | PERFORMS (0.8) | PERFORMS (0.8) | plausibility check |
Strategy 5: URL-Inferred Specialty
| Property | Value |
|---|---|
| Relationships | WORKS_IN |
| Confidence | 0.75 |
Creates WORKS_IN from URL-inferred specialty when no department was explicitly extracted. If a doctor was found on a page at /huidziekten/artsen, _enrich_doctors_with_specialty sets metadata["inferred_department"] = "Dermatologie", and Strategy 5 creates the relationship.
Strategy 6: Campus Relationships
| Property | Value |
|---|---|
| Relationships | LOCATED_AT (facilities), AVAILABLE_AT (services) |
| Confidence | 0.7 |
| Guard | Exactly 1 campus on the page |
Links facilities and services to the single campus mentioned on a page. The single-campus guard prevents cross-product relationships (e.g., a page listing all campuses would incorrectly link every service to every campus).
Strategy 7: Hospital Hierarchy
| Property | Value |
|---|---|
| Relationships | HAS_CAMPUS, BELONGS_TO, LOCATED_AT |
| Confidence | 1.0 (hierarchy), 0.9 (LOCATED_AT) |
Runs unconditionally on every extraction. Establishes the structural hierarchy:
- Hospital
HAS_CAMPUSeach of the 4 campuses (confidence 1.0) - Every department
BELONGS_TOthe hospital (confidence 1.0) - Departments
LOCATED_ATcampuses viaDEPARTMENT_CAMPUS_MAP(confidence 0.9)
Strategy 7b: Doctor-Campus from Department-Campus
| Property | Value |
|---|---|
| Relationships | WORKS_AT_CAMPUS |
| Confidence | 0.85 |
Derives doctor-campus links from the chain: Doctor → WORKS_IN → Department → LOCATED_AT → Campus. Only creates the link if the department maps to a single campus (prevents cross-product).
Strategy 8: Domain Knowledge PERFORMS
| Property | Value |
|---|---|
| Relationships | PERFORMS |
| Confidence | 0.8 |
Page-independent strategy: whenever an examination entity is extracted, checks EXAM_PERFORMS_MAP to create PERFORMS relationships to the examination's performing department(s), regardless of page co-occurrence. Example: extracting "ECG" on any page triggers PERFORMS from Cardiologie.
Plausibility Guard Framework
Every relationship passes through plausibility guards before being accepted. The guards form a layered defense against spurious relationships.
Guard Functions
_is_plausible_handles(dept, condition, confidence) -- Guards HANDLES relationships:
- Diagnostic department block: Departments flagged
is_diagnostic=True(Radiologie, Labo, Nucleaire Geneeskunde, Pathologie) never get HANDLES relationships - Negative map:
DEPT_CONDITION_NEGATIVE_MAPblocks specific department-condition pairs (e.g., Patiëntenbegeleiding cannot HANDLE clinical conditions) - Hub concept guard: Overly broad conditions ("pijn", "tumor", "allergie", "infectie") only link to departments in allowed domain groups
- Positive domain guard: If
DEPT_CONDITION_MAPlists conditions for a department, only listed conditions pass
_is_plausible_relationship(rel_type, dept, target, confidence) -- Guards OFFERS, PERFORMS, TREATS, USED_FOR:
- OFFERS: 4-layer validation (treatment blocklist, imaging department guard,
DEPT_TREATMENT_MAPpositive guard, hub treatment guard) - PERFORMS: Imaging exam guard (MRI/CT only from imaging departments),
EXAM_DOMAIN_MAPdomain group validation - TREATS/USED_FOR: Cross-entity domain plausibility check
_is_valid_doctor_department(doctor, dept) -- Guards WORKS_IN:
- Uses
SPECIALTY_DEPARTMENT_MAPto validate that a doctor's specialty is compatible with the target department - Prevents cardiologists from being linked to Pediatrie because they appear on the same page
Domain Groups
Each department is assigned a domain group in the hospital config (zol.yaml). Domain groups cluster clinically related departments and are used by plausibility guards to validate cross-entity relationships.
| Domain Group | Example Departments | Count |
|---|---|---|
cardiovascular | Cardiologie, Cardiochirurgie, Hartrevalidatie | 4 |
surgery | Algemene Heelkunde, Orthopedie, Urologie, Plastische Chirurgie | 6 |
oncology | Oncologie, Hematologie, Radiotherapie | 4 |
neuroscience | Neurologie, Neurochirurgie | 2 |
internal_medicine | Gastro-enterologie, Pneumologie, Nefrologie, Endocrinologie | 9 |
women_children | Gynaecologie, Materniteit, Pediatrie, Neonatologie | 7 |
diagnostics | Radiologie, Labo, Nucleaire Geneeskunde, Pathologie | 6 |
sensory | Oftalmologie, Keel- Neus- en Oorziekten | 2 |
psychiatry | Psychiatrie, Kinder- en Jeugdpsychiatrie | 3 |
emergency_icu | Spoedgevallen, Intensieve Zorgen, MUG | 4 |
rehabilitation | Fysische Geneeskunde, Revalidatie | 3 |
support | Patiëntenbegeleiding, Sociale Dienst, Diëtetiek | 5 |
centers | Slaapcentrum, Pijncentrum, Borstcentrum | 8 |
Hub Blocking
Hub conditions (HUB_CONDITIONS): Generic terms like "pijn" (pain), "tumor", "allergie", "infectie", "ontsteking" that would connect to nearly every department if unchecked. These only link to departments in HUB_ALLOWED_DOMAIN_GROUPS (typically oncology, emergency/ICU, and internal medicine).
Hub treatments (HUB_TREATMENTS): Generic treatments like "medicatie", "therapie", "revalidatie" that are similarly over-broad.
Negative Maps
DEPT_CONDITION_NEGATIVE_MAP explicitly blocks known-incorrect relationships:
None(block all): Diagnostic departments and support departments that should never HANDLE any condition. Example:"radiologie": Nonemeans Radiologie never gets HANDLES.- Specific blocks: Named conditions that a department should not handle. Example:
"patiëntenbegeleiding": {"diabetes", "kanker", "hartfalen", ...}blocks clinical conditions from the patient support department.
Confidence Scoring
Confidence values are deterministic and source-based, not probabilistic. They encode the quality of evidence:
| Confidence | Evidence Level | Example |
|---|---|---|
| 1.0 | Hub page listing or structural hierarchy | Doctor from /zol-artsen, Hospital HAS_CAMPUS |
| 0.9 | Sentence-level pattern match or config-derived | "Dr. X werkt bij Cardiologie", LOCATED_AT from config |
| 0.85 | URL/title inference or domain knowledge with TREATS | Strategy 1 (URL-inferred WORKS_IN), Strategy 7b |
| 0.8 | Domain knowledge map match | Strategy 2 (HANDLES from DEPT_CONDITION_MAP) |
| 0.75 | Focused page all-pairs or URL-inferred specialty | Strategy 3 (department-treatment), Strategy 5 |
| 0.7 | Sentence co-occurrence without pattern match | Strategy 4 (RELATED_TO from sentence proximity) |
| 0.65 | Paragraph co-occurrence | Strategy 4 paragraph-level WORKS_IN |
| 0.6 | Weak co-occurrence evidence | Strategy 3 (treatment-condition RELATED_TO) |
| 0.5 | Body-only condition mention | Strategy 3 (condition in body text, no heading) |
A minimum confidence threshold of 0.65 is applied during PostgreSQL taxonomy storage, filtering out the weakest evidence.
Pipeline Flow
When multiple strategies produce the same relationship (same source, target, and type), deduplication keeps the one with the highest confidence value.
References
- Hogan, A., et al. (2021). Knowledge graphs. ACM Computing Surveys, 54(4), 1--37. https://doi.org/10.1145/3447772
- Mintz, M., Bills, S., Snow, R., & Jurafsky, D. (2009). Distant supervision for relation extraction without labeled data. Proceedings of ACL 2009, 1003--1011. https://aclanthology.org/P09-1113