Relationship Inference Strategies

The knowledge graph extraction pipeline uses deterministic regex-based strategies rather than LLM-based relation extraction. This design choice prioritises reproducibility, auditability, and speed: the same page content always produces the same extraction result, each relationship can be traced to a specific strategy, and extraction completes in milliseconds without API calls.

Why Regex Over LLM?

Criterion	Regex Strategies	LLM Relation Extraction
Determinism	Identical output for identical input	Non-deterministic (temperature, context window)
Auditability	Each relationship tagged with strategy + confidence	Black-box output
Cost	Zero marginal cost per page	~$0.01-0.05 per page (API call)
Latency	<10ms per page	1-5s per page
Coverage	Limited to known patterns	Discovers novel relationships

The system compensates for regex's limited coverage through domain knowledge maps (Strategy 2) and plausibility guards that filter implausible relationships rather than trying to discover all possible ones.

Strategy Catalog

All strategies are implemented in _extract_relationships() in medical_extraction.py. They run sequentially on every page, each contributing relationships that pass through a shared plausibility filter.

Pre-conditions

Before any strategy runs, the method computes page-level context:

page_department: Inferred from URL slug (e.g., /raadplegingen/cardiologie/) or page title
is_doctor_page: True if the page has 2 or fewer doctors, or the URL matches doctor/team patterns
is_focused_dept_page: page_department exists AND 15 or fewer doctors
allow_works_in: is_doctor_page OR is_focused_dept_page -- gates all WORKS_IN creation to prevent spurious doctor-department links on listing pages

Strategy 1: Document-Level URL/Title Inference

Property	Value
Relationships	`WORKS_IN`
Confidence	0.85
Guard	Requires `page_department` AND `allow_works_in`

If a department is identifiable from the URL or page title, all extracted doctors on the page are linked to that department. Example: a page at /raadplegingen/cardiologie/ with two doctor names produces two WORKS_IN relationships to Cardiologie.

Strategy 2: Domain Knowledge Mapping

Property	Value
Relationships	`HANDLES`, `OFFERS`, `PERFORMS`, `TREATS`
Confidence	0.8 (HANDLES/OFFERS/PERFORMS), 0.85 (TREATS)
Guard	Per-relationship plausibility checks

Applies taxonomy-driven maps when entities co-occur on a page:

DEPT_CONDITION_MAP: Department + condition on same page → HANDLES (guarded by _is_plausible_handles)
DEPT_TREATMENT_MAP: Department + treatment → OFFERS (guarded by _is_plausible_relationship("OFFERS"))
EXAM_PERFORMS_MAP: Department + examination → PERFORMS (guarded by _is_plausible_relationship("PERFORMS"))
TREATMENT_CONDITION_MAP: Treatment + condition → TREATS (map match is sufficient)

Strategy 2b: Service-Department Linking

Property	Value
Relationships	`OFFERS`
Confidence	0.75

Links hospital services (e.g., "afspraak maken") to appropriate departments using a curated service-to-department map.

Strategy 3: All-Pairs for Focused Pages

Property	Value
Relationships	`WORKS_IN`, `OFFERS`, `PERFORMS`, `HANDLES`, `RELATED_TO`
Confidence	0.5--0.8 (varies by evidence)

On pages with few entities, assumes all co-occurring entities are related:

Entity Pair	Cap	Confidence	Relationship
Doctor-Department	≤3 doctors, 1--3 depts	0.7	`WORKS_IN`
Department-Treatment	≤2 departments	0.75	`OFFERS`
Department-Examination	≤2 departments	0.75	`PERFORMS`
Department-Condition	≤3 departments	0.5--0.8	`HANDLES`
Treatment-Condition	both non-empty	0.6	`RELATED_TO`

Conditions mentioned in headings or early text receive higher confidence (0.8) than body-only mentions (0.5). Body-only mentions below the minimum confidence threshold (0.65) are filtered out.

Strategy 4: Sentence and Paragraph Co-occurrence

Property	Value
Relationships	`WORKS_IN`, `WORKS_AT`, `TREATS`, `DIAGNOSES`, `USED_FOR`, `RELATED_TO`, `OFFERS`, `PERFORMS`
Confidence	0.6--0.9

Two analysis levels:

Paragraph-level (skipped on listing pages with >15 doctors or >10 departments):

Doctor + Department in same paragraph → WORKS_IN (0.65)
Treatment + Condition → RELATED_TO (0.6)
Examination + Condition → USED_FOR (0.6)

Sentence-level (always runs): Uses Dutch relationship indicator patterns ("werkt bij", "behandeling van", "diagnose van") to boost confidence when explicit textual evidence exists:

Pair Type	With pattern match	Without pattern	Guard
Doctor-Department	`WORKS_AT` (0.9)	`WORKS_IN` (0.75)	`_is_valid_doctor_department()`
Treatment-Condition	`TREATS` (0.9)	`RELATED_TO` (0.7)	plausibility check
Exam-Condition	`DIAGNOSES` (0.9)	`USED_FOR` (0.7)	`is_plausible_used_for()`
Department-Treatment	`OFFERS` (0.8)	`OFFERS` (0.8)	plausibility check
Department-Exam	`PERFORMS` (0.8)	`PERFORMS` (0.8)	plausibility check

Strategy 5: URL-Inferred Specialty

Property	Value
Relationships	`WORKS_IN`
Confidence	0.75

Creates WORKS_IN from URL-inferred specialty when no department was explicitly extracted. If a doctor was found on a page at /huidziekten/artsen, _enrich_doctors_with_specialty sets metadata["inferred_department"] = "Dermatologie", and Strategy 5 creates the relationship.

Strategy 6: Campus Relationships

Property	Value
Relationships	`LOCATED_AT` (facilities), `AVAILABLE_AT` (services)
Confidence	0.7
Guard	Exactly 1 campus on the page

Links facilities and services to the single campus mentioned on a page. The single-campus guard prevents cross-product relationships (e.g., a page listing all campuses would incorrectly link every service to every campus).

Strategy 7: Hospital Hierarchy

Property	Value
Relationships	`HAS_CAMPUS`, `BELONGS_TO`, `LOCATED_AT`
Confidence	1.0 (hierarchy), 0.9 (LOCATED_AT)

Runs unconditionally on every extraction. Establishes the structural hierarchy:

Hospital HAS_CAMPUS each of the 4 campuses (confidence 1.0)
Every department BELONGS_TO the hospital (confidence 1.0)
Departments LOCATED_AT campuses via DEPARTMENT_CAMPUS_MAP (confidence 0.9)

Strategy 7b: Doctor-Campus from Department-Campus

Property	Value
Relationships	`WORKS_AT_CAMPUS`
Confidence	0.85

Derives doctor-campus links from the chain: Doctor → WORKS_IN → Department → LOCATED_AT → Campus. Only creates the link if the department maps to a single campus (prevents cross-product).

Strategy 8: Domain Knowledge PERFORMS

Property	Value
Relationships	`PERFORMS`
Confidence	0.8

Page-independent strategy: whenever an examination entity is extracted, checks EXAM_PERFORMS_MAP to create PERFORMS relationships to the examination's performing department(s), regardless of page co-occurrence. Example: extracting "ECG" on any page triggers PERFORMS from Cardiologie.

Plausibility Guard Framework

Every relationship passes through plausibility guards before being accepted. The guards form a layered defense against spurious relationships.

Guard Functions

_is_plausible_handles(dept, condition, confidence) -- Guards HANDLES relationships:

Diagnostic department block: Departments flagged is_diagnostic=True (Radiologie, Labo, Nucleaire Geneeskunde, Pathologie) never get HANDLES relationships
Negative map: DEPT_CONDITION_NEGATIVE_MAP blocks specific department-condition pairs (e.g., Patiëntenbegeleiding cannot HANDLE clinical conditions)
Hub concept guard: Overly broad conditions ("pijn", "tumor", "allergie", "infectie") only link to departments in allowed domain groups
Positive domain guard: If DEPT_CONDITION_MAP lists conditions for a department, only listed conditions pass

_is_plausible_relationship(rel_type, dept, target, confidence) -- Guards OFFERS, PERFORMS, TREATS, USED_FOR:

OFFERS: 4-layer validation (treatment blocklist, imaging department guard, DEPT_TREATMENT_MAP positive guard, hub treatment guard)
PERFORMS: Imaging exam guard (MRI/CT only from imaging departments), EXAM_DOMAIN_MAP domain group validation
TREATS/USED_FOR: Cross-entity domain plausibility check

_is_valid_doctor_department(doctor, dept) -- Guards WORKS_IN:

Uses SPECIALTY_DEPARTMENT_MAP to validate that a doctor's specialty is compatible with the target department
Prevents cardiologists from being linked to Pediatrie because they appear on the same page

Domain Groups

Each department is assigned a domain group in the hospital config (zol.yaml). Domain groups cluster clinically related departments and are used by plausibility guards to validate cross-entity relationships.

Domain Group	Example Departments	Count
`cardiovascular`	Cardiologie, Cardiochirurgie, Hartrevalidatie	4
`surgery`	Algemene Heelkunde, Orthopedie, Urologie, Plastische Chirurgie	6
`oncology`	Oncologie, Hematologie, Radiotherapie	4
`neuroscience`	Neurologie, Neurochirurgie	2
`internal_medicine`	Gastro-enterologie, Pneumologie, Nefrologie, Endocrinologie	9
`women_children`	Gynaecologie, Materniteit, Pediatrie, Neonatologie	7
`diagnostics`	Radiologie, Labo, Nucleaire Geneeskunde, Pathologie	6
`sensory`	Oftalmologie, Keel- Neus- en Oorziekten	2
`psychiatry`	Psychiatrie, Kinder- en Jeugdpsychiatrie	3
`emergency_icu`	Spoedgevallen, Intensieve Zorgen, MUG	4
`rehabilitation`	Fysische Geneeskunde, Revalidatie	3
`support`	Patiëntenbegeleiding, Sociale Dienst, Diëtetiek	5
`centers`	Slaapcentrum, Pijncentrum, Borstcentrum	8

Hub Blocking

Hub conditions (HUB_CONDITIONS): Generic terms like "pijn" (pain), "tumor", "allergie", "infectie", "ontsteking" that would connect to nearly every department if unchecked. These only link to departments in HUB_ALLOWED_DOMAIN_GROUPS (typically oncology, emergency/ICU, and internal medicine).

Hub treatments (HUB_TREATMENTS): Generic treatments like "medicatie", "therapie", "revalidatie" that are similarly over-broad.

Negative Maps

DEPT_CONDITION_NEGATIVE_MAP explicitly blocks known-incorrect relationships:

None (block all): Diagnostic departments and support departments that should never HANDLE any condition. Example: "radiologie": None means Radiologie never gets HANDLES.
Specific blocks: Named conditions that a department should not handle. Example: "patiëntenbegeleiding": {"diabetes", "kanker", "hartfalen", ...} blocks clinical conditions from the patient support department.

Confidence Scoring

Confidence values are deterministic and source-based, not probabilistic. They encode the quality of evidence:

Confidence	Evidence Level	Example
1.0	Hub page listing or structural hierarchy	Doctor from `/zol-artsen`, Hospital HAS_CAMPUS
0.9	Sentence-level pattern match or config-derived	"Dr. X werkt bij Cardiologie", LOCATED_AT from config
0.85	URL/title inference or domain knowledge with TREATS	Strategy 1 (URL-inferred WORKS_IN), Strategy 7b
0.8	Domain knowledge map match	Strategy 2 (HANDLES from DEPT_CONDITION_MAP)
0.75	Focused page all-pairs or URL-inferred specialty	Strategy 3 (department-treatment), Strategy 5
0.7	Sentence co-occurrence without pattern match	Strategy 4 (RELATED_TO from sentence proximity)
0.65	Paragraph co-occurrence	Strategy 4 paragraph-level WORKS_IN
0.6	Weak co-occurrence evidence	Strategy 3 (treatment-condition RELATED_TO)
0.5	Body-only condition mention	Strategy 3 (condition in body text, no heading)

A minimum confidence threshold of 0.65 is applied during PostgreSQL taxonomy storage, filtering out the weakest evidence.

Pipeline Flow

When multiple strategies produce the same relationship (same source, target, and type), deduplication keeps the one with the highest confidence value.

References

Hogan, A., et al. (2021). Knowledge graphs. ACM Computing Surveys, 54(4), 1--37. https://doi.org/10.1145/3447772
Mintz, M., Bills, S., Snow, R., & Jurafsky, D. (2009). Distant supervision for relation extraction without labeled data. Proceedings of ACL 2009, 1003--1011. https://aclanthology.org/P09-1113

Why Regex Over LLM?​

Strategy Catalog​

Pre-conditions​

Strategy 1: Document-Level URL/Title Inference​

Strategy 2: Domain Knowledge Mapping​

Strategy 2b: Service-Department Linking​

Strategy 3: All-Pairs for Focused Pages​

Strategy 4: Sentence and Paragraph Co-occurrence​

Strategy 5: URL-Inferred Specialty​

Strategy 6: Campus Relationships​

Strategy 7: Hospital Hierarchy​

Strategy 7b: Doctor-Campus from Department-Campus​

Strategy 8: Domain Knowledge PERFORMS​

Plausibility Guard Framework​

Guard Functions​

Domain Groups​

Hub Blocking​

Negative Maps​

Confidence Scoring​

Pipeline Flow​

References​