Skip to main content

Relationship Inference Strategies

The knowledge graph extraction pipeline uses deterministic regex-based strategies rather than LLM-based relation extraction. This design choice prioritises reproducibility, auditability, and speed: the same page content always produces the same extraction result, each relationship can be traced to a specific strategy, and extraction completes in milliseconds without API calls.

Why Regex Over LLM?

CriterionRegex StrategiesLLM Relation Extraction
DeterminismIdentical output for identical inputNon-deterministic (temperature, context window)
AuditabilityEach relationship tagged with strategy + confidenceBlack-box output
CostZero marginal cost per page~$0.01-0.05 per page (API call)
Latency<10ms per page1-5s per page
CoverageLimited to known patternsDiscovers novel relationships

The system compensates for regex's limited coverage through domain knowledge maps (Strategy 2) and plausibility guards that filter implausible relationships rather than trying to discover all possible ones.

Strategy Catalog

All strategies are implemented in _extract_relationships() in medical_extraction.py. They run sequentially on every page, each contributing relationships that pass through a shared plausibility filter.

Pre-conditions

Before any strategy runs, the method computes page-level context:

  • page_department: Inferred from URL slug (e.g., /raadplegingen/cardiologie/) or page title
  • is_doctor_page: True if the page has 2 or fewer doctors, or the URL matches doctor/team patterns
  • is_focused_dept_page: page_department exists AND 15 or fewer doctors
  • allow_works_in: is_doctor_page OR is_focused_dept_page -- gates all WORKS_IN creation to prevent spurious doctor-department links on listing pages

Strategy 1: Document-Level URL/Title Inference

PropertyValue
RelationshipsWORKS_IN
Confidence0.85
GuardRequires page_department AND allow_works_in

If a department is identifiable from the URL or page title, all extracted doctors on the page are linked to that department. Example: a page at /raadplegingen/cardiologie/ with two doctor names produces two WORKS_IN relationships to Cardiologie.

Strategy 2: Domain Knowledge Mapping

PropertyValue
RelationshipsHANDLES, OFFERS, PERFORMS, TREATS
Confidence0.8 (HANDLES/OFFERS/PERFORMS), 0.85 (TREATS)
GuardPer-relationship plausibility checks

Applies taxonomy-driven maps when entities co-occur on a page:

  1. DEPT_CONDITION_MAP: Department + condition on same page → HANDLES (guarded by _is_plausible_handles)
  2. DEPT_TREATMENT_MAP: Department + treatment → OFFERS (guarded by _is_plausible_relationship("OFFERS"))
  3. EXAM_PERFORMS_MAP: Department + examination → PERFORMS (guarded by _is_plausible_relationship("PERFORMS"))
  4. TREATMENT_CONDITION_MAP: Treatment + condition → TREATS (map match is sufficient)

Strategy 2b: Service-Department Linking

PropertyValue
RelationshipsOFFERS
Confidence0.75

Links hospital services (e.g., "afspraak maken") to appropriate departments using a curated service-to-department map.

Strategy 3: All-Pairs for Focused Pages

PropertyValue
RelationshipsWORKS_IN, OFFERS, PERFORMS, HANDLES, RELATED_TO
Confidence0.5--0.8 (varies by evidence)

On pages with few entities, assumes all co-occurring entities are related:

Entity PairCapConfidenceRelationship
Doctor-Department≤3 doctors, 1--3 depts0.7WORKS_IN
Department-Treatment≤2 departments0.75OFFERS
Department-Examination≤2 departments0.75PERFORMS
Department-Condition≤3 departments0.5--0.8HANDLES
Treatment-Conditionboth non-empty0.6RELATED_TO

Conditions mentioned in headings or early text receive higher confidence (0.8) than body-only mentions (0.5). Body-only mentions below the minimum confidence threshold (0.65) are filtered out.

Strategy 4: Sentence and Paragraph Co-occurrence

PropertyValue
RelationshipsWORKS_IN, WORKS_AT, TREATS, DIAGNOSES, USED_FOR, RELATED_TO, OFFERS, PERFORMS
Confidence0.6--0.9

Two analysis levels:

Paragraph-level (skipped on listing pages with >15 doctors or >10 departments):

  • Doctor + Department in same paragraph → WORKS_IN (0.65)
  • Treatment + Condition → RELATED_TO (0.6)
  • Examination + Condition → USED_FOR (0.6)

Sentence-level (always runs): Uses Dutch relationship indicator patterns ("werkt bij", "behandeling van", "diagnose van") to boost confidence when explicit textual evidence exists:

Pair TypeWith pattern matchWithout patternGuard
Doctor-DepartmentWORKS_AT (0.9)WORKS_IN (0.75)_is_valid_doctor_department()
Treatment-ConditionTREATS (0.9)RELATED_TO (0.7)plausibility check
Exam-ConditionDIAGNOSES (0.9)USED_FOR (0.7)is_plausible_used_for()
Department-TreatmentOFFERS (0.8)OFFERS (0.8)plausibility check
Department-ExamPERFORMS (0.8)PERFORMS (0.8)plausibility check

Strategy 5: URL-Inferred Specialty

PropertyValue
RelationshipsWORKS_IN
Confidence0.75

Creates WORKS_IN from URL-inferred specialty when no department was explicitly extracted. If a doctor was found on a page at /huidziekten/artsen, _enrich_doctors_with_specialty sets metadata["inferred_department"] = "Dermatologie", and Strategy 5 creates the relationship.

Strategy 6: Campus Relationships

PropertyValue
RelationshipsLOCATED_AT (facilities), AVAILABLE_AT (services)
Confidence0.7
GuardExactly 1 campus on the page

Links facilities and services to the single campus mentioned on a page. The single-campus guard prevents cross-product relationships (e.g., a page listing all campuses would incorrectly link every service to every campus).

Strategy 7: Hospital Hierarchy

PropertyValue
RelationshipsHAS_CAMPUS, BELONGS_TO, LOCATED_AT
Confidence1.0 (hierarchy), 0.9 (LOCATED_AT)

Runs unconditionally on every extraction. Establishes the structural hierarchy:

  1. Hospital HAS_CAMPUS each of the 4 campuses (confidence 1.0)
  2. Every department BELONGS_TO the hospital (confidence 1.0)
  3. Departments LOCATED_AT campuses via DEPARTMENT_CAMPUS_MAP (confidence 0.9)

Strategy 7b: Doctor-Campus from Department-Campus

PropertyValue
RelationshipsWORKS_AT_CAMPUS
Confidence0.85

Derives doctor-campus links from the chain: Doctor → WORKS_IN → Department → LOCATED_AT → Campus. Only creates the link if the department maps to a single campus (prevents cross-product).

Strategy 8: Domain Knowledge PERFORMS

PropertyValue
RelationshipsPERFORMS
Confidence0.8

Page-independent strategy: whenever an examination entity is extracted, checks EXAM_PERFORMS_MAP to create PERFORMS relationships to the examination's performing department(s), regardless of page co-occurrence. Example: extracting "ECG" on any page triggers PERFORMS from Cardiologie.

Plausibility Guard Framework

Every relationship passes through plausibility guards before being accepted. The guards form a layered defense against spurious relationships.

Guard Functions

_is_plausible_handles(dept, condition, confidence) -- Guards HANDLES relationships:

  1. Diagnostic department block: Departments flagged is_diagnostic=True (Radiologie, Labo, Nucleaire Geneeskunde, Pathologie) never get HANDLES relationships
  2. Negative map: DEPT_CONDITION_NEGATIVE_MAP blocks specific department-condition pairs (e.g., Patiëntenbegeleiding cannot HANDLE clinical conditions)
  3. Hub concept guard: Overly broad conditions ("pijn", "tumor", "allergie", "infectie") only link to departments in allowed domain groups
  4. Positive domain guard: If DEPT_CONDITION_MAP lists conditions for a department, only listed conditions pass

_is_plausible_relationship(rel_type, dept, target, confidence) -- Guards OFFERS, PERFORMS, TREATS, USED_FOR:

  • OFFERS: 4-layer validation (treatment blocklist, imaging department guard, DEPT_TREATMENT_MAP positive guard, hub treatment guard)
  • PERFORMS: Imaging exam guard (MRI/CT only from imaging departments), EXAM_DOMAIN_MAP domain group validation
  • TREATS/USED_FOR: Cross-entity domain plausibility check

_is_valid_doctor_department(doctor, dept) -- Guards WORKS_IN:

  • Uses SPECIALTY_DEPARTMENT_MAP to validate that a doctor's specialty is compatible with the target department
  • Prevents cardiologists from being linked to Pediatrie because they appear on the same page

Domain Groups

Each department is assigned a domain group in the hospital config (zol.yaml). Domain groups cluster clinically related departments and are used by plausibility guards to validate cross-entity relationships.

Domain GroupExample DepartmentsCount
cardiovascularCardiologie, Cardiochirurgie, Hartrevalidatie4
surgeryAlgemene Heelkunde, Orthopedie, Urologie, Plastische Chirurgie6
oncologyOncologie, Hematologie, Radiotherapie4
neuroscienceNeurologie, Neurochirurgie2
internal_medicineGastro-enterologie, Pneumologie, Nefrologie, Endocrinologie9
women_childrenGynaecologie, Materniteit, Pediatrie, Neonatologie7
diagnosticsRadiologie, Labo, Nucleaire Geneeskunde, Pathologie6
sensoryOftalmologie, Keel- Neus- en Oorziekten2
psychiatryPsychiatrie, Kinder- en Jeugdpsychiatrie3
emergency_icuSpoedgevallen, Intensieve Zorgen, MUG4
rehabilitationFysische Geneeskunde, Revalidatie3
supportPatiëntenbegeleiding, Sociale Dienst, Diëtetiek5
centersSlaapcentrum, Pijncentrum, Borstcentrum8

Hub Blocking

Hub conditions (HUB_CONDITIONS): Generic terms like "pijn" (pain), "tumor", "allergie", "infectie", "ontsteking" that would connect to nearly every department if unchecked. These only link to departments in HUB_ALLOWED_DOMAIN_GROUPS (typically oncology, emergency/ICU, and internal medicine).

Hub treatments (HUB_TREATMENTS): Generic treatments like "medicatie", "therapie", "revalidatie" that are similarly over-broad.

Negative Maps

DEPT_CONDITION_NEGATIVE_MAP explicitly blocks known-incorrect relationships:

  • None (block all): Diagnostic departments and support departments that should never HANDLE any condition. Example: "radiologie": None means Radiologie never gets HANDLES.
  • Specific blocks: Named conditions that a department should not handle. Example: "patiëntenbegeleiding": {"diabetes", "kanker", "hartfalen", ...} blocks clinical conditions from the patient support department.

Confidence Scoring

Confidence values are deterministic and source-based, not probabilistic. They encode the quality of evidence:

ConfidenceEvidence LevelExample
1.0Hub page listing or structural hierarchyDoctor from /zol-artsen, Hospital HAS_CAMPUS
0.9Sentence-level pattern match or config-derived"Dr. X werkt bij Cardiologie", LOCATED_AT from config
0.85URL/title inference or domain knowledge with TREATSStrategy 1 (URL-inferred WORKS_IN), Strategy 7b
0.8Domain knowledge map matchStrategy 2 (HANDLES from DEPT_CONDITION_MAP)
0.75Focused page all-pairs or URL-inferred specialtyStrategy 3 (department-treatment), Strategy 5
0.7Sentence co-occurrence without pattern matchStrategy 4 (RELATED_TO from sentence proximity)
0.65Paragraph co-occurrenceStrategy 4 paragraph-level WORKS_IN
0.6Weak co-occurrence evidenceStrategy 3 (treatment-condition RELATED_TO)
0.5Body-only condition mentionStrategy 3 (condition in body text, no heading)

A minimum confidence threshold of 0.65 is applied during PostgreSQL taxonomy storage, filtering out the weakest evidence.

Pipeline Flow

When multiple strategies produce the same relationship (same source, target, and type), deduplication keeps the one with the highest confidence value.

References