Graph Value Assessment — 2026-02-23 09:04 UTC
Label: graph-value-assessment
Abstract
This assessment uses LLM-as-judge (GPT-4.1) to evaluate answer quality dimensions that entity recall cannot capture: relationship accuracy, completeness, navigational utility, factual grounding, and information structure. Each question is scored independently with randomized A/B assignment to prevent position bias.
Summary
| Metric | Graph ON | Graph OFF | Delta |
|---|---|---|---|
| Overall quality | 4.609 | 4.676 | -0.067 |
| Judge prefers | 21 (23%) | 55 (61%) | — |
| Ties | 14 (16%) | — | — |
| Questions evaluated | 90 | 90 | — |
Per-Dimension Comparison
| Dimension | Graph ON | Graph OFF | Delta | Winner |
|---|---|---|---|---|
| Relationship Accuracy | 4.77 | 4.82 | -0.06 | Graph OFF |
| Completeness | 4.53 | 4.71 | -0.18 | Graph OFF |
| Navigational Utility | 4.14 | 4.17 | -0.02 | Tie |
| Factual Grounding | 4.81 | 4.84 | -0.03 | Tie |
| Information Structure | 4.79 | 4.83 | -0.04 | Tie |
Per-Category Analysis
| Category | n | Avg ON | Avg OFF | Delta | Pref ON | Pref OFF | Ties |
|---|---|---|---|---|---|---|---|
| ambiguous_symptom | 5 | 4.56 | 4.88 | -0.32 | 2 | 2 | 1 |
| condition_department | 19 | 4.73 | 4.85 | -0.13 | 3 | 11 | 5 |
| doctor_department | 6 | 4.57 | 4.47 | +0.10 | 2 | 2 | 2 |
| entity_disambiguation | 8 | 4.45 | 4.38 | +0.08 | 0 | 7 | 1 |
| followup_chain | 6 | 4.47 | 4.63 | -0.17 | 1 | 4 | 1 |
| multi_hop_graph | 17 | 4.54 | 4.60 | -0.06 | 4 | 12 | 1 |
| snomed_terminology | 14 | 4.80 | 4.79 | +0.01 | 7 | 6 | 1 |
| taxonomy_alias | 7 | 4.80 | 4.91 | -0.11 | 1 | 6 | 0 |
| treatment_info | 8 | 4.30 | 4.38 | -0.08 | 1 | 5 | 2 |
Questions Most Improved by Knowledge Graph
| ID | Category | Question | ON Avg | OFF Avg | Delta | Judge Notes |
|---|---|---|---|---|---|---|
| GQ-088 | multi_hop_graph | Welke behandelingen biedt de Cardiologie aan voor hartfalen? | 4.8 | 2.0 | +2.8 | Dit antwoord noemt de juiste dienst en aandoening, en specificeert meerdere rele |
| GQ-078 | entity_disambiguation | Biedt ZOL revalidatie aan op Sint-Jan? | 4.8 | 4.0 | +0.8 | Geeft een volledig overzicht van revalidatie op beide campussen, benoemt explici |
| GQ-067 | followup_chain | Ik heb last van rugpijn | 5.0 | 4.4 | +0.6 | Noemt alle relevante diensten (Revalidatie en Fysische Geneeskunde, Pijncentrum, |
| GQ-068 | followup_chain | Kan ik daar zonder verwijsbrief terecht? | 5.0 | 4.4 | +0.6 | Uitstekend specifiek antwoord: noemt het Gendercentrum, hormoontherapie, intake, |
| GQ-172 | snomed_terminology | Mijn moeder heeft osteoporose | 5.0 | 4.4 | +0.6 | Answer A accurately describes the relationships between osteoporose and the rele |
| GQ-108 | treatment_info | Wat is logopedie en voor welke aandoeningen helpt het? | 4.0 | 3.6 | +0.4 | Eveneens goed gestructureerd en volledig, met een extra vermelding dat familiele |
| GQ-002 | doctor_department | Welke cardiologen werken bij ZOL? | 4.8 | 4.4 | +0.4 | Answer B provides the same specific list of cardiologists, explicitly states the |
| GQ-101 | multi_hop_graph | Welke behandelingen bestaan er voor een beroerte? | 4.8 | 4.4 | +0.4 | Answer B provides a comprehensive and well-structured overview, explicitly menti |
| GQ-004 | doctor_department | Bij welke afdeling werkt Dr. Rik Houben? | 4.0 | 3.8 | +0.2 | Answer A correctly states the relationship between Dr. Rik Houben and the Neurol |
| GQ-064 | followup_chain | Welke artsen werken bij de Cardiologie? | 5.0 | 4.8 | +0.2 | Answer B is also comprehensive and well-structured, listing the main cardiologis |
| GQ-097 | taxonomy_alias | Mijn kind heeft waterpokken | 5.0 | 4.8 | +0.2 | Answer B also accurately connects waterpokken in children to Kindergeneeskunde, |
| GQ-104 | treatment_info | Welke afdelingen bieden revalidatie aan na een beroerte? | 4.8 | 4.6 | +0.2 | Answer B covers all relevant departments and adds detail about the multidiscipli |
| GQ-121 | multi_hop_graph | Welke dokter behandelt diabetes en op welke campus kan ik bi | 5.0 | 4.8 | +0.2 | Answer B also accurately connects diabetes to the dienst Endocrinologie, lists t |
| GQ-128 | condition_department | Ik heb hepatitis B, bij welke dienst kan ik terecht voor beh | 4.2 | 4.0 | +0.2 | Sterk gestructureerd en duidelijk, noemt de juiste afdeling (Gastro-enterologie) |
| GQ-129 | entity_disambiguation | Ik wil een neuscorrectie laten doen bij ZOL, kan dat? | 4.4 | 4.2 | +0.2 | Benoemt NKO en het Gendercentrum, maar noemt Plastische Heelkunde niet, wat een |
Questions Regressed with Knowledge Graph
| ID | Category | Question | ON Avg | OFF Avg | Delta | Judge Notes |
|---|---|---|---|---|---|---|
| GQ-008 | condition_department | Bij welke dienst moet ik zijn voor rugpijn? | 4.6 | 4.8 | -0.2 | Uitstekend antwoord dat alle relevante diensten noemt (inclusief samenwerking me |
| GQ-010 | condition_department | Welke afdeling helpt bij longproblemen? | 4.6 | 4.8 | -0.2 | Answer A provides a comprehensive overview of all relevant departments and their |
| GQ-022 | treatment_info | Hoe verloopt een bloedafname bij ZOL? | 4.8 | 5.0 | -0.2 | Answer B is also highly accurate and complete, covering all relevant relationshi |
| GQ-040 | condition_department | Mijn kind heeft oorpijn, welke dokter moet ik raadplegen? | 4.8 | 5.0 | -0.2 | Answer B also correctly links oorpijn in children to the KNO department at ZOL, |
| GQ-072 | ambiguous_symptom | Ik heb al weken last van hoofdpijn | 4.8 | 5.0 | -0.2 | Answer B also accurately describes the relationships (huisarts, dienst Neurologi |
| GQ-073 | ambiguous_symptom | Ik voel een knobbeltje in mijn hals | 4.8 | 5.0 | -0.2 | Answer A clearly identifies KNO as the relevant department, explains what condit |
| GQ-092 | multi_hop_graph | Welke onderzoeken doet de dienst Cardiologie? | 4.2 | 4.4 | -0.2 | Alle relevante onderzoeken worden genoemd, inclusief Holter-monitoring en beide |
| GQ-095 | taxonomy_alias | Ik zoek een hartdokter | 4.8 | 5.0 | -0.2 | Uitstekende relatiebeschrijving: koppelt cardiologen aan dienst Cardiologie, ben |
| GQ-099 | taxonomy_alias | Waar kan ik een hartfilmpje laten maken? | 4.8 | 5.0 | -0.2 | Answer B noemt alle relevante afdelingen (Cardiologie, Medium Care), legt uit da |
| GQ-105 | condition_department | Welke dokter kan mij helpen met artrose? | 4.8 | 5.0 | -0.2 | Answer B also accurately links artrose to the correct departments and doctors, a |
Response Time Comparison
| Metric | Graph ON | Graph OFF | Delta |
|---|---|---|---|
| Mean | 14819 ms | 29448 ms | -14629 ms |
| Median | 8220 ms | 11169 ms | -2948 ms |
Entity Recall vs Quality Score Comparison
This section demonstrates why entity recall alone is insufficient.
| Metric | Graph ON | Graph OFF | Delta |
|---|---|---|---|
| Entity recall (crude) | 0.940 | 0.963 | -0.023 |
| Quality score (LLM judge) | 4.609 | 4.676 | -0.067 |
The knowledge graph's quality improvement (-0.067) is larger the entity recall difference (-0.023), suggesting the graph does not provide quality gains beyond entity mention.
Methodology
- Judge model: GPT-4.1 (temperature 0.0 for determinism)
- Position bias mitigation: A/B assignment randomized per question
- Scoring: 5-point Likert scale per dimension, averaged for overall quality
- Categories evaluated: Questions from categories most likely to benefit from graph context
- Categories: ambiguous_symptom, condition_department, doctor_department, entity_disambiguation, followup_chain, multi_hop_graph, snomed_terminology, taxonomy_alias, treatment_info