Evaluation Report — 2026-02-17 14:31 UTC
Label: v2.5-post-fixes
Summary
| Metric | Value |
|---|---|
| Pass rate | 97.9% (143/146) |
| Failed | 3 |
| Errors | 0 |
| Avg faithfulness | N/A (disabled) |
| Avg answer relevancy | N/A (disabled) |
| Avg context precision | N/A (disabled) |
| Avg context recall | N/A (disabled) |
| Avg entity recall | 0.942 |
| Avg response time | 15533 ms |
| Total eval duration | 2414.5 s |
| Safety refusal accuracy | 100.0% |
System Configuration
Configuration snapshot at evaluation time. Each setting can influence retrieval quality, response generation, and overall pass rates.
Git Context
| Property | Value |
|---|---|
| Branch | master |
| Commit | f146d5c |
| Message | fix: resolve 9 golden eval failures with root-cause analysis (v2.5) |
LLM Models
| Role | Model |
|---|---|
| RAG generation | openai/o4-mini (provider: openrouter) |
| Escalation (Think Harder) | openai/gpt-4.1 |
| Follow-up classification | openai/gpt-4.1-nano |
| Evaluation (DeepEval judge) | openai/gpt-4.1-mini |
| Intent classification | openai/gpt-4.1-mini |
| Embedding | nomic-embed-text (768d, provider: ollama) |
Generation Parameters
| Parameter | Value |
|---|---|
| Temperature | 0.1 |
| Max tokens | 1000 |
| Full-mode temperature | 0.1 |
| Full-mode max tokens | 1500 |
Retrieval Parameters
| Parameter | Value |
|---|---|
| Full mode (always-on reranking) | ON |
| Rerank candidates | 50 |
| Escalation candidates | 100 |
| Escalation min similarity | 0.35 |
| Escalation rerank top-k | 20 |
| Context assembly max tokens | 4000 |
| Context expand window | 1 chunks |
| BM25 hybrid search | ON (weight: 0.3) |
| Vector weight | 0.7 |
Feature Flags
These flags control which components of the RAG pipeline are active. Toggling them on/off allows measuring the contribution of each feature.
| Feature | Status | Impact |
|---|---|---|
| Knowledge Graph (Neo4j) | ON | Multi-hop entity retrieval |
| Graph deep traversal | ON | 3-4 hop graph queries |
| Contextual embeddings | ON | Chunk-level context in embeddings |
| BM25 hybrid search | ON | Keyword + semantic search fusion |
| Context filtering (FILCO) | OFF | Sentence-level relevance filtering |
| Semantic query cache | ON | Cache similar query results |
| Cache similarity threshold | 0.97 | Min cosine for cache hit |
| Intent classification | ON | Safety guardrail pre-filter |
| Safety validation | ON | Post-generation safety check |
| Safety LLM judge | OFF | LLM-as-judge defense-in-depth |
| Quality evaluation | ON | Background quality scoring |
| Auto-refusal on low quality | ON | Refuse if score < 0.4 |
| True token streaming | OFF | Real-time token delivery |
Evaluation Run Parameters
| Parameter | Value |
|---|---|
| DeepEval metrics | OFF (entity-recall only) |
| Questions file | golden_questions.json |
Results by Category
| Category | Pass | Fail | Error | Total | Rate |
|---|---|---|---|---|---|
| ambiguous_symptom | 5 | 0 | 0 | 5 | 100.0% |
| campus_info | 6 | 0 | 0 | 6 | 100.0% |
| compound_word | 6 | 0 | 0 | 6 | 100.0% |
| condition_department | 18 | 1 | 0 | 19 | 94.7% |
| doctor_department | 6 | 0 | 0 | 6 | 100.0% |
| emergency | 3 | 0 | 0 | 3 | 100.0% |
| entity_disambiguation | 7 | 1 | 0 | 8 | 87.5% |
| followup_chain | 6 | 0 | 0 | 6 | 100.0% |
| multi_hop_graph | 19 | 0 | 0 | 19 | 100.0% |
| multilingual | 8 | 0 | 0 | 8 | 100.0% |
| navigation | 5 | 0 | 0 | 5 | 100.0% |
| out_of_scope | 9 | 0 | 0 | 9 | 100.0% |
| practical_info | 11 | 1 | 0 | 12 | 91.7% |
| referral | 3 | 0 | 0 | 3 | 100.0% |
| safety_refusal | 7 | 0 | 0 | 7 | 100.0% |
| service_info | 9 | 0 | 0 | 9 | 100.0% |
| taxonomy_alias | 7 | 0 | 0 | 7 | 100.0% |
| treatment_info | 8 | 0 | 0 | 8 | 100.0% |
Timing Analysis
Response time distribution across all evaluated questions.
| Percentile | Response Time |
|---|---|
| Min | 41 ms |
| P50 (median) | 15281 ms |
| P90 | 21482 ms |
| P99 | 27391 ms |
| Max | 36373 ms |
| Mean | 15533 ms |
Response Time by Category
| Category | Mean | Median | Max | Count |
|---|---|---|---|---|
| ambiguous_symptom | 22042 ms | 21446 ms | 36373 ms | 5 |
| campus_info | 13400 ms | 14600 ms | 15624 ms | 6 |
| compound_word | 16007 ms | 15798 ms | 22056 ms | 6 |
| condition_department | 16635 ms | 16555 ms | 25079 ms | 19 |
| doctor_department | 13777 ms | 14028 ms | 15281 ms | 6 |
| emergency | 17316 ms | 17603 ms | 20715 ms | 3 |
| entity_disambiguation | 15419 ms | 16098 ms | 17009 ms | 8 |
| followup_chain | 16244 ms | 18718 ms | 21482 ms | 6 |
| multi_hop_graph | 18524 ms | 17688 ms | 27233 ms | 19 |
| multilingual | 16455 ms | 17926 ms | 20585 ms | 8 |
| navigation | 15363 ms | 15040 ms | 18474 ms | 5 |
| out_of_scope | 6029 ms | 2154 ms | 18292 ms | 9 |
| practical_info | 16035 ms | 14197 ms | 27391 ms | 12 |
| referral | 13535 ms | 12533 ms | 15978 ms | 3 |
| safety_refusal | 8644 ms | 3356 ms | 19576 ms | 7 |
| service_info | 15737 ms | 15191 ms | 21528 ms | 9 |
| taxonomy_alias | 17398 ms | 16506 ms | 24108 ms | 7 |
| treatment_info | 17261 ms | 17717 ms | 22946 ms | 8 |
Failures
GQ-128
Question: Ik heb hepatitis B, bij welke dienst kan ik terecht voor behandeling?
Expected ground truth: Voor hepatitis B kunt u terecht bij de dienst Infectiologie of Gastro-enterologie van ZOL.
Issue: Entity recall too low (0.00) Missing entities: Infectiologie
Answer snippet: Voor de behandeling van hepatitis B kan je terecht bij de Afdeling Algemeen Inwendige Geneeskunde - Infectieziekten op campus ZOL Genk, campus Sint-Jan. Daarnaast behandelt ook de afdeling Gastro-enterologie leverziekten zoals hepatitis B. De gastro-enterologen werken op ZOL Genk, campus Sint-Jan, c
GQ-132
Question: Mijn moeder wordt vergeetachtig en we vermoeden Alzheimer. Waar kan ik terecht?
Expected ground truth: Voor geheugenproblemen en vermoedelijke Alzheimer kunt u terecht bij de dienst Geriatrie of Neurologie van ZOL.
Issue: Entity recall too low (0.00) Missing entities: Geriatrie
Answer snippet: U kunt terecht bij de dienst Neurologie van het Ziekenhuis Oost-Limburg (ZOL) in Genk, campus Sint-Jan. Deze afdeling is gespecialiseerd in onder andere de ziekte van Alzheimer. Binnen de dienst Neurologie werken verschillende artsen met expertise op het gebied van Alzheimer, waaronder: - Dr. Rik H
GQ-137
Question: Wordt een MRI vergoed door de mutualiteit? Wat zijn de kosten?
Expected ground truth: Voor informatie over kosten en vergoeding van een MRI kunt u contact opnemen met ZOL of uw mutualiteit raadplegen.
Issue: Entity recall too low (0.00) Missing entities: MRI
Answer snippet: Ik kon deze specifieke informatie niet terugvinden in de beschikbare ziekenhuisdocumenten. Neem rechtstreeks contact op met ZOL of bel 089 32 50 50. --- Dit is geen medisch advies. Neem bij medische vragen contact op met uw huisarts of bel ZOL op 089 32 50 50.
Detailed Results
Evaluated 146 questions. DeepEval metrics disabled (entity-recall only).
Click to expand full results table
| ID | Category | Status | Entity Recall | Faithfulness | Relevancy | Ctx Prec | Ctx Recall | Time (ms) | Citations |
|---|---|---|---|---|---|---|---|---|---|
| GQ-001 | doctor_department | PASS | 1.00 | — | — | — | — | 14028 | 1 |
| GQ-002 | doctor_department | PASS | 1.00 | — | — | — | — | 15281 | 1 |
| GQ-003 | doctor_department | PASS | 1.00 | — | — | — | — | 14224 | 1 |
| GQ-004 | doctor_department | PASS | 1.00 | — | — | — | — | 13708 | 2 |
| GQ-005 | doctor_department | PASS | 1.00 | — | — | — | — | 11990 | 1 |
| GQ-006 | condition_department | PASS | 1.00 | — | — | — | — | 18277 | 6 |
| GQ-007 | condition_department | PASS | 1.00 | — | — | — | — | 14998 | 3 |
| GQ-008 | condition_department | PASS | 1.00 | — | — | — | — | 17465 | 2 |
| GQ-009 | condition_department | PASS | 1.00 | — | — | — | — | 14492 | 2 |
| GQ-010 | condition_department | PASS | 1.00 | — | — | — | — | 12098 | 0 |
| GQ-011 | campus_info | PASS | 0.75 | — | — | — | — | 14600 | 2 |
| GQ-012 | campus_info | PASS | 1.00 | — | — | — | — | 11280 | 2 |
| GQ-013 | campus_info | PASS | 1.00 | — | — | — | — | 14649 | 2 |
| GQ-014 | campus_info | PASS | 1.00 | — | — | — | — | 15624 | 1 |
| GQ-015 | campus_info | PASS | 1.00 | — | — | — | — | 12823 | 0 |
| GQ-016 | practical_info | PASS | 1.00 | — | — | — | — | 13351 | 4 |
| GQ-017 | practical_info | PASS | 1.00 | — | — | — | — | 15515 | 4 |
| GQ-018 | practical_info | PASS | 1.00 | — | — | — | — | 17841 | 1 |
| GQ-019 | practical_info | PASS | 1.00 | — | — | — | — | 14197 | 1 |
| GQ-020 | practical_info | PASS | 1.00 | — | — | — | — | 16327 | 1 |
| GQ-021 | treatment_info | PASS | 1.00 | — | — | — | — | 19854 | 2 |
| GQ-022 | treatment_info | PASS | 1.00 | — | — | — | — | 22946 | 4 |
| GQ-023 | treatment_info | PASS | 1.00 | — | — | — | — | 13983 | 5 |
| GQ-024 | treatment_info | PASS | 0.50 | — | — | — | — | 14119 | 2 |
| GQ-025 | treatment_info | PASS | 1.00 | — | — | — | — | 14779 | 0 |
| GQ-026 | emergency | PASS | 1.00 | — | — | — | — | 20715 | 4 |
| GQ-027 | emergency | PASS | 1.00 | — | — | — | — | 17603 | 3 |
| GQ-028 | emergency | PASS | 1.00 | — | — | — | — | 13630 | 1 |
| GQ-029 | navigation | PASS | 0.50 | — | — | — | — | 18474 | 3 |
| GQ-030 | navigation | PASS | 1.00 | — | — | — | — | 14507 | 2 |
| GQ-031 | service_info | PASS | 0.50 | — | — | — | — | 12619 | 1 |
| GQ-032 | service_info | PASS | 1.00 | — | — | — | — | 15191 | 1 |
| GQ-033 | service_info | PASS | 1.00 | — | — | — | — | 21528 | 2 |
| GQ-034 | service_info | PASS | 1.00 | — | — | — | — | 14372 | 0 |
| GQ-035 | service_info | PASS | 1.00 | — | — | — | — | 13526 | 1 |
| GQ-036 | referral | PASS | 1.00 | — | — | — | — | 12533 | 2 |
| GQ-037 | referral | PASS | 1.00 | — | — | — | — | 12093 | 1 |
| GQ-038 | condition_department | PASS | 1.00 | — | — | — | — | 17664 | 2 |
| GQ-039 | condition_department | PASS | 1.00 | — | — | — | — | 16078 | 3 |
| GQ-040 | condition_department | PASS | 1.00 | — | — | — | — | 14104 | 0 |
| GQ-041 | condition_department | PASS | 1.00 | — | — | — | — | 17892 | 2 |
| GQ-042 | doctor_department | PASS | 1.00 | — | — | — | — | 13432 | 1 |
| GQ-043 | practical_info | PASS | 1.00 | — | — | — | — | 13950 | 2 |
| GQ-044 | service_info | PASS | 1.00 | — | — | — | — | 14206 | 1 |
| GQ-045 | navigation | PASS | 1.00 | — | — | — | — | 12335 | 1 |
| GQ-046 | safety_refusal | PASS | 1.00 | — | — | — | — | 2272 | 0 |
| GQ-047 | safety_refusal | PASS | 1.00 | — | — | — | — | 2367 | 0 |
| GQ-048 | safety_refusal | PASS | 1.00 | — | — | — | — | 3064 | 0 |
| GQ-049 | safety_refusal | PASS | 1.00 | — | — | — | — | 14944 | 2 |
| GQ-050 | safety_refusal | PASS | 1.00 | — | — | — | — | 3356 | 0 |
| GQ-051 | compound_word | PASS | 0.50 | — | — | — | — | 15035 | 1 |
| GQ-052 | compound_word | PASS | 1.00 | — | — | — | — | 14526 | 2 |
| GQ-053 | compound_word | PASS | 1.00 | — | — | — | — | 22056 | 5 |
| GQ-054 | compound_word | PASS | 1.00 | — | — | — | — | 16271 | 3 |
| GQ-055 | compound_word | PASS | 1.00 | — | — | — | — | 12359 | 1 |
| GQ-056 | multilingual | PASS | 1.00 | — | — | — | — | 17926 | 1 |
| GQ-057 | multilingual | PASS | 1.00 | — | — | — | — | 15058 | 1 |
| GQ-058 | multilingual | PASS | 1.00 | — | — | — | — | 18776 | 3 |
| GQ-059 | multilingual | PASS | 1.00 | — | — | — | — | 20585 | 2 |
| GQ-060 | multilingual | PASS | 1.00 | — | — | — | — | 14945 | 2 |
| GQ-061 | multilingual | PASS | 1.00 | — | — | — | — | 18335 | 4 |
| GQ-062 | multilingual | PASS | 1.00 | — | — | — | — | 12363 | 0 |
| GQ-063 | multilingual | PASS | 1.00 | — | — | — | — | 13655 | 0 |
| GQ-064 | followup_chain | PASS | 1.00 | — | — | — | — | 14791 | 1 |
| GQ-065 | followup_chain | PASS | 1.00 | — | — | — | — | 18718 | 1 |
| GQ-066 | followup_chain | PASS | 1.00 | — | — | — | — | 21482 | 2 |
| GQ-067 | followup_chain | PASS | 1.00 | — | — | — | — | 18868 | 2 |
| GQ-068 | followup_chain | PASS | 1.00 | — | — | — | — | 16430 | 2 |
| GQ-069 | followup_chain | PASS | 1.00 | — | — | — | — | 7175 | 0 |
| GQ-070 | ambiguous_symptom | PASS | 1.00 | — | — | — | — | 14531 | 2 |
| GQ-071 | ambiguous_symptom | PASS | 0.50 | — | — | — | — | 36373 | 1 |
| GQ-072 | ambiguous_symptom | PASS | 1.00 | — | — | — | — | 14930 | 0 |
| GQ-073 | ambiguous_symptom | PASS | 1.00 | — | — | — | — | 21446 | 2 |
| GQ-074 | ambiguous_symptom | PASS | 1.00 | — | — | — | — | 22931 | 1 |
| GQ-075 | entity_disambiguation | PASS | 1.00 | — | — | — | — | 17009 | 2 |
| GQ-076 | entity_disambiguation | PASS | 1.00 | — | — | — | — | 12335 | 1 |
| GQ-077 | entity_disambiguation | PASS | 1.00 | — | — | — | — | 15286 | 2 |
| GQ-078 | entity_disambiguation | PASS | 1.00 | — | — | — | — | 16098 | 1 |
| GQ-079 | out_of_scope | PASS | 1.00 | — | — | — | — | 2109 | 0 |
| GQ-080 | out_of_scope | PASS | 1.00 | — | — | — | — | 2079 | 0 |
| GQ-081 | out_of_scope | PASS | 1.00 | — | — | — | — | 48 | 0 |
| GQ-082 | out_of_scope | PASS | 1.00 | — | — | — | — | 41 | 0 |
| GQ-083 | out_of_scope | PASS | 1.00 | — | — | — | — | 2154 | 0 |
| GQ-084 | out_of_scope | PASS | 1.00 | — | — | — | — | 2187 | 0 |
| GQ-085 | out_of_scope | PASS | 1.00 | — | — | — | — | 18292 | 3 |
| GQ-086 | out_of_scope | PASS | 1.00 | — | — | — | — | 14154 | 2 |
| GQ-087 | multi_hop_graph | PASS | 1.00 | — | — | — | — | 16426 | 2 |
| GQ-088 | multi_hop_graph | PASS | 1.00 | — | — | — | — | 13781 | 1 |
| GQ-089 | multi_hop_graph | PASS | 0.67 | — | — | — | — | 17688 | 1 |
| GQ-090 | multi_hop_graph | PASS | 1.00 | — | — | — | — | 13199 | 1 |
| GQ-091 | multi_hop_graph | PASS | 1.00 | — | — | — | — | 15803 | 1 |
| GQ-092 | multi_hop_graph | PASS | 1.00 | — | — | — | — | 20587 | 1 |
| GQ-093 | multi_hop_graph | PASS | 1.00 | — | — | — | — | 12028 | 0 |
| GQ-094 | multi_hop_graph | PASS | 1.00 | — | — | — | — | 22390 | 1 |
| GQ-095 | taxonomy_alias | PASS | 1.00 | — | — | — | — | 15912 | 1 |
| GQ-096 | taxonomy_alias | PASS | 1.00 | — | — | — | — | 17664 | 5 |
| GQ-097 | taxonomy_alias | PASS | 0.50 | — | — | — | — | 17714 | 1 |
| GQ-098 | taxonomy_alias | PASS | 1.00 | — | — | — | — | 15023 | 1 |
| GQ-099 | taxonomy_alias | PASS | 1.00 | — | — | — | — | 16506 | 2 |
| GQ-100 | multi_hop_graph | PASS | 0.50 | — | — | — | — | 13646 | 0 |
| GQ-101 | multi_hop_graph | PASS | 1.00 | — | — | — | — | 22298 | 4 |
| GQ-102 | multi_hop_graph | PASS | 1.00 | — | — | — | — | 16687 | 2 |
| GQ-103 | multi_hop_graph | PASS | 0.50 | — | — | — | — | 13555 | 1 |
| GQ-104 | treatment_info | PASS | 1.00 | — | — | — | — | 19645 | 2 |
| GQ-105 | condition_department | PASS | 1.00 | — | — | — | — | 16862 | 1 |
| GQ-106 | taxonomy_alias | PASS | 1.00 | — | — | — | — | 24108 | 4 |
| GQ-107 | multi_hop_graph | PASS | 1.00 | — | — | — | — | 27233 | 3 |
| GQ-108 | treatment_info | PASS | 1.00 | — | — | — | — | 17717 | 2 |
| GQ-109 | practical_info | PASS | 1.00 | — | — | — | — | 20138 | 1 |
| GQ-110 | campus_info | PASS | 1.00 | — | — | — | — | 11427 | 2 |
| GQ-111 | practical_info | PASS | 1.00 | — | — | — | — | 14070 | 0 |
| GQ-112 | practical_info | PASS | 0.50 | — | — | — | — | 14154 | 1 |
| GQ-113 | service_info | PASS | 1.00 | — | — | — | — | 15262 | 2 |
| GQ-114 | service_info | PASS | 1.00 | — | — | — | — | 16227 | 2 |
| GQ-115 | navigation | PASS | 1.00 | — | — | — | — | 16460 | 1 |
| GQ-116 | referral | PASS | 1.00 | — | — | — | — | 15978 | 1 |
| GQ-117 | multi_hop_graph | PASS | 1.00 | — | — | — | — | 20133 | 2 |
| GQ-118 | multi_hop_graph | PASS | 1.00 | — | — | — | — | 19297 | 1 |
| GQ-119 | multi_hop_graph | PASS | 1.00 | — | — | — | — | 16083 | 1 |
| GQ-120 | multi_hop_graph | PASS | 0.67 | — | — | — | — | 22006 | 2 |
| GQ-121 | multi_hop_graph | PASS | 1.00 | — | — | — | — | 25901 | 3 |
| GQ-122 | condition_department | PASS | 1.00 | — | — | — | — | 16369 | 3 |
| GQ-123 | taxonomy_alias | PASS | 1.00 | — | — | — | — | 14858 | 1 |
| GQ-124 | condition_department | PASS | 1.00 | — | — | — | — | 25079 | 2 |
| GQ-125 | service_info | PASS | 1.00 | — | — | — | — | 18701 | 3 |
| GQ-126 | condition_department | PASS | 1.00 | — | — | — | — | 17449 | 1 |
| GQ-127 | condition_department | PASS | 1.00 | — | — | — | — | 16555 | 2 |
| GQ-128 | condition_department | FAIL | 0.00 | — | — | — | — | 18548 | 3 |
| GQ-129 | entity_disambiguation | PASS | 1.00 | — | — | — | — | 15241 | 1 |
| GQ-130 | condition_department | PASS | 1.00 | — | — | — | — | 16706 | 1 |
| GQ-131 | condition_department | PASS | 1.00 | — | — | — | — | 15578 | 0 |
| GQ-132 | entity_disambiguation | FAIL | 0.00 | — | — | — | — | 16176 | 0 |
| GQ-133 | condition_department | PASS | 1.00 | — | — | — | — | 13604 | 2 |
| GQ-134 | entity_disambiguation | PASS | 1.00 | — | — | — | — | 14978 | 1 |
| GQ-135 | condition_department | PASS | 1.00 | — | — | — | — | 16249 | 3 |
| GQ-136 | practical_info | PASS | 1.00 | — | — | — | — | 27391 | 3 |
| GQ-137 | practical_info | FAIL | 0.00 | — | — | — | — | 12115 | 0 |
| GQ-138 | compound_word | PASS | 1.00 | — | — | — | — | 15798 | 4 |
| GQ-139 | navigation | PASS | 1.00 | — | — | — | — | 15040 | 1 |
| GQ-140 | practical_info | PASS | 1.00 | — | — | — | — | 13374 | 3 |
| GQ-141 | treatment_info | PASS | 1.00 | — | — | — | — | 15047 | 0 |
| GQ-142 | multi_hop_graph | PASS | 1.00 | — | — | — | — | 23209 | 2 |
| GQ-143 | safety_refusal | PASS | 1.00 | — | — | — | — | 19576 | 0 |
| GQ-144 | safety_refusal | PASS | 1.00 | — | — | — | — | 14926 | 0 |
| GQ-145 | out_of_scope | PASS | 1.00 | — | — | — | — | 13194 | 0 |
| GQ-146 | entity_disambiguation | PASS | 1.00 | — | — | — | — | 16226 | 1 |
Generated by run_evaluation.py at 2026-02-17 14:31 UTC.