Evaluation Report — 2026-02-17 15:44 UTC
Label: v2.5.1-baseline-decomposition-off
Summary
| Metric | Value |
|---|
| Pass rate | 99.3% (145/146) |
| Failed | 1 |
| Errors | 0 |
| Avg faithfulness | N/A (disabled) |
| Avg answer relevancy | N/A (disabled) |
| Avg context precision | N/A (disabled) |
| Avg context recall | N/A (disabled) |
| Avg entity recall | 0.953 |
| Avg response time | 17535 ms |
| Total eval duration | 2707.0 s |
| Safety refusal accuracy | 100.0% |
System Configuration
Configuration snapshot at evaluation time. Each setting can influence
retrieval quality, response generation, and overall pass rates.
Git Context
| Property | Value |
|---|
| Branch | feat/query-decomposition |
| Commit | 15ad000 |
| Message | feat: implement query decomposition for multi-hop questions (ADR-0032) |
LLM Models
| Role | Model |
|---|
| RAG generation | openai/o4-mini (provider: openrouter) |
| Escalation (Think Harder) | openai/gpt-4.1 |
| Follow-up classification | openai/gpt-4.1-nano |
| Evaluation (DeepEval judge) | openai/gpt-4.1-mini |
| Intent classification | openai/gpt-4.1-mini |
| Embedding | nomic-embed-text (768d, provider: ollama) |
Generation Parameters
| Parameter | Value |
|---|
| Temperature | 0.1 |
| Max tokens | 1000 |
| Full-mode temperature | 0.1 |
| Full-mode max tokens | 1500 |
Retrieval Parameters
| Parameter | Value |
|---|
| Full mode (always-on reranking) | ON |
| Rerank candidates | 50 |
| Escalation candidates | 100 |
| Escalation min similarity | 0.35 |
| Escalation rerank top-k | 20 |
| Context assembly max tokens | 4000 |
| Context expand window | 1 chunks |
| BM25 hybrid search | ON (weight: 0.3) |
| Vector weight | 0.7 |
Feature Flags
These flags control which components of the RAG pipeline are active.
Toggling them on/off allows measuring the contribution of each feature.
| Feature | Status | Impact |
|---|
| Knowledge Graph (Neo4j) | ON | Multi-hop entity retrieval |
| Graph deep traversal | ON | 3-4 hop graph queries |
| Contextual embeddings | ON | Chunk-level context in embeddings |
| BM25 hybrid search | ON | Keyword + semantic search fusion |
| Context filtering (FILCO) | OFF | Sentence-level relevance filtering |
| Semantic query cache | ON | Cache similar query results |
| Cache similarity threshold | 0.97 | Min cosine for cache hit |
| Intent classification | ON | Safety guardrail pre-filter |
| Safety validation | ON | Post-generation safety check |
| Safety LLM judge | OFF | LLM-as-judge defense-in-depth |
| Quality evaluation | ON | Background quality scoring |
| Auto-refusal on low quality | ON | Refuse if score < 0.4 |
| True token streaming | OFF | Real-time token delivery |
Evaluation Run Parameters
| Parameter | Value |
|---|
| DeepEval metrics | OFF (entity-recall only) |
| Questions file | golden_questions.json |
Results by Category
| Category | Pass | Fail | Error | Total | Rate |
|---|
| ambiguous_symptom | 5 | 0 | 0 | 5 | 100.0% |
| campus_info | 6 | 0 | 0 | 6 | 100.0% |
| compound_word | 6 | 0 | 0 | 6 | 100.0% |
| condition_department | 19 | 0 | 0 | 19 | 100.0% |
| doctor_department | 6 | 0 | 0 | 6 | 100.0% |
| emergency | 3 | 0 | 0 | 3 | 100.0% |
| entity_disambiguation | 8 | 0 | 0 | 8 | 100.0% |
| followup_chain | 6 | 0 | 0 | 6 | 100.0% |
| multi_hop_graph | 19 | 0 | 0 | 19 | 100.0% |
| multilingual | 8 | 0 | 0 | 8 | 100.0% |
| navigation | 4 | 1 | 0 | 5 | 80.0% |
| out_of_scope | 9 | 0 | 0 | 9 | 100.0% |
| practical_info | 12 | 0 | 0 | 12 | 100.0% |
| referral | 3 | 0 | 0 | 3 | 100.0% |
| safety_refusal | 7 | 0 | 0 | 7 | 100.0% |
| service_info | 9 | 0 | 0 | 9 | 100.0% |
| taxonomy_alias | 7 | 0 | 0 | 7 | 100.0% |
| treatment_info | 8 | 0 | 0 | 8 | 100.0% |
Timing Analysis
Response time distribution across all evaluated questions.
| Percentile | Response Time |
|---|
| Min | 32 ms |
| P50 (median) | 18149 ms |
| P90 | 24671 ms |
| P99 | 32879 ms |
| Max | 35057 ms |
| Mean | 17535 ms |
Response Time by Category
| Category | Mean | Median | Max | Count |
|---|
| ambiguous_symptom | 22225 ms | 22056 ms | 27032 ms | 5 |
| campus_info | 15804 ms | 16575 ms | 20257 ms | 6 |
| compound_word | 17554 ms | 18476 ms | 18738 ms | 6 |
| condition_department | 19879 ms | 19727 ms | 26037 ms | 19 |
| doctor_department | 17656 ms | 16397 ms | 29007 ms | 6 |
| emergency | 17919 ms | 18588 ms | 21537 ms | 3 |
| entity_disambiguation | 17632 ms | 18149 ms | 21760 ms | 8 |
| followup_chain | 16082 ms | 18401 ms | 19663 ms | 6 |
| multi_hop_graph | 21035 ms | 20445 ms | 32879 ms | 19 |
| multilingual | 17741 ms | 18190 ms | 23235 ms | 8 |
| navigation | 16948 ms | 17126 ms | 18366 ms | 5 |
| out_of_scope | 5845 ms | 2442 ms | 24589 ms | 9 |
| practical_info | 17334 ms | 16667 ms | 25513 ms | 12 |
| referral | 17502 ms | 17375 ms | 18537 ms | 3 |
| safety_refusal | 8691 ms | 2576 ms | 18477 ms | 7 |
| service_info | 19696 ms | 17349 ms | 35057 ms | 9 |
| taxonomy_alias | 21139 ms | 19350 ms | 29130 ms | 7 |
| treatment_info | 18551 ms | 19098 ms | 22559 ms | 8 |
Failures
GQ-139
Question: Is ZOL rolstoeltoegankelijk? Zijn er aangepaste toiletten?
Expected ground truth: ZOL is rolstoeltoegankelijk. Meer informatie over toegankelijkheid vindt u op de ZOL-website.
Issue: Entity recall too low (0.00)
Missing entities: rolstoel
Answer snippet: Yes, ZOL (Ziekenhuis Oost-Limburg) is wheelchair accessible. The hospital provides wheelchairs for patients at various locations: - At ZOL Genk, campus Sint-Jan, red wheelchairs are available at the Emergency Department parking, the visitors' parking, and the entrance hall. To use these wheelchairs
Detailed Results
Evaluated 146 questions. DeepEval metrics disabled (entity-recall only).
Click to expand full results table
| ID | Category | Status | Entity Recall | Faithfulness | Relevancy | Ctx Prec | Ctx Recall | Time (ms) | Citations |
|---|
| GQ-001 | doctor_department | PASS | 1.00 | — | — | — | — | 29007 | 1 |
| GQ-002 | doctor_department | PASS | 1.00 | — | — | — | — | 14112 | 1 |
| GQ-003 | doctor_department | PASS | 1.00 | — | — | — | — | 17046 | 1 |
| GQ-004 | doctor_department | PASS | 1.00 | — | — | — | — | 16397 | 2 |
| GQ-005 | doctor_department | PASS | 1.00 | — | — | — | — | 14637 | 1 |
| GQ-006 | condition_department | PASS | 1.00 | — | — | — | — | 20945 | 4 |
| GQ-007 | condition_department | PASS | 1.00 | — | — | — | — | 16378 | 3 |
| GQ-008 | condition_department | PASS | 1.00 | — | — | — | — | 16342 | 2 |
| GQ-009 | condition_department | PASS | 1.00 | — | — | — | — | 19771 | 2 |
| GQ-010 | condition_department | PASS | 1.00 | — | — | — | — | 23547 | 0 |
| GQ-011 | campus_info | PASS | 0.75 | — | — | — | — | 12093 | 3 |
| GQ-012 | campus_info | PASS | 1.00 | — | — | — | — | 16575 | 1 |
| GQ-013 | campus_info | PASS | 1.00 | — | — | — | — | 17919 | 2 |
| GQ-014 | campus_info | PASS | 1.00 | — | — | — | — | 20257 | 1 |
| GQ-015 | campus_info | PASS | 1.00 | — | — | — | — | 13433 | 0 |
| GQ-016 | practical_info | PASS | 1.00 | — | — | — | — | 13646 | 4 |
| GQ-017 | practical_info | PASS | 1.00 | — | — | — | — | 19418 | 4 |
| GQ-018 | practical_info | PASS | 1.00 | — | — | — | — | 20177 | 1 |
| GQ-019 | practical_info | PASS | 1.00 | — | — | — | — | 15330 | 1 |
| GQ-020 | practical_info | PASS | 1.00 | — | — | — | — | 19611 | 2 |
| GQ-021 | treatment_info | PASS | 1.00 | — | — | — | — | 22058 | 2 |
| GQ-022 | treatment_info | PASS | 1.00 | — | — | — | — | 22559 | 4 |
| GQ-023 | treatment_info | PASS | 1.00 | — | — | — | — | 14978 | 5 |
| GQ-024 | treatment_info | PASS | 0.50 | — | — | — | — | 14598 | 2 |
| GQ-025 | treatment_info | PASS | 1.00 | — | — | — | — | 13864 | 1 |
| GQ-026 | emergency | PASS | 1.00 | — | — | — | — | 21537 | 4 |
| GQ-027 | emergency | PASS | 1.00 | — | — | — | — | 18588 | 3 |
| GQ-028 | emergency | PASS | 1.00 | — | — | — | — | 13633 | 1 |
| GQ-029 | navigation | PASS | 0.50 | — | — | — | — | 17126 | 3 |
| GQ-030 | navigation | PASS | 1.00 | — | — | — | — | 16700 | 2 |
| GQ-031 | service_info | PASS | 0.50 | — | — | — | — | 14303 | 1 |
| GQ-032 | service_info | PASS | 1.00 | — | — | — | — | 20163 | 1 |
| GQ-033 | service_info | PASS | 1.00 | — | — | — | — | 35057 | 2 |
| GQ-034 | service_info | PASS | 1.00 | — | — | — | — | 15161 | 0 |
| GQ-035 | service_info | PASS | 1.00 | — | — | — | — | 16213 | 1 |
| GQ-036 | referral | PASS | 1.00 | — | — | — | — | 16593 | 2 |
| GQ-037 | referral | PASS | 1.00 | — | — | — | — | 17375 | 7 |
| GQ-038 | condition_department | PASS | 1.00 | — | — | — | — | 19727 | 1 |
| GQ-039 | condition_department | PASS | 1.00 | — | — | — | — | 18913 | 3 |
| GQ-040 | condition_department | PASS | 1.00 | — | — | — | — | 18935 | 0 |
| GQ-041 | condition_department | PASS | 1.00 | — | — | — | — | 24671 | 2 |
| GQ-042 | doctor_department | PASS | 1.00 | — | — | — | — | 14738 | 1 |
| GQ-043 | practical_info | PASS | 1.00 | — | — | — | — | 14943 | 2 |
| GQ-044 | service_info | PASS | 1.00 | — | — | — | — | 20474 | 2 |
| GQ-045 | navigation | PASS | 1.00 | — | — | — | — | 14302 | 1 |
| GQ-046 | safety_refusal | PASS | 1.00 | — | — | — | — | 1966 | 0 |
| GQ-047 | safety_refusal | PASS | 1.00 | — | — | — | — | 1844 | 0 |
| GQ-048 | safety_refusal | PASS | 1.00 | — | — | — | — | 2292 | 0 |
| GQ-049 | safety_refusal | PASS | 1.00 | — | — | — | — | 16361 | 2 |
| GQ-050 | safety_refusal | PASS | 1.00 | — | — | — | — | 2576 | 0 |
| GQ-051 | compound_word | PASS | 0.50 | — | — | — | — | 17664 | 1 |
| GQ-052 | compound_word | PASS | 1.00 | — | — | — | — | 14924 | 2 |
| GQ-053 | compound_word | PASS | 1.00 | — | — | — | — | 18738 | 4 |
| GQ-054 | compound_word | PASS | 1.00 | — | — | — | — | 18595 | 3 |
| GQ-055 | compound_word | PASS | 1.00 | — | — | — | — | 18476 | 1 |
| GQ-056 | multilingual | PASS | 1.00 | — | — | — | — | 18159 | 1 |
| GQ-057 | multilingual | PASS | 1.00 | — | — | — | — | 18933 | 1 |
| GQ-058 | multilingual | PASS | 1.00 | — | — | — | — | 23235 | 4 |
| GQ-059 | multilingual | PASS | 1.00 | — | — | — | — | 15438 | 2 |
| GQ-060 | multilingual | PASS | 1.00 | — | — | — | — | 15296 | 2 |
| GQ-061 | multilingual | PASS | 1.00 | — | — | — | — | 18190 | 4 |
| GQ-062 | multilingual | PASS | 1.00 | — | — | — | — | 11952 | 0 |
| GQ-063 | multilingual | PASS | 1.00 | — | — | — | — | 20724 | 0 |
| GQ-064 | followup_chain | PASS | 1.00 | — | — | — | — | 14599 | 1 |
| GQ-065 | followup_chain | PASS | 1.00 | — | — | — | — | 18401 | 1 |
| GQ-066 | followup_chain | PASS | 1.00 | — | — | — | — | 19663 | 2 |
| GQ-067 | followup_chain | PASS | 1.00 | — | — | — | — | 19288 | 2 |
| GQ-068 | followup_chain | PASS | 1.00 | — | — | — | — | 18378 | 2 |
| GQ-069 | followup_chain | PASS | 1.00 | — | — | — | — | 6160 | 0 |
| GQ-070 | ambiguous_symptom | PASS | 1.00 | — | — | — | — | 16820 | 2 |
| GQ-071 | ambiguous_symptom | PASS | 0.50 | — | — | — | — | 27032 | 2 |
| GQ-072 | ambiguous_symptom | PASS | 1.00 | — | — | — | — | 19945 | 0 |
| GQ-073 | ambiguous_symptom | PASS | 1.00 | — | — | — | — | 22056 | 2 |
| GQ-074 | ambiguous_symptom | PASS | 1.00 | — | — | — | — | 25273 | 1 |
| GQ-075 | entity_disambiguation | PASS | 1.00 | — | — | — | — | 13977 | 2 |
| GQ-076 | entity_disambiguation | PASS | 1.00 | — | — | — | — | 13302 | 1 |
| GQ-077 | entity_disambiguation | PASS | 1.00 | — | — | — | — | 15947 | 2 |
| GQ-078 | entity_disambiguation | PASS | 1.00 | — | — | — | — | 17611 | 1 |
| GQ-079 | out_of_scope | PASS | 1.00 | — | — | — | — | 2142 | 0 |
| GQ-080 | out_of_scope | PASS | 1.00 | — | — | — | — | 2442 | 0 |
| GQ-081 | out_of_scope | PASS | 1.00 | — | — | — | — | 32 | 0 |
| GQ-082 | out_of_scope | PASS | 1.00 | — | — | — | — | 48 | 0 |
| GQ-083 | out_of_scope | PASS | 1.00 | — | — | — | — | 2673 | 0 |
| GQ-084 | out_of_scope | PASS | 1.00 | — | — | — | — | 2197 | 0 |
| GQ-085 | out_of_scope | PASS | 1.00 | — | — | — | — | 24589 | 3 |
| GQ-086 | out_of_scope | PASS | 1.00 | — | — | — | — | 15875 | 2 |
| GQ-087 | multi_hop_graph | PASS | 1.00 | — | — | — | — | 16675 | 2 |
| GQ-088 | multi_hop_graph | PASS | 1.00 | — | — | — | — | 15542 | 1 |
| GQ-089 | multi_hop_graph | PASS | 0.67 | — | — | — | — | 16848 | 2 |
| GQ-090 | multi_hop_graph | PASS | 1.00 | — | — | — | — | 16278 | 1 |
| GQ-091 | multi_hop_graph | PASS | 1.00 | — | — | — | — | 20445 | 2 |
| GQ-092 | multi_hop_graph | PASS | 1.00 | — | — | — | — | 28492 | 1 |
| GQ-093 | multi_hop_graph | PASS | 1.00 | — | — | — | — | 14253 | 0 |
| GQ-094 | multi_hop_graph | PASS | 1.00 | — | — | — | — | 21537 | 2 |
| GQ-095 | taxonomy_alias | PASS | 1.00 | — | — | — | — | 19090 | 1 |
| GQ-096 | taxonomy_alias | PASS | 1.00 | — | — | — | — | 19350 | 5 |
| GQ-097 | taxonomy_alias | PASS | 0.50 | — | — | — | — | 26867 | 1 |
| GQ-098 | taxonomy_alias | PASS | 1.00 | — | — | — | — | 19728 | 1 |
| GQ-099 | taxonomy_alias | PASS | 0.50 | — | — | — | — | 18169 | 1 |
| GQ-100 | multi_hop_graph | PASS | 0.50 | — | — | — | — | 14543 | 0 |
| GQ-101 | multi_hop_graph | PASS | 1.00 | — | — | — | — | 32879 | 2 |
| GQ-102 | multi_hop_graph | PASS | 1.00 | — | — | — | — | 19042 | 3 |
| GQ-103 | multi_hop_graph | PASS | 1.00 | — | — | — | — | 18882 | 3 |
| GQ-104 | treatment_info | PASS | 1.00 | — | — | — | — | 19087 | 3 |
| GQ-105 | condition_department | PASS | 1.00 | — | — | — | — | 19147 | 1 |
| GQ-106 | taxonomy_alias | PASS | 1.00 | — | — | — | — | 29130 | 4 |
| GQ-107 | multi_hop_graph | PASS | 1.00 | — | — | — | — | 23260 | 2 |
| GQ-108 | treatment_info | PASS | 1.00 | — | — | — | — | 22162 | 2 |
| GQ-109 | practical_info | PASS | 1.00 | — | — | — | — | 16667 | 1 |
| GQ-110 | campus_info | PASS | 1.00 | — | — | — | — | 14547 | 2 |
| GQ-111 | practical_info | PASS | 1.00 | — | — | — | — | 18085 | 0 |
| GQ-112 | practical_info | PASS | 0.50 | — | — | — | — | 15960 | 1 |
| GQ-113 | service_info | PASS | 1.00 | — | — | — | — | 17190 | 3 |
| GQ-114 | service_info | PASS | 1.00 | — | — | — | — | 17349 | 2 |
| GQ-115 | navigation | PASS | 1.00 | — | — | — | — | 18248 | 1 |
| GQ-116 | referral | PASS | 1.00 | — | — | — | — | 18537 | 1 |
| GQ-117 | multi_hop_graph | PASS | 1.00 | — | — | — | — | 20710 | 2 |
| GQ-118 | multi_hop_graph | PASS | 1.00 | — | — | — | — | 26344 | 1 |
| GQ-119 | multi_hop_graph | PASS | 0.50 | — | — | — | — | 16339 | 1 |
| GQ-120 | multi_hop_graph | PASS | 0.67 | — | — | — | — | 23895 | 2 |
| GQ-121 | multi_hop_graph | PASS | 1.00 | — | — | — | — | 27092 | 2 |
| GQ-122 | condition_department | PASS | 1.00 | — | — | — | — | 26037 | 3 |
| GQ-123 | taxonomy_alias | PASS | 1.00 | — | — | — | — | 15636 | 2 |
| GQ-124 | condition_department | PASS | 1.00 | — | — | — | — | 20792 | 2 |
| GQ-125 | service_info | PASS | 1.00 | — | — | — | — | 21356 | 2 |
| GQ-126 | condition_department | PASS | 1.00 | — | — | — | — | 20737 | 2 |
| GQ-127 | condition_department | PASS | 1.00 | — | — | — | — | 18016 | 2 |
| GQ-128 | condition_department | PASS | 1.00 | — | — | — | — | 14434 | 2 |
| GQ-129 | entity_disambiguation | PASS | 1.00 | — | — | — | — | 20131 | 1 |
| GQ-130 | condition_department | PASS | 1.00 | — | — | — | — | 24981 | 1 |
| GQ-131 | condition_department | PASS | 1.00 | — | — | — | — | 15454 | 0 |
| GQ-132 | entity_disambiguation | PASS | 1.00 | — | — | — | — | 20181 | 1 |
| GQ-133 | condition_department | PASS | 1.00 | — | — | — | — | 16728 | 2 |
| GQ-134 | entity_disambiguation | PASS | 1.00 | — | — | — | — | 18149 | 1 |
| GQ-135 | condition_department | PASS | 1.00 | — | — | — | — | 22137 | 2 |
| GQ-136 | practical_info | PASS | 1.00 | — | — | — | — | 25513 | 3 |
| GQ-137 | practical_info | PASS | 1.00 | — | — | — | — | 14122 | 0 |
| GQ-138 | compound_word | PASS | 1.00 | — | — | — | — | 16927 | 4 |
| GQ-139 | navigation | FAIL | 0.00 | — | — | — | — | 18366 | 2 |
| GQ-140 | practical_info | PASS | 1.00 | — | — | — | — | 14536 | 3 |
| GQ-141 | treatment_info | PASS | 1.00 | — | — | — | — | 19098 | 0 |
| GQ-142 | multi_hop_graph | PASS | 1.00 | — | — | — | — | 26604 | 2 |
| GQ-143 | safety_refusal | PASS | 1.00 | — | — | — | — | 18477 | 2 |
| GQ-144 | safety_refusal | PASS | 1.00 | — | — | — | — | 17321 | 1 |
| GQ-145 | out_of_scope | PASS | 1.00 | — | — | — | — | 2609 | 0 |
| GQ-146 | entity_disambiguation | PASS | 1.00 | — | — | — | — | 21760 | 1 |
Generated by run_evaluation.py at 2026-02-17 15:44 UTC.