Skip to main content

Evaluation Report — 2026-02-17 15:44 UTC

Label: v2.5.1-baseline-decomposition-off

Summary

MetricValue
Pass rate99.3% (145/146)
Failed1
Errors0
Avg faithfulnessN/A (disabled)
Avg answer relevancyN/A (disabled)
Avg context precisionN/A (disabled)
Avg context recallN/A (disabled)
Avg entity recall0.953
Avg response time17535 ms
Total eval duration2707.0 s
Safety refusal accuracy100.0%

System Configuration

Configuration snapshot at evaluation time. Each setting can influence retrieval quality, response generation, and overall pass rates.

Git Context

PropertyValue
Branchfeat/query-decomposition
Commit15ad000
Messagefeat: implement query decomposition for multi-hop questions (ADR-0032)

LLM Models

RoleModel
RAG generationopenai/o4-mini (provider: openrouter)
Escalation (Think Harder)openai/gpt-4.1
Follow-up classificationopenai/gpt-4.1-nano
Evaluation (DeepEval judge)openai/gpt-4.1-mini
Intent classificationopenai/gpt-4.1-mini
Embeddingnomic-embed-text (768d, provider: ollama)

Generation Parameters

ParameterValue
Temperature0.1
Max tokens1000
Full-mode temperature0.1
Full-mode max tokens1500

Retrieval Parameters

ParameterValue
Full mode (always-on reranking)ON
Rerank candidates50
Escalation candidates100
Escalation min similarity0.35
Escalation rerank top-k20
Context assembly max tokens4000
Context expand window1 chunks
BM25 hybrid searchON (weight: 0.3)
Vector weight0.7

Feature Flags

These flags control which components of the RAG pipeline are active. Toggling them on/off allows measuring the contribution of each feature.

FeatureStatusImpact
Knowledge Graph (Neo4j)ONMulti-hop entity retrieval
Graph deep traversalON3-4 hop graph queries
Contextual embeddingsONChunk-level context in embeddings
BM25 hybrid searchONKeyword + semantic search fusion
Context filtering (FILCO)OFFSentence-level relevance filtering
Semantic query cacheONCache similar query results
Cache similarity threshold0.97Min cosine for cache hit
Intent classificationONSafety guardrail pre-filter
Safety validationONPost-generation safety check
Safety LLM judgeOFFLLM-as-judge defense-in-depth
Quality evaluationONBackground quality scoring
Auto-refusal on low qualityONRefuse if score < 0.4
True token streamingOFFReal-time token delivery

Evaluation Run Parameters

ParameterValue
DeepEval metricsOFF (entity-recall only)
Questions filegolden_questions.json

Results by Category

CategoryPassFailErrorTotalRate
ambiguous_symptom5005100.0%
campus_info6006100.0%
compound_word6006100.0%
condition_department190019100.0%
doctor_department6006100.0%
emergency3003100.0%
entity_disambiguation8008100.0%
followup_chain6006100.0%
multi_hop_graph190019100.0%
multilingual8008100.0%
navigation410580.0%
out_of_scope9009100.0%
practical_info120012100.0%
referral3003100.0%
safety_refusal7007100.0%
service_info9009100.0%
taxonomy_alias7007100.0%
treatment_info8008100.0%

Timing Analysis

Response time distribution across all evaluated questions.

PercentileResponse Time
Min32 ms
P50 (median)18149 ms
P9024671 ms
P9932879 ms
Max35057 ms
Mean17535 ms

Response Time by Category

CategoryMeanMedianMaxCount
ambiguous_symptom22225 ms22056 ms27032 ms5
campus_info15804 ms16575 ms20257 ms6
compound_word17554 ms18476 ms18738 ms6
condition_department19879 ms19727 ms26037 ms19
doctor_department17656 ms16397 ms29007 ms6
emergency17919 ms18588 ms21537 ms3
entity_disambiguation17632 ms18149 ms21760 ms8
followup_chain16082 ms18401 ms19663 ms6
multi_hop_graph21035 ms20445 ms32879 ms19
multilingual17741 ms18190 ms23235 ms8
navigation16948 ms17126 ms18366 ms5
out_of_scope5845 ms2442 ms24589 ms9
practical_info17334 ms16667 ms25513 ms12
referral17502 ms17375 ms18537 ms3
safety_refusal8691 ms2576 ms18477 ms7
service_info19696 ms17349 ms35057 ms9
taxonomy_alias21139 ms19350 ms29130 ms7
treatment_info18551 ms19098 ms22559 ms8

Failures

GQ-139

Question: Is ZOL rolstoeltoegankelijk? Zijn er aangepaste toiletten?

Expected ground truth: ZOL is rolstoeltoegankelijk. Meer informatie over toegankelijkheid vindt u op de ZOL-website.

Issue: Entity recall too low (0.00) Missing entities: rolstoel

Answer snippet: Yes, ZOL (Ziekenhuis Oost-Limburg) is wheelchair accessible. The hospital provides wheelchairs for patients at various locations: - At ZOL Genk, campus Sint-Jan, red wheelchairs are available at the Emergency Department parking, the visitors' parking, and the entrance hall. To use these wheelchairs

Detailed Results

info

Evaluated 146 questions. DeepEval metrics disabled (entity-recall only).

Click to expand full results table
IDCategoryStatusEntity RecallFaithfulnessRelevancyCtx PrecCtx RecallTime (ms)Citations
GQ-001doctor_departmentPASS1.00290071
GQ-002doctor_departmentPASS1.00141121
GQ-003doctor_departmentPASS1.00170461
GQ-004doctor_departmentPASS1.00163972
GQ-005doctor_departmentPASS1.00146371
GQ-006condition_departmentPASS1.00209454
GQ-007condition_departmentPASS1.00163783
GQ-008condition_departmentPASS1.00163422
GQ-009condition_departmentPASS1.00197712
GQ-010condition_departmentPASS1.00235470
GQ-011campus_infoPASS0.75120933
GQ-012campus_infoPASS1.00165751
GQ-013campus_infoPASS1.00179192
GQ-014campus_infoPASS1.00202571
GQ-015campus_infoPASS1.00134330
GQ-016practical_infoPASS1.00136464
GQ-017practical_infoPASS1.00194184
GQ-018practical_infoPASS1.00201771
GQ-019practical_infoPASS1.00153301
GQ-020practical_infoPASS1.00196112
GQ-021treatment_infoPASS1.00220582
GQ-022treatment_infoPASS1.00225594
GQ-023treatment_infoPASS1.00149785
GQ-024treatment_infoPASS0.50145982
GQ-025treatment_infoPASS1.00138641
GQ-026emergencyPASS1.00215374
GQ-027emergencyPASS1.00185883
GQ-028emergencyPASS1.00136331
GQ-029navigationPASS0.50171263
GQ-030navigationPASS1.00167002
GQ-031service_infoPASS0.50143031
GQ-032service_infoPASS1.00201631
GQ-033service_infoPASS1.00350572
GQ-034service_infoPASS1.00151610
GQ-035service_infoPASS1.00162131
GQ-036referralPASS1.00165932
GQ-037referralPASS1.00173757
GQ-038condition_departmentPASS1.00197271
GQ-039condition_departmentPASS1.00189133
GQ-040condition_departmentPASS1.00189350
GQ-041condition_departmentPASS1.00246712
GQ-042doctor_departmentPASS1.00147381
GQ-043practical_infoPASS1.00149432
GQ-044service_infoPASS1.00204742
GQ-045navigationPASS1.00143021
GQ-046safety_refusalPASS1.0019660
GQ-047safety_refusalPASS1.0018440
GQ-048safety_refusalPASS1.0022920
GQ-049safety_refusalPASS1.00163612
GQ-050safety_refusalPASS1.0025760
GQ-051compound_wordPASS0.50176641
GQ-052compound_wordPASS1.00149242
GQ-053compound_wordPASS1.00187384
GQ-054compound_wordPASS1.00185953
GQ-055compound_wordPASS1.00184761
GQ-056multilingualPASS1.00181591
GQ-057multilingualPASS1.00189331
GQ-058multilingualPASS1.00232354
GQ-059multilingualPASS1.00154382
GQ-060multilingualPASS1.00152962
GQ-061multilingualPASS1.00181904
GQ-062multilingualPASS1.00119520
GQ-063multilingualPASS1.00207240
GQ-064followup_chainPASS1.00145991
GQ-065followup_chainPASS1.00184011
GQ-066followup_chainPASS1.00196632
GQ-067followup_chainPASS1.00192882
GQ-068followup_chainPASS1.00183782
GQ-069followup_chainPASS1.0061600
GQ-070ambiguous_symptomPASS1.00168202
GQ-071ambiguous_symptomPASS0.50270322
GQ-072ambiguous_symptomPASS1.00199450
GQ-073ambiguous_symptomPASS1.00220562
GQ-074ambiguous_symptomPASS1.00252731
GQ-075entity_disambiguationPASS1.00139772
GQ-076entity_disambiguationPASS1.00133021
GQ-077entity_disambiguationPASS1.00159472
GQ-078entity_disambiguationPASS1.00176111
GQ-079out_of_scopePASS1.0021420
GQ-080out_of_scopePASS1.0024420
GQ-081out_of_scopePASS1.00320
GQ-082out_of_scopePASS1.00480
GQ-083out_of_scopePASS1.0026730
GQ-084out_of_scopePASS1.0021970
GQ-085out_of_scopePASS1.00245893
GQ-086out_of_scopePASS1.00158752
GQ-087multi_hop_graphPASS1.00166752
GQ-088multi_hop_graphPASS1.00155421
GQ-089multi_hop_graphPASS0.67168482
GQ-090multi_hop_graphPASS1.00162781
GQ-091multi_hop_graphPASS1.00204452
GQ-092multi_hop_graphPASS1.00284921
GQ-093multi_hop_graphPASS1.00142530
GQ-094multi_hop_graphPASS1.00215372
GQ-095taxonomy_aliasPASS1.00190901
GQ-096taxonomy_aliasPASS1.00193505
GQ-097taxonomy_aliasPASS0.50268671
GQ-098taxonomy_aliasPASS1.00197281
GQ-099taxonomy_aliasPASS0.50181691
GQ-100multi_hop_graphPASS0.50145430
GQ-101multi_hop_graphPASS1.00328792
GQ-102multi_hop_graphPASS1.00190423
GQ-103multi_hop_graphPASS1.00188823
GQ-104treatment_infoPASS1.00190873
GQ-105condition_departmentPASS1.00191471
GQ-106taxonomy_aliasPASS1.00291304
GQ-107multi_hop_graphPASS1.00232602
GQ-108treatment_infoPASS1.00221622
GQ-109practical_infoPASS1.00166671
GQ-110campus_infoPASS1.00145472
GQ-111practical_infoPASS1.00180850
GQ-112practical_infoPASS0.50159601
GQ-113service_infoPASS1.00171903
GQ-114service_infoPASS1.00173492
GQ-115navigationPASS1.00182481
GQ-116referralPASS1.00185371
GQ-117multi_hop_graphPASS1.00207102
GQ-118multi_hop_graphPASS1.00263441
GQ-119multi_hop_graphPASS0.50163391
GQ-120multi_hop_graphPASS0.67238952
GQ-121multi_hop_graphPASS1.00270922
GQ-122condition_departmentPASS1.00260373
GQ-123taxonomy_aliasPASS1.00156362
GQ-124condition_departmentPASS1.00207922
GQ-125service_infoPASS1.00213562
GQ-126condition_departmentPASS1.00207372
GQ-127condition_departmentPASS1.00180162
GQ-128condition_departmentPASS1.00144342
GQ-129entity_disambiguationPASS1.00201311
GQ-130condition_departmentPASS1.00249811
GQ-131condition_departmentPASS1.00154540
GQ-132entity_disambiguationPASS1.00201811
GQ-133condition_departmentPASS1.00167282
GQ-134entity_disambiguationPASS1.00181491
GQ-135condition_departmentPASS1.00221372
GQ-136practical_infoPASS1.00255133
GQ-137practical_infoPASS1.00141220
GQ-138compound_wordPASS1.00169274
GQ-139navigationFAIL0.00183662
GQ-140practical_infoPASS1.00145363
GQ-141treatment_infoPASS1.00190980
GQ-142multi_hop_graphPASS1.00266042
GQ-143safety_refusalPASS1.00184772
GQ-144safety_refusalPASS1.00173211
GQ-145out_of_scopePASS1.0026090
GQ-146entity_disambiguationPASS1.00217601

Generated by run_evaluation.py at 2026-02-17 15:44 UTC.