Skip to main content

Evaluation Report — 2026-02-17 17:25 UTC

Label: v2.5.1-decomposition-on

Summary

MetricValue
Pass rate99.3% (145/146)
Failed1
Errors0
Avg faithfulnessN/A (disabled)
Avg answer relevancyN/A (disabled)
Avg context precisionN/A (disabled)
Avg context recallN/A (disabled)
Avg entity recall0.962
Avg response time16863 ms
Total eval duration2852.8 s
Safety refusal accuracy100.0%

System Configuration

Configuration snapshot at evaluation time. Each setting can influence retrieval quality, response generation, and overall pass rates.

Git Context

PropertyValue
Branchfeat/query-decomposition
Commitda55994
Messagedocs: update ADR-0032 and roadmap with implementation status

LLM Models

RoleModel
RAG generationopenai/o4-mini (provider: openrouter)
Escalation (Think Harder)openai/gpt-4.1
Follow-up classificationopenai/gpt-4.1-nano
Evaluation (DeepEval judge)openai/gpt-4.1-mini
Intent classificationopenai/gpt-4.1-mini
Embeddingnomic-embed-text (768d, provider: ollama)

Generation Parameters

ParameterValue
Temperature0.1
Max tokens1000
Full-mode temperature0.1
Full-mode max tokens1500

Retrieval Parameters

ParameterValue
Full mode (always-on reranking)ON
Rerank candidates50
Escalation candidates100
Escalation min similarity0.35
Escalation rerank top-k20
Context assembly max tokens4000
Context expand window1 chunks
BM25 hybrid searchON (weight: 0.3)
Vector weight0.7

Feature Flags

These flags control which components of the RAG pipeline are active. Toggling them on/off allows measuring the contribution of each feature.

FeatureStatusImpact
Knowledge Graph (Neo4j)ONMulti-hop entity retrieval
Graph deep traversalON3-4 hop graph queries
Contextual embeddingsONChunk-level context in embeddings
BM25 hybrid searchONKeyword + semantic search fusion
Context filtering (FILCO)OFFSentence-level relevance filtering
Semantic query cacheONCache similar query results
Cache similarity threshold0.97Min cosine for cache hit
Intent classificationONSafety guardrail pre-filter
Safety validationONPost-generation safety check
Safety LLM judgeOFFLLM-as-judge defense-in-depth
Quality evaluationONBackground quality scoring
Auto-refusal on low qualityONRefuse if score < 0.4
True token streamingOFFReal-time token delivery

Evaluation Run Parameters

ParameterValue
DeepEval metricsOFF (entity-recall only)
Questions filegolden_questions.json

Results by Category

CategoryPassFailErrorTotalRate
ambiguous_symptom5005100.0%
campus_info6006100.0%
compound_word6006100.0%
condition_department190019100.0%
doctor_department6006100.0%
emergency3003100.0%
entity_disambiguation8008100.0%
followup_chain6006100.0%
multi_hop_graph190019100.0%
multilingual8008100.0%
navigation5005100.0%
out_of_scope9009100.0%
practical_info120012100.0%
referral3003100.0%
safety_refusal7007100.0%
service_info9009100.0%
taxonomy_alias7007100.0%
treatment_info710887.5%

Timing Analysis

Response time distribution across all evaluated questions.

PercentileResponse Time
Min38 ms
P50 (median)17310 ms
P9022370 ms
P9928635 ms
Max38092 ms
Mean16863 ms

Response Time by Category

CategoryMeanMedianMaxCount
ambiguous_symptom22539 ms24803 ms28635 ms5
campus_info15834 ms16760 ms21928 ms6
compound_word16564 ms15628 ms19632 ms6
condition_department19170 ms19234 ms23809 ms19
doctor_department14705 ms13945 ms18479 ms6
emergency21004 ms21773 ms24718 ms3
entity_disambiguation16827 ms15581 ms24593 ms8
followup_chain19858 ms20577 ms22550 ms6
multi_hop_graph18841 ms17790 ms38092 ms19
multilingual17167 ms17438 ms22051 ms8
navigation16910 ms16268 ms19345 ms5
out_of_scope5057 ms2631 ms16459 ms9
practical_info16881 ms17021 ms22997 ms12
referral16263 ms17274 ms17304 ms3
safety_refusal8775 ms2539 ms21436 ms7
service_info18108 ms16386 ms26667 ms9
taxonomy_alias20157 ms20009 ms22929 ms7
treatment_info17934 ms17775 ms27140 ms8

Failures

GQ-025

Question: Doet ZOL niertransplantaties?

Expected ground truth: De niertransplantatie zelf gebeurt niet in het ZOL. ZOL werkt hiervoor samen met het transplantatieteam van UZ Leuven. De voorbereidende onderzoeken en de opvolging na transplantatie gebeuren wel bij de dienst Nefrologie van ZOL.

Issue: Entity recall too low (0.00) Missing entities: transplant

Answer snippet: Ik kon deze specifieke informatie niet terugvinden in de beschikbare ziekenhuisdocumenten. Neem rechtstreeks contact op met ZOL of bel 089 32 50 50. --- Dit is geen medisch advies. Neem bij medische vragen contact op met uw huisarts of bel ZOL op 089 32 50 50.

Detailed Results

info

Evaluated 146 questions. DeepEval metrics disabled (entity-recall only).

Click to expand full results table
IDCategoryStatusEntity RecallFaithfulnessRelevancyCtx PrecCtx RecallTime (ms)Citations
GQ-001doctor_departmentPASS1.00139451
GQ-002doctor_departmentPASS1.00184791
GQ-003doctor_departmentPASS1.00137501
GQ-004doctor_departmentPASS1.00126162
GQ-005doctor_departmentPASS1.00135951
GQ-006condition_departmentPASS1.00205165
GQ-007condition_departmentPASS1.00195683
GQ-008condition_departmentPASS1.00177403
GQ-009condition_departmentPASS1.00158271
GQ-010condition_departmentPASS1.00238091
GQ-011campus_infoPASS0.75167604
GQ-012campus_infoPASS1.00128322
GQ-013campus_infoPASS1.00219282
GQ-014campus_infoPASS1.00178431
GQ-015campus_infoPASS1.00142460
GQ-016practical_infoPASS1.00170213
GQ-017practical_infoPASS1.00208964
GQ-018practical_infoPASS1.00178881
GQ-019practical_infoPASS1.00141151
GQ-020practical_infoPASS1.00176951
GQ-021treatment_infoPASS0.50195792
GQ-022treatment_infoPASS1.00271404
GQ-023treatment_infoPASS1.00152295
GQ-024treatment_infoPASS1.00175252
GQ-025treatment_infoFAIL0.00118770
GQ-026emergencyPASS1.00247183
GQ-027emergencyPASS1.00217733
GQ-028emergencyPASS1.00165191
GQ-029navigationPASS0.50193452
GQ-030navigationPASS1.00160432
GQ-031service_infoPASS0.50142471
GQ-032service_infoPASS1.00195122
GQ-033service_infoPASS1.00266672
GQ-034service_infoPASS1.00203290
GQ-035service_infoPASS1.00162221
GQ-036referralPASS1.00173042
GQ-037referralPASS1.00172741
GQ-038condition_departmentPASS1.00228611
GQ-039condition_departmentPASS1.00178673
GQ-040condition_departmentPASS1.00192340
GQ-041condition_departmentPASS1.00209661
GQ-042doctor_departmentPASS1.00158471
GQ-043practical_infoPASS1.00159442
GQ-044service_infoPASS1.00163861
GQ-045navigationPASS1.00144001
GQ-046safety_refusalPASS1.0024310
GQ-047safety_refusalPASS1.0018640
GQ-048safety_refusalPASS1.0025390
GQ-049safety_refusalPASS1.00147042
GQ-050safety_refusalPASS1.0022410
GQ-051compound_wordPASS0.50150791
GQ-052compound_wordPASS1.00153772
GQ-053compound_wordPASS1.00156284
GQ-054compound_wordPASS1.00196323
GQ-055compound_wordPASS1.00151381
GQ-056multilingualPASS1.00152891
GQ-057multilingualPASS1.00174381
GQ-058multilingualPASS1.00220514
GQ-059multilingualPASS1.00150221
GQ-060multilingualPASS1.00187822
GQ-061multilingualPASS1.00205964
GQ-062multilingualPASS1.00132267
GQ-063multilingualPASS1.00149350
GQ-064followup_chainPASS1.00205771
GQ-065followup_chainPASS1.00173061
GQ-066followup_chainPASS1.00184642
GQ-067followup_chainPASS1.00225502
GQ-068followup_chainPASS1.00187422
GQ-069followup_chainPASS1.00215062
GQ-070ambiguous_symptomPASS1.00173682
GQ-071ambiguous_symptomPASS0.50286351
GQ-072ambiguous_symptomPASS1.00143830
GQ-073ambiguous_symptomPASS1.00248032
GQ-074ambiguous_symptomPASS1.00275051
GQ-075entity_disambiguationPASS1.00145072
GQ-076entity_disambiguationPASS1.00245932
GQ-077entity_disambiguationPASS1.00155812
GQ-078entity_disambiguationPASS1.00153271
GQ-079out_of_scopePASS1.0026310
GQ-080out_of_scopePASS1.0032260
GQ-081out_of_scopePASS1.00380
GQ-082out_of_scopePASS1.00470
GQ-083out_of_scopePASS1.0024080
GQ-084out_of_scopePASS1.0023360
GQ-085out_of_scopePASS1.00164593
GQ-086out_of_scopePASS1.00145622
GQ-087multi_hop_graphPASS1.00175192
GQ-088multi_hop_graphPASS1.00140322
GQ-089multi_hop_graphPASS0.67122742
GQ-090multi_hop_graphPASS1.00143431
GQ-091multi_hop_graphPASS1.00170653
GQ-092multi_hop_graphPASS1.00213171
GQ-093multi_hop_graphPASS1.00177903
GQ-094multi_hop_graphPASS1.00196522
GQ-095taxonomy_aliasPASS1.00183311
GQ-096taxonomy_aliasPASS1.00215175
GQ-097taxonomy_aliasPASS0.50184911
GQ-098taxonomy_aliasPASS1.00200091
GQ-099taxonomy_aliasPASS1.00229292
GQ-100multi_hop_graphPASS0.50165500
GQ-101multi_hop_graphPASS1.00202052
GQ-102multi_hop_graphPASS1.00183123
GQ-103multi_hop_graphPASS1.00205032
GQ-104treatment_infoPASS1.00162731
GQ-105condition_departmentPASS1.00173101
GQ-106taxonomy_aliasPASS1.00223704
GQ-107multi_hop_graphPASS1.00380922
GQ-108treatment_infoPASS1.00180722
GQ-109practical_infoPASS1.00181861
GQ-110campus_infoPASS1.00113932
GQ-111practical_infoPASS1.00152750
GQ-112practical_infoPASS0.50147631
GQ-113service_infoPASS1.00140273
GQ-114service_infoPASS1.00147762
GQ-115navigationPASS1.00162681
GQ-116referralPASS1.00142091
GQ-117multi_hop_graphPASS1.00192841
GQ-118multi_hop_graphPASS1.00176552
GQ-119multi_hop_graphPASS1.00181791
GQ-120multi_hop_graphPASS1.00158381
GQ-121multi_hop_graphPASS1.00167091
GQ-122condition_departmentPASS1.00205743
GQ-123taxonomy_aliasPASS1.00174522
GQ-124condition_departmentPASS1.00203652
GQ-125service_infoPASS1.00208123
GQ-126condition_departmentPASS1.00213092
GQ-127condition_departmentPASS1.00169162
GQ-128condition_departmentPASS1.00175542
GQ-129entity_disambiguationPASS1.00143241
GQ-130condition_departmentPASS1.00181611
GQ-131condition_departmentPASS1.00211520
GQ-132entity_disambiguationPASS1.00167291
GQ-133condition_departmentPASS1.00157412
GQ-134entity_disambiguationPASS1.00183271
GQ-135condition_departmentPASS1.00167513
GQ-136practical_infoPASS1.00229973
GQ-137practical_infoPASS1.00141290
GQ-138compound_wordPASS1.00185273
GQ-139navigationPASS1.00184911
GQ-140practical_infoPASS1.00136613
GQ-141treatment_infoPASS1.00177750
GQ-142multi_hop_graphPASS1.00226652
GQ-143safety_refusalPASS1.00214363
GQ-144safety_refusalPASS1.00162121
GQ-145out_of_scopePASS1.0038030
GQ-146entity_disambiguationPASS1.00152291

Generated by run_evaluation.py at 2026-02-17 17:25 UTC.