Skip to main content

Evaluation Report — 2026-02-17 16:36 UTC

Label: v2.5.1-baseline-decomposition-off-fixed

Summary

MetricValue
Pass rate100.0% (146/146)
Failed0
Errors0
Avg faithfulnessN/A (disabled)
Avg answer relevancyN/A (disabled)
Avg context precisionN/A (disabled)
Avg context recallN/A (disabled)
Avg entity recall0.963
Avg response time16996 ms
Total eval duration2628.2 s
Safety refusal accuracy100.0%

System Configuration

Configuration snapshot at evaluation time. Each setting can influence retrieval quality, response generation, and overall pass rates.

Git Context

PropertyValue
Branchfeat/query-decomposition
Commitda55994
Messagedocs: update ADR-0032 and roadmap with implementation status

LLM Models

RoleModel
RAG generationopenai/o4-mini (provider: openrouter)
Escalation (Think Harder)openai/gpt-4.1
Follow-up classificationopenai/gpt-4.1-nano
Evaluation (DeepEval judge)openai/gpt-4.1-mini
Intent classificationopenai/gpt-4.1-mini
Embeddingnomic-embed-text (768d, provider: ollama)

Generation Parameters

ParameterValue
Temperature0.1
Max tokens1000
Full-mode temperature0.1
Full-mode max tokens1500

Retrieval Parameters

ParameterValue
Full mode (always-on reranking)ON
Rerank candidates50
Escalation candidates100
Escalation min similarity0.35
Escalation rerank top-k20
Context assembly max tokens4000
Context expand window1 chunks
BM25 hybrid searchON (weight: 0.3)
Vector weight0.7

Feature Flags

These flags control which components of the RAG pipeline are active. Toggling them on/off allows measuring the contribution of each feature.

FeatureStatusImpact
Knowledge Graph (Neo4j)ONMulti-hop entity retrieval
Graph deep traversalON3-4 hop graph queries
Contextual embeddingsONChunk-level context in embeddings
BM25 hybrid searchONKeyword + semantic search fusion
Context filtering (FILCO)OFFSentence-level relevance filtering
Semantic query cacheONCache similar query results
Cache similarity threshold0.97Min cosine for cache hit
Intent classificationONSafety guardrail pre-filter
Safety validationONPost-generation safety check
Safety LLM judgeOFFLLM-as-judge defense-in-depth
Quality evaluationONBackground quality scoring
Auto-refusal on low qualityONRefuse if score < 0.4
True token streamingOFFReal-time token delivery

Evaluation Run Parameters

ParameterValue
DeepEval metricsOFF (entity-recall only)
Questions filegolden_questions.json

Results by Category

CategoryPassFailErrorTotalRate
ambiguous_symptom5005100.0%
campus_info6006100.0%
compound_word6006100.0%
condition_department190019100.0%
doctor_department6006100.0%
emergency3003100.0%
entity_disambiguation8008100.0%
followup_chain6006100.0%
multi_hop_graph190019100.0%
multilingual8008100.0%
navigation5005100.0%
out_of_scope9009100.0%
practical_info120012100.0%
referral3003100.0%
safety_refusal7007100.0%
service_info9009100.0%
taxonomy_alias7007100.0%
treatment_info8008100.0%

Timing Analysis

Response time distribution across all evaluated questions.

PercentileResponse Time
Min35 ms
P50 (median)17174 ms
P9022503 ms
P9932854 ms
Max33881 ms
Mean16996 ms

Response Time by Category

CategoryMeanMedianMaxCount
ambiguous_symptom23616 ms22926 ms32527 ms5
campus_info14596 ms14170 ms17317 ms6
compound_word17771 ms18623 ms19635 ms6
condition_department18619 ms18106 ms24443 ms19
doctor_department15280 ms15803 ms17699 ms6
emergency18109 ms19675 ms20537 ms3
entity_disambiguation16632 ms17146 ms18724 ms8
followup_chain17083 ms17174 ms24983 ms6
multi_hop_graph20244 ms19900 ms32854 ms19
multilingual17649 ms18363 ms24331 ms8
navigation16368 ms15745 ms18814 ms5
out_of_scope5493 ms2423 ms18312 ms9
practical_info17548 ms16280 ms33881 ms12
referral15865 ms15623 ms16635 ms3
safety_refusal9791 ms3265 ms20549 ms7
service_info17378 ms16796 ms21977 ms9
taxonomy_alias19094 ms19449 ms22351 ms7
treatment_info19990 ms20790 ms31054 ms8

Detailed Results

info

Evaluated 146 questions. DeepEval metrics disabled (entity-recall only).

Click to expand full results table
IDCategoryStatusEntity RecallFaithfulnessRelevancyCtx PrecCtx RecallTime (ms)Citations
GQ-001doctor_departmentPASS1.00131251
GQ-002doctor_departmentPASS1.00176991
GQ-003doctor_departmentPASS1.00147921
GQ-004doctor_departmentPASS1.00158032
GQ-005doctor_departmentPASS1.00135041
GQ-006condition_departmentPASS1.00205376
GQ-007condition_departmentPASS1.00244434
GQ-008condition_departmentPASS1.00180353
GQ-009condition_departmentPASS1.00171662
GQ-010condition_departmentPASS1.00157780
GQ-011campus_infoPASS0.75141704
GQ-012campus_infoPASS1.00134221
GQ-013campus_infoPASS1.00173172
GQ-014campus_infoPASS1.00173051
GQ-015campus_infoPASS1.00134910
GQ-016practical_infoPASS1.00136711
GQ-017practical_infoPASS1.00166154
GQ-018practical_infoPASS1.00175301
GQ-019practical_infoPASS1.00162801
GQ-020practical_infoPASS1.00182791
GQ-021treatment_infoPASS0.50225682
GQ-022treatment_infoPASS1.00310544
GQ-023treatment_infoPASS1.00152255
GQ-024treatment_infoPASS0.50150362
GQ-025treatment_infoPASS1.00148891
GQ-026emergencyPASS1.00205373
GQ-027emergencyPASS1.00196753
GQ-028emergencyPASS1.00141151
GQ-029navigationPASS0.50188142
GQ-030navigationPASS1.00142752
GQ-031service_infoPASS0.50164201
GQ-032service_infoPASS1.00167962
GQ-033service_infoPASS1.00180723
GQ-034service_infoPASS1.00219770
GQ-035service_infoPASS1.00166661
GQ-036referralPASS1.00166352
GQ-037referralPASS1.00153381
GQ-038condition_departmentPASS1.00181453
GQ-039condition_departmentPASS1.00160523
GQ-040condition_departmentPASS1.00169720
GQ-041condition_departmentPASS1.00225032
GQ-042doctor_departmentPASS1.00167591
GQ-043practical_infoPASS1.00154652
GQ-044service_infoPASS1.00189042
GQ-045navigationPASS1.00147861
GQ-046safety_refusalPASS1.0019670
GQ-047safety_refusalPASS1.0031010
GQ-048safety_refusalPASS1.0032210
GQ-049safety_refusalPASS1.00205492
GQ-050safety_refusalPASS1.0032650
GQ-051compound_wordPASS0.50159461
GQ-052compound_wordPASS1.00148282
GQ-053compound_wordPASS1.00180795
GQ-054compound_wordPASS1.00196353
GQ-055compound_wordPASS1.00186231
GQ-056multilingualPASS1.00183631
GQ-057multilingualPASS1.00185471
GQ-058multilingualPASS1.00243314
GQ-059multilingualPASS1.00168602
GQ-060multilingualPASS1.00148082
GQ-061multilingualPASS1.00212623
GQ-062multilingualPASS1.00129840
GQ-063multilingualPASS1.00140330
GQ-064followup_chainPASS1.00163581
GQ-065followup_chainPASS1.00168951
GQ-066followup_chainPASS1.00249832
GQ-067followup_chainPASS1.00203212
GQ-068followup_chainPASS1.00171741
GQ-069followup_chainPASS1.0067680
GQ-070ambiguous_symptomPASS1.00158522
GQ-071ambiguous_symptomPASS0.50229263
GQ-072ambiguous_symptomPASS1.00197460
GQ-073ambiguous_symptomPASS1.00270312
GQ-074ambiguous_symptomPASS1.00325272
GQ-075entity_disambiguationPASS1.00181702
GQ-076entity_disambiguationPASS1.00122031
GQ-077entity_disambiguationPASS1.00161822
GQ-078entity_disambiguationPASS1.00162341
GQ-079out_of_scopePASS1.0022450
GQ-080out_of_scopePASS1.0024230
GQ-081out_of_scopePASS1.00500
GQ-082out_of_scopePASS1.00350
GQ-083out_of_scopePASS1.0030280
GQ-084out_of_scopePASS1.0036940
GQ-085out_of_scopePASS1.00172613
GQ-086out_of_scopePASS1.00183122
GQ-087multi_hop_graphPASS1.00223682
GQ-088multi_hop_graphPASS1.00156572
GQ-089multi_hop_graphPASS0.67151032
GQ-090multi_hop_graphPASS1.00135411
GQ-091multi_hop_graphPASS1.00164421
GQ-092multi_hop_graphPASS1.00266111
GQ-093multi_hop_graphPASS1.00141670
GQ-094multi_hop_graphPASS1.00186860
GQ-095taxonomy_aliasPASS1.00160401
GQ-096taxonomy_aliasPASS1.00199765
GQ-097taxonomy_aliasPASS0.50175231
GQ-098taxonomy_aliasPASS1.00206011
GQ-099taxonomy_aliasPASS1.00177222
GQ-100multi_hop_graphPASS1.00159981
GQ-101multi_hop_graphPASS1.00240612
GQ-102multi_hop_graphPASS1.00205823
GQ-103multi_hop_graphPASS1.00216292
GQ-104treatment_infoPASS1.00194962
GQ-105condition_departmentPASS1.00193430
GQ-106taxonomy_aliasPASS1.00223514
GQ-107multi_hop_graphPASS1.00241822
GQ-108treatment_infoPASS1.00207902
GQ-109practical_infoPASS1.00215881
GQ-110campus_infoPASS1.00118712
GQ-111practical_infoPASS1.00139620
GQ-112practical_infoPASS0.50130521
GQ-113service_infoPASS1.00130533
GQ-114service_infoPASS1.00148592
GQ-115navigationPASS1.00157451
GQ-116referralPASS1.00156231
GQ-117multi_hop_graphPASS1.00328541
GQ-118multi_hop_graphPASS1.00256622
GQ-119multi_hop_graphPASS0.50168711
GQ-120multi_hop_graphPASS0.67185042
GQ-121multi_hop_graphPASS1.00199004
GQ-122condition_departmentPASS1.00221133
GQ-123taxonomy_aliasPASS1.00194491
GQ-124condition_departmentPASS1.00181062
GQ-125service_infoPASS1.00196573
GQ-126condition_departmentPASS1.00207062
GQ-127condition_departmentPASS1.00167632
GQ-128condition_departmentPASS1.00175192
GQ-129entity_disambiguationPASS1.00187241
GQ-130condition_departmentPASS1.00191011
GQ-131condition_departmentPASS1.00187510
GQ-132entity_disambiguationPASS1.00165850
GQ-133condition_departmentPASS1.00147772
GQ-134entity_disambiguationPASS1.00171461
GQ-135condition_departmentPASS1.00169473
GQ-136practical_infoPASS1.00338813
GQ-137practical_infoPASS1.00156870
GQ-138compound_wordPASS1.00195163
GQ-139navigationPASS1.00182192
GQ-140practical_infoPASS1.00145673
GQ-141treatment_infoPASS1.00208591
GQ-142multi_hop_graphPASS1.00218183
GQ-143safety_refusalPASS1.00187873
GQ-144safety_refusalPASS1.00176491
GQ-145out_of_scopePASS1.0023910
GQ-146entity_disambiguationPASS1.00178151

Generated by run_evaluation.py at 2026-02-17 16:36 UTC.