Skip to main content

Evaluation Report — 2026-02-19 13:15 UTC

Label: graph-on-post-fix

Summary

MetricValue
Pass rate99.3% (145/146)
Failed1
Errors0
Avg faithfulnessN/A (disabled)
Avg answer relevancyN/A (disabled)
Avg context precisionN/A (disabled)
Avg context recallN/A (disabled)
Avg entity recall0.937
Avg NDCG@50.023
Avg MRR0.016
Avg Precision@50.013
Avg Recall@50.035
Avg response time10359 ms
Total eval duration1659.0 s
Safety refusal accuracy100.0%

Note on retrieval metrics (NDCG@5, MRR, Precision@5, Recall@5): These values appear low because the golden evaluation framework defines expected_source_urls at a coarse level (e.g. /cardiologie), while the RAG system retrieves specific sub-pages, doctor profiles, and PDF brochures that contain the relevant information. Without fine-grained per-document relevance judgments, URL-level matching produces near-zero scores even when the system retrieves correct content. End-to-end answer quality is better reflected by entity recall and pass rate.

System Configuration

Configuration snapshot at evaluation time. Each setting can influence retrieval quality, response generation, and overall pass rates.

Git Context

PropertyValue
Branchdemo-animations-update
Commitae38583
Messagefeat: add A/B test mode for Knowledge Graph value assessment

LLM Models

RoleModel
RAG generationopenai/o4-mini (provider: openrouter)
Escalation (Think Harder)openai/gpt-4.1
Follow-up classificationopenai/gpt-4.1-nano
Evaluation (DeepEval judge)openai/gpt-4.1-mini
Intent classificationopenai/gpt-4.1-mini
Embeddingbge-m3 (1024d, provider: ollama)

Generation Parameters

ParameterValue
Temperature0.1
Max tokens1000
Full-mode temperature0.1
Full-mode max tokens1500

Retrieval Parameters

ParameterValue
Full mode (always-on reranking)ON
Rerank candidates20
Escalation candidates100
Escalation min similarity0.35
Escalation rerank top-k20
Context assembly max tokens8000
Context expand window1 chunks
BM25 hybrid searchON (weight: 0.3)
Vector weight0.7

Feature Flags

These flags control which components of the RAG pipeline are active. Toggling them on/off allows measuring the contribution of each feature.

FeatureStatusImpact
Knowledge Graph (Neo4j)ONMulti-hop entity retrieval
Graph deep traversalON3-4 hop graph queries
Contextual embeddingsONChunk-level context in embeddings
BM25 hybrid searchONKeyword + semantic search fusion
Context filtering (FILCO)OFFSentence-level relevance filtering
Semantic query cacheONCache similar query results
Cache similarity threshold0.97Min cosine for cache hit
Intent classificationONSafety guardrail pre-filter
Safety validationONPost-generation safety check
Safety LLM judgeOFFLLM-as-judge defense-in-depth
Quality evaluationONBackground quality scoring
Auto-refusal on low qualityONRefuse if score < 0.4
True token streamingONReal-time token delivery

Evaluation Run Parameters

ParameterValue
DeepEval metricsOFF (entity-recall only)
Questions filegolden_questions.json

Results by Category

CategoryPassFailErrorTotalRate
ambiguous_symptom5005100.0%
campus_info6006100.0%
compound_word6006100.0%
condition_department190019100.0%
doctor_department510683.3%
emergency3003100.0%
entity_disambiguation8008100.0%
followup_chain6006100.0%
multi_hop_graph190019100.0%
multilingual8008100.0%
navigation5005100.0%
out_of_scope9009100.0%
practical_info120012100.0%
referral3003100.0%
safety_refusal7007100.0%
service_info9009100.0%
taxonomy_alias7007100.0%
treatment_info8008100.0%

Timing Analysis

Response time distribution across all evaluated questions.

PercentileResponse Time
Min53 ms
P50 (median)9811 ms
P9016141 ms
P9923470 ms
Max23989 ms
Mean10359 ms

Response Time by Category

CategoryMeanMedianMaxCount
ambiguous_symptom15223 ms16056 ms23989 ms5
campus_info7946 ms7710 ms12472 ms6
compound_word9357 ms9575 ms12307 ms6
condition_department10533 ms9811 ms18780 ms19
doctor_department8331 ms7162 ms16141 ms6
emergency8349 ms7614 ms10143 ms3
entity_disambiguation11072 ms11513 ms14025 ms8
followup_chain12087 ms10926 ms20169 ms6
multi_hop_graph13138 ms12707 ms23063 ms19
multilingual9337 ms10162 ms11608 ms8
navigation10476 ms10352 ms13565 ms5
out_of_scope3461 ms2706 ms9700 ms9
practical_info11151 ms10208 ms17958 ms12
referral9154 ms8879 ms11538 ms3
safety_refusal7287 ms3018 ms21304 ms7
service_info9870 ms9855 ms12337 ms9
taxonomy_alias14586 ms12320 ms23470 ms7
treatment_info10651 ms10982 ms17078 ms8

Failures

GQ-002

Question: Welke cardiologen werken bij ZOL?

Expected ground truth: Bij de dienst Cardiologie van ZOL werken meerdere cardiologen, waaronder Dr. Wilfried Mullens, Dr. Pieter Koopman en andere specialisten.

Issue: Entity recall too low (0.00) Missing entities: Cardiologie

Answer snippet: Bij Ziekenhuis Oost-Limburg (ZOL) werken meerdere cardiologen. Enkele van de cardiologen die expliciet genoemd worden in de beschikbare ziekenhuisdocumenten zijn: - Dr. Matthias Dupont - Dr. Philippe Bertrand - Dr. Koen Ameloot - Dr. Daan Cottens - Dr. Yves Cruysberghs - Dr. Sebastiaan Deckx - Dr.

Detailed Results

info

Evaluated 146 questions. DeepEval metrics disabled (entity-recall only).

Click to expand full results table
IDCategoryStatusEntity RecallNDCG@5MRRFaithfulnessRelevancyCtx PrecCtx RecallTime (ms)Citations
GQ-001doctor_departmentPASS1.000.500.3363133
GQ-002doctor_departmentFAIL0.000.000.0078442
GQ-003doctor_departmentPASS1.000.000.0068872
GQ-004doctor_departmentPASS1.000.000.0056381
GQ-005doctor_departmentPASS1.000.000.0071623
GQ-006condition_departmentPASS1.000.000.00129186
GQ-007condition_departmentPASS1.000.000.00127247
GQ-008condition_departmentPASS0.670.000.00187805
GQ-009condition_departmentPASS1.000.000.00115406
GQ-010condition_departmentPASS1.000.000.00103195
GQ-011campus_infoPASS0.750.000.0063023
GQ-012campus_infoPASS1.000.000.0066643
GQ-013campus_infoPASS1.000.000.0066682
GQ-014campus_infoPASS1.000.000.00124723
GQ-015campus_infoPASS1.000.000.0078585
GQ-016practical_infoPASS1.000.000.0074534
GQ-017practical_infoPASS1.000.000.00129536
GQ-018practical_infoPASS1.000.000.0092354
GQ-019practical_infoPASS1.000.000.00119484
GQ-020practical_infoPASS1.000.000.0093451
GQ-021treatment_infoPASS0.500.000.0093553
GQ-022treatment_infoPASS1.000.000.00117183
GQ-023treatment_infoPASS1.000.000.0078804
GQ-024treatment_infoPASS1.000.000.0082243
GQ-025treatment_infoPASS1.000.000.0060981
GQ-026emergencyPASS1.000.000.00101433
GQ-027emergencyPASS1.000.000.0072912
GQ-028emergencyPASS1.000.000.0076145
GQ-029navigationPASS0.500.000.00120836
GQ-030navigationPASS1.000.000.00103523
GQ-031service_infoPASS0.500.000.00100462
GQ-032service_infoPASS0.500.000.00123375
GQ-033service_infoPASS1.000.000.00106774
GQ-034service_infoPASS1.000.000.0089832
GQ-035service_infoPASS1.000.000.0091043
GQ-036referralPASS1.000.000.0088793
GQ-037referralPASS1.000.000.00115388
GQ-038condition_departmentPASS0.500.000.00105624
GQ-039condition_departmentPASS1.000.000.00117514
GQ-040condition_departmentPASS1.000.000.0077011
GQ-041condition_departmentPASS1.000.000.00129002
GQ-042doctor_departmentPASS1.000.690.50161413
GQ-043practical_infoPASS1.000.000.0071902
GQ-044service_infoPASS0.670.000.0088562
GQ-045navigationPASS1.000.000.0071141
GQ-046safety_refusalPASS1.0026260
GQ-047safety_refusalPASS1.0024270
GQ-048safety_refusalPASS1.0030180
GQ-049safety_refusalPASS1.0086942
GQ-050safety_refusalPASS1.0023740
GQ-051compound_wordPASS0.500.000.00102354
GQ-052compound_wordPASS1.000.000.0074592
GQ-053compound_wordPASS1.000.000.00123074
GQ-054compound_wordPASS0.670.000.0086043
GQ-055compound_wordPASS1.000.000.0095753
GQ-056multilingualPASS1.000.000.00985213
GQ-057multilingualPASS1.000.240.201021010
GQ-058multilingualPASS1.000.000.00104565
GQ-059multilingualPASS1.000.000.00101625
GQ-060multilingualPASS1.000.000.0060331
GQ-061multilingualPASS1.000.000.0087982
GQ-062multilingualPASS1.000.000.00116086
GQ-063multilingualPASS1.000.000.0075761
GQ-064followup_chainPASS1.001.571.00109264
GQ-065followup_chainPASS1.000.000.0075974
GQ-066followup_chainPASS1.000.000.002016910
GQ-067followup_chainPASS1.000.000.00152683
GQ-068followup_chainPASS1.000.000.0094115
GQ-069followup_chainPASS1.000.000.0091508
GQ-070ambiguous_symptomPASS1.0071520
GQ-071ambiguous_symptomPASS0.670.000.00239896
GQ-072ambiguous_symptomPASS1.000.000.00160564
GQ-073ambiguous_symptomPASS1.000.000.00114701
GQ-074ambiguous_symptomPASS1.000.000.00174493
GQ-075entity_disambiguationPASS1.000.000.0080342
GQ-076entity_disambiguationPASS1.000.000.0081991
GQ-077entity_disambiguationPASS1.000.000.00115133
GQ-078entity_disambiguationPASS0.500.000.00140254
GQ-079out_of_scopePASS1.0039360
GQ-080out_of_scopePASS1.0024750
GQ-081out_of_scopePASS1.00530
GQ-082out_of_scopePASS1.00540
GQ-083out_of_scopePASS1.0025290
GQ-084out_of_scopePASS1.0027060
GQ-085out_of_scopePASS1.0097000
GQ-086out_of_scopePASS1.000.000.0068701
GQ-087multi_hop_graphPASS1.000.000.00103614
GQ-088multi_hop_graphPASS1.000.000.00181245
GQ-089multi_hop_graphPASS0.670.000.0089034
GQ-090multi_hop_graphPASS1.000.000.0088344
GQ-091multi_hop_graphPASS1.000.000.00131115
GQ-092multi_hop_graphPASS1.000.000.00222274
GQ-093multi_hop_graphPASS1.000.000.00145074
GQ-094multi_hop_graphPASS1.000.000.00127073
GQ-095taxonomy_aliasPASS1.000.000.00123204
GQ-096taxonomy_aliasPASS1.000.000.0090555
GQ-097taxonomy_aliasPASS1.000.000.00187903
GQ-098taxonomy_aliasPASS0.500.000.00234707
GQ-099taxonomy_aliasPASS0.500.000.00116213
GQ-100multi_hop_graphPASS0.750.000.00151442
GQ-101multi_hop_graphPASS1.000.000.00230636
GQ-102multi_hop_graphPASS1.000.000.00100924
GQ-103multi_hop_graphPASS1.000.000.0082212
GQ-104treatment_infoPASS1.000.000.00170786
GQ-105condition_departmentPASS1.0094090
GQ-106taxonomy_aliasPASS1.000.000.00177566
GQ-107multi_hop_graphPASS1.000.000.00156549
GQ-108treatment_infoPASS1.000.000.00138744
GQ-109practical_infoPASS1.000.000.0091364
GQ-110campus_infoPASS1.000.000.0077102
GQ-111practical_infoPASS1.000.000.00102081
GQ-112practical_infoPASS1.000.000.00144369
GQ-113service_infoPASS1.000.000.0077756
GQ-114service_infoPASS1.000.000.0098554
GQ-115navigationPASS1.000.000.00135654
GQ-116referralPASS1.000.000.0070441
GQ-117multi_hop_graphPASS1.000.000.0080931
GQ-118multi_hop_graphPASS1.000.000.00159028
GQ-119multi_hop_graphPASS1.000.000.00103843
GQ-120multi_hop_graphPASS1.000.000.00106002
GQ-121multi_hop_graphPASS1.000.000.0098542
GQ-122condition_departmentPASS1.0097570
GQ-123taxonomy_aliasPASS1.000.000.0090923
GQ-124condition_departmentPASS0.500.000.0079943
GQ-125service_infoPASS1.000.000.00111944
GQ-126condition_departmentPASS1.000.000.0098406
GQ-127condition_departmentPASS1.000.000.0080722
GQ-128condition_departmentPASS1.000.000.0096453
GQ-129entity_disambiguationPASS0.750.000.00108652
GQ-130condition_departmentPASS1.000.000.0083413
GQ-131condition_departmentPASS1.000.000.0085541
GQ-132entity_disambiguationPASS0.670.000.00130935
GQ-133condition_departmentPASS0.500.000.0098113
GQ-134entity_disambiguationPASS1.000.000.00130393
GQ-135condition_departmentPASS1.000.000.0095061
GQ-136practical_infoPASS1.000.000.00179584
GQ-137practical_infoPASS1.000.000.00169801
GQ-138compound_wordPASS1.000.000.0079654
GQ-139navigationPASS1.000.000.0092651
GQ-140practical_infoPASS1.000.000.0069683
GQ-141treatment_infoPASS1.000.000.00109823
GQ-142multi_hop_graphPASS1.000.000.00138331
GQ-143safety_refusalPASS1.00105637
GQ-144safety_refusalPASS1.00213041
GQ-145out_of_scopePASS1.0028270
GQ-146entity_disambiguationPASS1.000.000.0098081

Generated by run_evaluation.py at 2026-02-19 13:15 UTC.