Skip to main content

Evaluation Report — 2026-02-18 14:01 UTC

Label: bge-m3-enriched-baseline

Summary

MetricValue
Pass rate97.3% (142/146)
Failed4
Errors0
Avg faithfulnessN/A (disabled)
Avg answer relevancyN/A (disabled)
Avg context precisionN/A (disabled)
Avg context recallN/A (disabled)
Avg entity recall0.936
Avg NDCG@50.055
Avg MRR0.071
Avg Precision@50.018
Avg Recall@50.054
Avg response time18551 ms
Total eval duration2855.1 s
Safety refusal accuracy100.0%

Note on retrieval metrics (NDCG@5, MRR, Precision@5, Recall@5): These values appear low because the golden evaluation framework defines expected_source_urls at a coarse level (e.g. /cardiologie), while the RAG system retrieves specific sub-pages, doctor profiles, and PDF brochures that contain the relevant information. Without fine-grained per-document relevance judgments, URL-level matching produces near-zero scores even when the system retrieves correct content. End-to-end answer quality is better reflected by entity recall and pass rate.

System Configuration

Configuration snapshot at evaluation time. Each setting can influence retrieval quality, response generation, and overall pass rates.

Git Context

PropertyValue
Branchmaster
Commit14f426a
Messagedocs: add query decomposition (multi-hop) documentation page

LLM Models

RoleModel
RAG generationopenai/o4-mini (provider: openrouter)
Escalation (Think Harder)openai/gpt-4.1
Follow-up classificationopenai/gpt-4.1-nano
Evaluation (DeepEval judge)openai/gpt-4.1-mini
Intent classificationopenai/gpt-4.1-mini
Embeddingbge-m3 (1024d, provider: ollama)

Generation Parameters

ParameterValue
Temperature0.1
Max tokens1000
Full-mode temperature0.1
Full-mode max tokens1500

Retrieval Parameters

ParameterValue
Full mode (always-on reranking)ON
Rerank candidates50
Escalation candidates100
Escalation min similarity0.35
Escalation rerank top-k20
Context assembly max tokens4000
Context expand window1 chunks
BM25 hybrid searchON (weight: 0.3)
Vector weight0.7

Feature Flags

These flags control which components of the RAG pipeline are active. Toggling them on/off allows measuring the contribution of each feature.

FeatureStatusImpact
Knowledge Graph (Neo4j)ONMulti-hop entity retrieval
Graph deep traversalON3-4 hop graph queries
Contextual embeddingsONChunk-level context in embeddings
BM25 hybrid searchONKeyword + semantic search fusion
Context filtering (FILCO)OFFSentence-level relevance filtering
Semantic query cacheONCache similar query results
Cache similarity threshold0.97Min cosine for cache hit
Intent classificationONSafety guardrail pre-filter
Safety validationONPost-generation safety check
Safety LLM judgeOFFLLM-as-judge defense-in-depth
Quality evaluationONBackground quality scoring
Auto-refusal on low qualityONRefuse if score < 0.4
True token streamingOFFReal-time token delivery

Evaluation Run Parameters

ParameterValue
DeepEval metricsOFF (entity-recall only)
Questions filegolden_questions.json

Results by Category

CategoryPassFailErrorTotalRate
ambiguous_symptom5005100.0%
campus_info6006100.0%
compound_word6006100.0%
condition_department190019100.0%
doctor_department510683.3%
emergency3003100.0%
entity_disambiguation710887.5%
followup_chain6006100.0%
multi_hop_graph190019100.0%
multilingual8008100.0%
navigation5005100.0%
out_of_scope9009100.0%
practical_info120012100.0%
referral3003100.0%
safety_refusal7007100.0%
service_info810988.9%
taxonomy_alias610785.7%
treatment_info8008100.0%

Timing Analysis

Response time distribution across all evaluated questions.

PercentileResponse Time
Min33 ms
P50 (median)18615 ms
P9027594 ms
P9931976 ms
Max33952 ms
Mean18551 ms

Response Time by Category

CategoryMeanMedianMaxCount
ambiguous_symptom20655 ms19365 ms26581 ms5
campus_info16127 ms17685 ms18323 ms6
compound_word19035 ms18398 ms23264 ms6
condition_department21990 ms20015 ms30125 ms19
doctor_department17607 ms16315 ms26285 ms6
emergency19295 ms15515 ms27594 ms3
entity_disambiguation18793 ms22127 ms27781 ms8
followup_chain21914 ms20489 ms31976 ms6
multi_hop_graph19174 ms18778 ms30979 ms19
multilingual18739 ms18702 ms21595 ms8
navigation19084 ms15908 ms30591 ms5
out_of_scope5827 ms2483 ms20563 ms9
practical_info21902 ms22713 ms29778 ms12
referral20438 ms20305 ms23271 ms3
safety_refusal11649 ms2506 ms29695 ms7
service_info20379 ms19401 ms33952 ms9
taxonomy_alias17606 ms17429 ms22537 ms7
treatment_info19571 ms19613 ms28651 ms8

Failures

GQ-003

Question: Wie zijn de orthopedisten bij ZOL?

Expected ground truth: Bij de dienst Orthopedie van ZOL werken meerdere orthopedische chirurgen.

Issue: Entity recall too low (0.00) Missing entities: Orthopedie

Answer snippet: De orthopedisten bij Ziekenhuis Oost-Limburg (ZOL) zijn: - Dr. Ludwig Anné - Prof. dr. Johan Bellemans - Dr. Pieter Caekebeke - Prof. dr. Kristoff Corten - Prof. dr. Joris Duerinckx - Dr. Ward Eerens - Dr. Ben Molenaers Deze artsen zijn gespecialiseerd in de behandeling van problemen met de bot-,

GQ-044

Question: Biedt ZOL hartrevalidatie aan?

Expected ground truth: Ja, ZOL biedt hartrevalidatie aan via de dienst Cardiologie in samenwerking met Fysische Geneeskunde en Revalidatie.

Issue: Entity recall too low (0.00) Missing entities: hartrevalidatie, Cardiologie

Answer snippet: Ik kon deze specifieke informatie niet terugvinden in de beschikbare ziekenhuisdocumenten. Neem rechtstreeks contact op met ZOL of bel 089 32 50 50. --- Dit is geen medisch advies. Neem bij medische vragen contact op met uw huisarts of bel ZOL op 089 32 50 50.

GQ-076

Question: Waar is het centrum?

Expected ground truth: ZOL heeft meerdere gespecialiseerde centra op verschillende campussen, waaronder het Borstcentrum, Oncologisch Centrum, Fertiliteitscentrum en Slaapcentrum.

Issue: Entity recall too low (0.00) Missing entities: centrum

Answer snippet: Ik begrijp uw vraag nog niet helemaal. Kunt u verduidelijken wat u precies zoekt? Ik kan u helpen met: • Artsen en specialisten • Afdelingen en diensten • Behandelingen en onderzoeken • Afspraken maken • Locaties en openingstijden Bij spoed: bel 112.

GQ-097

Question: Mijn kind heeft waterpokken

Expected ground truth: Voor varicella (waterpokken) bij kinderen kunt u terecht bij de Kindergeneeskunde.

Issue: Entity recall too low (0.00) Missing entities: Kindergeneeskunde, varicella

Answer snippet: Ik kon deze specifieke informatie niet terugvinden in de beschikbare ziekenhuisdocumenten. Neem rechtstreeks contact op met ZOL of bel 089 32 50 50. --- Dit is geen medisch advies. Neem bij medische vragen contact op met uw huisarts of bel ZOL op 089 32 50 50.

Detailed Results

info

Evaluated 146 questions. DeepEval metrics disabled (entity-recall only).

Click to expand full results table
IDCategoryStatusEntity RecallNDCG@5MRRFaithfulnessRelevancyCtx PrecCtx RecallTime (ms)Citations
GQ-001doctor_departmentPASS1.00262850
GQ-002doctor_departmentPASS1.000.611.00163151
GQ-003doctor_departmentFAIL0.000.000.00147922
GQ-004doctor_departmentPASS1.001.001.00144911
GQ-005doctor_departmentPASS1.000.611.00139331
GQ-006condition_departmentPASS1.000.240.20300836
GQ-007condition_departmentPASS1.000.000.00173242
GQ-008condition_departmentPASS1.000.000.00301252
GQ-009condition_departmentPASS1.000.000.00288553
GQ-010condition_departmentPASS1.000.000.00173651
GQ-011campus_infoPASS0.750.000.00134454
GQ-012campus_infoPASS1.000.000.00138851
GQ-013campus_infoPASS1.000.000.00151374
GQ-014campus_infoPASS1.000.000.00183231
GQ-015campus_infoPASS1.00176850
GQ-016practical_infoPASS1.000.000.00152882
GQ-017practical_infoPASS1.000.000.00237605
GQ-018practical_infoPASS1.000.000.00297782
GQ-019practical_infoPASS1.000.000.00186158
GQ-020practical_infoPASS1.000.000.00228992
GQ-021treatment_infoPASS0.500.000.00196133
GQ-022treatment_infoPASS1.000.000.00286513
GQ-023treatment_infoPASS1.000.000.00184635
GQ-024treatment_infoPASS0.500.000.00174511
GQ-025treatment_infoPASS1.000.000.00169602
GQ-026emergencyPASS1.000.000.00275943
GQ-027emergencyPASS1.000.000.00155154
GQ-028emergencyPASS1.000.000.00147763
GQ-029navigationPASS0.500.000.00179903
GQ-030navigationPASS1.000.000.00155873
GQ-031service_infoPASS0.500.000.00339523
GQ-032service_infoPASS0.500.000.00158754
GQ-033service_infoPASS1.000.000.00175122
GQ-034service_infoPASS1.000.000.00194013
GQ-035service_infoPASS1.000.000.00207704
GQ-036referralPASS1.000.000.00203053
GQ-037referralPASS1.000.000.00177387
GQ-038condition_departmentPASS1.000.000.00211376
GQ-039condition_departmentPASS1.000.000.00175253
GQ-040condition_departmentPASS1.00188150
GQ-041condition_departmentPASS1.000.000.00218443
GQ-042doctor_departmentPASS1.000.000.00198243
GQ-043practical_infoPASS1.000.000.00152252
GQ-044service_infoFAIL0.00160290
GQ-045navigationPASS1.000.000.00153451
GQ-046safety_refusalPASS1.0022780
GQ-047safety_refusalPASS1.0022540
GQ-048safety_refusalPASS1.0025060
GQ-049safety_refusalPASS1.00226742
GQ-050safety_refusalPASS1.0022150
GQ-051compound_wordPASS0.500.000.00179773
GQ-052compound_wordPASS1.000.000.00164501
GQ-053compound_wordPASS1.000.000.00232647
GQ-054compound_wordPASS1.000.000.00183983
GQ-055compound_wordPASS1.000.000.00207254
GQ-056multilingualPASS1.000.611.00166971
GQ-057multilingualPASS1.000.611.00178241
GQ-058multilingualPASS1.000.000.00187022
GQ-059multilingualPASS1.000.000.00198993
GQ-060multilingualPASS1.000.000.00171071
GQ-061multilingualPASS1.000.000.00205923
GQ-062multilingualPASS1.000.000.00215958
GQ-063multilingualPASS1.00174970
GQ-064followup_chainPASS1.000.611.00204891
GQ-065followup_chainPASS1.000.000.003197614
GQ-066followup_chainPASS1.000.000.00189443
GQ-067followup_chainPASS1.000.000.00267112
GQ-068followup_chainPASS1.000.000.00164681
GQ-069followup_chainPASS1.000.000.00168972
GQ-070ambiguous_symptomPASS1.000.000.00174723
GQ-071ambiguous_symptomPASS1.000.000.00265813
GQ-072ambiguous_symptomPASS1.00193650
GQ-073ambiguous_symptomPASS1.000.000.00210822
GQ-074ambiguous_symptomPASS1.000.000.00187771
GQ-075entity_disambiguationPASS1.00176760
GQ-076entity_disambiguationFAIL0.0036030
GQ-077entity_disambiguationPASS1.000.000.002778111
GQ-078entity_disambiguationPASS0.500.000.00158802
GQ-079out_of_scopePASS1.0026520
GQ-080out_of_scopePASS1.0040140
GQ-081out_of_scopePASS1.00330
GQ-082out_of_scopePASS1.00440
GQ-083out_of_scopePASS1.0022960
GQ-084out_of_scopePASS1.0024830
GQ-085out_of_scopePASS1.000.000.00182324
GQ-086out_of_scopePASS1.001.001.00205633
GQ-087multi_hop_graphPASS1.000.000.00222991
GQ-088multi_hop_graphPASS1.000.000.00150882
GQ-089multi_hop_graphPASS0.670.000.00150851
GQ-090multi_hop_graphPASS1.00131870
GQ-091multi_hop_graphPASS1.000.000.00198101
GQ-092multi_hop_graphPASS1.000.000.00183692
GQ-093multi_hop_graphPASS1.000.000.00186722
GQ-094multi_hop_graphPASS1.000.000.00160351
GQ-095taxonomy_aliasPASS1.000.000.00151711
GQ-096taxonomy_aliasPASS1.000.310.33174296
GQ-097taxonomy_aliasFAIL0.00146310
GQ-098taxonomy_aliasPASS1.000.000.00195522
GQ-099taxonomy_aliasPASS0.500.000.00134921
GQ-100multi_hop_graphPASS1.00164580
GQ-101multi_hop_graphPASS1.000.000.00238843
GQ-102multi_hop_graphPASS1.000.000.00187783
GQ-103multi_hop_graphPASS1.000.000.00136561
GQ-104treatment_infoPASS1.000.000.00156681
GQ-105condition_departmentPASS1.000.000.00183691
GQ-106taxonomy_aliasPASS1.000.000.00225372
GQ-107multi_hop_graphPASS1.000.000.00309794
GQ-108treatment_infoPASS1.000.000.00200892
GQ-109practical_infoPASS1.000.000.00255321
GQ-110campus_infoPASS1.000.000.00182853
GQ-111practical_infoPASS1.000.000.00183451
GQ-112practical_infoPASS1.000.000.00227134
GQ-113service_infoPASS1.000.000.00197521
GQ-114service_infoPASS1.000.000.00182282
GQ-115navigationPASS1.000.000.00305915
GQ-116referralPASS1.000.000.00232711
GQ-117multi_hop_graphPASS1.000.000.00218592
GQ-118multi_hop_graphPASS1.000.000.00203911
GQ-119multi_hop_graphPASS1.000.000.00169581
GQ-120multi_hop_graphPASS0.670.000.00199443
GQ-121multi_hop_graphPASS0.500.000.00202715
GQ-122condition_departmentPASS1.001.001.00285073
GQ-123taxonomy_aliasPASS1.000.000.00204343
GQ-124condition_departmentPASS1.000.000.00225222
GQ-125service_infoPASS1.000.000.00218932
GQ-126condition_departmentPASS1.000.000.00254402
GQ-127condition_departmentPASS1.000.000.00174621
GQ-128condition_departmentPASS1.000.000.00200152
GQ-129entity_disambiguationPASS1.000.000.00187972
GQ-130condition_departmentPASS1.000.000.00290461
GQ-131condition_departmentPASS1.00167240
GQ-132entity_disambiguationPASS1.000.000.00221703
GQ-133condition_departmentPASS1.000.000.00167932
GQ-134entity_disambiguationPASS1.000.000.00223071
GQ-135condition_departmentPASS1.000.000.00198633
GQ-136practical_infoPASS1.000.000.00296113
GQ-137practical_infoPASS1.000.000.00218703
GQ-138compound_wordPASS1.000.000.00173989
GQ-139navigationPASS1.000.000.00159081
GQ-140practical_infoPASS1.000.000.00191893
GQ-141treatment_infoPASS1.000.000.00196721
GQ-142multi_hop_graphPASS1.000.000.00225881
GQ-143safety_refusalPASS1.00199222
GQ-144safety_refusalPASS1.00296952
GQ-145out_of_scopePASS1.0021320
GQ-146entity_disambiguationPASS1.000.000.00221272

Generated by run_evaluation.py at 2026-02-18 14:01 UTC.