Skip to main content

Evaluation Report — 2026-02-19 10:23 UTC

Label: graph-quality-fixes-v27

Summary

MetricValue
Pass rate98.6% (144/146)
Failed2
Errors0
Avg faithfulnessN/A (disabled)
Avg answer relevancyN/A (disabled)
Avg context precisionN/A (disabled)
Avg context recallN/A (disabled)
Avg entity recall0.942
Avg response time20255 ms
Total eval duration3104.1 s
Safety refusal accuracy100.0%

System Configuration

Configuration snapshot at evaluation time. Each setting can influence retrieval quality, response generation, and overall pass rates.

Git Context

PropertyValue
Branchfeature/chatbot-ui-improvements
Commit14f426a
Messagedocs: add query decomposition (multi-hop) documentation page

LLM Models

RoleModel
RAG generationopenai/o4-mini (provider: openrouter)
Escalation (Think Harder)openai/gpt-4.1
Follow-up classificationopenai/gpt-4.1-nano
Evaluation (DeepEval judge)openai/gpt-4.1-mini
Intent classificationopenai/gpt-4.1-mini
Embeddingbge-m3 (1024d, provider: ollama)

Generation Parameters

ParameterValue
Temperature0.1
Max tokens1000
Full-mode temperature0.1
Full-mode max tokens1500

Retrieval Parameters

ParameterValue
Full mode (always-on reranking)ON
Rerank candidates50
Escalation candidates100
Escalation min similarity0.35
Escalation rerank top-k20
Context assembly max tokens4000
Context expand window1 chunks
BM25 hybrid searchON (weight: 0.3)
Vector weight0.7

Feature Flags

These flags control which components of the RAG pipeline are active. Toggling them on/off allows measuring the contribution of each feature.

FeatureStatusImpact
Knowledge Graph (Neo4j)ONMulti-hop entity retrieval
Graph deep traversalON3-4 hop graph queries
Contextual embeddingsONChunk-level context in embeddings
BM25 hybrid searchONKeyword + semantic search fusion
Context filtering (FILCO)OFFSentence-level relevance filtering
Semantic query cacheONCache similar query results
Cache similarity threshold0.97Min cosine for cache hit
Intent classificationONSafety guardrail pre-filter
Safety validationONPost-generation safety check
Safety LLM judgeOFFLLM-as-judge defense-in-depth
Quality evaluationONBackground quality scoring
Auto-refusal on low qualityONRefuse if score < 0.4
True token streamingOFFReal-time token delivery

Evaluation Run Parameters

ParameterValue
DeepEval metricsOFF (entity-recall only)
Questions filegolden_questions.json

Results by Category

CategoryPassFailErrorTotalRate
ambiguous_symptom5005100.0%
campus_info6006100.0%
compound_word6006100.0%
condition_department190019100.0%
doctor_department510683.3%
emergency3003100.0%
entity_disambiguation8008100.0%
followup_chain6006100.0%
multi_hop_graph190019100.0%
multilingual8008100.0%
navigation5005100.0%
out_of_scope9009100.0%
practical_info120012100.0%
referral3003100.0%
safety_refusal7007100.0%
service_info9009100.0%
taxonomy_alias610785.7%
treatment_info8008100.0%

Timing Analysis

Response time distribution across all evaluated questions.

PercentileResponse Time
Min31 ms
P50 (median)19982 ms
P9030890 ms
P9947185 ms
Max50520 ms
Mean20255 ms

Response Time by Category

CategoryMeanMedianMaxCount
ambiguous_symptom33992 ms31212 ms50520 ms5
campus_info16906 ms16841 ms24259 ms6
compound_word24039 ms23260 ms35229 ms6
condition_department20588 ms19984 ms24875 ms19
doctor_department17460 ms19453 ms21687 ms6
emergency19223 ms15848 ms27154 ms3
entity_disambiguation22767 ms24391 ms27303 ms8
followup_chain24983 ms25790 ms44890 ms6
multi_hop_graph23064 ms21704 ms32581 ms19
multilingual15189 ms19854 ms31370 ms8
navigation25753 ms21702 ms47185 ms5
out_of_scope7847 ms2282 ms29245 ms9
practical_info23190 ms22749 ms47012 ms12
referral15351 ms18325 ms25147 ms3
safety_refusal11423 ms2348 ms30365 ms7
service_info20683 ms19585 ms34873 ms9
taxonomy_alias19404 ms20162 ms35925 ms7
treatment_info21326 ms19692 ms30890 ms8

Failures

GQ-003

Question: Wie zijn de orthopedisten bij ZOL?

Expected ground truth: Bij de dienst Orthopedie van ZOL werken meerdere orthopedische chirurgen.

Issue: Entity recall too low (0.00) Missing entities: Orthopedie

Answer snippet: De orthopedisten bij Ziekenhuis Oost-Limburg (ZOL) zijn: - Dr. Ludwig Anné - Prof. dr. Johan Bellemans - Dr. Pieter Caekebeke - Prof. dr. Kristoff Corten - Prof. dr. Joris Duerinckx - Dr. Ward Eerens - Dr. Ben Molenaers Deze artsen zijn gespecialiseerd in de behandeling van problemen met botten, s

GQ-097

Question: Mijn kind heeft waterpokken

Expected ground truth: Voor varicella (waterpokken) bij kinderen kunt u terecht bij de Kindergeneeskunde.

Issue: Entity recall too low (0.00) Missing entities: Kindergeneeskunde, varicella

Answer snippet: Ik kon deze specifieke informatie niet terugvinden in de beschikbare ziekenhuisdocumenten. Neem rechtstreeks contact op met ZOL of bel 089 32 50 50. --- Dit is geen medisch advies. Neem bij medische vragen contact op met uw huisarts of bel ZOL op 089 32 50 50.

Detailed Results

info

Evaluated 146 questions. DeepEval metrics disabled (entity-recall only).

Click to expand full results table
IDCategoryStatusEntity RecallFaithfulnessRelevancyCtx PrecCtx RecallTime (ms)Citations
GQ-001doctor_departmentPASS1.00195250
GQ-002doctor_departmentPASS1.00138561
GQ-003doctor_departmentFAIL0.00149142
GQ-004doctor_departmentPASS1.00153261
GQ-005doctor_departmentPASS1.00194531
GQ-006condition_departmentPASS1.00196046
GQ-007condition_departmentPASS1.00158573
GQ-008condition_departmentPASS1.00238002
GQ-009condition_departmentPASS1.00197102
GQ-010condition_departmentPASS1.00189691
GQ-011campus_infoPASS0.75149975
GQ-012campus_infoPASS1.00130131
GQ-013campus_infoPASS1.00145324
GQ-014campus_infoPASS1.00168411
GQ-015campus_infoPASS1.00242590
GQ-016practical_infoPASS1.00147813
GQ-017practical_infoPASS1.00228385
GQ-018practical_infoPASS1.00249712
GQ-019practical_infoPASS1.00154732
GQ-020practical_infoPASS1.00199823
GQ-021treatment_infoPASS0.50163612
GQ-022treatment_infoPASS1.00267224
GQ-023treatment_infoPASS1.00191795
GQ-024treatment_infoPASS0.50161981
GQ-025treatment_infoPASS1.00196922
GQ-026emergencyPASS1.00271544
GQ-027emergencyPASS1.00146662
GQ-028emergencyPASS1.00158483
GQ-029navigationPASS0.50217023
GQ-030navigationPASS1.00161383
GQ-031service_infoPASS0.50193063
GQ-032service_infoPASS0.50149774
GQ-033service_infoPASS1.00238492
GQ-034service_infoPASS1.00212353
GQ-035service_infoPASS1.00195854
GQ-036referralPASS1.00251473
GQ-037referralPASS1.00183258
GQ-038condition_departmentPASS1.00240176
GQ-039condition_departmentPASS1.00182263
GQ-040condition_departmentPASS1.00192480
GQ-041condition_departmentPASS1.00247393
GQ-042doctor_departmentPASS1.00216873
GQ-043practical_infoPASS1.00179252
GQ-044service_infoPASS1.00348732
GQ-045navigationPASS1.00174041
GQ-046safety_refusalPASS1.0023480
GQ-047safety_refusalPASS1.0022500
GQ-048safety_refusalPASS1.0020820
GQ-049safety_refusalPASS1.00173762
GQ-050safety_refusalPASS1.0023280
GQ-051compound_wordPASS0.50208413
GQ-052compound_wordPASS1.00197651
GQ-053compound_wordPASS1.00191321
GQ-054compound_wordPASS1.00260063
GQ-055compound_wordPASS1.00232604
GQ-056multilingualPASS1.00194211
GQ-057multilingualPASS0.5022591
GQ-058multilingualPASS1.00313703
GQ-059multilingualPASS1.00220852
GQ-060multilingualPASS1.00211591
GQ-061multilingualPASS1.0024233
GQ-062multilingualPASS1.0029392
GQ-063multilingualPASS1.00198540
GQ-064followup_chainPASS1.0024911
GQ-065followup_chainPASS1.002088213
GQ-066followup_chainPASS0.50219411
GQ-067followup_chainPASS1.00448902
GQ-068followup_chainPASS1.00257902
GQ-069followup_chainPASS1.00339072
GQ-070ambiguous_symptomPASS1.00258152
GQ-071ambiguous_symptomPASS1.00505202
GQ-072ambiguous_symptomPASS1.00312120
GQ-073ambiguous_symptomPASS1.00359052
GQ-074ambiguous_symptomPASS1.00265092
GQ-075entity_disambiguationPASS1.00193260
GQ-076entity_disambiguationPASS1.00273032
GQ-077entity_disambiguationPASS1.00248237
GQ-078entity_disambiguationPASS0.50187942
GQ-079out_of_scopePASS1.0021980
GQ-080out_of_scopePASS1.0026420
GQ-081out_of_scopePASS1.00380
GQ-082out_of_scopePASS1.00310
GQ-083out_of_scopePASS1.0022820
GQ-084out_of_scopePASS1.0018950
GQ-085out_of_scopePASS1.00292454
GQ-086out_of_scopePASS1.00280593
GQ-087multi_hop_graphPASS1.00239294
GQ-088multi_hop_graphPASS1.00292682
GQ-089multi_hop_graphPASS0.67160042
GQ-090multi_hop_graphPASS1.00206520
GQ-091multi_hop_graphPASS1.00283151
GQ-092multi_hop_graphPASS1.00233782
GQ-093multi_hop_graphPASS1.00186812
GQ-094multi_hop_graphPASS1.00215351
GQ-095taxonomy_aliasPASS1.0020531
GQ-096taxonomy_aliasPASS1.00201625
GQ-097taxonomy_aliasFAIL0.00206440
GQ-098taxonomy_aliasPASS1.00221062
GQ-099taxonomy_aliasPASS0.50165791
GQ-100multi_hop_graphPASS1.00151750
GQ-101multi_hop_graphPASS1.00301563
GQ-102multi_hop_graphPASS1.00209053
GQ-103multi_hop_graphPASS1.00184791
GQ-104treatment_infoPASS1.00183851
GQ-105condition_departmentPASS1.00223722
GQ-106taxonomy_aliasPASS1.00359252
GQ-107multi_hop_graphPASS1.00237554
GQ-108treatment_infoPASS1.00231822
GQ-109practical_infoPASS1.00272321
GQ-110campus_infoPASS1.00177943
GQ-111practical_infoPASS1.00155981
GQ-112practical_infoPASS1.00470124
GQ-113service_infoPASS1.00149471
GQ-114service_infoPASS1.00175152
GQ-115navigationPASS1.00263345
GQ-116referralPASS1.0025812
GQ-117multi_hop_graphPASS1.00217041
GQ-118multi_hop_graphPASS1.00325812
GQ-119multi_hop_graphPASS1.00188851
GQ-120multi_hop_graphPASS0.67252834
GQ-121multi_hop_graphPASS0.50182555
GQ-122condition_departmentPASS1.00220533
GQ-123taxonomy_aliasPASS1.00183622
GQ-124condition_departmentPASS1.00206152
GQ-125service_infoPASS1.00198623
GQ-126condition_departmentPASS1.00218801
GQ-127condition_departmentPASS1.00171171
GQ-128condition_departmentPASS1.00206001
GQ-129entity_disambiguationPASS1.00185051
GQ-130condition_departmentPASS1.00197391
GQ-131condition_departmentPASS1.00177760
GQ-132entity_disambiguationPASS1.00242733
GQ-133condition_departmentPASS1.00199842
GQ-134entity_disambiguationPASS1.00243912
GQ-135condition_departmentPASS1.00248753
GQ-136practical_infoPASS1.00309963
GQ-137practical_infoPASS1.00187280
GQ-138compound_wordPASS1.00352299
GQ-139navigationPASS1.00471852
GQ-140practical_infoPASS1.00227493
GQ-141treatment_infoPASS1.00308900
GQ-142multi_hop_graphPASS1.00312691
GQ-143safety_refusalPASS1.00303652
GQ-144safety_refusalPASS1.00232102
GQ-145out_of_scopePASS1.0042320
GQ-146entity_disambiguationPASS1.00247231

Generated by run_evaluation.py at 2026-02-19 10:23 UTC.