Evaluation Report — 2026-02-21 05:35 UTC

Label: all-three-on

Summary

Metric	Value
Pass rate	96.9% (158/163)
Failed	5
Errors	0
Avg faithfulness	N/A (disabled)
Avg answer relevancy	N/A (disabled)
Avg context precision	N/A (disabled)
Avg context recall	N/A (disabled)
Avg entity recall	0.916
Avg NDCG@5	0.027
Avg MRR	0.018
Avg Precision@5	0.013
Avg Recall@5	0.039
Avg response time	13602 ms
Total eval duration	2381.3 s
Safety refusal accuracy	100.0%

Note on retrieval metrics (NDCG@5, MRR, Precision@5, Recall@5): These values appear low because the golden evaluation framework defines expected_source_urls at a coarse level (e.g. /cardiologie), while the RAG system retrieves specific sub-pages, doctor profiles, and PDF brochures that contain the relevant information. Without fine-grained per-document relevance judgments, URL-level matching produces near-zero scores even when the system retrieves correct content. End-to-end answer quality is better reflected by entity recall and pass rate.

Statistical Analysis

95% bootstrap confidence intervals (10,000 resamples, percentile method). Narrower intervals indicate more reliable estimates.

Metric	Mean	95% CI	Width	n
Entity Recall	0.916	[0.880, 0.947]	0.066	163
NDCG@5	0.027	[0.002, 0.058]	0.056	128
MRR	0.018	[0.002, 0.039]	0.037	128
Precision@5	0.013	[0.002, 0.027]	0.025	128
Recall@5	0.039	[0.004, 0.086]	0.082	128
Pass Rate	0.969	[0.939, 0.994]	0.055	163

System Configuration

Configuration snapshot at evaluation time. Each setting can influence retrieval quality, response generation, and overall pass rates.

Git Context

Property	Value
Branch	`master`
Commit	`69ac48c`
Message	docs: add evaluation methodology justification (ER vs LLM-as-a-judge)

LLM Models

Role	Model
RAG generation	`openai/o4-mini` (provider: openrouter)
Escalation (Think Harder)	`openai/gpt-5.2`
Follow-up classification	`openai/gpt-4.1-nano`
Evaluation (DeepEval judge)	`openai/gpt-4.1-mini`
Intent classification	`openai/gpt-4.1-mini`
Safety LLM judge	`openai/gpt-4.1-mini`
Embedding	`bge-m3` (1024d, provider: ollama)

Generation Parameters

Parameter	Value
Temperature	0.1
Max tokens	1000
Full-mode temperature	0.1
Full-mode max tokens	1500

Retrieval Parameters

Parameter	Value
Full mode (always-on reranking)	ON
Rerank candidates	20
Escalation candidates	100
Escalation min similarity	0.35
Escalation rerank top-k	20
Context assembly max tokens	8000
Context expand window	1 chunks
BM25 hybrid search	ON (weight: 0.3)
Vector weight	0.7

Feature Flags

These flags control which components of the RAG pipeline are active. Toggling them on/off allows measuring the contribution of each feature.

Feature	Status	Impact
Knowledge Graph (Neo4j)	ON	Multi-hop entity retrieval
Graph deep traversal	ON	3-4 hop graph queries
Contextual embeddings	ON	Chunk-level context in embeddings
BM25 hybrid search	ON	Keyword + semantic search fusion
Context filtering (FILCO)	OFF	Sentence-level relevance filtering
Semantic query cache	ON	Cache similar query results
Cache similarity threshold	0.97	Min cosine for cache hit
Intent classification	ON	Safety guardrail pre-filter
Safety validation	ON	Post-generation safety check
Safety LLM judge	ON	LLM-as-judge defense-in-depth
Quality evaluation	ON	Background quality scoring
Auto-refusal on low quality	ON	Refuse if score < 0.4
True token streaming	ON	Real-time token delivery

Evaluation Run Parameters

Parameter	Value
DeepEval metrics	OFF (entity-recall only)
Questions file	`golden_questions.json`

Results by Category

Category	Pass	Fail	Total	Rate
adversarial_gcg	12	0	12	100.0%
ambiguous_symptom	5	0	5	100.0%
campus_info	6	0	6	100.0%
compound_word	6	0	6	100.0%
condition_department	18	1	19	94.7%
doctor_department	5	1	6	83.3%
emergency	3	0	3	100.0%
entity_disambiguation	8	0	8	100.0%
followup_chain	5	1	6	83.3%
multi_hop_graph	19	0	19	100.0%
multilingual	6	2	8	75.0%
navigation	5	0	5	100.0%
out_of_scope	12	0	12	100.0%
practical_info	12	0	12	100.0%
referral	3	0	3	100.0%
safety_refusal	9	0	9	100.0%
service_info	9	0	9	100.0%
taxonomy_alias	7	0	7	100.0%
treatment_info	8	0	8	100.0%

Timing Analysis

Response time distribution across all evaluated questions.

Percentile	Response Time
Min	31 ms
P50 (median)	14757 ms
P90	19537 ms
P99	24225 ms
Max	25135 ms
Mean	13602 ms

Response Time by Category

Category	Mean	Median	Max	Count
adversarial_gcg	8499 ms	12657 ms	20024 ms	12
ambiguous_symptom	16378 ms	17470 ms	20016 ms	5
campus_info	12360 ms	14143 ms	18379 ms	6
compound_word	12758 ms	13002 ms	15096 ms	6
condition_department	16596 ms	16195 ms	21518 ms	19
doctor_department	12190 ms	12602 ms	15358 ms	6
emergency	12987 ms	13116 ms	14638 ms	3
entity_disambiguation	13131 ms	14910 ms	19646 ms	8
followup_chain	12834 ms	13041 ms	17430 ms	6
multi_hop_graph	16796 ms	16883 ms	24225 ms	19
multilingual	10308 ms	10415 ms	16409 ms	8
navigation	15112 ms	16662 ms	17363 ms	5
out_of_scope	7420 ms	10574 ms	17293 ms	12
practical_info	17861 ms	18447 ms	22760 ms	12
referral	16496 ms	15798 ms	18441 ms	3
safety_refusal	13151 ms	12006 ms	16463 ms	9
service_info	10796 ms	11882 ms	19901 ms	9
taxonomy_alias	17368 ms	17336 ms	25135 ms	7
treatment_info	13254 ms	15393 ms	19874 ms	8

Failures

GQ-004

Question: Bij welke afdeling werkt Dr. Rik Houben?

Expected ground truth: Dr. Rik Houben werkt bij de dienst Neurologie van Ziekenhuis Oost-Limburg (ZOL).

Issue: Entity recall too low (0.00) Missing entities: Houben