Evaluation Report — 2026-02-20 15:12 UTC

Label: filco-regression-fix-validation

Summary

Metric	Value
Pass rate	100.0% (13/13)
Failed	0
Errors	0
Avg faithfulness	N/A (disabled)
Avg answer relevancy	N/A (disabled)
Avg context precision	N/A (disabled)
Avg context recall	N/A (disabled)
Avg entity recall	1.000
Avg NDCG@5	0.083
Avg MRR	0.083
Avg Precision@5	0.033
Avg Recall@5	0.083
Avg response time	22943 ms
Total eval duration	310.6 s
Safety refusal accuracy	100.0%

Note on retrieval metrics (NDCG@5, MRR, Precision@5, Recall@5): These values appear low because the golden evaluation framework defines expected_source_urls at a coarse level (e.g. /cardiologie), while the RAG system retrieves specific sub-pages, doctor profiles, and PDF brochures that contain the relevant information. Without fine-grained per-document relevance judgments, URL-level matching produces near-zero scores even when the system retrieves correct content. End-to-end answer quality is better reflected by entity recall and pass rate.

Statistical Analysis

95% bootstrap confidence intervals (10,000 resamples, percentile method). Narrower intervals indicate more reliable estimates.

Metric	Mean	95% CI	Width	n
Entity Recall	1.000	[1.000, 1.000]	0.000	13
NDCG@5	0.083	[0.000, 0.250]	0.250	12
MRR	0.083	[0.000, 0.250]	0.250	12
Precision@5	0.033	[0.000, 0.100]	0.100	12
Recall@5	0.083	[0.000, 0.250]	0.250	12
Pass Rate	1.000	[1.000, 1.000]	0.000	13

System Configuration

Configuration snapshot at evaluation time. Each setting can influence retrieval quality, response generation, and overall pass rates.

Git Context

Property	Value
Branch	`master`
Commit	`8e52e54`
Message	fix(W4-2): CRAG rrf_score bug, cross-lingual discount, pymupdf4llm + test coverage

LLM Models

Role	Model
RAG generation	`openai/o4-mini` (provider: openrouter)
Escalation (Think Harder)	`openai/gpt-5.2`
Follow-up classification	`openai/gpt-4.1-nano`
Evaluation (DeepEval judge)	`openai/gpt-4.1-mini`
Intent classification	`openai/gpt-4.1-mini`
Safety LLM judge	`openai/gpt-4.1-mini`
Embedding	`bge-m3` (1024d, provider: ollama)

Generation Parameters

Parameter	Value
Temperature	0.1
Max tokens	1000
Full-mode temperature	0.1
Full-mode max tokens	1500

Retrieval Parameters

Parameter	Value
Full mode (always-on reranking)	ON
Rerank candidates	20
Escalation candidates	100
Escalation min similarity	0.35
Escalation rerank top-k	20
Context assembly max tokens	8000
Context expand window	1 chunks
BM25 hybrid search	ON (weight: 0.3)
Vector weight	0.7

Feature Flags

These flags control which components of the RAG pipeline are active. Toggling them on/off allows measuring the contribution of each feature.

Feature	Status	Impact
Knowledge Graph (Neo4j)	ON	Multi-hop entity retrieval
Graph deep traversal	ON	3-4 hop graph queries
Contextual embeddings	ON	Chunk-level context in embeddings
BM25 hybrid search	ON	Keyword + semantic search fusion
Context filtering (FILCO)	OFF	Sentence-level relevance filtering
Semantic query cache	ON	Cache similar query results
Cache similarity threshold	0.97	Min cosine for cache hit
Intent classification	ON	Safety guardrail pre-filter
Safety validation	ON	Post-generation safety check
Safety LLM judge	ON	LLM-as-judge defense-in-depth
Quality evaluation	ON	Background quality scoring
Auto-refusal on low quality	ON	Refuse if score < 0.4
True token streaming	ON	Real-time token delivery

Evaluation Run Parameters

Parameter	Value
DeepEval metrics	OFF (entity-recall only)
Questions file	`golden_questions.json`
ID filter	`GQ-061, GQ-062, GQ-063, GQ-066, GQ-067, GQ-068, GQ-069, GQ-071, GQ-072, GQ-096, GQ-128`

Results by Category

Category	Pass	Total	Rate
ambiguous_symptom	2	2	100.0%
condition_department	1	1	100.0%
followup_chain	6	6	100.0%
multilingual	3	3	100.0%
taxonomy_alias	1	1	100.0%

Timing Analysis

Response time distribution across all evaluated questions.

Percentile	Response Time
Min	11551 ms
P50 (median)	21987 ms
P90	40173 ms
P99	41452 ms
Max	41452 ms
Mean	22943 ms

Response Time by Category

Category	Mean	Median	Max	Count
ambiguous_symptom	35358 ms	40173 ms	40173 ms	2
condition_department	41452 ms	41452 ms	41452 ms	1
followup_chain	20281 ms	21987 ms	26866 ms	6
multilingual	11655 ms	11552 ms	11862 ms	3
taxonomy_alias	29439 ms	29439 ms	29439 ms	1

Detailed Results

info

Evaluated 13 questions. DeepEval metrics disabled (entity-recall only).

Click to expand full results table

ID	Category	Status	Entity Recall	NDCG@5	MRR	Faithfulness	Relevancy	Ctx Prec	Ctx Recall	Time (ms)	Citations
GQ-061	multilingual	PASS	1.00	0.00	0.00	—	—	—	—	11551	2
GQ-062	multilingual	PASS	1.00	0.00	0.00	—	—	—	—	11862	8
GQ-063	multilingual	PASS	1.00	0.00	0.00	—	—	—	—	11552	1
GQ-064	followup_chain	PASS	1.00	1.00	1.00	—	—	—	—	14057	2
GQ-065	followup_chain	PASS	1.00	0.00	0.00	—	—	—	—	18358	4
GQ-066	followup_chain	PASS	1.00	0.00	0.00	—	—	—	—	21987	9
GQ-067	followup_chain	PASS	1.00	0.00	0.00	—	—	—	—	26866	3
GQ-068	followup_chain	PASS	1.00	0.00	0.00	—	—	—	—	22861	8
GQ-069	followup_chain	PASS	1.00	0.00	0.00	—	—	—	—	17555	9
GQ-071	ambiguous_symptom	PASS	1.00	0.00	0.00	—	—	—	—	40173	5
GQ-072	ambiguous_symptom	PASS	1.00	—	—	—	—	—	—	30544	0
GQ-096	taxonomy_alias	PASS	1.00	0.00	0.00	—	—	—	—	29439	8
GQ-128	condition_department	PASS	1.00	0.00	0.00	—	—	—	—	41452	2

Generated by run_evaluation.py at 2026-02-20 15:12 UTC.

Summary​

Statistical Analysis​

System Configuration​

Git Context​

LLM Models​

Generation Parameters​

Retrieval Parameters​

Feature Flags​

Evaluation Run Parameters​

Results by Category​

Timing Analysis​

Response Time by Category​

Detailed Results​