Skip to main content

Evaluation Report — 2026-02-20 15:12 UTC

Label: filco-regression-fix-validation

Summary

MetricValue
Pass rate100.0% (13/13)
Failed0
Errors0
Avg faithfulnessN/A (disabled)
Avg answer relevancyN/A (disabled)
Avg context precisionN/A (disabled)
Avg context recallN/A (disabled)
Avg entity recall1.000
Avg NDCG@50.083
Avg MRR0.083
Avg Precision@50.033
Avg Recall@50.083
Avg response time22943 ms
Total eval duration310.6 s
Safety refusal accuracy100.0%

Note on retrieval metrics (NDCG@5, MRR, Precision@5, Recall@5): These values appear low because the golden evaluation framework defines expected_source_urls at a coarse level (e.g. /cardiologie), while the RAG system retrieves specific sub-pages, doctor profiles, and PDF brochures that contain the relevant information. Without fine-grained per-document relevance judgments, URL-level matching produces near-zero scores even when the system retrieves correct content. End-to-end answer quality is better reflected by entity recall and pass rate.

Statistical Analysis

95% bootstrap confidence intervals (10,000 resamples, percentile method). Narrower intervals indicate more reliable estimates.

MetricMean95% CIWidthn
Entity Recall1.000[1.000, 1.000]0.00013
NDCG@50.083[0.000, 0.250]0.25012
MRR0.083[0.000, 0.250]0.25012
Precision@50.033[0.000, 0.100]0.10012
Recall@50.083[0.000, 0.250]0.25012
Pass Rate1.000[1.000, 1.000]0.00013

System Configuration

Configuration snapshot at evaluation time. Each setting can influence retrieval quality, response generation, and overall pass rates.

Git Context

PropertyValue
Branchmaster
Commit8e52e54
Messagefix(W4-2): CRAG rrf_score bug, cross-lingual discount, pymupdf4llm + test coverage

LLM Models

RoleModel
RAG generationopenai/o4-mini (provider: openrouter)
Escalation (Think Harder)openai/gpt-5.2
Follow-up classificationopenai/gpt-4.1-nano
Evaluation (DeepEval judge)openai/gpt-4.1-mini
Intent classificationopenai/gpt-4.1-mini
Safety LLM judgeopenai/gpt-4.1-mini
Embeddingbge-m3 (1024d, provider: ollama)

Generation Parameters

ParameterValue
Temperature0.1
Max tokens1000
Full-mode temperature0.1
Full-mode max tokens1500

Retrieval Parameters

ParameterValue
Full mode (always-on reranking)ON
Rerank candidates20
Escalation candidates100
Escalation min similarity0.35
Escalation rerank top-k20
Context assembly max tokens8000
Context expand window1 chunks
BM25 hybrid searchON (weight: 0.3)
Vector weight0.7

Feature Flags

These flags control which components of the RAG pipeline are active. Toggling them on/off allows measuring the contribution of each feature.

FeatureStatusImpact
Knowledge Graph (Neo4j)ONMulti-hop entity retrieval
Graph deep traversalON3-4 hop graph queries
Contextual embeddingsONChunk-level context in embeddings
BM25 hybrid searchONKeyword + semantic search fusion
Context filtering (FILCO)OFFSentence-level relevance filtering
Semantic query cacheONCache similar query results
Cache similarity threshold0.97Min cosine for cache hit
Intent classificationONSafety guardrail pre-filter
Safety validationONPost-generation safety check
Safety LLM judgeONLLM-as-judge defense-in-depth
Quality evaluationONBackground quality scoring
Auto-refusal on low qualityONRefuse if score < 0.4
True token streamingONReal-time token delivery

Evaluation Run Parameters

ParameterValue
DeepEval metricsOFF (entity-recall only)
Questions filegolden_questions.json
ID filterGQ-061, GQ-062, GQ-063, GQ-066, GQ-067, GQ-068, GQ-069, GQ-071, GQ-072, GQ-096, GQ-128

Results by Category

CategoryPassFailErrorTotalRate
ambiguous_symptom2002100.0%
condition_department1001100.0%
followup_chain6006100.0%
multilingual3003100.0%
taxonomy_alias1001100.0%

Timing Analysis

Response time distribution across all evaluated questions.

PercentileResponse Time
Min11551 ms
P50 (median)21987 ms
P9040173 ms
P9941452 ms
Max41452 ms
Mean22943 ms

Response Time by Category

CategoryMeanMedianMaxCount
ambiguous_symptom35358 ms40173 ms40173 ms2
condition_department41452 ms41452 ms41452 ms1
followup_chain20281 ms21987 ms26866 ms6
multilingual11655 ms11552 ms11862 ms3
taxonomy_alias29439 ms29439 ms29439 ms1

Detailed Results

info

Evaluated 13 questions. DeepEval metrics disabled (entity-recall only).

Click to expand full results table
IDCategoryStatusEntity RecallNDCG@5MRRFaithfulnessRelevancyCtx PrecCtx RecallTime (ms)Citations
GQ-061multilingualPASS1.000.000.00115512
GQ-062multilingualPASS1.000.000.00118628
GQ-063multilingualPASS1.000.000.00115521
GQ-064followup_chainPASS1.001.001.00140572
GQ-065followup_chainPASS1.000.000.00183584
GQ-066followup_chainPASS1.000.000.00219879
GQ-067followup_chainPASS1.000.000.00268663
GQ-068followup_chainPASS1.000.000.00228618
GQ-069followup_chainPASS1.000.000.00175559
GQ-071ambiguous_symptomPASS1.000.000.00401735
GQ-072ambiguous_symptomPASS1.00305440
GQ-096taxonomy_aliasPASS1.000.000.00294398
GQ-128condition_departmentPASS1.000.000.00414522

Generated by run_evaluation.py at 2026-02-20 15:12 UTC.