Evaluation Report — 2026-02-20 15:12 UTC
Label: filco-regression-fix-validation
Summary
| Metric | Value |
|---|---|
| Pass rate | 100.0% (13/13) |
| Failed | 0 |
| Errors | 0 |
| Avg faithfulness | N/A (disabled) |
| Avg answer relevancy | N/A (disabled) |
| Avg context precision | N/A (disabled) |
| Avg context recall | N/A (disabled) |
| Avg entity recall | 1.000 |
| Avg NDCG@5 | 0.083 |
| Avg MRR | 0.083 |
| Avg Precision@5 | 0.033 |
| Avg Recall@5 | 0.083 |
| Avg response time | 22943 ms |
| Total eval duration | 310.6 s |
| Safety refusal accuracy | 100.0% |
Note on retrieval metrics (NDCG@5, MRR, Precision@5, Recall@5): These values appear low because the golden evaluation framework defines
expected_source_urlsat a coarse level (e.g./cardiologie), while the RAG system retrieves specific sub-pages, doctor profiles, and PDF brochures that contain the relevant information. Without fine-grained per-document relevance judgments, URL-level matching produces near-zero scores even when the system retrieves correct content. End-to-end answer quality is better reflected by entity recall and pass rate.
Statistical Analysis
95% bootstrap confidence intervals (10,000 resamples, percentile method). Narrower intervals indicate more reliable estimates.
| Metric | Mean | 95% CI | Width | n |
|---|---|---|---|---|
| Entity Recall | 1.000 | [1.000, 1.000] | 0.000 | 13 |
| NDCG@5 | 0.083 | [0.000, 0.250] | 0.250 | 12 |
| MRR | 0.083 | [0.000, 0.250] | 0.250 | 12 |
| Precision@5 | 0.033 | [0.000, 0.100] | 0.100 | 12 |
| Recall@5 | 0.083 | [0.000, 0.250] | 0.250 | 12 |
| Pass Rate | 1.000 | [1.000, 1.000] | 0.000 | 13 |
System Configuration
Configuration snapshot at evaluation time. Each setting can influence retrieval quality, response generation, and overall pass rates.
Git Context
| Property | Value |
|---|---|
| Branch | master |
| Commit | 8e52e54 |
| Message | fix(W4-2): CRAG rrf_score bug, cross-lingual discount, pymupdf4llm + test coverage |
LLM Models
| Role | Model |
|---|---|
| RAG generation | openai/o4-mini (provider: openrouter) |
| Escalation (Think Harder) | openai/gpt-5.2 |
| Follow-up classification | openai/gpt-4.1-nano |
| Evaluation (DeepEval judge) | openai/gpt-4.1-mini |
| Intent classification | openai/gpt-4.1-mini |
| Safety LLM judge | openai/gpt-4.1-mini |
| Embedding | bge-m3 (1024d, provider: ollama) |
Generation Parameters
| Parameter | Value |
|---|---|
| Temperature | 0.1 |
| Max tokens | 1000 |
| Full-mode temperature | 0.1 |
| Full-mode max tokens | 1500 |
Retrieval Parameters
| Parameter | Value |
|---|---|
| Full mode (always-on reranking) | ON |
| Rerank candidates | 20 |
| Escalation candidates | 100 |
| Escalation min similarity | 0.35 |
| Escalation rerank top-k | 20 |
| Context assembly max tokens | 8000 |
| Context expand window | 1 chunks |
| BM25 hybrid search | ON (weight: 0.3) |
| Vector weight | 0.7 |
Feature Flags
These flags control which components of the RAG pipeline are active. Toggling them on/off allows measuring the contribution of each feature.
| Feature | Status | Impact |
|---|---|---|
| Knowledge Graph (Neo4j) | ON | Multi-hop entity retrieval |
| Graph deep traversal | ON | 3-4 hop graph queries |
| Contextual embeddings | ON | Chunk-level context in embeddings |
| BM25 hybrid search | ON | Keyword + semantic search fusion |
| Context filtering (FILCO) | OFF | Sentence-level relevance filtering |
| Semantic query cache | ON | Cache similar query results |
| Cache similarity threshold | 0.97 | Min cosine for cache hit |
| Intent classification | ON | Safety guardrail pre-filter |
| Safety validation | ON | Post-generation safety check |
| Safety LLM judge | ON | LLM-as-judge defense-in-depth |
| Quality evaluation | ON | Background quality scoring |
| Auto-refusal on low quality | ON | Refuse if score < 0.4 |
| True token streaming | ON | Real-time token delivery |
Evaluation Run Parameters
| Parameter | Value |
|---|---|
| DeepEval metrics | OFF (entity-recall only) |
| Questions file | golden_questions.json |
| ID filter | GQ-061, GQ-062, GQ-063, GQ-066, GQ-067, GQ-068, GQ-069, GQ-071, GQ-072, GQ-096, GQ-128 |
Results by Category
| Category | Pass | Fail | Error | Total | Rate |
|---|---|---|---|---|---|
| ambiguous_symptom | 2 | 0 | 0 | 2 | 100.0% |
| condition_department | 1 | 0 | 0 | 1 | 100.0% |
| followup_chain | 6 | 0 | 0 | 6 | 100.0% |
| multilingual | 3 | 0 | 0 | 3 | 100.0% |
| taxonomy_alias | 1 | 0 | 0 | 1 | 100.0% |
Timing Analysis
Response time distribution across all evaluated questions.
| Percentile | Response Time |
|---|---|
| Min | 11551 ms |
| P50 (median) | 21987 ms |
| P90 | 40173 ms |
| P99 | 41452 ms |
| Max | 41452 ms |
| Mean | 22943 ms |
Response Time by Category
| Category | Mean | Median | Max | Count |
|---|---|---|---|---|
| ambiguous_symptom | 35358 ms | 40173 ms | 40173 ms | 2 |
| condition_department | 41452 ms | 41452 ms | 41452 ms | 1 |
| followup_chain | 20281 ms | 21987 ms | 26866 ms | 6 |
| multilingual | 11655 ms | 11552 ms | 11862 ms | 3 |
| taxonomy_alias | 29439 ms | 29439 ms | 29439 ms | 1 |
Detailed Results
Evaluated 13 questions. DeepEval metrics disabled (entity-recall only).
Click to expand full results table
| ID | Category | Status | Entity Recall | NDCG@5 | MRR | Faithfulness | Relevancy | Ctx Prec | Ctx Recall | Time (ms) | Citations |
|---|---|---|---|---|---|---|---|---|---|---|---|
| GQ-061 | multilingual | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 11551 | 2 |
| GQ-062 | multilingual | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 11862 | 8 |
| GQ-063 | multilingual | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 11552 | 1 |
| GQ-064 | followup_chain | PASS | 1.00 | 1.00 | 1.00 | — | — | — | — | 14057 | 2 |
| GQ-065 | followup_chain | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 18358 | 4 |
| GQ-066 | followup_chain | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 21987 | 9 |
| GQ-067 | followup_chain | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 26866 | 3 |
| GQ-068 | followup_chain | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 22861 | 8 |
| GQ-069 | followup_chain | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 17555 | 9 |
| GQ-071 | ambiguous_symptom | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 40173 | 5 |
| GQ-072 | ambiguous_symptom | PASS | 1.00 | — | — | — | — | — | — | 30544 | 0 |
| GQ-096 | taxonomy_alias | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 29439 | 8 |
| GQ-128 | condition_department | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 41452 | 2 |
Generated by run_evaluation.py at 2026-02-20 15:12 UTC.