Ablation Study: CRAG + FILCO + Guardrails — 2026-02-21 08:01 UTC
Fractional factorial experiment measuring the individual and combined impact of three retrieval-quality features:
- CRAG — Corrective RAG ternary quality gate (correct/ambiguous/incorrect)
- FILCO — Sentence-level context filtering before generation
- Guardrails — Llama Guard 3 input/output safety classification
Experiment Matrix
| Run | Label | CRAG | FILCO | Guard |
|---|---|---|---|---|
| 0 | baseline-all-off | off | off | off |
| 1 | crag-only | ON | off | off |
| 2 | filco-only | off | ON | off |
| 3 | guardrails-only | off | off | ON |
| 4 | all-three-on | ON | ON | ON |
Summary Comparison
| Metric | crag-only | all-three-on |
|---|---|---|
| Pass rate | 98.8% (160/162) | 97.5% (158/162) -1.2pp |
| ER-only pass | 98.8% (160/162) | 97.5% (158/162) -1.2pp |
| Entity recall | 0.930 | 0.920 -0.010 |
| Faithfulness | N/A | N/A |
| Ans. relevancy | N/A | N/A |
| Ctx. precision | N/A | N/A |
| Ctx. recall | N/A | N/A |
| NDCG@5 | 0.017 | 0.026 +0.009 |
| MRR | 0.017 | 0.018 +0.002 |
| Avg time (ms) | 4005 | 12849 +8845 |
| Safety refusal | 100% | 100% |
| Errors | 0 | 0 |
| Duration (min) | 13.5 | 37.5 |
Note on NDCG@5 / MRR: These retrieval metrics appear low because
expected_source_urlsin the golden questions are defined at a coarse department-page level (e.g./cardiologie), while the RAG system retrieves specific sub-pages, doctor profiles, and PDF brochures. Without fine-grained per-document relevance judgments, these URL-level metrics are not meaningful. End-to-end answer quality is better reflected by entity recall and pass rate.
Results by Category (Pass Rate %)
| Category | crag-only | all-three-on |
|---|---|---|
| adversarial_gcg | 12/12 (100%) | 12/12 (100%) |
| ambiguous_symptom | 5/5 (100%) | 5/5 (100%) |
| campus_info | 6/6 (100%) | 6/6 (100%) |
| compound_word | 6/6 (100%) | 6/6 (100%) |
| condition_department | 18/19 (95%) | 18/19 (95%) |
| doctor_department | 6/6 (100%) | 5/6 (83%) |
| emergency | 3/3 (100%) | 3/3 (100%) |
| entity_disambiguation | 8/8 (100%) | 8/8 (100%) |
| followup_chain | 5/5 (100%) | 5/5 (100%) |
| multi_hop_graph | 19/19 (100%) | 19/19 (100%) |
| multilingual | 7/8 (88%) | 6/8 (75%) |
| navigation | 5/5 (100%) | 5/5 (100%) |
| out_of_scope | 12/12 (100%) | 12/12 (100%) |
| practical_info | 12/12 (100%) | 12/12 (100%) |
| referral | 3/3 (100%) | 3/3 (100%) |
| safety_refusal | 9/9 (100%) | 9/9 (100%) |
| service_info | 9/9 (100%) | 9/9 (100%) |
| taxonomy_alias | 7/7 (100%) | 7/7 (100%) |
| treatment_info | 8/8 (100%) | 8/8 (100%) |
Per-Question Changes vs Baseline
Pass/fail comparison uses entity-recall only (≥0.5) for consistency across
--no-evalruns. Safety questions use refusal match.
all-three-on vs baseline
Improved: none Regressed (2): GQ-004, GQ-063
System Context
- Git branch: master
- Git commit: 93df7a7
- LLM model: openai/o4-mini
- Embedding model: bge-m3
- Questions: 162
- DeepEval: disabled (entity-recall only)
Generated by run_ablation_study.py