Ablation Study: CRAG + FILCO + Guardrails — 2026-02-20 20:34 UTC
Fractional factorial experiment measuring the individual and combined impact of three retrieval-quality features:
- CRAG — Corrective RAG ternary quality gate (correct/ambiguous/incorrect)
- FILCO — Sentence-level context filtering before generation
- Guardrails — Llama Guard 3 input/output safety classification
Experiment Matrix
| Run | Label | CRAG | FILCO | Guard |
|---|---|---|---|---|
| 0 | baseline-all-off | off | off | off |
| 1 | crag-only | ON | off | off |
| 2 | filco-only | off | ON | off |
| 3 | guardrails-only | off | off | ON |
| 4 | all-three-on | ON | ON | ON |
Summary Comparison
| Metric | baseline-all-off | crag-only | filco-only | guardrails-only | all-three-on |
|---|---|---|---|---|---|
| Pass rate | 95.7% (156/163) | 98.2% (160/163) +2.5pp | 98.2% (160/163) +2.5pp | 99.4% (162/163) +3.7pp | 96.3% (157/163) +0.6pp |
| Entity recall | 0.937 | 0.946 +0.009 | 0.933 -0.004 | 0.945 +0.008 | 0.926 -0.011 |
| Faithfulness | 0.941 | 0.938 -0.004 | 0.942 — | 0.959 +0.018 | 0.923 -0.018 |
| Ans. relevancy | 0.776 | 0.788 +0.012 | 0.774 -0.002 | 0.800 +0.023 | 0.776 — |
| Ctx. precision | 0.460 | 0.369 -0.091 | 0.400 -0.060 | 0.410 -0.050 | 0.425 -0.035 |
| Ctx. recall | 0.417 | 0.342 -0.075 | 0.417 — | 0.390 -0.028 | 0.426 +0.009 |
| NDCG@5 | 0.000 | 0.000 — | 0.000 — | 0.000 — | 0.000 — |
| MRR | 0.000 | 0.000 — | 0.000 — | 0.000 — | 0.000 — |
| Avg time (ms) | 15022 | 15751 +729 | 10664 -4358 | 11577 -3444 | 22501 +7479 |
| Safety refusal | 100% | 100% | 100% | 100% | 100% |
| Errors | 0 | 0 | 0 | 0 | 0 |
| Duration (min) | 73.9 | 76.4 | 61.9 | 64.0 | 89.3 |
Note on NDCG@5 / MRR: These retrieval metrics appear low because
expected_source_urlsin the golden questions are defined at a coarse department-page level (e.g./cardiologie), while the RAG system retrieves specific sub-pages, doctor profiles, and PDF brochures. Without fine-grained per-document relevance judgments, these URL-level metrics are not meaningful. End-to-end answer quality is better reflected by entity recall and pass rate.
Results by Category (Pass Rate %)
| Category | baseline-all-off | crag-only | filco-only | guardrails-only | all-three-on |
|---|---|---|---|---|---|
| adversarial_gcg | 12/12 (100%) | 12/12 (100%) | 12/12 (100%) | 12/12 (100%) | 12/12 (100%) |
| ambiguous_symptom | 4/5 (80%) | 5/5 (100%) | 5/5 (100%) | 5/5 (100%) | 5/5 (100%) |
| campus_info | 6/6 (100%) | 6/6 (100%) | 6/6 (100%) | 6/6 (100%) | 6/6 (100%) |
| compound_word | 6/6 (100%) | 6/6 (100%) | 6/6 (100%) | 6/6 (100%) | 6/6 (100%) |
| condition_department | 18/19 (95%) | 18/19 (95%) | 19/19 (100%) | 19/19 (100%) | 17/19 (89%) |
| doctor_department | 5/6 (83%) | 6/6 (100%) | 5/6 (83%) | 5/6 (83%) | 6/6 (100%) |
| emergency | 2/3 (67%) | 3/3 (100%) | 2/3 (67%) | 3/3 (100%) | 3/3 (100%) |
| entity_disambiguation | 8/8 (100%) | 8/8 (100%) | 8/8 (100%) | 8/8 (100%) | 8/8 (100%) |
| followup_chain | 6/6 (100%) | 6/6 (100%) | 6/6 (100%) | 6/6 (100%) | 6/6 (100%) |
| multi_hop_graph | 19/19 (100%) | 19/19 (100%) | 19/19 (100%) | 19/19 (100%) | 19/19 (100%) |
| multilingual | 8/8 (100%) | 7/8 (88%) | 7/8 (88%) | 8/8 (100%) | 7/8 (88%) |
| navigation | 4/5 (80%) | 5/5 (100%) | 5/5 (100%) | 5/5 (100%) | 5/5 (100%) |
| out_of_scope | 12/12 (100%) | 12/12 (100%) | 12/12 (100%) | 12/12 (100%) | 11/12 (92%) |
| practical_info | 11/12 (92%) | 11/12 (92%) | 12/12 (100%) | 12/12 (100%) | 10/12 (83%) |
| referral | 3/3 (100%) | 3/3 (100%) | 3/3 (100%) | 3/3 (100%) | 3/3 (100%) |
| safety_refusal | 9/9 (100%) | 9/9 (100%) | 9/9 (100%) | 9/9 (100%) | 9/9 (100%) |
| service_info | 9/9 (100%) | 9/9 (100%) | 9/9 (100%) | 9/9 (100%) | 9/9 (100%) |
| taxonomy_alias | 7/7 (100%) | 7/7 (100%) | 7/7 (100%) | 7/7 (100%) | 7/7 (100%) |
| treatment_info | 7/8 (88%) | 8/8 (100%) | 8/8 (100%) | 8/8 (100%) | 8/8 (100%) |
Per-Question Changes vs Baseline
crag-only vs baseline
Improved (5): GQ-005, GQ-028, GQ-029, GQ-071, GQ-104 Regressed (1): GQ-059
filco-only vs baseline
Improved (6): GQ-005, GQ-016, GQ-029, GQ-071, GQ-104, GQ-122 Regressed (2): GQ-004, GQ-059
guardrails-only vs baseline
Improved (7): GQ-005, GQ-016, GQ-028, GQ-029, GQ-071, GQ-104, GQ-122 Regressed (1): GQ-004
all-three-on vs baseline
Improved (5): GQ-005, GQ-028, GQ-029, GQ-071, GQ-104 Regressed (4): GQ-043, GQ-059, GQ-086, GQ-133
System Context
- Git branch: master
- Git commit: 2f17c29
- LLM model: openai/o4-mini
- Embedding model: bge-m3
- Questions: 163
- DeepEval: enabled
Generated by run_ablation_study.py