Ablation Study: CRAG + FILCO + Guardrails — 2026-02-21 06:15 UTC

Fractional factorial experiment measuring the individual and combined impact of three retrieval-quality features:

CRAG — Corrective RAG ternary quality gate (correct/ambiguous/incorrect)
FILCO — Sentence-level context filtering before generation
Guardrails — Llama Guard 3 input/output safety classification

Experiment Matrix

Run	Label	CRAG	FILCO	Guard
0	baseline-all-off	off	off	off
1	crag-only	ON	off	off
2	filco-only	off	ON	off
3	guardrails-only	off	off	ON
4	all-three-on	ON	ON	ON

Summary Comparison

Metric	crag-only	filco-only	all-three-on
Pass rate	96.9% (158/163)	98.2% (160/163) +1.2pp	96.9% (158/163) —
ER-only pass	96.9% (158/163)	98.2% (160/163) +1.2pp	96.9% (158/163) —
Entity recall	0.909	0.927 +0.018	0.916 +0.007
Faithfulness	N/A	N/A	N/A
Ans. relevancy	N/A	N/A	N/A
Ctx. precision	N/A	N/A	N/A
Ctx. recall	N/A	N/A	N/A
NDCG@5	0.019	0.029 +0.010	0.027 +0.008
MRR	0.017	0.020 +0.003	0.018 —
Avg time (ms)	4104	11841 +7737	13602 +9498
Safety refusal	100%	100%	100%
Errors	0	0	0
Duration (min)	13.9	34.9	39.7

Note on NDCG@5 / MRR: These retrieval metrics appear low because expected_source_urls in the golden questions are defined at a coarse department-page level (e.g. /cardiologie), while the RAG system retrieves specific sub-pages, doctor profiles, and PDF brochures. Without fine-grained per-document relevance judgments, these URL-level metrics are not meaningful. End-to-end answer quality is better reflected by entity recall and pass rate.

Results by Category (Pass Rate %)

Category	crag-only	filco-only	all-three-on
adversarial_gcg	12/12 (100%)	12/12 (100%)	12/12 (100%)
ambiguous_symptom	5/5 (100%)	5/5 (100%)	5/5 (100%)
campus_info	6/6 (100%)	6/6 (100%)	6/6 (100%)
compound_word	5/6 (83%)	6/6 (100%)	6/6 (100%)
condition_department	18/19 (95%)	18/19 (95%)	18/19 (95%)
doctor_department	6/6 (100%)	5/6 (83%)	5/6 (83%)
emergency	3/3 (100%)	3/3 (100%)	3/3 (100%)
entity_disambiguation	8/8 (100%)	8/8 (100%)	8/8 (100%)
followup_chain	5/6 (83%)	6/6 (100%)	5/6 (83%)
multi_hop_graph	19/19 (100%)	19/19 (100%)	19/19 (100%)
multilingual	6/8 (75%)	7/8 (88%)	6/8 (75%)
navigation	5/5 (100%)	5/5 (100%)	5/5 (100%)
out_of_scope	12/12 (100%)	12/12 (100%)	12/12 (100%)
practical_info	12/12 (100%)	12/12 (100%)	12/12 (100%)
referral	3/3 (100%)	3/3 (100%)	3/3 (100%)
safety_refusal	9/9 (100%)	9/9 (100%)	9/9 (100%)
service_info	9/9 (100%)	9/9 (100%)	9/9 (100%)
taxonomy_alias	7/7 (100%)	7/7 (100%)	7/7 (100%)
treatment_info	8/8 (100%)	8/8 (100%)	8/8 (100%)

Per-Question Changes vs Baseline

Pass/fail comparison uses entity-recall only (≥0.5) for consistency across --no-eval runs. Safety questions use refusal match.

filco-only vs baseline

Improved (4): GQ-053, GQ-059, GQ-068, GQ-122 Regressed (2): GQ-042, GQ-128

all-three-on vs baseline

Improved (1): GQ-053 Regressed (1): GQ-004

System Context

Git branch: master
Git commit: b7a6b8d
LLM model: openai/o4-mini
Embedding model: bge-m3
Questions: 163
DeepEval: disabled (entity-recall only)

Generated by run_ablation_study.py

Experiment Matrix​

Summary Comparison​

Results by Category (Pass Rate %)​

Per-Question Changes vs Baseline​

filco-only vs baseline​

all-three-on vs baseline​

System Context​