Ablation Study: CRAG + FILCO + Guardrails — 2026-02-20 20:34 UTC

Fractional factorial experiment measuring the individual and combined impact of three retrieval-quality features:

CRAG — Corrective RAG ternary quality gate (correct/ambiguous/incorrect)
FILCO — Sentence-level context filtering before generation
Guardrails — Llama Guard 3 input/output safety classification

Experiment Matrix

Run	Label	CRAG	FILCO	Guard
0	baseline-all-off	off	off	off
1	crag-only	ON	off	off
2	filco-only	off	ON	off
3	guardrails-only	off	off	ON
4	all-three-on	ON	ON	ON

Summary Comparison

Metric	baseline-all-off	crag-only	filco-only	guardrails-only	all-three-on
Pass rate	95.7% (156/163)	98.2% (160/163) +2.5pp	98.2% (160/163) +2.5pp	99.4% (162/163) +3.7pp	96.3% (157/163) +0.6pp
Entity recall	0.937	0.946 +0.009	0.933 -0.004	0.945 +0.008	0.926 -0.011
Faithfulness	0.941	0.938 -0.004	0.942 —	0.959 +0.018	0.923 -0.018
Ans. relevancy	0.776	0.788 +0.012	0.774 -0.002	0.800 +0.023	0.776 —
Ctx. precision	0.460	0.369 -0.091	0.400 -0.060	0.410 -0.050	0.425 -0.035
Ctx. recall	0.417	0.342 -0.075	0.417 —	0.390 -0.028	0.426 +0.009
NDCG@5	0.000	0.000 —	0.000 —	0.000 —	0.000 —
MRR	0.000	0.000 —	0.000 —	0.000 —	0.000 —
Avg time (ms)	15022	15751 +729	10664 -4358	11577 -3444	22501 +7479
Safety refusal	100%	100%	100%	100%	100%
Errors	0	0	0	0	0
Duration (min)	73.9	76.4	61.9	64.0	89.3

Note on NDCG@5 / MRR: These retrieval metrics appear low because expected_source_urls in the golden questions are defined at a coarse department-page level (e.g. /cardiologie), while the RAG system retrieves specific sub-pages, doctor profiles, and PDF brochures. Without fine-grained per-document relevance judgments, these URL-level metrics are not meaningful. End-to-end answer quality is better reflected by entity recall and pass rate.

Results by Category (Pass Rate %)

Category	baseline-all-off	crag-only	filco-only	guardrails-only	all-three-on
adversarial_gcg	12/12 (100%)	12/12 (100%)	12/12 (100%)	12/12 (100%)	12/12 (100%)
ambiguous_symptom	4/5 (80%)	5/5 (100%)	5/5 (100%)	5/5 (100%)	5/5 (100%)
campus_info	6/6 (100%)	6/6 (100%)	6/6 (100%)	6/6 (100%)	6/6 (100%)
compound_word	6/6 (100%)	6/6 (100%)	6/6 (100%)	6/6 (100%)	6/6 (100%)
condition_department	18/19 (95%)	18/19 (95%)	19/19 (100%)	19/19 (100%)	17/19 (89%)
doctor_department	5/6 (83%)	6/6 (100%)	5/6 (83%)	5/6 (83%)	6/6 (100%)
emergency	2/3 (67%)	3/3 (100%)	2/3 (67%)	3/3 (100%)	3/3 (100%)
entity_disambiguation	8/8 (100%)	8/8 (100%)	8/8 (100%)	8/8 (100%)	8/8 (100%)
followup_chain	6/6 (100%)	6/6 (100%)	6/6 (100%)	6/6 (100%)	6/6 (100%)
multi_hop_graph	19/19 (100%)	19/19 (100%)	19/19 (100%)	19/19 (100%)	19/19 (100%)
multilingual	8/8 (100%)	7/8 (88%)	7/8 (88%)	8/8 (100%)	7/8 (88%)
navigation	4/5 (80%)	5/5 (100%)	5/5 (100%)	5/5 (100%)	5/5 (100%)
out_of_scope	12/12 (100%)	12/12 (100%)	12/12 (100%)	12/12 (100%)	11/12 (92%)
practical_info	11/12 (92%)	11/12 (92%)	12/12 (100%)	12/12 (100%)	10/12 (83%)
referral	3/3 (100%)	3/3 (100%)	3/3 (100%)	3/3 (100%)	3/3 (100%)
safety_refusal	9/9 (100%)	9/9 (100%)	9/9 (100%)	9/9 (100%)	9/9 (100%)
service_info	9/9 (100%)	9/9 (100%)	9/9 (100%)	9/9 (100%)	9/9 (100%)
taxonomy_alias	7/7 (100%)	7/7 (100%)	7/7 (100%)	7/7 (100%)	7/7 (100%)
treatment_info	7/8 (88%)	8/8 (100%)	8/8 (100%)	8/8 (100%)	8/8 (100%)

Per-Question Changes vs Baseline

crag-only vs baseline

Improved (5): GQ-005, GQ-028, GQ-029, GQ-071, GQ-104 Regressed (1): GQ-059

filco-only vs baseline

Improved (6): GQ-005, GQ-016, GQ-029, GQ-071, GQ-104, GQ-122 Regressed (2): GQ-004, GQ-059

guardrails-only vs baseline

Improved (7): GQ-005, GQ-016, GQ-028, GQ-029, GQ-071, GQ-104, GQ-122 Regressed (1): GQ-004

all-three-on vs baseline

Improved (5): GQ-005, GQ-028, GQ-029, GQ-071, GQ-104 Regressed (4): GQ-043, GQ-059, GQ-086, GQ-133

System Context

Git branch: master
Git commit: 2f17c29
LLM model: openai/o4-mini
Embedding model: bge-m3
Questions: 163
DeepEval: enabled

Generated by run_ablation_study.py

Experiment Matrix​

Summary Comparison​

Results by Category (Pass Rate %)​

Per-Question Changes vs Baseline​

crag-only vs baseline​

filco-only vs baseline​

guardrails-only vs baseline​

all-three-on vs baseline​

System Context​