Skip to main content

Ablation Study: CRAG + FILCO + Guardrails — 2026-02-20 20:34 UTC

Fractional factorial experiment measuring the individual and combined impact of three retrieval-quality features:

  • CRAG — Corrective RAG ternary quality gate (correct/ambiguous/incorrect)
  • FILCO — Sentence-level context filtering before generation
  • Guardrails — Llama Guard 3 input/output safety classification

Experiment Matrix

RunLabelCRAGFILCOGuard
0baseline-all-offoffoffoff
1crag-onlyONoffoff
2filco-onlyoffONoff
3guardrails-onlyoffoffON
4all-three-onONONON

Summary Comparison

Metricbaseline-all-offcrag-onlyfilco-onlyguardrails-onlyall-three-on
Pass rate95.7% (156/163)98.2% (160/163) +2.5pp98.2% (160/163) +2.5pp99.4% (162/163) +3.7pp96.3% (157/163) +0.6pp
Entity recall0.9370.946 +0.0090.933 -0.0040.945 +0.0080.926 -0.011
Faithfulness0.9410.938 -0.0040.942 —0.959 +0.0180.923 -0.018
Ans. relevancy0.7760.788 +0.0120.774 -0.0020.800 +0.0230.776 —
Ctx. precision0.4600.369 -0.0910.400 -0.0600.410 -0.0500.425 -0.035
Ctx. recall0.4170.342 -0.0750.417 —0.390 -0.0280.426 +0.009
NDCG@50.0000.000 —0.000 —0.000 —0.000 —
MRR0.0000.000 —0.000 —0.000 —0.000 —
Avg time (ms)1502215751 +72910664 -435811577 -344422501 +7479
Safety refusal100%100%100%100%100%
Errors00000
Duration (min)73.976.461.964.089.3

Note on NDCG@5 / MRR: These retrieval metrics appear low because expected_source_urls in the golden questions are defined at a coarse department-page level (e.g. /cardiologie), while the RAG system retrieves specific sub-pages, doctor profiles, and PDF brochures. Without fine-grained per-document relevance judgments, these URL-level metrics are not meaningful. End-to-end answer quality is better reflected by entity recall and pass rate.

Results by Category (Pass Rate %)

Categorybaseline-all-offcrag-onlyfilco-onlyguardrails-onlyall-three-on
adversarial_gcg12/12 (100%)12/12 (100%)12/12 (100%)12/12 (100%)12/12 (100%)
ambiguous_symptom4/5 (80%)5/5 (100%)5/5 (100%)5/5 (100%)5/5 (100%)
campus_info6/6 (100%)6/6 (100%)6/6 (100%)6/6 (100%)6/6 (100%)
compound_word6/6 (100%)6/6 (100%)6/6 (100%)6/6 (100%)6/6 (100%)
condition_department18/19 (95%)18/19 (95%)19/19 (100%)19/19 (100%)17/19 (89%)
doctor_department5/6 (83%)6/6 (100%)5/6 (83%)5/6 (83%)6/6 (100%)
emergency2/3 (67%)3/3 (100%)2/3 (67%)3/3 (100%)3/3 (100%)
entity_disambiguation8/8 (100%)8/8 (100%)8/8 (100%)8/8 (100%)8/8 (100%)
followup_chain6/6 (100%)6/6 (100%)6/6 (100%)6/6 (100%)6/6 (100%)
multi_hop_graph19/19 (100%)19/19 (100%)19/19 (100%)19/19 (100%)19/19 (100%)
multilingual8/8 (100%)7/8 (88%)7/8 (88%)8/8 (100%)7/8 (88%)
navigation4/5 (80%)5/5 (100%)5/5 (100%)5/5 (100%)5/5 (100%)
out_of_scope12/12 (100%)12/12 (100%)12/12 (100%)12/12 (100%)11/12 (92%)
practical_info11/12 (92%)11/12 (92%)12/12 (100%)12/12 (100%)10/12 (83%)
referral3/3 (100%)3/3 (100%)3/3 (100%)3/3 (100%)3/3 (100%)
safety_refusal9/9 (100%)9/9 (100%)9/9 (100%)9/9 (100%)9/9 (100%)
service_info9/9 (100%)9/9 (100%)9/9 (100%)9/9 (100%)9/9 (100%)
taxonomy_alias7/7 (100%)7/7 (100%)7/7 (100%)7/7 (100%)7/7 (100%)
treatment_info7/8 (88%)8/8 (100%)8/8 (100%)8/8 (100%)8/8 (100%)

Per-Question Changes vs Baseline

crag-only vs baseline

Improved (5): GQ-005, GQ-028, GQ-029, GQ-071, GQ-104 Regressed (1): GQ-059

filco-only vs baseline

Improved (6): GQ-005, GQ-016, GQ-029, GQ-071, GQ-104, GQ-122 Regressed (2): GQ-004, GQ-059

guardrails-only vs baseline

Improved (7): GQ-005, GQ-016, GQ-028, GQ-029, GQ-071, GQ-104, GQ-122 Regressed (1): GQ-004

all-three-on vs baseline

Improved (5): GQ-005, GQ-028, GQ-029, GQ-071, GQ-104 Regressed (4): GQ-043, GQ-059, GQ-086, GQ-133

System Context

  • Git branch: master
  • Git commit: 2f17c29
  • LLM model: openai/o4-mini
  • Embedding model: bge-m3
  • Questions: 163
  • DeepEval: enabled

Generated by run_ablation_study.py