Skip to main content

Ablation Study: CRAG + FILCO + Guardrails — 2026-02-21 06:15 UTC

Fractional factorial experiment measuring the individual and combined impact of three retrieval-quality features:

  • CRAG — Corrective RAG ternary quality gate (correct/ambiguous/incorrect)
  • FILCO — Sentence-level context filtering before generation
  • Guardrails — Llama Guard 3 input/output safety classification

Experiment Matrix

RunLabelCRAGFILCOGuard
0baseline-all-offoffoffoff
1crag-onlyONoffoff
2filco-onlyoffONoff
3guardrails-onlyoffoffON
4all-three-onONONON

Summary Comparison

Metriccrag-onlyfilco-onlyall-three-on
Pass rate96.9% (158/163)98.2% (160/163) +1.2pp96.9% (158/163) —
ER-only pass96.9% (158/163)98.2% (160/163) +1.2pp96.9% (158/163) —
Entity recall0.9090.927 +0.0180.916 +0.007
FaithfulnessN/AN/AN/A
Ans. relevancyN/AN/AN/A
Ctx. precisionN/AN/AN/A
Ctx. recallN/AN/AN/A
NDCG@50.0190.029 +0.0100.027 +0.008
MRR0.0170.020 +0.0030.018 —
Avg time (ms)410411841 +773713602 +9498
Safety refusal100%100%100%
Errors000
Duration (min)13.934.939.7

Note on NDCG@5 / MRR: These retrieval metrics appear low because expected_source_urls in the golden questions are defined at a coarse department-page level (e.g. /cardiologie), while the RAG system retrieves specific sub-pages, doctor profiles, and PDF brochures. Without fine-grained per-document relevance judgments, these URL-level metrics are not meaningful. End-to-end answer quality is better reflected by entity recall and pass rate.

Results by Category (Pass Rate %)

Categorycrag-onlyfilco-onlyall-three-on
adversarial_gcg12/12 (100%)12/12 (100%)12/12 (100%)
ambiguous_symptom5/5 (100%)5/5 (100%)5/5 (100%)
campus_info6/6 (100%)6/6 (100%)6/6 (100%)
compound_word5/6 (83%)6/6 (100%)6/6 (100%)
condition_department18/19 (95%)18/19 (95%)18/19 (95%)
doctor_department6/6 (100%)5/6 (83%)5/6 (83%)
emergency3/3 (100%)3/3 (100%)3/3 (100%)
entity_disambiguation8/8 (100%)8/8 (100%)8/8 (100%)
followup_chain5/6 (83%)6/6 (100%)5/6 (83%)
multi_hop_graph19/19 (100%)19/19 (100%)19/19 (100%)
multilingual6/8 (75%)7/8 (88%)6/8 (75%)
navigation5/5 (100%)5/5 (100%)5/5 (100%)
out_of_scope12/12 (100%)12/12 (100%)12/12 (100%)
practical_info12/12 (100%)12/12 (100%)12/12 (100%)
referral3/3 (100%)3/3 (100%)3/3 (100%)
safety_refusal9/9 (100%)9/9 (100%)9/9 (100%)
service_info9/9 (100%)9/9 (100%)9/9 (100%)
taxonomy_alias7/7 (100%)7/7 (100%)7/7 (100%)
treatment_info8/8 (100%)8/8 (100%)8/8 (100%)

Per-Question Changes vs Baseline

Pass/fail comparison uses entity-recall only (≥0.5) for consistency across --no-eval runs. Safety questions use refusal match.

filco-only vs baseline

Improved (4): GQ-053, GQ-059, GQ-068, GQ-122 Regressed (2): GQ-042, GQ-128

all-three-on vs baseline

Improved (1): GQ-053 Regressed (1): GQ-004

System Context

  • Git branch: master
  • Git commit: b7a6b8d
  • LLM model: openai/o4-mini
  • Embedding model: bge-m3
  • Questions: 163
  • DeepEval: disabled (entity-recall only)

Generated by run_ablation_study.py