Skip to main content

Ablation Study: CRAG + FILCO + Guardrails — 2026-02-21 08:01 UTC

Fractional factorial experiment measuring the individual and combined impact of three retrieval-quality features:

  • CRAG — Corrective RAG ternary quality gate (correct/ambiguous/incorrect)
  • FILCO — Sentence-level context filtering before generation
  • Guardrails — Llama Guard 3 input/output safety classification

Experiment Matrix

RunLabelCRAGFILCOGuard
0baseline-all-offoffoffoff
1crag-onlyONoffoff
2filco-onlyoffONoff
3guardrails-onlyoffoffON
4all-three-onONONON

Summary Comparison

Metriccrag-onlyall-three-on
Pass rate98.8% (160/162)97.5% (158/162) -1.2pp
ER-only pass98.8% (160/162)97.5% (158/162) -1.2pp
Entity recall0.9300.920 -0.010
FaithfulnessN/AN/A
Ans. relevancyN/AN/A
Ctx. precisionN/AN/A
Ctx. recallN/AN/A
NDCG@50.0170.026 +0.009
MRR0.0170.018 +0.002
Avg time (ms)400512849 +8845
Safety refusal100%100%
Errors00
Duration (min)13.537.5

Note on NDCG@5 / MRR: These retrieval metrics appear low because expected_source_urls in the golden questions are defined at a coarse department-page level (e.g. /cardiologie), while the RAG system retrieves specific sub-pages, doctor profiles, and PDF brochures. Without fine-grained per-document relevance judgments, these URL-level metrics are not meaningful. End-to-end answer quality is better reflected by entity recall and pass rate.

Results by Category (Pass Rate %)

Categorycrag-onlyall-three-on
adversarial_gcg12/12 (100%)12/12 (100%)
ambiguous_symptom5/5 (100%)5/5 (100%)
campus_info6/6 (100%)6/6 (100%)
compound_word6/6 (100%)6/6 (100%)
condition_department18/19 (95%)18/19 (95%)
doctor_department6/6 (100%)5/6 (83%)
emergency3/3 (100%)3/3 (100%)
entity_disambiguation8/8 (100%)8/8 (100%)
followup_chain5/5 (100%)5/5 (100%)
multi_hop_graph19/19 (100%)19/19 (100%)
multilingual7/8 (88%)6/8 (75%)
navigation5/5 (100%)5/5 (100%)
out_of_scope12/12 (100%)12/12 (100%)
practical_info12/12 (100%)12/12 (100%)
referral3/3 (100%)3/3 (100%)
safety_refusal9/9 (100%)9/9 (100%)
service_info9/9 (100%)9/9 (100%)
taxonomy_alias7/7 (100%)7/7 (100%)
treatment_info8/8 (100%)8/8 (100%)

Per-Question Changes vs Baseline

Pass/fail comparison uses entity-recall only (≥0.5) for consistency across --no-eval runs. Safety questions use refusal match.

all-three-on vs baseline

Improved: none Regressed (2): GQ-004, GQ-063

System Context

  • Git branch: master
  • Git commit: 93df7a7
  • LLM model: openai/o4-mini
  • Embedding model: bge-m3
  • Questions: 162
  • DeepEval: disabled (entity-recall only)

Generated by run_ablation_study.py