Quality Evaluation
A RAG system that cannot verify the quality of its own responses is fundamentally incomplete. The ZOL Intelligent Search employs a two-phase hybrid evaluation strategy that balances the need for immediate quality assurance with the desire for comprehensive long-term analytics.
The Problem: Measuring RAG Response Quality
RAG quality evaluation requires assessing multiple dimensions: faithfulness (grounded in sources), relevance (answers the question), completeness (addresses all aspects), and safety (no medical advice). Comprehensive LLM-based evaluation via DeepEval is accurate but takes 40-60 seconds — prohibitively slow for real-time use.
The Latency Challenge
Early iterations of the system attempted to run full evaluation inline -- blocking the response until all quality metrics were computed. The results were unacceptable:
| Evaluation Approach | Latency | User Experience |
|---|---|---|
| No evaluation | ~5.5s | Fast, but no quality assurance |
| Full inline evaluation | ~51-65s | Quality assured, but unusable |
| Hybrid (fast gate + background) | ~6.4s | Fast with quality assurance |
A 51-65 second wait for a search response is unacceptable in any context, let alone a hospital website where patients expect the responsiveness of Google search. This observation drove the design of the two-phase approach.
Two-Phase Hybrid Evaluation
Phase 1: Fast Quality Gate
Latency: ~600ms | Blocking: Yes | Purpose: Immediate quality assurance
The fast quality gate operates on embedding cosine similarity, using the same configured embedding model that powers the retrieval pipeline (currently OpenAI text-embedding-3-large per ADR-0048; see @openai2024embeddings). The gate calls get_embedding_service() which returns whichever provider/model is configured. It computes two similarity scores and combines them into a single weighted average:
-
Context Alignment (weight: 0.7): Cosine similarity between the response embedding and the retrieved context embedding. Measures whether the response is faithful to the source material.
-
Semantic Similarity (weight: 0.3): Cosine similarity between the response embedding and the query embedding. Measures whether the response is topically relevant to what was asked.
The weighted score (0.7 × context_alignment + 0.3 × semantic_similarity) must meet or exceed a 50% threshold for the response to pass. Context alignment receives the higher weight because faithfulness to source material is more important than topical relevance in a hospital information system. The threshold was raised from 0.40 to 0.50 on 2026-05-10 after the Wave 2.C.1 empirical revalidation described below — see backend/app/config.py:eval_fast_threshold and the regression-pinning unit test tests/unit/services/test_evaluation_service_unit.py::test_fast_eval_threshold_pinned_to_wave_2c1_empirical_anchor.
The threshold was originally lowered from 65% to 40% during the BGE-M3 era, after empirical testing showed BGE-M3 cosine similarities for topically related Dutch hospital content fell in the 35-55% range. After the 2026-04-30 embedding migration to text-embedding-3-large (ADR-0048) the cosine distribution shifted upward, and a Wave 2.C revalidation (2026-05-10) measured the new distribution directly.
Method. A reproducible offline script at backend/scripts/revalidate_fast_gate_threshold.py samples 200 real (question, answer) pairs from app.conversation_messages and embeds each with the production text-embedding-3-large (1536d) stack. For each sample it builds two cohorts that mimic the gate's runtime semantics (score = 0.7 × cos(answer, context) + 0.3 × cos(question, answer)):
- Topically related — the answer is paired with its actual top-5 retrieved chunks (the production retrieval path).
- Topically unrelated — the same answer is paired with a uniformly-random chunk drawn from a different document.
Distribution shift (text-embedding-3-large vs the old BGE-M3 anchor).
| Cohort | p05 | p25 | p50 | p75 | p95 |
|---|---|---|---|---|---|
| Topically related | 0.44 | 0.62 | 0.70 | 0.76 | 0.81 |
| Topically unrelated | 0.30 | 0.39 | 0.44 | 0.50 | 0.58 |
The "35-55%" claim that anchored the 40% threshold no longer describes related content — the related-cohort p05 (0.44) now sits at the old upper bound, and the related median (0.70) is well above the old 0.55 ceiling. The unrelated cohort's median (0.44) is roughly where the related cohort used to start.
What this means for the 0.40 gate. At 0.40 the gate accepts 96% of related answers (TPR=0.96) but also passes 72% of unrelated answers (FPR=0.72), giving a Youden-J of 0.24. Under text-embedding-3-large the 0.40 threshold has limited discrimination value — most of the unrelated-cohort distribution clears the bar.
Empirically optimal threshold (Youden-optimal, J = TPR − FPR). The script's grid sweep over [0.20, 0.91] finds the maximum at:
| Threshold | TPR | FPR | Youden J |
|---|---|---|---|
| 0.40 (legacy BGE-M3 anchor; retired 2026-05-10) | 0.960 | 0.720 | 0.240 |
| 0.50 (current, post-Wave 2.C.1) | 0.920 | 0.280 | 0.640 |
| 0.58 (Youden-optimal — left on table) | 0.830 | 0.045 | 0.785 |
Operating note: 0.58 reaches near-zero FPR but at the cost of rejecting ~17% of related answers. A more conservative 0.50 holds TPR ≈ 0.92 with FPR ≈ 0.28 (J ≈ 0.64). The right choice depends on whether the cost of refusing a good answer outweighs the cost of letting a weak one through; for a navigational hospital-search system the asymmetry plausibly favours something in the 0.50-0.55 range.
Status. The threshold has been raised from 0.40 to 0.50 (Wave 2.C.1, 2026-05-10) after deliberate operational review of the empirical findings above. The 0.50 value lives in three places, kept in sync by the regression test tests/unit/services/test_evaluation_service_unit.py::test_fast_eval_threshold_pinned_to_wave_2c1_empirical_anchor (R2 contract per CLAUDE.md):
backend/app/config.py:eval_fast_threshold— the canonical Settings defaultbackend/app/services/evaluation_service.py:EvalConfig.fast_eval_threshold— fallback default if Settings is unavailablebackend/app/services/evaluation_service.py:FastEvaluationResult.passed_quality_gate— boundary in the result-dataclass property
The script and JSON report (backend/scripts/revalidate_fast_gate_threshold.json) are reproducible — re-run after each embedding-model change to detect distribution drift. Always update the regression test, the three threshold sites, and this Status section together. The remaining 0.045 FPR gain available at the Youden-optimal 0.58 threshold is left on the table because the 0.92 → 0.83 TPR drop (rejecting ~17% of legitimate answers vs ~8% at 0.50) is judged too aggressive for the navigational-search use case; a future re-tuning study post-pilot may revisit this.
Phase 2: Background Analytics
Latency: ~40-60s | Blocking: No | Purpose: Long-term quality monitoring
After the response is delivered to the user, comprehensive evaluation runs asynchronously in the background. This phase uses DeepEval, a framework for evaluating LLM applications, to compute:
| Metric | What It Measures | Model Used |
|---|---|---|
| Faithfulness | Does the response only contain information from the context? | Tier 2 |
| Answer Relevancy | Does the response address the user's actual question? | Tier 2 |
These metrics are reported to Prometheus and visualized on quality dashboards, enabling the team to:
- Track quality trends over time
- Identify content areas where responses are consistently weak
- Detect quality regressions after system changes
- Provide evidence for stakeholder reporting
Evaluation Architecture
Quality Rejection Handling
When the fast quality gate rejects a response, the system does not silently fail. Instead, it:
- Returns a helpful fallback: A polite message acknowledging the question and suggesting alternative navigation paths (e.g., department phone numbers, website sections)
- Logs the rejection: The rejected query, response, and scores are logged for analysis, helping identify patterns in quality failures
- Does not retry automatically: Retrying with the same context would likely produce the same result. The rejection signals a genuine gap in content coverage or retrieval quality.
Summary
The fast quality gate provides the safety net (immediate, deterministic, cost-free). Background analytics provide the learning loop (comprehensive LLM-based metrics for continuous improvement). Together they deliver immediate quality assurance for users and comprehensive quality intelligence for the development team.
Evaluation Framework Theory
The ZOL evaluation approach draws on two established frameworks in the RAG evaluation literature:
RAGAS (Retrieval Augmented Generation Assessment)
RAGAS, introduced by Es et al. (2023), defines four core metrics for RAG evaluation: faithfulness, answer relevancy, context precision, and context recall. The ZOL system implements faithfulness and answer relevancy as the primary quality signals, as these directly address the hospital domain's core requirements: responses must be grounded in verified content (faithfulness) and must answer the patient's question (relevancy).
DeepEval
DeepEval extends the RAGAS framework with additional metrics and a standardized evaluation pipeline. It provides the implementation used in the ZOL background evaluation phase, executing LLM-as-judge evaluations where the Tier 2 model assesses whether the generated response is faithful to the provided context.
The Two-Phase Innovation
The ZOL system's contribution is the two-phase separation of evaluation into a fast embedding-based gate and a slow LLM-based analysis. This separation is driven by the insight that embedding cosine similarity, while less nuanced than LLM-based evaluation, is sufficient for real-time quality gating (detecting catastrophic failures) while being three orders of magnitude faster.
References
- Confident AI. (2024). DeepEval: The open-source LLM evaluation framework. https://deepeval.com/docs/metrics-ragas
- Es, S., James, J., Espinosa-Anke, L., & Schockaert, S. (2023). RAGAS: Automated evaluation of retrieval augmented generation. arXiv preprint, arXiv:2309.15217. https://arxiv.org/abs/2309.15217
- Zheng, L., et al. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. Advances in Neural Information Processing Systems, 36. https://arxiv.org/abs/2306.05685