Value Framework
Package: backend/app/services/value_framework/
Migration: 066 (backend/alembic/versions/066_category_mismatch_telemetry.py)
Tests: 76 unit tests + 8-test wheelchair regression integration test (counts may drift; run pytest --collect-only backend/tests/integration/services/test_value_framework_*.py backend/tests/unit/services/value_framework/ for the current count)
This is the retrieval-steering member of a three-subsystem set. It is best read alongside the Taxonomy (the structured knowledge graph it reranks against) and SNOMED CT (the ontology that resolves the entities the taxonomy routes). See the Core Concepts flow for how all three compose end-to-end in a single query. The Value Framework runs as Stage 5b of the Taxonomy Query Enrichment pipeline.
The cross-category contamination problem
The Value Framework was built to close a concrete regression. A caller asked about wheelchair accessibility at the hospital entrance. The answer they expected: "There are wheelchair-accessible entrances at Schiepse Bos and Genk." The answer they received: a confused reimbursement-process explanation.
Why? The query contained the word "rolstoel" (wheelchair). The orthopedic chunk about wheelchair-prescription reimbursement lexically over-scored on "rolstoel" and outranked the parking/accessibility chunk — because pgvector cosine similarity (@pgvector_docs) on a 3072-dim embedding (@openai2024embeddings) sees "rolstoel" in both documents and gives them nearly identical scores. The top-K context then contained primarily regulatory content, and the LLM faithfully answered from that context.
This is cross-category contamination: a query with intent navigation_or_practical_info gets answered from regulatory content. The embedding model is doing its job; the problem is that similarity alone cannot distinguish "I want to park my wheelchair" from "I want to claim reimbursement for my wheelchair prescription."
Why retrieval-only solutions don't suffice
Three retrieval-side approaches were considered before adding categorical-affinity reranking. Each was rejected:
| Alternative | Why rejected |
|---|---|
| Add BM25 keyword search to widen the recall pool (@robertson2009bm25) | Already in the hybrid retriever — both branches surface the same regulatory chunk on rolstoel. Adding keyword recall makes the contamination worse, not better. |
| Reciprocal Rank Fusion across vector + BM25 (@cormack2009rrf) | RRF combines rankings without normalising scores; both branches independently rank the regulatory chunk high, so RRF preserves the contamination. |
| Cross-encoder reranker (BERT-style passage rerank, @nogueira2019passagererank) | Would help — but the cross-encoder is a per-query inference (~50 ms on Jina v2) on the entire candidate set. Voice channel adds 50 ms before the LLM call, and the cross-encoder still has no signal that "parking" and "reimbursement" are different categories of fact. |
The Value Framework is structurally a categorical-affinity multiplier applied to the existing similarity score. It is conceptually descended from RRF (rank-fusion lineage) but extends it with a per-intent × per-category multiplier matrix. This places it in Gao et al.'s "Modular RAG" taxonomy (@gao2024ragsurvey) — a specialised retrieval-augmentation module orchestrated alongside the existing retriever, not replacing it.
Six canonical content categories
backend/app/services/value_framework/category_classifier.py assigns one of six categories to any retrieved chunk using multi-language keyword matching (word-boundary regex, case-insensitive, nl/en/fr/it):
| Category | What it covers |
|---|---|
practical | Parking, entrance, visiting hours, accessibility, cafeteria, wifi, lift, directions |
clinical_info | Conditions, treatments, diagnoses, symptoms, medications, procedures |
regulatory | Reimbursement, prescriptions, insurance, social fund, orthopedic devices, co-pay |
appointments | Scheduling, consultation, office hours, cancel/reschedule, secretariat |
legal_admin | Privacy, complaints, ombudsman, patient rights, accreditation, governance |
general | Fallback — no strong keyword signal in the chunk |
The classifier is hospital-agnostic: keyword sets are linguistic (Dutch/French/Italian/English vocabulary), not specific to ZOL. A future tenant gets the same classifier without configuration. This satisfies the hospital-agnostic invariant of the multi-tenant architecture (@bezemer2010multitenant).
Default 7-intent × 6-category affinity matrix
backend/app/services/value_framework/affinity.py — DEFAULT_AFFINITY:
| Intent | practical | clinical_info | regulatory | appointments | legal_admin | general |
|---|---|---|---|---|---|---|
navigation_or_practical_info | 1.30 | 0.65 | 0.55 | 1.05 | 0.85 | 1.00 |
appointment_scheduling | 1.05 | 0.80 | 0.75 | 1.30 | 0.95 | 1.00 |
medical_information | 0.75 | 1.25 | 1.05 | 0.95 | 0.85 | 1.00 |
doctor_information | 0.90 | 1.10 | 0.85 | 1.20 | 0.85 | 1.00 |
department_or_service | 1.10 | 1.10 | 0.85 | 1.20 | 0.90 | 1.00 |
administrative_or_legal | 0.90 | 0.80 | 1.20 | 0.95 | 1.30 | 1.00 |
billing_or_insurance | 0.85 | 0.85 | 1.30 | 0.95 | 1.10 | 1.00 |
Multipliers > 1.0 boost; < 1.0 penalize. 1.0 is neutral (no change). Intents not in the map default to neutral across all categories — the framework never makes things worse than the unfiltered ranking.
The maximum boost is 1.30 and the maximum penalty is 0.55. The boundary values were chosen so that even on a worst-case ranking inversion (boosted chunk at similarity 0.50 vs penalised chunk at similarity 0.95), the boost can flip the order:
- Boosted:
0.50 × 1.30 = 0.65 - Penalised:
0.95 × 0.55 = 0.52
The wheelchair example: intent navigation_or_practical_info, regulatory chunk at similarity 0.85. After rerank: 0.85 × 0.55 = 0.47. Practical chunk at 0.65 becomes 0.65 × 1.30 = 0.85. The practical chunk wins.
Intent-to-category empirical justification
The seven intents come from IntentClassificationService (the hospital-domain intent taxonomy). The matrix entries were derived from the empirical question distribution in the 299-question text golden set, classified manually against the six categories. The wheelchair regression contributed entries for navigation_or_practical_info × regulatory = 0.55. The structural pattern — categorical-affinity multiplier on top of vector retrieval — is adjacent to the knowledge-graph + vector hybrids surveyed in Sarmah et al. 2024.
The 7-intent × 6-category space is small enough to maintain manually. A future tenant whose ontology requires more categories (e.g., research-hospital with a clinical_trials category) can extend DEFAULT_AFFINITY without changing the application logic.
Formal definition of the affinity operator
Let C = {c₁, …, cₙ} be the candidate chunk set returned by hybrid retrieval, each chunk cᵢ carrying a base relevance score sᵢ ∈ [0,1] (the fused pgvector + BM25 similarity). Let κ(c) → K be the category classifier mapping a chunk to one of the six categories K, and let q ∈ I be the query's intent class drawn from the seven-intent taxonomy. The affinity matrix is a function
A : I × K → ℝ₊ with A(q, k) ∈ [0.55, 1.30]
The reranked score is the clamped product, and the operator re-sorts C by sᵢ′ descending:
sᵢ′ = clamp[0,1]( sᵢ · A(q, κ(cᵢ)) )
Three properties make this safe to run unconditionally on the request path:
- Identity on unknown intent. If
q ∉ dom(A)the matrix degenerates toA(q, ·) = 1, sosᵢ′ = sᵢand the ordering is unchanged. The framework can never rank worse than the unfiltered retriever — it is a Pareto-safe refinement. - Order-preserving within a category. For two chunks of the same category,
Ais a constant positive scalar, so their relative order is preserved. Reranking only ever changes order across categories — exactly the contamination axis it targets. - Non-idempotent on raw re-application.
sᵢ″ = sᵢ · A² ≠ sᵢ′, so applying the operator twice would compound the multiplier incorrectly. The implementation guards this by caching the computed category on the chunk dict (framework_category) and treating an already-scored chunk as terminal — see test 5 in the regression suite.
Relationship to rank fusion
The operator is a generalisation of the score-multiplier family that Reciprocal Rank Fusion (@cormack2009rrf) belongs to. Where RRF fuses two ranked lists by a rank-reciprocal weight Σₗ 1/(k + rₗ(c)) with no semantic signal, the Value Framework fuses one ranked list with a categorical prior A(q, κ(c)). The two are composable: RRF first produces the fused sᵢ across the vector and BM25 lanes, then A applies the intent-conditioned categorical correction. This places the framework firmly in the "Modular RAG" orchestration layer (@gao2024ragsurvey) rather than the retrieval layer.
The boundary values [0.55, 1.30] are not arbitrary: they are the smallest multiplier band that guarantees a worst-case ranking inversion (a maximally-penalised chunk at s=0.95 vs a maximally-boosted chunk at s=0.50) can still flip, since 0.50 · 1.30 = 0.65 > 0.52 = 0.95 · 0.55. A tighter band would leave pathological inversions unfixable; a wider band would risk boosting weakly-relevant on-category chunks above strongly-relevant ones.
Per-turn rerank flow
The category is computed lazily and cached on the chunk dict (framework_category key), so apply_intent_category_affinity, primary_category, and record_category_mismatch each read the same classification without re-running the classifier.
Fix B — Primary-category prompt guard
After reranking, primary_category(chunks, top_k=5) identifies the dominant category among the top-5 chunks (by cumulative score, not count). The system prompt includes an instruction keyed on this category:
"The context is primarily about
practicalinformation (parking, accessibility, navigation). Answer from that perspective. Do not fuse with reimbursement or prescription content even if the query contains a word that appears in both."
This gives the answer LLM an explicit signal to stay in one lane, rather than inferring intent from the mixed retrieval set. The prompt-level guard pairs with the score-level rerank: rerank reduces the probability of mixed retrieval; the prompt handles the residual cases where mixed retrieval still occurs.
primary_category election algorithm
The election uses cumulative similarity score, not chunk count, to weight high-confidence retrieval:
primary_category(chunks, top_k=5):
weights = defaultdict(float)
for chunk in chunks[:top_k]:
weights[chunk.framework_category] += chunk.similarity
return max(weights, key=weights.get)
Counting chunks would let three weak regulatory chunks (similarities 0.45, 0.45, 0.45 = 1.35 total) outvote two strong practical chunks (similarities 0.85, 0.85 = 1.70 total) — the wrong answer. Weighted-by-score gives the right answer in this scenario.
Fix C — Unit-mismatch admission
backend/app/services/value_framework/unit_mismatch.py. When the query contains a per-minute or per-session unit (parking tariffs) but the top chunks discuss per-kWh or per-item pricing (EV charger costs), the framework emits a structured gap note into the context:
[UNIT MISMATCH] Query asks about per-minute parking rates.
Retrieved context covers per-kWh EV charger pricing.
If you cannot directly answer the unit asked about, say so explicitly.
This prevents the LLM from silently transposing units and producing a confidently wrong answer. The pattern catches the "data exists but at the wrong granularity" failure mode that pure similarity ranking can't detect.
Fix E — category_mismatch_rate telemetry
backend/app/services/value_framework/telemetry.py — record_category_mismatch.
Per-turn metric written to app.category_mismatch_telemetry:
| Column | Type | Description |
|---|---|---|
tenant_id | uuid | Isolates metrics per hospital tenant |
conversation_id | uuid | Correlates turns in one call |
message_id | uuid | Links to app.conversation_messages |
intent_class | text | Intent that drove the rerank |
primary_category | text | Dominant category of top-5 chunks |
mismatch_rate | numeric | 0.0 – 1.0 fraction of top-5 chunks below boost threshold |
chunks_total | int | Top-K evaluated |
chunks_off_category | int | Chunks with affinity < 1.0 |
query_preview | text | First 200 chars of query (no PII — voice queries are already PII-redacted via voice_pii_redaction) |
Exposed at GET /api/v1/admin/ops/category-mismatch (backend/app/api/admin_ops.py:342) and charted on the /analytics/system Operations tab. A sustained spike (e.g., mismatch_rate > 0.6 for 50+ turns) indicates an emerging query class that the affinity table doesn't cover yet — the fix is adding a new intent row to DEFAULT_AFFINITY.
The function follows the R1 silent-failure discipline (CLAUDE.md §Silent-Failure Discipline): it logs the rate at INFO on every call ([ValueFramework] category_mismatch_rate=... intent=... primary=... off=.../...), so a scan of the logs reveals the metric per turn without needing to query the DB.
Latency budget
The full Value Framework pass adds modest latency to the RAG pipeline. * markers indicate stages whose timings have not yet been pinned to a histogram on the pilot.
| Stage | Local-dev p50 | Notes |
|---|---|---|
classify_chunk_category (per chunk) | < 0.1 ms* | Pure-Python regex word-boundary match |
apply_intent_category_affinity (10 chunks) | < 1 ms* | One dict lookup + one float multiply per chunk |
primary_category (top-5) | < 0.1 ms* | Sum + argmax over 5 entries |
record_category_mismatch (DB write) | 5–20 ms* | One Postgres INSERT |
The total framework overhead is ~5–25 ms per turn — small relative to the 600–1 200 ms RAG inner loop. Per Beyer et al. 2016 §4 (SLOs at the tail), pilot measurement should pin a p95 budget; ~50 ms is the working assumption.
Wheelchair regression test
backend/tests/integration/services/test_value_framework_wheelchair_regression.py
8 tests that pin the end-to-end contract for the wheelchair conflation scenario:
- Regulatory chunk for wheelchair prescription scores BELOW practical chunk for wheelchair entrance after rerank with
navigation_or_practical_infointent - The
category_mismatch_ratefor a practical-intent query against a regulatory-dominant result set is > 0.5 (detectable) primary_categoryreturnspracticalafter affinity rerank on the wheelchair accessibility queryclassify_chunk_categorycorrectly labels the parking/entrance chunk aspracticaland the reimbursement chunk asregulatory- Calling
apply_intent_category_affinitytwice is idempotent on stable inputs (multiplied twice would compound incorrectly — the function accepts the already-classified chunks via theframework_categorycache) - Unknown intent (not in affinity map) leaves scores unchanged
- Chunks without a readable score field pass through without error
- The unit-mismatch detector fires on parking-per-minute vs EV-per-kWh mismatch
This is the R2 regression-pin per CLAUDE.md §Silent-Failure Discipline — a test that lives WITH the fix, asserts the user-visible post-state, and would catch the regression on day 1 if it returned.
References
backend/app/services/value_framework/affinity.py—apply_intent_category_affinity,DEFAULT_AFFINITY,primary_category,category_match_ratiobackend/app/services/value_framework/category_classifier.py—classify_chunk_category,Categoryenumbackend/app/services/value_framework/telemetry.py—record_category_mismatchbackend/app/services/value_framework/unit_mismatch.py— unit-mismatch detectionbackend/tests/integration/services/test_value_framework_wheelchair_regression.py— regression test suitebackend/alembic/versions/066_category_mismatch_telemetry.py— migration creatingapp.category_mismatch_telemetry- Cormack et al. 2009 — Reciprocal Rank Fusion, the rank-fusion lineage that the affinity multiplier extends
- Robertson & Zaragoza 2009 — BM25 baseline, alternative considered and rejected for the contamination problem
- Nogueira & Cho 2019 — passage reranker baseline, cross-encoder alternative considered and rejected on latency
- Sarmah et al. 2024 — HybridRAG; structurally adjacent (knowledge-graph + vector hybrid). The Value Framework is the categorical-affinity sibling: a category-classifier output that augments the vector ranking
- Gao et al. 2024 — Modular RAG taxonomy; the framework is a specialised retrieval-augmentation module
- Bezemer & Zaidman 2010 — multi-tenant SaaS isolation; the hospital-agnostic claim of the keyword sets satisfies tenant-additive extensibility