Value Framework

Package: backend/app/services/value_framework/ Migration: 066 (backend/alembic/versions/066_category_mismatch_telemetry.py) Tests: 76 unit tests + 8-test wheelchair regression integration test (counts may drift; run pytest --collect-only backend/tests/integration/services/test_value_framework_*.py backend/tests/unit/services/value_framework/ for the current count)

Part of the Knowledge & Retrieval Steering triad

This is the retrieval-steering member of a three-subsystem set. It is best read alongside the Taxonomy (the structured knowledge graph it reranks against) and SNOMED CT (the ontology that resolves the entities the taxonomy routes). See the Core Concepts flow for how all three compose end-to-end in a single query. The Value Framework runs as Stage 5b of the Taxonomy Query Enrichment pipeline.

The cross-category contamination problem

The Value Framework was built to close a concrete regression. A caller asked about wheelchair accessibility at the hospital entrance. The answer they expected: "There are wheelchair-accessible entrances at Schiepse Bos and Genk." The answer they received: a confused reimbursement-process explanation.

Why? The query contained the word "rolstoel" (wheelchair). The orthopedic chunk about wheelchair-prescription reimbursement lexically over-scored on "rolstoel" and outranked the parking/accessibility chunk — because pgvector cosine similarity (@pgvector_docs) on a 3072-dim embedding (@openai2024embeddings) sees "rolstoel" in both documents and gives them nearly identical scores. The top-K context then contained primarily regulatory content, and the LLM faithfully answered from that context.

This is cross-category contamination: a query with intent navigation_or_practical_info gets answered from regulatory content. The embedding model is doing its job; the problem is that similarity alone cannot distinguish "I want to park my wheelchair" from "I want to claim reimbursement for my wheelchair prescription."

Why retrieval-only solutions don't suffice

Three retrieval-side approaches were considered before adding categorical-affinity reranking. Each was rejected:

Alternative	Why rejected
Add BM25 keyword search to widen the recall pool (@robertson2009bm25)	Already in the hybrid retriever — both branches surface the same regulatory chunk on `rolstoel`. Adding keyword recall makes the contamination worse, not better.
Reciprocal Rank Fusion across vector + BM25 (@cormack2009rrf)	RRF combines rankings without normalising scores; both branches independently rank the regulatory chunk high, so RRF preserves the contamination.
Cross-encoder reranker (BERT-style passage rerank, @nogueira2019passagererank)	Would help — but the cross-encoder is a per-query inference (~50 ms on Jina v2) on the entire candidate set. Voice channel adds 50 ms before the LLM call, and the cross-encoder still has no signal that "parking" and "reimbursement" are different categories of fact.

The Value Framework is structurally a categorical-affinity multiplier applied to the existing similarity score. It is conceptually descended from RRF (rank-fusion lineage) but extends it with a per-intent × per-category multiplier matrix. This places it in Gao et al.'s "Modular RAG" taxonomy (@gao2024ragsurvey) — a specialised retrieval-augmentation module orchestrated alongside the existing retriever, not replacing it.

Six canonical content categories

backend/app/services/value_framework/category_classifier.py assigns one of six categories to any retrieved chunk using multi-language keyword matching (word-boundary regex, case-insensitive, nl/en/fr/it):

Category	What it covers
`practical`	Parking, entrance, visiting hours, accessibility, cafeteria, wifi, lift, directions
`clinical_info`	Conditions, treatments, diagnoses, symptoms, medications, procedures
`regulatory`	Reimbursement, prescriptions, insurance, social fund, orthopedic devices, co-pay
`appointments`	Scheduling, consultation, office hours, cancel/reschedule, secretariat
`legal_admin`	Privacy, complaints, ombudsman, patient rights, accreditation, governance
`general`	Fallback — no strong keyword signal in the chunk

The classifier is hospital-agnostic: keyword sets are linguistic (Dutch/French/Italian/English vocabulary), not specific to ZOL. A future tenant gets the same classifier without configuration. This satisfies the hospital-agnostic invariant of the multi-tenant architecture (@bezemer2010multitenant).

Default 7-intent × 6-category affinity matrix

backend/app/services/value_framework/affinity.py — DEFAULT_AFFINITY:

Intent	practical	clinical_info	regulatory	appointments	legal_admin	general
`navigation_or_practical_info`	1.30	0.65	0.55	1.05	0.85	1.00
`appointment_scheduling`	1.05	0.80	0.75	1.30	0.95	1.00
`medical_information`	0.75	1.25	1.05	0.95	0.85	1.00
`doctor_information`	0.90	1.10	0.85	1.20	0.85	1.00
`department_or_service`	1.10	1.10	0.85	1.20	0.90	1.00
`administrative_or_legal`	0.90	0.80	1.20	0.95	1.30	1.00
`billing_or_insurance`	0.85	0.85	1.30	0.95	1.10	1.00

Multipliers > 1.0 boost; < 1.0 penalize. 1.0 is neutral (no change). Intents not in the map default to neutral across all categories — the framework never makes things worse than the unfiltered ranking.

The maximum boost is 1.30 and the maximum penalty is 0.55. The boundary values were chosen so that even on a worst-case ranking inversion (boosted chunk at similarity 0.50 vs penalised chunk at similarity 0.95), the boost can flip the order:

Boosted: 0.50 × 1.30 = 0.65
Penalised: 0.95 × 0.55 = 0.52

The wheelchair example: intent navigation_or_practical_info, regulatory chunk at similarity 0.85. After rerank: 0.85 × 0.55 = 0.47. Practical chunk at 0.65 becomes 0.65 × 1.30 = 0.85. The practical chunk wins.

Intent-to-category empirical justification

The seven intents come from IntentClassificationService (the hospital-domain intent taxonomy). The matrix entries were derived from the empirical question distribution in the 299-question text golden set, classified manually against the six categories. The wheelchair regression contributed entries for navigation_or_practical_info × regulatory = 0.55. The structural pattern — categorical-affinity multiplier on top of vector retrieval — is adjacent to the knowledge-graph + vector hybrids surveyed in Sarmah et al. 2024.

The 7-intent × 6-category space is small enough to maintain manually. A future tenant whose ontology requires more categories (e.g., research-hospital with a clinical_trials category) can extend DEFAULT_AFFINITY without changing the application logic.

Formal definition of the affinity operator

Let C = {c₁, …, cₙ} be the candidate chunk set returned by hybrid retrieval, each chunk cᵢ carrying a base relevance score sᵢ ∈ [0,1] (the fused pgvector + BM25 similarity). Let κ(c) → K be the category classifier mapping a chunk to one of the six categories K, and let q ∈ I be the query's intent class drawn from the seven-intent taxonomy. The affinity matrix is a function

A : I × K → ℝ₊      with   A(q, k) ∈ [0.55, 1.30]

The reranked score is the clamped product, and the operator re-sorts C by sᵢ′ descending:

sᵢ′ = clamp[0,1]( sᵢ · A(q, κ(cᵢ)) )

Three properties make this safe to run unconditionally on the request path:

Identity on unknown intent. If q ∉ dom(A) the matrix degenerates to A(q, ·) = 1, so sᵢ′ = sᵢ and the ordering is unchanged. The framework can never rank worse than the unfiltered retriever — it is a Pareto-safe refinement.
Order-preserving within a category. For two chunks of the same category, A is a constant positive scalar, so their relative order is preserved. Reranking only ever changes order across categories — exactly the contamination axis it targets.
Non-idempotent on raw re-application. sᵢ″ = sᵢ · A² ≠ sᵢ′, so applying the operator twice would compound the multiplier incorrectly. The implementation guards this by caching the computed category on the chunk dict (framework_category) and treating an already-scored chunk as terminal — see test 5 in the regression suite.

Relationship to rank fusion

The operator is a generalisation of the score-multiplier family that Reciprocal Rank Fusion (@cormack2009rrf) belongs to. Where RRF fuses two ranked lists by a rank-reciprocal weight Σₗ 1/(k + rₗ(c)) with no semantic signal, the Value Framework fuses one ranked list with a categorical prior A(q, κ(c)). The two are composable: RRF first produces the fused sᵢ across the vector and BM25 lanes, then A applies the intent-conditioned categorical correction. This places the framework firmly in the "Modular RAG" orchestration layer (@gao2024ragsurvey) rather than the retrieval layer.

The boundary values [0.55, 1.30] are not arbitrary: they are the smallest multiplier band that guarantees a worst-case ranking inversion (a maximally-penalised chunk at s=0.95 vs a maximally-boosted chunk at s=0.50) can still flip, since 0.50 · 1.30 = 0.65 > 0.52 = 0.95 · 0.55. A tighter band would leave pathological inversions unfixable; a wider band would risk boosting weakly-relevant on-category chunks above strongly-relevant ones.

Per-turn rerank flow

The category is computed lazily and cached on the chunk dict (framework_category key), so apply_intent_category_affinity, primary_category, and record_category_mismatch each read the same classification without re-running the classifier.

Fix B — Primary-category prompt guard

After reranking, primary_category(chunks, top_k=5) identifies the dominant category among the top-5 chunks (by cumulative score, not count). The system prompt includes an instruction keyed on this category:

"The context is primarily about practical information (parking, accessibility, navigation). Answer from that perspective. Do not fuse with reimbursement or prescription content even if the query contains a word that appears in both."

This gives the answer LLM an explicit signal to stay in one lane, rather than inferring intent from the mixed retrieval set. The prompt-level guard pairs with the score-level rerank: rerank reduces the probability of mixed retrieval; the prompt handles the residual cases where mixed retrieval still occurs.

`primary_category` election algorithm

The election uses cumulative similarity score, not chunk count, to weight high-confidence retrieval:

primary_category(chunks, top_k=5):
    weights = defaultdict(float)
    for chunk in chunks[:top_k]:
        weights[chunk.framework_category] += chunk.similarity
    return max(weights, key=weights.get)

Counting chunks would let three weak regulatory chunks (similarities 0.45, 0.45, 0.45 = 1.35 total) outvote two strong practical chunks (similarities 0.85, 0.85 = 1.70 total) — the wrong answer. Weighted-by-score gives the right answer in this scenario.

Fix C — Unit-mismatch admission

backend/app/services/value_framework/unit_mismatch.py. When the query contains a per-minute or per-session unit (parking tariffs) but the top chunks discuss per-kWh or per-item pricing (EV charger costs), the framework emits a structured gap note into the context:

[UNIT MISMATCH] Query asks about per-minute parking rates.
Retrieved context covers per-kWh EV charger pricing.
If you cannot directly answer the unit asked about, say so explicitly.

This prevents the LLM from silently transposing units and producing a confidently wrong answer. The pattern catches the "data exists but at the wrong granularity" failure mode that pure similarity ranking can't detect.

Fix E — `category_mismatch_rate` telemetry

backend/app/services/value_framework/telemetry.py — record_category_mismatch.

Per-turn metric written to app.category_mismatch_telemetry:

Column	Type	Description
`tenant_id`	uuid	Isolates metrics per hospital tenant
`conversation_id`	uuid	Correlates turns in one call
`message_id`	uuid	Links to `app.conversation_messages`
`intent_class`	text	Intent that drove the rerank
`primary_category`	text	Dominant category of top-5 chunks
`mismatch_rate`	numeric	0.0 – 1.0 fraction of top-5 chunks below boost threshold
`chunks_total`	int	Top-K evaluated
`chunks_off_category`	int	Chunks with affinity < 1.0
`query_preview`	text	First 200 chars of query (no PII — voice queries are already PII-redacted via `voice_pii_redaction`)

Exposed at GET /api/v1/admin/ops/category-mismatch (backend/app/api/admin_ops.py:342) and charted on the /analytics/system Operations tab. A sustained spike (e.g., mismatch_rate > 0.6 for 50+ turns) indicates an emerging query class that the affinity table doesn't cover yet — the fix is adding a new intent row to DEFAULT_AFFINITY.

The function follows the R1 silent-failure discipline (CLAUDE.md §Silent-Failure Discipline): it logs the rate at INFO on every call ([ValueFramework] category_mismatch_rate=... intent=... primary=... off=.../...), so a scan of the logs reveals the metric per turn without needing to query the DB.

Latency budget

The full Value Framework pass adds modest latency to the RAG pipeline. * markers indicate stages whose timings have not yet been pinned to a histogram on the pilot.

Stage	Local-dev p50	Notes
`classify_chunk_category` (per chunk)	< 0.1 ms*	Pure-Python regex word-boundary match
`apply_intent_category_affinity` (10 chunks)	< 1 ms*	One dict lookup + one float multiply per chunk
`primary_category` (top-5)	< 0.1 ms*	Sum + argmax over 5 entries
`record_category_mismatch` (DB write)	5–20 ms*	One Postgres INSERT

The total framework overhead is ~5–25 ms per turn — small relative to the 600–1 200 ms RAG inner loop. Per Beyer et al. 2016 §4 (SLOs at the tail), pilot measurement should pin a p95 budget; ~50 ms is the working assumption.

Wheelchair regression test

backend/tests/integration/services/test_value_framework_wheelchair_regression.py

8 tests that pin the end-to-end contract for the wheelchair conflation scenario:

Regulatory chunk for wheelchair prescription scores BELOW practical chunk for wheelchair entrance after rerank with navigation_or_practical_info intent
The category_mismatch_rate for a practical-intent query against a regulatory-dominant result set is > 0.5 (detectable)
primary_category returns practical after affinity rerank on the wheelchair accessibility query
classify_chunk_category correctly labels the parking/entrance chunk as practical and the reimbursement chunk as regulatory
Calling apply_intent_category_affinity twice is idempotent on stable inputs (multiplied twice would compound incorrectly — the function accepts the already-classified chunks via the framework_category cache)
Unknown intent (not in affinity map) leaves scores unchanged
Chunks without a readable score field pass through without error
The unit-mismatch detector fires on parking-per-minute vs EV-per-kWh mismatch

This is the R2 regression-pin per CLAUDE.md §Silent-Failure Discipline — a test that lives WITH the fix, asserts the user-visible post-state, and would catch the regression on day 1 if it returned.

References

backend/app/services/value_framework/affinity.py — apply_intent_category_affinity, DEFAULT_AFFINITY, primary_category, category_match_ratio
backend/app/services/value_framework/category_classifier.py — classify_chunk_category, Category enum
backend/app/services/value_framework/telemetry.py — record_category_mismatch
backend/app/services/value_framework/unit_mismatch.py — unit-mismatch detection
backend/tests/integration/services/test_value_framework_wheelchair_regression.py — regression test suite
backend/alembic/versions/066_category_mismatch_telemetry.py — migration creating app.category_mismatch_telemetry
Cormack et al. 2009 — Reciprocal Rank Fusion, the rank-fusion lineage that the affinity multiplier extends
Robertson & Zaragoza 2009 — BM25 baseline, alternative considered and rejected for the contamination problem
Nogueira & Cho 2019 — passage reranker baseline, cross-encoder alternative considered and rejected on latency
Sarmah et al. 2024 — HybridRAG; structurally adjacent (knowledge-graph + vector hybrid). The Value Framework is the categorical-affinity sibling: a category-classifier output that augments the vector ranking
Gao et al. 2024 — Modular RAG taxonomy; the framework is a specialised retrieval-augmentation module
Bezemer & Zaidman 2010 — multi-tenant SaaS isolation; the hospital-agnostic claim of the keyword sets satisfies tenant-additive extensibility

The cross-category contamination problem​

Why retrieval-only solutions don't suffice​

Six canonical content categories​

Default 7-intent × 6-category affinity matrix​

Intent-to-category empirical justification​

Formal definition of the affinity operator​

Relationship to rank fusion​

Per-turn rerank flow​

Fix B — Primary-category prompt guard​

primary_category election algorithm​

Fix C — Unit-mismatch admission​

Fix E — category_mismatch_rate telemetry​

Latency budget​

Wheelchair regression test​

References​