Skip to main content

Value Framework

Package: backend/app/services/value_framework/ Migration: 066 (backend/alembic/versions/066_category_mismatch_telemetry.py) Tests: 76 unit tests + 8-test wheelchair regression integration test (counts may drift; run pytest --collect-only backend/tests/integration/services/test_value_framework_*.py backend/tests/unit/services/value_framework/ for the current count)

Part of the Knowledge & Retrieval Steering triad

This is the retrieval-steering member of a three-subsystem set. It is best read alongside the Taxonomy (the structured knowledge graph it reranks against) and SNOMED CT (the ontology that resolves the entities the taxonomy routes). See the Core Concepts flow for how all three compose end-to-end in a single query. The Value Framework runs as Stage 5b of the Taxonomy Query Enrichment pipeline.

The cross-category contamination problem

The Value Framework was built to close a concrete regression. A caller asked about wheelchair accessibility at the hospital entrance. The answer they expected: "There are wheelchair-accessible entrances at Schiepse Bos and Genk." The answer they received: a confused reimbursement-process explanation.

Why? The query contained the word "rolstoel" (wheelchair). The orthopedic chunk about wheelchair-prescription reimbursement lexically over-scored on "rolstoel" and outranked the parking/accessibility chunk — because pgvector cosine similarity (@pgvector_docs) on a 3072-dim embedding (@openai2024embeddings) sees "rolstoel" in both documents and gives them nearly identical scores. The top-K context then contained primarily regulatory content, and the LLM faithfully answered from that context.

This is cross-category contamination: a query with intent navigation_or_practical_info gets answered from regulatory content. The embedding model is doing its job; the problem is that similarity alone cannot distinguish "I want to park my wheelchair" from "I want to claim reimbursement for my wheelchair prescription."

Why retrieval-only solutions don't suffice

Three retrieval-side approaches were considered before adding categorical-affinity reranking. Each was rejected:

AlternativeWhy rejected
Add BM25 keyword search to widen the recall pool (@robertson2009bm25)Already in the hybrid retriever — both branches surface the same regulatory chunk on rolstoel. Adding keyword recall makes the contamination worse, not better.
Reciprocal Rank Fusion across vector + BM25 (@cormack2009rrf)RRF combines rankings without normalising scores; both branches independently rank the regulatory chunk high, so RRF preserves the contamination.
Cross-encoder reranker (BERT-style passage rerank, @nogueira2019passagererank)Would help — but the cross-encoder is a per-query inference (~50 ms on Jina v2) on the entire candidate set. Voice channel adds 50 ms before the LLM call, and the cross-encoder still has no signal that "parking" and "reimbursement" are different categories of fact.

The Value Framework is structurally a categorical-affinity multiplier applied to the existing similarity score. It is conceptually descended from RRF (rank-fusion lineage) but extends it with a per-intent × per-category multiplier matrix. This places it in Gao et al.'s "Modular RAG" taxonomy (@gao2024ragsurvey) — a specialised retrieval-augmentation module orchestrated alongside the existing retriever, not replacing it.

Six canonical content categories

backend/app/services/value_framework/category_classifier.py assigns one of six categories to any retrieved chunk using multi-language keyword matching (word-boundary regex, case-insensitive, nl/en/fr/it):

CategoryWhat it covers
practicalParking, entrance, visiting hours, accessibility, cafeteria, wifi, lift, directions
clinical_infoConditions, treatments, diagnoses, symptoms, medications, procedures
regulatoryReimbursement, prescriptions, insurance, social fund, orthopedic devices, co-pay
appointmentsScheduling, consultation, office hours, cancel/reschedule, secretariat
legal_adminPrivacy, complaints, ombudsman, patient rights, accreditation, governance
generalFallback — no strong keyword signal in the chunk

The classifier is hospital-agnostic: keyword sets are linguistic (Dutch/French/Italian/English vocabulary), not specific to ZOL. A future tenant gets the same classifier without configuration. This satisfies the hospital-agnostic invariant of the multi-tenant architecture (@bezemer2010multitenant).

Default 7-intent × 6-category affinity matrix

backend/app/services/value_framework/affinity.pyDEFAULT_AFFINITY:

Intentpracticalclinical_inforegulatoryappointmentslegal_admingeneral
navigation_or_practical_info1.300.650.551.050.851.00
appointment_scheduling1.050.800.751.300.951.00
medical_information0.751.251.050.950.851.00
doctor_information0.901.100.851.200.851.00
department_or_service1.101.100.851.200.901.00
administrative_or_legal0.900.801.200.951.301.00
billing_or_insurance0.850.851.300.951.101.00

Multipliers > 1.0 boost; < 1.0 penalize. 1.0 is neutral (no change). Intents not in the map default to neutral across all categories — the framework never makes things worse than the unfiltered ranking.

The maximum boost is 1.30 and the maximum penalty is 0.55. The boundary values were chosen so that even on a worst-case ranking inversion (boosted chunk at similarity 0.50 vs penalised chunk at similarity 0.95), the boost can flip the order:

  • Boosted: 0.50 × 1.30 = 0.65
  • Penalised: 0.95 × 0.55 = 0.52

The wheelchair example: intent navigation_or_practical_info, regulatory chunk at similarity 0.85. After rerank: 0.85 × 0.55 = 0.47. Practical chunk at 0.65 becomes 0.65 × 1.30 = 0.85. The practical chunk wins.

Intent-to-category empirical justification

The seven intents come from IntentClassificationService (the hospital-domain intent taxonomy). The matrix entries were derived from the empirical question distribution in the 299-question text golden set, classified manually against the six categories. The wheelchair regression contributed entries for navigation_or_practical_info × regulatory = 0.55. The structural pattern — categorical-affinity multiplier on top of vector retrieval — is adjacent to the knowledge-graph + vector hybrids surveyed in Sarmah et al. 2024.

The 7-intent × 6-category space is small enough to maintain manually. A future tenant whose ontology requires more categories (e.g., research-hospital with a clinical_trials category) can extend DEFAULT_AFFINITY without changing the application logic.

Formal definition of the affinity operator

Let C = {c₁, …, cₙ} be the candidate chunk set returned by hybrid retrieval, each chunk cᵢ carrying a base relevance score sᵢ ∈ [0,1] (the fused pgvector + BM25 similarity). Let κ(c) → K be the category classifier mapping a chunk to one of the six categories K, and let q ∈ I be the query's intent class drawn from the seven-intent taxonomy. The affinity matrix is a function

A : I × K → ℝ₊ with A(q, k) ∈ [0.55, 1.30]

The reranked score is the clamped product, and the operator re-sorts C by sᵢ′ descending:

sᵢ′ = clamp[0,1]( sᵢ · A(q, κ(cᵢ)) )

Three properties make this safe to run unconditionally on the request path:

  1. Identity on unknown intent. If q ∉ dom(A) the matrix degenerates to A(q, ·) = 1, so sᵢ′ = sᵢ and the ordering is unchanged. The framework can never rank worse than the unfiltered retriever — it is a Pareto-safe refinement.
  2. Order-preserving within a category. For two chunks of the same category, A is a constant positive scalar, so their relative order is preserved. Reranking only ever changes order across categories — exactly the contamination axis it targets.
  3. Non-idempotent on raw re-application. sᵢ″ = sᵢ · A² ≠ sᵢ′, so applying the operator twice would compound the multiplier incorrectly. The implementation guards this by caching the computed category on the chunk dict (framework_category) and treating an already-scored chunk as terminal — see test 5 in the regression suite.

Relationship to rank fusion

The operator is a generalisation of the score-multiplier family that Reciprocal Rank Fusion (@cormack2009rrf) belongs to. Where RRF fuses two ranked lists by a rank-reciprocal weight Σₗ 1/(k + rₗ(c)) with no semantic signal, the Value Framework fuses one ranked list with a categorical prior A(q, κ(c)). The two are composable: RRF first produces the fused sᵢ across the vector and BM25 lanes, then A applies the intent-conditioned categorical correction. This places the framework firmly in the "Modular RAG" orchestration layer (@gao2024ragsurvey) rather than the retrieval layer.

The boundary values [0.55, 1.30] are not arbitrary: they are the smallest multiplier band that guarantees a worst-case ranking inversion (a maximally-penalised chunk at s=0.95 vs a maximally-boosted chunk at s=0.50) can still flip, since 0.50 · 1.30 = 0.65 > 0.52 = 0.95 · 0.55. A tighter band would leave pathological inversions unfixable; a wider band would risk boosting weakly-relevant on-category chunks above strongly-relevant ones.

Per-turn rerank flow

The category is computed lazily and cached on the chunk dict (framework_category key), so apply_intent_category_affinity, primary_category, and record_category_mismatch each read the same classification without re-running the classifier.

Fix B — Primary-category prompt guard

After reranking, primary_category(chunks, top_k=5) identifies the dominant category among the top-5 chunks (by cumulative score, not count). The system prompt includes an instruction keyed on this category:

"The context is primarily about practical information (parking, accessibility, navigation). Answer from that perspective. Do not fuse with reimbursement or prescription content even if the query contains a word that appears in both."

This gives the answer LLM an explicit signal to stay in one lane, rather than inferring intent from the mixed retrieval set. The prompt-level guard pairs with the score-level rerank: rerank reduces the probability of mixed retrieval; the prompt handles the residual cases where mixed retrieval still occurs.

primary_category election algorithm

The election uses cumulative similarity score, not chunk count, to weight high-confidence retrieval:

primary_category(chunks, top_k=5):
weights = defaultdict(float)
for chunk in chunks[:top_k]:
weights[chunk.framework_category] += chunk.similarity
return max(weights, key=weights.get)

Counting chunks would let three weak regulatory chunks (similarities 0.45, 0.45, 0.45 = 1.35 total) outvote two strong practical chunks (similarities 0.85, 0.85 = 1.70 total) — the wrong answer. Weighted-by-score gives the right answer in this scenario.

Fix C — Unit-mismatch admission

backend/app/services/value_framework/unit_mismatch.py. When the query contains a per-minute or per-session unit (parking tariffs) but the top chunks discuss per-kWh or per-item pricing (EV charger costs), the framework emits a structured gap note into the context:

[UNIT MISMATCH] Query asks about per-minute parking rates.
Retrieved context covers per-kWh EV charger pricing.
If you cannot directly answer the unit asked about, say so explicitly.

This prevents the LLM from silently transposing units and producing a confidently wrong answer. The pattern catches the "data exists but at the wrong granularity" failure mode that pure similarity ranking can't detect.

Fix E — category_mismatch_rate telemetry

backend/app/services/value_framework/telemetry.pyrecord_category_mismatch.

Per-turn metric written to app.category_mismatch_telemetry:

ColumnTypeDescription
tenant_iduuidIsolates metrics per hospital tenant
conversation_iduuidCorrelates turns in one call
message_iduuidLinks to app.conversation_messages
intent_classtextIntent that drove the rerank
primary_categorytextDominant category of top-5 chunks
mismatch_ratenumeric0.0 – 1.0 fraction of top-5 chunks below boost threshold
chunks_totalintTop-K evaluated
chunks_off_categoryintChunks with affinity < 1.0
query_previewtextFirst 200 chars of query (no PII — voice queries are already PII-redacted via voice_pii_redaction)

Exposed at GET /api/v1/admin/ops/category-mismatch (backend/app/api/admin_ops.py:342) and charted on the /analytics/system Operations tab. A sustained spike (e.g., mismatch_rate > 0.6 for 50+ turns) indicates an emerging query class that the affinity table doesn't cover yet — the fix is adding a new intent row to DEFAULT_AFFINITY.

The function follows the R1 silent-failure discipline (CLAUDE.md §Silent-Failure Discipline): it logs the rate at INFO on every call ([ValueFramework] category_mismatch_rate=... intent=... primary=... off=.../...), so a scan of the logs reveals the metric per turn without needing to query the DB.

Latency budget

The full Value Framework pass adds modest latency to the RAG pipeline. * markers indicate stages whose timings have not yet been pinned to a histogram on the pilot.

StageLocal-dev p50Notes
classify_chunk_category (per chunk)< 0.1 ms*Pure-Python regex word-boundary match
apply_intent_category_affinity (10 chunks)< 1 ms*One dict lookup + one float multiply per chunk
primary_category (top-5)< 0.1 ms*Sum + argmax over 5 entries
record_category_mismatch (DB write)5–20 ms*One Postgres INSERT

The total framework overhead is ~5–25 ms per turn — small relative to the 600–1 200 ms RAG inner loop. Per Beyer et al. 2016 §4 (SLOs at the tail), pilot measurement should pin a p95 budget; ~50 ms is the working assumption.

Wheelchair regression test

backend/tests/integration/services/test_value_framework_wheelchair_regression.py

8 tests that pin the end-to-end contract for the wheelchair conflation scenario:

  1. Regulatory chunk for wheelchair prescription scores BELOW practical chunk for wheelchair entrance after rerank with navigation_or_practical_info intent
  2. The category_mismatch_rate for a practical-intent query against a regulatory-dominant result set is > 0.5 (detectable)
  3. primary_category returns practical after affinity rerank on the wheelchair accessibility query
  4. classify_chunk_category correctly labels the parking/entrance chunk as practical and the reimbursement chunk as regulatory
  5. Calling apply_intent_category_affinity twice is idempotent on stable inputs (multiplied twice would compound incorrectly — the function accepts the already-classified chunks via the framework_category cache)
  6. Unknown intent (not in affinity map) leaves scores unchanged
  7. Chunks without a readable score field pass through without error
  8. The unit-mismatch detector fires on parking-per-minute vs EV-per-kWh mismatch

This is the R2 regression-pin per CLAUDE.md §Silent-Failure Discipline — a test that lives WITH the fix, asserts the user-visible post-state, and would catch the regression on day 1 if it returned.

References

  • backend/app/services/value_framework/affinity.pyapply_intent_category_affinity, DEFAULT_AFFINITY, primary_category, category_match_ratio
  • backend/app/services/value_framework/category_classifier.pyclassify_chunk_category, Category enum
  • backend/app/services/value_framework/telemetry.pyrecord_category_mismatch
  • backend/app/services/value_framework/unit_mismatch.py — unit-mismatch detection
  • backend/tests/integration/services/test_value_framework_wheelchair_regression.py — regression test suite
  • backend/alembic/versions/066_category_mismatch_telemetry.py — migration creating app.category_mismatch_telemetry
  • Cormack et al. 2009 — Reciprocal Rank Fusion, the rank-fusion lineage that the affinity multiplier extends
  • Robertson & Zaragoza 2009 — BM25 baseline, alternative considered and rejected for the contamination problem
  • Nogueira & Cho 2019 — passage reranker baseline, cross-encoder alternative considered and rejected on latency
  • Sarmah et al. 2024 — HybridRAG; structurally adjacent (knowledge-graph + vector hybrid). The Value Framework is the categorical-affinity sibling: a category-classifier output that augments the vector ranking
  • Gao et al. 2024 — Modular RAG taxonomy; the framework is a specialised retrieval-augmentation module
  • Bezemer & Zaidman 2010 — multi-tenant SaaS isolation; the hospital-agnostic claim of the keyword sets satisfies tenant-additive extensibility