Voice Citation Pipeline
The problem: voice has no inline markers
The chat channel's citation extractor assumes the LLM's answer contains inline [1], [2], [3] markers that map to the retrieved chunks. The extractor pattern-matches these markers, deduplicates the referenced sources, and writes them to conversation_messages.citations.
The voice system prompt explicitly strips inline citation markers — they are un-speakable. A text-to-speech system reading "De cardiologie is bereikbaar via ingang B [1]." produces audible gibberish: "…via ingang B bracket one bracket." The voice system prompt therefore instructs the LLM to never include [N] markers in the answer.
This created a silent cascade failure affecting every voice turn:
voice answer contains no [N] markers
→ marker extractor returns []
→ dedup step sees empty list, returns []
→ cache write stores [] for the cache key
→ conversation_messages.citations = NULL
→ v2 diagnostic: dimensional_scores = {} (JSON schema validation fails)
→ v2 diagnostic silently falls back to v1 rendering
Nothing in this cascade raised an exception. Every function behaved correctly for its stated inputs. The failure was architectural: the chat path's assumption (markers exist) propagated into the voice path without a guard.
This is a textbook R2 silent-failure branch (CLAUDE.md §Silent-Failure Discipline): a code path that "fails quietly" with an empty fallback. The fix required a regression test that asserts citations != [] on a voice turn, not just "no exception."
The cascade in detail
Three functions each assumed markers were present:
| Function | Location | Assumption | Effect when no markers |
|---|---|---|---|
_qs_extract_citations | rag_service.py | Parses [N] from answer text | Returns [] |
_qs_deduplicate_citations | rag_service.py | Deduplicates by marker reference | Returns [] on empty input |
_qs_write_citation_cache | rag_service.py | Writes (answer_hash, citations) to app.semantic_query_cache | Writes citations=[]; future cache hits on the same query return citations=[] |
The v2 diagnostic's downstream failure (dimensional_scores={}) was not the root cause — it was the symptom that surfaced the citation pipeline bug in the first place. The diagnostic's LLM call was returning a valid JSON object with per-dimension scores; the cache read was overriding it with the stale citations=[] from the broken voice turns that had already poisoned the cache.
The fix — three commits
d130df74 — voice fallback in _qs_finalize
When channel == "voice" and the marker extractor returns [], _qs_finalize now derives citations directly from the retrieved chunks that were passed to the LLM:
# Voice path: no inline markers in the answer — derive citations from
# the retrieved chunks that were used to build the context.
if request.channel == "voice" and not citations:
citations = [
{"source": chunk.get("source_url", ""), "title": chunk.get("title", "")}
for chunk in retrieved_chunks[:5]
if chunk.get("source_url") or chunk.get("title")
]
logger.info("voice_citation_fallback", count=len(citations))
3cd5cc2f — skip dedup when no markers
_qs_deduplicate_citations now checks whether the input was marker-derived or chunk-derived. Chunk-derived citations have no duplicate marker structure, so the dedup step is a no-op and is skipped cleanly.
11a51ab2 — cleanup and R1 logging
Added logger.info("voice_citations_written", count=len(citations)) immediately before the cache write so the log stream surfaces the citation count per voice turn. A count of 0 in the logs is now a visible signal, not a silent empty write.
Cache flush discipline
After any change to the citation pipeline, flush app.semantic_query_cache.
The cache stores (query_hash → {answer, citations, dimensional_scores}). Stale cache entries from before the fix contain citations=[]; they will serve incorrect results on cache hits until they expire or are flushed.
-- Flush the entire semantic query cache (dev / staging)
DELETE FROM app.semantic_query_cache;
-- Or flush only voice-channel entries (safer on a live pilot)
DELETE FROM app.semantic_query_cache
WHERE metadata->>'channel' = 'voice';
On the pilot server, run this via the backend's admin endpoint (if exposed) or directly via psql. Do NOT rely on TTL-based expiry for a citation-pipeline change — the TTL can be several hours, and the stale entries will poison all traffic during that window.
v2 diagnostic — downstream consequence
The v2 diagnostic endpoint (POST /api/v1/query?response_format=v2) renders per-dimension scores (relevance, safety, citation quality, etc.) in individual cards. It calls a separate LLM completion with response_format={"type":"json_object"} enforced (commit c1cfa026). Before the citation fix, this LLM call was succeeding, but the stale citations=[] from the cache was making the response schema validation fail, causing the diagnostic to silently fall back to v1 rendering with dimensional_scores={}.
After the fix:
- Voice turns write chunk-derived citations (non-empty)
- Cache stores non-empty citations
- v2 diagnostic reads non-empty citations
- Schema validation passes
- Per-dimension cards render correctly
References
- Memory file:
feedback-voice-citation-pipeline.md— full post-mortem with conversation IDs - Commits:
d130df74(fallback),3cd5cc2f(skip dedup),11a51ab2(R1 log + cleanup) - Commit
c1cfa026—response_format={"type":"json_object"}enforcement on v2 diagnostic call - CLAUDE.md §Silent-Failure Discipline (R1/R2/R3) — the framework that motivated the regression test requirement