Skip to main content

Voice Citation Pipeline

The problem: voice has no inline markers

The chat channel's citation extractor assumes the LLM's answer contains inline [1], [2], [3] markers that map to the retrieved chunks. The extractor pattern-matches these markers, deduplicates the referenced sources, and writes them to conversation_messages.citations.

The voice system prompt explicitly strips inline citation markers — they are un-speakable. A text-to-speech system reading "De cardiologie is bereikbaar via ingang B [1]." produces audible gibberish: "…via ingang B bracket one bracket." The voice system prompt therefore instructs the LLM to never include [N] markers in the answer.

This created a silent cascade failure affecting every voice turn:

voice answer contains no [N] markers
→ marker extractor returns []
→ dedup step sees empty list, returns []
→ cache write stores [] for the cache key
→ conversation_messages.citations = NULL
→ v2 diagnostic: dimensional_scores = {} (JSON schema validation fails)
→ v2 diagnostic silently falls back to v1 rendering

Nothing in this cascade raised an exception. Every function behaved correctly for its stated inputs. The failure was architectural: the chat path's assumption (markers exist) propagated into the voice path without a guard.

This is a textbook R2 silent-failure branch (CLAUDE.md §Silent-Failure Discipline): a code path that "fails quietly" with an empty fallback. The fix required a regression test that asserts citations != [] on a voice turn, not just "no exception."

The cascade in detail

Three functions each assumed markers were present:

FunctionLocationAssumptionEffect when no markers
_qs_extract_citationsrag_service.pyParses [N] from answer textReturns []
_qs_deduplicate_citationsrag_service.pyDeduplicates by marker referenceReturns [] on empty input
_qs_write_citation_cacherag_service.pyWrites (answer_hash, citations) to app.semantic_query_cacheWrites citations=[]; future cache hits on the same query return citations=[]

The v2 diagnostic's downstream failure (dimensional_scores={}) was not the root cause — it was the symptom that surfaced the citation pipeline bug in the first place. The diagnostic's LLM call was returning a valid JSON object with per-dimension scores; the cache read was overriding it with the stale citations=[] from the broken voice turns that had already poisoned the cache.

The fix — three commits

d130df74 — voice fallback in _qs_finalize

When channel == "voice" and the marker extractor returns [], _qs_finalize now derives citations directly from the retrieved chunks that were passed to the LLM:

# Voice path: no inline markers in the answer — derive citations from
# the retrieved chunks that were used to build the context.
if request.channel == "voice" and not citations:
citations = [
{"source": chunk.get("source_url", ""), "title": chunk.get("title", "")}
for chunk in retrieved_chunks[:5]
if chunk.get("source_url") or chunk.get("title")
]
logger.info("voice_citation_fallback", count=len(citations))

3cd5cc2f — skip dedup when no markers

_qs_deduplicate_citations now checks whether the input was marker-derived or chunk-derived. Chunk-derived citations have no duplicate marker structure, so the dedup step is a no-op and is skipped cleanly.

11a51ab2 — cleanup and R1 logging

Added logger.info("voice_citations_written", count=len(citations)) immediately before the cache write so the log stream surfaces the citation count per voice turn. A count of 0 in the logs is now a visible signal, not a silent empty write.

Cache flush discipline

After any change to the citation pipeline, flush app.semantic_query_cache.

The cache stores (query_hash → {answer, citations, dimensional_scores}). Stale cache entries from before the fix contain citations=[]; they will serve incorrect results on cache hits until they expire or are flushed.

-- Flush the entire semantic query cache (dev / staging)
DELETE FROM app.semantic_query_cache;

-- Or flush only voice-channel entries (safer on a live pilot)
DELETE FROM app.semantic_query_cache
WHERE metadata->>'channel' = 'voice';

On the pilot server, run this via the backend's admin endpoint (if exposed) or directly via psql. Do NOT rely on TTL-based expiry for a citation-pipeline change — the TTL can be several hours, and the stale entries will poison all traffic during that window.

v2 diagnostic — downstream consequence

The v2 diagnostic endpoint (POST /api/v1/query?response_format=v2) renders per-dimension scores (relevance, safety, citation quality, etc.) in individual cards. It calls a separate LLM completion with response_format={"type":"json_object"} enforced (commit c1cfa026). Before the citation fix, this LLM call was succeeding, but the stale citations=[] from the cache was making the response schema validation fail, causing the diagnostic to silently fall back to v1 rendering with dimensional_scores={}.

After the fix:

  1. Voice turns write chunk-derived citations (non-empty)
  2. Cache stores non-empty citations
  3. v2 diagnostic reads non-empty citations
  4. Schema validation passes
  5. Per-dimension cards render correctly

References

  • Memory file: feedback-voice-citation-pipeline.md — full post-mortem with conversation IDs
  • Commits: d130df74 (fallback), 3cd5cc2f (skip dedup), 11a51ab2 (R1 log + cleanup)
  • Commit c1cfa026response_format={"type":"json_object"} enforcement on v2 diagnostic call
  • CLAUDE.md §Silent-Failure Discipline (R1/R2/R3) — the framework that motivated the regression test requirement