Skip to main content

Release Notes: May 29, 2026

Doctor-Routing & Medical-Refusal Robustness · 3 Fixes · 2 Premise-Checks That Changed the Plan

~9 commits | 1 session | 1 pilot deploy (N*-afdeling) | 2 features still flag-off behind their eval gates | jellyfish added then reverted on evidence

This window came entirely from reading real pilot voice transcripts and tracing each "the agent should have answered that" moment back to its root cause. Three distinct defects surfaced, and two of them were not what the symptom suggested — which is the transferable lesson of this release. In both cases an empirical premise-check ran before any code was built, and in both cases the measurement contradicted the obvious fix:

  1. The "voice won't give the ibuprofen dose" complaint turned out to be partly correct refusal (the corpus has only an adult dose, dangerous for the child in the query).
  2. The "add a phonetic library" plan was killed when measurement showed the phonetic algorithms recall none of the real mishears while plain character-ratio recalls all of them.

The headline themes:

  1. N*-afdeling ward-code suppression (DEPLOYED). The doctor-enrichment block rendered "Bij de afdeling N*-afdeling kan u terecht bij: Dr. Dirk Van Moorsel" on a diabetes query. Root cause was a display-selection bug — min(by length) preferred a cryptic ward-code entity over the real specialty name. Fixed with a specialty-token filter; shipped to pilot.
  2. F2 pediatric medical-refusal gate (fixed; feature still flag-off). The intent-aware refusal enrichment fell through to a bare refusal for pediatric dosing because its relevance gate only matched plural "voor kinderen", not the singular "voor een kind van 18 kg geldt 125 mg" that brochures actually use.
  3. Phonetic doctor-name recovery (new; flag-off behind a voice-eval gate). STT transcribes "Dr. Dupont" as "Depot"; the LIKE '%depot%' lookup finds nothing though Dr. Matthias Dupont exists. New char-ratio recall → constrained LLM disambiguation → confirm-before-route, multilingual by construction.

1 · N*-afdeling — the display heuristic that preferred a ward code

Commit: 2f398317 · Status: deployed to pilot (zol-rag-app:2f398317)

A caller asked "wie behandelt suikerziekte?" (who treats diabetes). The LLM answer was correct and well-cited — but a deterministic doctor-enrichment block appended:

Bij de afdeling N*-afdeling kan u terecht bij: Dr. Dirk Van Moorsel, …

The doctors were the right endocrinologists. Only the department label was garbage.

Root cause: min(filtered_depts, key=len)

The enrichment in taxonomy_mixin.py selected the department to display with min(filtered_depts, key=len) — the shortest matched name. The ZOL taxonomy contains a real (and duplicated) DEPARTMENT entity literally named N*-afdeling (a nursing-ward code). The endocrinologists' WORKS_IN edges point to Endocrinologie (14 chars), but N*-afdeling (11 chars) also entered the candidate set and won the length tiebreak. A second amplifier: the matcher matched N*-afdeling on the bare generic word "afdeling" (the N* token is too short to count).

A prior fix (2026-05-14) had added an intent gate to suppress enrichment for irrelevant queries — but doctor_lookup is a legitimate intent, so the gate passed and the display-name bug surfaced untouched.

Fix: a specialty-token filter (_has_specialty_token)

A department name must contain an alphabetic token of length ≥ 4 that is not a generic suffix (afdeling/dienst/eenheid/unit/afd). N*-afdeling has no such token, so it is dropped at the source (never matches, never contributes doctors, never displays); Endocrinologie passes. When no displayable department remains, enrichment is skipped rather than printing garbage.

Verified in the live data (fork A): N*-afdeling is a real duplicated entity; Van Moorsel WORKS_IN Endocrinologie. So the fix is display-only — the doctors were always correct. The garbage N*-afdeling entity remains a separate taxonomy-hygiene cleanup, now harmless to output.


2 · F2 pediatric medical-refusal gate — singular dosing phrasing

Commit: 0dfc88f6 · Status: F2 feature remains flag-off (intent_aware_medical_refusal_enabled=false)

A caller asked "Mijn kind heeft koorts, hoeveel ibuprofen mag ik geven?" The classifier correctly routed it to out_of_scope_medical_advice. The F2 enrichment — designed to surface corpus-grounded brochure info alongside the refusal instead of a bare "I can't help" — exists and is well-tested, but was dormant. When activated, it still fell through to the legacy refusal for pediatric dosing.

Root cause: a plural-only relevance gate

F2's relevance gate (_retrieval_addresses_query) requires a topic-specific keyword to confirm the retrieved brochure is on-topic. For pediatric_medication the keyword set was plural-only — "voor kinderen", "bij kinderen", … — but brochure dosing is written in the singular: "Voor een kind van 18 kg geldt 125 mg ibuprofen per keer." That matched no keyword, so the gate rejected genuinely on-topic content and F2 fell back to the bare refusal.

Fix: pediatric term + dosing/weight signal

_is_pediatric_dosing() passes the gate when a hit contains a pediatric term (kind, baby, zuigeling, …) and a dosing/weight signal (mg/ml dose, kg, or dosering/dosis). Both are required, so the incidental "geen kinderen onder 12 jaar bij IZ-bezoek" (no dose) is still correctly rejected. The previously-failing F2 test now passes, plus three new gate-regression tests.

The bigger finding: a corpus content gap, not just a code bug

Tracing the live corpus revealed the ibuprofen dosing it contains is adult/general (Ibuprofen 400mg, < 60 kg: 400mg 3×/dag — a light-adult prescription threshold), not pediatric. Surfacing that for a child query would be a dangerous adult-dose-for-child mismatch. So the voice refusal, for that query, was arguably protecting the caller — and the safe path is a pediatric dosing brochure (a clinical/content decision), not a code change that synthesizes a dose. F2 remains gated; activation is benchmark- and safety-probe-dependent.


3 · Phonetic doctor-name recovery — and the dependency the evidence killed

Commits: fa7f4f76 (recall) · 4aeec139 (flag) · 1760e7c9 (lookup hook) · df9397ac (LLM disambiguation) · Status: flag-off (phonetic_doctor_recovery_enabled=false)

A caller asked for "dokter Depot / Depo" — STT's rendering of Dr. Dupont. The lookup (find_doctor_by_name, a SQL LIKE '%depot%' over app.doctors) found nothing, though Dr. Matthias Dupont is in the directory. Chat works because chat users type the name; voice is at the mercy of STT, whose errors are phonetic.

The premise-check that killed jellyfish

The plan's first task was to add the jellyfish phonetic library — but it began with an empirical premise-check (per the project rule "plan targets are hypotheses, not specs"). The measurement:

STT heardmetaphonevs Dupont (TPNT)char-ratio
depotTPT✗ no match0.727
depoTP✗ no match0.600
duponTPN✗ no match0.909

Metaphone, Soundex, and NYSIIS all failed to group the real mishears. The reason is subtle: STT didn't substitute a similar sound — it dropped a phoneme ("Dupont" → "Depot" loses the /n/), which orthographic char-ratio tolerates but phonetic encoders (which encode the /n/) do not. Plain character-ratio (difflib, already in the repo) recalled all three at floor 0.55. jellyfish was added (40c97928) and reverted (adf0ee58) on this evidence — a needless dependency avoided.

Architecture: recall (cheap, deterministic) → precision (LLM) → confirm

Why this shape:

  • Multilingual by construction. Phonetic algorithms are language-specific (no good Romanian option exists); the LLM does the per-language phonetic reasoning, so adding a tenant language (e.g. clinicajosesilva.ro) needs zero new code. char-ratio recall is language-agnostic.
  • Anti-hallucination. select_recovered_doctor() returns a name only if it is in the candidate list (or None) — a hallucinated doctor can never be routed to.
  • Confirm before routing. Wrong-doctor routing is worse than "not found", so the agent always confirms before acting. Recovered candidates carry a recovered=True marker (and type=doctor, so downstream formatting renders them) and the recovery branch is wrapped in a fail-safe that degrades to today's not-found on any error.

Status & gate

The feature is flag-off (phonetic_doctor_recovery_enabled=false). Activation is gated on a pilot voice-eval over a real mishear set with a hard kill criterion: zero wrong-doctor routings that survive the confirmation step. Spec + plan: docs/superpowers/specs/2026-05-29-phonetic-doctor-matching-design.md, docs/superpowers/plans/2026-05-29-phonetic-doctor-matching.md.


The transferable lesson

Two of three fixes had a symptom that pointed at the wrong cause, and in both the discipline that saved the work was measuring before building:

  • "Voice won't give the dose" → the dose in the corpus is adult; refusal was partly protective.
  • "Add a phonetic library" → phonetic algorithms recall none of the real mishears; char-ratio (zero-dep) recalls all.

Both premise-checks cost minutes and prevented either a safety regression (adult dose to a child) or a needless dependency. The same rule that retired pydantic-ai and OpenRouter retired jellyfish here: a plausible fix is a hypothesis until the data agrees.