SOTA Positioning Matrix (May 2026)

A competitive analysis of the ZOL voice stack against eighteen vendors across eight comparison axes, written at engineer-buyer register. Every cell is either backed by a public source (vendor doc URL, public benchmark, or our own measured number with a source-of-truth pointer) or marked explicitly as not measured / not publicly documented. Inferred numbers are forbidden — if a buyer asks "where did you get this number?" we point at a URL.

This document is a snapshot. Vendors ship; the matrix decays. The methodology section at the end describes the refresh cadence and the cite-or-blank discipline.

1. Executive summary

The voice-AI market in May 2026 is split across five tiers. Voice-AI specialists like Retell (retellai.com), Vapi (vapi.ai), Synthflow, Bland, and Cognigy Voice Gateway (cognigy.com/products/voice-gateway) compete on conversational fluidity, barge-in tuning, and developer ergonomics. Hyperscalers like the OpenAI Realtime API (platform.openai.com/docs/guides/realtime), Deepgram Voice Agent (deepgram.com/product/voice-agent-api), Google Dialogflow CX, and Microsoft Voice Bot compete on infrastructure scale and enterprise compliance certifications. Healthcare-specific vendors like Suki, DeepScribe, Abridge, and Hyro compete on a different axis — most are clinician-facing scribes, not patient-facing voice search; the closest analogue is Hyro. Contact-center incumbents like Genesys Cloud CX, NICE CXone, Five9, and Talkdesk compete on enterprise integrations, marketplace ecosystems, and call-routing maturity. Open-source baselines like LiveKit Agents (raw, github.com/livekit/agents), Pipecat (github.com/pipecat-ai/pipecat), and Vocode are reference implementations that ship the runtime but not the cognition.

The ZOL voice stack competes with all five tiers on different axes. Against the voice-AI specialists we compete on domain depth and provenance — our retrieval pipeline, citation discipline, and multi-language safety architecture are not features they offer out of the box. Against the hyperscalers we compete on honesty and observability — our per-turn telemetry, citation-grounded answers, and documented LLM-as-judge bias controls (Zheng et al. 2023) describe a system that knows when it is wrong; their managed services do not surface that signal at the same granularity. Against the healthcare-specific vendors we compete on scope clarity — our system is not a clinical scribe, not clinical decision support, and the architecture is shaped by that negative scope (see thin voice architecture). Against the contact-center incumbents we compete on engineering rigor and time-to-onboard — our multi-tenant overlay system (architecture/multi-tenancy) admits a new hospital with zero source-code changes; their integrations are powerful but heavyweight. Against the open-source baselines we compete on the layer above the runtime — they ship LiveKit-equivalent plumbing, we ship the seven-layer stack documented in the Voice Stack Compendium.

Three honest gaps shape this snapshot. Infrastructure reliability — managed hyperscaler stacks have one less moving part than a self-hosted Twilio + LiveKit deployment, and at our scale (≤25 K queries/month, single-region pilot) we have not yet measured uptime against a managed alternative. Conversational fluidity — Retell has shipped fine-grained barge-in tuning (docs.retellai.com) that we have not yet matched at the per-turn level. Marketplace integrations — Salesforce, HubSpot, and Microsoft Dynamics connectors that contact-center incumbents ship out of the box are not in our codebase. Section 3 describes each gap in detail; Section 4 commits each to a time-boxed roadmap line.

The single-sentence answer to "why us" is this: we are the only stack on this list that combines patient-facing voice with citation-grounded retrieval, multi-tenant overlay onboarding, GDPR Art. 35 DPIA + AI Act Art. 50 limited-risk classification on file, and an empirically measured 99.0 % pass rate against a 302-question regulated-domain benchmark (thesis Chapter 4, Table 4.1). Every other vendor in the matrix is missing at least one of those four. The roadmap in Section 4 closes the three honest gaps without giving up the four differentiators.

2. Per-axis comparison matrix

The matrix uses one row per vendor, grouped by tier, with a final ZOL row. Each cell is one of:

A specific number, language list, or feature claim with an inline URL or a [file:line] source-of-truth pointer
Not publicly documented — vendor material does not state this, and inference is forbidden
Not yet measured — we have not benchmarked this against the vendor; pilot Phase 5 will backfill some of these

The five tiers are framed at the head of axis 1; subsequent axes reuse the same vendor groupings.

2.1 Latency

Tier A — Voice-AI specialists: vendors that build and sell a turnkey voice-agent platform. Their value proposition is "give us your prompt, we run the call." They compete on conversational fluidity and developer ergonomics.

Tier B — Hyperscalers: speech-and-LLM providers (OpenAI, Deepgram, Google, Microsoft) that expose voice-agent or realtime-API surfaces. Their value proposition is "infrastructure scale and compliance posture."

Tier C — Healthcare-specific: vendors selling into healthcare. Three of the four (Suki, DeepScribe, Abridge) are clinician-facing scribes and therefore not direct competitors on the patient-search axis; Hyro is the only patient-facing analogue.

Tier D — Contact-center incumbents: enterprise CCaaS platforms with voice-bot extensions. Their value proposition is integrations and routing maturity.

Tier E — Open-source baselines: frameworks (LiveKit Agents raw, Pipecat, Vocode) that ship the runtime without cognition or domain logic.

Vendor	Tier	TTFT (time to first audio)	End-to-end turn latency
Retell AI	A	Not publicly documented	Not publicly documented (vendor publishes a latency troubleshooting page but not a target SLO)
Vapi	A	Not publicly documented	500–700 ms voice-to-voice (docs.vapi.ai/quickstart)
Synthflow	A	Not publicly documented	Not publicly documented
Bland AI	A	Not publicly documented	Not publicly documented
Cognigy Voice Gateway	A	Not publicly documented	Not publicly documented (vendor cites "99.7 % intent recognition" and "25K+ concurrent conversations" but no end-to-end latency target — cognigy.com/products/voice-gateway)
OpenAI Realtime API	B	Not publicly documented at a target SLO	Not publicly documented at a target SLO
Deepgram Voice Agent	B	Not publicly documented	Not publicly documented (vendor markets "real-time responsiveness" without published p50/p95)
Google Dialogflow CX	B	Not publicly documented	Not publicly documented
Microsoft Voice Bot / Azure AI Speech	B	Not publicly documented	Not publicly documented
Suki AI	C	N/A — clinician scribe, no caller-facing turn loop	N/A
DeepScribe	C	N/A — clinician scribe	N/A
Abridge	C	N/A — clinician scribe	N/A
Hyro	C	Not publicly documented	Not publicly documented
Genesys Cloud CX	D	Not publicly documented	Not publicly documented
NICE CXone	D	Not publicly documented	Not publicly documented
Five9	D	Not publicly documented	Not publicly documented
Talkdesk	D	Not publicly documented	Not publicly documented
LiveKit Agents (raw)	E	N/A — framework, depends on plugin choices (github.com/livekit/agents)	N/A
Pipecat	E	N/A — framework	N/A
Vocode	E	N/A — framework	N/A
ZOL Voice Stack	—	Not yet measured at p95 on pilot; local-dev p50 of ElevenLabs first-audio is 200–400 ms (voice/architecture)	Not yet measured at p95 on pilot; local-dev stage budget targets ~5.5 s end-to-end on the chat channel (performance/overview). Voice-channel p95 is on the Phase-5 measurement list.

Latency targets follow the Beyer et al. 2016 SRE practice of writing SLOs at the tail (p95, p99) rather than the mean. The underlying UX thresholds are from Nielsen 1993 — 0.1 s for instantaneous feedback, 1 s for seamless flow, 10 s as the upper attention bound. Reading the table: of the 18 competitors, only one (Vapi) publishes a numeric latency target. The other 17 either do not document a target or sell a framework that pushes the latency question down to the integrator. This is informative on its own — competitive latency is mostly a marketing claim, not a published number.

2.2 Multilingual

Vendor	Tier	Languages	Mid-call switching policy
Retell AI	A	At least nl, en, es, fr, de, hi, ru, pt, jp, it (docs.retellai.com/agent/multilingual) — vendor notes per-voice subsets	Not publicly documented at a single policy level
Vapi	A	Not publicly documented at a complete list	Not publicly documented
Synthflow	A	Not publicly documented	Not publicly documented
Bland AI	A	Not publicly documented	Not publicly documented
Cognigy Voice Gateway	A	"100+ languages" with built-in machine translation (cognigy.com/products/voice-gateway)	Not publicly documented
OpenAI Realtime API	B	Not publicly documented at a complete list	Not publicly documented
Deepgram Voice Agent	B	Not publicly documented at the agent level (Nova-3 STT supports multiple languages — deepgram.com)	Not publicly documented
Google Dialogflow CX	B	Not publicly documented at this granularity	Not publicly documented
Microsoft Voice Bot / Azure AI Speech	B	Not publicly documented at this granularity	Not publicly documented
Hyro	C	Not publicly documented	Not publicly documented
Suki / DeepScribe / Abridge	C	N/A — scribe, not voice agent	N/A
Genesys / NICE / Five9 / Talkdesk	D	Not publicly documented at this granularity	Not publicly documented
LiveKit Agents (raw)	E	Depends on STT/TTS plugin choice (livekit_agents_docs)	Depends on integrator
Pipecat	E	Depends on integrator	Depends on integrator
Vocode	E	Depends on integrator	Depends on integrator
ZOL Voice Stack	—	nl, en, fr, it — production-validated; Dutch (Flemish) is primary, see voice/language-locking	Locked at first STT-confirmed utterance for the duration of the call (ADR-0052). Mid-call switching is explicitly traded away to preserve Flemish acoustic accuracy after two empirical pilot regressions documented in the ADR.

The salient observation: most vendors' language lists are not published at agent-product granularity. Retell publishes the longest verifiable list; Cognigy claims the most ("100+") via machine translation. Our four are fewer in count but each is production-tuned with safety regex packs (see §2.3). The locked-at-first-utterance policy is a deliberate trade-off, not a limitation — multi-language Deepgram measurably degrades Flemish accuracy, per the empirical evidence in ADR-0052.

2.3 Domain depth (regulated-healthcare voice)

Vendor	Tier	Out-of-box healthcare safety	Medical-advice refusal	STT-mishearing awareness	Voice-channel safety architecture
Retell AI	A	Not publicly documented as a healthcare-specific feature	Not publicly documented	Not publicly documented	Not publicly documented
Vapi	A	Not publicly documented	Not publicly documented	Not publicly documented	Not publicly documented
Synthflow	A	Not publicly documented	Not publicly documented	Not publicly documented	Not publicly documented
Bland AI	A	Not publicly documented	Not publicly documented	Not publicly documented	Not publicly documented
Cognigy Voice Gateway	A	Healthcare listed as an industry vertical; specific features not documented (cognigy.com/products/voice-gateway)	Not publicly documented	Not publicly documented	Not publicly documented
OpenAI Realtime API	B	Not publicly documented	OpenAI safety policies apply at model level; agent-product surface not documented	Not publicly documented	Not publicly documented
Deepgram Voice Agent	B	Not publicly documented	Not publicly documented	Not publicly documented	Not publicly documented
Google Dialogflow CX	B	Not publicly documented	Not publicly documented	Not publicly documented	Not publicly documented
Microsoft Voice Bot / Azure AI Speech	B	Not publicly documented	Not publicly documented	Not publicly documented	Not publicly documented
Suki AI	C	Clinician scribe; HIPAA / SOC 2 (suki.ai)	Out of scope — not patient-facing	N/A — clinician audio context, different problem	N/A
DeepScribe	C	HIPAA, SOC 2 (deepscribe.ai)	Out of scope — clinician scribe	N/A	N/A
Abridge	C	Enterprise healthcare claim (abridge.com); specific compliance certifications not on the homepage	Out of scope — clinician scribe	N/A	N/A
Hyro	C	Healthcare-specific patient-facing assistant; specific safety architecture not publicly documented at engineering depth	Not publicly documented	Not publicly documented	Not publicly documented
Genesys / NICE / Five9 / Talkdesk	D	Healthcare verticals; specific safety architecture not publicly documented at engineering depth	Not publicly documented	Not publicly documented	Not publicly documented
LiveKit Agents / Pipecat / Vocode	E	None — frameworks, no domain	None — depends on integrator	None — depends on integrator	None
ZOL Voice Stack	—	Multi-language safety regex packs (nl/en/fr/it) — see voice/triple-defense and adversarial hardening. Pattern set covers diagnostic, prescription, and dosage queries with a 100 % pass rate on the 14-question safety-refusal cohort and 12-question adversarial-GCG cohort (thesis Chapter 4, Table 4.1, citing Zou et al. 2023 for the GCG benchmark methodology).	Three-stage post-LLM disclaimer — automatic medical-content detection on the answer text + disclaimer prepender (voice/triple-defense). Disclaimer wording in nl/en/fr/it.	STT-mishearing aware — the language-locking ADR and the "Hoe wordt migraine behandeld?" / "Behandel ik migraine?" phoneme-pair example documented in the Voice Stack Compendium §1.	Regex pre-filter → agentic LLM → regex post-filter → answer-shaper → disclaimer prepender (voice/architecture). Architecture is the only voice path; legacy 8-stage `VoiceOrchestrator` was deleted in commit `158d793` (ADR-0049, ADR-0051). Base tool set: `search_hospital_kb`, `transfer_to_helpdesk`, `end_call`; extended to six tools in June 2026 with `find_consulting_doctors`, `list_department_doctors`, `get_doctor_schedule` (flag-gated, default on).

The salient observation: the four cells in our row are populated; the same four cells across every other vendor are either "not publicly documented" or "out of scope." This is the axis on which the architecture differentiates most cleanly. The healthcare-specific vendors (Suki, DeepScribe, Abridge) are not on this axis at all — they sell into a different problem (clinician documentation), and Hyro, the closest patient-facing analogue, does not document safety architecture at the engineering depth our compendium does. See Inan et al. 2023 for the Llama Guard lineage of LLM-output safety classifiers; our regex post-filter is a deterministic complement to that line of work.

2.4 Customization (multi-tenant onboarding)

Vendor	Tier	Per-tenant overlay	Intent affinity tuning	FAQ override	Day-1 onboarding effort
Retell AI	A	Per-account agent configuration (docs.retellai.com)	Not publicly documented at a tunable matrix level	Not publicly documented as a separate feature	Not publicly documented at minute-level
Vapi	A	Per-account agent configuration (docs.vapi.ai)	Not publicly documented	Not publicly documented	Not publicly documented
Synthflow / Bland	A	Not publicly documented at this granularity	Not publicly documented	Not publicly documented	Not publicly documented
Cognigy Voice Gateway	A	Per-account flow configuration	Not publicly documented at a tunable matrix level	Not publicly documented as a separate feature	Not publicly documented
OpenAI Realtime API	B	Per-API-key + system prompt; not multi-tenant out of box	None (LLM-only)	None (LLM-only)	Not applicable — building-block API
Deepgram Voice Agent	B	Per-account configuration	Not publicly documented	Not publicly documented	Not publicly documented
Google Dialogflow CX / Microsoft Voice Bot	B	Per-project / per-resource isolation	Not publicly documented at a tunable matrix level	Per-intent override (Dialogflow)	Not publicly documented at minute-level
Hyro	C	Per-customer deployment	Not publicly documented at engineering depth	Not publicly documented at engineering depth	Not publicly documented
Suki / DeepScribe / Abridge	C	Per-clinic deployment; different problem	N/A	N/A	N/A
Genesys / NICE / Five9 / Talkdesk	D	Per-tenant routing + flow isolation	Not publicly documented at a tunable matrix level	Per-flow / per-skill	Not publicly documented at minute-level
LiveKit Agents / Pipecat / Vocode	E	Depends on integrator	Depends on integrator	Depends on integrator	Depends on integrator
ZOL Voice Stack	—	Two-plane configuration: DB-driven for web/RAG (`site_crawl_configs`, `golden_pages`, `PromptContext`) + YAML overlay for voice (`tenant_overlays/_yaml/<slug>.yaml`); see architecture/multi-tenancy. Tenant identity comes from the Keycloak JWT claim — cryptographically bound, not header-resolved. Architectural lineage: Bezemer & Zaidman 2010 shared-schema multi-tenant SaaS taxonomy.	7-intent × 6-category affinity matrix (voice/value-framework). Per-intent multipliers tune retrieval categorical fit without changing prompts.	DB-driven FAQ renderers (voice/tenant-overlay-system) — per-tenant FAQ entries, STT phonetic-recovery overrides, and pre-filter classifier overrides land in version control without source changes.	Zero source-code change to onboard a new tenant is the architectural target. Empirical day-1 onboarding effort has not yet been measured against a competitor (pilot Phase 5 commitment, see §4).

Reading the table: most vendors have some form of per-tenant configuration, but none publish an intent-to-category affinity matrix as a tunable surface. Our Value Framework is unusual in exposing this lever (voice/value-framework) and is a direct consequence of the wheelchair-cross-category-contamination regression documented in the same page.

2.5 Observability

Vendor	Tier	Per-turn telemetry	Diagnostic accuracy metric	Operator feedback loop	Cost dashboard
Retell AI	A	Vendor exposes a call-history view; per-turn detail not publicly documented at engineering depth	Not publicly documented	Not publicly documented	Per-account billing dashboard
Vapi	A	Vendor exposes call logs; per-turn detail not publicly documented	Not publicly documented	Not publicly documented	Per-account billing
Synthflow / Bland	A	Not publicly documented at engineering depth	Not publicly documented	Not publicly documented	Not publicly documented
Cognigy Voice Gateway	A	"99.7 % intent recognition" claimed at the platform level (cognigy.com); per-turn detail not publicly documented	Not publicly documented	Not publicly documented at engineering depth	Per-account billing
OpenAI Realtime API	B	OpenAI usage dashboard; per-turn telemetry not publicly documented at engineering depth	Not publicly documented	Not publicly documented	OpenAI billing dashboard
Deepgram Voice Agent	B	Per-account dashboard; per-turn detail not publicly documented at engineering depth	Not publicly documented	Not publicly documented	Per-account billing
Google Dialogflow CX / Microsoft Voice Bot	B	Per-project Cloud Monitoring / Azure Monitor dashboards; agent-specific quality metrics not publicly documented	Not publicly documented	Not publicly documented	Per-project billing
Hyro	C	Not publicly documented at engineering depth	Not publicly documented	Not publicly documented	Not publicly documented
Suki / DeepScribe / Abridge	C	Different problem domain; observability geared to clinician QA, not patient-call telemetry	Not applicable	Clinician feedback loops (vendor-specific)	Not applicable
Genesys / NICE / Five9 / Talkdesk	D	Mature CCaaS reporting (call recording, IVR analytics)	Not publicly documented at agent-quality level	Not publicly documented at engineering depth	Per-account billing
LiveKit Agents / Pipecat / Vocode	E	Frameworks emit events; integrator builds the dashboard	Depends on integrator	Depends on integrator	Depends on integrator
ZOL Voice Stack	—	`pipeline_telemetry` Postgres table — per-stage latency, retrieval cardinality, intent class, primary content category, category-mismatch indicator. Per-turn writes are unconditional. See Voice Stack Compendium §4.	Diagnostic V2 endpoint (`POST /api/v1/query?response_format=v2`) — per-dimension scoring (correctness, safety, memory, tool_use, latency) by `VoiceTurnEvaluator` (schema-validated via the `structured_call` helper). LLM-as-judge bias controls follow Zheng et al. 2023.	Operations dashboard — per-tenant trend charts (Category Mismatch Trend, Diagnostic Accuracy Trend) on the Costs tab; described in feedback-dashboard-metrics.	Costs page — dollarised per-LLM-call breakdown, see performance/overview.

This is the second axis where our row is populated and most competitor cells are not. The driver is engineering choice, not technical hardness — vendors could expose per-turn telemetry, and most do so to themselves internally; what they don't publish is the engineering shape of those metrics. The diagnostic-accuracy metric in particular ("did the answer match the caller's intent on the documented 5-dimension rubric?") is, to our reading of public material, unique on this list.

2.6 Compliance

Vendor	Tier	GDPR DPIA on file	AI Act classification on file	PII redaction	Data residency
Retell AI	A	Not publicly documented	Not publicly documented	Add-on at $0.01/min (retellai.com/pricing)	Not publicly documented
Vapi	A	Not publicly documented	Not publicly documented	Not publicly documented	Not publicly documented
Synthflow / Bland	A	Not publicly documented	Not publicly documented	Not publicly documented	Not publicly documented
Cognigy Voice Gateway	A	GDPR compliance claimed at platform level (cognigy.com); DPIA per-deployment artifact not publicly documented	Not publicly documented	Not publicly documented	Not publicly documented
OpenAI Realtime API	B	OpenAI publishes a DPA framework; per-deployment DPIA is the customer's	Not publicly documented at this product level	Not publicly documented	EU data residency available on Enterprise plans (verify per-tier)
Deepgram Voice Agent	B	Vendor publishes a DPA framework	Not publicly documented at this product level	Not publicly documented	Not publicly documented at agent-product level
Google Dialogflow CX	B	GCP DPA framework	Not publicly documented at this product level	DLP API can be wired in (separate product)	Multi-region available
Microsoft Voice Bot / Azure AI Speech	B	Azure DPA framework	Not publicly documented at this product level	Azure-side PII tooling (separate)	Multi-region available
Suki AI	C	HIPAA, SOC 2 (suki.ai)	Not publicly documented	HIPAA-compliant by design	Not publicly documented
DeepScribe	C	HIPAA, SOC 2 (deepscribe.ai)	Not publicly documented	HIPAA-compliant by design	Not publicly documented
Abridge	C	"Enterprise-grade" claim (abridge.com); specific certifications not on the public marketing page	Not publicly documented	Not publicly documented at marketing-page depth	Not publicly documented
Hyro	C	Not publicly documented at engineering depth	Not publicly documented	Not publicly documented	Not publicly documented
Genesys / NICE / Five9 / Talkdesk	D	Mature DPA frameworks; specific DPIA artifacts are customer-side	Not publicly documented at this product level	Vendor-specific PII tooling	Multi-region available
LiveKit Agents / Pipecat / Vocode	E	None — depends on integrator	None — depends on integrator	None — depends on integrator	Self-hosted (integrator's choice)
ZOL Voice Stack	—	GDPR Art. 35 DPIA on file — see safety/dpia. Lawful basis, processing scope, residual-risk register, and Article-by-Article mapping are documented. References GDPR directly.	AI Act Art. 50 limited-risk classification on file — see safety/ai-act-compliance. High-risk Annex III analysis included with explicit scope-limit warning. References AI Act and MDR directly. Trustworthy-AI principles from HLEG 2019 cited.	Multi-language voice PII redaction (`voice_pii_redaction.py`) at telemetry write time. Architecture in safety/pii-protection.	Self-hosted; data does not leave the pilot server except to subprocessors-of-record (OpenAI, Deepgram, ElevenLabs) under GDPR Art. 28 — see Voice Stack Compendium §1.

This row is populated at engineering depth in our column and at marketing depth in most competitor columns — not because the competitors are non-compliant, but because compliance artifacts are typically per-deployment work product and not published as part of vendor marketing. The honest read: on the four cells of this axis we have published the engineering-depth artifacts; competitors have published the marketing claim. A buyer evaluating this row should ask each vendor to produce their equivalent of our DPIA and AI Act memo. ISO 27001:2022 (iso27001_2022) and ISO 27018:2019 (iso27018_2019) provide the cross-vendor reference for what those artifacts should contain.

2.7 Cost

Vendor	Tier	Headline price	Notes
Retell AI	A	$0.07–$0.31 / min voice (retellai.com/pricing)	Components: voice infra $0.055/min + TTS $0.015/min + LLM ($0.003–$0.08/min) + add-ons (knowledge base $0.005/min, denoising $0.005/min, PII removal $0.01/min)
Vapi	A	Not publicly documented at headline level	Vendor publishes a pricing page; verifiable at vapi.ai
Synthflow / Bland	A	Not publicly documented at headline level
Cognigy Voice Gateway	A	Not publicly documented (demo-gated)
OpenAI Realtime API	B	Not publicly documented at a per-minute headline (token-based pricing at platform.openai.com)
Deepgram Voice Agent	B	$4.50/hr flat with full stack (deepgram.com/product/voice-agent-api) — equivalent to $0.075/min	Reduced rates with bring-your-own-model
Google Dialogflow CX / Microsoft Voice Bot	B	Per-request and per-minute pricing on respective cloud pricing pages	Token + audio + STT/TTS components priced separately
Suki / DeepScribe / Abridge	C	Per-clinician seat, not per-minute; demo-gated	Different problem
Hyro	C	Not publicly documented (demo-gated)
Genesys / NICE / Five9 / Talkdesk	D	Per-seat/agent pricing, voice-bot add-on per-minute; demo-gated
LiveKit Agents / Pipecat / Vocode	E	Free (open-source); cost is vendor-passthrough (STT/LLM/TTS) + infrastructure
ZOL Voice Stack	—	~$8.70/month at projected 25K queries/month — internal cost-tracking, see performance/overview cost table. Per-tenant marginal cost is dominated by LLM token spend, not infrastructure (self-hosted Twilio + LiveKit).	At 25K queries × 45s avg call duration (~18,750 minutes/month), the headline-equivalent is ~$0.46/hour or ~$0.008/min — but this is not directly comparable to vendor per-minute pricing because (a) our number excludes infrastructure depreciation and engineer-time amortisation, and (b) vendor numbers typically bundle STT + LLM + TTS into the headline rate. A like-for-like cost comparison is on the Phase 5 list.

The honest framing: direct $/minute comparison across vendors is misleading without a normalisation pass that we have not yet run. Headline rates bundle different components. Our $8.70/month figure is internal cost-tracking, not a vendor-equivalent rate. The Phase 5 commitment in §4 includes building the normalised cost-comparison spreadsheet.

2.8 Provenance

Vendor	Tier	Citations on answers	Chunk-id traceability	Deletion compliance	Audit-log retention
Retell AI	A	Not publicly documented as a feature	Not publicly documented	Per-account data deletion (docs.retellai.com)	Not publicly documented
Vapi	A	Not publicly documented as a feature	Not publicly documented	Per-account	Not publicly documented
Synthflow / Bland	A	Not publicly documented	Not publicly documented	Per-account	Not publicly documented
Cognigy Voice Gateway	A	Per-flow knowledge-source attribution exists (cognigy.com); chunk-id detail not publicly documented at engineering depth	Not publicly documented at engineering depth	Per-account	Not publicly documented
OpenAI Realtime API	B	None — LLM-only product	None — no retrieval surface	OpenAI account deletion	OpenAI default
Deepgram Voice Agent	B	Not publicly documented as a feature	Not publicly documented	Per-account	Not publicly documented
Google Dialogflow CX	B	Knowledge-base attribution exists; engineering depth not publicly documented	Not publicly documented at engineering depth	Per-project	Cloud Logging retention
Microsoft Voice Bot / Azure AI Speech	B	Not publicly documented at engineering depth	Not publicly documented at engineering depth	Per-project	Azure Monitor retention
Suki / DeepScribe / Abridge	C	Note-citation features in clinician scribes; not patient-facing	Not applicable	Per-customer	Vendor-specific
Hyro	C	Not publicly documented at engineering depth	Not publicly documented	Not publicly documented	Not publicly documented
Genesys / NICE / Five9 / Talkdesk	D	Knowledge-base attribution in some flows; engineering depth not publicly documented	Not publicly documented at engineering depth	Per-tenant	Per-tenant
LiveKit Agents / Pipecat / Vocode	E	Depends on integrator	Depends on integrator	Depends on integrator	Depends on integrator
ZOL Voice Stack	—	Citations on every substantive answer — chunk-derived for voice (no inline `[N]` markers) and marker-derived for chat. Pipeline in voice/citation-pipeline and Voice Stack Compendium §3.	Per-chunk traceability to source `document_chunks` row, including page number and document URL. The citation extractor is a three-helper cascade documented after the 2026-05-07 silent-failure regression; see silent-failure discipline R1/R2/R3.	GDPR Art. 17 right-to-erasure mapped to deletion of conversation rows + audit-log retention exception. See safety/data-retention-policy.	Audit logs retained per documented policy with audit-log retention exception under GDPR Art. 17(3)(e). See safety/data-retention-policy.

The salient observation: citation-grounded retrieval as a per-turn feature is not a routine vendor capability. Most vendors expose retrieval-augmented generation; few expose chunk-id traceability and per-turn citation pipelines at engineering depth. The lineage is Lewis et al. 2020 for the RAG architecture and Gao et al. 2024 for the modular-RAG taxonomy that places our Value Framework among orchestrated retrieval modules.

2.9 Cell-summary statistics

Axis	ZOL cells: verified	Competitor cells: verified	Competitor cells: not publicly documented / not measured
Latency (TTFT, end-to-end)	0 verified, 2 "not yet measured"	1 (Vapi end-to-end)	35
Multilingual (languages, switching)	2 verified	4 verified, partial	32
Domain depth (4 sub-axes)	4 verified	0	64 (mostly "not publicly documented")
Customization (4 sub-axes)	4 verified	~6 partial	~54
Observability (4 sub-axes)	4 verified	~4 partial	~56
Compliance (4 sub-axes)	4 verified	~12 partial (DPA frameworks, HIPAA claims)	~50
Cost (1 axis)	1 verified	2 verified (Retell, Deepgram)	16
Provenance (4 sub-axes)	4 verified	~6 partial	~58

Reading the totals: most cells in the matrix are blank because vendors do not publish the engineering depth that the comparison requires. Our blanks (the latency cells marked "not yet measured") are commitments to backfill in pilot Phase 5; competitor blanks are usually not commitments to publish at all. This is the matrix's core honest finding — competitive positioning at the engineering depth our buyer cares about is mostly a research exercise on the buyer side, because vendor marketing pages do not publish the answers.

3. Honest gap analysis

This section is deliberately unflattering. Three gaps shape the snapshot, and each is a place where a buyer would correctly say "competitor X is ahead of you on this axis." We name the deficit, the cost, and the closing-the-gap commitment in §4.

3.1 Infrastructure reliability — managed hyperscaler stacks have one less moving part

The OpenAI Realtime API and Deepgram Voice Agent are managed services; the hyperscaler runs the infrastructure, the integrator runs the prompt. Our stack is self-hosted Twilio Elastic SIP Trunk + LiveKit SIP + LiveKit Server + LiveKit Agents on a single pilot server, with multiple subprocessors of record (OpenAI, Deepgram, ElevenLabs) under GDPR Art. 28. At pilot scale (≤25 K queries/month, single-region) this is operationally fine — see the runbook in ADR-0050 — but we have not yet measured uptime against a managed alternative on an apples-to-apples SLO basis.

The cost of this gap to a buyer: a CTO asking "what is your committed uptime SLO?" gets the honest answer "we have not yet built a multi-region failover and we have not yet posted a public SLO." A managed-stack vendor can answer that question with a number. We cannot, yet. The §4.1 line commits to backfilling pilot uptime measurement and posting an SLO in Q3 2026.

3.2 Conversational fluidity — Retell has shipped fine-grained barge-in tuning

Retell publishes a latency troubleshooting page and a barge-in tuning surface that exposes per-call interruption sensitivity, end-of-turn detection thresholds, and tunable response delay. Our voice agent has basic barge-in support via LiveKit Agents' semantic turn detection (livekit_agents_docs) but we have not yet tuned per-call barge-in sensitivity. In particular, the elderly demographic that dominates hospital helpdesk traffic frequently pauses mid-sentence in ways that current Voice Activity Detection tuning may treat as end-of-turn, prematurely ducking the caller.

The cost of this gap to a buyer: a hospital sponsor running a side-by-side smoke test with Retell will hear the Retell agent feel more "natural" on barge-in. This is a UX gap, not a correctness gap (our citations are still grounded, our safety is still enforced), but UX gaps shape buyer impressions. The §4.1 line commits to barge-in tuning improvements and to a documented per-tenant tuning surface in Q3 2026. The lineage of full-duplex voice systems is Lin et al. 2026; the conversational-analysis lineage is Sacks, Schegloff & Jefferson 1974.

3.3 Marketplace integrations — no Salesforce/HubSpot connectors

Genesys, NICE, Five9, and Talkdesk ship marketplace ecosystems with hundreds of pre-built connectors — Salesforce, HubSpot, Microsoft Dynamics, Zendesk, ServiceNow. A contact-center buyer who already lives inside a Salesforce CRM gets immediate value from a vendor whose voice bot can read and write Salesforce records. Our codebase has zero such connectors today. Our integration story is HTTP API + Postgres queries + DB-backed FAQ; none of those are marketplace connectors.

The cost of this gap to a buyer: the appointment-booking spinoff buyer who asks "can your agent log a follow-up task into our existing Salesforce instance?" gets the honest answer "not without engineering work." The §4.3 line commits to Salesforce and HubSpot connectors as the first two marketplace integrations in 2027 H1, contingent on pilot expansion to a tenant that needs them.

3.4 Two smaller gaps worth naming

Two further gaps are worth naming for completeness even if they do not justify roadmap lines on their own:

Mid-call language switching — by ADR-0052 this is a deliberate trade-off, not a deficit, but a buyer comparing language-list cells in §2.2 should be told that Cognigy claims 100+ languages with mid-call translation while we lock at first utterance to preserve Flemish accuracy. Both choices are defensible; the buyer should know which one we made.
No 24×7 operator-NOC dashboard — at pilot scale we have an operations dashboard (architecture/feedback-dashboard-metrics), but no on-call operator NOC. Contact-center incumbents ship a NOC. We do not, and at pilot scale we should not.

Neither of these is on the §4 roadmap; both are on this list so a buyer reading §3 has the complete honest picture.

4. Closing-the-gap roadmap

Each item below maps to a specific gap in §3. Items that do not map to a gap have been deleted; this list is the roadmap, not a wish list.

4.1 Q3 2026 (Jul–Sep): SLA-grade pilot measurement + barge-in tuning

Item	Maps to	Description
Pilot uptime SLO posting	§3.1 (infrastructure reliability)	Backfill three months of pilot uptime data; post a public SLO. Requires the per-stage histogram instrumentation that §2.1 marks as "not yet measured at p95 on pilot."
Latency cell backfill	§2.1 ("not yet measured" cells)	Replace dev p50 numbers in voice/architecture latency-budget table with pilot p95 numbers. Methodology is Beyer et al. 2016 tail-latency framing.
Barge-in tuning v1	§3.2 (conversational fluidity)	Per-tenant Voice Activity Detection sensitivity tuning. Surface as a tenant-overlay knob in the YAML overlay (voice/tenant-overlay-system).
Faster TTFT via streaming TTS	§3.2 + §2.1	Investigate the ElevenLabs streaming endpoint for first-audio reduction below 200 ms.
Like-for-like cost comparison spreadsheet	§2.7 (cost normalisation)	Normalise vendor headline rates against component-broken-out cost so the buyer-facing per-minute comparison is honest.

4.2 Q4 2026 (Oct–Dec): zero-shot mode + second-pilot deployment

Item	Maps to	Description
Open-source intent-classifier benchmark	§2.4 (customization, intent affinity tuning)	Publish the 7-intent × 6-category affinity matrix as an evaluable benchmark. Methodology framed by Wohlin et al. 2012 experimentation in software engineering.
Zero-shot prompt mode	§2.4 (day-1 onboarding)	New tenant onboards by filling in a structured prompt template — no YAML overlay, no FAQ entries, just LLM + retrieval. Useful as a fast-mode for proof-of-concept tenants.
Second-pilot deployment	§2.4 (multi-tenant overlay validation)	Onboard a second hospital with only YAML overlay + DB rows; zero source-code commits to the codebase. Empirical proof of the multi-tenant onboarding architecture.
Diagnostic V2 metric publication	§2.5 (observability)	Publish per-dimension v2 diagnostic numbers as an internal benchmark. LLM-as-judge bias controls follow Zheng et al. 2023.

4.3 2027 H1: marketplace connectors + multi-region

Item	Maps to	Description
Salesforce connector	§3.3 (marketplace integrations)	First marketplace integration. Read + write Salesforce records via the agentic LLM's tool surface.
HubSpot connector	§3.3	Second marketplace integration.
Multi-region deploy	§3.1 (infrastructure reliability, second-region failover)	Add a second-region pilot deployment + DNS-level failover.
AI Act high-risk pathway documentation	(not §3, but adjacent)	If any future feature crosses the high-risk threshold (clinical decision support, scheduling that materially affects care delivery), we already have the limited-risk memo on file (safety/ai-act-compliance) — the high-risk pathway is the next step. Cite MDR 2017/745 as the medical-device adjacency.

The roadmap totals seven items across three time horizons, each tied to a specific gap. No marketing-roadmap padding.

5. Why-us summary

Four differentiators carry the weight here. Each is named with concrete evidence; the engineering depth lives at the cross-link.

5.1 Domain depth — the Value Framework + safety triple-defense

The Value Framework (voice/value-framework) is a 7-intent × 6-category affinity matrix that prevents cross-category contamination — a wheelchair-accessibility query gets a parking answer, not an orthopaedic-reimbursement answer. The safety triple-defense (voice/triple-defense) layers regex pre-filter + LLM-side prompting + regex post-filter + post-LLM disclaimer prepender; multi-language regex packs cover nl/en/fr/it (adversarial hardening). Empirical evidence: 100 % pass rate on the 14-question safety-refusal cohort and 12-question adversarial-GCG cohort (thesis Chapter 4, Table 4.1, citing Zou et al. 2023). The Llama Guard line of work (Inan et al. 2023) is the academic adjacency for LLM-output safety; our regex post-filter is a deterministic complement.

5.2 Provenance + observability — citations + diagnostic V2 + Operations dashboard

Every substantive answer carries chunk-derived citations (voice/citation-pipeline). The diagnostic V2 endpoint scores per-turn correctness, safety, memory, tool-use, and latency on a documented rubric, with LLM-as-judge bias controls per Zheng et al. 2023. The Operations dashboard (architecture/feedback-dashboard-metrics) renders per-tenant trend charts on Category Mismatch and Diagnostic Accuracy. The architectural lineage is Lewis et al. 2020 for RAG and Gao et al. 2024 for modular RAG.

5.3 Multi-tenant SaaS architecture — zero-source-change onboarding

Tenant identity is bound to the Keycloak JWT claim, not to a header — cryptographically resolved per request (architecture/multi-tenancy). The two-plane configuration (DB-driven for web/RAG + YAML overlay for voice) puts slow-moving voice content in version control and fast-moving crawl rules in DB rows that platform admins edit through the API. The architectural lineage is Bezemer & Zaidman 2010 shared-schema multi-tenant SaaS taxonomy. The zero-source-change invariant is the architectural target; pilot Phase 5 commits to empirical verification via second-pilot deployment.

5.4 Engineering rigor — 50+ ADRs, 62-entry verified bibliography, silent-failure discipline

Architectural decisions live in the ADR series (decisions is a representative example); the bibliography (references) has 62 verified entries with last-verified dates and one-line summaries. The silent-failure discipline (R1: log size on collection-returning functions; R2: regression test for every silent-failure branch; R3: contract test for cross-component shared state) was codified after a real-world voice-history regression on 2026-05-07 — see the project's CLAUDE.md for the canonical writeup. The thesis (thesis Chapter 4) ships the empirical evidence for every quantitative claim. Software-engineering experimentation methodology is Wohlin et al. 2012; SRE practice for tail-latency is Beyer et al. 2016; software-craftsmanship practice is Martin 2017.

The single sentence

We are the only stack on this list that combines patient-facing voice with citation-grounded retrieval, multi-tenant overlay onboarding, GDPR Art. 35 DPIA + AI Act Art. 50 limited-risk classification on file, and an empirically measured 99.0 % pass rate against a 302-question regulated-domain benchmark. The roadmap in §4 closes the three honest gaps without giving up the four differentiators.

6. Methodology and caveats

6.1 How the matrix was built

Public-material research was the primary source. For each vendor and each cell:

The vendor's documentation page, pricing page, or product page was fetched. Vendor URLs are in §2 inline.
If the page stated the answer, the cell got the answer + the URL.
If the page did not state the answer in five minutes of reading, the cell was marked Not publicly documented.
Inferred numbers were forbidden. If the buyer asks "where did you get this number?" we point at a URL, not a guess.

Our own cells were sourced from:

The audit ledger under docs/audits/2026-05-09-*.md (drift register, source-of-truth for "what we have")
The thesis (thesis Chapter 4) for empirical golden-eval numbers
The fast-gate threshold study (backend/scripts/revalidate_fast_gate_threshold.json)
The ADR series for architectural decisions
The Voice Stack Compendium for engineering-grade claims
The references bibliography for academic claims (62 verified entries)

Where our own number is not yet measured, the cell says not yet measured with a Phase-5 commitment. Same discipline as for competitors.

6.2 Caveats

The matrix is a snapshot. Vendors ship; the matrix decays. Specific decay risks:

Pricing pages move quickly. Retell's per-minute components (retellai.com/pricing) and Deepgram's flat-rate Voice Agent (deepgram.com/product/voice-agent-api) were verified on 2026-05-09–10. Both will likely shift before Q3 2026.
Vendor language lists expand. Retell's multilingual page lists nl/en/es/fr/de/hi/ru/pt/jp/it as of verification; this can grow.
Hyperscaler features land continuously. OpenAI, Deepgram, Google, and Microsoft ship voice features on multi-week cadences; the cells marked "not publicly documented" today may be documented next quarter.
Engineering-depth competitive material is rarely public. Most blank cells exist because vendors do not publish the engineering depth our buyer cares about. A competitive-procurement reviewer should ask each vendor to produce their equivalent of our DPIA, AI Act memo, and Voice Stack Compendium.

6.3 Refresh cadence

This matrix is timestamped 2026-05 in the URL slug. The next refresh is committed to Q3 2026 alongside the pilot uptime SLO posting (§4.1). Each refresh will:

Re-verify every vendor URL (redirect, removal, content drift)
Replace "not yet measured" cells in the ZOL row with measured numbers as pilot Phase 5 backfills land
Add new tier rows if a new vendor category emerges (e.g., agentic-voice frameworks beyond LiveKit Agents / Pipecat / Vocode)
Strike vendors that have exited the market or pivoted out of voice

The matrix is honest about being a snapshot. A buyer reading this in 2027 H1 should refresh — or ask us to refresh — before relying on any specific cell.

7. References

This document cites the following bibliography keys (see references for full entries):

Lewis et al. 2020 — RAG architecture
Gao et al. 2024 — Modular-RAG taxonomy
Beyer et al. 2016 — SRE tail-latency SLO practice
Nielsen 1993 — Response-time UX thresholds
Bezemer & Zaidman 2010 — Multi-tenant SaaS architecture
Zheng et al. 2023 — LLM-as-judge bias controls
Inan et al. 2023 — Llama Guard / LLM-output safety
Zou et al. 2023 — Greedy Coordinate Gradient adversarial benchmark
Wohlin et al. 2012 — Software-engineering experimentation methodology
Sacks, Schegloff & Jefferson 1974 — Conversational turn-taking
Lin et al. 2026 — Full-duplex voice benchmark
Martin 2017 — Software-craftsmanship practice
GDPR, AI Act, MDR — EU regulatory texts
HLEG 2019 — Trustworthy-AI ethics guidelines
ISO 27001:2022, ISO 27018:2019 — Information-security management standards
LiveKit Agents, Deepgram Nova-3, ElevenLabs Multilingual v2 — Vendor stack components

Vendor product pages are cited inline by URL. They are intentionally not bibliography entries because they are product-marketing material, not academic work.

1. Executive summary​

2. Per-axis comparison matrix​

2.1 Latency​

2.2 Multilingual​

2.3 Domain depth (regulated-healthcare voice)​

2.4 Customization (multi-tenant onboarding)​

2.5 Observability​

2.6 Compliance​

2.7 Cost​

2.8 Provenance​

2.9 Cell-summary statistics​

3. Honest gap analysis​

3.1 Infrastructure reliability — managed hyperscaler stacks have one less moving part​

3.2 Conversational fluidity — Retell has shipped fine-grained barge-in tuning​

3.3 Marketplace integrations — no Salesforce/HubSpot connectors​

3.4 Two smaller gaps worth naming​

4. Closing-the-gap roadmap​

4.1 Q3 2026 (Jul–Sep): SLA-grade pilot measurement + barge-in tuning​

4.2 Q4 2026 (Oct–Dec): zero-shot mode + second-pilot deployment​

4.3 2027 H1: marketplace connectors + multi-region​

5. Why-us summary​

5.1 Domain depth — the Value Framework + safety triple-defense​

5.2 Provenance + observability — citations + diagnostic V2 + Operations dashboard​

5.3 Multi-tenant SaaS architecture — zero-source-change onboarding​

5.4 Engineering rigor — 50+ ADRs, 62-entry verified bibliography, silent-failure discipline​

The single sentence​

6. Methodology and caveats​

6.1 How the matrix was built​

6.2 Caveats​

6.3 Refresh cadence​

7. References​