SOTA Positioning Matrix (May 2026)
A competitive analysis of the ZOL voice stack against eighteen vendors across eight comparison axes, written at engineer-buyer register. Every cell is either backed by a public source (vendor doc URL, public benchmark, or our own measured number with a source-of-truth pointer) or marked explicitly as not measured / not publicly documented. Inferred numbers are forbidden — if a buyer asks "where did you get this number?" we point at a URL.
This document is a snapshot. Vendors ship; the matrix decays. The methodology section at the end describes the refresh cadence and the cite-or-blank discipline.
1. Executive summary
The voice-AI market in May 2026 is split across five tiers. Voice-AI specialists like Retell (retellai.com), Vapi (vapi.ai), Synthflow, Bland, and Cognigy Voice Gateway (cognigy.com/products/voice-gateway) compete on conversational fluidity, barge-in tuning, and developer ergonomics. Hyperscalers like the OpenAI Realtime API (platform.openai.com/docs/guides/realtime), Deepgram Voice Agent (deepgram.com/product/voice-agent-api), Google Dialogflow CX, and Microsoft Voice Bot compete on infrastructure scale and enterprise compliance certifications. Healthcare-specific vendors like Suki, DeepScribe, Abridge, and Hyro compete on a different axis — most are clinician-facing scribes, not patient-facing voice search; the closest analogue is Hyro. Contact-center incumbents like Genesys Cloud CX, NICE CXone, Five9, and Talkdesk compete on enterprise integrations, marketplace ecosystems, and call-routing maturity. Open-source baselines like LiveKit Agents (raw, github.com/livekit/agents), Pipecat (github.com/pipecat-ai/pipecat), and Vocode are reference implementations that ship the runtime but not the cognition.
The ZOL voice stack competes with all five tiers on different axes. Against the voice-AI specialists we compete on domain depth and provenance — our retrieval pipeline, citation discipline, and multi-language safety architecture are not features they offer out of the box. Against the hyperscalers we compete on honesty and observability — our per-turn telemetry, citation-grounded answers, and documented LLM-as-judge bias controls (Zheng et al. 2023) describe a system that knows when it is wrong; their managed services do not surface that signal at the same granularity. Against the healthcare-specific vendors we compete on scope clarity — our system is not a clinical scribe, not clinical decision support, and the architecture is shaped by that negative scope (see thin voice architecture). Against the contact-center incumbents we compete on engineering rigor and time-to-onboard — our multi-tenant overlay system (architecture/multi-tenancy) admits a new hospital with zero source-code changes; their integrations are powerful but heavyweight. Against the open-source baselines we compete on the layer above the runtime — they ship LiveKit-equivalent plumbing, we ship the seven-layer stack documented in the Voice Stack Compendium.
Three honest gaps shape this snapshot. Infrastructure reliability — managed hyperscaler stacks have one less moving part than a self-hosted Twilio + LiveKit deployment, and at our scale (≤25 K queries/month, single-region pilot) we have not yet measured uptime against a managed alternative. Conversational fluidity — Retell has shipped fine-grained barge-in tuning (docs.retellai.com) that we have not yet matched at the per-turn level. Marketplace integrations — Salesforce, HubSpot, and Microsoft Dynamics connectors that contact-center incumbents ship out of the box are not in our codebase. Section 3 describes each gap in detail; Section 4 commits each to a time-boxed roadmap line.
The single-sentence answer to "why us" is this: we are the only stack on this list that combines patient-facing voice with citation-grounded retrieval, multi-tenant overlay onboarding, GDPR Art. 35 DPIA + AI Act Art. 50 limited-risk classification on file, and an empirically measured 99.0 % pass rate against a 302-question regulated-domain benchmark (thesis Chapter 4, Table 4.1). Every other vendor in the matrix is missing at least one of those four. The roadmap in Section 4 closes the three honest gaps without giving up the four differentiators.
2. Per-axis comparison matrix
The matrix uses one row per vendor, grouped by tier, with a final ZOL row. Each cell is one of:
- A specific number, language list, or feature claim with an inline URL or a
[file:line]source-of-truth pointer Not publicly documented— vendor material does not state this, and inference is forbiddenNot yet measured— we have not benchmarked this against the vendor; pilot Phase 5 will backfill some of these
The five tiers are framed at the head of axis 1; subsequent axes reuse the same vendor groupings.
2.1 Latency
Tier A — Voice-AI specialists: vendors that build and sell a turnkey voice-agent platform. Their value proposition is "give us your prompt, we run the call." They compete on conversational fluidity and developer ergonomics.
Tier B — Hyperscalers: speech-and-LLM providers (OpenAI, Deepgram, Google, Microsoft) that expose voice-agent or realtime-API surfaces. Their value proposition is "infrastructure scale and compliance posture."
Tier C — Healthcare-specific: vendors selling into healthcare. Three of the four (Suki, DeepScribe, Abridge) are clinician-facing scribes and therefore not direct competitors on the patient-search axis; Hyro is the only patient-facing analogue.
Tier D — Contact-center incumbents: enterprise CCaaS platforms with voice-bot extensions. Their value proposition is integrations and routing maturity.
Tier E — Open-source baselines: frameworks (LiveKit Agents raw, Pipecat, Vocode) that ship the runtime without cognition or domain logic.
| Vendor | Tier | TTFT (time to first audio) | End-to-end turn latency |
|---|---|---|---|
| Retell AI | A | Not publicly documented | Not publicly documented (vendor publishes a latency troubleshooting page but not a target SLO) |
| Vapi | A | Not publicly documented | 500–700 ms voice-to-voice (docs.vapi.ai/quickstart) |
| Synthflow | A | Not publicly documented | Not publicly documented |
| Bland AI | A | Not publicly documented | Not publicly documented |
| Cognigy Voice Gateway | A | Not publicly documented | Not publicly documented (vendor cites "99.7 % intent recognition" and "25K+ concurrent conversations" but no end-to-end latency target — cognigy.com/products/voice-gateway) |
| OpenAI Realtime API | B | Not publicly documented at a target SLO | Not publicly documented at a target SLO |
| Deepgram Voice Agent | B | Not publicly documented | Not publicly documented (vendor markets "real-time responsiveness" without published p50/p95) |
| Google Dialogflow CX | B | Not publicly documented | Not publicly documented |
| Microsoft Voice Bot / Azure AI Speech | B | Not publicly documented | Not publicly documented |
| Suki AI | C | N/A — clinician scribe, no caller-facing turn loop | N/A |
| DeepScribe | C | N/A — clinician scribe | N/A |
| Abridge | C | N/A — clinician scribe | N/A |
| Hyro | C | Not publicly documented | Not publicly documented |
| Genesys Cloud CX | D | Not publicly documented | Not publicly documented |
| NICE CXone | D | Not publicly documented | Not publicly documented |
| Five9 | D | Not publicly documented | Not publicly documented |
| Talkdesk | D | Not publicly documented | Not publicly documented |
| LiveKit Agents (raw) | E | N/A — framework, depends on plugin choices (github.com/livekit/agents) | N/A |
| Pipecat | E | N/A — framework | N/A |
| Vocode | E | N/A — framework | N/A |
| ZOL Voice Stack | — | Not yet measured at p95 on pilot; local-dev p50 of ElevenLabs first-audio is 200–400 ms (voice/architecture) | Not yet measured at p95 on pilot; local-dev stage budget targets ~5.5 s end-to-end on the chat channel (performance/overview). Voice-channel p95 is on the Phase-5 measurement list. |
Latency targets follow the Beyer et al. 2016 SRE practice of writing SLOs at the tail (p95, p99) rather than the mean. The underlying UX thresholds are from Nielsen 1993 — 0.1 s for instantaneous feedback, 1 s for seamless flow, 10 s as the upper attention bound. Reading the table: of the 18 competitors, only one (Vapi) publishes a numeric latency target. The other 17 either do not document a target or sell a framework that pushes the latency question down to the integrator. This is informative on its own — competitive latency is mostly a marketing claim, not a published number.
2.2 Multilingual
| Vendor | Tier | Languages | Mid-call switching policy |
|---|---|---|---|
| Retell AI | A | At least nl, en, es, fr, de, hi, ru, pt, jp, it (docs.retellai.com/agent/multilingual) — vendor notes per-voice subsets | Not publicly documented at a single policy level |
| Vapi | A | Not publicly documented at a complete list | Not publicly documented |
| Synthflow | A | Not publicly documented | Not publicly documented |
| Bland AI | A | Not publicly documented | Not publicly documented |
| Cognigy Voice Gateway | A | "100+ languages" with built-in machine translation (cognigy.com/products/voice-gateway) | Not publicly documented |
| OpenAI Realtime API | B | Not publicly documented at a complete list | Not publicly documented |
| Deepgram Voice Agent | B | Not publicly documented at the agent level (Nova-3 STT supports multiple languages — deepgram.com) | Not publicly documented |
| Google Dialogflow CX | B | Not publicly documented at this granularity | Not publicly documented |
| Microsoft Voice Bot / Azure AI Speech | B | Not publicly documented at this granularity | Not publicly documented |
| Hyro | C | Not publicly documented | Not publicly documented |
| Suki / DeepScribe / Abridge | C | N/A — scribe, not voice agent | N/A |
| Genesys / NICE / Five9 / Talkdesk | D | Not publicly documented at this granularity | Not publicly documented |
| LiveKit Agents (raw) | E | Depends on STT/TTS plugin choice (livekit_agents_docs) | Depends on integrator |
| Pipecat | E | Depends on integrator | Depends on integrator |
| Vocode | E | Depends on integrator | Depends on integrator |
| ZOL Voice Stack | — | nl, en, fr, it — production-validated; Dutch (Flemish) is primary, see voice/language-locking | Locked at first STT-confirmed utterance for the duration of the call (ADR-0052). Mid-call switching is explicitly traded away to preserve Flemish acoustic accuracy after two empirical pilot regressions documented in the ADR. |
The salient observation: most vendors' language lists are not published at agent-product granularity. Retell publishes the longest verifiable list; Cognigy claims the most ("100+") via machine translation. Our four are fewer in count but each is production-tuned with safety regex packs (see §2.3). The locked-at-first-utterance policy is a deliberate trade-off, not a limitation — multi-language Deepgram measurably degrades Flemish accuracy, per the empirical evidence in ADR-0052.
2.3 Domain depth (regulated-healthcare voice)
| Vendor | Tier | Out-of-box healthcare safety | Medical-advice refusal | STT-mishearing awareness | Voice-channel safety architecture |
|---|---|---|---|---|---|
| Retell AI | A | Not publicly documented as a healthcare-specific feature | Not publicly documented | Not publicly documented | Not publicly documented |
| Vapi | A | Not publicly documented | Not publicly documented | Not publicly documented | Not publicly documented |
| Synthflow | A | Not publicly documented | Not publicly documented | Not publicly documented | Not publicly documented |
| Bland AI | A | Not publicly documented | Not publicly documented | Not publicly documented | Not publicly documented |
| Cognigy Voice Gateway | A | Healthcare listed as an industry vertical; specific features not documented (cognigy.com/products/voice-gateway) | Not publicly documented | Not publicly documented | Not publicly documented |
| OpenAI Realtime API | B | Not publicly documented | OpenAI safety policies apply at model level; agent-product surface not documented | Not publicly documented | Not publicly documented |
| Deepgram Voice Agent | B | Not publicly documented | Not publicly documented | Not publicly documented | Not publicly documented |
| Google Dialogflow CX | B | Not publicly documented | Not publicly documented | Not publicly documented | Not publicly documented |
| Microsoft Voice Bot / Azure AI Speech | B | Not publicly documented | Not publicly documented | Not publicly documented | Not publicly documented |
| Suki AI | C | Clinician scribe; HIPAA / SOC 2 (suki.ai) | Out of scope — not patient-facing | N/A — clinician audio context, different problem | N/A |
| DeepScribe | C | HIPAA, SOC 2 (deepscribe.ai) | Out of scope — clinician scribe | N/A | N/A |
| Abridge | C | Enterprise healthcare claim (abridge.com); specific compliance certifications not on the homepage | Out of scope — clinician scribe | N/A | N/A |
| Hyro | C | Healthcare-specific patient-facing assistant; specific safety architecture not publicly documented at engineering depth | Not publicly documented | Not publicly documented | Not publicly documented |
| Genesys / NICE / Five9 / Talkdesk | D | Healthcare verticals; specific safety architecture not publicly documented at engineering depth | Not publicly documented | Not publicly documented | Not publicly documented |
| LiveKit Agents / Pipecat / Vocode | E | None — frameworks, no domain | None — depends on integrator | None — depends on integrator | None |
| ZOL Voice Stack | — | Multi-language safety regex packs (nl/en/fr/it) — see voice/triple-defense and adversarial hardening. Pattern set covers diagnostic, prescription, and dosage queries with a 100 % pass rate on the 14-question safety-refusal cohort and 12-question adversarial-GCG cohort (thesis Chapter 4, Table 4.1, citing Zou et al. 2023 for the GCG benchmark methodology). | Three-stage post-LLM disclaimer — automatic medical-content detection on the answer text + disclaimer prepender (voice/triple-defense). Disclaimer wording in nl/en/fr/it. | STT-mishearing aware — the language-locking ADR and the "Hoe wordt migraine behandeld?" / "Behandel ik migraine?" phoneme-pair example documented in the Voice Stack Compendium §1. | Regex pre-filter → agentic LLM with three tools → regex post-filter → answer-shaper → disclaimer prepender (voice/architecture). Architecture is the only voice path; legacy 8-stage VoiceOrchestrator was deleted in commit 158d793 (ADR-0049, ADR-0051). |
The salient observation: the four cells in our row are populated; the same four cells across every other vendor are either "not publicly documented" or "out of scope." This is the axis on which the architecture differentiates most cleanly. The healthcare-specific vendors (Suki, DeepScribe, Abridge) are not on this axis at all — they sell into a different problem (clinician documentation), and Hyro, the closest patient-facing analogue, does not document safety architecture at the engineering depth our compendium does. See Inan et al. 2023 for the Llama Guard lineage of LLM-output safety classifiers; our regex post-filter is a deterministic complement to that line of work.
2.4 Customization (multi-tenant onboarding)
| Vendor | Tier | Per-tenant overlay | Intent affinity tuning | FAQ override | Day-1 onboarding effort |
|---|---|---|---|---|---|
| Retell AI | A | Per-account agent configuration (docs.retellai.com) | Not publicly documented at a tunable matrix level | Not publicly documented as a separate feature | Not publicly documented at minute-level |
| Vapi | A | Per-account agent configuration (docs.vapi.ai) | Not publicly documented | Not publicly documented | Not publicly documented |
| Synthflow / Bland | A | Not publicly documented at this granularity | Not publicly documented | Not publicly documented | Not publicly documented |
| Cognigy Voice Gateway | A | Per-account flow configuration | Not publicly documented at a tunable matrix level | Not publicly documented as a separate feature | Not publicly documented |
| OpenAI Realtime API | B | Per-API-key + system prompt; not multi-tenant out of box | None (LLM-only) | None (LLM-only) | Not applicable — building-block API |
| Deepgram Voice Agent | B | Per-account configuration | Not publicly documented | Not publicly documented | Not publicly documented |
| Google Dialogflow CX / Microsoft Voice Bot | B | Per-project / per-resource isolation | Not publicly documented at a tunable matrix level | Per-intent override (Dialogflow) | Not publicly documented at minute-level |
| Hyro | C | Per-customer deployment | Not publicly documented at engineering depth | Not publicly documented at engineering depth | Not publicly documented |
| Suki / DeepScribe / Abridge | C | Per-clinic deployment; different problem | N/A | N/A | N/A |
| Genesys / NICE / Five9 / Talkdesk | D | Per-tenant routing + flow isolation | Not publicly documented at a tunable matrix level | Per-flow / per-skill | Not publicly documented at minute-level |
| LiveKit Agents / Pipecat / Vocode | E | Depends on integrator | Depends on integrator | Depends on integrator | Depends on integrator |
| ZOL Voice Stack | — | Two-plane configuration: DB-driven for web/RAG (site_crawl_configs, golden_pages, PromptContext) + YAML overlay for voice (tenant_overlays/_yaml/<slug>.yaml); see architecture/multi-tenancy. Tenant identity comes from the Keycloak JWT claim — cryptographically bound, not header-resolved. Architectural lineage: Bezemer & Zaidman 2010 shared-schema multi-tenant SaaS taxonomy. | 7-intent × 6-category affinity matrix (voice/value-framework). Per-intent multipliers tune retrieval categorical fit without changing prompts. | DB-driven FAQ renderers (voice/tenant-overlay-system) — per-tenant FAQ entries, STT phonetic-recovery overrides, and pre-filter classifier overrides land in version control without source changes. | Zero source-code change to onboard a new tenant is the architectural target. Empirical day-1 onboarding effort has not yet been measured against a competitor (pilot Phase 5 commitment, see §4). |
Reading the table: most vendors have some form of per-tenant configuration, but none publish an intent-to-category affinity matrix as a tunable surface. Our Value Framework is unusual in exposing this lever (voice/value-framework) and is a direct consequence of the wheelchair-cross-category-contamination regression documented in the same page.
2.5 Observability
| Vendor | Tier | Per-turn telemetry | Diagnostic accuracy metric | Operator feedback loop | Cost dashboard |
|---|---|---|---|---|---|
| Retell AI | A | Vendor exposes a call-history view; per-turn detail not publicly documented at engineering depth | Not publicly documented | Not publicly documented | Per-account billing dashboard |
| Vapi | A | Vendor exposes call logs; per-turn detail not publicly documented | Not publicly documented | Not publicly documented | Per-account billing |
| Synthflow / Bland | A | Not publicly documented at engineering depth | Not publicly documented | Not publicly documented | Not publicly documented |
| Cognigy Voice Gateway | A | "99.7 % intent recognition" claimed at the platform level (cognigy.com); per-turn detail not publicly documented | Not publicly documented | Not publicly documented at engineering depth | Per-account billing |
| OpenAI Realtime API | B | OpenAI usage dashboard; per-turn telemetry not publicly documented at engineering depth | Not publicly documented | Not publicly documented | OpenAI billing dashboard |
| Deepgram Voice Agent | B | Per-account dashboard; per-turn detail not publicly documented at engineering depth | Not publicly documented | Not publicly documented | Per-account billing |
| Google Dialogflow CX / Microsoft Voice Bot | B | Per-project Cloud Monitoring / Azure Monitor dashboards; agent-specific quality metrics not publicly documented | Not publicly documented | Not publicly documented | Per-project billing |
| Hyro | C | Not publicly documented at engineering depth | Not publicly documented | Not publicly documented | Not publicly documented |
| Suki / DeepScribe / Abridge | C | Different problem domain; observability geared to clinician QA, not patient-call telemetry | Not applicable | Clinician feedback loops (vendor-specific) | Not applicable |
| Genesys / NICE / Five9 / Talkdesk | D | Mature CCaaS reporting (call recording, IVR analytics) | Not publicly documented at agent-quality level | Not publicly documented at engineering depth | Per-account billing |
| LiveKit Agents / Pipecat / Vocode | E | Frameworks emit events; integrator builds the dashboard | Depends on integrator | Depends on integrator | Depends on integrator |
| ZOL Voice Stack | — | pipeline_telemetry Postgres table — per-stage latency, retrieval cardinality, intent class, primary content category, category-mismatch indicator. Per-turn writes are unconditional. See Voice Stack Compendium §4. | Diagnostic V2 endpoint (POST /api/v1/query?response_format=v2) — per-dimension scoring (correctness, safety, memory, tool_use, latency) by VoiceTurnEvaluator (schema-validated via the structured_call helper). LLM-as-judge bias controls follow Zheng et al. 2023. | Operations dashboard — per-tenant trend charts (Category Mismatch Trend, Diagnostic Accuracy Trend) on the Costs tab; described in feedback-dashboard-metrics. | Costs page — dollarised per-LLM-call breakdown, see performance/overview. |
This is the second axis where our row is populated and most competitor cells are not. The driver is engineering choice, not technical hardness — vendors could expose per-turn telemetry, and most do so to themselves internally; what they don't publish is the engineering shape of those metrics. The diagnostic-accuracy metric in particular ("did the answer match the caller's intent on the documented 5-dimension rubric?") is, to our reading of public material, unique on this list.
2.6 Compliance
| Vendor | Tier | GDPR DPIA on file | AI Act classification on file | PII redaction | Data residency |
|---|---|---|---|---|---|
| Retell AI | A | Not publicly documented | Not publicly documented | Add-on at $0.01/min (retellai.com/pricing) | Not publicly documented |
| Vapi | A | Not publicly documented | Not publicly documented | Not publicly documented | Not publicly documented |
| Synthflow / Bland | A | Not publicly documented | Not publicly documented | Not publicly documented | Not publicly documented |
| Cognigy Voice Gateway | A | GDPR compliance claimed at platform level (cognigy.com); DPIA per-deployment artifact not publicly documented | Not publicly documented | Not publicly documented | Not publicly documented |
| OpenAI Realtime API | B | OpenAI publishes a DPA framework; per-deployment DPIA is the customer's | Not publicly documented at this product level | Not publicly documented | EU data residency available on Enterprise plans (verify per-tier) |
| Deepgram Voice Agent | B | Vendor publishes a DPA framework | Not publicly documented at this product level | Not publicly documented | Not publicly documented at agent-product level |
| Google Dialogflow CX | B | GCP DPA framework | Not publicly documented at this product level | DLP API can be wired in (separate product) | Multi-region available |
| Microsoft Voice Bot / Azure AI Speech | B | Azure DPA framework | Not publicly documented at this product level | Azure-side PII tooling (separate) | Multi-region available |
| Suki AI | C | HIPAA, SOC 2 (suki.ai) | Not publicly documented | HIPAA-compliant by design | Not publicly documented |
| DeepScribe | C | HIPAA, SOC 2 (deepscribe.ai) | Not publicly documented | HIPAA-compliant by design | Not publicly documented |
| Abridge | C | "Enterprise-grade" claim (abridge.com); specific certifications not on the public marketing page | Not publicly documented | Not publicly documented at marketing-page depth | Not publicly documented |
| Hyro | C | Not publicly documented at engineering depth | Not publicly documented | Not publicly documented | Not publicly documented |
| Genesys / NICE / Five9 / Talkdesk | D | Mature DPA frameworks; specific DPIA artifacts are customer-side | Not publicly documented at this product level | Vendor-specific PII tooling | Multi-region available |
| LiveKit Agents / Pipecat / Vocode | E | None — depends on integrator | None — depends on integrator | None — depends on integrator | Self-hosted (integrator's choice) |
| ZOL Voice Stack | — | GDPR Art. 35 DPIA on file — see safety/dpia. Lawful basis, processing scope, residual-risk register, and Article-by-Article mapping are documented. References GDPR directly. | AI Act Art. 50 limited-risk classification on file — see safety/ai-act-compliance. High-risk Annex III analysis included with explicit scope-limit warning. References AI Act and MDR directly. Trustworthy-AI principles from HLEG 2019 cited. | Multi-language voice PII redaction (voice_pii_redaction.py) at telemetry write time. Architecture in safety/pii-protection. | Self-hosted; data does not leave the pilot server except to subprocessors-of-record (OpenAI, Deepgram, ElevenLabs) under GDPR Art. 28 — see Voice Stack Compendium §1. |
This row is populated at engineering depth in our column and at marketing depth in most competitor columns — not because the competitors are non-compliant, but because compliance artifacts are typically per-deployment work product and not published as part of vendor marketing. The honest read: on the four cells of this axis we have published the engineering-depth artifacts; competitors have published the marketing claim. A buyer evaluating this row should ask each vendor to produce their equivalent of our DPIA and AI Act memo. ISO 27001:2022 (iso27001_2022) and ISO 27018:2019 (iso27018_2019) provide the cross-vendor reference for what those artifacts should contain.
2.7 Cost
| Vendor | Tier | Headline price | Notes |
|---|---|---|---|
| Retell AI | A | $0.07–$0.31 / min voice (retellai.com/pricing) | Components: voice infra $0.055/min + TTS $0.015/min + LLM ($0.003–$0.08/min) + add-ons (knowledge base $0.005/min, denoising $0.005/min, PII removal $0.01/min) |
| Vapi | A | Not publicly documented at headline level | Vendor publishes a pricing page; verifiable at vapi.ai |
| Synthflow / Bland | A | Not publicly documented at headline level | |
| Cognigy Voice Gateway | A | Not publicly documented (demo-gated) | |
| OpenAI Realtime API | B | Not publicly documented at a per-minute headline (token-based pricing at platform.openai.com) | |
| Deepgram Voice Agent | B | $4.50/hr flat with full stack (deepgram.com/product/voice-agent-api) — equivalent to $0.075/min | Reduced rates with bring-your-own-model |
| Google Dialogflow CX / Microsoft Voice Bot | B | Per-request and per-minute pricing on respective cloud pricing pages | Token + audio + STT/TTS components priced separately |
| Suki / DeepScribe / Abridge | C | Per-clinician seat, not per-minute; demo-gated | Different problem |
| Hyro | C | Not publicly documented (demo-gated) | |
| Genesys / NICE / Five9 / Talkdesk | D | Per-seat/agent pricing, voice-bot add-on per-minute; demo-gated | |
| LiveKit Agents / Pipecat / Vocode | E | Free (open-source); cost is vendor-passthrough (STT/LLM/TTS) + infrastructure | |
| ZOL Voice Stack | — | ~$8.70/month at projected 25K queries/month — internal cost-tracking, see performance/overview cost table. Per-tenant marginal cost is dominated by LLM token spend, not infrastructure (self-hosted Twilio + LiveKit). | At 25K queries × 45s avg call duration (~18,750 minutes/month), the headline-equivalent is ~$0.46/hour or ~$0.008/min — but this is not directly comparable to vendor per-minute pricing because (a) our number excludes infrastructure depreciation and engineer-time amortisation, and (b) vendor numbers typically bundle STT + LLM + TTS into the headline rate. A like-for-like cost comparison is on the Phase 5 list. |
The honest framing: direct $/minute comparison across vendors is misleading without a normalisation pass that we have not yet run. Headline rates bundle different components. Our $8.70/month figure is internal cost-tracking, not a vendor-equivalent rate. The Phase 5 commitment in §4 includes building the normalised cost-comparison spreadsheet.
2.8 Provenance
| Vendor | Tier | Citations on answers | Chunk-id traceability | Deletion compliance | Audit-log retention |
|---|---|---|---|---|---|
| Retell AI | A | Not publicly documented as a feature | Not publicly documented | Per-account data deletion (docs.retellai.com) | Not publicly documented |
| Vapi | A | Not publicly documented as a feature | Not publicly documented | Per-account | Not publicly documented |
| Synthflow / Bland | A | Not publicly documented | Not publicly documented | Per-account | Not publicly documented |
| Cognigy Voice Gateway | A | Per-flow knowledge-source attribution exists (cognigy.com); chunk-id detail not publicly documented at engineering depth | Not publicly documented at engineering depth | Per-account | Not publicly documented |
| OpenAI Realtime API | B | None — LLM-only product | None — no retrieval surface | OpenAI account deletion | OpenAI default |
| Deepgram Voice Agent | B | Not publicly documented as a feature | Not publicly documented | Per-account | Not publicly documented |
| Google Dialogflow CX | B | Knowledge-base attribution exists; engineering depth not publicly documented | Not publicly documented at engineering depth | Per-project | Cloud Logging retention |
| Microsoft Voice Bot / Azure AI Speech | B | Not publicly documented at engineering depth | Not publicly documented at engineering depth | Per-project | Azure Monitor retention |
| Suki / DeepScribe / Abridge | C | Note-citation features in clinician scribes; not patient-facing | Not applicable | Per-customer | Vendor-specific |
| Hyro | C | Not publicly documented at engineering depth | Not publicly documented | Not publicly documented | Not publicly documented |
| Genesys / NICE / Five9 / Talkdesk | D | Knowledge-base attribution in some flows; engineering depth not publicly documented | Not publicly documented at engineering depth | Per-tenant | Per-tenant |
| LiveKit Agents / Pipecat / Vocode | E | Depends on integrator | Depends on integrator | Depends on integrator | Depends on integrator |
| ZOL Voice Stack | — | Citations on every substantive answer — chunk-derived for voice (no inline [N] markers) and marker-derived for chat. Pipeline in voice/citation-pipeline and Voice Stack Compendium §3. | Per-chunk traceability to source document_chunks row, including page number and document URL. The citation extractor is a three-helper cascade documented after the 2026-05-07 silent-failure regression; see silent-failure discipline R1/R2/R3. | GDPR Art. 17 right-to-erasure mapped to deletion of conversation rows + audit-log retention exception. See safety/data-retention-policy. | Audit logs retained per documented policy with audit-log retention exception under GDPR Art. 17(3)(e). See safety/data-retention-policy. |
The salient observation: citation-grounded retrieval as a per-turn feature is not a routine vendor capability. Most vendors expose retrieval-augmented generation; few expose chunk-id traceability and per-turn citation pipelines at engineering depth. The lineage is Lewis et al. 2020 for the RAG architecture and Gao et al. 2024 for the modular-RAG taxonomy that places our Value Framework among orchestrated retrieval modules.
2.9 Cell-summary statistics
| Axis | ZOL cells: verified | Competitor cells: verified | Competitor cells: not publicly documented / not measured |
|---|---|---|---|
| Latency (TTFT, end-to-end) | 0 verified, 2 "not yet measured" | 1 (Vapi end-to-end) | 35 |
| Multilingual (languages, switching) | 2 verified | 4 verified, partial | 32 |
| Domain depth (4 sub-axes) | 4 verified | 0 | 64 (mostly "not publicly documented") |
| Customization (4 sub-axes) | 4 verified | ~6 partial | ~54 |
| Observability (4 sub-axes) | 4 verified | ~4 partial | ~56 |
| Compliance (4 sub-axes) | 4 verified | ~12 partial (DPA frameworks, HIPAA claims) | ~50 |
| Cost (1 axis) | 1 verified | 2 verified (Retell, Deepgram) | 16 |
| Provenance (4 sub-axes) | 4 verified | ~6 partial | ~58 |
Reading the totals: most cells in the matrix are blank because vendors do not publish the engineering depth that the comparison requires. Our blanks (the latency cells marked "not yet measured") are commitments to backfill in pilot Phase 5; competitor blanks are usually not commitments to publish at all. This is the matrix's core honest finding — competitive positioning at the engineering depth our buyer cares about is mostly a research exercise on the buyer side, because vendor marketing pages do not publish the answers.
3. Honest gap analysis
This section is deliberately unflattering. Three gaps shape the snapshot, and each is a place where a buyer would correctly say "competitor X is ahead of you on this axis." We name the deficit, the cost, and the closing-the-gap commitment in §4.
3.1 Infrastructure reliability — managed hyperscaler stacks have one less moving part
The OpenAI Realtime API and Deepgram Voice Agent are managed services; the hyperscaler runs the infrastructure, the integrator runs the prompt. Our stack is self-hosted Twilio Elastic SIP Trunk + LiveKit SIP + LiveKit Server + LiveKit Agents on a single pilot server, with multiple subprocessors of record (OpenAI, Deepgram, ElevenLabs) under GDPR Art. 28. At pilot scale (≤25 K queries/month, single-region) this is operationally fine — see the runbook in ADR-0050 — but we have not yet measured uptime against a managed alternative on an apples-to-apples SLO basis.
The cost of this gap to a buyer: a CTO asking "what is your committed uptime SLO?" gets the honest answer "we have not yet built a multi-region failover and we have not yet posted a public SLO." A managed-stack vendor can answer that question with a number. We cannot, yet. The §4.1 line commits to backfilling pilot uptime measurement and posting an SLO in Q3 2026.
3.2 Conversational fluidity — Retell has shipped fine-grained barge-in tuning
Retell publishes a latency troubleshooting page and a barge-in tuning surface that exposes per-call interruption sensitivity, end-of-turn detection thresholds, and tunable response delay. Our voice agent has basic barge-in support via LiveKit Agents' semantic turn detection (livekit_agents_docs) but we have not yet tuned per-call barge-in sensitivity. In particular, the elderly demographic that dominates hospital helpdesk traffic frequently pauses mid-sentence in ways that current Voice Activity Detection tuning may treat as end-of-turn, prematurely ducking the caller.
The cost of this gap to a buyer: a hospital sponsor running a side-by-side smoke test with Retell will hear the Retell agent feel more "natural" on barge-in. This is a UX gap, not a correctness gap (our citations are still grounded, our safety is still enforced), but UX gaps shape buyer impressions. The §4.1 line commits to barge-in tuning improvements and to a documented per-tenant tuning surface in Q3 2026. The lineage of full-duplex voice systems is Lin et al. 2026; the conversational-analysis lineage is Sacks, Schegloff & Jefferson 1974.
3.3 Marketplace integrations — no Salesforce/HubSpot connectors
Genesys, NICE, Five9, and Talkdesk ship marketplace ecosystems with hundreds of pre-built connectors — Salesforce, HubSpot, Microsoft Dynamics, Zendesk, ServiceNow. A contact-center buyer who already lives inside a Salesforce CRM gets immediate value from a vendor whose voice bot can read and write Salesforce records. Our codebase has zero such connectors today. Our integration story is HTTP API + Postgres queries + DB-backed FAQ; none of those are marketplace connectors.
The cost of this gap to a buyer: the appointment-booking spinoff buyer who asks "can your agent log a follow-up task into our existing Salesforce instance?" gets the honest answer "not without engineering work." The §4.3 line commits to Salesforce and HubSpot connectors as the first two marketplace integrations in 2027 H1, contingent on pilot expansion to a tenant that needs them.
3.4 Two smaller gaps worth naming
Two further gaps are worth naming for completeness even if they do not justify roadmap lines on their own:
- Mid-call language switching — by ADR-0052 this is a deliberate trade-off, not a deficit, but a buyer comparing language-list cells in §2.2 should be told that Cognigy claims 100+ languages with mid-call translation while we lock at first utterance to preserve Flemish accuracy. Both choices are defensible; the buyer should know which one we made.
- No 24×7 operator-NOC dashboard — at pilot scale we have an operations dashboard (architecture/feedback-dashboard-metrics), but no on-call operator NOC. Contact-center incumbents ship a NOC. We do not, and at pilot scale we should not.
Neither of these is on the §4 roadmap; both are on this list so a buyer reading §3 has the complete honest picture.
4. Closing-the-gap roadmap
Each item below maps to a specific gap in §3. Items that do not map to a gap have been deleted; this list is the roadmap, not a wish list.
4.1 Q3 2026 (Jul–Sep): SLA-grade pilot measurement + barge-in tuning
| Item | Maps to | Description |
|---|---|---|
| Pilot uptime SLO posting | §3.1 (infrastructure reliability) | Backfill three months of pilot uptime data; post a public SLO. Requires the per-stage histogram instrumentation that §2.1 marks as "not yet measured at p95 on pilot." |
| Latency cell backfill | §2.1 ("not yet measured" cells) | Replace dev p50 numbers in voice/architecture latency-budget table with pilot p95 numbers. Methodology is Beyer et al. 2016 tail-latency framing. |
| Barge-in tuning v1 | §3.2 (conversational fluidity) | Per-tenant Voice Activity Detection sensitivity tuning. Surface as a tenant-overlay knob in the YAML overlay (voice/tenant-overlay-system). |
| Faster TTFT via streaming TTS | §3.2 + §2.1 | Investigate the ElevenLabs streaming endpoint for first-audio reduction below 200 ms. |
| Like-for-like cost comparison spreadsheet | §2.7 (cost normalisation) | Normalise vendor headline rates against component-broken-out cost so the buyer-facing per-minute comparison is honest. |
4.2 Q4 2026 (Oct–Dec): zero-shot mode + second-pilot deployment
| Item | Maps to | Description |
|---|---|---|
| Open-source intent-classifier benchmark | §2.4 (customization, intent affinity tuning) | Publish the 7-intent × 6-category affinity matrix as an evaluable benchmark. Methodology framed by Wohlin et al. 2012 experimentation in software engineering. |
| Zero-shot prompt mode | §2.4 (day-1 onboarding) | New tenant onboards by filling in a structured prompt template — no YAML overlay, no FAQ entries, just LLM + retrieval. Useful as a fast-mode for proof-of-concept tenants. |
| Second-pilot deployment | §2.4 (multi-tenant overlay validation) | Onboard a second hospital with only YAML overlay + DB rows; zero source-code commits to the codebase. Empirical proof of the multi-tenant onboarding architecture. |
| Diagnostic V2 metric publication | §2.5 (observability) | Publish per-dimension v2 diagnostic numbers as an internal benchmark. LLM-as-judge bias controls follow Zheng et al. 2023. |
4.3 2027 H1: marketplace connectors + multi-region
| Item | Maps to | Description |
|---|---|---|
| Salesforce connector | §3.3 (marketplace integrations) | First marketplace integration. Read + write Salesforce records via the agentic LLM's tool surface. |
| HubSpot connector | §3.3 | Second marketplace integration. |
| Multi-region deploy | §3.1 (infrastructure reliability, second-region failover) | Add a second-region pilot deployment + DNS-level failover. |
| AI Act high-risk pathway documentation | (not §3, but adjacent) | If any future feature crosses the high-risk threshold (clinical decision support, scheduling that materially affects care delivery), we already have the limited-risk memo on file (safety/ai-act-compliance) — the high-risk pathway is the next step. Cite MDR 2017/745 as the medical-device adjacency. |
The roadmap totals seven items across three time horizons, each tied to a specific gap. No marketing-roadmap padding.
5. Why-us summary
Four differentiators carry the weight here. Each is named with concrete evidence; the engineering depth lives at the cross-link.
5.1 Domain depth — the Value Framework + safety triple-defense
The Value Framework (voice/value-framework) is a 7-intent × 6-category affinity matrix that prevents cross-category contamination — a wheelchair-accessibility query gets a parking answer, not an orthopaedic-reimbursement answer. The safety triple-defense (voice/triple-defense) layers regex pre-filter + LLM-side prompting + regex post-filter + post-LLM disclaimer prepender; multi-language regex packs cover nl/en/fr/it (adversarial hardening). Empirical evidence: 100 % pass rate on the 14-question safety-refusal cohort and 12-question adversarial-GCG cohort (thesis Chapter 4, Table 4.1, citing Zou et al. 2023). The Llama Guard line of work (Inan et al. 2023) is the academic adjacency for LLM-output safety; our regex post-filter is a deterministic complement.
5.2 Provenance + observability — citations + diagnostic V2 + Operations dashboard
Every substantive answer carries chunk-derived citations (voice/citation-pipeline). The diagnostic V2 endpoint scores per-turn correctness, safety, memory, tool-use, and latency on a documented rubric, with LLM-as-judge bias controls per Zheng et al. 2023. The Operations dashboard (architecture/feedback-dashboard-metrics) renders per-tenant trend charts on Category Mismatch and Diagnostic Accuracy. The architectural lineage is Lewis et al. 2020 for RAG and Gao et al. 2024 for modular RAG.
5.3 Multi-tenant SaaS architecture — zero-source-change onboarding
Tenant identity is bound to the Keycloak JWT claim, not to a header — cryptographically resolved per request (architecture/multi-tenancy). The two-plane configuration (DB-driven for web/RAG + YAML overlay for voice) puts slow-moving voice content in version control and fast-moving crawl rules in DB rows that platform admins edit through the API. The architectural lineage is Bezemer & Zaidman 2010 shared-schema multi-tenant SaaS taxonomy. The zero-source-change invariant is the architectural target; pilot Phase 5 commits to empirical verification via second-pilot deployment.
5.4 Engineering rigor — 50+ ADRs, 62-entry verified bibliography, silent-failure discipline
Architectural decisions live in the ADR series (decisions is a representative example); the bibliography (references) has 62 verified entries with last-verified dates and one-line summaries. The silent-failure discipline (R1: log size on collection-returning functions; R2: regression test for every silent-failure branch; R3: contract test for cross-component shared state) was codified after a real-world voice-history regression on 2026-05-07 — see the project's CLAUDE.md for the canonical writeup. The thesis (thesis Chapter 4) ships the empirical evidence for every quantitative claim. Software-engineering experimentation methodology is Wohlin et al. 2012; SRE practice for tail-latency is Beyer et al. 2016; software-craftsmanship practice is Martin 2017.
The single sentence
We are the only stack on this list that combines patient-facing voice with citation-grounded retrieval, multi-tenant overlay onboarding, GDPR Art. 35 DPIA + AI Act Art. 50 limited-risk classification on file, and an empirically measured 99.0 % pass rate against a 302-question regulated-domain benchmark. The roadmap in §4 closes the three honest gaps without giving up the four differentiators.
6. Methodology and caveats
6.1 How the matrix was built
Public-material research was the primary source. For each vendor and each cell:
- The vendor's documentation page, pricing page, or product page was fetched. Vendor URLs are in §2 inline.
- If the page stated the answer, the cell got the answer + the URL.
- If the page did not state the answer in five minutes of reading, the cell was marked
Not publicly documented. - Inferred numbers were forbidden. If the buyer asks "where did you get this number?" we point at a URL, not a guess.
Our own cells were sourced from:
- The audit ledger under
docs/audits/2026-05-09-*.md(drift register, source-of-truth for "what we have") - The thesis (thesis Chapter 4) for empirical golden-eval numbers
- The fast-gate threshold study (
backend/scripts/revalidate_fast_gate_threshold.json) - The ADR series for architectural decisions
- The Voice Stack Compendium for engineering-grade claims
- The references bibliography for academic claims (62 verified entries)
Where our own number is not yet measured, the cell says not yet measured with a Phase-5 commitment. Same discipline as for competitors.
6.2 Caveats
The matrix is a snapshot. Vendors ship; the matrix decays. Specific decay risks:
- Pricing pages move quickly. Retell's per-minute components (retellai.com/pricing) and Deepgram's flat-rate Voice Agent (deepgram.com/product/voice-agent-api) were verified on 2026-05-09–10. Both will likely shift before Q3 2026.
- Vendor language lists expand. Retell's multilingual page lists nl/en/es/fr/de/hi/ru/pt/jp/it as of verification; this can grow.
- Hyperscaler features land continuously. OpenAI, Deepgram, Google, and Microsoft ship voice features on multi-week cadences; the cells marked "not publicly documented" today may be documented next quarter.
- Engineering-depth competitive material is rarely public. Most blank cells exist because vendors do not publish the engineering depth our buyer cares about. A competitive-procurement reviewer should ask each vendor to produce their equivalent of our DPIA, AI Act memo, and Voice Stack Compendium.
6.3 Refresh cadence
This matrix is timestamped 2026-05 in the URL slug. The next refresh is committed to Q3 2026 alongside the pilot uptime SLO posting (§4.1). Each refresh will:
- Re-verify every vendor URL (redirect, removal, content drift)
- Replace "not yet measured" cells in the ZOL row with measured numbers as pilot Phase 5 backfills land
- Add new tier rows if a new vendor category emerges (e.g., agentic-voice frameworks beyond LiveKit Agents / Pipecat / Vocode)
- Strike vendors that have exited the market or pivoted out of voice
The matrix is honest about being a snapshot. A buyer reading this in 2027 H1 should refresh — or ask us to refresh — before relying on any specific cell.
7. References
This document cites the following bibliography keys (see references for full entries):
- Lewis et al. 2020 — RAG architecture
- Gao et al. 2024 — Modular-RAG taxonomy
- Beyer et al. 2016 — SRE tail-latency SLO practice
- Nielsen 1993 — Response-time UX thresholds
- Bezemer & Zaidman 2010 — Multi-tenant SaaS architecture
- Zheng et al. 2023 — LLM-as-judge bias controls
- Inan et al. 2023 — Llama Guard / LLM-output safety
- Zou et al. 2023 — Greedy Coordinate Gradient adversarial benchmark
- Wohlin et al. 2012 — Software-engineering experimentation methodology
- Sacks, Schegloff & Jefferson 1974 — Conversational turn-taking
- Lin et al. 2026 — Full-duplex voice benchmark
- Martin 2017 — Software-craftsmanship practice
- GDPR, AI Act, MDR — EU regulatory texts
- HLEG 2019 — Trustworthy-AI ethics guidelines
- ISO 27001:2022, ISO 27018:2019 — Information-security management standards
- LiveKit Agents, Deepgram Nova-3, ElevenLabs Multilingual v2 — Vendor stack components
Vendor product pages are cited inline by URL. They are intentionally not bibliography entries because they are product-marketing material, not academic work.