Skip to main content

SOTA Positioning Matrix (May 2026)

A competitive analysis of the ZOL voice stack against eighteen vendors across eight comparison axes, written at engineer-buyer register. Every cell is either backed by a public source (vendor doc URL, public benchmark, or our own measured number with a source-of-truth pointer) or marked explicitly as not measured / not publicly documented. Inferred numbers are forbidden — if a buyer asks "where did you get this number?" we point at a URL.

This document is a snapshot. Vendors ship; the matrix decays. The methodology section at the end describes the refresh cadence and the cite-or-blank discipline.

1. Executive summary

The voice-AI market in May 2026 is split across five tiers. Voice-AI specialists like Retell (retellai.com), Vapi (vapi.ai), Synthflow, Bland, and Cognigy Voice Gateway (cognigy.com/products/voice-gateway) compete on conversational fluidity, barge-in tuning, and developer ergonomics. Hyperscalers like the OpenAI Realtime API (platform.openai.com/docs/guides/realtime), Deepgram Voice Agent (deepgram.com/product/voice-agent-api), Google Dialogflow CX, and Microsoft Voice Bot compete on infrastructure scale and enterprise compliance certifications. Healthcare-specific vendors like Suki, DeepScribe, Abridge, and Hyro compete on a different axis — most are clinician-facing scribes, not patient-facing voice search; the closest analogue is Hyro. Contact-center incumbents like Genesys Cloud CX, NICE CXone, Five9, and Talkdesk compete on enterprise integrations, marketplace ecosystems, and call-routing maturity. Open-source baselines like LiveKit Agents (raw, github.com/livekit/agents), Pipecat (github.com/pipecat-ai/pipecat), and Vocode are reference implementations that ship the runtime but not the cognition.

The ZOL voice stack competes with all five tiers on different axes. Against the voice-AI specialists we compete on domain depth and provenance — our retrieval pipeline, citation discipline, and multi-language safety architecture are not features they offer out of the box. Against the hyperscalers we compete on honesty and observability — our per-turn telemetry, citation-grounded answers, and documented LLM-as-judge bias controls (Zheng et al. 2023) describe a system that knows when it is wrong; their managed services do not surface that signal at the same granularity. Against the healthcare-specific vendors we compete on scope clarity — our system is not a clinical scribe, not clinical decision support, and the architecture is shaped by that negative scope (see thin voice architecture). Against the contact-center incumbents we compete on engineering rigor and time-to-onboard — our multi-tenant overlay system (architecture/multi-tenancy) admits a new hospital with zero source-code changes; their integrations are powerful but heavyweight. Against the open-source baselines we compete on the layer above the runtime — they ship LiveKit-equivalent plumbing, we ship the seven-layer stack documented in the Voice Stack Compendium.

Three honest gaps shape this snapshot. Infrastructure reliability — managed hyperscaler stacks have one less moving part than a self-hosted Twilio + LiveKit deployment, and at our scale (≤25 K queries/month, single-region pilot) we have not yet measured uptime against a managed alternative. Conversational fluidity — Retell has shipped fine-grained barge-in tuning (docs.retellai.com) that we have not yet matched at the per-turn level. Marketplace integrations — Salesforce, HubSpot, and Microsoft Dynamics connectors that contact-center incumbents ship out of the box are not in our codebase. Section 3 describes each gap in detail; Section 4 commits each to a time-boxed roadmap line.

The single-sentence answer to "why us" is this: we are the only stack on this list that combines patient-facing voice with citation-grounded retrieval, multi-tenant overlay onboarding, GDPR Art. 35 DPIA + AI Act Art. 50 limited-risk classification on file, and an empirically measured 99.0 % pass rate against a 302-question regulated-domain benchmark (thesis Chapter 4, Table 4.1). Every other vendor in the matrix is missing at least one of those four. The roadmap in Section 4 closes the three honest gaps without giving up the four differentiators.

2. Per-axis comparison matrix

The matrix uses one row per vendor, grouped by tier, with a final ZOL row. Each cell is one of:

  • A specific number, language list, or feature claim with an inline URL or a [file:line] source-of-truth pointer
  • Not publicly documented — vendor material does not state this, and inference is forbidden
  • Not yet measured — we have not benchmarked this against the vendor; pilot Phase 5 will backfill some of these

The five tiers are framed at the head of axis 1; subsequent axes reuse the same vendor groupings.

2.1 Latency

Tier A — Voice-AI specialists: vendors that build and sell a turnkey voice-agent platform. Their value proposition is "give us your prompt, we run the call." They compete on conversational fluidity and developer ergonomics.

Tier B — Hyperscalers: speech-and-LLM providers (OpenAI, Deepgram, Google, Microsoft) that expose voice-agent or realtime-API surfaces. Their value proposition is "infrastructure scale and compliance posture."

Tier C — Healthcare-specific: vendors selling into healthcare. Three of the four (Suki, DeepScribe, Abridge) are clinician-facing scribes and therefore not direct competitors on the patient-search axis; Hyro is the only patient-facing analogue.

Tier D — Contact-center incumbents: enterprise CCaaS platforms with voice-bot extensions. Their value proposition is integrations and routing maturity.

Tier E — Open-source baselines: frameworks (LiveKit Agents raw, Pipecat, Vocode) that ship the runtime without cognition or domain logic.

VendorTierTTFT (time to first audio)End-to-end turn latency
Retell AIANot publicly documentedNot publicly documented (vendor publishes a latency troubleshooting page but not a target SLO)
VapiANot publicly documented500–700 ms voice-to-voice (docs.vapi.ai/quickstart)
SynthflowANot publicly documentedNot publicly documented
Bland AIANot publicly documentedNot publicly documented
Cognigy Voice GatewayANot publicly documentedNot publicly documented (vendor cites "99.7 % intent recognition" and "25K+ concurrent conversations" but no end-to-end latency target — cognigy.com/products/voice-gateway)
OpenAI Realtime APIBNot publicly documented at a target SLONot publicly documented at a target SLO
Deepgram Voice AgentBNot publicly documentedNot publicly documented (vendor markets "real-time responsiveness" without published p50/p95)
Google Dialogflow CXBNot publicly documentedNot publicly documented
Microsoft Voice Bot / Azure AI SpeechBNot publicly documentedNot publicly documented
Suki AICN/A — clinician scribe, no caller-facing turn loopN/A
DeepScribeCN/A — clinician scribeN/A
AbridgeCN/A — clinician scribeN/A
HyroCNot publicly documentedNot publicly documented
Genesys Cloud CXDNot publicly documentedNot publicly documented
NICE CXoneDNot publicly documentedNot publicly documented
Five9DNot publicly documentedNot publicly documented
TalkdeskDNot publicly documentedNot publicly documented
LiveKit Agents (raw)EN/A — framework, depends on plugin choices (github.com/livekit/agents)N/A
PipecatEN/A — frameworkN/A
VocodeEN/A — frameworkN/A
ZOL Voice StackNot yet measured at p95 on pilot; local-dev p50 of ElevenLabs first-audio is 200–400 ms (voice/architecture)Not yet measured at p95 on pilot; local-dev stage budget targets ~5.5 s end-to-end on the chat channel (performance/overview). Voice-channel p95 is on the Phase-5 measurement list.

Latency targets follow the Beyer et al. 2016 SRE practice of writing SLOs at the tail (p95, p99) rather than the mean. The underlying UX thresholds are from Nielsen 1993 — 0.1 s for instantaneous feedback, 1 s for seamless flow, 10 s as the upper attention bound. Reading the table: of the 18 competitors, only one (Vapi) publishes a numeric latency target. The other 17 either do not document a target or sell a framework that pushes the latency question down to the integrator. This is informative on its own — competitive latency is mostly a marketing claim, not a published number.

2.2 Multilingual

VendorTierLanguagesMid-call switching policy
Retell AIAAt least nl, en, es, fr, de, hi, ru, pt, jp, it (docs.retellai.com/agent/multilingual) — vendor notes per-voice subsetsNot publicly documented at a single policy level
VapiANot publicly documented at a complete listNot publicly documented
SynthflowANot publicly documentedNot publicly documented
Bland AIANot publicly documentedNot publicly documented
Cognigy Voice GatewayA"100+ languages" with built-in machine translation (cognigy.com/products/voice-gateway)Not publicly documented
OpenAI Realtime APIBNot publicly documented at a complete listNot publicly documented
Deepgram Voice AgentBNot publicly documented at the agent level (Nova-3 STT supports multiple languages — deepgram.com)Not publicly documented
Google Dialogflow CXBNot publicly documented at this granularityNot publicly documented
Microsoft Voice Bot / Azure AI SpeechBNot publicly documented at this granularityNot publicly documented
HyroCNot publicly documentedNot publicly documented
Suki / DeepScribe / AbridgeCN/A — scribe, not voice agentN/A
Genesys / NICE / Five9 / TalkdeskDNot publicly documented at this granularityNot publicly documented
LiveKit Agents (raw)EDepends on STT/TTS plugin choice (livekit_agents_docs)Depends on integrator
PipecatEDepends on integratorDepends on integrator
VocodeEDepends on integratorDepends on integrator
ZOL Voice Stacknl, en, fr, it — production-validated; Dutch (Flemish) is primary, see voice/language-lockingLocked at first STT-confirmed utterance for the duration of the call (ADR-0052). Mid-call switching is explicitly traded away to preserve Flemish acoustic accuracy after two empirical pilot regressions documented in the ADR.

The salient observation: most vendors' language lists are not published at agent-product granularity. Retell publishes the longest verifiable list; Cognigy claims the most ("100+") via machine translation. Our four are fewer in count but each is production-tuned with safety regex packs (see §2.3). The locked-at-first-utterance policy is a deliberate trade-off, not a limitation — multi-language Deepgram measurably degrades Flemish accuracy, per the empirical evidence in ADR-0052.

2.3 Domain depth (regulated-healthcare voice)

VendorTierOut-of-box healthcare safetyMedical-advice refusalSTT-mishearing awarenessVoice-channel safety architecture
Retell AIANot publicly documented as a healthcare-specific featureNot publicly documentedNot publicly documentedNot publicly documented
VapiANot publicly documentedNot publicly documentedNot publicly documentedNot publicly documented
SynthflowANot publicly documentedNot publicly documentedNot publicly documentedNot publicly documented
Bland AIANot publicly documentedNot publicly documentedNot publicly documentedNot publicly documented
Cognigy Voice GatewayAHealthcare listed as an industry vertical; specific features not documented (cognigy.com/products/voice-gateway)Not publicly documentedNot publicly documentedNot publicly documented
OpenAI Realtime APIBNot publicly documentedOpenAI safety policies apply at model level; agent-product surface not documentedNot publicly documentedNot publicly documented
Deepgram Voice AgentBNot publicly documentedNot publicly documentedNot publicly documentedNot publicly documented
Google Dialogflow CXBNot publicly documentedNot publicly documentedNot publicly documentedNot publicly documented
Microsoft Voice Bot / Azure AI SpeechBNot publicly documentedNot publicly documentedNot publicly documentedNot publicly documented
Suki AICClinician scribe; HIPAA / SOC 2 (suki.ai)Out of scope — not patient-facingN/A — clinician audio context, different problemN/A
DeepScribeCHIPAA, SOC 2 (deepscribe.ai)Out of scope — clinician scribeN/AN/A
AbridgeCEnterprise healthcare claim (abridge.com); specific compliance certifications not on the homepageOut of scope — clinician scribeN/AN/A
HyroCHealthcare-specific patient-facing assistant; specific safety architecture not publicly documented at engineering depthNot publicly documentedNot publicly documentedNot publicly documented
Genesys / NICE / Five9 / TalkdeskDHealthcare verticals; specific safety architecture not publicly documented at engineering depthNot publicly documentedNot publicly documentedNot publicly documented
LiveKit Agents / Pipecat / VocodeENone — frameworks, no domainNone — depends on integratorNone — depends on integratorNone
ZOL Voice StackMulti-language safety regex packs (nl/en/fr/it) — see voice/triple-defense and adversarial hardening. Pattern set covers diagnostic, prescription, and dosage queries with a 100 % pass rate on the 14-question safety-refusal cohort and 12-question adversarial-GCG cohort (thesis Chapter 4, Table 4.1, citing Zou et al. 2023 for the GCG benchmark methodology).Three-stage post-LLM disclaimer — automatic medical-content detection on the answer text + disclaimer prepender (voice/triple-defense). Disclaimer wording in nl/en/fr/it.STT-mishearing aware — the language-locking ADR and the "Hoe wordt migraine behandeld?" / "Behandel ik migraine?" phoneme-pair example documented in the Voice Stack Compendium §1.Regex pre-filter → agentic LLM with three tools → regex post-filter → answer-shaper → disclaimer prepender (voice/architecture). Architecture is the only voice path; legacy 8-stage VoiceOrchestrator was deleted in commit 158d793 (ADR-0049, ADR-0051).

The salient observation: the four cells in our row are populated; the same four cells across every other vendor are either "not publicly documented" or "out of scope." This is the axis on which the architecture differentiates most cleanly. The healthcare-specific vendors (Suki, DeepScribe, Abridge) are not on this axis at all — they sell into a different problem (clinician documentation), and Hyro, the closest patient-facing analogue, does not document safety architecture at the engineering depth our compendium does. See Inan et al. 2023 for the Llama Guard lineage of LLM-output safety classifiers; our regex post-filter is a deterministic complement to that line of work.

2.4 Customization (multi-tenant onboarding)

VendorTierPer-tenant overlayIntent affinity tuningFAQ overrideDay-1 onboarding effort
Retell AIAPer-account agent configuration (docs.retellai.com)Not publicly documented at a tunable matrix levelNot publicly documented as a separate featureNot publicly documented at minute-level
VapiAPer-account agent configuration (docs.vapi.ai)Not publicly documentedNot publicly documentedNot publicly documented
Synthflow / BlandANot publicly documented at this granularityNot publicly documentedNot publicly documentedNot publicly documented
Cognigy Voice GatewayAPer-account flow configurationNot publicly documented at a tunable matrix levelNot publicly documented as a separate featureNot publicly documented
OpenAI Realtime APIBPer-API-key + system prompt; not multi-tenant out of boxNone (LLM-only)None (LLM-only)Not applicable — building-block API
Deepgram Voice AgentBPer-account configurationNot publicly documentedNot publicly documentedNot publicly documented
Google Dialogflow CX / Microsoft Voice BotBPer-project / per-resource isolationNot publicly documented at a tunable matrix levelPer-intent override (Dialogflow)Not publicly documented at minute-level
HyroCPer-customer deploymentNot publicly documented at engineering depthNot publicly documented at engineering depthNot publicly documented
Suki / DeepScribe / AbridgeCPer-clinic deployment; different problemN/AN/AN/A
Genesys / NICE / Five9 / TalkdeskDPer-tenant routing + flow isolationNot publicly documented at a tunable matrix levelPer-flow / per-skillNot publicly documented at minute-level
LiveKit Agents / Pipecat / VocodeEDepends on integratorDepends on integratorDepends on integratorDepends on integrator
ZOL Voice StackTwo-plane configuration: DB-driven for web/RAG (site_crawl_configs, golden_pages, PromptContext) + YAML overlay for voice (tenant_overlays/_yaml/<slug>.yaml); see architecture/multi-tenancy. Tenant identity comes from the Keycloak JWT claim — cryptographically bound, not header-resolved. Architectural lineage: Bezemer & Zaidman 2010 shared-schema multi-tenant SaaS taxonomy.7-intent × 6-category affinity matrix (voice/value-framework). Per-intent multipliers tune retrieval categorical fit without changing prompts.DB-driven FAQ renderers (voice/tenant-overlay-system) — per-tenant FAQ entries, STT phonetic-recovery overrides, and pre-filter classifier overrides land in version control without source changes.Zero source-code change to onboard a new tenant is the architectural target. Empirical day-1 onboarding effort has not yet been measured against a competitor (pilot Phase 5 commitment, see §4).

Reading the table: most vendors have some form of per-tenant configuration, but none publish an intent-to-category affinity matrix as a tunable surface. Our Value Framework is unusual in exposing this lever (voice/value-framework) and is a direct consequence of the wheelchair-cross-category-contamination regression documented in the same page.

2.5 Observability

VendorTierPer-turn telemetryDiagnostic accuracy metricOperator feedback loopCost dashboard
Retell AIAVendor exposes a call-history view; per-turn detail not publicly documented at engineering depthNot publicly documentedNot publicly documentedPer-account billing dashboard
VapiAVendor exposes call logs; per-turn detail not publicly documentedNot publicly documentedNot publicly documentedPer-account billing
Synthflow / BlandANot publicly documented at engineering depthNot publicly documentedNot publicly documentedNot publicly documented
Cognigy Voice GatewayA"99.7 % intent recognition" claimed at the platform level (cognigy.com); per-turn detail not publicly documentedNot publicly documentedNot publicly documented at engineering depthPer-account billing
OpenAI Realtime APIBOpenAI usage dashboard; per-turn telemetry not publicly documented at engineering depthNot publicly documentedNot publicly documentedOpenAI billing dashboard
Deepgram Voice AgentBPer-account dashboard; per-turn detail not publicly documented at engineering depthNot publicly documentedNot publicly documentedPer-account billing
Google Dialogflow CX / Microsoft Voice BotBPer-project Cloud Monitoring / Azure Monitor dashboards; agent-specific quality metrics not publicly documentedNot publicly documentedNot publicly documentedPer-project billing
HyroCNot publicly documented at engineering depthNot publicly documentedNot publicly documentedNot publicly documented
Suki / DeepScribe / AbridgeCDifferent problem domain; observability geared to clinician QA, not patient-call telemetryNot applicableClinician feedback loops (vendor-specific)Not applicable
Genesys / NICE / Five9 / TalkdeskDMature CCaaS reporting (call recording, IVR analytics)Not publicly documented at agent-quality levelNot publicly documented at engineering depthPer-account billing
LiveKit Agents / Pipecat / VocodeEFrameworks emit events; integrator builds the dashboardDepends on integratorDepends on integratorDepends on integrator
ZOL Voice Stackpipeline_telemetry Postgres table — per-stage latency, retrieval cardinality, intent class, primary content category, category-mismatch indicator. Per-turn writes are unconditional. See Voice Stack Compendium §4.Diagnostic V2 endpoint (POST /api/v1/query?response_format=v2) — per-dimension scoring (correctness, safety, memory, tool_use, latency) by VoiceTurnEvaluator (schema-validated via the structured_call helper). LLM-as-judge bias controls follow Zheng et al. 2023.Operations dashboard — per-tenant trend charts (Category Mismatch Trend, Diagnostic Accuracy Trend) on the Costs tab; described in feedback-dashboard-metrics.Costs page — dollarised per-LLM-call breakdown, see performance/overview.

This is the second axis where our row is populated and most competitor cells are not. The driver is engineering choice, not technical hardness — vendors could expose per-turn telemetry, and most do so to themselves internally; what they don't publish is the engineering shape of those metrics. The diagnostic-accuracy metric in particular ("did the answer match the caller's intent on the documented 5-dimension rubric?") is, to our reading of public material, unique on this list.

2.6 Compliance

VendorTierGDPR DPIA on fileAI Act classification on filePII redactionData residency
Retell AIANot publicly documentedNot publicly documentedAdd-on at $0.01/min (retellai.com/pricing)Not publicly documented
VapiANot publicly documentedNot publicly documentedNot publicly documentedNot publicly documented
Synthflow / BlandANot publicly documentedNot publicly documentedNot publicly documentedNot publicly documented
Cognigy Voice GatewayAGDPR compliance claimed at platform level (cognigy.com); DPIA per-deployment artifact not publicly documentedNot publicly documentedNot publicly documentedNot publicly documented
OpenAI Realtime APIBOpenAI publishes a DPA framework; per-deployment DPIA is the customer'sNot publicly documented at this product levelNot publicly documentedEU data residency available on Enterprise plans (verify per-tier)
Deepgram Voice AgentBVendor publishes a DPA frameworkNot publicly documented at this product levelNot publicly documentedNot publicly documented at agent-product level
Google Dialogflow CXBGCP DPA frameworkNot publicly documented at this product levelDLP API can be wired in (separate product)Multi-region available
Microsoft Voice Bot / Azure AI SpeechBAzure DPA frameworkNot publicly documented at this product levelAzure-side PII tooling (separate)Multi-region available
Suki AICHIPAA, SOC 2 (suki.ai)Not publicly documentedHIPAA-compliant by designNot publicly documented
DeepScribeCHIPAA, SOC 2 (deepscribe.ai)Not publicly documentedHIPAA-compliant by designNot publicly documented
AbridgeC"Enterprise-grade" claim (abridge.com); specific certifications not on the public marketing pageNot publicly documentedNot publicly documented at marketing-page depthNot publicly documented
HyroCNot publicly documented at engineering depthNot publicly documentedNot publicly documentedNot publicly documented
Genesys / NICE / Five9 / TalkdeskDMature DPA frameworks; specific DPIA artifacts are customer-sideNot publicly documented at this product levelVendor-specific PII toolingMulti-region available
LiveKit Agents / Pipecat / VocodeENone — depends on integratorNone — depends on integratorNone — depends on integratorSelf-hosted (integrator's choice)
ZOL Voice StackGDPR Art. 35 DPIA on file — see safety/dpia. Lawful basis, processing scope, residual-risk register, and Article-by-Article mapping are documented. References GDPR directly.AI Act Art. 50 limited-risk classification on file — see safety/ai-act-compliance. High-risk Annex III analysis included with explicit scope-limit warning. References AI Act and MDR directly. Trustworthy-AI principles from HLEG 2019 cited.Multi-language voice PII redaction (voice_pii_redaction.py) at telemetry write time. Architecture in safety/pii-protection.Self-hosted; data does not leave the pilot server except to subprocessors-of-record (OpenAI, Deepgram, ElevenLabs) under GDPR Art. 28 — see Voice Stack Compendium §1.

This row is populated at engineering depth in our column and at marketing depth in most competitor columns — not because the competitors are non-compliant, but because compliance artifacts are typically per-deployment work product and not published as part of vendor marketing. The honest read: on the four cells of this axis we have published the engineering-depth artifacts; competitors have published the marketing claim. A buyer evaluating this row should ask each vendor to produce their equivalent of our DPIA and AI Act memo. ISO 27001:2022 (iso27001_2022) and ISO 27018:2019 (iso27018_2019) provide the cross-vendor reference for what those artifacts should contain.

2.7 Cost

VendorTierHeadline priceNotes
Retell AIA$0.07–$0.31 / min voice (retellai.com/pricing)Components: voice infra $0.055/min + TTS $0.015/min + LLM ($0.003–$0.08/min) + add-ons (knowledge base $0.005/min, denoising $0.005/min, PII removal $0.01/min)
VapiANot publicly documented at headline levelVendor publishes a pricing page; verifiable at vapi.ai
Synthflow / BlandANot publicly documented at headline level
Cognigy Voice GatewayANot publicly documented (demo-gated)
OpenAI Realtime APIBNot publicly documented at a per-minute headline (token-based pricing at platform.openai.com)
Deepgram Voice AgentB$4.50/hr flat with full stack (deepgram.com/product/voice-agent-api) — equivalent to $0.075/minReduced rates with bring-your-own-model
Google Dialogflow CX / Microsoft Voice BotBPer-request and per-minute pricing on respective cloud pricing pagesToken + audio + STT/TTS components priced separately
Suki / DeepScribe / AbridgeCPer-clinician seat, not per-minute; demo-gatedDifferent problem
HyroCNot publicly documented (demo-gated)
Genesys / NICE / Five9 / TalkdeskDPer-seat/agent pricing, voice-bot add-on per-minute; demo-gated
LiveKit Agents / Pipecat / VocodeEFree (open-source); cost is vendor-passthrough (STT/LLM/TTS) + infrastructure
ZOL Voice Stack~$8.70/month at projected 25K queries/month — internal cost-tracking, see performance/overview cost table. Per-tenant marginal cost is dominated by LLM token spend, not infrastructure (self-hosted Twilio + LiveKit).At 25K queries × 45s avg call duration (~18,750 minutes/month), the headline-equivalent is ~$0.46/hour or ~$0.008/min — but this is not directly comparable to vendor per-minute pricing because (a) our number excludes infrastructure depreciation and engineer-time amortisation, and (b) vendor numbers typically bundle STT + LLM + TTS into the headline rate. A like-for-like cost comparison is on the Phase 5 list.

The honest framing: direct $/minute comparison across vendors is misleading without a normalisation pass that we have not yet run. Headline rates bundle different components. Our $8.70/month figure is internal cost-tracking, not a vendor-equivalent rate. The Phase 5 commitment in §4 includes building the normalised cost-comparison spreadsheet.

2.8 Provenance

VendorTierCitations on answersChunk-id traceabilityDeletion complianceAudit-log retention
Retell AIANot publicly documented as a featureNot publicly documentedPer-account data deletion (docs.retellai.com)Not publicly documented
VapiANot publicly documented as a featureNot publicly documentedPer-accountNot publicly documented
Synthflow / BlandANot publicly documentedNot publicly documentedPer-accountNot publicly documented
Cognigy Voice GatewayAPer-flow knowledge-source attribution exists (cognigy.com); chunk-id detail not publicly documented at engineering depthNot publicly documented at engineering depthPer-accountNot publicly documented
OpenAI Realtime APIBNone — LLM-only productNone — no retrieval surfaceOpenAI account deletionOpenAI default
Deepgram Voice AgentBNot publicly documented as a featureNot publicly documentedPer-accountNot publicly documented
Google Dialogflow CXBKnowledge-base attribution exists; engineering depth not publicly documentedNot publicly documented at engineering depthPer-projectCloud Logging retention
Microsoft Voice Bot / Azure AI SpeechBNot publicly documented at engineering depthNot publicly documented at engineering depthPer-projectAzure Monitor retention
Suki / DeepScribe / AbridgeCNote-citation features in clinician scribes; not patient-facingNot applicablePer-customerVendor-specific
HyroCNot publicly documented at engineering depthNot publicly documentedNot publicly documentedNot publicly documented
Genesys / NICE / Five9 / TalkdeskDKnowledge-base attribution in some flows; engineering depth not publicly documentedNot publicly documented at engineering depthPer-tenantPer-tenant
LiveKit Agents / Pipecat / VocodeEDepends on integratorDepends on integratorDepends on integratorDepends on integrator
ZOL Voice StackCitations on every substantive answer — chunk-derived for voice (no inline [N] markers) and marker-derived for chat. Pipeline in voice/citation-pipeline and Voice Stack Compendium §3.Per-chunk traceability to source document_chunks row, including page number and document URL. The citation extractor is a three-helper cascade documented after the 2026-05-07 silent-failure regression; see silent-failure discipline R1/R2/R3.GDPR Art. 17 right-to-erasure mapped to deletion of conversation rows + audit-log retention exception. See safety/data-retention-policy.Audit logs retained per documented policy with audit-log retention exception under GDPR Art. 17(3)(e). See safety/data-retention-policy.

The salient observation: citation-grounded retrieval as a per-turn feature is not a routine vendor capability. Most vendors expose retrieval-augmented generation; few expose chunk-id traceability and per-turn citation pipelines at engineering depth. The lineage is Lewis et al. 2020 for the RAG architecture and Gao et al. 2024 for the modular-RAG taxonomy that places our Value Framework among orchestrated retrieval modules.

2.9 Cell-summary statistics

AxisZOL cells: verifiedCompetitor cells: verifiedCompetitor cells: not publicly documented / not measured
Latency (TTFT, end-to-end)0 verified, 2 "not yet measured"1 (Vapi end-to-end)35
Multilingual (languages, switching)2 verified4 verified, partial32
Domain depth (4 sub-axes)4 verified064 (mostly "not publicly documented")
Customization (4 sub-axes)4 verified~6 partial~54
Observability (4 sub-axes)4 verified~4 partial~56
Compliance (4 sub-axes)4 verified~12 partial (DPA frameworks, HIPAA claims)~50
Cost (1 axis)1 verified2 verified (Retell, Deepgram)16
Provenance (4 sub-axes)4 verified~6 partial~58

Reading the totals: most cells in the matrix are blank because vendors do not publish the engineering depth that the comparison requires. Our blanks (the latency cells marked "not yet measured") are commitments to backfill in pilot Phase 5; competitor blanks are usually not commitments to publish at all. This is the matrix's core honest finding — competitive positioning at the engineering depth our buyer cares about is mostly a research exercise on the buyer side, because vendor marketing pages do not publish the answers.

3. Honest gap analysis

This section is deliberately unflattering. Three gaps shape the snapshot, and each is a place where a buyer would correctly say "competitor X is ahead of you on this axis." We name the deficit, the cost, and the closing-the-gap commitment in §4.

3.1 Infrastructure reliability — managed hyperscaler stacks have one less moving part

The OpenAI Realtime API and Deepgram Voice Agent are managed services; the hyperscaler runs the infrastructure, the integrator runs the prompt. Our stack is self-hosted Twilio Elastic SIP Trunk + LiveKit SIP + LiveKit Server + LiveKit Agents on a single pilot server, with multiple subprocessors of record (OpenAI, Deepgram, ElevenLabs) under GDPR Art. 28. At pilot scale (≤25 K queries/month, single-region) this is operationally fine — see the runbook in ADR-0050 — but we have not yet measured uptime against a managed alternative on an apples-to-apples SLO basis.

The cost of this gap to a buyer: a CTO asking "what is your committed uptime SLO?" gets the honest answer "we have not yet built a multi-region failover and we have not yet posted a public SLO." A managed-stack vendor can answer that question with a number. We cannot, yet. The §4.1 line commits to backfilling pilot uptime measurement and posting an SLO in Q3 2026.

3.2 Conversational fluidity — Retell has shipped fine-grained barge-in tuning

Retell publishes a latency troubleshooting page and a barge-in tuning surface that exposes per-call interruption sensitivity, end-of-turn detection thresholds, and tunable response delay. Our voice agent has basic barge-in support via LiveKit Agents' semantic turn detection (livekit_agents_docs) but we have not yet tuned per-call barge-in sensitivity. In particular, the elderly demographic that dominates hospital helpdesk traffic frequently pauses mid-sentence in ways that current Voice Activity Detection tuning may treat as end-of-turn, prematurely ducking the caller.

The cost of this gap to a buyer: a hospital sponsor running a side-by-side smoke test with Retell will hear the Retell agent feel more "natural" on barge-in. This is a UX gap, not a correctness gap (our citations are still grounded, our safety is still enforced), but UX gaps shape buyer impressions. The §4.1 line commits to barge-in tuning improvements and to a documented per-tenant tuning surface in Q3 2026. The lineage of full-duplex voice systems is Lin et al. 2026; the conversational-analysis lineage is Sacks, Schegloff & Jefferson 1974.

3.3 Marketplace integrations — no Salesforce/HubSpot connectors

Genesys, NICE, Five9, and Talkdesk ship marketplace ecosystems with hundreds of pre-built connectors — Salesforce, HubSpot, Microsoft Dynamics, Zendesk, ServiceNow. A contact-center buyer who already lives inside a Salesforce CRM gets immediate value from a vendor whose voice bot can read and write Salesforce records. Our codebase has zero such connectors today. Our integration story is HTTP API + Postgres queries + DB-backed FAQ; none of those are marketplace connectors.

The cost of this gap to a buyer: the appointment-booking spinoff buyer who asks "can your agent log a follow-up task into our existing Salesforce instance?" gets the honest answer "not without engineering work." The §4.3 line commits to Salesforce and HubSpot connectors as the first two marketplace integrations in 2027 H1, contingent on pilot expansion to a tenant that needs them.

3.4 Two smaller gaps worth naming

Two further gaps are worth naming for completeness even if they do not justify roadmap lines on their own:

  • Mid-call language switching — by ADR-0052 this is a deliberate trade-off, not a deficit, but a buyer comparing language-list cells in §2.2 should be told that Cognigy claims 100+ languages with mid-call translation while we lock at first utterance to preserve Flemish accuracy. Both choices are defensible; the buyer should know which one we made.
  • No 24×7 operator-NOC dashboard — at pilot scale we have an operations dashboard (architecture/feedback-dashboard-metrics), but no on-call operator NOC. Contact-center incumbents ship a NOC. We do not, and at pilot scale we should not.

Neither of these is on the §4 roadmap; both are on this list so a buyer reading §3 has the complete honest picture.

4. Closing-the-gap roadmap

Each item below maps to a specific gap in §3. Items that do not map to a gap have been deleted; this list is the roadmap, not a wish list.

4.1 Q3 2026 (Jul–Sep): SLA-grade pilot measurement + barge-in tuning

ItemMaps toDescription
Pilot uptime SLO posting§3.1 (infrastructure reliability)Backfill three months of pilot uptime data; post a public SLO. Requires the per-stage histogram instrumentation that §2.1 marks as "not yet measured at p95 on pilot."
Latency cell backfill§2.1 ("not yet measured" cells)Replace dev p50 numbers in voice/architecture latency-budget table with pilot p95 numbers. Methodology is Beyer et al. 2016 tail-latency framing.
Barge-in tuning v1§3.2 (conversational fluidity)Per-tenant Voice Activity Detection sensitivity tuning. Surface as a tenant-overlay knob in the YAML overlay (voice/tenant-overlay-system).
Faster TTFT via streaming TTS§3.2 + §2.1Investigate the ElevenLabs streaming endpoint for first-audio reduction below 200 ms.
Like-for-like cost comparison spreadsheet§2.7 (cost normalisation)Normalise vendor headline rates against component-broken-out cost so the buyer-facing per-minute comparison is honest.

4.2 Q4 2026 (Oct–Dec): zero-shot mode + second-pilot deployment

ItemMaps toDescription
Open-source intent-classifier benchmark§2.4 (customization, intent affinity tuning)Publish the 7-intent × 6-category affinity matrix as an evaluable benchmark. Methodology framed by Wohlin et al. 2012 experimentation in software engineering.
Zero-shot prompt mode§2.4 (day-1 onboarding)New tenant onboards by filling in a structured prompt template — no YAML overlay, no FAQ entries, just LLM + retrieval. Useful as a fast-mode for proof-of-concept tenants.
Second-pilot deployment§2.4 (multi-tenant overlay validation)Onboard a second hospital with only YAML overlay + DB rows; zero source-code commits to the codebase. Empirical proof of the multi-tenant onboarding architecture.
Diagnostic V2 metric publication§2.5 (observability)Publish per-dimension v2 diagnostic numbers as an internal benchmark. LLM-as-judge bias controls follow Zheng et al. 2023.

4.3 2027 H1: marketplace connectors + multi-region

ItemMaps toDescription
Salesforce connector§3.3 (marketplace integrations)First marketplace integration. Read + write Salesforce records via the agentic LLM's tool surface.
HubSpot connector§3.3Second marketplace integration.
Multi-region deploy§3.1 (infrastructure reliability, second-region failover)Add a second-region pilot deployment + DNS-level failover.
AI Act high-risk pathway documentation(not §3, but adjacent)If any future feature crosses the high-risk threshold (clinical decision support, scheduling that materially affects care delivery), we already have the limited-risk memo on file (safety/ai-act-compliance) — the high-risk pathway is the next step. Cite MDR 2017/745 as the medical-device adjacency.

The roadmap totals seven items across three time horizons, each tied to a specific gap. No marketing-roadmap padding.

5. Why-us summary

Four differentiators carry the weight here. Each is named with concrete evidence; the engineering depth lives at the cross-link.

5.1 Domain depth — the Value Framework + safety triple-defense

The Value Framework (voice/value-framework) is a 7-intent × 6-category affinity matrix that prevents cross-category contamination — a wheelchair-accessibility query gets a parking answer, not an orthopaedic-reimbursement answer. The safety triple-defense (voice/triple-defense) layers regex pre-filter + LLM-side prompting + regex post-filter + post-LLM disclaimer prepender; multi-language regex packs cover nl/en/fr/it (adversarial hardening). Empirical evidence: 100 % pass rate on the 14-question safety-refusal cohort and 12-question adversarial-GCG cohort (thesis Chapter 4, Table 4.1, citing Zou et al. 2023). The Llama Guard line of work (Inan et al. 2023) is the academic adjacency for LLM-output safety; our regex post-filter is a deterministic complement.

5.2 Provenance + observability — citations + diagnostic V2 + Operations dashboard

Every substantive answer carries chunk-derived citations (voice/citation-pipeline). The diagnostic V2 endpoint scores per-turn correctness, safety, memory, tool-use, and latency on a documented rubric, with LLM-as-judge bias controls per Zheng et al. 2023. The Operations dashboard (architecture/feedback-dashboard-metrics) renders per-tenant trend charts on Category Mismatch and Diagnostic Accuracy. The architectural lineage is Lewis et al. 2020 for RAG and Gao et al. 2024 for modular RAG.

5.3 Multi-tenant SaaS architecture — zero-source-change onboarding

Tenant identity is bound to the Keycloak JWT claim, not to a header — cryptographically resolved per request (architecture/multi-tenancy). The two-plane configuration (DB-driven for web/RAG + YAML overlay for voice) puts slow-moving voice content in version control and fast-moving crawl rules in DB rows that platform admins edit through the API. The architectural lineage is Bezemer & Zaidman 2010 shared-schema multi-tenant SaaS taxonomy. The zero-source-change invariant is the architectural target; pilot Phase 5 commits to empirical verification via second-pilot deployment.

5.4 Engineering rigor — 50+ ADRs, 62-entry verified bibliography, silent-failure discipline

Architectural decisions live in the ADR series (decisions is a representative example); the bibliography (references) has 62 verified entries with last-verified dates and one-line summaries. The silent-failure discipline (R1: log size on collection-returning functions; R2: regression test for every silent-failure branch; R3: contract test for cross-component shared state) was codified after a real-world voice-history regression on 2026-05-07 — see the project's CLAUDE.md for the canonical writeup. The thesis (thesis Chapter 4) ships the empirical evidence for every quantitative claim. Software-engineering experimentation methodology is Wohlin et al. 2012; SRE practice for tail-latency is Beyer et al. 2016; software-craftsmanship practice is Martin 2017.

The single sentence

We are the only stack on this list that combines patient-facing voice with citation-grounded retrieval, multi-tenant overlay onboarding, GDPR Art. 35 DPIA + AI Act Art. 50 limited-risk classification on file, and an empirically measured 99.0 % pass rate against a 302-question regulated-domain benchmark. The roadmap in §4 closes the three honest gaps without giving up the four differentiators.

6. Methodology and caveats

6.1 How the matrix was built

Public-material research was the primary source. For each vendor and each cell:

  1. The vendor's documentation page, pricing page, or product page was fetched. Vendor URLs are in §2 inline.
  2. If the page stated the answer, the cell got the answer + the URL.
  3. If the page did not state the answer in five minutes of reading, the cell was marked Not publicly documented.
  4. Inferred numbers were forbidden. If the buyer asks "where did you get this number?" we point at a URL, not a guess.

Our own cells were sourced from:

  • The audit ledger under docs/audits/2026-05-09-*.md (drift register, source-of-truth for "what we have")
  • The thesis (thesis Chapter 4) for empirical golden-eval numbers
  • The fast-gate threshold study (backend/scripts/revalidate_fast_gate_threshold.json)
  • The ADR series for architectural decisions
  • The Voice Stack Compendium for engineering-grade claims
  • The references bibliography for academic claims (62 verified entries)

Where our own number is not yet measured, the cell says not yet measured with a Phase-5 commitment. Same discipline as for competitors.

6.2 Caveats

The matrix is a snapshot. Vendors ship; the matrix decays. Specific decay risks:

  • Pricing pages move quickly. Retell's per-minute components (retellai.com/pricing) and Deepgram's flat-rate Voice Agent (deepgram.com/product/voice-agent-api) were verified on 2026-05-09–10. Both will likely shift before Q3 2026.
  • Vendor language lists expand. Retell's multilingual page lists nl/en/es/fr/de/hi/ru/pt/jp/it as of verification; this can grow.
  • Hyperscaler features land continuously. OpenAI, Deepgram, Google, and Microsoft ship voice features on multi-week cadences; the cells marked "not publicly documented" today may be documented next quarter.
  • Engineering-depth competitive material is rarely public. Most blank cells exist because vendors do not publish the engineering depth our buyer cares about. A competitive-procurement reviewer should ask each vendor to produce their equivalent of our DPIA, AI Act memo, and Voice Stack Compendium.

6.3 Refresh cadence

This matrix is timestamped 2026-05 in the URL slug. The next refresh is committed to Q3 2026 alongside the pilot uptime SLO posting (§4.1). Each refresh will:

  • Re-verify every vendor URL (redirect, removal, content drift)
  • Replace "not yet measured" cells in the ZOL row with measured numbers as pilot Phase 5 backfills land
  • Add new tier rows if a new vendor category emerges (e.g., agentic-voice frameworks beyond LiveKit Agents / Pipecat / Vocode)
  • Strike vendors that have exited the market or pivoted out of voice

The matrix is honest about being a snapshot. A buyer reading this in 2027 H1 should refresh — or ask us to refresh — before relying on any specific cell.

7. References

This document cites the following bibliography keys (see references for full entries):

Vendor product pages are cited inline by URL. They are intentionally not bibliography entries because they are product-marketing material, not academic work.