Next-Generation Platform Architecture

"The system should work by just crawling the website and sending them the link."

Executive Summary

We are building a healthcare navigation platform — not a chatbot. The chatbot is the entry point; the intelligence underneath is what delivers value. This document describes the architectural evolution from a single-hospital knowledge graph to a multi-hospital, multi-language, multi-channel platform that scales across European healthcare.

The core innovation is a three-phase taxonomy extraction pipeline: the system crawls any hospital website, automatically identifies structural pages, extracts a proposed taxonomy using premium LLM models backed by SNOMED-CT medical terminology, and presents it for human approval. No custom code per hospital. No manual relationship mapping. No developer-as-operator.

The Problem We Solved

Before: The Knowledge Graph Trap

Traditional healthcare knowledge graph systems — including our own v1 architecture — follow an extract-validate-fix loop:

Crawl website → LLM extracts entities → Store in graph database →
  Quality audit reveals errors → Developer adds blocklists/rules →
    New edge cases appear → Repeat

This approach has well-documented limitations in the literature. Healthcare knowledge graph construction remains "one of the most challenging tasks in healthcare informatics" due to the inherent ambiguity of medical text and the complexity of clinical relationships (Shi et al., 2023). Our own experience confirmed this: 9 iteration rounds, 16,000 lines of hospital-specific code, and the system still produced quality-gate bypasses, canonical-name collisions, and implausible department-condition assignments.

The root cause is architectural, not technical: extraction from unstructured text is inherently noisy, and no amount of post-hoc filtering makes it clean.

After: Proposed-and-Approved

Our new architecture inverts the paradigm:

Crawl website → Classify pages → Extract proposed taxonomy →
  Human operator reviews and approves →
    Approved structure becomes the navigational backbone

Every relationship that drives patient navigation has been seen and approved by a human. The system proposes; the operator decides. This eliminates the entire class of quality issues that plagued the extraction-validation approach.

Architecture Overview

The Three-Phase Pipeline

Phase 1: Crawl & Classify

The system crawls any hospital website and uses an LLM to classify each page as either a hub page (a structured directory listing multiple medical entities) or a detail page (everything else). Pages identified as hub pages are proposed as hub page candidates for operator confirmation.

This classification is hospital-agnostic — it works on any hospital's page layout, in any language, without custom HTML parsers.

Phase 2: Extract & Propose

A generic LLM-powered extractor processes confirmed hub pages, producing structured entities (departments, doctors, conditions, treatments, examinations) and relationships (department handles condition, department offers treatment, doctor works in department).

Each extracted item receives:

A confidence score from the LLM
Provenance (source URL and extraction timestamp)
A SNOMED-CT code mapped via three-tier matching (exact, fuzzy, LLM-assisted)
Plausibility validation against the SNOMED hierarchy

Premium LLM models (GPT-4.1, Claude Opus) are used for extraction — this is a one-time cost per crawl cycle, and extraction quality is the foundation of the entire system.

Phase 3: Curate & Approve

A purpose-built curation interface presents the proposed taxonomy for human review. High-confidence items (>= 0.85) can be bulk-approved. Lower-confidence items are flagged for individual review, with plausibility warnings when SNOMED hierarchy analysis suggests a mismatch.

The operator can approve, reject, correct, merge duplicates, or add missing relationships. Once approved, records become the live navigational backbone that powers all search and routing.

Technical Foundation

Unified PostgreSQL Architecture

A key architectural decision is the elimination of Neo4j in favor of PostgreSQL as the single data store for vectors, structured taxonomy, content, and authentication. This is grounded in the observation that hospital navigation requires shallow traversals (2-4 hops) that are optimally served by relational JOINs, not graph database engines designed for deep traversals on millions of nodes.

This aligns with emerging best practices in healthcare data architecture. The World Economic Forum's 2026 analysis of healthcare data architectures emphasizes that flexibility and interoperability matter more than any specific database paradigm (WEF, 2026). By consolidating into PostgreSQL with pgvector, we achieve:

Single database operations — backup, monitoring, versioning in one place
Combined vector + structural queries — no cross-database round trips
ACID transactions on taxonomy updates
Standard SQL tooling — every developer knows SQL; Cypher is niche
Multi-tenant isolation via hospital_id foreign keys on every table

The "patient journey" query (condition -> department -> campus -> doctors) executes as 4 SQL JOINs in sub-10 milliseconds on the expected data volumes (~500 departments, ~400 doctors, ~250 conditions per hospital).

SNOMED-CT: The Medical Lingua Franca

SNOMED Clinical Terms (SNOMED-CT) is the world's most comprehensive clinical terminology system, with over 350,000 concepts and official translations in Dutch, French, German, English, and 10+ additional languages (SNOMED International, 2026).

Our integration uses SNOMED-CT for three purposes:

Term Resolution: Hospital-specific terms ("hartkloppingen") map to stable concept codes (80313002), enabling canonical deduplication and synonym handling without manual alias maps.
Cross-Language Portability: A Dutch hospital's "hartfalen" and a French hospital's "insuffisance cardiaque" map to the same SNOMED code (84114007). Department routing logic written once works across all languages.
Plausibility Checking: SNOMED's IS_A hierarchy groups conditions under clinical domains. A universal mapping table (~50-100 rows) from SNOMED top-level categories to department domain groups replaces thousands of lines of hardcoded plausibility guards.

This approach is consistent with the literature on SNOMED-CT implementation. Mills (2013) and subsequent systematic reviews demonstrate that SNOMED-CT integration is most effective when it is "not simply viewed as a bolt-on but fully integrated into the semantic understanding of clinical terms" (JMIR Medical Informatics, 2023). Our three-tier matching strategy (exact match -> trigram fuzzy match -> LLM-assisted mapping) ensures high coverage while maintaining deterministic behavior for the majority of terms.

Three-Tier Term Matching

Multi-Channel Delivery

The taxonomy and RAG pipeline produces a structured response object that is channel-agnostic. Each delivery channel renders it according to its constraints:

Channel	Response Style	Latency Target	Key Constraint
Web Chatbot	Full markdown, links, expandable sections	< 3 seconds	Rich formatting available
WhatsApp	Short text, action buttons	< 2 seconds	Character limits, button constraints
Voice/Phone	Concise spoken response, transfer offers	< 1.5 seconds	No visual aids, must be spoken naturally

The unified PostgreSQL architecture directly enables the voice channel's latency requirements: combined vector + taxonomy queries execute in a single database round trip, eliminating the 50-200ms cross-database overhead of the previous Neo4j architecture.

Compliance by Design

EU AI Act (Regulation 2024/1689)

The system's classification under the EU AI Act depends on its intended purpose:

Use Case	Risk Classification	Key Obligations
Navigational search ("find the right department")	Limited risk	Transparency: disclose AI usage
Pre-triage routing ("you should contact Cardiology")	Potentially high-risk	If influencing clinical decisions: MDR + AI Act dual compliance
Medical advice	Prohibited by design	System explicitly does not provide medical advice

Our architecture is designed to satisfy high-risk requirements from day one, even while operating as limited risk:

Human oversight: Every navigational relationship is operator-approved
Audit trails: extraction_proposals and response_explanations tables provide full decision lineage
Explainability: Each response can trace which taxonomy relationships and content chunks contributed
Data quality: SNOMED-CT codes provide standardized, verifiable medical concept references

This forward-looking compliance posture is informed by MDCG 2025-6, which establishes that AI systems classified as medical devices under MDR face dual compliance requirements from August 2026 (European Commission, 2025).

Requirement	Implementation
Data minimization	No patient health data collected — only anonymized search queries
EU data residency	European servers, EU-based LLM endpoints
Right to erasure	Query audit log with retention policy and purge mechanism
Transparency	Clear AI disclosure on every response
DPO requirement	Required when processing health-adjacent data at scale
Data Processing Agreements	Required with all LLM providers

Future: HL7 FHIR Readiness

By storing SNOMED-CT codes from day one, the system is pre-positioned for hospital backend integration via HL7 FHIR — the European standard for healthcare data interoperability (HL7 Europe, 2026). SNOMED concepts map directly to FHIR resources:

Condition -> FHIR Condition
Department -> FHIR Organization
Doctor -> FHIR Practitioner
Appointment -> FHIR Schedule + Slot

Competitive Differentiation

The Data Moat

Each hospital's curated taxonomy — operator-reviewed, SNOMED-mapped, confidence-scored — gets better over time. A competitor starting from scratch cannot replicate months of curation. The SNOMED domain mappings improve with each hospital: hospital #5 benefits from plausibility patterns learned from hospitals #1-4.

Compliance as Competitive Advantage

Competitors rushing to market with quick chatbot wrappers will encounter the EU AI Act compliance wall in 2026-2027. Our architecture satisfies high-risk requirements from day one — audit trails, human oversight, explainability, and standardized medical terminology are structural features, not afterthoughts.

The ROI Pitch

The platform sells itself on immediately provable return on investment:

Contact center load reduction of 30-40% — a direct, measurable cost saving
Intelligent routing reduces misdirected calls and wasted specialist time
Dashboard-driven proof: "Last month, 12,847 queries were resolved without a phone call"

This is not a technology pitch. It is a financial pitch backed by technology.

Zero-Integration Onboarding

The onboarding experience for a new hospital:

Provide your website URL
We crawl and classify your content (automated)
Review and approve the proposed taxonomy (your staff, our UI)
Go live

No integration project. No custom development. No 6-month implementation timeline. This is a SaaS product, not a consulting engagement.

Platform Vision

Phase 1 (current): Intelligent search replacing keyword search. Crawl, extract, curate, serve.

Phase 2: Multi-channel delivery with measurable contact center deflection. The ROI pitch.

Phase 3: Hospital backend integration via FHIR. Doctor-facing tools. Schedule management. The platform becomes the hospital's digital front desk.

References

European Commission (2025). MDCG 2025-6: Guidance on AI systems used as or within medical devices. Medical Device Coordination Group.
HL7 Europe (2026). Standards & Communities — HL7 FHIR Implementation Guides for European Healthcare. https://hl7europe.org/standards/
JMIR Medical Informatics (2023). "Systematized Nomenclature of Medicine-Clinical Terminology (SNOMED CT) Clinical Use Cases in the Context of Electronic Health Record Systems: Systematic Literature Review." JMIR Med Inform, 11, e43750.
Mills, S. (2013). "Implementation of SNOMED CT in an online clinical database." Pacific Symposium on Biocomputing.
Shi, Y., et al. (2023). "Healthcare knowledge graph construction: A systematic review of the state-of-the-art, open issues, and opportunities." Journal of Biomedical Informatics, 141, 104361. PMC10225120.
SNOMED International (2026). Practical Guides: Vendor Introduction to SNOMED CT — Choosing an Approach to Implementation. https://docs.snomed.org/
World Economic Forum (2026). "Why we need to transform our healthcare data architecture." https://www.weforum.org/stories/2026/01/ai-healthcare-data-architecture/
European Parliament (2024). Regulation (EU) 2024/1689 — The Artificial Intelligence Act. Official Journal of the European Union.
Quickchat AI (2026). "GDPR-Compliant Chatbot: Step-by-Step Guide." https://quickchat.ai/post/gdpr-compliant-chatbot-guide

Executive Summary​

The Problem We Solved​

Before: The Knowledge Graph Trap​

After: Proposed-and-Approved​

Architecture Overview​

The Three-Phase Pipeline​

Phase 1: Crawl & Classify​

Phase 2: Extract & Propose​

Phase 3: Curate & Approve​

Technical Foundation​

Unified PostgreSQL Architecture​

SNOMED-CT: The Medical Lingua Franca​

Three-Tier Term Matching​

Multi-Channel Delivery​

Compliance by Design​

EU AI Act (Regulation 2024/1689)​

GDPR (Regulation 2016/679)​

Future: HL7 FHIR Readiness​

Competitive Differentiation​

The Data Moat​

Compliance as Competitive Advantage​

The ROI Pitch​

Zero-Integration Onboarding​

Platform Vision​

References​