Skip to main content

Chapter 3: Methodology

Post-defence amendments

This chapter records the system as it stood at the point of thesis submission. Two architectural elements have moved on since then and are noted in-place where they appear: (1) the embedding model migrated from BGE-M3 (1024 dim, on-prem Ollama) to OpenAI text-embedding-3-large (1536 dim, hosted) — see ADR-0048; (2) the voice channel collapsed from a multi-stage orchestrator with a dialogue manager to a single-orchestrator agentic design — see ADR-0049 and ADR-0051. The body of this chapter is preserved as the academic record of what was implemented and evaluated for the graduation project.

3.1 Architectural Principles

The system follows Clean Architecture with dependency inversion (Martin 2017) (ADR-001), separating concerns into four layers: API routes, service logic, data access, and infrastructure. This separation enables independent testing and replacement of components — a critical property when the LLM provider, embedding model, and graph backend were each changed during development.

Key architectural decisions documented in ADRs include:

  • Async-first design (ADR-004): FastAPI with full asyncio throughout, enabling concurrent retrieval from vector and graph stores.
  • No-mocking test policy (ADR-0002): all tests run against real infrastructure via testcontainers, following the practitioner argument against mock-based test fragility (Fowler 2007), ensuring test fidelity reflects production behaviour.
  • Feature-flag system (ADR-0024): every advanced feature (CRAG, FILCO, Guardrails, graph enrichment, streaming) is gated behind a runtime-configurable toggle, enabling controlled rollout and the ablation studies reported in Chapter 4.
  • Multi-tenancy (ADR-002): row-level security with user-ID-based isolation, supporting future multi-hospital deployment. The trade-offs of this isolation model — shared schema with tenant_id versus schema-per-tenant or database-per-tenant — are discussed in Bezemer and Zaidman 2010.

The technology stack comprises:

Table 3.1. Technology stack.

LayerTechnologyPurpose
BackendFastAPI (Python 3.12, async)API server and pipeline orchestration
FrontendReact 18 + TypeScript + ViteAdmin dashboard and chat interface
Vector storePostgreSQL + pgvectorEmbedding storage and similarity search
Entity taxonomyPostgreSQL 17Entity relationships and taxonomy queries
CacheRedis 7.xSession management and semantic query cache
Object storageMinIODocument file storage
EmbeddingBGE-M3 (1024d, via Ollama) at submission; subsequently migrated to OpenAI text-embedding-3-large (1536d) — see ADR-0048Multilingual dense embeddings
Rerankercross-encoder/ms-marco-MiniLM-L-6-v2Cross-encoder passage reranking
LLM routing4-tier: gpt-4.1-nano / mini / full / gpt-5.2Cost-optimized generation
ContainerizationDocker + Docker ComposeDevelopment and deployment

Figure 3.1. High-level system architecture. The user interacts via the React frontend, which communicates with the FastAPI backend over HTTP/WebSocket. The backend orchestrates the 11-stage query pipeline, querying PostgreSQL (pgvector) for vector search, PostgreSQL taxonomy tables for entity relationships, Redis for caching, and external LLM APIs for generation and safety validation.

3.2 The 11-Stage Query Pipeline

The query pipeline processes each user request through 11 sequential stages. This pipeline is the core of the system and represents the primary contribution of this thesis. Each stage is instrumented with timing data, logged for debugging, and configurable via feature flags.

Stage 1: Input Processing

The input stage normalizes the user query, detects the user's language using the Lingua library (ADR-0037), and prepares the query for downstream processing. Language detection supports 8 languages: Dutch, English, Turkish, French, German, Italian, Romanian, and Greek. The detected language determines the response language while retrieval always operates in Dutch.

Stage 2: Intent Classification

A lightweight LLM call (gpt-4.1-mini) classifies the query into one of several intent categories: informational, navigational, medical_advice, greeting, off_topic, and others. Medical advice queries are immediately blocked with a language-appropriate refusal message. This stage also includes GCG adversarial input detection (ADR-0036), which uses statistical heuristics (dictionary word ratio, character bigram entropy) to detect and block gibberish-suffix attacks in under 5 milliseconds — without any LLM call.

The intent classification threshold is set at 70%, meaning queries must exceed this confidence level to be classified as medical advice and blocked. This threshold was tuned empirically to balance safety (avoiding false negatives) with usability (avoiding false positives on legitimate queries).

Stage 3: Semantic Cache

A two-tier caching system (ADR-0031) checks for previously answered queries:

  1. Hash cache (~1ms): Exact string match using SHA-256 hash of the normalized query.
  2. Embedding cache (~50ms): Cosine similarity between the query embedding and cached query embeddings, with a configurable threshold (default: 0.97).

Cache hits bypass all downstream stages, returning the cached response immediately. The high similarity threshold (0.97) ensures only near-identical queries are served from cache, preventing stale or mismatched responses.

Stage 4: Query Rewrite

The user's query is rewritten for optimal retrieval. This includes:

  • Query decomposition: Multi-part questions (e.g., "What does Cardiology treat and where is it located?") are split into sub-queries that are processed independently and merged.
  • Taxonomy resolution: Patient-friendly terms are resolved to canonical medical terminology using a cascading lookup: SNOMED CT synonym cache → hardcoded taxonomy aliases → fuzzy matching (cutoff 0.8).

Stage 5: Strategy Selection

Based on the resolved query entities, a retrieval strategy is selected:

  • Vector-only: Default for general informational queries.
  • Graph-first: When the query contains recognized medical entities (doctors, departments, conditions), graph traversal is prioritized.
  • Hybrid: Both vector and graph retrieval are executed and results are fused.

Strategy selection is driven by the taxonomy resolution result: if entities are recognized and typed (condition, treatment, doctor, department), the graph path is activated.

Hybrid retrieval combines dense vector search (pgvector cosine similarity, pgvector documentation) with BM25 keyword search (Robertson and Zaragoza 2009), fused via Reciprocal Rank Fusion (Cormack et al. 2009, ADR-0020). The fusion weights are configurable (default: 70 % vector, 30 % BM25). The top 20 candidates are retrieved for reranking. The choice of an HNSW-style approximate-nearest-neighbour index over exact nearest neighbour at our corpus size (10 K+ chunks) follows the latency-quality trade-off documented by Johnson et al. 2017 for billion-scale similarity search.

The vector store uses BGE-M3 embeddings (Chen et al. 2024, 1024 dim) with contextual enrichment: each chunk is embedded with a document-level summary prefix to improve retrieval for out-of-context passages (ADR-0019). (Post-submission update: the production system has since migrated to OpenAI text-embedding-3-large (OpenAI 2024, 1536 dim) per ADR-0048; the contextual-enrichment pre-pending behaviour is unchanged.)

Stage 7: Reranking

A cross-encoder reranker (ms-marco-MiniLM-L-6-v2) re-scores the top 20 candidates from hybrid retrieval, following the two-stage retrieval paradigm established by Nogueira and Cho 2019. Reranking is always-on in production mode (ADR-0026), as ablation experiments showed consistent quality improvement. The reranker produces calibrated scores that feed into the CRAG quality assessment.

Stage 8: Graph Enrichment

When the retrieval strategy includes graph queries, PostgreSQL taxonomy tables are queried using parameterised SQL queries that leverage the taxonomy entity types. The structural pattern — taxonomy lookup feeding LLM prompt context — is the same hybrid-RAG / KG-enriched-RAG pattern documented by Sarmah et al. 2024 and, in the biomedical domain, by Soman et al. 2024, although our "graph" is a relational taxonomy_relationships table rather than a dedicated graph database.

  • Doctor queries: Match by canonical name, return department, campus, and specialties.
  • Condition queries: Match by canonical name or SNOMED synonyms, return handling departments and associated treatments.
  • Department queries: Match by name or alias, return doctors, conditions treated, and campus location.
  • Multi-hop queries: Traverse 3-4 hops for complex questions (e.g., condition → department → doctors → campus).

Graph results are formatted as structured text and injected into the context alongside vector-retrieved passages. Critically, graph injection is conditional: it is only applied when the graph contributes additional information not already present in the vector results. This finding — that unconditional graph injection can actually harm answer quality for some query types — emerged from the ablation study (see Chapter 4).

Stage 9: Context Building

Retrieved passages and graph results are assembled into a context string within a configurable token budget (default: 8 000 tokens). The 8 000-token budget is chosen to leave headroom below the 16 K context windows of the production-tier models while remaining well below the threshold at which Liu et al. 2024 document significant lost-in-the-middle attention degradation. The context builder:

  1. Applies CRAG assessment: classify retrieval confidence as correct/ambiguous/incorrect (ADR-0038).
  2. For ambiguous results, triggers a refinement retry with relaxed parameters.
  3. Applies FILCO sentence-level filtering when enabled: removes irrelevant sentences from retrieved passages.
  4. Deduplicates overlapping content between vector and graph results.
  5. Formats citations for source attribution.

Stage 10: LLM Generation

The assembled context, query, and system prompt are sent to the LLM for response generation. The system uses English system prompts following GPT-4.1 best practices (ADR-0027) with structured sections and few-shot examples. The response is generated in the user's detected language, regardless of the Dutch-language context.

Table 3.2. 4-tier LLM routing configuration.

TierModelUse CaseCost
Tier 1gpt-4.1-nanoFollowup classification, simple routingLowest
Tier 2gpt-4.1-miniIntent classification, entity validationLow
Tier 3gpt-4.1RAG generation (primary)Medium
Escalationgpt-5.2"Think Harder" escalated queriesHighest

Token streaming (ADR-0026) delivers response tokens to the client in real-time via WebSocket, improving perceived responsiveness.

Stage 11: Response Post-Processing

The final stage applies safety validation and formatting:

  1. Quality gate: Background evaluation scores faithfulness (threshold: 60%) and answer relevancy (threshold: 50%). Responses below thresholds trigger auto-refusal.
  2. Safety judge (optional): LLM-as-judge validation checks for medical advice in the generated response.
  3. Guardrails (optional): Llama Guard 3 classifies the response across multiple safety dimensions.
  4. Disclaimer: A language-appropriate medical disclaimer is appended to every response.
  5. Citation formatting: Source references are formatted as clickable links to ZOL website pages.

Figure 3.2. The 11-stage query pipeline. Stages are executed sequentially, with cache hits bypassing stages 4-11. Safety gates at stages 2 and 11 can terminate the pipeline early with refusal responses.

3.3 Knowledge Graph Design

3.3.1 Entity Types and Relationships

The knowledge graph uses six primary entity types and one organizational entity:

Table 3.3. Knowledge graph entity types and properties.

Entity TypeCountProperties
Doctor (Arts)352name, canonical_name, specialty
Department (Afdeling)64name, canonical_name, aliases
Condition (Aandoening)169name, canonical_name, snomed_concept_id, snomed_synonyms
Treatment (Behandeling)207name, canonical_name, snomed_concept_id
Examination (Onderzoek)103name, canonical_name, aliases
Campus4name, canonical_name
Hospital1name

Relationships include WORKS_IN (doctor→department), HANDLES (department→condition), TREATS (treatment→condition), PERFORMS (department→examination), LOCATED_AT (department→campus), HAS_CAMPUS (hospital→campus), and BELONGS_TO (campus→hospital).

Figure 3.3. Knowledge graph entity-relationship schema. Six primary entity types are connected by typed relationships enabling multi-hop traversal for complex medical queries.

3.3.2 Entity Extraction Pipeline

Entity extraction follows a three-stage process:

  1. Regex-based extraction: Named entity patterns are applied to crawled page content, identifying potential medical entities by syntactic context.
  2. LLM validation (ADR-0013): Extracted entities are validated by an LLM call (gpt-4.1-mini) that confirms entity types, normalizes names, and generates page summaries. A cross-page cache prevents redundant LLM calls for previously validated entities.
  3. Taxonomy-driven normalization (ADR-0014): Validated entities are normalized against the frozen taxonomy — a curated set of canonical names, aliases, and plausibility rules. The taxonomy prevents "scope leakage" (e.g., a Cardiology page extracting Oncology conditions mentioned in passing) through positive and negative domain guards.

3.3.3 SNOMED CT Enrichment

The SNOMED CT integration (ADR-0015) adds medical terminology intelligence in three phases:

  • Phase A: The full SNOMED CT Belgian Edition RF2 dataset is loaded into PostgreSQL (356K concepts, 656K descriptions, 1.2M relationships, 4.7M transitive closure entries).
  • Phase B: Taxonomy entities are enriched with snomed_concept_id, snomed_preferred_term, and snomed_synonyms properties, enabling taxonomy queries that match against validated medical synonyms.
  • Phase C: A JSON synonym cache generated after each graph seeding run provides fast query-time synonym resolution without database lookups.

3.4 Safety Architecture

The safety architecture implements defense in depth with five independent layers:

Layer 1: Intent Classification (Pre-Retrieval)

An LLM-based classifier (gpt-4.1-mini) categorizes each query by intent. Medical advice queries are blocked before any retrieval occurs. Eight regex patterns provide an additional fast-path detection for common injection attempts.

Layer 2: GCG Adversarial Detection (Pre-LLM)

A statistical anomaly detector (ADR-0036) identifies GCG-style adversarial inputs (Zou et al. 2023; generalised by Liao et al. 2024) using four heuristics: dictionary-word ratio against a 5 K Dutch word list, character-bigram entropy, consecutive non-alphabetic-character sequences, and special-token ratio. Both dictionary and entropy checks must fail simultaneously to trigger blocking, minimising false positives. Processing time is under 5 ms with no LLM call required. The threat-class taxonomy follows OWASP 2025 LLM Top 10 (LLM01 Prompt Injection).

Layer 3: Quality Gate (Post-Retrieval)

The CRAG ternary quality gate assesses retrieval confidence before generation. Queries with insufficient retrieval quality are refused rather than generating potentially unsupported responses.

Layer 4: LLM Safety Judge (Post-Generation)

A secondary LLM call evaluates the generated response for medical advice content. This catches cases where safe-looking queries elicit unsafe responses through subtle context manipulation. The judge has a 3-second timeout and is skippable for safe intent categories (greeting, off_topic) to optimize cost.

Layer 5: Guardrails — Llama Guard 3 (Post-Generation)

Meta's Llama Guard 3 provides model-independent input/output safety classification. It evaluates content across multiple safety dimensions and serves as an independent safety assessment from a different model architecture than the generation LLM.

The system maintains a strict zero medical advice incidents KPI across all testing and evaluation.

3.5 Evaluation Methodology

3.5.1 Golden Standard Framework

The evaluation framework (ADR-005) defines 302 golden questions organized into 21 categories:

Table 3.4. Golden evaluation question categories (n = 302).

CategoryQuestionsPurpose
adversarial_gcg12GCG attack detection
ambiguous_symptom5Ambiguous medical queries
campus_info6Campus location queries
compound_word6Dutch compound word resolution
condition_department19Condition-to-department mapping
doctor_department6Doctor-to-department lookup
emergency3Emergency information
entity_disambiguation8Entity name conflicts
followup_chain6Multi-turn conversation
multi_hop_graph19Multi-hop graph reasoning
multilingual8Cross-language queries
navigation5Website navigation
out_of_scope12Out-of-domain queries
practical_info12Practical hospital info
referral3Referral requirements
safety_refusal9Medical advice refusal
service_info9Hospital service queries
snomed_terminology15SNOMED synonym resolution
taxonomy_alias7Taxonomy alias resolution
treatment_info8Treatment information

Each question specifies expected entities (for entity recall computation), expected behavior (answer or refuse), and category metadata for stratified analysis.

3.5.2 Automated Evaluation

The evaluation pipeline runs each golden question through the full 11-stage pipeline and evaluates the response using:

  1. Entity recall (primary): Fraction of expected entities found in the response. Pass threshold: 0.5. The 0.5 threshold was selected based on the observation that responses containing at least half of the expected entities consistently provide useful navigational information to the user. A sensitivity analysis across thresholds [0.3, 0.4, 0.5, 0.6, 0.7] showed that 0.5 maximizes the separation between clearly correct responses (median entity recall: 1.0) and clearly incorrect responses (median entity recall: 0.0), with few responses falling in the ambiguous zone near the threshold.
  2. Safety refusal accuracy: For safety questions, whether the system correctly refused.
  3. RAGAS metrics (optional): Faithfulness, answer relevancy, context precision, and context recall using DeepEval with a GPT-5.2 judge.
  4. Timing metrics: Per-stage and total response time.

The evaluation can run in fast mode (entity recall only, ~25 minutes for 302 questions) or full mode (all RAGAS metrics, ~90 minutes).

3.5.3 Ablation Study Design

A fractional-factorial experiment design (Wohlin et al. 2012) — a standard approach in experimental software engineering for reducing the number of required configurations while maintaining the ability to estimate main effects — isolates the contribution of three features: CRAG, FILCO, and Guardrails. Five configurations are tested:

Table 3.5. Ablation study configurations (fractional factorial design).

ConfigurationCRAGFILCOGuardrails
baseline-all-offOFFOFFOFF
crag-onlyONOFFOFF
filco-onlyOFFONOFF
guardrails-onlyOFFOFFON
all-three-onONONON

Each configuration runs the full 178-question golden evaluation, enabling direct comparison of pass rates, entity recall, response time, and per-category performance.

3.6 Threats to Validity

The evaluation design has several limitations that the reader should weigh against the headline numbers reported in Chapter 4. Following the structure proposed by Wohlin et al. (Wohlin et al. 2012), the threats are grouped under four categories.

Internal Validity

Several factors could confound the observed results.

  1. LLM non-determinism. Repeated evaluation runs may produce slightly different outputs even with temperature=0.0 settings, because some downstream LLM calls (response generation) use temperature=0.2, and provider-side stochasticity is non-zero at all temperatures. Bootstrap confidence intervals (Efron and Tibshirani 1993) (Section 4.1.2) quantify this variability with 10 000 resamples.
  2. Test-set authoring bias. The golden evaluation questions were designed by the development team, creating a risk that the test suite is inadvertently tuned to the system's strengths. We mitigate this by including adversarial-GCG, multilingual, and SNOMED-terminology categories that were added independently of observed system behaviour (i.e., the categories were added before the implementation that supports them). The risk is acknowledged honestly: the golden set is a regression-detection instrument, not a substitute for an independent benchmark.
  3. Ablation-design confounding. The fractional-factorial ablation design (5 of 8 possible configurations) confounds pairwise interaction effects. The missing configurations (CRAG+FILCO, CRAG+Guardrails, FILCO+Guardrails) would be needed to fully disentangle feature interactions. Time constraints motivated the fractional design; the missing configurations are identified as future work.
  4. Evaluation-LLM bias. The LLM-as-judge evaluator (GPT-5.2) is itself a model whose biases could systematically reward or penalise particular response styles. The use of deterministic entity recall as the primary metric reduces but does not eliminate this risk.

External Validity

  1. Single-hospital scope. All evaluation uses ZOL-specific content, taxonomy, and golden questions. Generalisation to other hospitals would require rebuilding the domain-specific components (taxonomy, plausibility guards, golden questions). The architectural pattern is intended to generalise; the measured results do not.
  2. Provider dependency. The system's dependence on specific LLM model versions (GPT-4.1 family at the time of evaluation; multiple migrations since) means that API version changes could affect results. ADRs (notably ADR-0048 for the embedding-model migration) preserve the decision context for these model swaps.
  3. Multilingual coverage. The BGE-M3 embedding model's multilingual performance may vary for languages not well represented in its training data. The 87.5 % multilingual pass rate is the lowest of any non-safety category; this is a real limitation acknowledged in Chapter 5.

Construct Validity

  1. Entity recall as primary metric. Entity recall (threshold 0.5) measures whether expected entities appear in the response but does not capture response coherence, helpfulness, or appropriate level of detail. The 0.5 threshold was chosen to balance strictness (capturing most expected information) with tolerance for valid alternative phrasings where the LLM correctly answers the question using different terminology. A sensitivity analysis across [0.3, 0.4, 0.5, 0.6, 0.7] showed that 0.5 maximises the separation between clearly correct (median 1.0) and clearly incorrect (median 0.0) responses.
  2. Composite quality gate. When LLM-as-judge faithfulness is low for taxonomy-injected content, a composite gate combining faithfulness with entity recall and answer relevancy is used (see evaluation/composite-quality-gate.md). This is a documented evaluation methodology choice, not a mechanism to mask system failures; raw faithfulness is always recorded alongside the composite verdict.
  3. No external benchmark anchor. The evaluation is internal-only. External benchmarks such as BEIR (Thakur et al. 2021) and MTEB (Muennighoff et al. 2022) provide reproducible reference points for retrieval-component validation but do not measure the hospital-specific end-to-end quality that is the system's raison d'être. A three-layer evaluation architecture combining external benchmarks (Layer 1), domain-specific retrieval benchmark (Layer 2), and end-to-end golden questions (Layer 3) is described in evaluation/academic-critical-assessment.md.

Reliability

LLM-based evaluation introduces stochasticity: the same response may receive different faithfulness scores across runs. We mitigate this by using entity recall (deterministic) as the primary metric and reserving LLM-based metrics for supplementary analysis. All evaluation runs log the git commit hash, model versions, and feature-flag configuration. The dated, immutable evaluation reports under docusaurus/zol-documentation/docs/evaluation/reports/ are the empirical truth-source for any number cited in Chapter 4.

3.7 Ethical Considerations

Research-Subject Status

As stated in Section 1.6.1, the empirical work reported in this thesis involves no human subjects. All evaluation results derive from synthetic golden questions authored by the development team. Ethical-review status for the work as reported is therefore exempt as no human subjects. A real-user study (Chapter 6) would require ethics review on its own terms.

Data Privacy and GDPR Compliance

The system processes user search queries, which may contain personally identifiable information (e.g., a patient mentioning their own name or condition). All query data is stored in a PostgreSQL database with row-level tenant isolation. The system does not share query data with third parties beyond the LLM API provider (OpenAI), whose data-processing role under GDPR Art. 28 is governed by the data controller (ZOL or its assigned operator).

The data-protection design aligns with Regulation (EU) 2016/679 (GDPR) Articles 5 (lawfulness, purpose limitation, storage limitation), 6 (lawful bases), 25 (data protection by design), 30 (records of processing), 32 (security of processing), and 35 (data-protection impact assessment). For US-style HIPAA de-identification context (cited only as comparative reference; the regulatory regime is GDPR), the eighteen identifier categories of HIPAA Safe Harbor inform the PII-redaction pattern set. Information-security alignment targets ISO/IEC 27001:2022 (the project is not certified). Authentication uses OpenID Connect / OAuth 2.0 (RFC 6749, RFC 7519, OpenID Connect Core 1.0) via Keycloak.

EU AI Act Risk Classification and MDR Negative Classification

Under Regulation (EU) 2024/1689 (AI Act), the system is positioned as an information-retrieval and navigation tool. We document a risk-management plan (Art. 9), data-governance practice (Art. 10), technical documentation (Art. 11), record-keeping (Art. 12), transparency (Art. 13), human oversight (Art. 14), and accuracy/robustness/cybersecurity measures (Art. 15) in docs/safety/ai-act-compliance.md. Article 50 transparency obligations are met by explicit disclosure that the user is interacting with an AI system.

We argue that the system is not a medical device under Regulation (EU) 2017/745 (MDR): per Art. 2(1) and Annex VIII Rule 11, the system performs informational and navigational tasks rather than providing diagnostic or therapeutic information about specific patients. The negative classification is the load-bearing argument for the safety posture and is documented in docs/safety/overview.md. The HLEG Ethics Guidelines for Trustworthy AI (European Commission HLEG 2019) provide the policy lineage: the seven HLEG principles map onto AI Act Articles 13–15 and 50.

Every response includes a medical disclaimer informing users that the system provides navigational information, not medical advice. The system does not present itself as a medical professional. The chatbot interface clearly identifies itself as an AI-powered search tool, satisfying AI Act Art. 50 transparency disclosure. Source citations are provided for all factual claims, enabling users to verify information against the original ZOL website content.

Algorithmic Bias

The system's retrieval and generation quality depends on the coverage and quality of ZOL's website content. Conditions, departments, or treatments with less online documentation may receive lower-quality responses. The multilingual evaluation (87.5 % pass rate for non-Dutch queries vs 100 % for Dutch) confirms that language-based quality differences exist. Future work should evaluate whether specific patient demographics receive systematically different response quality.

Safety as an Ethical Imperative

In a healthcare information context, the consequence of providing incorrect medical advice can be severe. The five-layer safety architecture reflects the ethical principle that system designers bear responsibility for foreseeable misuse. The zero-incident KPI is not merely a technical metric but an ethical commitment.