Research Bibliography
This page predates the canonical bibliography at /docs/references. The entries below are kept for thesis-chapter cross-references and the discursive prose that contextualises each reference, but the canonical, machine-checkable list is /docs/references (rendered from docs/references.bib per ADR conventions). New citations across the documentation should land in references.bib and deep-link via /docs/references#bibkey.
This chapter presents the academic and technical literature that informs the architectural design, implementation decisions, and evaluation methodology of the ZOL Intelligent Search system. The references are organized thematically, with each section providing an introductory discussion of the topic's relevance to the project. All citations follow the APA 7th edition format.
1. Retrieval-Augmented Generation
Retrieval-Augmented Generation (RAG) constitutes the foundational paradigm upon which the ZOL system is built. RAG architectures address a fundamental limitation of large language models: their inability to access information beyond their training data, and their tendency to generate plausible but factually incorrect responses -- a phenomenon known as hallucination. In the context of hospital information retrieval, where factual accuracy is paramount and the information changes frequently (doctor schedules, department offerings, visiting hours), RAG provides a mechanism to ground LLM responses in verified, up-to-date source content.
The seminal contribution by Lewis et al. (2020) established the RAG framework by combining a parametric memory (a pre-trained language model) with a non-parametric memory (a document index accessed via dense retrieval). Their work demonstrated that this combination outperforms either component in isolation on knowledge-intensive NLP tasks. The ZOL system implements a production variant of this architecture, extending it with hybrid retrieval (vector + BM25 + knowledge graph), multi-layer safety filtering, and streaming response generation.
Gao et al. (2024) provide a comprehensive survey that classifies RAG systems into three generations: Naive RAG (basic retrieve-then-generate), Advanced RAG (with pre-retrieval and post-retrieval optimization), and Modular RAG (with interchangeable, composable components). The ZOL system incorporates elements of both Advanced RAG (intent-based routing, metadata boosting, quality gates) and Modular RAG (pluggable retrieval strategies, configurable model routing, taxonomy-driven normalization).
-
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W., Rocktäschel, T., Riedel, S., & Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. In Advances in Neural Information Processing Systems (NeurIPS 2020). https://arxiv.org/abs/2005.11401
-
Gao, Y., Xiong, Y., Dibia, V., Cohan, A., & Sil, A. (2024). Retrieval-augmented generation for large language models: A survey. arXiv preprint, arXiv:2312.10997. https://arxiv.org/abs/2312.10997
2. Advanced RAG Techniques
The ZOL system's query pipeline incorporates several architectural patterns documented in the recent RAG literature. Self-RAG (Asai et al., 2024) introduces the concept of self-reflection during retrieval and generation, where the model learns to decide when to retrieve and how to critique its own output. While the ZOL system does not implement full Self-RAG, its hybrid quality evaluation (fast embedding-similarity gate followed by asynchronous DeepEval metrics) serves an analogous purpose: ensuring that generated responses are faithful to retrieved context before delivery.
Corrective Retrieval Augmented Generation (Yan et al., 2024) proposes evaluating retrieval quality before generation, classifying retrieved documents as Correct, Incorrect, or Ambiguous. This pre-generation filtering pattern represents a potential evolution path for the ZOL system's current post-generation quality gate architecture.
RAPTOR (Sarthi et al., 2024) introduces hierarchical summarization through recursive clustering, enabling retrieval at multiple levels of abstraction. This approach addresses a limitation of the ZOL system's flat chunking strategy, where multi-document coherence is lost. HyPE-RAG (Vake et al., 2025) proposes embedding hypothetical questions alongside document chunks, a technique partially implemented in the ZOL system through canonical question generation during ingestion.
Agentic RAG (Singh et al., 2025) represents the frontier of RAG architectures, where autonomous agents dynamically select retrieval strategies, tools, and reasoning paths. The ZOL system's intent-based routing and taxonomy-driven query resolution represent preliminary steps toward agentic behaviour, though the current pipeline follows a fixed execution sequence rather than adaptive agent-based orchestration.
-
Asai, A., Wu, Z., Wang, Y., Sil, A., & Hajishirzi, H. (2024). Self-RAG: Learning to retrieve, generate, and critique through self-reflection. In Proceedings of the 12th International Conference on Learning Representations (ICLR 2024). https://arxiv.org/abs/2310.11511
-
Yan, S., Gu, J., Zhu, Y., & Ling, Z. (2024). Corrective retrieval augmented generation. arXiv preprint, arXiv:2401.15884. https://arxiv.org/abs/2401.15884
-
Sarthi, P., Abdullah, S., Tuli, A., Khanna, S., Goldie, A., & Manning, C. D. (2024). RAPTOR: Recursive abstractive processing for tree-organized retrieval. In Proceedings of the 12th International Conference on Learning Representations (ICLR 2024). https://arxiv.org/abs/2401.18059
-
Vake, L., Stanny, O., & Guthrie, R. (2025). HyPE-RAG: Hypothetical prompt embeddings for retrieval-augmented generation. SSRN Electronic Journal. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5139335
-
Singh, A., Ehtesham, A., Kumar, S., & Srinath, S. (2025). Agentic retrieval-augmented generation: A survey on agentic RAG. arXiv preprint, arXiv:2501.09136. https://arxiv.org/abs/2501.09136
3. Graph-Augmented Retrieval
The integration of knowledge graphs with RAG systems addresses a fundamental limitation of vector-only retrieval: the inability to traverse structured relationships between entities. In the hospital domain, the question "Which doctor treats breast cancer at Campus Sint-Jan?" requires traversing Doctor--Department--Condition--Campus relationships -- a query that cannot be reliably answered through semantic similarity search alone.
HybridRAG (Sarmah et al., 2024) formalises the combination of knowledge graph retrieval with vector retrieval, demonstrating that the hybrid approach outperforms either modality independently. GraphRAG (Edge et al., 2024) takes this further by using LLMs to construct and query knowledge graphs from source documents, enabling global summarisation capabilities. The ZOL system implements a variant of HybridRAG where the knowledge graph is populated through regex extraction with LLM validation, and graph queries are executed in parallel with vector search using asyncio.gather().
MedRAG (Shang et al., 2025) specifically addresses medical RAG by leveraging knowledge graph-elicited reasoning, demonstrating significant reductions in misdiagnosis rates. This work validates the ZOL system's architectural investment in the knowledge graph, particularly for entity-rich queries where structured relationships between departments, conditions, and treatments are essential for accurate responses.
Pan et al. (2024) provide a comprehensive roadmap for unifying large language models and knowledge graphs, identifying three paradigms: KG-enhanced LLMs, LLM-augmented KGs, and synergised LLMs+KGs. The ZOL system primarily follows the KG-enhanced LLM paradigm, where the knowledge graph provides structured context that is injected into the LLM prompt alongside vector-retrieved document chunks.
-
Sarmah, B., Aggarwal, A., Ramesh, M., Li, K., & Mitra, P. (2024). HybridRAG: Integrating knowledge graphs and vector retrieval augmented generation for efficient information extraction. arXiv preprint, arXiv:2408.04948. https://arxiv.org/abs/2408.04948
-
Edge, D., Trinh, H., Cheng, N., Bradley, J., Chao, A., Mody, A., Truitt, S., & Larson, J. (2024). From local to global: A graph RAG approach to query-focused summarization. Microsoft Research. https://github.com/microsoft/graphrag
-
Peng, B., Galley, M., He, P., Cheng, H., Xie, Y., Hu, Y., Huang, Q., Liden, L., Yu, Z., Chen, W., & Gao, J. (2025). Retrieval-augmented generation with graphs (GraphRAG). arXiv preprint, arXiv:2501.00309. https://arxiv.org/abs/2501.00309
-
Shang, J., et al. (2025). MedRAG: Enhancing retrieval-augmented generation with knowledge graph-elicited reasoning for healthcare copilot. In Proceedings of the ACM Web Conference 2025. https://dl.acm.org/doi/10.1145/3696410.3714782
-
Pan, S., Luo, L., Wang, Y., Chen, C., Wang, J., & Wu, X. (2024). Unifying large language models and knowledge graphs: A roadmap. IEEE Transactions on Knowledge and Data Engineering. https://arxiv.org/abs/2306.08302
4. Medical Domain RAG and Healthcare AI
The application of RAG systems in healthcare presents unique challenges and opportunities. Medical information retrieval demands higher accuracy than general-purpose search, as incorrect information can influence patient decisions. Simultaneously, the regulatory environment (EU AI Act, Medical Device Regulation) imposes constraints on how AI systems may interact with patients.
Gargari and Habibi (2025) provide a narrative review of RAG applications in healthcare, identifying key benefits including reduced hallucination, improved source attribution, and the ability to incorporate institution-specific knowledge. Zakka et al. (2024) conduct a systematic review of RAG for healthcare LLMs, establishing evaluation criteria and identifying common failure modes.
The ZOL system positions itself as a "search tool" (zoekfunctie) rather than a "clinical decision support system," a deliberate architectural decision that places it outside the scope of the EU Medical Device Regulation (2017/745). This distinction is critical: the system provides navigational assistance (which department to contact, which doctor to see) rather than diagnostic or therapeutic recommendations.
-
Gargari, O. K., & Habibi, G. (2025). Enhancing medical AI with retrieval-augmented generation: A mini narrative review. SAGE Digital Health. https://pmc.ncbi.nlm.nih.gov/articles/PMC12059965/
-
Zakka, C., Chaurasia, A., Shad, R., Dalal, A. R., Kim, J. L., Moor, M., Topol, E. J., & Hiesinger, W. (2024). Retrieval augmented generation for large language models in healthcare: A systematic review. PLOS Digital Health. https://journals.plos.org/digitalhealth/article?id=10.1371/journal.pdig.0000877
-
European Parliament. (2024). Regulation (EU) 2024/1689 laying down harmonised rules on artificial intelligence (Artificial Intelligence Act). Official Journal of the European Union. https://eur-lex.europa.eu/eli/reg/2024/1689/oj
-
European Parliament. (2017). Regulation (EU) 2017/745 on medical devices (Medical Device Regulation). Official Journal of the European Union. https://eur-lex.europa.eu/eli/reg/2017/745/oj
5. Information Retrieval Foundations
The ZOL system's retrieval layer combines three complementary paradigms: sparse lexical retrieval (BM25), dense semantic retrieval (vector search), and structured graph retrieval (Cypher queries). The theoretical foundations of these approaches are well-established in the information retrieval literature.
Robertson and Zaragoza (2009) provide the definitive treatment of the BM25 scoring function within the probabilistic relevance framework. BM25 remains the gold standard for lexical retrieval, particularly effective for exact-match queries (e.g., doctor names, medical procedure codes) where semantic similarity measures may underperform.
Reciprocal Rank Fusion (Cormack et al., 2009) provides the score-agnostic fusion mechanism used by the ZOL system to combine ranked lists from vector search and BM25 search. The RRF formula score = 1/(k + r + 1) eliminates the need for score normalisation, which is particularly advantageous when combining results from fundamentally different scoring mechanisms (cosine similarity vs. BM25 term frequency).
Dense Passage Retrieval (Karpukhin et al., 2020) established the paradigm of using dual-encoder architectures for semantic retrieval, replacing traditional sparse retrieval with learned dense representations. The ZOL system uses this approach through OpenAI's text-embedding-3-large model (1536 dim, hosted) for query and document encoding — see ADR-0048. Earlier production builds used BGE-M3 (Chen et al., 2024) via Ollama before voice-channel latency drove the migration.
-
Robertson, S., & Zaragoza, H. (2009). The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval, 3(4), 333--389. https://doi.org/10.1561/1500000019
-
Cormack, G. V., Clarke, C. L. A., & Buettcher, S. (2009). Reciprocal rank fusion outperforms condorcet and individual rank learning methods. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2009). https://doi.org/10.1145/1571941.1572114
-
Karpukhin, V., Oğuz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., & Yih, W. (2020). Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020). https://arxiv.org/abs/2004.04906
6. Embedding Models and Vector Search
The choice of embedding model and vector index structure directly impacts retrieval quality and system performance. The ZOL system has migrated through three embedding models: nomic-embed-text (768-dim) → BGE-M3 (Chen et al., 2024, 1024-dim, on-prem Ollama, MTEB-NL 60.0) → OpenAI text-embedding-3-large (1536-dim, hosted). The current model — selected per ADR-0048 on 2026-04-30 — was chosen to eliminate the on-prem serialization tax (Ollama's OLLAMA_NUM_PARALLEL=1 constraint paid 1.7–5.8 s wall-clock per voice turn) while preserving multilingual retrieval quality.
The Hierarchical Navigable Small World (HNSW) algorithm (Malkov & Yashunin, 2018) provides the approximate nearest neighbour index used by pgvector. HNSW constructs a multi-layer graph where each layer provides progressively coarser navigation, enabling logarithmic-time search complexity. For the ZOL corpus (~50,000 chunks), HNSW with m=16 and ef_construction=200 provides sub-second query times with high recall.
Sentence-BERT (Reimers & Gurevych, 2019) introduced the use of Siamese and triplet network architectures for generating semantically meaningful sentence embeddings, establishing the foundation upon which modern multilingual embedding models — including BGE-M3 and the OpenAI text-embedding-3 family — are built.
Anthropic's research on contextual retrieval (2024) demonstrates that prepending document-level context to individual chunks reduces retrieval failure rates by 49% when combined with hybrid search. The ZOL system implements this technique through page summaries generated during ingestion, which are stored in the chunk_metadata JSONB column and prepended to chunks at query time.
-
Nussbaum, Z., Morris, J. X., Duderstadt, B., & Mulyar, A. (2024). Nomic Embed: Training a reproducible long context text embedder. Nomic AI Technical Report. https://arxiv.org/abs/2402.01613
-
Malkov, Y. A., & Yashunin, D. A. (2018). Efficient and robust approximate nearest neighbor using Hierarchical Navigable Small World graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(4), 824--836.
-
Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP 2019). https://arxiv.org/abs/1908.10084
-
Anthropic. (2024). Introducing contextual retrieval. Anthropic Research Blog. https://www.anthropic.com/news/contextual-retrieval
7. Knowledge Graphs
The theoretical foundations of the ZOL knowledge graph draw on both general graph theory and domain-specific medical ontology research. The property graph model, as implemented in Neo4j, provides the formal framework for representing entities as nodes with typed properties and relationships as directed, typed edges carrying their own properties.
Robinson et al. (2015) provide a comprehensive introduction to graph databases, establishing the property graph model as the dominant paradigm for connected data applications. Hogan et al. (2021) offer an authoritative survey of knowledge graph theory, construction methodologies, and applications, providing the taxonomic framework within which the ZOL knowledge graph is situated.
Ernst et al. (2015) demonstrate approaches to biomedical knowledge graph construction, addressing challenges that are directly relevant to the ZOL system: entity disambiguation, relationship extraction from unstructured medical text, and ontology alignment. The ZOL system addresses these challenges through a combination of compiled regex extraction, LLM validation, and a curated taxonomy module containing 580+ lines of domain knowledge.
-
Robinson, I., Webber, J., & Eifrem, E. (2015). Graph databases: New opportunities for connected data (2nd ed.). O'Reilly Media.
-
Hogan, A., Blomqvist, E., Cochez, M., d'Amato, C., Melo, G. D., Gutierrez, C., Kirrane, S., Gayo, J. E. L., Navigli, R., Neumaier, S., Ngomo, A. N., Polleres, A., Rashid, S. M., Rula, A., Schmelzeisen, L., Sequeda, J., Staab, S., & Zimmermann, A. (2021). Knowledge graphs. ACM Computing Surveys, 54(4), Article 71.
-
Ernst, P., Siu, A., & Weikum, G. (2015). KnowLife: A versatile approach for constructing a large knowledge graph for biomedical sciences. BMC Bioinformatics, 16, Article 157.
8. Dutch Clinical Natural Language Processing
The ZOL system operates primarily in Dutch, a language with relatively limited NLP resources compared to English. Dutch medical text presents specific challenges: compound word formation (e.g., "borstonderzoek" vs. "borst onderzoek"), code-switching in clinical settings, and limited availability of domain-specific pre-trained models.
MedRoBERTa.nl (Verkijk & Vossen, 2024) represents the first medical language model pre-trained on Dutch Electronic Health Records, demonstrating that domain-specific pre-training significantly improves clinical NLP task performance. While the ZOL system currently relies on multilingual models (Tier 3 flagship LLM, OpenAI text-embedding-3-large for retrieval per ADR-0048) rather than Dutch-specific models, MedRoBERTa.nl represents a potential future enhancement for entity extraction and embedding quality.
Afzal et al. (2014) provide an inventory of tools for Dutch clinical language processing, establishing the landscape of available resources and identifying gaps. The ZOL system addresses several of these gaps through its custom taxonomy module, which contains 110+ condition aliases, 50+ treatment aliases, and 40+ search aliases mapping colloquial Dutch to canonical medical terminology.
-
Verkijk, S., & Vossen, P. (2024). Creating, anonymizing and evaluating the first medical language model pre-trained on Dutch Electronic Health Records: MedRoBERTa.nl. Artificial Intelligence in Medicine. https://www.sciencedirect.com/science/article/pii/S0933365725000831
-
Afzal, Z., Pons, E., Kang, N., Sturkenboom, M. C., Schuemie, M. J., & Kors, J. A. (2014). ContextD: An algorithm to identify contextual properties of medical terms in a Dutch clinical corpus. BMC Bioinformatics, 15, Article 373. https://pubmed.ncbi.nlm.nih.gov/22874189/
9. RAG Evaluation Methodology
Evaluation of RAG systems requires specialised metrics that capture both retrieval quality and generation quality. The ZOL system employs a hybrid evaluation strategy combining fast embedding-similarity gates with comprehensive LLM-as-judge metrics.
RAGAS (Es et al., 2023) provides a reference-free evaluation framework that uses LLM-as-judge to compute metrics including Faithfulness (factual consistency between response and context), Context Precision (proportion of relevant information ranked highly), Context Recall (proportion of relevant information retrieved), and Answer Relevancy (semantic relevance of response to query). The ZOL system integrates a subset of these metrics via the DeepEval framework.
DeepEval (Confident AI, 2024) is the open-source LLM evaluation framework integrated into the ZOL system's production pipeline. It provides the FaithfulnessMetric and AnswerRelevancyMetric used in the background quality analytics, with the Tier 2 (standard) model as the evaluation judge.
Zheng et al. (2023) established the LLM-as-a-Judge paradigm through MT-Bench and Chatbot Arena, demonstrating that strong LLMs can serve as reliable evaluation judges. This paradigm underpins both the RAGAS and DeepEval frameworks used by the ZOL system.
ARES (Saad-Falcon et al., 2024) proposes an automated evaluation framework specifically designed for RAG systems, combining prediction-powered inference with judge LLMs to provide statistically rigorous evaluation without human annotations.
-
Es, S., James, J., Espinosa-Anke, L., & Schockaert, S. (2023). RAGAS: Automated evaluation of retrieval augmented generation. arXiv preprint, arXiv:2309.15217. https://arxiv.org/abs/2309.15217
-
Confident AI. (2024). DeepEval: The open-source LLM evaluation framework. https://deepeval.com/docs/metrics-ragas
-
Zheng, L., Chiang, W., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., Zhang, H., Gonzalez, J. E., & Stoica, I. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. In Advances in Neural Information Processing Systems (NeurIPS 2023).
-
Saad-Falcon, J., Khattab, O., Potts, C., & Zaharia, M. (2024). ARES: An automated evaluation framework for retrieval-augmented generation systems. arXiv preprint, arXiv:2311.09476. https://arxiv.org/abs/2311.09476
10. Semantic Caching
Semantic caching addresses a significant operational challenge in RAG systems: the latency and cost of repeated LLM inference for semantically equivalent queries. Traditional exact-match caching is insufficient because natural language queries exhibit high lexical variance -- "Which doctors work in cardiology?" and "Who are the cardiologists?" are semantically identical but lexically distinct.
Zhu et al. (2024) formalise GPT Semantic Cache, using embedding similarity to identify cache-worthy query pairs. Their work demonstrates that semantic caching with appropriate similarity thresholds can reduce LLM costs by 40-60% while maintaining response quality. The ZOL system implements a two-tier variant: Tier 1 (exact hash match on LLM-reformulated queries) and Tier 2 (cosine similarity >= 0.97 via pgvector HNSW index).
GPTCache provides an open-source reference implementation for semantic caching, though the ZOL system implements its own caching layer in PostgreSQL to leverage the existing pgvector infrastructure and avoid additional dependencies.
-
Zhu, Z., Wang, Z., Li, M., Xu, J., Tan, P., & Zhu, Y. (2024). GPT Semantic Cache: Reducing LLM costs and latency via semantic embedding caching. arXiv preprint, arXiv:2411.05276. https://arxiv.org/abs/2411.05276
-
GPTCache. (2024). GPTCache: An open-source semantic cache for LLM applications. https://github.com/zilliztech/GPTCache
11. Safety, Explainability, and Regulation
The deployment of AI systems in healthcare contexts is subject to stringent safety requirements and regulatory frameworks. The ZOL system's multi-layer safety architecture reflects the defense-in-depth principle, where multiple independent mechanisms guard against medical advice generation, prompt injection, and hallucination.
The EU Artificial Intelligence Act (European Parliament, 2024) establishes a risk-based regulatory framework for AI systems. Healthcare AI systems are classified as "high-risk" under Annex III, requiring conformity assessment, risk management, and transparency obligations. The ZOL system mitigates regulatory exposure by positioning itself as an information retrieval tool rather than a clinical decision support system.
Amann et al. (2020) provide a multidisciplinary perspective on explainability in healthcare AI, arguing that transparency is essential for clinical trust and adoption. The ZOL system addresses this through its pipeline progress emissions (WebSocket-based real-time visibility), source citations with verifiable URLs, and the frontend debug panel that exposes intent classification, retrieval strategy, and quality evaluation scores.
The WHO guidance on AI ethics for health (2021) establishes six principles: protecting human autonomy, promoting well-being, ensuring transparency, fostering responsibility, ensuring inclusiveness and equity, and promoting responsive and sustainable AI. The ZOL system's safety architecture directly addresses several of these principles through mandatory disclaimers, medical advice refusal, and comprehensive audit logging.
-
Amann, J., Blasimme, A., Vayena, E., Frey, D., & Madai, V. I. (2020). Explainability for artificial intelligence in healthcare: A multidisciplinary perspective. BMC Medical Informatics and Decision Making, 20, Article 310.
-
World Health Organization. (2021). Ethics and governance of artificial intelligence for health. WHO.
-
Topol, E. J. (2019). High-performance medicine: The convergence of human and artificial intelligence. Nature Medicine, 25, 44--56.
12. Web Crawling and Incremental Ingestion
The ZOL system's document ingestion pipeline relies on sitemap-driven web crawling with content-hash-based change detection for incremental updates. The theoretical foundations of this approach are well-established in the web crawling literature, which addresses the fundamental challenge of maintaining collection freshness — the fraction of a local document collection that accurately reflects the current state of the source.
Olston and Najork (2010) provide the authoritative survey of web crawling techniques, covering URL frontier management, content change detection, and freshness optimisation strategies. Their classification of crawling architectures — batch, incremental, and focused — directly informs the ZOL system's hybrid approach: batch sitemap discovery combined with incremental content-hash change detection.
Cho and Garcia-Molina (2000) established the theoretical framework for incremental web crawlers, demonstrating that selective index updating significantly outperforms periodic batch re-crawling in terms of collection freshness. Their subsequent work (Cho & Garcia-Molina, 2003) on change frequency estimation showed that Web crawlers could achieve 35% improvement in freshness by adopting Poisson-based change frequency estimators, validating the content-hash approach used by the ZOL system.
-
Olston, C. & Najork, M. (2010). Web crawling. Foundations and Trends in Information Retrieval, 4(3), 175–246. https://doi.org/10.1561/1500000017
-
Cho, J. & Garcia-Molina, H. (2000). The evolution of the web and implications for an incremental crawler. In Proceedings of the 26th International Conference on Very Large Data Bases (VLDB 2000), 200–209. https://www.vldb.org/conf/2000/P200.pdf
-
Cho, J. & Garcia-Molina, H. (2003). Estimating frequency of change. ACM Transactions on Internet Technology, 3(3), 256–290. https://doi.org/10.1145/857166.857170
13. Context Filtering for RAG
A critical challenge in RAG systems is that retrieved passages often contain irrelevant information that can cause the generation model to hallucinate or produce unfocused responses. Two complementary approaches address this: ingestion-time context enrichment and query-time context filtering.
FILCO (Wang et al., 2023) formalises query-time context filtering by training models to identify useful context using lexical overlap, string inclusion, and conditional cross-mutual information. Tested across six knowledge-intensive tasks with FLAN-T5 and LLaMA-2, FILCO outperforms baselines on extractive QA, multi-hop reasoning, and fact verification while reducing prompt lengths by up to 64%. The ZOL system addresses the same underlying problem through ingestion-time contextual enrichment (ADR-0019), which prepends LLM-generated document context to chunks before embedding — a complementary approach that shifts the filtering burden from query time to ingestion time.
Late Chunking (Günther et al., 2024) proposes an alternative approach where entire documents are embedded using long-context models before being split into chunks, thereby preserving cross-sentence context in the embedding space. This approach eliminates the need for explicit context enrichment but requires models with very long context windows and introduces architectural complexity in the embedding pipeline.
MAIN-RAG (Shi et al., 2025) introduces multi-agent filtering where multiple LLM agents independently evaluate document relevance, achieving consensus-based filtering that outperforms single-agent approaches. This represents a potential evolution path for the ZOL system's post-retrieval processing.
-
Wang, Z., Araki, J., Jiang, Z., Parvez, M. R., & Neubig, G. (2023). Learning to filter context for retrieval-augmented generation. arXiv preprint, arXiv:2311.08377. https://arxiv.org/abs/2311.08377
-
Günther, M., Mohr, I., Williams, D. J., Wang, B., & Xiao, H. (2024). Late chunking: Contextual chunk embeddings using long-context embedding models. arXiv preprint, arXiv:2409.04701. https://arxiv.org/abs/2409.04701
-
Shi, Y., et al. (2025). MAIN-RAG: Multi-agent filtering retrieval-augmented generation. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025). https://arxiv.org/abs/2501.00332
14. Cross-Encoder Reranking
Two-stage retrieval — where a fast first-stage retriever generates candidates and a slower cross-encoder reranker refines the ranking — is the dominant paradigm for high-precision RAG systems. The theoretical advantage of cross-encoders over bi-encoders is their ability to jointly attend to query-document token interactions, capturing fine-grained relevance signals that bi-encoder cosine similarity cannot model.
Nogueira and Cho (2019) established the cross-encoder reranking paradigm by demonstrating that BERT-based passage re-ranking achieves state-of-the-art results on the MS MARCO leaderboard, outperforming prior approaches by 27% relative in MRR@10. This foundational work motivates the ZOL system's use of Jina Reranker v2 (with bge-reranker-v2-m3 as fallback) for always-on reranking.
Bruch et al. (2023) provide a systematic analysis of fusion functions for hybrid retrieval, comparing Reciprocal Rank Fusion (RRF) with convex score combination. Their work demonstrates that the optimal fusion strategy depends on the quality differential between retrieval channels — a finding that informed the ZOL system's choice of RRF with k=60 for combining vector and BM25 results.
-
Nogueira, R. & Cho, K. (2019). Passage re-ranking with BERT. arXiv preprint, arXiv:1901.04085. https://arxiv.org/abs/1901.04085
-
Bruch, S., Gai, S., & Ingber, A. (2023). An analysis of fusion functions for hybrid retrieval. ACM Transactions on Information Systems, 42(1), Article 16. https://doi.org/10.1145/3596512
15. Query Decomposition for Multi-Hop Reasoning
Multi-hop questions — queries requiring information from multiple evidence sources to construct a complete answer — represent a fundamental challenge for RAG systems. In the hospital domain, questions like "Which doctor treats epilepsy in children at Campus Sint-Jan?" require traversing condition→department→doctor→campus relationship chains.
Ammann et al. (2025) demonstrate that incorporating question decomposition into a RAG pipeline yields significant gains: +36.7% MRR@10 and +11.6% F1 over standard RAG baselines. Their approach — decomposing queries into sub-questions, retrieving passages for each, and merging the candidate pool before reranking — directly mirrors the ZOL system's query decomposition implementation (ADR-0032).
Min et al. (2019) established the foundational framework for question decomposition in multi-hop reading comprehension, demonstrating that decomposition with rescoring outperforms end-to-end models on complex reasoning tasks. GenDec (Li et al., 2024) extends this with a generative decomposition approach that produces independent and complete sub-questions incorporating extracted evidence.
-
Ammann, P. J. L., et al. (2025). Question decomposition for retrieval-augmented generation. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025 SRW). https://arxiv.org/abs/2507.00355
-
Min, S., Zhong, V., Socher, R., & Xiong, C. (2019). Multi-hop reading comprehension through question decomposition and rescoring. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019).
-
Li, J., et al. (2024). GenDec: A robust generative question-decomposition method for multi-hop reasoning. arXiv preprint, arXiv:2402.11166. https://arxiv.org/abs/2402.11166
16. Adversarial Robustness and LLM Safety
The deployment of RAG systems in healthcare contexts requires robust defences against adversarial inputs, including prompt injection, jailbreaking, and the emerging class of gradient-based adversarial suffix attacks. The ZOL system's multi-layer safety architecture (ADR-0036) implements defence-in-depth against these threat vectors.
Zou et al. (2023) introduced the GCG (Greedy Coordinate Gradient) attack, demonstrating that optimised gibberish token sequences appended to harmful queries can bypass LLM safety alignment with 88% success rate on GPT-3.5/4. Critically, these suffixes transfer across models — a suffix optimised on an open-source model works against closed-source models. This transferability motivated the ZOL system's perplexity-based anomaly detector (H1), which exploits the statistical anomaly of GCG suffixes (high entropy, low dictionary-word ratio) without requiring an LLM call.
Liao et al. (2024) demonstrated AmpleGCG-Plus, a generator that produces hundreds of adversarial suffixes per minute with near-100% attack success rates, raising the threat level for production systems. Huang et al. (2025) introduced IRIS, which inhibits the LLM's refusal mechanism at the representation level, achieving 90% universal attack success rate even against state-of-the-art defences.
In the medical domain specifically, recent work has quantified the risks of LLM hallucination. A framework for assessing clinical safety (Patel et al., 2025) found a 1.47% hallucination rate across 12,999 clinician-annotated sentences in LLM-generated clinical notes, underscoring the need for multi-layer output validation in healthcare AI systems.
-
Zou, A., Wang, Z., Kolter, J. Z., & Fredrikson, M. (2023). Universal and transferable adversarial attacks on aligned language models. arXiv preprint, arXiv:2307.15043. https://arxiv.org/abs/2307.15043
-
Liao, H., et al. (2024). AmpleGCG-Plus: A strong generative model of adversarial suffixes to jailbreak LLMs. arXiv preprint, arXiv:2410.22143. https://arxiv.org/abs/2410.22143
-
Huang, D., et al. (2025). Stronger universal and transferable attacks by suppressing refusals (IRIS). In Proceedings of the 2025 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2025). https://arxiv.org/abs/2505.17598
-
Patel, D., et al. (2025). A framework to assess clinical safety and hallucination rates of LLMs for medical text summarisation. npj Digital Medicine, 8, Article 87. https://www.nature.com/articles/s41746-025-01670-7
17. Named Entity Recognition and Entity Extraction
Entity extraction from medical text is a prerequisite for knowledge graph population and query-time entity resolution. The ZOL system employs a hybrid approach: compiled regex patterns for deterministic extraction during ingestion, and LLM-based extraction for query-time entity identification.
GLiNER (Zaratiana et al., 2023) presents a generalist model for zero-shot named entity recognition using bidirectional transformers, offering a potential future enhancement for the ZOL system's entity extraction pipeline. Unlike traditional NER models that require labelled training data for each entity type, GLiNER can extract novel entity types from natural language descriptions alone.
- Zaratiana, U., Araujo, V., Neves, L., Coelho, G. P., & Nguyen, D. Q. (2023). GLiNER: Generalist model for named entity recognition using bidirectional transformer. arXiv preprint, arXiv:2311.08526. https://arxiv.org/abs/2311.08526
18. Response Time and User Experience
The perceived performance of interactive AI systems significantly influences user satisfaction and adoption. Nielsen (1993) established three fundamental response time thresholds: 0.1 seconds (instantaneous feedback), 1.0 second (maintained flow of thought), and 10 seconds (maximum attention retention). The ZOL system's streaming architecture targets the 10-second threshold as an upper bound, with visible progress indicators maintaining user engagement during the ~5.5-second pipeline execution.
-
Nielsen, J. (1993). Usability engineering. Academic Press.
-
W3C. (2018). Web Content Accessibility Guidelines (WCAG) 2.1. https://www.w3.org/TR/WCAG21/
19. Software Architecture Patterns
The ZOL system's architecture draws on established software engineering patterns, adapted for the specific requirements of production RAG systems.
-
Fowler, M. (2002). Patterns of enterprise application architecture. Addison-Wesley.
-
Richards, M. (2022). Software architecture patterns (2nd ed.). O'Reilly Media.
This bibliography represents the academic foundations identified during the documentation audit of 2026-02-15. As the system evolves, additional references should be added to reflect new architectural decisions and their theoretical motivations. All ADRs (Architecture Decision Records) cross-reference the relevant entries from this bibliography where applicable.