References
This page renders the project's BibTeX bibliography (docs/references.bib — single source of truth). Citations elsewhere in the documentation use [@key] syntax and deep-link to the matching anchor on this page. Each entry has been URL-verified.
The convention used in Docusaurus pages is to write the citation as an explicit markdown link, for example [Karpukhin et al. 2020](/docs/references#karpukhin2020dpr). The BibTeX file at docs/references.bib carries the canonical entries; this page mirrors those entries in human-readable form.
Retrieval & RAG
karpukhin2020dpr
Karpukhin, V., Oğuz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., & Yih, W. (2020). Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781. Association for Computational Linguistics. arXiv: 2004.04906. DOI: 10.18653/v1/2020.emnlp-main.550.
Canonical URL: https://aclanthology.org/2020.emnlp-main.550/
We cite this for: foundational dense bi-encoder retrieval — the academic precedent for our pgvector-backed semantic search half of the hybrid retrieval pipeline (ADR-0017 Stage 2a).
lewis2020rag
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W., Rocktäschel, T., Riedel, S., & Kiela, D. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In Advances in Neural Information Processing Systems (NeurIPS), volume 33. arXiv: 2005.11401.
Canonical URL: https://arxiv.org/abs/2005.11401
We cite this for: the original "RAG" paper — architectural precedent for retrieval-augmented generation in the Service Layer overview and query-pipeline.md.
khattab2020colbert
Khattab, O., & Zaharia, M. (2020). ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 39–48. ACM. arXiv: 2004.12832. DOI: 10.1145/3397271.3401075.
Canonical URL: https://arxiv.org/abs/2004.12832
We cite this for: late-interaction multi-vector reranker — cited by ADR-0039 (ColBERT multi-vector reranking, feature-flagged) and as the academic basis for token-level reranking when vector + BM25 aren't enough.
liu2024lostinmiddle
Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2024). Lost in the Middle: How Language Models Use Long Contexts. Transactions of the Association for Computational Linguistics, 12, 157–173. arXiv: 2307.03172. DOI: 10.1162/tacl_a_00638.
Canonical URL: https://aclanthology.org/2024.tacl-1.9/
We cite this for: empirical demonstration that LLMs under-attend to mid-context tokens — rationale for our 8 000-token context budget and chunk-ordering choices in context_assembly_service.py.
Embeddings
chen2024bgem3
Chen, J., Xiao, S., Zhang, P., Luo, K., Lian, D., & Liu, Z. (2024). BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation. arXiv: 2402.03216.
Canonical URL: https://arxiv.org/abs/2402.03216
We cite this for: BGE-M3 multilingual embedder. Historical reference for ADR-0033 (since superseded by ADR-0048 / OpenAI text-embedding-3-large). Retained because ADR-0017 / ADR-0029 / ADR-0033 cite the model by name, and BGE-M3 still powers the optional ColBERT reranker.
openai2024embeddings
OpenAI. (2024, January). New Embedding Models and API Updates.
Canonical URL: https://openai.com/index/new-embedding-models-and-api-updates/
We cite this for: the official announcement of text-embedding-3-small and text-embedding-3-large — source for model spec, dimensionality (1536/3072), and pricing. Cited by ADR-0048.
Vector Indexing & Storage
johnson2017faiss
Johnson, J., Douze, M., & Jégou, H. (2019). Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3), 535–547. arXiv: 1702.08734. DOI: 10.1109/TBDATA.2019.2921572.
Canonical URL: https://arxiv.org/abs/1702.08734
We cite this for: Faiss / approximate nearest neighbour at scale — background in storage.md for why HNSW (an ANN index) is the correct trade-off vs exact nearest neighbour at our corpus size (10 K+ chunks).
pgvector_docs
Kane, A., & pgvector contributors. (2024). pgvector: Open-source vector similarity search for Postgres. GitHub repository.
Canonical URL: https://github.com/pgvector/pgvector
We cite this for: the pgvector project page (HNSW + IVFFlat + cosine/L2 ops) — the canonical documentation for our embedding store. Referenced by ADR-0048 / ADR-0053 / ADR-0017 / storage.md.
neo4j_gds_manual
Neo4j, Inc. (2024). Neo4j Graph Data Science Library Manual. Vendor documentation.
Canonical URL: https://neo4j.com/docs/graph-data-science/current/
We cite this for: the Neo4j GDS manual landing page. Cited historically by ADR-006 / ADR-0017 / ADR-0029 (typed-node knowledge graph) and now superseded by ADR-0053 (Neo4j removal in favour of pgvector + app.entity_relationships). Retained for historical fidelity.
LLM Providers, Voice Stack, and Tooling
anthropic_claude3_card
Anthropic. (2024). Claude 3 Model Card. Anthropic, Inc., model card PDF.
Canonical URL: https://assets.anthropic.com/m/61e7d27f8c8f5919/original/Claude-3-Model-Card.pdf
We cite this for: the most recent stable Anthropic model card available at a fixed canonical URL — safety / capability source-of-truth for any Anthropic-hosted Claude path in our LLM stack.
livekit_agents_docs
LiveKit, Inc. (2024). LiveKit Agents Documentation. Vendor documentation.
Canonical URL: https://docs.livekit.io/agents/
We cite this for: the LiveKit Agents framework reference — runtime that hosts our voice agent. Cited by ADR-0050 (Twilio + self-hosted LiveKit SIP) and the voice compendium.
deepgram_nova3
Deepgram. (2025). Introducing Nova-3: The First Speech-to-Text Model Designed for Real-Time Enterprise Use Cases. Vendor announcement.
Canonical URL: https://deepgram.com/learn/introducing-nova-3-speech-to-text-api
We cite this for: the Deepgram Nova-3 STT model announcement — STT model spec, language coverage, and the first-utterance language-locking justification. Cited by ADR-0052 and the voice compendium.
elevenlabs_multilingual_v2
ElevenLabs. (2024). ElevenLabs Models — Multilingual v2. Vendor model documentation.
Canonical URL: https://elevenlabs.io/docs/models
We cite this for: the Eleven Multilingual v2 TTS model card (29 languages, prosody-injection support) — TTS model spec for the voice channel (nl/en/fr/it).
pydantic_ai_docs
Pydantic Services Inc. (2024). Pydantic AI: Agent Framework / Shim for using Pydantic with LLMs. Project documentation.
Canonical URL: https://ai.pydantic.dev/
We cite this for: the Pydantic AI documentation — the Agent[None, OutputModel] structured-output pattern (output_retries=3, UnexpectedModelBehavior fallback) used at our 8 LLM call sites. Cited by llm-stack.md and the voice compendium.
Safety, Guardrails, and Medical AI
owasp_llm_top10
OWASP Foundation. (2025). OWASP Top 10 for Large Language Model Applications. OWASP project page.
Canonical URL: https://genai.owasp.org/llm-top-10/
We cite this for: the OWASP LLM Top 10 (prompt injection, insecure output handling, training-data poisoning, model DoS, etc.) — practitioner taxonomy for LLM-application threats. Cited by ADR-0036 (adversarial input hardening) and the safety / compliance section of the architecture docs.
Operational Practices & Multi-tenant Systems
bezemer2010multitenant
Bezemer, C.-P., & Zaidman, A. (2010). Multi-tenant SaaS applications: maintenance dream or nightmare? In Proceedings of the Joint ERCIM Workshop on Software Evolution (EVOL) and International Workshop on Principles of Software Evolution (IWPSE), pages 88–92. ACM. DOI: 10.1145/1862372.1862393.
Canonical URL: https://dl.acm.org/doi/10.1145/1862372.1862393
We cite this for: the canonical taxonomy of multi-tenant SaaS isolation models (shared schema with tenant_id vs schema-per-tenant vs database-per-tenant) and the maintenance-cost trade-offs that informed our pool-model choice. Cited by architecture/multi-tenancy.md.
beyer2016sre
Beyer, B., Jones, C., Petoff, J., & Murphy, N. R. (2016). Site Reliability Engineering: How Google Runs Production Systems. O'Reilly Media. ISBN 978-1-491-92912-4.
Canonical URL: https://sre.google/sre-book/table-of-contents/ (free online edition)
We cite this for: the canonical practitioner statement (chapter 4, "Service Level Objectives") that latency SLOs should be written at the tail (P95, P99) rather than the mean — mean latency hides tail behaviour; the worst 1-in-20 user experience is what operators care about. Cited by architecture/feedback-dashboard-metrics.md.
Information Retrieval foundations (BM25, RRF, rerankers)
robertson2009bm25
Robertson, S., & Zaragoza, H. (2009). The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in Information Retrieval, 3(4), 333–389. DOI: 10.1561/1500000019.
Canonical URL: https://www.staff.city.ac.uk/~sbrp622/papers/foundations_bm25_review.pdf
We cite this for: the canonical BM25 reference (Foundations and Trends survey). The keyword-search half of our hybrid retrieval pipeline approximates BM25 via PostgreSQL tsvector + ts_rank on inverted-index lookups. Cited by rag/hybrid-search.md and rag/context-retrieval.md.
cormack2009rrf
Cormack, G. V., Clarke, C. L. A., & Büttcher, S. (2009). Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 758–759. ACM. DOI: 10.1145/1571941.1572114.
Canonical URL: https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf
We cite this for: the canonical Reciprocal Rank Fusion (RRF) reference. RRF combines vector and BM25 rankings without requiring score normalisation; the formula score = sum(1 / (k + rank_i)) with k=60 is our production default. Cited by rag/hybrid-search.md and rag/what-is-rag.md.
nogueira2019passagererank
Nogueira, R., & Cho, K. (2019). Passage Re-ranking with BERT. arXiv: 1901.04085.
Canonical URL: https://arxiv.org/abs/1901.04085
We cite this for: the early canonical demonstration that a BERT cross-encoder reranker (jointly encoding query + passage with full attention) outperforms bi-encoder retrieval as a second-stage reranker. The production model Jina Reranker v2 is a direct descendant of this lineage. Cited by rag/reranking-evaluation.md.
Recent RAG-specific surveys & extensions (2023-2024)
gao2024ragsurvey
Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J., Wang, M., & Wang, H. (2024). Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv: 2312.10997.
Canonical URL: https://arxiv.org/abs/2312.10997
We cite this for: comprehensive RAG survey covering Naive / Advanced / Modular RAG paradigms. Our pipeline is "Modular RAG" by Gao et al.'s typology — multiple specialised retrieval/augmentation modules orchestrated by a coordinator (RAGService). Cited by rag/what-is-rag.md.
sarmah2024hybridrag
Sarmah, B., Mehta, D., Hall, B., Rao, R., Patel, S., & Pasquali, S. (2024). HybridRAG: Integrating Knowledge Graphs and Vector Retrieval Augmented Generation for Efficient Information Extraction. arXiv: 2408.04948.
Canonical URL: https://arxiv.org/abs/2408.04948
We cite this for: a recent paper combining knowledge-graph-based retrieval and vector retrieval (HybridRAG). The ZOL pipeline is structurally similar (taxonomy lookup feeding query enrichment + Stage 5b/5c context injection feeding chunk retrieval) although our 'graph' is the Postgres taxonomy_relationships table rather than a dedicated graph DB. Cited by rag/taxonomy-query-enrichment.md.
soman2024biomedicalkg
Soman, K., Rose, P. W., Morris, J. H., Akbas, R. E., Smith, B., Peetoom, B., Villouta-Reyes, C., Cerono, G., Shi, Y., Rizk-Jackson, A., et al. (2024). Biomedical knowledge graph-optimized prompt generation for large language models. Bioinformatics. arXiv: 2311.17330.
Canonical URL: https://arxiv.org/abs/2311.17330
We cite this for: biomedical-domain knowledge-graph-augmented prompt generation, demonstrating that ontology-grounded retrieval improves factuality for medical-domain LLM applications. Our pipeline follows the same structural pattern (KG entity disambiguation feeding LLM prompt context) although our taxonomy is hospital-organisational rather than clinical-ontology. Cited by rag/taxonomy-query-enrichment.md.
Telephony Standards
rfc3261_sip
Rosenberg, J., Schulzrinne, H., Camarillo, G., Johnston, A., Peterson, J., Sparks, R., Handley, M., & Schooler, E. (2002). SIP: Session Initiation Protocol. RFC 3261. Internet Engineering Task Force. DOI: 10.17487/RFC3261.
Canonical URL: https://datatracker.ietf.org/doc/html/rfc3261
We cite this for: the IETF canonical SIP specification — standards basis for our PSTN → SIP gateway → LiveKit telephony path. Cited by voice/twilio-livekit-sip.md and voice/architecture.md.
itu_e164
International Telecommunication Union (2010). The international public telecommunication numbering plan (Recommendation E.164). ITU-T.
Canonical URL: https://www.itu.int/rec/T-REC-E.164
We cite this for: the ITU-T international phone-number format used in caller-ID, dialled-number validation, and tenant-routing tables. Cited by voice/twilio-livekit-sip.md.
Voice / Speech Models
radford2023whisper
Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2023). Robust Speech Recognition via Large-Scale Weak Supervision. ICML 2023. arXiv: 2212.04356.
Canonical URL: https://arxiv.org/abs/2212.04356
We cite this for: OpenAI Whisper — the alternative STT architecture considered before standardising on Deepgram Nova-3 (lower latency, Flemish-tuned acoustic model, native LiveKit Agents integration). Cited by voice/language-locking.md.
wang2017tacotron
Wang, Y., Skerry-Ryan, R. J., Stanton, D., Wu, Y., Weiss, R. J., Jaitly, N., Yang, Z., Xiao, Y., Chen, Z., Bengio, S., Le, Q., Agiomyrgiannakis, Y., Clark, R., & Saurous, R. A. (2017). Tacotron: Towards End-to-End Speech Synthesis. arXiv: 1703.10135.
Canonical URL: https://arxiv.org/abs/1703.10135
We cite this for: the foundational neural-TTS paper (Google's Tacotron). The production model (ElevenLabs Multilingual v2) is a direct descendant of this lineage; Tacotron established that prosody and pacing emerge from the model's attention over the input rather than from rule-based prosodic markup. Cited by voice/prosody-injection.md and voice/adaptive-tts-speed.md.
lin2026fullduplexbench
Lin, G.-T., Chen, C., Chen, Z., & Lee, H. (2026). Full-Duplex-Bench-v3: Benchmarking Tool Use for Full-Duplex Voice Agents Under Real-World Disfluency. arXiv: 2604.04847.
Canonical URL: https://arxiv.org/abs/2604.04847
We cite this for: a recent benchmark for full-duplex voice agents under tool-use and disfluency conditions; FDBv3 directly evaluates the kind of mid-utterance signalling the listening-ack mechanism implements. Cited by voice/listening-ack.md.
Conversation Analysis (Voice UX Foundations)
sacks1974turntaking
Sacks, H., Schegloff, E. A., & Jefferson, G. (1974). A Simplest Systematics for the Organization of Turn-Taking for Conversation. Language, 50(4), 696–735. Linguistic Society of America. DOI: 10.2307/412243.
Canonical URL: https://www.jstor.org/stable/412243
We cite this for: the foundational social-science reference for turn-taking organisation in spoken conversation; voice agents implement (or violate) the Sacks-Schegloff-Jefferson rules whether their designers know it or not. Cited by voice/conversational-intent.md, voice/context-aware-filler.md, and voice/listening-ack.md.
UX & Response-Time Perception
nielsen1993responsetimes
Nielsen, J. (1993). Response Times: The 3 Important Limits. Nielsen Norman Group article (excerpted from Usability Engineering, Morgan Kaufmann, 1993).
Canonical URL: https://www.nngroup.com/articles/response-times-3-important-limits/
We cite this for: Nielsen's three response-time thresholds (0.1 s instantaneous, 1 s seamless flow, 10 s loss of attention). Our latency budgets cluster around the 1 s and 10 s boundaries — the filler ladder fires precisely at these thresholds. Cited by voice/architecture.md, voice/adaptive-tts-speed.md, and architecture/pipeline-animation.md.
Healthcare Compliance
hipaa_safe_harbor
U.S. Department of Health and Human Services, Office for Civil Rights (2012). Guidance Regarding Methods for De-identification of Protected Health Information in Accordance with the HIPAA Privacy Rule. HHS guidance document.
Canonical URL: https://www.hhs.gov/hipaa/for-professionals/privacy/special-topics/de-identification/index.html
We cite this for: the U.S. regulatory baseline for what counts as de-identified protected health information. The 18 HIPAA Safe Harbor identifiers (names, dates, locations, phone numbers, etc.) inform our PII-redaction pattern set even though our primary regulatory regime is GDPR + AI Act. Cited by voice/twilio-livekit-sip.md (PII-redaction section).
EU Regulations & Ethics Guidance
gdpr_regulation
European Parliament and Council of the European Union (2016). General Data Protection Regulation (GDPR): Regulation (EU) 2016/679. Official Journal of the European Union, L 119/1.
Canonical URL: https://eur-lex.europa.eu/eli/reg/2016/679/oj
We cite this for: the canonical GDPR text. Specific articles cited across safety/overview.md, safety/dpia.md, safety/pii-protection.md, safety/data-retention-policy.md: Art. 4(5) pseudonymisation, Art. 5 principles, Art. 6 lawful bases, Art. 9 special-category data, Art. 25 DPbD, Art. 28 processors, Art. 30 records, Art. 32 security, Art. 35 DPIA, Art. 36 prior consultation.
ai_act_regulation
European Parliament and Council of the European Union (2024). EU Artificial Intelligence Act: Regulation (EU) 2024/1689. Official Journal of the European Union, L series.
Canonical URL: https://eur-lex.europa.eu/eli/reg/2024/1689/oj
We cite this for: the canonical AI Act text. Specific articles cited across safety/overview.md, safety/ai-act-compliance.md: Art. 5 prohibited practices, Art. 6 + Annex III high-risk classification, Art. 9 risk management, Art. 10 data governance, Art. 11 technical documentation, Art. 13 transparency, Art. 14 human oversight, Art. 15 robustness/cybersecurity, Art. 50 transparency/disclosure.
mdr_regulation
European Parliament and Council of the European Union (2017). EU Medical Device Regulation (MDR): Regulation (EU) 2017/745. Official Journal of the European Union, L 117/1.
Canonical URL: https://eur-lex.europa.eu/eli/reg/2017/745/oj
We cite this for: the canonical MDR text. Cited by safety/overview.md and safety/ai-act-compliance.md for the negative classification: Art. 2(1) medical-device definition + Annex VIII Rule 11 software classification, used to argue our system is informational/navigational and NOT a medical device under MDR.
hleg2019trustworthyai
High-Level Expert Group on Artificial Intelligence, European Commission (2019). Ethics Guidelines for Trustworthy AI. European Commission policy document.
Canonical URL: https://digital-strategy.ec.europa.eu/en/library/ethics-guidelines-trustworthy-ai
We cite this for: the European Commission HLEG Ethics Guidelines (April 2019). The seven HLEG principles (human agency, technical robustness, privacy, transparency, diversity, well-being, accountability) map onto AI Act Articles 13-15 and Article 50. Cited by safety/ai-act-compliance.md as the policy lineage that informed the AI Act's ethics framing.
Information Security & Authentication Standards
iso27001_2022
International Organization for Standardization (2022). ISO/IEC 27001:2022 — Information security, cybersecurity and privacy protection — Information security management systems — Requirements. ISO standard.
Canonical URL: https://www.iso.org/standard/27001
We cite this for: the target framework for our information security management system (ISMS). We are NOT certified; the document references specific Annex A controls (A.5.34 privacy/PII, A.8.10 information deletion, A.8.11 data masking) as the alignment target. Cited by safety/dpia.md, safety/data-retention-policy.md, safety/security.md.
iso27018_2019
International Organization for Standardization (2019). ISO/IEC 27018:2019 — Code of practice for protection of personally identifiable information (PII) in public clouds acting as PII processors. ISO standard.
Canonical URL: https://www.iso.org/standard/76559.html
We cite this for: the cloud-PII-processor controls. OpenAI and Twilio (our subprocessors under GDPR Art. 28) are expected to align with ISO/IEC 27018; we do not certify ourselves. Cited by safety/data-retention-policy.md.
rfc6749_oauth2
Hardt, D. (2012). The OAuth 2.0 Authorization Framework. RFC 6749. Internet Engineering Task Force. DOI: 10.17487/RFC6749.
Canonical URL: https://datatracker.ietf.org/doc/html/rfc6749
We cite this for: the IETF canonical OAuth 2.0 specification — the standards basis for our Keycloak-backed authorization framework. Cited by safety/security.md.
rfc7519_jwt
Jones, M. B., Bradley, J., & Sakimura, N. (2015). JSON Web Token (JWT). RFC 7519. Internet Engineering Task Force. DOI: 10.17487/RFC7519.
Canonical URL: https://datatracker.ietf.org/doc/html/rfc7519
We cite this for: the access-token format used in our Keycloak OIDC flow. Cited by safety/security.md.
openid_connect_core_1_0
Sakimura, N., Bradley, J., Jones, M. B., de Medeiros, B., & Mortimore, C. (2014). OpenID Connect Core 1.0 incorporating errata set 2. OpenID Foundation specification.
Canonical URL: https://openid.net/specs/openid-connect-core-1_0.html
We cite this for: the authentication-protocol layer above OAuth 2.0. Our Keycloak realm exposes the standard OIDC discovery and token endpoints. Cited by safety/security.md.
Adversarial Robustness Research
zou2023gcg
Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J. Z., & Fredrikson, M. (2023). Universal and Transferable Adversarial Attacks on Aligned Language Models. arXiv: 2307.15043.
Canonical URL: https://arxiv.org/abs/2307.15043
We cite this for: the original Greedy Coordinate Gradient (GCG) attack paper — the canonical reference for the adversarial-suffix threat class our hardening targets. Our regex-based detector targets the surface signatures of GCG-style suffixes (high-entropy sequences, character-level perturbations) without needing the gradient-based defence the paper proposes. Cited by safety/dpia.md and safety/adversarial-hardening.md.
liao2024amplegcg
Liao, Z., & Sun, H. (2024). AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMs. arXiv: 2404.07921.
Canonical URL: https://arxiv.org/abs/2404.07921
We cite this for: AmpleGCG (and the AmpleGCG-Plus variant within the same paper) — generalises the Zou et al. 2023 GCG attack via a generative model of adversarial suffixes. Defences must assume a steady stream of new suffix patterns rather than a fixed adversarial vocabulary. Cited by safety/adversarial-hardening.md.
RAG Variants & Evaluation Frameworks
guu2020realm
Guu, K., Lee, K., Tung, Z., Pasupat, P., & Chang, M.-W. (2020). REALM: Retrieval-Augmented Language Model Pre-Training. arXiv: 2002.08909.
Canonical URL: https://arxiv.org/abs/2002.08909
We cite this for: the precursor demonstrating that retrieval-augmented pre-training improves language-model factuality. Cited by thesis/02-literature-review.md.
thakur2021beir
Thakur, N., Reimers, N., Rücklé, A., Srivastava, A., & Gurevych, I. (2021). BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. NeurIPS Datasets and Benchmarks. arXiv: 2104.08663.
Canonical URL: https://arxiv.org/abs/2104.08663
We cite this for: BEIR — the canonical heterogeneous IR benchmark for zero-shot retrieval evaluation. The external benchmark anchor that the ZOL evaluation does NOT yet use (acknowledged methodology gap). Cited by thesis/02-literature-review.md, thesis/03-methodology.md, thesis/05-discussion.md.
muennighoff2022mteb
Muennighoff, N., Tazi, N., Magne, L., & Reimers, N. (2022). MTEB: Massive Text Embedding Benchmark. arXiv: 2210.07316.
Canonical URL: https://arxiv.org/abs/2210.07316
We cite this for: MTEB — the canonical embedding benchmark. Same role as BEIR for embedding-quality measurement; not yet used by ZOL evaluation. Cited by thesis/03-methodology.md, thesis/05-discussion.md.
yan2024crag
Yan, S.-Q., Gu, J.-C., Zhu, Y., & Ling, Z.-H. (2024). Corrective Retrieval Augmented Generation. arXiv: 2401.15884.
Canonical URL: https://arxiv.org/abs/2401.15884
We cite this for: Corrective RAG (CRAG) — introduces a retrieval evaluator triaging results into Correct/Incorrect/Ambiguous. The academic precedent for our ADR-0038 corrective-RAG quality gate. Cited by thesis/02-literature-review.md.
wang2024filco
Wang, Z., Araki, J., Jiang, Z., Parvez, M. R., & Neubig, G. (2024). Learning to Filter Context for Retrieval-Augmented Generation. arXiv: 2311.08377.
Canonical URL: https://arxiv.org/abs/2311.08377
We cite this for: FiLCO (Filter Context) — per-sentence context filtering for RAG generation. Cited by thesis/02-literature-review.md.
asai2024selfrag
Asai, A., Wu, Z., Wang, Y., Sil, A., & Hajishirzi, H. (2024). Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. ICLR. arXiv: 2310.11511.
Canonical URL: https://arxiv.org/abs/2310.11511
We cite this for: Self-RAG — on-demand retrieval with self-reflection tokens. A recent RAG variant compared against our pipeline. Cited by thesis/02-literature-review.md.
edge2024graphrag
Edge, D., Trinh, H., Cheng, N., Bradley, J., Chao, A., Mody, A., Truitt, S., & Larson, J. (2024). From Local to Global: A Graph RAG Approach to Query-Focused Summarization. arXiv: 2404.16130.
Canonical URL: https://arxiv.org/abs/2404.16130
We cite this for: Microsoft GraphRAG — hierarchical knowledge-graph summarisation for RAG. The academic precedent for graph-augmented RAG; our conditional-injection finding contrasts with GraphRAG's unconditional fusion. Cited by thesis/02-literature-review.md, thesis/05-discussion.md.
es2023ragas
Es, S., James, J., Espinosa-Anke, L., & Schockaert, S. (2023). RAGAS: Automated Evaluation of Retrieval Augmented Generation. arXiv: 2309.15217.
Canonical URL: https://arxiv.org/abs/2309.15217
We cite this for: RAGAS framework — automated RAG evaluation via faithfulness, answer-relevancy, context-relevancy, and context-recall metrics. The methodology framework underpinning our golden-eval pipeline. Cited by thesis/01-introduction.md, thesis/02-literature-review.md, evaluation/index.md, evaluation/composite-quality-gate.md.
zheng2023llmjudge
Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E. P., Zhang, H., Gonzalez, J. E., & Stoica, I. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. NeurIPS Datasets and Benchmarks. arXiv: 2306.05685.
Canonical URL: https://arxiv.org/abs/2306.05685
We cite this for: the canonical reference on LLM-as-a-judge evaluation, including known biases (position, verbosity, self-enhancement). Methodology grounding for our LLM-judge faithfulness scoring. Cited by evaluation/composite-quality-gate.md.
Adversarial Inputs & Hallucination
ji2023hallucination
Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y., Madotto, A., & Fung, P. (2023). Survey of Hallucination in Natural Language Generation. ACM Computing Surveys, 55(12), 1–38. DOI: 10.1145/3571730. arXiv: 2202.03629.
Canonical URL: https://arxiv.org/abs/2202.03629
We cite this for: comprehensive survey of hallucination in NLG. The canonical taxonomy for the hallucination failure mode our system must avoid. Cited by thesis/02-literature-review.md.
inan2023llamaguard
Inan, H., Upasani, K., Chi, J., Rungta, R., Iyer, K., Mao, Y., Tontchev, M., Hu, Q., Fuller, B., Testuggine, D., & Khabsa, M. (2023). Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations. arXiv: 2312.06674.
Canonical URL: https://arxiv.org/abs/2312.06674
We cite this for: Meta's Llama Guard — fine-tuned input-output safety classifier. A representative LLM-based safety-classification approach (alternative to our regex-based pre/post-filter design). Cited by thesis/02-literature-review.md.
greshake2023indirectinjection
Greshake, K., Abdelnabi, S., Mishra, S., Endres, C., Holz, T., & Fritz, M. (2023). Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection. arXiv: 2302.12173.
Canonical URL: https://arxiv.org/abs/2302.12173
We cite this for: the canonical reference on indirect prompt injection (compromising LLM applications via attacker-controlled retrieval data). The threat-model basis for our adversarial test cases. Cited by evaluation/golden-questions.md.
Information Retrieval Foundations & Medical Terminology
manning2008ir
Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to Information Retrieval. Cambridge University Press. ISBN 978-0521865715.
Canonical URL: https://nlp.stanford.edu/IR-book/information-retrieval-book.html
We cite this for: Manning et al.'s canonical IR textbook (free online edition at Stanford NLP). The vocabulary-mismatch problem in classical IR. Cited by thesis/01-introduction.md.
voorhees2002philosophy
Voorhees, E. M. (2002). The Philosophy of Information Retrieval Evaluation. In Evaluation of Cross-Language Information Retrieval Systems (CLEF 2001), pages 355–370. Springer.
Canonical URL: https://trec.nist.gov/pubs/trec11/papers/OVERVIEW.11.pdf
We cite this for: Voorhees' canonical paper on IR evaluation methodology. The philosophy of test-collection-based evaluation. Cited by evaluation/golden-questions.md.
bodenreider2004umls
Bodenreider, O. (2004). The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Research, 32(Database issue), D267–D270. DOI: 10.1093/nar/gkh061.
Canonical URL: https://doi.org/10.1093/nar/gkh061
We cite this for: the canonical UMLS reference (academic.oup.com bot-blocked; DOI registry confirms). The foundational biomedical-terminology reference for our SNOMED CT integration. Cited by thesis/01-introduction.md, thesis/02-literature-review.md.
snomed_international
SNOMED International (2024). SNOMED CT: The global common language for health terms. SNOMED International official website.
Canonical URL: https://www.snomed.org/
We cite this for: SNOMED International — the standards body that maintains SNOMED CT. The Belgian Edition (~280K concepts, ~580K Dutch descriptions). Cited by thesis/02-literature-review.md.
Software Engineering Foundations
martin2017clean
Martin, R. C. (2017). Clean Architecture: A Craftsman's Guide to Software Structure and Design. Pearson. ISBN 978-0134494166.
Canonical URL: Pearson catalog page
We cite this for: Robert C. Martin's Clean Architecture canonical reference. The dependency-inversion architectural pattern (and ADR-001). Cited by thesis/03-methodology.md.
fowler2007mocks
Fowler, M. (2007). Mocks Aren't Stubs. martinfowler.com article.
Canonical URL: https://martinfowler.com/articles/mocksArentStubs.html
We cite this for: Fowler's canonical article on test doubles. The practitioner argument informing our no-mocking-by-default test policy (and ADR-0002). Cited by thesis/01-introduction.md, thesis/03-methodology.md.
efron1993bootstrap
Efron, B., & Tibshirani, R. J. (1993). An Introduction to the Bootstrap. Chapman & Hall/CRC. ISBN 978-0412042317.
Canonical URL: Routledge catalog page
We cite this for: Efron & Tibshirani's canonical bootstrap textbook. Percentile-based bootstrap confidence intervals (10000 resamples) for golden-eval reliability estimates. Cited by thesis/03-methodology.md, thesis/04-results.md.
wohlin2012experimentation
Wohlin, C., Runeson, P., Höst, M., Ohlsson, M. C., Regnell, B., & Wesslén, A. (2012). Experimentation in Software Engineering. Springer. ISBN 978-3-642-29044-2. DOI: 10.1007/978-3-642-29044-2.
Canonical URL: Springer book page
We cite this for: Wohlin et al.'s canonical SE-experimentation textbook. The fractional-factorial experiment design and the four-category threats-to-validity framework (internal/external/construct/reliability). Cited by thesis/01-introduction.md, thesis/03-methodology.md.