Skip to main content

A/B Experiment: Vector-Only vs Hybrid RAG

Experiment Metadata
FieldValue
Date2026-02-17
Branchbugfixes-and-consolidation (commit 4751218)
ConditionsA = Vector-only, B = Hybrid (vector + graph)
Sample size121 golden questions
Repetitions1 per condition
Statistical testPaired Wilcoxon signed-rank test
Primary metricEntity recall (case-insensitive substring match)
InfrastructurePostgreSQL + pgvector (1536-dim), Neo4j knowledge graph

1. Motivation

The ZOL Intelligent Search system supports two retrieval modes: vector-only (semantic similarity search over pgvector embeddings) and hybrid (vector search augmented with Neo4j knowledge graph traversal). The graph provides structured entity relationships -- doctors linked to departments, departments mapped to campuses, conditions routed to specialties -- that complement unstructured vector retrieval.

This experiment measures the incremental value of the knowledge graph component. Specifically, we test the hypothesis:

H1: Hybrid retrieval produces higher entity recall than vector-only retrieval across the golden question benchmark.

The null hypothesis (H0) states that there is no difference in entity recall between the two conditions. A paired design controls for question-level variance, and the non-parametric Wilcoxon signed-rank test is used because entity recall scores are bounded, non-normal, and contain ties.

2. Experimental Design

2.1 Protocol

The experiment follows a within-subjects design where each golden question is evaluated under both conditions sequentially:

Phase 1 (Vector-Only)
├── Disable graph RAG via user preference API
├── Execute all 121 questions with fresh conversation IDs
└── Record: answer, entity recall, latency, contexts, citations

Phase 2 (Hybrid)
├── Enable graph RAG via user preference API
├── Execute all 121 questions with fresh conversation IDs
└── Record: answer, entity recall, latency, contexts, citations

Phase 3 (Analysis)
├── Paired Wilcoxon signed-rank test on entity recall
├── Cohen's d effect size
├── Per-category and per-graph-hops stratification
└── Outlier identification (|delta| > 0.3)

2.2 Controls

  • Same backend instance: Both phases ran against the same deployed backend (FastAPI, PostgreSQL, Neo4j, Redis).
  • Same embedding model: nomic-embed-text (768-dimensional) for all vector searches.
  • Same LLM: GPT-4.1-mini for response generation, GPT-4.1-nano for intent classification.
  • Fresh conversations: Each question received a new conversation_id to prevent context leakage between questions. Follow-up chain questions (depends_on) shared a conversation within their chain.
  • No caching: Semantic cache was active but produces no hits for golden questions (unique phrasing).

2.3 Limitations

  • Single repetition: With n=1 per condition, LLM non-determinism cannot be distinguished from treatment effects. Temperature > 0 means the same prompt can produce different outputs across runs.
  • Graph state: The Neo4j knowledge graph was populated with the full ZOL entity set (~2,400 nodes, ~4,800 relationships). Graph quality affects hybrid results.
  • Confounded latency: Response time includes LLM inference, which varies with token count and API load. Latency differences are observational, not causal.

3. Results

3.1 Overall Entity Recall

MetricVector-Only (A)Hybrid (B)Delta (B-A)p-valueCohen's d
Entity recall (mean)0.8810.915+0.0340.0810.181
Entity recall (std)0.2610.224-0.037----
Perfect score (1.0)96/121 (79.3%)103/121 (85.1%)+7 questions----
Passing (>=0.5)115/121 (95.0%)117/121 (96.7%)+2 questions----

The hybrid condition shows a +3.4 percentage-point improvement in mean entity recall. The p-value of 0.081 falls outside the conventional significance threshold (alpha = 0.05) but within a liberal threshold (alpha = 0.10). Cohen's d = 0.181 indicates a small effect size (Cohen, 1988). Notably, the standard deviation is lower in the hybrid condition, suggesting more consistent performance.

Key Finding

Hybrid retrieval increased perfect-score questions from 96 to 103 (+7 questions, +7.3%), primarily in multi-hop and multilingual categories. The improvement is directionally positive but not statistically significant at alpha = 0.05 with a single repetition.

3.2 Win/Loss/Tie Analysis

OutcomeCountPercentage
Hybrid wins (B > A)108.3%
Hybrid loses (B < A)32.5%
Tie (B = A)10889.3%

The win:loss ratio of 10:3 favours hybrid retrieval. Of the 13 non-tied questions, hybrid improved 76.9% and regressed 23.1%.

3.3 Per-Category Breakdown

CategorynVector (A)Hybrid (B)DeltaDirection
ambiguous_symptom50.7000.700+0.000--
campus_info60.9580.958+0.000--
compound_word50.9000.900+0.000--
condition_department100.9670.950-0.017slightly worse
doctor_department61.0001.000+0.000--
emergency31.0001.000+0.000--
entity_disambiguation41.0001.000+0.000--
followup_chain60.8330.750-0.083worse
multi_hop_graph180.8060.880+0.074better
multilingual80.8121.000+0.188much better
navigation40.7920.792+0.000--
out_of_scope81.0001.000+0.000--
practical_info90.9441.000+0.056better
referral31.0001.000+0.000--
safety_refusal51.0001.000+0.000--
service_info80.7500.938+0.188much better
taxonomy_alias60.8330.917+0.083better
treatment_info70.7860.714-0.071worse

Strongest improvements: multilingual (+18.8pp), service_info (+18.8pp), and multi_hop_graph (+7.4pp). These are categories where graph traversal provides entity relationships not easily captured by vector similarity alone.

Regressions: followup_chain (-8.3pp) and treatment_info (-7.1pp). These are attributed to LLM non-determinism rather than systematic degradation (see Section 4).

3.4 Per-Graph-Hops Stratification

Each golden question is annotated with the minimum number of graph hops required to answer it (0 = vector-sufficient, 1-3 = requires graph traversal).

HopsnVector (A)Hybrid (B)Delta
0170.9710.971+0.000
1230.8260.891+0.065
2160.7710.865+0.094
370.8100.857+0.048
unknown580.9150.930+0.014

The graph's benefit increases with hop count, peaking at 2-hop queries (+9.4pp). This confirms the expected behaviour: questions requiring multi-hop reasoning (e.g., "which department treats condition X, and on which campus?") benefit most from structured graph traversal.

Entity Recall by Graph Hops

1.00 ┤ ●─────●

0.95 ┤

0.90 ┤ ○─────○ ○──○
│ ○ ○
0.85 ┤ ○

0.80 ┤ ● ●
│ ●
0.75 ┤

0.70 ┤
├─────┬─────┬─────┬─────┬─────
0 1 2 3 unknown

● = Vector-Only ○ = Hybrid

3.5 Latency Analysis

MetricVector-Only (A)Hybrid (B)Delta
Mean14,493 ms15,323 ms+830 ms (+5.7%)
Std dev7,365 ms4,969 ms-2,396 ms
Median (p50)15,996 ms15,514 ms-482 ms
p9021,979 ms20,165 ms-1,814 ms
p9523,466 ms22,181 ms-1,285 ms
p9929,392 ms25,233 ms-4,159 ms

The hybrid condition has a slightly higher mean latency (+830ms, +5.7%) but lower variance and tighter tail latencies. The p90, p95, and p99 are all lower for hybrid, indicating that while the average is marginally slower, the worst-case performance is better controlled. The lower standard deviation (4,969ms vs 7,365ms) suggests more predictable response times.

Per-Category Latency

CategorynVector (ms)Hybrid (ms)Delta (ms)
ambiguous_symptom518,68319,076+393
campus_info619,93813,929-6,009
compound_word512,74417,576+4,832
condition_department1017,38317,827+444
doctor_department617,11515,080-2,036
emergency313,24520,476+7,232
entity_disambiguation410,84915,314+4,465
followup_chain612,70816,697+3,990
multi_hop_graph1816,21317,546+1,334
multilingual89,16316,218+7,055
navigation416,94113,749-3,191
out_of_scope85,3485,554+206
practical_info918,98815,513-3,476
referral38,02313,798+5,774
safety_refusal55,5444,907-637
service_info814,83015,027+197
taxonomy_alias612,96117,089+4,127
treatment_info720,88217,472-3,410

Latency differences across categories are dominated by LLM inference variability and response length rather than retrieval strategy, as both conditions use the same embedding and LLM infrastructure.

3.6 Context Retrieval & Citations

MetricVector-Only (A)Hybrid (B)
Mean contexts retrieved1.521.64
Zero-context queries19 (15.7%)21 (17.4%)
Mean citations1.521.64
Total citations184199

Hybrid retrieval produces slightly more citations on average (+0.12 per query), reflecting additional context from graph traversal enriching the response.

3.7 Safety Compliance

MetricVector-Only (A)Hybrid (B)
Safety questions1111
Correct refusals11/11 (100%)11/11 (100%)

Both conditions achieve 100% safety refusal accuracy. The retrieval strategy does not affect the safety layer's ability to detect and refuse medical advice queries.

4. Outlier Analysis

Thirteen questions exhibited large effect sizes (|delta| > 0.3). Ten improved under hybrid, three regressed.

4.1 Improvements (Hybrid > Vector)

QIDCategoryVectorHybridDeltaExplanation
GQ-033service_info0.001.00+1.00"Heeft ZOL een apotheek?" -- Graph provided the Apotheek service entity; vector found no relevant chunks.
GQ-063multilingual0.001.00+1.00Turkish: "Hangi kampuste cocuk psikiyatrisi var?" -- Graph resolved the entity relationship across language barrier.
GQ-044service_info0.501.00+0.50"Biedt ZOL hartrevalidatie aan?" -- Graph linked Hartrevalidatie to Cardiologie department.
GQ-057multilingual0.501.00+0.50Turkish: "ZOL'de kalp doktoru var mi?" -- Graph entity lookup compensated for poor cross-lingual embedding similarity.
GQ-094multi_hop_graph0.501.00+0.50"Psoriasis op Sint-Barbara?" -- 2-hop query (condition -> department -> campus) resolved by graph.
GQ-106taxonomy_alias0.501.00+0.50"Suikerziekte onderzoeken" -- Taxonomy alias (suikerziekte -> Diabetes) enabled correct graph traversal.
GQ-112practical_info0.501.00+0.50"Wat meebrengen naar raadpleging?" -- Graph retrieved additional practical context documents.
GQ-041condition_department0.671.00+0.33"Knobbel in borst" -- Graph added Borstcentrum/Oncologie entity to response.
GQ-102multi_hop_graph0.671.00+0.33"Chemotherapie bij borstkanker" -- 3-hop traversal (condition -> treatment -> department -> campus).
GQ-100multi_hop_graph0.000.50+0.50"Onderzoeken bij hartfalen" -- Partial graph traversal improved from zero to partial recall.

4.2 Regressions (Hybrid < Vector)

QIDCategoryVectorHybridDeltaRoot Cause
GQ-025treatment_info0.500.00-0.50LLM non-determinism. Hybrid happened to retrieve 0 contexts (vs 1 for vector). The LLM produced a minimal fallback response. Not systematic -- same query enrichment in both conditions.
GQ-040condition_department1.000.50-0.50Entity alias mismatch. Vector used "NKO", hybrid used "KNO" -- both valid abbreviations for Neus-Keel-Oorheelkunde. The hybrid answer was objectively better (listed 6 doctors). Fixed post-experiment by updating expected entities to language-resilient substrings.
GQ-068followup_chain1.000.50-0.50LLM non-determinism on follow-up chain. Depends on GQ-067; different conversation context led to different retrieval path. Not systematic.

All three regressions are attributable to LLM non-determinism or evaluation methodology artefacts, not systematic degradation from graph integration.

5. Discussion

5.1 Interpretation

The hybrid retrieval condition demonstrates a consistent directional improvement across entity recall, with the largest gains in exactly the categories where graph traversal provides structural advantages:

  1. Multi-hop queries (+9.4pp at 2 hops): Questions requiring traversal across entity relationships (condition -> department -> campus) cannot be reliably answered by vector similarity alone, as the relevant information may span multiple source documents that are not semantically similar to each other.

  2. Multilingual queries (+18.8pp): The knowledge graph acts as a language-agnostic entity bridge. A Turkish query about "kalp doktoru" (heart doctor) maps to the same Cardiologie node regardless of input language, compensating for weak cross-lingual embedding similarity in the monolingual embedding model (nomic-embed-text).

  3. Service/taxonomy queries (+18.8pp, +8.3pp): Alias resolution through the taxonomy (suikerziekte -> Diabetes, hartrevalidatie -> Cardiologie) ensures that patient-friendly Dutch terms reach the correct entity nodes.

5.2 Statistical Power

The experiment's primary limitation is statistical power. With a single repetition per condition and LLM non-determinism contributing noise, the Wilcoxon test's p-value of 0.081 is suggestive but inconclusive at the conventional alpha = 0.05 threshold. The Cohen's d of 0.181 (small effect) is consistent with a modest but real improvement that would require approximately 3-5 repetitions to detect reliably at alpha = 0.05 with 80% power.

5.3 Practical Significance

Despite the lack of statistical significance, the practical implications are meaningful:

  • 7 additional perfect-score questions (79.3% -> 85.1%)
  • 10:3 win:loss ratio on non-tied questions
  • No safety degradation (100% refusal accuracy maintained)
  • Tighter tail latencies (p95: -1,285ms, p99: -4,159ms)
  • All regressions attributable to LLM noise, not systematic issues

For a hospital search system where each percentage point of entity recall represents better patient navigation, the practical benefit of hybrid retrieval justifies its inclusion even before reaching formal statistical significance.

6. Conclusion

Hybrid retrieval (vector + knowledge graph) improves entity recall by +3.4 percentage points over vector-only retrieval, with the effect concentrated in multi-hop, multilingual, and service information queries. The improvement is directionally consistent (10 wins vs 3 losses) and practically meaningful (+7 perfect-score questions), though it does not reach statistical significance at alpha = 0.05 with a single repetition.

The three observed regressions are attributable to LLM non-determinism and an entity alias mismatch in the evaluation data, not to systematic degradation from graph integration. Safety compliance remains at 100% under both conditions.

Recommendation: Retain hybrid retrieval as the default mode. Consider running 3-5 repetitions in a future experiment to achieve adequate statistical power for formal hypothesis testing.

7. Reproducibility

7.1 Running the Experiment

cd backend
source venv/bin/activate

# Dry run (no API calls)
python -m tests.evaluation.run_ab_experiment --dry-run

# Full experiment (requires running backend + infrastructure)
python -m tests.evaluation.run_ab_experiment --repetitions 1

# With multiple repetitions for higher statistical power
python -m tests.evaluation.run_ab_experiment --repetitions 5

7.2 Data Location

ArtifactPath
Raw vector resultsbackend/tests/evaluation/ab_results/results_vector.json
Raw hybrid resultsbackend/tests/evaluation/ab_results/results_hybrid.json
Structured reportbackend/tests/evaluation/ab_results/ab_experiment_report.json
Markdown reportbackend/tests/evaluation/ab_results/ab_experiment_report.md
Golden questionsbackend/tests/evaluation/golden_questions.json

7.3 System Configuration at Time of Experiment

ComponentVersion/Setting
Embedding modelnomic-embed-text (768-dim)
LLM (generation)gpt-4.1-mini
LLM (intent)gpt-4.1-nano
Vector DBPostgreSQL 16 + pgvector
Graph DBNeo4j 5.x (~2,400 nodes, ~4,800 relationships)
Rerankercross-encoder/ms-marco-MiniLM-L-6-v2
BackendFastAPI (Python 3.12)

References

  • Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd ed.). Lawrence Erlbaum Associates.
  • Es, S., James, J., Espinosa-Anke, L., & Schockaert, S. (2023). RAGAS: Automated evaluation of retrieval augmented generation. arXiv preprint, arXiv:2309.15217. https://arxiv.org/abs/2309.15217
  • Voorhees, E. M. (2002). The philosophy of information retrieval evaluation. Lecture Notes in Computer Science, 2406, 355--370. https://doi.org/10.1007/3-540-45691-0_34