Query Decomposition for Multi-Hop Retrieval
Single-shot dense retrieval embeds an entire question into one vector and competes that vector against the corpus. For questions that compose multiple independent evidence chains — which department offers procedure P? + which doctors work in that department? + when do those doctors consult on campus C? — a single embedding cannot represent the conjunction faithfully. Empirically, the model picks two of the three aspects, retrieves chunks that satisfy those two, and either hallucinates or silently omits the third aspect.
Query decomposition (see ADR-0032) breaks multi-hop queries into focused single-hop sub-questions, retrieves each in parallel, and merges the evidence before the LLM sees the context. This page documents the algorithm, the LLM prompt, the trade-off matrix that shaped the design, the empirical results from the A/B run, and the worked example.
The problem in one query
Consider:
"Welke arts op campus Sint-Jan doet knieoperaties en wanneer consulteert hij?"
Which doctor on campus Sint-Jan does knee operations, and when does he consult?
A faithful answer requires three independent lookups:
- Which department offers knee surgery? (treatment → department)
- Which doctors work in that department? (department → doctors)
- When do those doctors consult on campus Sint-Jan? (doctors + campus → schedule)
Without decomposition, vector search blends all three aspects into one embedding. The retrieved chunks typically cover Orthopedie and a handful of orthopedic surgeons, but consultation-schedule chunks are not in the top-K because the embedding's "schedule on Sint-Jan" component is diluted by the dominant "knee surgery" component. The LLM either hallucinates the schedule or quietly omits it. Decomposition fixes this at retrieval time, before the LLM is involved.
Trade-offs
| Decision | Chosen | Alternatives considered | Rejected because |
|---|---|---|---|
| When to decompose | LLM gate on every query that fails the cheap heuristic gate | Always decompose; never decompose; rule-based regex classifier | Always-decompose blows the per-query latency budget for the ~85 % of single-hop queries (cost: ~200 ms + ~$0.0001 each). Never-decompose loses the multi-hop case the feature exists to solve. A regex classifier was tried first and missed conjunctive forms across Dutch, English, French, and Italian patient input — patterns multiplied without ever achieving acceptable recall. The two-stage gate (heuristic + LLM) gives free pass-through for obviously simple queries and structured-output reliability where it matters. |
| Output validation | structured_call(output_model=DecompositionOutput) with retries; raw json.loads retained as fallback only on StructuredCallError | Raw OpenAI client + json.loads; LangChain output parsers; Pydantic AI Agent | Raw json.loads shipped malformed outputs to downstream code without retry, which is exactly the silent-failure pattern the project banned in 2026-05 (see CLAUDE.md silent-failure discipline). LangChain pulled in a heavier dependency surface than its runtime value justified. A Pydantic AI Agent was trialed but removed 2026-05-12 (+720 ms/call). The structured_call helper enforces the schema, retries on validation failure, and only triggers the legacy fallback when all retries are exhausted. |
| Sub-question reranking | Per-sub-question reranking + round-robin interleave | Merge all sub-question results, rerank against the original blended query | The merge-then-rerank approach silently dropped minority topics: chunks about rolstoelen scored poorly against a 5-topic query dominated by afspraak and parking, even when the chunks existed in the corpus. Per-sub-question reranking scores each chunk against its own sub-question; round-robin interleaving guarantees every topic's best chunk reaches the top of the assembled context. |
| Sub-question concurrency | Parallel asyncio.gather() for sub-retrieval; sequential reranking | Fully parallel (retrieval + rerank); fully sequential | Concurrent reranker calls trip API rate limits or — when the API fails — trigger parallel local-model fallback downloads that deadlock on the GPU lock. Sequential reranking is N × ~450 ms (typically ≤3 sub-questions, so ~1.4 s in the worst case), which is acceptable on the multi-hop branch where the user has already accepted higher latency in exchange for a complete answer. |
Architecture
Query decomposition sits at Stage 1b of the pipeline, between intent classification and retrieval:
Stage 1 — Heuristic gate (zero cost)
A query passes the heuristic gate as single-hop if both of:
- 6 words or fewer, AND
- The intent classifier extracted at most one entity type (campus, department, condition, treatment, examination, doctor)
For ~85 % of pilot traffic the gate accepts immediately, saving ~200 ms and ~$0.0001 per query. The gate accepts both the raw query and the ExtractedEntities from the upstream classifier.
Stage 2 — LLM gate (structured_call helper)
Queries that do not pass the heuristic gate are sent to a Tier 2 model (default: gpt-4.1-mini) through structured_call(output_model=DecompositionOutput). The helper enforces a JSON schema:
class DecompositionOutput(BaseModel):
multi_hop: bool
sub_questions: list[str] = Field(default_factory=list)
with retries — on a validation failure the helper re-prompts the model before raising StructuredCallError. Only that exception triggers the legacy json.loads fallback path, which itself exists only as a defence-in-depth signal (in pilot operation it has not fired).
The prompt instructs the model on multi-hop and single-hop indicators:
Multi-hop indicators (decompose)
- Doctor identification + department / campus / schedule
- Treatment-or-condition to department mapping + doctor listing
- Campus-specific filtering + service availability
- Temporal constraints (schedule, consultation hours) + entity lookup
- Cross-entity comparison ("arts die X doet EN op campus Y werkt")
Single-hop indicators (do NOT decompose)
- Simple entity lookup: "Wie is Dr. Peeters?"
- Single department: "Welke artsen werken bij Cardiologie?"
- Simple condition routing: "Waar kan ik terecht met diabetes?"
- Factual: "Wat zijn de bezoekuren?"
When multi_hop=true the agent returns ≤6 sub-questions (default cap; configurable). Each sub-question must be:
- Written in Dutch (regardless of input language — already reformulated upstream).
- Self-contained — no pronouns or anaphoric references.
- Targeting a single entity type or relationship.
- Ordered by dependency (department before doctors, doctors before schedule).
Example output:
{
"multi_hop": true,
"sub_questions": [
"Welke afdeling van het ZOL biedt knieoperaties aan?",
"Welke artsen werken bij de afdeling Orthopedie?",
"Wanneer consulteren deze artsen op campus Sint-Jan?"
]
}
Stage 3 — Parallel retrieval
Each sub-question fans out to an independent retrieval call via asyncio.gather(). Sub-retrievals share the asyncpg session pool but use distinct sessions, so they truly execute in parallel:
sub_results = await asyncio.gather(*[
self._execute_retrieval(sub_q, intent, entities) for sub_q in sub_questions
])
Wall-clock latency for the retrieval phase scales with the slowest sub-query, not the sum.
Stage 4 — Per-sub-question reranking
Each sub-question's result list is reranked against its own sub-question by the cross-encoder (Jina Reranker v2, with local BGE fallback — see Reranking & Evaluation). Graph results and keyword-rescue chunks are pinned — they pass through reranking unchanged because their relevance is already known structurally.
Reranking is sequential across sub-questions to avoid concurrent reranker API calls — the sequential cost is ~450 ms × N for ≤3 typical sub-questions.
Per-sub-question top-K is ceil(total_top_k / n_sub_questions), with a floor of 3 chunks per sub-question.
Stage 5 — Round-robin interleave + dedup
def interleave(reranked_lists: list[list[Chunk]]) -> list[Chunk]:
"""Take rank-0 from each list, then rank-1, etc. Dedup on chunk_id."""
seen: set[ChunkId] = set()
out: list[Chunk] = []
max_len = max(len(lst) for lst in reranked_lists)
for i in range(max_len):
for lst in reranked_lists:
if i < len(lst) and lst[i].id not in seen:
seen.add(lst[i].id)
out.append(lst[i])
return out
Round-robin guarantees every sub-question's best chunk reaches the top of the merged pool. Without it, a sub-question with three near-tied top chunks could push a different sub-question's top chunk past the assembled-context budget. After dedup the merged pool is handed to context assembly.
Configuration
Three settings in backend/app/config.py, all hot-toggleable from the admin Settings UI:
| Setting | Default | Range | Description |
|---|---|---|---|
query_decomposition_enabled | false | bool | Master feature flag. When false, the service is not instantiated and adds zero overhead. |
query_decomposition_model | gpt-4.1-mini | model name | Tier 2 model used for the LLM gate (no provider prefix; uses RAG_LLM_PROVIDER for routing). |
query_decomposition_max_subquestions | 6 | 2 – 8 | Upper bound on sub-questions per query. |
Fault tolerance
The service is designed to never break the pipeline:
| Failure mode | Behaviour |
|---|---|
structured_call validation failure | Retry on malformed output; on exhaustion raise StructuredCallError. |
StructuredCallError raised | Fall back to legacy json.loads path; if that also fails, fall back to single-hop with the original query. |
| LLM timeout / network error | Fall back to single-hop. |
multi_hop=true with empty sub_questions | Treat as single-hop. |
Sub-question count exceeds max_subquestions | Truncate to the cap. |
| Sub-retrieval fails for one sub-question | Continue with the remaining sub-questions; the failed one contributes no chunks. |
Feature flag false | Service not instantiated; zero overhead. |
Cost and latency
| Query class (rough share of pilot traffic) | Additional cost | Additional latency |
|---|---|---|
| Single-hop accepted by heuristic gate (~85 %) | $0.000 | 0 ms |
| Single-hop confirmed by LLM gate (~10 %) | ~$0.0001 | ~200 ms |
| Multi-hop (~5 %) | ~$0.0001 + N × sub-retrieval cost | ~200 ms + slowest-sub-retrieval |
Wall-clock latency on the multi-hop branch is dominated by the slowest sub-retrieval rather than the sum of all sub-retrievals (the asyncio.gather parallelism), and by the sequential per-sub-question reranking. Cost tracking writes the query_decomposition stage to CostTracker per call with prompt and completion token counts.
A/B experiment results
A 121-question A/B run on the golden evaluation set compared Vector-Only baseline vs Hybrid + decomposition enabled at the time the feature shipped:
| Metric | Vector-only (A) | Hybrid + decomp (B) | Δ |
|---|---|---|---|
| Entity recall | 0.881 | 0.915 | +3.4 % |
By category (largest improvements):
| Category | n | Vector-only | Hybrid + decomp | Δ |
|---|---|---|---|---|
multi_hop_graph | 18 | 0.806 | 0.880 | +7.4 % |
multilingual | 8 | 0.812 | 1.000 | +18.8 % |
service_info | 8 | 0.750 | 0.938 | +18.8 % |
taxonomy_alias | 6 | 0.833 | 0.917 | +8.3 % |
practical_info | 9 | 0.944 | 1.000 | +5.6 % |
By graph hops required:
| Hops | n | Vector-only | Hybrid + decomp | Δ |
|---|---|---|---|---|
| 0 (simple) | 17 | 0.971 | 0.971 | 0.000 |
| 1 | 23 | 0.826 | 0.891 | +6.5 % |
| 2 | 16 | 0.771 | 0.865 | +9.4 % |
| 3 | 7 | 0.810 | 0.857 | +4.8 % |
The pattern is exactly the one the design predicts:
- 0-hop questions show zero delta. The heuristic gate shields simple queries from any overhead or regression.
- 2-hop questions benefit most (+9.4 %). These are precisely the queries that need independent evidence chains.
multi_hop_graphcategory +7.4 %. The category the feature was built for.
A pilot-corpus re-run after the BGE-M3 to text-embedding-3-large migration (ADR-0048) is not yet measured; the pre-migration deltas above are retained as the historical record.
Worked example
Query: "Welke arts op campus Sint-Jan doet knieoperaties en wanneer consulteert hij?"
Step 1 — Heuristic gate. 11 words; entities campus=Sint-Jan and treatment=knieoperatie (two entity types). Gate routes to LLM.
Step 2 — LLM gate (structured_call helper). multi_hop=true with three sub-questions covering department, doctors, and schedule (see the JSON example in Stage 2 above).
Step 3 — Parallel retrieval (asyncio.gather):
- Sub-Q1: graph finds Orthopedie via OFFERS relationship; vector finds the knee-surgery brochure.
- Sub-Q2: graph finds three orthopedic surgeons via WORKS_IN relationship; vector finds doctor brochure pages.
- Sub-Q3: vector finds the Orthopedie consultation-schedule page for campus Sint-Jan.
Step 4 — Per-sub-question reranking + Step 5 — Round-robin interleave. The merged pool contains 12 unique chunks covering department + doctors + schedule, with the consultation-schedule chunk in position 1 of its sub-question's reranked list — so it survives interleaving instead of being dominated by the higher-volume orthopedic-surgeon chunks.
Result. The LLM receives complete context for all three aspects and emits a single answer covering department, surgeon names, and consultation times, with citations to each.
Implementation pointers
| Component | File | What's there |
|---|---|---|
| Service | backend/app/services/query_decomposition_service.py | Heuristic gate, structured_call LLM gate, legacy fallback, DecompositionResult dataclass |
| Pipeline integration | backend/app/services/rag_service.py | Lazy import + cost tracking + parallel sub-retrieval orchestration |
| Per-sub-question reranking | RAGService._rerank_multi_hop() | Sequential rerank across sub-questions, round-robin interleave, pin graph results |
| Configuration | backend/app/config.py (lines ~245–250) | Three settings with defaults and validation |
| Settings API | backend/app/api/settings.py | Hot toggle from the admin Settings page |
| Unit tests | backend/tests/unit/services/test_query_decomposition.py | Heuristic gate, prompt building, Agent retry, fallback parser, dataclass round-trip |
| Rerank tests | backend/tests/unit/services/test_rag_service_unit.py::TestRerankMultiHop | Round-robin interleave, dedup, pinned graph chunks, empty-sub-question handling |
Timing metrics emitted for observability:
decomposition_ms— time spent in the decomposition service.multi_hop_sub_questions— number of sub-questions generated.multi_hop_chunks_before_merge— total chunks before deduplication.
References
- Khattab, O., et al. (2022). Demonstrate-Search-Predict. arXiv 2212.14024. — Composable retrieve-then-search-then-predict pattern; the conceptual lineage of decomposition-style retrieval.
- Trivedi, H., et al. (2023). Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions. ACL 2023. — Multi-hop QA via decomposition + interleaved retrieval.
- Min, S., et al. (2019). Multi-hop Reading Comprehension through Question Decomposition and Rescoring. ACL 2019. — Foundational work on question decomposition for multi-hop reasoning.
- The
structured_callhelper (app.llm.structured) — schema-validated LLM output with retries and a typedStructuredCallError; replaced a trialed Pydantic AI Agent (removed 2026-05-12, see Decision-Cost Rubric). - Anthropic. (2024). Introducing Contextual Retrieval.