Query Decomposition for Multi-Hop Retrieval

Single-shot dense retrieval embeds an entire question into one vector and competes that vector against the corpus. For questions that compose multiple independent evidence chains — which department offers procedure P? + which doctors work in that department? + when do those doctors consult on campus C? — a single embedding cannot represent the conjunction faithfully. Empirically, the model picks two of the three aspects, retrieves chunks that satisfy those two, and either hallucinates or silently omits the third aspect.

Query decomposition (see ADR-0032) breaks multi-hop queries into focused single-hop sub-questions, retrieves each in parallel, and merges the evidence before the LLM sees the context. This page documents the algorithm, the LLM prompt, the trade-off matrix that shaped the design, the empirical results from the A/B run, and the worked example.

The problem in one query

Consider:

"Welke arts op campus Sint-Jan doet knieoperaties en wanneer consulteert hij?"

Which doctor on campus Sint-Jan does knee operations, and when does he consult?

A faithful answer requires three independent lookups:

Which department offers knee surgery? (treatment → department)
Which doctors work in that department? (department → doctors)
When do those doctors consult on campus Sint-Jan? (doctors + campus → schedule)

Without decomposition, vector search blends all three aspects into one embedding. The retrieved chunks typically cover Orthopedie and a handful of orthopedic surgeons, but consultation-schedule chunks are not in the top-K because the embedding's "schedule on Sint-Jan" component is diluted by the dominant "knee surgery" component. The LLM either hallucinates the schedule or quietly omits it. Decomposition fixes this at retrieval time, before the LLM is involved.

Trade-offs

Decision	Chosen	Alternatives considered	Rejected because
When to decompose	LLM gate on every query that fails the cheap heuristic gate	Always decompose; never decompose; rule-based regex classifier	Always-decompose blows the per-query latency budget for the ~85 % of single-hop queries (cost: ~200 ms + ~$0.0001 each). Never-decompose loses the multi-hop case the feature exists to solve. A regex classifier was tried first and missed conjunctive forms across Dutch, English, French, and Italian patient input — patterns multiplied without ever achieving acceptable recall. The two-stage gate (heuristic + LLM) gives free pass-through for obviously simple queries and structured-output reliability where it matters.
Output validation	`structured_call(output_model=DecompositionOutput)` with retries; raw `json.loads` retained as fallback only on `StructuredCallError`	Raw OpenAI client + `json.loads`; LangChain output parsers; Pydantic AI Agent	Raw `json.loads` shipped malformed outputs to downstream code without retry, which is exactly the silent-failure pattern the project banned in 2026-05 (see CLAUDE.md silent-failure discipline). LangChain pulled in a heavier dependency surface than its runtime value justified. A Pydantic AI Agent was trialed but removed 2026-05-12 (+720 ms/call). The `structured_call` helper enforces the schema, retries on validation failure, and only triggers the legacy fallback when all retries are exhausted.
Sub-question reranking	Per-sub-question reranking + round-robin interleave	Merge all sub-question results, rerank against the original blended query	The merge-then-rerank approach silently dropped minority topics: chunks about rolstoelen scored poorly against a 5-topic query dominated by afspraak and parking, even when the chunks existed in the corpus. Per-sub-question reranking scores each chunk against its own sub-question; round-robin interleaving guarantees every topic's best chunk reaches the top of the assembled context.
Sub-question concurrency	Parallel `asyncio.gather()` for sub-retrieval; sequential reranking	Fully parallel (retrieval + rerank); fully sequential	Concurrent reranker calls trip API rate limits or — when the API fails — trigger parallel local-model fallback downloads that deadlock on the GPU lock. Sequential reranking is N × ~450 ms (typically ≤3 sub-questions, so ~1.4 s in the worst case), which is acceptable on the multi-hop branch where the user has already accepted higher latency in exchange for a complete answer.

Architecture

Query decomposition sits at Stage 1b of the pipeline, between intent classification and retrieval:

Stage 1 — Heuristic gate (zero cost)

A query passes the heuristic gate as single-hop if both of:

6 words or fewer, AND
The intent classifier extracted at most one entity type (campus, department, condition, treatment, examination, doctor)

For ~85 % of pilot traffic the gate accepts immediately, saving ~200 ms and ~$0.0001 per query. The gate accepts both the raw query and the ExtractedEntities from the upstream classifier.

Stage 2 — LLM gate (`structured_call` helper)

Queries that do not pass the heuristic gate are sent to a Tier 2 model (default: gpt-4.1-mini) through structured_call(output_model=DecompositionOutput). The helper enforces a JSON schema:

class DecompositionOutput(BaseModel):
    multi_hop: bool
    sub_questions: list[str] = Field(default_factory=list)

with retries — on a validation failure the helper re-prompts the model before raising StructuredCallError. Only that exception triggers the legacy json.loads fallback path, which itself exists only as a defence-in-depth signal (in pilot operation it has not fired).

The prompt instructs the model on multi-hop and single-hop indicators:

Multi-hop indicators (decompose)

Doctor identification + department / campus / schedule
Treatment-or-condition to department mapping + doctor listing
Campus-specific filtering + service availability
Temporal constraints (schedule, consultation hours) + entity lookup
Cross-entity comparison ("arts die X doet EN op campus Y werkt")

Single-hop indicators (do NOT decompose)

Simple entity lookup: "Wie is Dr. Peeters?"
Single department: "Welke artsen werken bij Cardiologie?"
Simple condition routing: "Waar kan ik terecht met diabetes?"
Factual: "Wat zijn de bezoekuren?"

When multi_hop=true the agent returns ≤6 sub-questions (default cap; configurable). Each sub-question must be:

Written in Dutch (regardless of input language — already reformulated upstream).
Self-contained — no pronouns or anaphoric references.
Targeting a single entity type or relationship.
Ordered by dependency (department before doctors, doctors before schedule).

Example output:

{
  "multi_hop": true,
  "sub_questions": [
    "Welke afdeling van het ZOL biedt knieoperaties aan?",
    "Welke artsen werken bij de afdeling Orthopedie?",
    "Wanneer consulteren deze artsen op campus Sint-Jan?"
  ]
}

Stage 3 — Parallel retrieval

Each sub-question fans out to an independent retrieval call via asyncio.gather(). Sub-retrievals share the asyncpg session pool but use distinct sessions, so they truly execute in parallel:

sub_results = await asyncio.gather(*[
    self._execute_retrieval(sub_q, intent, entities) for sub_q in sub_questions
])

Wall-clock latency for the retrieval phase scales with the slowest sub-query, not the sum.

Stage 4 — Per-sub-question reranking

Each sub-question's result list is reranked against its own sub-question by the cross-encoder (Jina Reranker v2, with local BGE fallback — see Reranking & Evaluation). Graph results and keyword-rescue chunks are pinned — they pass through reranking unchanged because their relevance is already known structurally.

Reranking is sequential across sub-questions to avoid concurrent reranker API calls — the sequential cost is ~450 ms × N for ≤3 typical sub-questions.

Per-sub-question top-K is ceil(total_top_k / n_sub_questions), with a floor of 3 chunks per sub-question.

Stage 5 — Round-robin interleave + dedup

def interleave(reranked_lists: list[list[Chunk]]) -> list[Chunk]:
    """Take rank-0 from each list, then rank-1, etc. Dedup on chunk_id."""
    seen: set[ChunkId] = set()
    out: list[Chunk] = []
    max_len = max(len(lst) for lst in reranked_lists)
    for i in range(max_len):
        for lst in reranked_lists:
            if i < len(lst) and lst[i].id not in seen:
                seen.add(lst[i].id)
                out.append(lst[i])
    return out

Round-robin guarantees every sub-question's best chunk reaches the top of the merged pool. Without it, a sub-question with three near-tied top chunks could push a different sub-question's top chunk past the assembled-context budget. After dedup the merged pool is handed to context assembly.

Configuration

Three settings in backend/app/config.py, all hot-toggleable from the admin Settings UI:

Setting	Default	Range	Description
`query_decomposition_enabled`	`false`	bool	Master feature flag. When false, the service is not instantiated and adds zero overhead.
`query_decomposition_model`	`gpt-4.1-mini`	model name	Tier 2 model used for the LLM gate (no provider prefix; uses `RAG_LLM_PROVIDER` for routing).
`query_decomposition_max_subquestions`	`6`	2 – 8	Upper bound on sub-questions per query.

Fault tolerance

The service is designed to never break the pipeline:

Failure mode	Behaviour
`structured_call` validation failure	Retry on malformed output; on exhaustion raise `StructuredCallError`.
`StructuredCallError` raised	Fall back to legacy `json.loads` path; if that also fails, fall back to single-hop with the original query.
LLM timeout / network error	Fall back to single-hop.
`multi_hop=true` with empty `sub_questions`	Treat as single-hop.
Sub-question count exceeds `max_subquestions`	Truncate to the cap.
Sub-retrieval fails for one sub-question	Continue with the remaining sub-questions; the failed one contributes no chunks.
Feature flag `false`	Service not instantiated; zero overhead.

Cost and latency

Query class (rough share of pilot traffic)	Additional cost	Additional latency
Single-hop accepted by heuristic gate (~85 %)	$0.000	0 ms
Single-hop confirmed by LLM gate (~10 %)	~$0.0001	~200 ms
Multi-hop (~5 %)	~$0.0001 + N × sub-retrieval cost	~200 ms + slowest-sub-retrieval

Wall-clock latency on the multi-hop branch is dominated by the slowest sub-retrieval rather than the sum of all sub-retrievals (the asyncio.gather parallelism), and by the sequential per-sub-question reranking. Cost tracking writes the query_decomposition stage to CostTracker per call with prompt and completion token counts.

A/B experiment results

A 121-question A/B run on the golden evaluation set compared Vector-Only baseline vs Hybrid + decomposition enabled at the time the feature shipped:

Metric	Vector-only (A)	Hybrid + decomp (B)	Δ
Entity recall	0.881	0.915	+3.4 %

By category (largest improvements):

Category	n	Vector-only	Hybrid + decomp	Δ
`multi_hop_graph`	18	0.806	0.880	+7.4 %
`multilingual`	8	0.812	1.000	+18.8 %
`service_info`	8	0.750	0.938	+18.8 %
`taxonomy_alias`	6	0.833	0.917	+8.3 %
`practical_info`	9	0.944	1.000	+5.6 %

By graph hops required:

Hops	n	Vector-only	Hybrid + decomp	Δ
0 (simple)	17	0.971	0.971	0.000
1	23	0.826	0.891	+6.5 %
2	16	0.771	0.865	+9.4 %
3	7	0.810	0.857	+4.8 %

The pattern is exactly the one the design predicts:

0-hop questions show zero delta. The heuristic gate shields simple queries from any overhead or regression.
2-hop questions benefit most (+9.4 %). These are precisely the queries that need independent evidence chains.
multi_hop_graph category +7.4 %. The category the feature was built for.

A pilot-corpus re-run after the BGE-M3 to text-embedding-3-large migration (ADR-0048) is not yet measured; the pre-migration deltas above are retained as the historical record.

Worked example

Query: "Welke arts op campus Sint-Jan doet knieoperaties en wanneer consulteert hij?"

Step 1 — Heuristic gate. 11 words; entities campus=Sint-Jan and treatment=knieoperatie (two entity types). Gate routes to LLM.

Step 2 — LLM gate (structured_call helper). multi_hop=true with three sub-questions covering department, doctors, and schedule (see the JSON example in Stage 2 above).

Step 3 — Parallel retrieval (asyncio.gather):

Sub-Q1: graph finds Orthopedie via OFFERS relationship; vector finds the knee-surgery brochure.
Sub-Q2: graph finds three orthopedic surgeons via WORKS_IN relationship; vector finds doctor brochure pages.
Sub-Q3: vector finds the Orthopedie consultation-schedule page for campus Sint-Jan.

Step 4 — Per-sub-question reranking + Step 5 — Round-robin interleave. The merged pool contains 12 unique chunks covering department + doctors + schedule, with the consultation-schedule chunk in position 1 of its sub-question's reranked list — so it survives interleaving instead of being dominated by the higher-volume orthopedic-surgeon chunks.

Result. The LLM receives complete context for all three aspects and emits a single answer covering department, surgeon names, and consultation times, with citations to each.

Implementation pointers

Component	File	What's there
Service	`backend/app/services/query_decomposition_service.py`	Heuristic gate, `structured_call` LLM gate, legacy fallback, `DecompositionResult` dataclass
Pipeline integration	`backend/app/services/rag_service.py`	Lazy import + cost tracking + parallel sub-retrieval orchestration
Per-sub-question reranking	`RAGService._rerank_multi_hop()`	Sequential rerank across sub-questions, round-robin interleave, pin graph results
Configuration	`backend/app/config.py` (lines ~245–250)	Three settings with defaults and validation
Settings API	`backend/app/api/settings.py`	Hot toggle from the admin Settings page
Unit tests	`backend/tests/unit/services/test_query_decomposition.py`	Heuristic gate, prompt building, Agent retry, fallback parser, dataclass round-trip
Rerank tests	`backend/tests/unit/services/test_rag_service_unit.py::TestRerankMultiHop`	Round-robin interleave, dedup, pinned graph chunks, empty-sub-question handling

Timing metrics emitted for observability:

decomposition_ms — time spent in the decomposition service.
multi_hop_sub_questions — number of sub-questions generated.
multi_hop_chunks_before_merge — total chunks before deduplication.

References

Khattab, O., et al. (2022). Demonstrate-Search-Predict. arXiv 2212.14024. — Composable retrieve-then-search-then-predict pattern; the conceptual lineage of decomposition-style retrieval.
Trivedi, H., et al. (2023). Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions. ACL 2023. — Multi-hop QA via decomposition + interleaved retrieval.
Min, S., et al. (2019). Multi-hop Reading Comprehension through Question Decomposition and Rescoring. ACL 2019. — Foundational work on question decomposition for multi-hop reasoning.
The structured_call helper (app.llm.structured) — schema-validated LLM output with retries and a typed StructuredCallError; replaced a trialed Pydantic AI Agent (removed 2026-05-12, see Decision-Cost Rubric).
Anthropic. (2024). Introducing Contextual Retrieval.

The problem in one query​

Trade-offs​

Architecture​

Stage 1 — Heuristic gate (zero cost)​

Stage 2 — LLM gate (structured_call helper)​

Stage 3 — Parallel retrieval​

Stage 4 — Per-sub-question reranking​

Stage 5 — Round-robin interleave + dedup​

Configuration​

Fault tolerance​

Cost and latency​

A/B experiment results​

Worked example​

Implementation pointers​

References​