ADR-0013: Reasoning Model Token Budget

Date: 2026-02-07 | Status: Accepted (Superseded)

Model Superseded

The previous reasoning model was replaced by the Tier 2 (standard) model for all classification tasks. The current Tier 2 model is not a reasoning model, so the token budget issue described in this ADR no longer applies. Standard max_tokens values (250-500) work correctly. This ADR is preserved as a historical reference.

Discovery and fix for intent classification failures caused by insufficient token budget for the former reasoning model's overhead.

Context

Intent classification began returning unknown with 0.0 confidence for every query, effectively breaking the entire pipeline's routing logic. All queries fell back to the HYBRID retrieval strategy regardless of content, and no safety blocking occurred for medical advice queries.

Investigation

The root cause was traced to the interaction between max_tokens budget and the former reasoning model's behavior:

The former classification model was a reasoning model: Unlike standard completion models, it uses hidden "thinking" tokens before producing visible output. These reasoning tokens count against the max_tokens budget but are not included in the response content.
Original budget was 250-300 tokens: This was sized for the expected JSON output (approximately 50-100 tokens for an intent classification response).
Reasoning overhead consumed ~256 tokens: The model's internal reasoning about Dutch query classification, intent categories, and confidence scoring used approximately 500-600 tokens of thinking before generating any output.
Result: empty output content: With only 250-300 tokens allocated and ~256 consumed by reasoning, the model hit the token limit before producing any visible output. The API returned finish_reason: "length" (truncated) instead of "stop" (complete), with empty or partial content.
Downstream failure: The empty content caused a JSON parse exception in _parse_classification_response(), which caught the error and returned the default IntentClassificationResult(intent=UserIntent.UNKNOWN, confidence=0.0).

Symptoms Observed

All queries classified as unknown with 0.0 confidence
No out_of_scope_medical_advice safety blocks triggered
All queries routed to HYBRID (fallback strategy) regardless of content
API responses showed finish_reason: "length" instead of "stop"

Decision

Increase max_tokens from 250/300 to 2000 for all reasoning model classification calls.

Parameter	Before	After	Rationale
`max_tokens`	250–300	2000	Accommodate reasoning overhead (~500 tokens)
`stream`	false	false	Classification requires complete JSON response
`temperature`	0.0	0.0	Deterministic classification for safety

The budget of 2000 tokens provides ample headroom:

~500-600 tokens for reasoning overhead
~50-100 tokens for the JSON classification output
~1300-1400 tokens of safety margin for complex queries with long conversation history

This applies to both the standalone classify_intent() and the combined classify_and_rewrite() methods.

Consequences

Positive

Intent classification now returns correct intents with proper confidence scores
Safety blocking for out_of_scope_medical_advice queries is restored
finish_reason is consistently "stop" (complete output)
No measurable latency impact (the model still produces the same amount of reasoning; it just is no longer truncated)

Negative

Higher theoretical maximum cost per classification call (2000 tokens billed vs. 300), though actual token usage remains ~600-700 tokens total
The 2000-token budget is empirically determined and may need adjustment if OpenAI changes the model's reasoning behavior

Lessons Learned

Reasoning models require larger token budgets than their output size suggests. When using reasoning models that perform chain-of-thought reasoning, max_tokens must account for both hidden reasoning tokens and visible output.
Silent failures in classification are dangerous in safety-critical systems. The fallback to unknown/0.0 masked a complete classification failure that disabled safety blocking. Consider adding explicit monitoring for classification failure rates.
finish_reason is a critical signal: Checking for "length" vs. "stop" would have immediately revealed the truncation issue.

Verification

After increasing max_tokens to 2000:

Query "Welke dokters werken bij cardiologie?" returns department_lookup with confidence > 0.7
Query "Ik heb hoofdpijn, wat moet ik nemen?" returns out_of_scope_medical_advice (safety block)
API responses show finish_reason: "stop" consistently
Actual token usage is ~600-700 per call (well within the 2000 budget)

Context​

Investigation​

Symptoms Observed​

Decision​

Consequences​

Positive​

Negative​

Lessons Learned​

Verification​