ADR-0013: Reasoning Model Token Budget
Date: 2026-02-07 | Status: Accepted (Superseded)
The previous reasoning model was replaced by the Tier 2 (standard) model for all classification tasks. The current Tier 2 model is not a reasoning model, so the token budget issue described in this ADR no longer applies. Standard max_tokens values (250-500) work correctly. This ADR is preserved as a historical reference.
Discovery and fix for intent classification failures caused by insufficient token budget for the former reasoning model's overhead.
Context
Intent classification began returning unknown with 0.0 confidence for every query, effectively breaking the entire pipeline's routing logic. All queries fell back to the HYBRID retrieval strategy regardless of content, and no safety blocking occurred for medical advice queries.
Investigation
The root cause was traced to the interaction between max_tokens budget and the former reasoning model's behavior:
-
The former classification model was a reasoning model: Unlike standard completion models, it uses hidden "thinking" tokens before producing visible output. These reasoning tokens count against the
max_tokensbudget but are not included in the response content. -
Original budget was 250-300 tokens: This was sized for the expected JSON output (approximately 50-100 tokens for an intent classification response).
-
Reasoning overhead consumed ~256 tokens: The model's internal reasoning about Dutch query classification, intent categories, and confidence scoring used approximately 500-600 tokens of thinking before generating any output.
-
Result: empty output content: With only 250-300 tokens allocated and ~256 consumed by reasoning, the model hit the token limit before producing any visible output. The API returned
finish_reason: "length"(truncated) instead of"stop"(complete), with empty or partial content. -
Downstream failure: The empty content caused a JSON parse exception in
_parse_classification_response(), which caught the error and returned the defaultIntentClassificationResult(intent=UserIntent.UNKNOWN, confidence=0.0).
Symptoms Observed
- All queries classified as
unknownwith 0.0 confidence - No
out_of_scope_medical_advicesafety blocks triggered - All queries routed to HYBRID (fallback strategy) regardless of content
- API responses showed
finish_reason: "length"instead of"stop"
Decision
Increase max_tokens from 250/300 to 2000 for all reasoning model classification calls.
| Parameter | Before | After | Rationale |
|---|---|---|---|
max_tokens | 250–300 | 2000 | Accommodate reasoning overhead (~500 tokens) |
stream | false | false | Classification requires complete JSON response |
temperature | 0.0 | 0.0 | Deterministic classification for safety |
The budget of 2000 tokens provides ample headroom:
- ~500-600 tokens for reasoning overhead
- ~50-100 tokens for the JSON classification output
- ~1300-1400 tokens of safety margin for complex queries with long conversation history
This applies to both the standalone classify_intent() and the combined classify_and_rewrite() methods.
Consequences
Positive
- Intent classification now returns correct intents with proper confidence scores
- Safety blocking for
out_of_scope_medical_advicequeries is restored finish_reasonis consistently"stop"(complete output)- No measurable latency impact (the model still produces the same amount of reasoning; it just is no longer truncated)
Negative
- Higher theoretical maximum cost per classification call (2000 tokens billed vs. 300), though actual token usage remains ~600-700 tokens total
- The 2000-token budget is empirically determined and may need adjustment if OpenAI changes the model's reasoning behavior
Lessons Learned
-
Reasoning models require larger token budgets than their output size suggests. When using reasoning models that perform chain-of-thought reasoning,
max_tokensmust account for both hidden reasoning tokens and visible output. -
Silent failures in classification are dangerous in safety-critical systems. The fallback to
unknown/0.0masked a complete classification failure that disabled safety blocking. Consider adding explicit monitoring for classification failure rates. -
finish_reasonis a critical signal: Checking for"length"vs."stop"would have immediately revealed the truncation issue.
Verification
After increasing max_tokens to 2000:
- Query "Welke dokters werken bij cardiologie?" returns
department_lookupwith confidence > 0.7 - Query "Ik heb hoofdpijn, wat moet ik nemen?" returns
out_of_scope_medical_advice(safety block) - API responses show
finish_reason: "stop"consistently - Actual token usage is ~600-700 per call (well within the 2000 budget)