Decision-Cost Rubric (Multi-Dimensional Change Evaluation)
What: Every architectural change, dependency adoption, or pattern shift must be evaluated against six axes before adoption. The rubric is not optional. The rubric is not a sticker. A proposal that addresses one axis (e.g., "this improves consistency") and does not address the other five is incomplete, not "lighter weight." Where an axis cannot be quantified at the proposal time, the proposal explicitly states "not measured because…" with a rationale that survives sober review.
The six axes:
| Axis | The question the proposal must answer |
|---|---|
| Latency | Estimated p50/p95 impact per call. If not measured, why is that safe to defer? |
| Dependency surface | New packages pulled in, transitive deps, lines we own vs. lines we depend on. What happens when an upstream yanks a release? |
| Debuggability | When this fails at 3am in production, what does the stack trace look like? Can a new engineer fix it? Are the failure modes observable? |
| Reversibility | Hours to undo if wrong. Single config flip vs. an 8-file refactor vs. a database migration. |
| Blast radius | How many code paths are affected. Is the change additive (a new helper alongside existing code) or substitutive (replaces the existing path everywhere)? |
| Alternative considered | At least one credible alternative + a one-sentence "why rejected." If you cannot name an alternative, the proposal has not been thought through. |
Why: The default failure mode in proposal-driven development is single-dimensional evaluation. The proposer (human or AI) names one benefit — consistency, type safety, developer experience, "industry standard" — and the conversation evaluates the proposal against that single dimension. The other five dimensions never enter the discussion. The proposal is adopted. The hidden costs surface in production months later, by which time the cost of reversal is multiple orders of magnitude higher than the cost of asking the question at brainstorm.
In AI-human collaboration the failure mode is amplified by the AI's training-time gradients toward agreement. When the human proposes X with framing Y, the AI optimizes against Y. This is structural, not a politeness bug. The fix is not "the AI should try harder to disagree" — willpower against a gradient is unreliable. The fix is a procedural rubric that makes "did I evaluate axis X" a checkable condition rather than a feeling.
Evidence: One case in the ZOL Hospital Intelligent Search project (2026-05-09 → 2026-05-12) demonstrates the rubric's value in stark terms.
In May 2026 the project adopted pydantic-ai across eight LLM call sites (intent classification, query decomposition, conversation classifier, feedback investigation, voice turn evaluator, two diagnostic-v2 services, and admin feedback digest). The proposal was framed as a single-dimension win: "world-class consistency" across structured-LLM call sites + automatic validation-error retry. The AI agent and human collaborator both evaluated against that single dimension; both agreed; the adoption proceeded.
Three days later, a measurement endpoint added during a latency-optimization session (the /admin/ops/latency-percentiles aggregator over pipeline_telemetry) revealed that intent classification had a p50 latency of 2,454 ms — three times the original research estimate. Direct OpenAI timing experiments decomposed the 2.4 seconds into ~1,750 ms of raw model generation (driven by an output schema with nine fields) plus ~720 ms of pydantic-ai Agent.run() framework overhead per call. Multiplied across eight call sites, the framework cost ~5.6 seconds of cumulative latency on a worst-case turn.
The framework was removed and replaced with a ~190-line thin helper that preserved every load-bearing behavior (Pydantic-model validation with field validators, retry-on-validation-error with feedback fed back to the model, fallback to legacy path on exhausted retries). The output schema was also slimmed where downstream code overrode or ignored LLM-emitted fields. Total latency reduction on the intent classification stage: ~55% measured against the pre-removal baseline. Total engineering cost of the migration and the reversal: roughly four engineer-hours plus a Docker-image rebuild cycle.
Had the Decision-Cost Rubric been applied at the original proposal, the question "Latency: estimated p50/p95 impact per call" would have been on the table. The estimate at that time might have been wrong by a factor of two or three, but the act of producing an estimate would have forced a baseline measurement. The 720 ms tax would have been visible at adoption, instead of days later in production measurement.
How: The rubric integrates into the development lifecycle at two enforcement points.
At brainstorm phase (high leverage): When a proposal triggers the Brainstorm Gate (Section 3.1), the gate cannot be cleared until a Pre-Mortem Block is emitted that addresses all six axes plus three reflection fields (strongest risk, what would change my mind, confidence). The visible structure of the block makes its absence detectable to the human collaborator in real time. The block format and trigger conditions are documented in Section 3.1.
At commit/ADR phase (audit trail): Load-bearing architectural commits and ADRs include a Decision context: block that records which axes were considered and what the estimates were. This is not "audit theater" — it is the artifact that the consolidation review (Section 2.8) reads against actual outcomes to detect when an axis was systematically underestimated. The pydantic-ai commits did not have this block. That absence is the root cause of why the latency tax stayed invisible until the measurement endpoint landed.
Anti-pattern to defend against: The rubric becomes ceremonial when proposers fill in the six axes with stickered answers ("performance: likely fine", "alternatives: nothing comparable") to clear the gate, then proceed with the predetermined conclusion. Three defenses are baked into the Pre-Mortem Block format:
- "Strongest risk I see" demands a specific failure mode with a named system component. Generic answers ("complexity") do not pass.
- "What would change my mind" demands a falsifiable signal — a measurement, a benchmark threshold, a user report. "If it turns out to be slow" does not pass; "if
intent_classificationp50 stays above 1,200 ms after a week on pilot" passes. - The Pre-Mortem Block becomes a load-bearing reference that the post-decision audit reads when the decision underperforms. The rubric improves over time because failed decisions surface which axis was systematically underweighted.