Decision-Cost Rubric (Multi-Dimensional Change Evaluation)

In one line: Evaluate every architectural change against seven axes — a proposal that answers one and is silent on the rest is incomplete, not "lighter weight."

Do this: Before adopting any dependency or pattern shift, fill in all seven axes; where an axis can't be quantified yet, write "not measured because…" with a rationale that survives review.

What: Every architectural change, dependency adoption, or pattern shift must be evaluated against seven axes before adoption. The rubric is not optional and not a sticker. Where an axis cannot be quantified at proposal time, the proposal explicitly states "not measured because…".

The seven axes:

Axis	The question the proposal must answer
Latency	Estimated p50/p95 impact per call. If not measured, why is that safe to defer?
Dependency surface	New packages pulled in, transitive deps, lines we own vs. lines we depend on. What happens when an upstream yanks a release?
Debuggability	When this fails at 3am in production, what does the stack trace look like? Can a new engineer fix it? Are the failure modes observable?
Reversibility	Hours to undo if wrong. Single config flip vs. an 8-file refactor vs. a database migration.
Blast radius	How many code paths are affected. Is the change additive (a new helper alongside existing code) or substitutive (replaces the existing path everywhere)?
Alternative considered	At least one credible alternative + a one-sentence "why rejected." If you cannot name an alternative, the proposal has not been thought through.
Cost (compute/token spend)	Estimated token/compute spend per task or session; when does a premium model pay for itself? If not measured, why is that safe to defer?

Worked Cost example (break-even). A plan dispatches a wave of roughly a dozen near-independent implementation tasks. The question the Cost axis forces: run the wave on a high-capability model tier, or a mid tier at a fraction of the per-token price? The mid tier is cheaper per token, but if its weaker tasks bounce back through review-and-rework, each retry pays the dispatch cost twice and burns an orchestrator review cycle. The premium tier pays for itself once the expected rework rate on the cheaper tier crosses a low threshold — a handful of redo cycles across the wave is enough, because a rework loop costs far more than the per-token delta on a single clean pass. The estimate need not be precise; producing it surfaces whether the work is rework-prone (premium tier) or genuinely mechanical (mid tier is the honest choice). The axis is answered the moment that reasoning is on the table.

A proposal that addresses one axis and is silent on the other six is incomplete, not "lighter weight."

Why: The default failure mode is single-dimensional evaluation. The proposer names one benefit — consistency, type safety, "industry standard" — and the conversation evaluates against that single dimension. The others never come up. The proposal is adopted and the hidden costs surface in production months later, when reversal costs orders of magnitude more than asking the question at brainstorm would have.

With AI the failure is amplified by training-time gradients toward agreement: propose X with framing Y and the AI optimizes against Y. That is structural, not a politeness bug. The fix is not "the AI should disagree harder" — willpower against a gradient is unreliable. The fix is a procedural rubric that makes "did I evaluate axis X" a checkable condition rather than a feeling.

Evidence: The canonical lesson: a framework adopted on a single-dimension "consistency" argument added hundreds of milliseconds of per-call overhead, invisible until a latency-measurement endpoint exposed it days later; the fix was to replace the framework with a thin in-house helper preserving every load-bearing behavior. Had the rubric's "Latency: p50/p95 impact per call" axis been on the table at proposal time, the mere act of producing an estimate would have forced a baseline measurement and surfaced the tax at adoption. The mechanism is the Pre-Mortem Block (Section 3.1): the gate will not clear without the seven axes filled.

How: The rubric integrates into the development lifecycle at two enforcement points.

At brainstorm phase (high leverage): When a proposal triggers the Brainstorm Gate (Section 3.1), the gate cannot be cleared until a Pre-Mortem Block is emitted that addresses all seven axes plus three reflection fields (strongest risk, what would change my mind, confidence). The visible structure of the block makes its absence detectable to the human collaborator in real time. The block format and trigger conditions are documented in Section 3.1.

At commit/ADR phase (audit trail): Load-bearing architectural commits and ADRs include a Decision context: block recording which axes were considered and what the estimates were. This is not audit theater — it is the artifact the consolidation review (Section 2.8) reads against actual outcomes to detect a systematically underestimated axis. When this block is absent, a hidden cost (like a latency tax) stays invisible until production measurement exposes it.

Anti-pattern to defend against: The rubric becomes ceremonial when proposers fill in the axes with stickered answers ("performance: likely fine", "alternatives: nothing comparable") to clear the gate, then proceed with the predetermined conclusion. Three defenses are baked into the Pre-Mortem Block format:

"Strongest risk I see" demands a specific failure mode with a named system component. Generic answers ("complexity") do not pass.
"What would change my mind" demands a falsifiable signal — a measurement, a benchmark threshold, a user report. "If it turns out to be slow" does not pass; "if the classifier's p50 stays above 1,200 ms after a week in production" passes.
The Pre-Mortem Block becomes a load-bearing reference that the post-decision audit reads when the decision underperforms. The rubric improves over time because failed decisions surface which axis was systematically underweighted.