Skip to main content

Investigative Discipline

What: When a symptom appears (failing test, error log, user-reported bug), the default first move is to investigate the symptom against current production code before classifying it as stale, transient, or unrelated. Three rules govern the investigation:

  1. A failing test is REAL until proven stale. A test that fails on master is at least as likely to be detecting a real production regression as it is to be staleness from an un-coordinated code change. Default-investigate the prod code change in the touched area before declaring the test stale.

  2. Misattributed-cause is technical debt. When a defensive guard is added with a comment that guesses at the cause (rather than documenting the cause that was actually verified), future debuggers will waste time chasing the wrong hypothesis. Comments must distinguish "verified" from "hypothesized" and be updated when the actual cause is later identified.

  3. Convergent design intuition is a quality signal. When two engineers (or an engineer and an AI agent) independently arrive at the same design from different angles, the design is likely right. The convergence is a positive correctness signal and worth calling out — both as confidence that the implementation is on track and as documentation of why the design was chosen.

Why: Three distinct failure modes are addressed by these rules.

The "test is stale" mis-classification is the most common: a test fails on master, the developer runs git stash and observes the failure existed before their change, then declares "pre-existing — therefore stale" and moves on. The "pre-existing" half of that inference is correct; the "therefore stale" half is the wrong leap. A pre-existing failing test could be a pre-existing real bug — and frequently is. The shortcut bypasses the step that would have caught the regression.

The "misattributed cause" failure happens when a developer (or AI agent) hits a symptom they don't have time to fully diagnose. They write a defensive workaround, then write a comment guessing at the cause. The workaround works (because it covers the real cause too, even though the guess was wrong). Months later, another developer encounters a similar symptom, reads the comment, and chases the wrong hypothesis for hours before realizing the original attribution was speculation. The cost compounds across the codebase as defensive guards accumulate.

The "convergent design intuition" rule is positive rather than defensive: AI-assisted development often involves separate threads of reasoning — the human's intuition from product context, the AI's pattern-match against similar prior implementations, and the test suite's empirical signal from observed behavior. When these independently agree on a design, that agreement is meaningful information about correctness. Engineering teams that treat convergence as noise (rather than signal) miss a free quality check.

Evidence: Two cases on a single day in the ZOL Hospital Intelligent Search project (2026-05-10) demonstrated all three rules in action.

Case 1 — The phone-number test stale-misclassification. A test (test_phone_number_inserts_commas) was failing on master. Initial diagnosis: "the production joiner changed on 2026-05-07 from , to , ... (ellipsis-padded for ElevenLabs pause control); test was written for the old joiner; therefore stale." A git stash baseline confirmed the failure was pre-existing. The test was tagged for a future "fix the stale test" cleanup pass.

Hours later, deeper investigation revealed the test was correctly detecting a real production silent-failure bug: the new ellipsis joiner was being interpreted by _split_sentences_abbrev_safe as a sentence boundary, causing phone numbers to truncate mid-rendering when the default max_sentences=2 cap fired. The bug had shipped to production on 2026-05-07 and stayed broken for 3 days because the canary test was misclassified as stale. Fix: protect ... (ellipsis + whitespace) like protected abbreviations using the existing _PROTECT_MARKER mechanism. The test that "looked stale" was the bug's only detector.

Case 2 — The fastmcp OPENAI_API_KEY misattribution. A defensive guard had been added to revalidate_fast_gate_threshold.py with a comment blaming "fastmcp's MCP-server config loader" for contaminating the process environment. The guard was effective; the comment was speculation. Investigation showed (a) zero fastmcp imports anywhere in the backend's app/ tree, (b) zero top-level os.environ writes, (c) zero OPENAI_API_KEY matches in the user's shell rc files, and (d) python -m scripts.X in a clean shell does not contaminate the env at all. The actual cause: pydantic-settings prefers process-env over .env files, so a stale shell-level env var (typically exported in a parallel terminal) would override the project key. The comment had been misleading future debuggers since it was written.

Case 3 — The convergent-design moment for the LLM-investigation timeout. While the engineering team was deploying a n_turns * 700 + 1500 token-budget formula to fix an LLM regression (the formula derived from per-turn structured-output token math + reasoning-token overhead), the user — without seeing the deploy — proposed "maybe we can use dynamic cap on tokens depending on number of turns." The two threads converged on the same design. The convergence was treated as confirming evidence and explicitly called out in the post-deploy summary, not dismissed as coincidence.

How: Three operational mechanisms encode these rules.

For "failing test is real until proven stale," the bug-fix lifecycle (Section 3.2) requires investigating the production code change in the touched area before declaring a test stale. Specifically: run git log -- <prod-file> and git log -- <test-file> and compare. If the prod file changed after the test was last touched, the prior is that the test is right and the prod regressed. Run the prod code path manually (REPL, integration test, live invocation) and compare what it does with what the test asserts. Only declare the test stale when (a) the prod change was intentional AND (b) the test wasn't updated in the same commit that should have updated it — and that combination is itself an R2-discipline debt entry, not a "no-op" closure.

For "misattributed-cause is technical debt," code comments on defensive guards must distinguish verified from hypothesized causes. The format is structural: Verified <date> via <method>: <cause> for known causes, Hypothesized <date>: <cause> (untested; defensive guard works regardless) for guesses. When the actual cause is later identified, the comment is updated and any cross-references in memory entries are corrected. Memory entries that documented disproven hypotheses retain the original framing at the bottom for historical fidelity but lead with the corrected diagnosis.

For "convergent design intuition," the methodology recognizes the convergence as a positive quality signal, not noise. When the AI agent and human reach the same design from different angles, that agreement is documented in the commit message or design doc as a correctness signal. This builds team confidence in agentic execution and creates a feedback loop: human intuitions that converge with AI reasoning are observably correct more often, which calibrates the team's trust in subsequent agentic work.

For documentation-modernization-pass project shape that exercises all three rules at scale, see appendix-n-documentation-excellence-passes.md.