Skip to main content

Effort Estimation

Tracking development effort for the ZOL Intelligent Search project -- a PXL AI Technology Architect graduation project built with AI-assisted development.

Updated Weekly

This page is updated weekly to reflect the latest project activity. Last full table update: 2026-05-31.

Summary

MetricValue
Project start2026-02-06
Current date2026-05-31
Duration~17 weeks
Unique working days~95
Total commits2,432
Estimated total prompting hours190--250 hours
Working modelHuman architect + Claude Code AI pair programming

The development model pairs a human architect (responsible for design decisions, architecture, and quality oversight) with Claude Code as an AI pair programmer (responsible for code generation, refactoring, testing, and documentation). The human drives every session through natural-language prompts; the AI executes within those instructions.

Weekly Breakdown

WeekDate RangeCommitsEst. HoursCumulative CommitsKey Focus Areas
W06Feb 6--12475--847Project kickoff, initial RAG pipeline, PostgreSQL setup
W07Feb 13--19998--12146Hybrid search, embedding models, safety layer
W08Feb 20--2617315--20319Intent classification, query rewriting, reranking
W09Feb 27 -- Mar 516715--20486Knowledge graph, entity extraction, taxonomy
W10Mar 6--1222115--20707Graph RAG, SNOMED terminology, golden eval setup
W11Mar 13--1930315--201,010Draft/publish system, pipeline wizard, fuzzy dedup
W12Mar 20--261278--121,137PDF corpus scaling, content deduplication
W13Mar 27--31798--121,216Hospital-agnostic refactoring, taxonomy dedup
W14Apr 1--7735--81,289Code review, security hardening, type safety
W15Apr 8--10625--81,351Clarifying questions trigger, production debugging
W16Apr 13--19575--81,408Voice Phase A (LiveKit + Twilio SIP scaffolding), nightly auto-ingest live on pilot, CI unblock cascade
W17Apr 20--2618015--201,588Voice marathon: Q2 sprint complete, 5 production bug fixes, dialogue-manager spec + foundation, live-LLM e2e scenarios harness
W18Apr 27 -- May 3415--81,629Legacy 8-stage voice pipeline removed (~7 K LOC deleted), Voice batch A/B (compound subtopic + STT phonetic fallback), thin-pipeline migration
W19May 4--1020415--201,833Voice batch C/D/F, tenant-overlay system (multi-tenant FAQ + STT + DB renderers), Twilio Phase A SIP, pilot-review-readiness 5-phase rewrite (~7 K LOC docs), ADR-0053→0057, methodology v2.2
W20May 11--1715715--201,990Comparison RCA (MedChat 50-Q: 87.5 → 91.1 avg), 7 RAG fixes (T1-T7), autonomous latency optimisations (O3/O4/O5/O10/O12/O16), dedup-heuristic RCA + flip (24 docs restored), methodology v2.3 (Brainstorm Gate), Q5 laadpalen RCA
W21May 18--241108--122,100B1+B2 demo-night PRs, ADR-0053 LLM-first agentic voice (native streaming-with-tools), 4-hotfix cascade (Citation JSON, tool_choice, logger.exception, two-call latency), voice quality refit (Rule 4.5, temp 0, STT phonetic sweep, Rule 6.5, tier-1 rate limit), voice ops infrastructure (trace/replay/SLO + operator runbook), 88/89 voice eval, first SLO-discipline win (phantom-bug caught before deploy)
W22May 25--3133215--202,432Final consolidation sprint. Multi-tenant chat made operable (slug→tenant resolver, admin Chat-URL config, Romanian Layer B code+safety), semantic synthesis gate (embedding cosine vs vetted exemplars, calibrated 0.72, default ON), grounded medical-dosing citations end-to-end (5 bugs across A2+F2), cache channel-scoping + operator control panel, shared capability registry (chat shipped, voice gated) + concurrent-session 500 fix, chat-prompt de-contamination (+30% detail, byte-identical safety), voice language locking (turn-1 hysteresis + per-language schedule/prose), public-chat history persistence + mobile, corpus-wide docs pass (canonical Glossary, SNOMED golden-pages, 6-group sidebar, 155+ admonition fix)

Visual Progress

The following table provides a visual representation of weekly commit volume. Each block represents approximately 15 commits.

W06 |###### | 47
W07 |############# | 99
W08 |####################### | 173
W09 |###################### | 167
W10 |############################# | 221
W11 |######################################## | 303
W12 |################ | 127
W13 |########## | 79
W14 |######### | 73
W15 |######## | 62
W16 |####### | 57
W17 |######################## | 180
W18 |##### | 41
W19 |########################### | 204
W20 |##################### | 157
W21 |############## | 110
W22 |############################################| 332
0 75 150 225 300

Cumulative Progress

The system grew incrementally, with each week adding distinct capabilities on top of the previous foundation.

WeekSystem Capabilities at End of Week
W06Basic RAG pipeline operational: document ingestion, pgvector embeddings, simple vector search, PostgreSQL schema, FastAPI skeleton, React frontend shell
W07Hybrid search (vector + BM25 via RRF), BGE-M3 embedding model, initial safety layer with medical advice detection, user authentication
W08Intent classification (navigational vs. informational vs. medical), LLM-based query rewriting, cross-encoder reranking, response quality gates
W09Knowledge graph with entity extraction, hospital taxonomy (doctors, departments, conditions, treatments), entity-aware retrieval
W10Graph RAG integration, SNOMED CT medical terminology mapping, golden evaluation framework (299 questions), automated regression testing
W11Draft/publish content workflow, pipeline wizard for bulk processing, fuzzy entity deduplication, 95.1 percent baseline rising to 99 percent eval pass rate
W12PDF brochure corpus (573 documents), content-level deduplication, chunk quality improvements, scaling fixes for large document sets
W13Hospital-agnostic architecture (multi-tenant ready), taxonomy deduplication (12,997 to 2,663 entities), database-backed configuration
W14Security hardening (input validation, rate limiting), type safety improvements, code review remediation across 45+ files
W15Clarifying question system for ambiguous queries, production debugging and stability improvements, ambiguity detection pipeline
W16First voice channel reaching pilot: LiveKit Agents worker + Twilio Elastic SIP gateway, Deepgram Nova-3 STT, ElevenLabs Multilingual v2 TTS, nightly auto-ingest live (INGEST_MODE=auto, 03:00 UTC), CI pipeline green end-to-end
W17Voice dialogue manager (built, then later removed): 6-tool dispatcher, system prompt, frustration ladder, FAQ children, orchestrator integration, 15 integration tests. Live-LLM end-to-end scenarios harness for 8 dialogue flows. Voice path treated as stateful conversation, not stateless Q&A
W18Architectural simplification (~7,000 LOC deleted): legacy 8-stage VoiceOrchestrator + dialogue-manager + speculative-STT cache + preprocessor LLM + safety gate + conversational-intent resolver + 17 legacy tests. Thin pipeline (regex pre-filter → FAQ → RAG) becomes the only production behaviour on every channel
W19Tenant overlay system shipped: multi-tenant FAQ + STT phonetic recovery + DB-driven answer renderers, zero duplicated tenant data. Twilio Phase A SIP integration. Pilot-review documentation pass — 5-phase rewrite producing ~7K LOC of documentation across architecture, safety, voice, compendium, positioning, methodology
W20Comparison RCA against MedChat (50-Q benchmark): 87.5 avg / 3 wins / 21 losses → 91.1 avg / 23 wins / 7 losses / 0 P0 regressions via 7 RAG fixes. Autonomous latency wave: O3/O4/O5/O10/O12/O16 (~700 ms saved per call, pydantic-ai removed). Methodology v2.3 ratified: Decision-Cost Rubric (6 axes) + Brainstorm Gate (Pre-Mortem Block)
W21LLM-first agentic voice (ADR-0053): native OpenAI streaming-with-tools, single call per tool-decision iteration. 4-hotfix cascade survived. Voice quality refit (Rule 4.5 no-repeated-clarifications, temperature 0, 80-term Belgian-Dutch STT phonetic sweep, Rule 6.5 procedure explanations, tier-1 session rate limit). Voice operator runbook + diagnostic infrastructure (trace/replay/SLO) ends seven-week reactive prompt-cycle. 88/89 voice golden eval verdict. First SLO-discipline win: phantom safety bug caught before shipping a regression-prone prompt rule
W22Multi-tenant chat operable end-to-end: per-request slug→tenant resolution (fail-closed), admin Chat-URL configuration, per-tenant ingest language filter, and Romanian as the first non-Dutch locale (code + layer-3 safety parity, dormant pending native-speaker review). Safety boundary hardened: the synthesis decision moves from a punctuation-fragile regex to a calibrated embedding-similarity gate (threshold 0.72, default ON); grounded medical-dosing answers synthesise from vetted brochures and carry citations across both render paths. Cache becomes channel-aware (no voice answer served to web) with an operator control panel. Chat and voice share one capability registry; the concurrent-session HTTP 500 is resolved. Chat prompt de-contaminated (+30% detail, byte-identical safety). Voice locks turn-1 language and answers per-caller-language. Public chat gains history persistence + mobile layout. Documentation consolidated: canonical Glossary, SNOMED golden-pages, six-group sidebar, Docusaurus v3 admonition fix across 155+ blocks

How This Is Measured

Commits as a Proxy for Effort

Git commits serve as the primary effort proxy in this project. While commits are an imperfect measure of time, they correlate well with active development sessions in an AI-assisted workflow for the following reasons:

  • Session-driven development: Each working session consists of a human architect prompting Claude Code with design instructions. A typical session lasts 2--4 hours and produces 15--40 commits, depending on whether the work involves greenfield features (more commits) or debugging/refactoring (fewer commits).
  • Atomic commits: The AI pair programmer produces small, atomic commits -- one per logical change -- rather than large monolithic commits. This makes commit count a more granular measure than in traditional development.
  • Release notes: Each significant session is documented in release notes, providing a secondary source for effort validation.

Estimation Methodology

Prompting hours are estimated by categorizing weeks into three tiers:

TierCommits/WeekEstimated Hours/WeekRationale
High intensity150--30315--20Multiple long sessions, greenfield feature development
Medium intensity80--1498--12Mixed feature work, refinement, and testing
Lower intensity45--795--8Focused debugging, review, or short-week periods

These estimates are conservative. They count only active prompting time -- the hours during which the human architect is actively instructing the AI. They exclude time spent on design thinking, reading documentation, reviewing outputs, or writing specifications outside of the AI coding sessions.

What These Hours Represent

In the AI-assisted development model, a single prompting hour is significantly more productive than a traditional solo development hour. The human architect focuses exclusively on what to build and why, while the AI handles the how -- writing code, tests, migrations, and documentation. This means that 175--230 prompting hours over 16 weeks produced output equivalent to what would traditionally require substantially more engineering time — a production RAG system with a voice channel, multi-tenant overlays, structured-output validation, ADR-backed architectural decisions (49+ ADRs), a 299-question golden eval harness, a 10-persona voice eval harness, ~251 documentation pages, and an SLO-discipline operator runbook with diagnostic tooling.

This is not a claim about replacing developers. It is an observation that the human-AI pair programming model shifts the human role from writing code to directing code generation, which changes the relationship between hours spent and output produced.