Effort Estimation

Tracking development effort for the ZOL Intelligent Search project -- a PXL AI Technology Architect graduation project built with AI-assisted development.

Updated Weekly

This page is updated weekly to reflect the latest project activity. Last full table update: 2026-05-31.

Summary

Metric	Value
Project start	2026-02-06
Current date	2026-05-31
Duration	~17 weeks
Unique working days	~95
Total commits	2,432
Estimated total prompting hours	190--250 hours
Working model	Human architect + Claude Code AI pair programming

The development model pairs a human architect (responsible for design decisions, architecture, and quality oversight) with Claude Code as an AI pair programmer (responsible for code generation, refactoring, testing, and documentation). The human drives every session through natural-language prompts; the AI executes within those instructions.

Weekly Breakdown

Week	Date Range	Commits	Est. Hours	Cumulative Commits	Key Focus Areas
W06	Feb 6--12	47	5--8	47	Project kickoff, initial RAG pipeline, PostgreSQL setup
W07	Feb 13--19	99	8--12	146	Hybrid search, embedding models, safety layer
W08	Feb 20--26	173	15--20	319	Intent classification, query rewriting, reranking
W09	Feb 27 -- Mar 5	167	15--20	486	Knowledge graph, entity extraction, taxonomy
W10	Mar 6--12	221	15--20	707	Graph RAG, SNOMED terminology, golden eval setup
W11	Mar 13--19	303	15--20	1,010	Draft/publish system, pipeline wizard, fuzzy dedup
W12	Mar 20--26	127	8--12	1,137	PDF corpus scaling, content deduplication
W13	Mar 27--31	79	8--12	1,216	Hospital-agnostic refactoring, taxonomy dedup
W14	Apr 1--7	73	5--8	1,289	Code review, security hardening, type safety
W15	Apr 8--10	62	5--8	1,351	Clarifying questions trigger, production debugging
W16	Apr 13--19	57	5--8	1,408	Voice Phase A (LiveKit + Twilio SIP scaffolding), nightly auto-ingest live on pilot, CI unblock cascade
W17	Apr 20--26	180	15--20	1,588	Voice marathon: Q2 sprint complete, 5 production bug fixes, dialogue-manager spec + foundation, live-LLM e2e scenarios harness
W18	Apr 27 -- May 3	41	5--8	1,629	Legacy 8-stage voice pipeline removed (~7 K LOC deleted), Voice batch A/B (compound subtopic + STT phonetic fallback), thin-pipeline migration
W19	May 4--10	204	15--20	1,833	Voice batch C/D/F, tenant-overlay system (multi-tenant FAQ + STT + DB renderers), Twilio Phase A SIP, pilot-review-readiness 5-phase rewrite (~7 K LOC docs), ADR-0053→0057, methodology v2.2
W20	May 11--17	157	15--20	1,990	Comparison RCA (MedChat 50-Q: 87.5 → 91.1 avg), 7 RAG fixes (T1-T7), autonomous latency optimisations (O3/O4/O5/O10/O12/O16), dedup-heuristic RCA + flip (24 docs restored), methodology v2.3 (Brainstorm Gate), Q5 laadpalen RCA
W21	May 18--24	110	8--12	2,100	B1+B2 demo-night PRs, ADR-0053 LLM-first agentic voice (native streaming-with-tools), 4-hotfix cascade (Citation JSON, tool_choice, logger.exception, two-call latency), voice quality refit (Rule 4.5, temp 0, STT phonetic sweep, Rule 6.5, tier-1 rate limit), voice ops infrastructure (trace/replay/SLO + operator runbook), 88/89 voice eval, first SLO-discipline win (phantom-bug caught before deploy)
W22	May 25--31	332	15--20	2,432	Final consolidation sprint. Multi-tenant chat made operable (slug→tenant resolver, admin Chat-URL config, Romanian Layer B code+safety), semantic synthesis gate (embedding cosine vs vetted exemplars, calibrated 0.72, default ON), grounded medical-dosing citations end-to-end (5 bugs across A2+F2), cache channel-scoping + operator control panel, shared capability registry (chat shipped, voice gated) + concurrent-session 500 fix, chat-prompt de-contamination (+30% detail, byte-identical safety), voice language locking (turn-1 hysteresis + per-language schedule/prose), public-chat history persistence + mobile, corpus-wide docs pass (canonical Glossary, SNOMED golden-pages, 6-group sidebar, 155+ admonition fix)

Visual Progress

The following table provides a visual representation of weekly commit volume. Each block represents approximately 15 commits.

W06  |######                                      |  47
W07  |#############                               |  99
W08  |#######################                     | 173
W09  |######################                      | 167
W10  |#############################               | 221
W11  |########################################    | 303
W12  |################                            | 127
W13  |##########                                  |  79
W14  |#########                                   |  73
W15  |########                                    |  62
W16  |#######                                     |  57
W17  |########################                    | 180
W18  |#####                                       |  41
W19  |###########################                 | 204
W20  |#####################                       | 157
W21  |##############                              | 110
W22  |############################################| 332
      0        75       150       225       300

Cumulative Progress

The system grew incrementally, with each week adding distinct capabilities on top of the previous foundation.

Week	System Capabilities at End of Week
W06	Basic RAG pipeline operational: document ingestion, pgvector embeddings, simple vector search, PostgreSQL schema, FastAPI skeleton, React frontend shell
W07	Hybrid search (vector + BM25 via RRF), BGE-M3 embedding model, initial safety layer with medical advice detection, user authentication
W08	Intent classification (navigational vs. informational vs. medical), LLM-based query rewriting, cross-encoder reranking, response quality gates
W09	Knowledge graph with entity extraction, hospital taxonomy (doctors, departments, conditions, treatments), entity-aware retrieval
W10	Graph RAG integration, SNOMED CT medical terminology mapping, golden evaluation framework (299 questions), automated regression testing
W11	Draft/publish content workflow, pipeline wizard for bulk processing, fuzzy entity deduplication, 95.1 percent baseline rising to 99 percent eval pass rate
W12	PDF brochure corpus (573 documents), content-level deduplication, chunk quality improvements, scaling fixes for large document sets
W13	Hospital-agnostic architecture (multi-tenant ready), taxonomy deduplication (12,997 to 2,663 entities), database-backed configuration
W14	Security hardening (input validation, rate limiting), type safety improvements, code review remediation across 45+ files
W15	Clarifying question system for ambiguous queries, production debugging and stability improvements, ambiguity detection pipeline
W16	First voice channel reaching pilot: LiveKit Agents worker + Twilio Elastic SIP gateway, Deepgram Nova-3 STT, ElevenLabs Multilingual v2 TTS, nightly auto-ingest live (`INGEST_MODE=auto`, 03:00 UTC), CI pipeline green end-to-end
W17	Voice dialogue manager (built, then later removed): 6-tool dispatcher, system prompt, frustration ladder, FAQ children, orchestrator integration, 15 integration tests. Live-LLM end-to-end scenarios harness for 8 dialogue flows. Voice path treated as stateful conversation, not stateless Q&A
W18	Architectural simplification (~7,000 LOC deleted): legacy 8-stage VoiceOrchestrator + dialogue-manager + speculative-STT cache + preprocessor LLM + safety gate + conversational-intent resolver + 17 legacy tests. Thin pipeline (regex pre-filter → FAQ → RAG) becomes the only production behaviour on every channel
W19	Tenant overlay system shipped: multi-tenant FAQ + STT phonetic recovery + DB-driven answer renderers, zero duplicated tenant data. Twilio Phase A SIP integration. Pilot-review documentation pass — 5-phase rewrite producing ~7K LOC of documentation across architecture, safety, voice, compendium, positioning, methodology
W20	Comparison RCA against MedChat (50-Q benchmark): 87.5 avg / 3 wins / 21 losses → 91.1 avg / 23 wins / 7 losses / 0 P0 regressions via 7 RAG fixes. Autonomous latency wave: O3/O4/O5/O10/O12/O16 (~700 ms saved per call, pydantic-ai removed). Methodology v2.3 ratified: Decision-Cost Rubric (6 axes) + Brainstorm Gate (Pre-Mortem Block)
W21	LLM-first agentic voice (ADR-0053): native OpenAI streaming-with-tools, single call per tool-decision iteration. 4-hotfix cascade survived. Voice quality refit (Rule 4.5 no-repeated-clarifications, temperature 0, 80-term Belgian-Dutch STT phonetic sweep, Rule 6.5 procedure explanations, tier-1 session rate limit). Voice operator runbook + diagnostic infrastructure (trace/replay/SLO) ends seven-week reactive prompt-cycle. 88/89 voice golden eval verdict. First SLO-discipline win: phantom safety bug caught before shipping a regression-prone prompt rule
W22	Multi-tenant chat operable end-to-end: per-request slug→tenant resolution (fail-closed), admin Chat-URL configuration, per-tenant ingest language filter, and Romanian as the first non-Dutch locale (code + layer-3 safety parity, dormant pending native-speaker review). Safety boundary hardened: the synthesis decision moves from a punctuation-fragile regex to a calibrated embedding-similarity gate (threshold 0.72, default ON); grounded medical-dosing answers synthesise from vetted brochures and carry citations across both render paths. Cache becomes channel-aware (no voice answer served to web) with an operator control panel. Chat and voice share one capability registry; the concurrent-session HTTP 500 is resolved. Chat prompt de-contaminated (+30% detail, byte-identical safety). Voice locks turn-1 language and answers per-caller-language. Public chat gains history persistence + mobile layout. Documentation consolidated: canonical Glossary, SNOMED golden-pages, six-group sidebar, Docusaurus v3 admonition fix across 155+ blocks

How This Is Measured

Commits as a Proxy for Effort

Git commits serve as the primary effort proxy in this project. While commits are an imperfect measure of time, they correlate well with active development sessions in an AI-assisted workflow for the following reasons:

Session-driven development: Each working session consists of a human architect prompting Claude Code with design instructions. A typical session lasts 2--4 hours and produces 15--40 commits, depending on whether the work involves greenfield features (more commits) or debugging/refactoring (fewer commits).
Atomic commits: The AI pair programmer produces small, atomic commits -- one per logical change -- rather than large monolithic commits. This makes commit count a more granular measure than in traditional development.
Release notes: Each significant session is documented in release notes, providing a secondary source for effort validation.

Estimation Methodology

Prompting hours are estimated by categorizing weeks into three tiers:

Tier	Commits/Week	Estimated Hours/Week	Rationale
High intensity	150--303	15--20	Multiple long sessions, greenfield feature development
Medium intensity	80--149	8--12	Mixed feature work, refinement, and testing
Lower intensity	45--79	5--8	Focused debugging, review, or short-week periods

These estimates are conservative. They count only active prompting time -- the hours during which the human architect is actively instructing the AI. They exclude time spent on design thinking, reading documentation, reviewing outputs, or writing specifications outside of the AI coding sessions.

What These Hours Represent

In the AI-assisted development model, a single prompting hour is significantly more productive than a traditional solo development hour. The human architect focuses exclusively on what to build and why, while the AI handles the how -- writing code, tests, migrations, and documentation. This means that 175--230 prompting hours over 16 weeks produced output equivalent to what would traditionally require substantially more engineering time — a production RAG system with a voice channel, multi-tenant overlays, structured-output validation, ADR-backed architectural decisions (49+ ADRs), a 299-question golden eval harness, a 10-persona voice eval harness, ~251 documentation pages, and an SLO-discipline operator runbook with diagnostic tooling.

This is not a claim about replacing developers. It is an observation that the human-AI pair programming model shifts the human role from writing code to directing code generation, which changes the relationship between hours spent and output produced.

Summary​

Weekly Breakdown​

Visual Progress​

Cumulative Progress​

How This Is Measured​

Commits as a Proxy for Effort​

Estimation Methodology​

What These Hours Represent​