Effort Estimation
Tracking development effort for the ZOL Intelligent Search project -- a PXL AI Technology Architect graduation project built with AI-assisted development.
This page is updated weekly to reflect the latest project activity. Last full table update: 2026-05-31.
Summary
| Metric | Value |
|---|---|
| Project start | 2026-02-06 |
| Current date | 2026-05-31 |
| Duration | ~17 weeks |
| Unique working days | ~95 |
| Total commits | 2,432 |
| Estimated total prompting hours | 190--250 hours |
| Working model | Human architect + Claude Code AI pair programming |
The development model pairs a human architect (responsible for design decisions, architecture, and quality oversight) with Claude Code as an AI pair programmer (responsible for code generation, refactoring, testing, and documentation). The human drives every session through natural-language prompts; the AI executes within those instructions.
Weekly Breakdown
| Week | Date Range | Commits | Est. Hours | Cumulative Commits | Key Focus Areas |
|---|---|---|---|---|---|
| W06 | Feb 6--12 | 47 | 5--8 | 47 | Project kickoff, initial RAG pipeline, PostgreSQL setup |
| W07 | Feb 13--19 | 99 | 8--12 | 146 | Hybrid search, embedding models, safety layer |
| W08 | Feb 20--26 | 173 | 15--20 | 319 | Intent classification, query rewriting, reranking |
| W09 | Feb 27 -- Mar 5 | 167 | 15--20 | 486 | Knowledge graph, entity extraction, taxonomy |
| W10 | Mar 6--12 | 221 | 15--20 | 707 | Graph RAG, SNOMED terminology, golden eval setup |
| W11 | Mar 13--19 | 303 | 15--20 | 1,010 | Draft/publish system, pipeline wizard, fuzzy dedup |
| W12 | Mar 20--26 | 127 | 8--12 | 1,137 | PDF corpus scaling, content deduplication |
| W13 | Mar 27--31 | 79 | 8--12 | 1,216 | Hospital-agnostic refactoring, taxonomy dedup |
| W14 | Apr 1--7 | 73 | 5--8 | 1,289 | Code review, security hardening, type safety |
| W15 | Apr 8--10 | 62 | 5--8 | 1,351 | Clarifying questions trigger, production debugging |
| W16 | Apr 13--19 | 57 | 5--8 | 1,408 | Voice Phase A (LiveKit + Twilio SIP scaffolding), nightly auto-ingest live on pilot, CI unblock cascade |
| W17 | Apr 20--26 | 180 | 15--20 | 1,588 | Voice marathon: Q2 sprint complete, 5 production bug fixes, dialogue-manager spec + foundation, live-LLM e2e scenarios harness |
| W18 | Apr 27 -- May 3 | 41 | 5--8 | 1,629 | Legacy 8-stage voice pipeline removed (~7 K LOC deleted), Voice batch A/B (compound subtopic + STT phonetic fallback), thin-pipeline migration |
| W19 | May 4--10 | 204 | 15--20 | 1,833 | Voice batch C/D/F, tenant-overlay system (multi-tenant FAQ + STT + DB renderers), Twilio Phase A SIP, pilot-review-readiness 5-phase rewrite (~7 K LOC docs), ADR-0053→0057, methodology v2.2 |
| W20 | May 11--17 | 157 | 15--20 | 1,990 | Comparison RCA (MedChat 50-Q: 87.5 → 91.1 avg), 7 RAG fixes (T1-T7), autonomous latency optimisations (O3/O4/O5/O10/O12/O16), dedup-heuristic RCA + flip (24 docs restored), methodology v2.3 (Brainstorm Gate), Q5 laadpalen RCA |
| W21 | May 18--24 | 110 | 8--12 | 2,100 | B1+B2 demo-night PRs, ADR-0053 LLM-first agentic voice (native streaming-with-tools), 4-hotfix cascade (Citation JSON, tool_choice, logger.exception, two-call latency), voice quality refit (Rule 4.5, temp 0, STT phonetic sweep, Rule 6.5, tier-1 rate limit), voice ops infrastructure (trace/replay/SLO + operator runbook), 88/89 voice eval, first SLO-discipline win (phantom-bug caught before deploy) |
| W22 | May 25--31 | 332 | 15--20 | 2,432 | Final consolidation sprint. Multi-tenant chat made operable (slug→tenant resolver, admin Chat-URL config, Romanian Layer B code+safety), semantic synthesis gate (embedding cosine vs vetted exemplars, calibrated 0.72, default ON), grounded medical-dosing citations end-to-end (5 bugs across A2+F2), cache channel-scoping + operator control panel, shared capability registry (chat shipped, voice gated) + concurrent-session 500 fix, chat-prompt de-contamination (+30% detail, byte-identical safety), voice language locking (turn-1 hysteresis + per-language schedule/prose), public-chat history persistence + mobile, corpus-wide docs pass (canonical Glossary, SNOMED golden-pages, 6-group sidebar, 155+ admonition fix) |
Visual Progress
The following table provides a visual representation of weekly commit volume. Each block represents approximately 15 commits.
W06 |###### | 47
W07 |############# | 99
W08 |####################### | 173
W09 |###################### | 167
W10 |############################# | 221
W11 |######################################## | 303
W12 |################ | 127
W13 |########## | 79
W14 |######### | 73
W15 |######## | 62
W16 |####### | 57
W17 |######################## | 180
W18 |##### | 41
W19 |########################### | 204
W20 |##################### | 157
W21 |############## | 110
W22 |############################################| 332
0 75 150 225 300
Cumulative Progress
The system grew incrementally, with each week adding distinct capabilities on top of the previous foundation.
| Week | System Capabilities at End of Week |
|---|---|
| W06 | Basic RAG pipeline operational: document ingestion, pgvector embeddings, simple vector search, PostgreSQL schema, FastAPI skeleton, React frontend shell |
| W07 | Hybrid search (vector + BM25 via RRF), BGE-M3 embedding model, initial safety layer with medical advice detection, user authentication |
| W08 | Intent classification (navigational vs. informational vs. medical), LLM-based query rewriting, cross-encoder reranking, response quality gates |
| W09 | Knowledge graph with entity extraction, hospital taxonomy (doctors, departments, conditions, treatments), entity-aware retrieval |
| W10 | Graph RAG integration, SNOMED CT medical terminology mapping, golden evaluation framework (299 questions), automated regression testing |
| W11 | Draft/publish content workflow, pipeline wizard for bulk processing, fuzzy entity deduplication, 95.1 percent baseline rising to 99 percent eval pass rate |
| W12 | PDF brochure corpus (573 documents), content-level deduplication, chunk quality improvements, scaling fixes for large document sets |
| W13 | Hospital-agnostic architecture (multi-tenant ready), taxonomy deduplication (12,997 to 2,663 entities), database-backed configuration |
| W14 | Security hardening (input validation, rate limiting), type safety improvements, code review remediation across 45+ files |
| W15 | Clarifying question system for ambiguous queries, production debugging and stability improvements, ambiguity detection pipeline |
| W16 | First voice channel reaching pilot: LiveKit Agents worker + Twilio Elastic SIP gateway, Deepgram Nova-3 STT, ElevenLabs Multilingual v2 TTS, nightly auto-ingest live (INGEST_MODE=auto, 03:00 UTC), CI pipeline green end-to-end |
| W17 | Voice dialogue manager (built, then later removed): 6-tool dispatcher, system prompt, frustration ladder, FAQ children, orchestrator integration, 15 integration tests. Live-LLM end-to-end scenarios harness for 8 dialogue flows. Voice path treated as stateful conversation, not stateless Q&A |
| W18 | Architectural simplification (~7,000 LOC deleted): legacy 8-stage VoiceOrchestrator + dialogue-manager + speculative-STT cache + preprocessor LLM + safety gate + conversational-intent resolver + 17 legacy tests. Thin pipeline (regex pre-filter → FAQ → RAG) becomes the only production behaviour on every channel |
| W19 | Tenant overlay system shipped: multi-tenant FAQ + STT phonetic recovery + DB-driven answer renderers, zero duplicated tenant data. Twilio Phase A SIP integration. Pilot-review documentation pass — 5-phase rewrite producing ~7K LOC of documentation across architecture, safety, voice, compendium, positioning, methodology |
| W20 | Comparison RCA against MedChat (50-Q benchmark): 87.5 avg / 3 wins / 21 losses → 91.1 avg / 23 wins / 7 losses / 0 P0 regressions via 7 RAG fixes. Autonomous latency wave: O3/O4/O5/O10/O12/O16 (~700 ms saved per call, pydantic-ai removed). Methodology v2.3 ratified: Decision-Cost Rubric (6 axes) + Brainstorm Gate (Pre-Mortem Block) |
| W21 | LLM-first agentic voice (ADR-0053): native OpenAI streaming-with-tools, single call per tool-decision iteration. 4-hotfix cascade survived. Voice quality refit (Rule 4.5 no-repeated-clarifications, temperature 0, 80-term Belgian-Dutch STT phonetic sweep, Rule 6.5 procedure explanations, tier-1 session rate limit). Voice operator runbook + diagnostic infrastructure (trace/replay/SLO) ends seven-week reactive prompt-cycle. 88/89 voice golden eval verdict. First SLO-discipline win: phantom safety bug caught before shipping a regression-prone prompt rule |
| W22 | Multi-tenant chat operable end-to-end: per-request slug→tenant resolution (fail-closed), admin Chat-URL configuration, per-tenant ingest language filter, and Romanian as the first non-Dutch locale (code + layer-3 safety parity, dormant pending native-speaker review). Safety boundary hardened: the synthesis decision moves from a punctuation-fragile regex to a calibrated embedding-similarity gate (threshold 0.72, default ON); grounded medical-dosing answers synthesise from vetted brochures and carry citations across both render paths. Cache becomes channel-aware (no voice answer served to web) with an operator control panel. Chat and voice share one capability registry; the concurrent-session HTTP 500 is resolved. Chat prompt de-contaminated (+30% detail, byte-identical safety). Voice locks turn-1 language and answers per-caller-language. Public chat gains history persistence + mobile layout. Documentation consolidated: canonical Glossary, SNOMED golden-pages, six-group sidebar, Docusaurus v3 admonition fix across 155+ blocks |
How This Is Measured
Commits as a Proxy for Effort
Git commits serve as the primary effort proxy in this project. While commits are an imperfect measure of time, they correlate well with active development sessions in an AI-assisted workflow for the following reasons:
- Session-driven development: Each working session consists of a human architect prompting Claude Code with design instructions. A typical session lasts 2--4 hours and produces 15--40 commits, depending on whether the work involves greenfield features (more commits) or debugging/refactoring (fewer commits).
- Atomic commits: The AI pair programmer produces small, atomic commits -- one per logical change -- rather than large monolithic commits. This makes commit count a more granular measure than in traditional development.
- Release notes: Each significant session is documented in release notes, providing a secondary source for effort validation.
Estimation Methodology
Prompting hours are estimated by categorizing weeks into three tiers:
| Tier | Commits/Week | Estimated Hours/Week | Rationale |
|---|---|---|---|
| High intensity | 150--303 | 15--20 | Multiple long sessions, greenfield feature development |
| Medium intensity | 80--149 | 8--12 | Mixed feature work, refinement, and testing |
| Lower intensity | 45--79 | 5--8 | Focused debugging, review, or short-week periods |
These estimates are conservative. They count only active prompting time -- the hours during which the human architect is actively instructing the AI. They exclude time spent on design thinking, reading documentation, reviewing outputs, or writing specifications outside of the AI coding sessions.
What These Hours Represent
In the AI-assisted development model, a single prompting hour is significantly more productive than a traditional solo development hour. The human architect focuses exclusively on what to build and why, while the AI handles the how -- writing code, tests, migrations, and documentation. This means that 175--230 prompting hours over 16 weeks produced output equivalent to what would traditionally require substantially more engineering time — a production RAG system with a voice channel, multi-tenant overlays, structured-output validation, ADR-backed architectural decisions (49+ ADRs), a 299-question golden eval harness, a 10-persona voice eval harness, ~251 documentation pages, and an SLO-discipline operator runbook with diagnostic tooling.
This is not a claim about replacing developers. It is an observation that the human-AI pair programming model shifts the human role from writing code to directing code generation, which changes the relationship between hours spent and output produced.