Five Fatal Assumptions: Why T-Shirt Sizing Systematically Fails for AI Projects

Five Fatal Assumptions: Why T-Shirt Sizing Systematically Fails for AI Projects
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Agile estimation techniques, particularly T-shirt sizing, are widely used in software development for their simplicity and utility in scoping work. However, when we apply these methods to artificial intelligence initiatives – especially those involving large language models (LLMs) and multi-agent systems – the results can be systematically misleading. This paper shares an evidence-backed analysis of five foundational assumptions we often make during T-shirt sizing. While these assumptions usually hold true for traditional software, they tend to fail in AI contexts: (1) linear effort scaling, (2) repeatability from prior experience, (3) effort-duration fungibility, (4) task decomposability, and (5) deterministic completion criteria. Drawing on recent research into multi-agent system failures, scaling principles, and the inherent unreliability of multi-turn conversations, we show how AI development breaks these rules. We see this through non-linear performance jumps, complex interaction surfaces, and “tight coupling” where a small change in data cascades through the entire stack. To help teams navigate this, we propose Checkpoint Sizing: a more human-centric, iterative approach that uses explicit decision gates where scope and feasibility are reassessed based on what we learn during development, rather than what we assumed at the start. This paper is intended for engineering managers, technical leads, and product owners responsible for planning and delivering AI initiatives.


💡 Research Summary

**
The paper “Five Fatal Assumptions: Why T‑Shirt Sizing Systematically Fails for AI Projects” investigates why a popular agile estimation technique—T‑shirt sizing (categorizing work as Small, Medium, Large, Extra‑Large)—produces systematically misleading estimates when applied to modern artificial‑intelligence initiatives, especially those involving large language models (LLMs) and multi‑agent systems. The authors identify five implicit assumptions that underpin T‑shirt sizing in traditional software development and demonstrate, with empirical evidence from recent AI literature, how each assumption collapses in AI contexts.

  1. Linear Effort Scaling – Traditional software assumes effort grows roughly proportionally with feature size. In AI, scaling laws show that moving from 85 % to 95 % model accuracy can require an order‑of‑magnitude increase in data, compute, and experimentation. Multi‑agent systems further exacerbate this by introducing combinatorial interaction complexity that grows as N × (N‑1), where N is the number of agents. Consequently, a modest increase in scope can explode effort requirements.

  2. Repeatability from Prior Experience – Software teams rely on past projects to predict effort. AI projects, however, are dominated by data‑centric uncertainty. Each dataset brings unique distributional characteristics, annotation quality, and edge‑case density. Moreover, LLMs exhibit a 39 % performance drop in multi‑turn conversations compared with single‑turn settings, and context‑window decay after 20+ turns introduces “unknown unknowns” that only surface late in the development cycle.

  3. Effort‑Duration Fungibility – In software, adding engineers can compress schedule (within limits). AI pipelines contain mandatory sequential phases—data collection → cleaning → model training → hyper‑parameter tuning → evaluation → deployment—that cannot be parallelized regardless of headcount. Compute latency, GPU allocation limits, and API rate caps create hard time floors that additional staff cannot eliminate.

  4. Task Decomposability – Traditional projects can be broken into loosely coupled components that are developed in parallel. AI systems are tightly coupled across data engineering, model architecture, and prompt engineering. A change in data schema may trigger model retraining; a verbose agent can exhaust a shared token window, starving downstream agents. These interdependencies invalidate additive estimates based on isolated components.

  5. Deterministic Completion Criteria – Software “Definition of Done” is stable: tests pass, specs are met, code ships. AI projects have moving goalposts: after meeting accuracy targets, legal reviews may reject hallucination rates; ethical audits may uncover bias; multi‑turn performance degradation may require additional mitigation. The notion of “done” therefore evolves throughout the project, breaking the assumption of a fixed completion point.

The authors argue that these five “fatal” assumptions cause T‑shirt sizing to systematically under‑estimate scope, over‑promise timelines, and fail to capture risk in AI work. To address this, they propose Checkpoint Sizing, an iterative, gate‑based estimation framework. Instead of a single upfront size, the project is divided into checkpoints where explicit decision gates evaluate data quality, model performance, resource consumption, and compliance (ethical, legal, safety). At each gate, teams reassess scope and schedule based on the latest empirical evidence, allowing estimates to be continuously refined.

The paper contributes (1) a taxonomy of the five fatal assumptions, (2) a literature‑backed empirical grounding for each failure mode, (3) quantitative metrics (e.g., N(N‑1) interaction growth, 39 % multi‑turn degradation), and (4) the Checkpoint Sizing methodology as a concrete alternative for AI project planning. The authors conclude that while T‑shirt sizing remains useful for traditional software, AI development demands a more human‑centric, feedback‑driven approach. Future work is suggested to validate Checkpoint Sizing in real‑world industry settings and to develop automated metrics for gate evaluation.


Comments & Academic Discussion

Loading comments...

Leave a Comment