KOINEU

February 10, 2026

Reading time: 41 minute

...

📝 Original Info

Title:
ArXiv ID: 2512.21316
Date:
Authors: Unknown

📝 Abstract

This paper derives "Scaling Laws for Economic Impacts"-empirical relationships between the training compute of Large Language Models (LLMs) and professional productivity. In a preregistered experiment, over 500 consultants, data analysts, and managers completed professional tasks using one of 13 LLMs. We find that each year of model progress reduced task time by 8%, with 56% of gains driven by increased compute and 44% by algorithmic progress. However, productivity gains were significantly larger for non-agentic analytical tasks compared to agentic workflows requiring tool use. These findings suggest continued model scaling could boost U.S. productivity by approximately 20% over the next decade.

📄 Full Content

Between the release of GPT-2 in 2019 and the frontier models of 2025, the amount of compute used to train large language models (LLMs) increased by approximately four orders of magnitude. The machine learning literature has derived remarkably consistent "scaling laws" from this explosion in resources, demonstrating that model performance-measured by cross-entropy loss-improves as a predictable power law of compute, data, and parameter size (Kaplan et al., 2020). Yet, for economists and policymakers, the critical question remains unanswered: how does a reduction in a model's mathematical loss translate into tangible economic productivity? While the technological frontier is advancing rapidly, we possess little rigorous evidence on the elasticity of human professional output with respect to these model capabilities.

To address this, we conducted a large-scale randomized controlled trial (RCT) involving over 500 professionals across three high-skill domains: management, data analysis, and consulting. Participants were tasked with completing complex workflows designed to be representative of their professions-ranging from strategic report writing and statistical hypothesis testing to tasks requiring multi-step tool use such as creating presentation slides or Gantt charts. Workers were randomly assigned to a control group or to a treatment group equipped with one of thirteen different LLMs, spanning various compute scales and release dates. We utilized high-powered incentives, including bonus payments that doubled base earnings for high-quality submissions as evaluated by expert peer graders. Our primary contribution is the derivation of “Scaling Laws for Economic Impacts”-quantifying the relationship between model inputs and professional productivity. We decompose the progress of frontier AI into two distinct factors: the scaling of training compute and algorithmic innovation (e.g., architectural improvements and better training data). First, we identify a robust “calendar-time” scaling effect-which captures the aggregate of both factors-where each year of frontier model progress is associated with an 8% reduction in task completion time (p < 0.05). Second, isolating the effect of scale, we find that a tenfold (10x) increase in model training compute is associated with a 6.3% reduction in time taken. When decomposing these gains, we find that approximately 44% of the observed improvement is attributable to algorithmic progress over time, while the remainder is driven by pure compute scaling.

We further establish the baseline “AI premium” by pooling all treatment groups. Access to any AI model increased base Earnings Per Minute (EPM) by 81.3% (p = 0.001) and raised expert-assessed quality by 0.34 standard deviations. The compounding effects of speed and quality resulted in an 146% increase in Total Earnings Per Minute (TEPM), inclusive of performance bonuses with almost equal contributions for this increase coming from greater speed (52.6%) and higher quality (47.4%).

However, we uncover significant heterogeneities in these gains across task types. While AI assistance delivered total earnings gains of $1.58 per minute on non-agentic, analytical tasks, the gain fell to just $0.34 per minute for “agentic” tasks requiring multi-step interactions with external tools-a disparity significant at the 5% level (p = 0.043). This suggests that while current scaling paradigms are rapidly commoditizing analytical cognition, the productivity frontier for tasks requiring procedural agency currently remains significantly more resistant to automation 1 .

Next, we investigate scaling laws for quality. We find a striking divergence: while the quality of autonomous model output scales linearly with compute, the quality of human-assisted output remains stagnant across model generations. This implies that human users effectively cap the realized capabilities of frontier models, satisfying to a fixed quality threshold rather than maximizing the tool’s potential.

Finally, we utilize these experimental elasticities within an aggregate growth framework (Acemoglu, 2024). We estimate that continued model scaling could boost U.S. productivity by approximately 20% over the next decade, assuming marginal costs of inference remain low. This figure significantly exceeds prior conservative estimates by explicitly incorporating the dynamic gains from predictable advancements in model compute, rather than treating AI capabilities as fixed.

The remainder of the paper is organized as follows. Section 2 reviews the related literature and Section 3 then outlines the experimental methodology. Section 4 presents the experimental results, establishing the baseline AI pre-mium, deriving the economic scaling laws, and decomposing the drivers of productivity growth. Section 5 utilizes these elasticities to estimate aggregate productivity gains for the U.S. economy and Section 6 then concludes. Regression outputs for all figures shown as well as supplementary regressions are available in Appendix A and all the tasks completed by experiment participants can be found in Appendix B.

This paper contributes to three emerging strands of literature at the intersection of artificial intelligence and labor economics. First, we build upon a rapidly expanding body of experimental evidence documenting the productivity “uplift” of generative AI across diverse domains. Significant productivity gains have been documented for legal analysis by Choi, Monahan, and Schwarcz (2024), and for software engineering by Peng et al. (2023) and Cui et al. (2025). In the domain of high-skill knowledge work, Dell’Acqua et al. (2023) find substantial quality and speed improvements for management consultants. Similar effects are observed for professional writing by Noy and Zhang (2023), call center operations by Brynjolfsson, Li, and Raymond (2025), translation by Merali (2024), and entrepreneurship by Otis et al. (2024). Our work unifies these domain-specific findings by applying a consistent experimental design across multiple professions and, crucially, across thirteen distinct models to measure the elasticity of these uplifts with respect to model capabilities.

Second, we connect the economic literature to the study of AI scaling and evaluation. While Kaplan et al. (2020) and Hoffmann et al. (2022) have established rigorous scaling laws for model perplexity, the relationship between model size and downstream economic utility has remained under-explored. Our approach responds to calls from the machine learning community for more rigorous “Centaur Evaluations”-benchmarks where humans and AI solve tasks cooperatively-as advocated by Haupt and Brynjolfsson (2025). By measuring the joint productivity of the human-AI system, we provide an economic counterweight to standard “imitation games” that test models in isolation. Furthermore, our findings speak to critiques of benchmarkcentric evaluation by Kulveit et al. (2025) and Raji et al. (2021), who argue that leaderboard-style, model-in-isolation benchmarks can misrepre-sent real-world performance and can steer research away from human-AI complementarity in realistic workflows.

Finally, this paper contributes to the development of economically grounded benchmarks for AI progress. A new wave of evaluations has attempted to quantify the potential economic utility of frontier models, including OpenAI’s “GDPval” (OpenAI, 2025), Mercor’s “AI Productivity Index” (APEX) (Vidgen et al., 2025), and technical benchmarks such as SWE-bench (Jimenez et al., 2024) and GAIA (Mialon et al., 2023). However, these frameworks generally evaluate model outputs in isolation (even when graded by experts or judge models), rather than measuring the productivity of human-AI teams. By contrast, our results constitute a benchmark of human-augmented productivity. By quantifying the elasticity of real-world economic outputmeasured via both speed and quality-with respect to model capabilities, we provide a direct metric of the realized marginal productivity of labor in the presence of frontier technology. This offers a necessary complement to automated evaluations, ensuring that economic forecasting is grounded in the actual, rather than theoretical, deployment of frontier models.

For this experiment, over 500 participants were recruited, primarily through the online research platform Prolific, with a smaller subset of experienced professionals recruited directly. These participants were stratified roughly evenly across three high-skill domains: Management, Consulting, and Data Analysis. All participants underwent a rigorous multi-stage screening process to ensure professional competency. Eligibility requirements included a self-reported annual salary exceeding $40,000, at least one year of professional experience in their respective field, and high self-rated proficiency in key professional skills. Additionally, participants were required to pass a demanding screening survey containing strict attention checks and competency verifications, which resulted in a screen-out rate of approximately 90% of initial applicants.

The final pool of participants was highly experienced. Analysis of the sample indicates that over 79% of participants had at least three years of professional experience, with nearly 47% possessing five or more years of experience in their field. In self-reported assessments, participants rated their overall pro-fessional abilities at an average of 4.13 out of 5. Furthermore, they reported a moderate familiarity with AI tools, rating their familiarity at 3.62 out of 5 and their ability to use such tools at 3.56 out of 5.

Participants completed one of three tasks specific to their profession during the experiment. These tasks were designed to simulate real-world workflows and were categorized into “Agentic” tasks (requiring multi-step reasoning and tool use) and “Non-Agentic” tasks (focused on analysis and writing). The tasks covered a diverse range of professional activities, ranging from revising financial expansion reports and creating presentation slides to conducting statistical A/B test analyses and evaluating vendor contracts. A full description of the tasks can be found in Appendix B. On average, participants took approximately 26 minutes to complete each task and rated the difficulty of these tasks at 3.18 out of 5.

Participants were given high-powered incentives to ensure high-quality effort. The base payment for completing a task was set at $15. However, participants could earn a substantial bonus of an additional $15 (doubling their total pay to $30) if their submission received a grade of 5, 6, or 7 out of 7. Grades were assigned by expert peer graders with over five years of professional experience in the relevant field. Submissions receiving a score of 0 or 1 out of 7-typically indicating a failure to follow instructions or evidence of bot usage-were not compensated; such instances of non-compliance occurred in less than 5% of the sample.

To assist with their tasks, participants were randomly assigned to a treatment group with access to one of thirteen specific AI models, or to a control group. Access was provided through a custom-built website that allowed for the monitoring of model usage. Prior to the main tasks, participants completed a monitored practice attempt to familiarize themselves with the assigned AI model and ensure compliance with experimental protocols. Participants in the treatment group reported high engagement with the tools, rating their usage of the bot at an average of 3.59 out of 5 and its helpfulness at 3.58 out of 5.

We begin by establishing the baseline impact of AI assistance on worker productivity and output quality across all professions in our sample (Consultants, Data Analysts, and Managers). To do so, we pool the results from all thirteen AI models against the control group. The data reveals substantial productivity gains across four key dimensions.

We find that AI usage drives a statistically significant increase in base Earnings Per Minute (EPM) of $0.56 (p = 0.001), representing a 81.3% increase relative to the control group baseline. Simultaneously, AI assistance raised output quality by 0.6 points (p < 0.001), an increase of 18% or 0.34 standard deviations. When accounting for the performance bonuses awarded for high-scoring tasks (bonus payments doubled earnings and were awarded for grades of 5 or higher out of seven), this dual improvement in speed and quality compounded to raise Total Earnings Per Minute (TEPM) by $1.06 (p < 0.001) -a 146% increase in overall economic value. Finally, we conclude by demonstrating that these aggregate premiums are driven by specific task types, distinguishing between the impacts on agentic versus non-agentic workflows.

First, we show the effect of AI on Earnings per Minute through faster task completions. For tasks meeting the minimum quality threshold (a grade of at least 2 out of 7), the average earnings rose from $0.69 per minute for the control group to $1.24 per minute for the treatment group as depicted by Figure 1 below (p < 0.001) with full regression results available in Table 1 in Appendix A (all results on Time Taken are also in Appendix A).

Secondly, we examine how AI assistance affects output quality. Excluding only tasks that received a grade of zero, we find that access to any AI model substantially raises expert-assessed grades. In regression-adjusted terms (controlling for profession and task), the average grade increases from 3.32 without AI to 3.91 with AI, an improvement of 0.60 points on a sevenpoint scale (p < 0.001), corresponding to roughly one-third of a standard deviation. These results, summarized in Figure 2, indicate that the productivity gains documented above do not come at the expense of lower quality; on the contrary, AI assistance simultaneously accelerates task completion and improves output quality. Full regression results are reported in Table 2 in Appendix A.

Thirdly, we consider the impact of AI assistance on Total Earnings Per Minute (TEPM), which incorporates both base pay and performance bonuses. Workers earned a fixed $15 per task conditional on meeting the minimum quality threshold (grade ≥ 2), and an additional $15 bonus (i.e. earnings doubled) for high-quality submissions with grades of 5 or higher. Using the same paid-task sample and controlling for profession and task, AI assistance raises TEPM from $0.73 per minute to $1.79 per minute, an increase of $1.06 (p < 0.001), corresponding to an 146% increase in total earnings per unit of time. Figure 3 summarizes these regression-adjusted means, with full regression results reported in Table 3 in Appendix A. Taken together, these estimates imply that the overall TEPM premium of $1.06 per minute from AI assistance reflects both faster task completion and higher bonus rates. Roughly $0.56 per minute (52.6% of the total) comes from higher base earnings per minute (EPM), while the remaining $0.50 per minute (47.4%) is driven by improved quality that pushes more tasks above the bonus threshold. AI thus delivers economically meaningful gains along both the speed and quality margins, rather than merely shifting the composition of pay.

Finally, we examine how these effects vary across task types. We classify tasks as agentic when successful completion requires multi-step external actions and tool use (e.g. creating Gantt charts in spreadsheet software and sending them via email, or extracting information from multiple 100+ page PDFs and running statistical tests). In contrast, non-agentic tasks are primarily analytical or interpretive (e.g. hypothesis testing, interpreting results, or writing analytical reports).

Using the same specifications as above, we find much larger gains from AI usage on non-agentic tasks (regression outputs shown in Table 4 and Table 5 of Appendix A). For Earnings Per Minute, the effect of any AI usage is $0.83 per minute on non-agentic tasks (p < 0.001) but only $0.18 per minute on agentic tasks (p = 0.48), with the difference between the two significant at the 5% level (p = 0.050). For grades, AI raises quality by 0.82 points on a seven-point scale on non-agentic tasks (p < 0.001), compared to 0.27 points on agentic tasks (p = 0.31); the difference is not statistically significant (p = 0.11).

Finally, Total Earnings Per Minute displays the sharpest heterogeneity: AI increases total earnings by $1.58 per minute on non-agentic tasks (p < 0.001), but only $0.34 per minute on agentic tasks (p = 0.46), with the difference between these effects significant at the 5% level (p = 0.043). Figure 4 illustrates this heterogeneity in total earnings, and Table 6 in Appendix A reports the corresponding regressions. Our empirical strategy therefore traces out “economic scaling laws” along two related axes. First, we estimate how log time taken, earnings per minute (EPM), and total earnings per minute (TEPM) vary with model release month, interpreting the slope as the average change in economic outcomes per year of frontier progress. We then allow these calendar-time slopes to differ by profession and by task type (agentic versus non-agentic). Second, we replace calendar time with log training compute and recover elasticities with respect to a tenfold increase in compute. Finally, we combine both months and log compute in the same specification to shed light on how much of the observed improvement in economic outcomes is associated with algorithmic progress over time versus simple scaling of compute.

We first ask whether the economic gains from AI assistance improve as frontier models become more capable over calendar time. The magnitudes are economically large. Moving one year forward in model release date is associated with roughly an 8% reduction in time taken to complete a task as shown in Figure 5 below (p ≈ 0.04, full regression results and a breakdown of the scaling law by profession can be found in Table 9 and Figure A2 of Appendix A respectively) 2 . Adding the full set of background controls (demographics, abilities, AI familiarity and use, and country dummies) leaves this estimate essentially unchanged. Given that the average task in the human-only group takes around 25 minutes, this corresponds to workers completing the same tasks several minutes faster per year purely from frontier progress in the underlying models. Given the substantial reductions in task completion time documented above, it is perhaps unsurprising that we observe a corresponding scaling law for Earnings Per Minute (EPM). As base pay is fixed per task (conditional on meeting the quality threshold) and excludes performance bonuses, EPM is functionally the inverse of time taken. As illustrated in Figure 6, the economic value generated by workers scales linearly with the release date of the model they utilize. Our regression estimates indicate that each month of frontier model progress is associated with an increase in base earnings of $0.019 per minute (p < 0.01). Aggregated over a year, this implies that the algorithmic progress and increased model training compute raises the value of professional output by approximately $14 per hour annually. These results, detailed in Table 10 in Appendix A, confirm that the time-savings afforded by newer models translate directly into higher hourly productivity rates.

Next, we turn to the most comprehensive metric of economic productivity: Total Earnings Per Minute (TEPM). As shown in Figure 7, the economic scaling law is steepest here: our estimates indicate that for every month of frontier model progress, a worker’s total earnings capacity increases by $0.037 per minute (p < 0.001), or an effective hourly wage increase of approximately $26.30 per year. However, it is crucial to interpret the magnitude of this slope in the context of the incentive structure, where the performance bonus (for submissions with grades 5 or higher out of 7) doubles the effective piece rate. We observe that the TEPM coefficient (0.037) is approximately twice the magnitude of the base EPM coefficient (0.019), both figures reported in Table 10 and 11 of Appendix A respectively. This suggests that the compounding value of frontier models is driven primarily by the acceleration of workflows-simply allowing workers to complete high-value, bonus-eligible tasks at a faster rate-rather than by a distinct scaling law that drastically alters the probability of achieving a bonus.

Finally, disaggregating these results by task structure reveals two distinct scaling laws. As shown in Figure 8, both task types benefit from frontier model progress, but the rate of improvement differs significantly. For nonagentic tasks, the scaling curve is steep: the estimated coefficient implies that task completion time falls by approximately 10.7% per year of frontier model progress (p < 0.05). For agentic tasks, while the sign remains positive, the slope is considerably flatter (approximately 4.8% reduction per year) and statistically indistinguishable from zero in this sample. Similar results are shown for earnings per minute in Figure 9, below. This divergence suggests that while scaling effectively compresses the time required for well-defined analytical and managerial workflows, the complex planning and multi-step reasoning required for agentic tasks may face more stubborn bottlenecks that recent model improvements have not yet fully overcome. Full regression results for this split are provided in Tables 16 and17 of Appendix A. To conclude this section, we have documented robust “economic scaling laws” across multiple dimensions: frontier model progress over time significantly reduces task completion times, increases earnings per minute, and raises total earnings capacity. We further showed that these gains are not uniform, with non-agentic workflows benefiting considerably more than complex agentic ones. However, while calendar time serves as a useful proxy for total progress, it conflates two distinct forces: the massive increase in training compute used to train newer models, and “algorithmic” advancements-such as architec- Mean EPM for Human-Only Participants tural improvements, higher quality data, and training efficiency-that occur independent of scale. In the next section, we decompose the observed productivity gains into these two components to determine how much of the economic value of AI is driven by raw compute versus algorithmic ingenuity.

To better understand the drivers of the productivity gains outlined in the section above, it is useful to decompose the improvements observed over calendar time into distinct components: total progress, compute progress, and algorithmic progress. The gains shown in the previous section using calendar time represent total progress, capturing the combined effect of increased compute and better underlying algorithms.

We now switch from considering the productivity gains from model progress as a function of calendar time (including both increases in model training compute and algorithmic progress), to productivity gains from increases in model training compute alone. This is depicted in Figure 8 which illustrates the relationship between model compute and task efficiency, plotting log time taken against log training compute. The regression analysis reveals a log-linear relationship where a ten-fold increase in training compute corresponds to approximately a 6% reduction in task completion time (p < 0.31). Unlike the calendar time regressions which capture the aggregate effects of all advancements, this specification isolates the gains attributable strictly to the model’s computational scale. We quantify these contributions by first estimating the annualized rate of improvement implied by each specification. The calendar time regression (Table 9 in Appendix A) yields a monthly log time reduction of 0.0069, which translates to a total annualized efficiency gain of roughly 8.3% (0.0069 × 12). Economically, this represents the gross productivity growth realized in the market, capturing the sum of all technological and human advancements. Conversely, the compute specification isolates the contribution of model scaling. With training compute growing approximately 6.1× per year during the sample, and using the estimated coefficient of -0.0608 (Table 12 in Appendix A), we calculate that raw scale contributes a 4.8% reduction in task time annually (log 10 (6.2) × 0.0608). This figure reflects the capital-intensive component of progress: the gains purchased strictly through larger training budgets.

Comparing these two estimates allows us to decompose the total gain into hardware and software components. By subtracting the compute effect (0.048) from the total effect (0.083), we isolate a residual of 0.035. This residual represents algorithmic progress, an economic catch-all for improvements in model architecture, software optimization, and user learning-effectively the Solow residual of AI production. In percentage terms, this decomposition suggests that compute scaling drives approximately 56% of the total reduction in time, while algorithmic advancements account for the remaining 44%. It is important to note that while the aggregate time trend is statistically significant (p < 0.05), the specific contribution of compute is estimated with less precision (p ≈ 0.31), suggesting some uncertainty in the exact split between these factors.

We now turn to the second dimension of economic productivity: output quality. While the previous sections established that AI assistance dramatically reduces the time required to complete tasks, the question remains whether this speed comes at the cost of quality, or if frontier models also improve the standard of work produced. For this analysis, we focus specifically on nonagentic tasks. Because these tasks are self-contained and require no external actions, they allow us to cleanly compare three distinct modes of production: humans working alone, humans working with AI assistance, and the raw output of the AI models themselves.

We begin by establishing the baseline effect of AI access, as illustrated in Figure 11. Comparing the average grade (on a 0-7 scale) for non-agentic tasks performed by participants in the control group against those with access to any AI model reveals a clear improvement: access to AI raises the average grade from 3.52 to 4.34 (p < 0.01). This confirms that, on average, AI serves as a quality-enhancing tool, raising the floor of performance for typical consulting and data analysis tasks.

To understand what drives this improvement, we first isolate the capabilities of the technology itself. We collected “AI-only” responses for the same nonagentic tasks across a range of models, grading them using the same expert rubric applied to human participants. The relationship between these grades and the log training compute of the model is plotted in Figure 12 with full regression results in Table 30 of Appendix A.

The results reveal a robust scaling law for raw model capability. As training compute increases, the quality of the model’s autonomous output improves significantly (p < 0.01). The estimated slope suggests that a tenfold increase in compute corresponds to a 0.51-point increase in the average grade. Notably, the most capable models in our sample achieve average grades exceeding 6.0 out of 7-performance levels that are “superhuman” compared to the unassisted human average of 3.52. This confirms that the underlying “engine” of economic productivity is becoming measurably more capable with scale.

However, a striking puzzle emerges when we examine how these capabilities translate into actual economic output when a human is in the loop. The average grades of human participants assisted by AI are displayed against the log compute of the model they used in Figure 13. Unlike the clear upward trend seen in the AI-only data (Figure 12) or the time-savings data discussed in Section 4.2, the scaling law for quality completely disappears (p ≈ 0.85). Whether a participant uses a small, early-generation model or a frontier giant, the final output quality remains stagnant at approximately Comparing the AI-only and Human-AI results reveals a complex dynamic of complementarity and substitution. For weaker models (where the raw AI grade is ∼3.3-3.7), human intervention is highly additive: participants successfully refine imperfect drafts, raising the final quality to the ∼4.3 range. However, for the strongest models (where raw AI grades reach ∼5.0-6.0), human intervention appears to be destructive. Rather than preserving or enhancing the high-quality raw output, participants actively degrade it, bringing the final grade down to the same ∼4.3 average. This suggests that humans are only able to improve model outputs that are slightly better than their own. For weaker models (that still outperform humans by themselves on average), human participants are able to meaningfully add to the quality of the output. For intermediate models, humans are unable to add to output quality at all. For the best models, however, which have well above average-human quality participants are not only unable to further improve output quality with their effort but instead actually regress In conclusion, while scaling laws for capability are alive and well-with stronger models autonomously producing far superior work-scaling laws for realized quality are broken by the human user. The economic value of algorithmic progress is currently being lost in the “last mile” of human-computer interaction, where users revert to a regression to the mean regardless of the sophistication of the tool they wield.

In this section, we utilize the experimental results on the productivity gains from model scaling to estimate aggregate productivity gains from AI over the next ten years. In doing so, we leverage the framework from Acemoglu (2024). Here, a version of Hulten’s theorem (Hulten, 1978) is used to estimate aggregate productivity gains based on the fractions of tasks in the economy that are affected by AI and the average task-level cost savings.

The estimates from Acemoglu (2024) are based on the multiplication of four parameters. Firstly, an estimate of the share of tasks exposed to AI is used from Eloundou et al. (2023), which estimates it at 19.9%. Secondly, as some tasks involve a combination of labor and capital, the (AI-exposure adjusted) labor share of 0.57 is used. Neither of these parameters are changed in the analysis of this section.

However, the third parameter-estimating the labor cost savings (or productivity boost) from AI-is updated significantly based on the findings of this study. Acemoglu (2024) relies on an average productivity effect size of 27%, derived from early studies of call center workers (Brynjolfsson et al., 2023) and writing tasks (Noy and Zhang, 2023). Our experimental data offers a more current and granular baseline. Pooling all thirteen models in our study, we find a baseline productivity premium-measured via Earnings Per Minute (EPM)-of 81.3% (p = 0.001) relative to the control group. Even conservatively ignoring the bonus-driven “Total Earnings” metric (which showed an 146% gain), the baseline efficiency gain for professional consulting, data analysis, and management tasks is approximately triple the figure used in previous aggregate estimates. Furthermore, we must account for the scaling of model capabilities over the next decade. Our results derive a specific “Economic Scaling Law” for time savings: each year of frontier model progress is associated with an 8% reduction in task completion time (p < 0.05). To estimate the impact over the next ten years, we extrapolate this trend forward. We model the time taken to complete a task in year n as T n = T start (1 -0.08) n . This compounding reduction in time yields a convex increase in productivity. Starting from our baseline premium of 81.3% (where AI-assisted workers are 1.81× as productive as unassisted workers), the 8% annual reduction in time implies that by year 5-the midpoint of the decade-tasks will require only 66% of the time they take with current AI models (0.92 5 ≈ 0.66). This compounds the effective productivity boost to roughly 175% relative to the human baseline. 3 Using this 5-year midpoint as a representative average for the decade, we estimate an average task-level productivity boost of 175.1%.

Finally, regarding economic feasibility, we assume that automation remains profitable given that inference costs are negligible relative to professional wages and are falling rapidly. Combining these parameters, the total productivity gains from AI over the next ten years are estimated through the multiplication of three values: the share of tasks that can be automated (19.9%), multiplied by the average productivity effect derived from our scaling projections (175.1%), multiplied by the labor share of costs (57%).

Total GDP Gain = 0.199 × 1.751 × 0.57 ≈ 19.9%

This yields an estimate of approximately 20.0% productivity growth over the next decade driven by LLMs. This is significantly higher than the baseline estimate in Acemoglu (2024), reflecting both the higher starting productivity of current frontier models and the predictable accumulation of gains from continued model scaling.

It should be noted that this calculation makes several restrictions. It assumes the task structure of the economy remains fixed and does not model general equilibrium effects. Furthermore, it treats technological progress as exogenous. As noted in Section 2, there is growing evidence that AI accelerates R&D itself (e.g., Jumper et al., 2021). If AI scaling unlocks new categories 3 Calculated as 1.813 0.92 5 -1 ≈ 1.751. By year 10, the theoretical boost would exceed 300%, but we utilize the midpoint to represent the average effect over the decade.

of innovation that fundamentally alter production functions, the true aggregate gains could be substantially higher (Korinek and Suh, 2024). Thus, our estimate of 19.9% may still represent a lower bound.

This paper provides an experimental derivation of “Economic Scaling Laws"empirical elasticities mapping the computational scale and algorithmic progress of Large Language Models (LLMs) to human professional productivity. By analyzing over 500 professionals across 13 distinct models, we move beyond the binary assessment of whether AI aids productivity to the dynamic question of how rapidly this productivity gain is increasing. Our results suggest that the economic value of frontier models is increasing according to a predictable power law: a tenfold increase in training compute is associated with a 6.3% reduction in task completion time, while calendar-time progress delivers an annualized efficiency gain of approximately 8%.

Applying these elasticities to an aggregate growth framework using Hulten’s theorem suggests that the accumulation of model capabilities alone could drive significant economic gains. By geometrically extrapolating the observed time-savings over the next decade, we estimate that AI-exposed tasks could see productivity improvements exceeding 175%, contributing approximately 20% to aggregate productivity growth. Furthermore, our decomposition analysis suggests that these gains are driven by dual engines of progress: roughly 58% of the efficiency improvements are attributable to the scaling of raw compute, while the remaining 42% stem from algorithmic innovations and software optimization. These aggregate estimates, however, mask important heterogeneities. While analytical and interpretive tasks benefit robustly from model scaling, we find that “agentic” workflows requiring multi-step tool use and external actions see significantly smaller gains that are statistically indistinguishable from zero in our sample. This suggests that the translation of current model capabilities into economic value is not uniform, but rather conditional on the specific structural requirements of the task.

There are several limitations to this study. First, while our tasks were designed to be representative of professional workflows, they remain relatively short-horizon activities (taking between 20 and 60 minutes) compared to the multi-day projects common in the corporate sector. Second, our experiment focuses on individual productivity and does not capture general equilibrium effects, such as changes in wages or employment levels. Finally, our scaling laws are derived from the current paradigm of transformer-based LLMs; distinct architectural breakthroughs or bottlenecks in data availability could alter the slope of these curves in unpredictable ways. Importantly, the models provided in this study did not offer tool access or anywhere near the full-range of currently offered ‘agentic’ capabilities by AI providers.

In conclusion, the evidence provided suggests that if historical relationships between model compute and capability hold, future generations of LLMs may have significant economic implications. Even without accounting for potential accelerations in R&D or scientific discovery, the direct application of scaling laws to professional tasks points to an incredibly meaningful expansion in labor productivity.

7 Appendix A: Background: You are a senior manager at TextilesConsulting Inc., specializing in textile manufacturing operations. Your team recently received a detailed request from a client considering expanding manufacturing operations to Greece. You assigned an analyst, who recently joined your team straight out of university, to produce a detailed report based on two provided IMF documents: one focused on Greece’s economy and the other providing a regional European outlook.

After reviewing the analyst’s submission, you suspect that the analyst did not thoroughly consult the IMF reports, as several numbers and facts seem inaccurate or unsupported by the provided documents. You believe the analyst likely relied on an AI-generated report due to the overall low quality. The report is urgently needed, so you must personally revise the analyst’s submission for accuracy and clarity.

Your Tasks:

Revised Analyst Report (500 words): Rewrite the analyst’s original submission, ensuring that all data and analysis accurately reflect the detailed IMF documents provided. Below the original instructions given to the analyst are provided along with the analyst’s draft.
Feedback Email to Analyst (150 words): Draft a constructive feedback email addressing the analyst’s report.

Analyst’s Task:

Background: The client EuroTextiles Inc., specializes in textile manufacturing, typically employing older workers and women in its plants inter-nationally. EuroTextiles is considering expanding its operations to Europe, specifically Greece, to benefit from financial stability, competitive wages, and low-tariff access within Europe compared to Asia. You have been provided with two detailed IMF reports: one on Greece’s economy and another on the broader European regional outlook.

Your Task: Prepare a 500-word summary outlining whether Greece presents favorable conditions for establishing a new textile manufacturing operation. Your recommendation should thoroughly address Greece’s labor market and financial conditions. As part of this summary, specifically address the following points -give exact quantitative numbers and estimates from the report, wherever possible.

Labor Market Conditions:

• Evaluate Greece’s labor force participation rate, particularly focusing on older workers and women. Mention any recent policy changes influencing labor market participation, especially for those among older workers.

• Analyze the ease or difficulty of recruiting new workers based on unemployment rates and labor market participation.

• Describe Greece’s non-performing loan (NPL) ratio relative to the EU average and detail how it has evolved over the past five years.

• Assess the trajectory of Greece’s government debt over recent years and discuss the potential for another financial crisis.

• Provide the specific quantitative interest rate in Greece as of September 2024, comparing this rate with alternative investment locations such as Poland and the Czech Republic.

Trade and Tariff Considerations:

• Evaluate the tariff-equivalent of non-tariff barriers within Europe, providing a numerical measure comparing the difference between these barriers for EU manufacturing versus interstate commerce within the United States.

Note: All the information required to answer these questions is contained within the two provided IMF reports. Your analysis should rely exclusively on data and findings from these reports. The links to the IMF reports can be found below (please open in a new tab):

• Greece 2025 -IMF Report4

• Regional Economic Outlook -Europe5

Analyst’s Submission:

In evaluating the potential for establishing a textile manufacturing operation in Greece, we have analyzed the labor market and financial conditions based on IMF reports.

Labor Market Conditions: Greece’s labor force participation rate stands at 59.3% (2021), with a notable presence of older workers (65+) and women (58.4% and 49.3%, respectively, in 2021). The Greek government has implemented policies to encourage labor market participation among older workers, such as extending the retirement age and offering incentives for continued employment. Despite these efforts, high unemployment rates persist, particularly among the younger demographic, which may pose challenges in recruiting new workers.

Financial Conditions: Greece’s non-performing loan (NPL) ratio remains elevated at 36.4% (2021), significantly higher than the EU average of 3.6%. Over the past five years, the NPL ratio has shown a downward trend, indicating gradual improvement in the banking sector. However, Greece’s government debt remains high at 189.8% of GDP (2021), raising concerns about the potential for another financial crisis.

In terms of interest rates, the quantitative rate in Greece as of September 2024 is 2.5%, which is relatively competitive compared to Poland (2.3%) and the Czech Republic (2.2%).

Trade and Tariff Considerations: The tariff-equivalent of non-tariff barriers within Europe is estimated at 3.6% (2019), compared to an insignificant level within the United States for interstate commerce. This suggests that EU manufacturing may face slightly higher non-tariff barriers compared to domestic commerce in the U.S.

In conclusion, Greece presents a mixed picture for EuroTextiles Inc.’s textile manufacturing expansion. The country offers competitive wages, a relatively skilled labor force, and low-tariff access within Europe. However, challenges such as high unemployment rates, particularly among the younger demographic, and the lingering effects of a high government debt burden must be carefully considered. Additionally, the presence of non-tariff barriers within Europe may impact the competitiveness of Greek-based operations. A thorough risk assessment and strategic planning will be essential to ensure the success of this potential investment.

Your Tasks:

Revised Analyst Report (500 words): Rewrite the analyst’s original submission, ensuring that all data and analysis accurately reflect the detailed IMF documents provided.

Data Analyst Task Three: Correlation Analysis of Inflation and Economic Growth

Background: You are a data analyst at the macroeconomic analysis desk of Global Insights Financial Ltd. Your manager recently discussed two competing theories regarding inflation’s effects on economic growth:

• Hypothesis 1: Inflation acts as a leading indicator of poor economic growth in the future.

• Hypothesis 2: Inflation directly coincides with poor economic growth in the same period.

Your manager has asked you to empirically test these theories using the provided IMF dataset on European countries.

Note: All required data for these tasks is provided in the attached IMF document. Your analysis must be based solely on this dataset.

Your Tasks:

Correlation Analysis:

• Calculate the correlation between European countries’ inflation rates in 2024 and their projected economic growth rates in 2026 (testing Hypothesis 1).

• Calculate the correlation between European countries’ inflation rates in 2024 and their projected economic growth rates in 2024 (testing Hypothesis 2).

• Clearly report both correlation values and briefly explain your calculation methods.

Consulting Task Two: Market Expansion (EuroTextiles Inc.)

Background: Your firm, EuroTextiles Inc., specializes in textile manufacturing, typically employing older workers and women in its plants internationally. EuroTextiles is considering expanding its operations to Europe, specifically Greece, to benefit from financial stability, competitive wages, and low-tariff access within Europe compared to Asia. You have been provided with two detailed IMF reports: one on Greece’s economy and another on the broader European regional outlook.

Labor Market Conditions:

• Analyze the ease or difficulty of recruiting new workers based on unemployment rates and labor market participation.

• Describe Greece’s non-performing loan (NPL) ratio relative to the EU average and detail how it has evolved over the past five years.

• Assess the trajectory of Greece’s government debt over recent years and discuss the potential for another financial crisis.

• Provide the specific quantitative interest rate in Greece as of September 2024, comparing this rate with alternative investment locations such as Poland and the Czech Republic.

Trade and Tariff Considerations:

• Greece 2025-IMF Report 7

• Regional Economic Outlook -Europe

InnovateTech Solutions, a B2B software development company, specializes in building custom workflow automation tools for mid-to-large enterprises. Recently, the company has been experiencing growing competition from newer, more agile SaaS startups, putting pressure on service expectations.

You are the Senior Account Manager responsible for overseeing key client relationships. One of your most high-value clients, Atlas Logistics, which accounts for 8% of annual revenue, has canceled their contract, citing repeated failures in service delivery. Losing this client not only impacts revenue but also damages the company’s reputation in the logistics sector.

Senior leadership has tasked you with two urgent actions:

• Respond to Atlas Logistics with an email that apologizes for their experience, addresses their complaints, and proposes concrete steps to regain their trust.

• Develop an internal strategy document identifying key process failures and proposing structured improvements to prevent similar issues from recurring.

Atlas Logistics has outlined the following detailed concerns in their contract termination notice:

Unreliable Project Timelines:

• The initial contract promised a 12-week development cycle for a custom inventory tracking system. The final product was delivered 7 weeks late, causing significant disruptions to their supply chain.

• The client was informed of delays only after missed deadlines, with no advance warnings.

• A previous feature update scheduled for January was postponed twice, and the client was given contradictory explanations from different team members.

Poor Communication & Lack of Accountability:

• Atlas Logistics repeatedly struggled to get timely responses from your customer support team.

• Key points of contact (project managers) were often unavailable, requiring clients to escalate to multiple people before getting answers.

• Support ticket requests were left unresolved for days without follow-up.

• Some email responses were generic or unhelpful, failing to address their actual concerns.

Inconsistent Service Quality:

• A recent software patch introduced a critical bug that disrupted warehouse inventory tracking for 36 hours, leading to thousands of dollars in lost productivity.

• The client’s IT team was frustrated by conflicting troubleshooting advice from different members of the support team.

An internal post-mortem analysis of the account has highlighted several key process issues that contributed to Atlas Logistics’ dissatisfaction:

• 60% of project delays were due to an overstretched development team, lacking sufficient engineers to handle concurrent client requests.

• The customer support team has been understaffed, leading to response times exceeding 48 hours, frustrating clients.

• Delays were often communicated reactively, rather than proactively keeping clients informed.

• No formal check-in system exists to update clients on potential project risks.

• Different teams (sales, product, customer support) were not aligned on project status, leading to conflicting updates being sent to clients.

• Client issues were not tracked consistently-support teams and project managers often had different information, leading to confusion when addressing problems.

Expected Output 1. Internal Strategy Document (200-300 words + Gantt chart)

• Outline three major process improvements based on identified failures and specify expected outcomes • Calculate 95% confidence intervals for each version’s conversion rate.

• Compare the results and evaluate statistical significance.

• Recommend which design (if any) should be implemented site-wide.

• Suggest next steps for further testing or improvements. Which A/B test, if any, would you recommend be conducted next?

Data Analyst Task Two: Evaluating the Impact of Sales Partners on Sales Success Background Your company, BizGrow Solutions, provides consulting services to businesses and assigns sales partners to assist in closing deals. Senior management is considering whether to expand the use of sales partners.

An initial analysis by another analyst suggested that having a sales partner significantly increases the probability of closing a deal. You have been given access to the raw data and have been asked to review the methodology and verify the findings before management makes a major strategic decision.

You have been provided with a dataset that includes:

• Whether a sale was completed (1 = Yes, 0 = No)

• Sales revenue

• The lead score for the client (a measure of how promising the client was)

• Whether the client was assigned a sales partner (1 = Yes, 0 = No)

To access the data: Copy and paste the dataset from the provided table into a tool of your choice.

Client

A previous analyst ran the following simple linear regression model:

Where Y i is a binary variable denoting whether the sale was completed or not, and X1 is a binary variable denoting whether a sales partner was assigned or not.

They found that β1 was large (and positive) and statistically significant, concluding that assigning a sales partner strongly increases the probability of making a sale. Senior management is prepared to expand the sales partner program based on this result.

• Critique of Previous Analysis

• Identify any possible data quality issues

• Explain whether the previous analyst’s conclusions were justified based on the methodology used.

• Outline an appropriate approach for re-examining the impact of sales partners, specifying the correct statistical model and key variables to include.

Written Analysis Memo (250-300 words)

• Summarize why the original analysis may have been misleading.

• Outline a better regression approach and explain how it might change the conclusions.

• Explain the intuition of why this extra review should be conducted to a non-technical management team Note: You do not need to actually perform the regression analysis; only outline the steps needed to conduct a proper analysis. Senior management will use your findings to determine whether to increase, decrease, or refine the use of sales partners in future sales strategies.

Consulting Task Three: Crisis Management -Reputation Damage Control Background NaturaCosmetics, a mid-tier skincare brand, is facing a major public relations crisis after a viral social media post alleged that one of its products caused severe skin irritation. The video, posted by a high-profile beauty influencer, has gained 2.3 million views on TikTok and is continuing to spread. As a result, online sales have dropped by 18% in the past 48 hours, and several customers have expressed concerns across social media platforms.

The company’s leadership has convened an emergency meeting to decide how to respond. While NaturaCosmetics has not had any confirmed safety violations, there have been five similar customer complaints about the PureGlow Night Cream in the past year. These incidents were handled individually but did not previously attract widespread attention.

• Consumer Trust Erosion: Without a well-executed response, this crisis could lead to long-term damage to NaturaCosmetics’ brand reputation.

• Regulatory & Legal Risks: While no official product recalls have been issued, regulatory agencies may begin an investigation if more complaints surface.

• Retail & Distribution Concerns: Large retailers that stock Natu-raCosmetics products, such as Sephora and Ulta, are monitoring the situation closely and may reconsider shelf space if the controversy escalates.

• Competitor Actions: Rival skincare brands have started subtly marketing their own “dermatologist-approved” products in response to the controversy.

Internal Crisis Strategy Memo (350-500 words):

• Immediate Response Plan: Steps to contain the backlash, including engagement with customers, potential influencer partnerships, and legal considerations.

• Long-Term Reputation Recovery Strategy: Proposals for improving product credibility.

• Risk Assessment of Response Options: Evaluate different potential actions and their implications.

Your response should balance damage control with transparency, ensuring NaturaCosmetics does not escalate the crisis further while rebuilding trust with customers.

Manager Task Three: High-Stakes Vendor Selection & Negotiation Background MetroBuild Infrastructure, a construction project management firm, is preparing to bid on a $50 million government contract to build a new transit hub. The success of this bid depends heavily on selecting the right steel supplier, as material costs, reliability, and delivery times will directly impact the project timeline and profit margins.

As the Procurement Manager, you are responsible for selecting the best vendor from three shortlisted suppliers. Additionally, your CEO has tasked you with negotiating better payment terms before finalizing the deal.

Vendor

• Project delays of more than 2 weeks will result in heavy penalties in the government contract.

Robust (HC1) standard errors in parentheses. * p¡0.1, ** p¡0.05, *** p¡0.01

• Create a Gantt chart detailing in detail how and when these process improvements will be madeGantt chart instructions:

• Create a Gantt chart detailing in detail how and when these process improvements will be made

An important caveat is that participants were restricted to standard chatbot interfaces with text-based input and output and limited tool access. Consequently, this design may underestimate the capabilities of current AI models, particularly regarding agentic tasks.

The annual effect is computed as 100×(exp(12 β)-1), where β is the monthly coefficient on months since November 2022.

Robust (HC1) standard errors in parentheses. Sample excludes Grade = 0 and restricts to compute > 0. * p¡0.1, ** p¡0.05, *** p¡0.01

International Monetary Fund. (2025). Greece: 2025 Article IV Consultation-Press Release; Staff Report; and Statement by the Executive Director for Greece (IMF Country Report No. 2025/088).

International Monetary Fund. (2024). Regional Economic Outlook: Europe. Washington, D.C.: IMF.

📄 Read Full PDF on ArXiv

Reference

This content is AI-processed based on open access ArXiv data.

📝 Original Info

📝 Abstract

📄 Full Content

Reference

Table of Contents

Table of Contents

📝 Original Info

📝 Abstract

📄 Full Content

Reference

Start searching

No results found