From Prompt to Product: A Human-Centered Benchmark of Agentic App Generation Systems

Reading time: 5 minute
...

📝 Original Info

  • Title: From Prompt to Product: A Human-Centered Benchmark of Agentic App Generation Systems
  • ArXiv ID: 2512.18080
  • Date: 2025-12-19
  • Authors: Marcos Ortiz, Justin Hill, Collin Overbay, Ingrida Semenec, Frederic Sauve-Hoover, Jim Schwoebel, Joel Shor

📝 Abstract

Agentic AI systems capable of generating full-stack web applications from natural language prompts ("prompt- to-app") represent a significant shift in software development. However, evaluating these systems remains challenging, as visual polish, functional correctness, and user trust are often misaligned. As a result, it is unclear how existing prompt-to-app tools compare under realistic, human-centered evaluation criteria. In this paper, we introduce a human-centered benchmark for evaluating prompt-to-app systems and conduct a large-scale comparative study of three widely used platforms: Replit, Bolt, and Firebase Studio. Using a diverse set of 96 prompts spanning common web application tasks, we generate 288 unique application artifacts. We evaluate these systems through a large-scale human-rater study involving 205 participants and 1,071 quality-filtered pairwise comparisons, assessing task-based ease of use, visual appeal, perceived completeness, and user trust. Our results show that these systems are not interchangeable: Firebase Studio consistently outperforms competing platforms across all human-evaluated dimensions, achieving the highest win rates for ease of use, trust, visual appeal, and visual appropriateness. Bolt performs competitively on visual appeal but trails Firebase on usability and trust, while Replit underperforms relative to both across most metrics. These findings highlight a persistent gap between visual polish and functional reliability in prompt-to-app systems and demonstrate the necessity of interactive, task-based evaluation. We release our benchmark framework, prompt set, and generated artifacts to support reproducible evaluation and future research in agentic application generation.

💡 Deep Analysis

Figure 1

📄 Full Content

From Prompt to Product: A Human-Centered Benchmark of Agentic App Generation Systems Marcos Ortiz1,†, Justin Hill1,†, Collin Overbay1, Ingrida Semenec1, Frederic Sauve-Hoover1, Jim Schwoebel1 and Joel Shor2,1,* 1Quome Inc., West Hollywood, California 2Move37 Labs, Somerville, MA, USA Abstract Agentic AI systems capable of generating full-stack web applications from natural language prompts (“prompt- to-app”) represent a significant shift in software development. However, evaluating these systems remains challenging, as visual polish, functional correctness, and user trust are often misaligned. As a result, it is unclear how existing prompt-to-app tools compare under realistic, human-centered evaluation criteria. In this paper, we introduce a human-centered benchmark for evaluating prompt-to-app systems and conduct a large-scale comparative study of three widely used platforms: Replit, Bolt, and Firebase Studio. Using a diverse set of 96 prompts spanning common web application tasks, we generate 288 unique application artifacts. We evaluate these systems through a large-scale human-rater study involving 205 participants and 1,071 quality-filtered pairwise comparisons, assessing task-based ease of use, visual appeal, perceived completeness, and user trust. Our results show that these systems are not interchangeable: Firebase Studio consistently outperforms competing platforms across all human-evaluated dimensions, achieving the highest win rates for ease of use, trust, visual appeal, and visual appropriateness. Bolt performs competitively on visual appeal but trails Firebase on usability and trust, while Replit underperforms relative to both across most metrics. These findings highlight a persistent gap between visual polish and functional reliability in prompt-to-app systems and demonstrate the necessity of interactive, task-based evaluation. We release our benchmark framework, prompt set, and generated artifacts to support reproducible evaluation and future research in agentic application generation. Keywords prompt-to-app, agents, agentic AI, benchmark, generative software, human-centered, low-code, no-code 1. Introduction The advent of large language models has triggered a rapid evolution in software development tools, moving from simple code completion to sophisticated, agentic systems. A new frontier in this space is the “prompt-to-app” agent, which aims to generate entire full-stack, deployable web applications from a single natural language description. These systems promise to dramatically lower the barrier to software creation, enabling "citizen developers" and accelerating prototyping for experienced engineers. However, this promise is tempered by a significant lack of rigorous evaluation. How do we know if these generated applications are good? Are they functional? Do they produce results that real users can interact with? Existing benchmarks for code generation typically focus on small, function-level snippets or algo- rithmic challenges. These are insufficient for evaluating prompt-to-app agents, which must correctly orchestrate complex interactions between a frontend, a backend, and a database, all while delivering a coherent user experience. To address this gap, we present the first large-scale, human-centered benchmark of prompt-to-app generation agents. This addresses the need for new methodologies for user-centered evaluation and real-world usefulness of agentic systems, moving beyond static benchmarks to capture the holistic developer experience. Joint Proceedings of the ACM IUI Workshops 2026, March 23-26, 2026, Paphos, Cyprus *Corresponding author. †These authors contributed equally. $ joel.shor@move37labs.ai (J. Shor) € https://move37labs.ai/ (J. Shor) © 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Prompt to App System Human Rater Ô App Generation System _ Prompt ó App Artifacts À Individual App Task As- sessment 8 Side By Side Qualitative Comparison ! Deployment Figure 1: To evaluate prompt-to-app systems, we curated a list of prompts and used each system to generate App Artifacts. Those artifacts were then deployed, when possible, as live web applications. Then, human raters were asked to evaluate individual apps, as well as make side by side comparisons between apps generated by different systems using the same prompt. Our contributions are as follows: • A novel mixed-methods evaluation framework for prompt-to-app agents, combining automated checks with task-based human assessment. • A comparative analysis of three leading systems: Replit, Bolt, and Firebase Studio. • A public dataset of the 288 generated application artifacts, 96 from each platform. • Key findings that characterize the current state of the field: – Isolated evaluation of these systems proves difficult, while side by side comparisons yield clear user preferences. – It is possible for a single system to dominate across several met

📸 Image Gallery

age_distribution_20251218_092225.png ai_familiarity_distribution_20251218_092225.png bradley_terry_combined.png bradley_terry_ease_overall.png bradley_terry_visual_appropriateness.png cover.png demographic_overview.png education_distribution_20251218_092225.png effect_size_heatmap.png experience_distribution_20251218_092225.png forest_plot_win_rates.png gender_distribution_20251218_092225.png industry_distribution_20251218_092225.png lmm_ranking_combined.png platform_differences.png platform_differences_bt.png platform_differences_combined.png platform_leaderboard.png preference_stack_ease_overall.png preference_stack_trust.png preference_stack_visual_appeal.png preference_stack_visual_appropriateness.png screenshot_comparison.png screenshot_isolated1.png screenshot_isolated2.png win_rate_comparison_with_ci.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut