TreeMind: Automatically Reproducing Android Bug Reports via LLM-empowered Monte Carlo Tree Search

TreeMind: Automatically Reproducing Android Bug Reports via LLM-empowered Monte Carlo Tree Search
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Automatically reproducing Android app crashes from textual bug reports is challenging, particularly when the reports are incomplete and the modern UI exhibits high combinatorial complexity. Existing approaches based solely on reinforcement learning or large language models (LLMs) exhibit limitations in such scenarios. They struggle to infer unobserved steps and reconstruct the underlying user action sequences to navigate the vast UI interaction space, primarily due to limited goal-directed reasoning and planning. We present TreeMind, a novel technique that integrates LLMs with an adapted Monte Carlo Tree Search (MCTS) algorithm to achieve strategic UI exploration in bug reproduction. To the best of our knowledge, this is the first work to combine external decision-making with LLM semantic reasoning for reliable and accurate reproduction processes. We formulate the reproduction task as a target-driven search problem, leveraging MCTS as the core planning mechanism to iteratively refine action sequences. To enhance MCTS with semantic reasoning, we introduce two LLM-guided agents with distinct roles: Expander generates top-k promising actions based on the current UI state and exploration history, while Simulator estimates the likelihood that each candidate action leads toward successful reproduction by additionally leveraging dynamic environment feedback. By incorporating multi-modal UI inputs and tailored prompting strategies, TreeMind performs feedback-aware navigation that identifies essential user actions and incrementally reconstructs reproduction paths. We evaluate TreeMind on a dataset of 93 real-world Android bug reports from three widely-used benchmarks. Experimental results show that it significantly outperforms four state-of-the-art baselines, including ReBL, ReActDroid, AdbGPT, and ReproBot, in reproduction success rate.


💡 Research Summary

TreeMind tackles the long‑standing problem of automatically reproducing Android crashes from textual bug reports, especially when those reports are incomplete and the UI space is combinatorially large. The authors formulate bug reproduction as a target‑driven search problem and adopt Monte Carlo Tree Search (MCTS) as the core planner. To overcome two major challenges—(i) an explosion of possible actions in ambiguous UI states, and (ii) the high latency of full‑rollout simulations—they introduce two LLM‑driven agents that align with the expansion and simulation phases of MCTS.

The Expander agent receives a multimodal description of the current UI (textual hierarchy plus screenshot) together with the bug report and the exploration history. Using a carefully engineered prompt, it generates the top‑k most promising actions (e.g., clicks, text inputs, gestures) rather than enumerating the entire action space. This dramatically reduces the branching factor while keeping the search focused on semantically plausible steps.

The Simulator agent executes each candidate action on a real device (or emulator), captures the resulting UI state and logs, and feeds this feedback back to the LLM. The LLM then estimates a “bug‑trigger likelihood” for that one‑step rollout. This likelihood is transformed into a proxy reward that is back‑propagated through the MCTS tree, replacing costly multi‑step random rollouts with a single, informed evaluation.

Because LLM‑derived rewards can be noisy, the authors replace the classic deterministic Upper Confidence Bound (UCB) selection policy with a softmax‑over‑UCB strategy, turning UCB scores into a probability distribution. This stochastic selection prevents premature convergence to sub‑optimal paths and encourages broader exploration early in the search.

The system is evaluated on 93 real‑world bug reports drawn from three public benchmarks. Compared against four state‑of‑the‑art tools—ReActDroid, ReBL, AdbGPT, and ReproBot—TreeMind achieves a 64.5 % success rate, outperforming the next best method (ReActDroid at 45.2 %) by a large margin. Ablation studies show that removing either the Expander or the Simulator, or reverting to pure UCB selection, drops performance by 20‑30 % absolute, confirming that each component is essential.

Key insights include: (1) LLMs excel at semantic inference (e.g., guessing missing UI actions) but lack systematic exploration; (2) MCTS provides a principled framework for balancing exploration and exploitation in a huge, partially observable UI graph; (3) coupling the two yields a feedback‑aware planner that can recover hidden steps and converge quickly.

Limitations are acknowledged: the approach depends on the underlying LLM’s quality, the Simulator incurs real‑device latency, and the current implementation handles only single‑APK scenarios. Future work could explore larger multimodal models (e.g., GPT‑4‑Turbo), virtual UI simulators to reduce latency, online reinforcement learning from user feedback, and extensions to multi‑app or background‑foreground transitions.

Overall, TreeMind demonstrates that integrating LLM semantic reasoning with MCTS‑based planning is a promising direction for robust, automated bug reproduction, offering a substantial leap over existing reinforcement‑learning or pure‑LLM baselines.


Comments & Academic Discussion

Loading comments...

Leave a Comment