Prune4Web: DOM Tree Pruning Programming for Web Agent

Reading time: 6 minute
...

📝 Abstract

Web automation uses intelligent agents to perform high-level tasks by mimicking human interactions with webpages. Despite recent advances in LLM-based web agents, efficiently navigating complex, real-world webpages remains challenging due to massive DOM structures (10,000∼100,000 tokens). Current approaches either truncate DOMs-losing vital information-or use inefficient heuristics and separate ranking models, failing to balance precision and scalability. We introduce Prune4Web, a novel paradigm that transforms DOM processing from LLM-based filtering to programmatic pruning. Our key innovation is DOM Tree Pruning Programming, where an LLM generates executable Python scoring programs to dynamically filter DOM elements based on semantic clues from decomposed sub-tasks. This approach eliminates the need for LLMs to process full DOMs, instead delegating traversal and scoring to lightweight, interpretable programs. The result is a 25∼50 times reduction in candidate elements for grounding, enabling precise action localization without attention dilution. Additionally, we propose a data annotation method and a two-turn dialogue training strategy that jointly optimizes Planner, Programmatic Filter, and Grounder in a unified framework. Experiments demonstrate state-of-the-art performance. On our low-level task grounding task, our approach dramatically increases grounding accuracy from 46.8% to 88.28%, highlighting its effectiveness.

💡 Analysis

Web automation uses intelligent agents to perform high-level tasks by mimicking human interactions with webpages. Despite recent advances in LLM-based web agents, efficiently navigating complex, real-world webpages remains challenging due to massive DOM structures (10,000∼100,000 tokens). Current approaches either truncate DOMs-losing vital information-or use inefficient heuristics and separate ranking models, failing to balance precision and scalability. We introduce Prune4Web, a novel paradigm that transforms DOM processing from LLM-based filtering to programmatic pruning. Our key innovation is DOM Tree Pruning Programming, where an LLM generates executable Python scoring programs to dynamically filter DOM elements based on semantic clues from decomposed sub-tasks. This approach eliminates the need for LLMs to process full DOMs, instead delegating traversal and scoring to lightweight, interpretable programs. The result is a 25∼50 times reduction in candidate elements for grounding, enabling precise action localization without attention dilution. Additionally, we propose a data annotation method and a two-turn dialogue training strategy that jointly optimizes Planner, Programmatic Filter, and Grounder in a unified framework. Experiments demonstrate state-of-the-art performance. On our low-level task grounding task, our approach dramatically increases grounding accuracy from 46.8% to 88.28%, highlighting its effectiveness.

📄 Content

Web automation enables the completion of high-level tasks, such as booking flights or shopping online, through intelligent agents that mimic human interaction on webpages. These agents achieve this by interpreting high-level tasks, breaking them down into low-level sub-tasks, and seamlessly interacting with web elements. Recently, large language models (LLMs) have demonstrated impressive capabilities in autonomous web navigation through their strong reasoning and decision-making abilities (Yao et al. 2022;Deng et al. 2023a). Current web agents approaches fall into three main categories: 1) Textual HTML/DOM-based (Yao et al. 2022;Song et al. 2025), 2) Visual Screenshotbased (Lin et al. 2024;Cheng et al. 2024), and 3) Multimodal-based (He et al. 2024a;Zheng et al. 2024a). Visual Figure 1: Comparison between existing multi-modal web agents and our Prune4Web paradim. Compared to existing multi-modal web agent paradigms, we propose a programmatic pruning strategy that efficiently removes redundant DOM elements. Our Prune4Web approach relaxes the token limits of LLMs and increases accuracy on low-level sub-task grounding from 46.80% to 88.28%.

screenshots provide an intuitive, human-like understanding of webpage state, making them effective for reasoning about low-level sub-tasks. However, they contain limited semantic information, especially for special icons, and are sensitive to variations in resolution and overlapping elements. In contrast, HTML/DOMs offer precise and stable semantic and structural information that enables accurate element selection with minimal ambiguity.

In this paper, we leverage the complementary advantages of text and visual multi-modal information and design a multi-stage framework: A planner model takes the high-level task (e.g., “Book a flight to New York”) and a screenshot, then decomposes it into a low-level sub-task (e.g., “Find the destination field and Type NYC”). Based on the sub-task, an action grounder model processes the DOMs to precisely localize and execute the required operations (e.g., selecting to type “NYC”). However, modern webpage DOMs typically contain 10,000-100,000 tokens-far exceeding the context ca-pacity of most LLMs. This results in token truncation and attention dilution, leading to critical information loss and significant processing delays (Gou et al. 2024a;Deng et al. 2023a). Existing HTML pruning methods fall short, either relying on overly simplistic heuristic filtering (He et al. 2024a;Pan et al. 2024) or requiring separate language models for element-ranking (Deng et al. 2023a). Neither approach effectively addresses the core issue. The fundamental challenge remains: how to efficiently and accurately navigate task-relevant elements from complete DOM structures.

To this end, we propose a Prune4Web pipeline through a novel paradigm: DOM Tree Pruning Programming. We observe that the low-level sub-tasks (e.g.,“Find the destination field”) output by the planner contain extensive semantic clues about potentially relevant DOM elements. This insight motivates us to shift the LLM’s role from directly locating elements in lengthy DOMs to generating a locator program based solely on the low-level sub-tasks, thereby avoiding the need to feed long DOM sources into the LLMs (Jiang et al. 2024a;Zhang et al. 2023b). Specifically, we implement this concept through our Programmatic Element Filter model. This filter receives a specific low-level sub-task from the upstream Planner and prompts the LLM to generate a concise, task-specific Python scoring program. We design a heuristic-based scoring program template, requiring the LLM to generate only key parameters for better controllability and flexibility. The generated program runs independently outside the LLM, efficiently traversing the complete DOM tree to score and rank all elements. This approach reduces candidate elements by 25∼50 times, enabling precise action localization without attention dilution. A downstream LLM-based Action Grounder then selects the final element from this refined shortlist, completing the grounding task.

To train the models within Prune4Web, we create an automated data synthesis pipeline that annotates structured intermediate outputs from raw data with minimal human intervention. These include low-level sub-tasks for the Planner and key parameters for the Programmatic Element Filter. For optimization, we develop a novel two-turn dialogue training strategy that jointly trains the Planner, Filter, and Grounder as a unified model. We initially use Supervised Fine-Tuning (SFT) with our annotated data to train a base model (Zheng et al. 2024b). Subsequently, we apply Reinforcement Fine-Tuning (RFT) to enhance the Planner’s long-term planning capabilities while integrating the programmatic filtering process into this optimization framework. Extensive experiments on benchmark datasets (Deng et al. 2023a;Pan et al. 2024) demonstrate the effectiveness of the proposed Prune4

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut