LegalWebAgent: Empowering Access to Justice via LLM-Based Web Agents

Reading time: 17 minute
...

📝 Original Info

  • Title: LegalWebAgent: Empowering Access to Justice via LLM-Based Web Agents
  • ArXiv ID: 2512.04105
  • Date: 2025-11-28
  • Authors: Jinzhe Tan, Karim Benyekhlef

📝 Abstract

Access to justice remains a global challenge, with many citizens still finding it difficult to seek help from the justice system when facing legal issues. Although the internet provides abundant legal information and services, navigating complex websites, understanding legal terminology, and filling out procedural forms continue to pose barriers to accessing justice. This paper introduces the LegalWebAgent framework that employs a web agent powered by multimodal large language models to bridge the gap in access to justice for ordinary citizens. The framework combines the natural language understanding capabilities of large language models with multimodal perception, enabling a complete process from user query to concrete action. It operates in three stages: the Ask Module understands user needs through natural language processing; the Browse Module autonomously navigates webpages, interacts with page elements (including forms and calendars), and extracts information from HTML structures and webpage screenshots; the Act Module synthesizes information for users or performs direct actions like form completion and schedule booking. To evaluate its effectiveness, we designed a benchmark test covering 15 real-world tasks, simulating typical legal service processes relevant to Québec civil law users, from problem identification to procedural operations. Evaluation results show LegalWebAgent achieved a peak success rate of 86.7%, with an average of 84.4% across all tested models, demonstrating high autonomy in complex real-world scenarios.

📄 Full Content

The significant gap between the public's legal needs and their ability to find legal information and solutions has created a "justice gap" in countries worldwide [1]. For ordinary citizens, the cost of hiring a lawyer is prohibitively expensive [2], forcing them to spend a great deal of time navigating a maze of government websites, legal statutes, and procedural forms on their own. Even so, they often find it difficult to search using correct legal terminology, locate relevant information on cluttered websites, or make critical errors when filling out online forms.

Existing legal tech tools, such as static frequently asked questions (FAQ) portals and simple ruledriven chatbots, seek to address this issue. They typically provide information in plain language 1 , or offer simplified legal pathways and relevant cases [3,4] to help reduce the user’s cognitive load. However, as the legal domains covered by these websites and the volume of information they contain increase, the cognitive burden on users increases correspondingly. Furthermore, these tools still lack the ability to interact with the broader web ecosystem or perform actions on behalf of the user. This means that the most difficult and error-prone practical “operational” steps, such as submitting forms or scheduling appointments, are still left to the user.

The recent development in the field of multimodal large language models (MLLMs) has advanced the role of AI in building a general-purpose tool to enhance access to justice. The improved reasoning capabilities of LLMs have made it possible to build autonomous agents [5,6]. GPT-4o-vision and more advanced LLMs demonstrate strong capabilities in natural language understanding and generation. Moreover, their multimodal understanding abilities allow for the simultaneous processing of inputs other than text, such as images [7], making it feasible to build web agents that can comprehend both the text and visual elements of webpages [8]. In this study, we introduce the LegalWebAgent (see Figure 1) framework, a multimodal web agent designed to autonomously create plans based on a user’s query to perform tasks such as web browsing, information gathering, and web interaction.

In the following sections, we review related work (Section 2), introduce the LegalWebAgent framework (Section 3), outline our experiment design (Section 4), and present results (Section 5). We then discuss key insights and current limitations (Section 6), and conclude. Given a user’s query, LegalWebAgent formulates a plan, analyzes the webpage’s HTML elements and screenshots, and determines the appropriate actions (such as clicks, scrolls, or inputs). After gathering the necessary information or completing the requested task, it generates a concise summary of the process and presents the results to the user.

The access to justice gap is a global issue that not only incurs monetary, temporal, and psychological costs [9,10,11,12,13] but also erodes public confidence in the justice system. AI technology has been widely applied to bridge this gap [14], with efforts ranging from providing legal information [15,16] and assisting with form completion [17,18] to online dispute resolution [19,20,21]. In addition to these primarily text-focused efforts, Vision-Large Language Models (VLLMs) have also undergone preliminary exploration in the legal domain in recent years [22]. Researchers have also explored integrating AI with information portals to help users map natural language descriptions to relevant legal issues [16,23]. However, these systems are often built for specific contexts and thus lack generalizability. Furthermore, they are largely passive, requiring users to independently read, comprehend, and act upon the information provided.

The internet has long been one of the primary channels for users to obtain information. It can be viewed as a continuously updated real-time database. In the legal field, modifications to legal information, promulgation of new laws, and introduction of new cases are all published on the internet in a very timely manner. When needed, people can readily find the information they require online. Furthermore, tasks such as booking meetings, filling out online forms, and completing online questionnaires offer users the possibility of handling tasks remotely, saving them valuable time.

In practice, however, this process is far from straightforward: (1) users may encounter websites containing outdated or misleading information; (2) retrieving legal information often requires extensive cross-verification and synthesizing information from numerous web pages; (3) users may be unaware of the existence of certain authoritative resources; and (4) human cognitive limitations, along with states such as distraction and fatigue, can significantly hinder users’ ability to navigate the web effectively [24].

The internet is designed for humans, enabling them to interact with the digital world through operations such as clicking, scrolling, and typing [25]. Creating web agents capable of simulating these operations is expected to significantly help reduce the sense of disorientation and inefficiency experienced during the aforementioned web browsing process.

Such web agents or web automation have been extensively explored for decades. Early research typically relied on structured data sources or site-specific wrappers to perform tasks such as flight booking or information scraping [26,27,28]. These methods required significant manual configuration for each website and were highly brittle, often breaking when sites changed [29,30].

With the development of deep learning and reinforcement learning, more generalized web agents have become possible. Depending on the architecture, they range from agents trained purely on HTML files [31,32,33], to models that combine HTML and visual screenshots [34,35], and even to agents that rely solely on visual input from screenshots [36]. These modern web agents are capable of helping users interact with real-world websites and perform everyday tasks [37].

LegalWebAgent is composed of three modules (Ask Module / Browse Module / Act Module) and works in concert with a web browsing environment. LegalWebAgent builds on top of the open-source Browse-Use framework [38], the core of which is Playwright [39], which enables reliable and fast web automation by providing a unified API for Chromium, Firefox, and WebKit. When a user command is received, LegalWebAgent launches a Chromium browser and then provides context to the LLM by collecting the webpage’s HTML elements and a screenshot (see Figure 2). Based on the current state, LegalWebAgent analyzes the completion status of the previous objective, updates its memory, and formulates the next objective. According to the objective, LegalWebAgent proposes an action to be executed and performs the operation within the browser environment.

When the user provides a prompt (e.g., “Find the rental dispute department closest to H3A0G4. “), the Ask module performs the following steps: (1) It uses an LLM to parse the user’s intent. In this example, the module may recognize that the user needs legal information on landlord-tenant disputes in Québec, as well as the address and phone number of the relevant tribunal. (2) Based on the parsed intent, the Ask module generates a web navigation plan. For instance, the plan might be:

  1. Search for rental dispute resolution services; 2. The user’s postal code is H3A0G4, focus the search on Montreal.

The Browse module receives the current state of the webpage at each execution step and decides the next action to take. Because modern webpages often contain a large number of elements and complex layouts, we adopt a multimodal perception mechanism to improve the robustness of the Browse module.

HTML Analysis. The module parses the HTML Document Object Model (DOM) of the page, identifies candidate interactive elements (e.g., links,

📸 Image Gallery

heatmap.png main-figure.png screenshot-example.jpg success-rate.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut