📝 Original Info
- Title: Improving Language Agents through BREW
- ArXiv ID: 2511.20297
- Date: 2025-11-26
- Authors: ** Shashank Kirtania, Param Biyani, Priyanshu Gupta, Yasharth Bajpai, Roshni Iyer, Sumit Gulwani, Gustavo Soares (Microsoft) **
📝 Abstract
Large Language Model (LLM)-based agents are increasingly applied to tasks requiring structured reasoning, tool use, and environmental adaptation, such as data manipulation, multistep planning, and computer-use automation. However, despite their versatility, current training paradigms for model weight optimization methods, like PPO and GRPO, remain relatively impractical with their high computational overhead for rollout convergence. In addition, the resulting agent policies are difficult to interpret, adapt, or incrementally improve. To address this, we investigate creating and refining structured memory of experiential learning of an agent from its environment as an alternative route to agent optimization. We introduce BREW (Bootstrapping expeRientially-learned Environmental knoWledge), a framework for agent optimization for downstream tasks via KB construction and refinement. In our formulation, we introduce an effective method for partitioning agent memory for more efficient retrieval and refinement. BREW uses task graders and behavior rubrics to learn insights while leveraging state-space search for ensuring robustness from the noise and non-specificity in natural language. Empirical results on real world, domain-grounded benchmarks -- OSWorld, $\tau^2$Bench, and SpreadsheetBench -- show BREW achieves $10-20\%$ improvement in task precision, $10-15\%$ reduction in API/tool calls leading to faster execution time, all while maintaining computational efficiency on par with base models. Unlike prior work where memory is treated as static context, we establish the KB as a modular and controllable substrate for agent optimization -- an explicit lever for shaping behavior in a transparent, interpretable, and extensible manner.
💡 Deep Analysis
📄 Full Content
Improving Language Agents through BREW
Shashank Kirtania, Param Biyani, Priyanshu Gupta, Yasharth Bajpai,
Roshni Iyer, Sumit Gulwani, Gustavo Soares
Microsoft
{t-skirtania,t-pbiyani,priyansgupta,ybajpai,
iyerroshni,sumitg,gustavo.soares}@microsoft.com
Abstract
Large Language Model (LLM)-based agents are increasingly applied to tasks re-
quiring structured reasoning, tool use, and environmental adaptation, such as data
manipulation, multistep planning, and computer-use automation. However, despite
their versatility, current training paradigms for model weight optimization methods,
like PPO and GRPO, remain relatively impractical with their high computational
overhead for rollout convergence. In addition, the resulting agent policies are diffi-
cult to interpret, adapt, or incrementally improve. To address this, we investigate
creating and refining structured memory of experiential learning of an agent from
its environment as an alternative route to agent optimization. We introduce BREW
(Bootstrapping expeRientially-learned Environmental knoWledge), a framework
for agent optimization for downstream tasks via KB construction and refinement.
In our formulation, we introduce an effective method for partitioning agent memory
for more efficient retrieval and refinement. BREW uses task graders and behavior
rubrics to learn insights while leveraging state-space search for ensuring robust-
ness from the noise and non-specificity in natural language. Empirical results
on real world, domain-grounded benchmarks – OSWorld, τ 2Bench, and Spread-
sheetBench – show BREW achieves 10 −20% improvement in task precision,
10 −15% reduction in API/tool calls leading to faster execution time, all while
maintaining computational efficiency on par with base models. Unlike prior work
where memory is treated as static context, we establish the KB as a modular and
controllable substrate for agent optimization – an explicit lever for shaping behavior
in a transparent, interpretable, and extensible manner.
1
Introduction
Large Language Model (LLM) based agents are rapidly being deployed for structured reasoning,
tool use, and autonomous interaction in real-world environments [16]. From computer-use and
spreadsheet automation to software engineering pipelines, these agents drive tasks such as multi-step
planning, data manipulation, and adaptive workflows [24, 13, 37, 2, 22]. For example, a language
agent might help automate a multi-step workflow like collecting data from different sources, cleaning
or validating it, and then uploading it onto a dedicated server, all while adjusting its plan if the
format or structure of the data changes unexpectedly [36, 40, 28, 3]. Yet, despite these successes, top-
performing agents generally score underwhelmingly on challenging real-world benchmarks—well
behind human experts [39, 4, 32, 19]. As an example, consider the following scenario:
Case Study on Computer Use Agents
A computer-use agent in an Ubuntu environment tasked with automating software installation
across multiple sessions.
arXiv:2511.20297v1 [cs.AI] 25 Nov 2025
{agent-alignment,correctness}
“Would the behavior and edits
of the agent remain robust if the
same task were performed on a
slightly different system setup?”
…
{agent-alignment,correctness}
“Would the behavior and edits
of the agent remain robust if the
same task were performed on a
slightly different system setup?”
…
Example Rubric:
"How well does the agent handle
unexpected states or failures in the
environment? Does it adapt or
recover?"
Human-Validated
Rubrics and Task-
specific Grader
1
User Query:
"Can you enable the
'Do Not Track' feature
in Chrome to enhance
my online privacy?"
Agent Trajectory:
__file_diff__
__file_diff__
agent response…
Trajectory Generation
LLM
tools
planning
memory
Integrator Agent
trajectories guided for
alignment and correctness
Optimal
KB State
Reflector Agent
Learned Insights +
Concept on trajectory
Concepts
Pack and
unpack archival
files
Create Charts
from Data
"Export as PDF"
…
Zip and unzip
files
lemmatization
2
Compress and
Extract Files
Compress and
Extract Files
Meta Concept List
Compress and
Extract Files
semantic
deduplication
{concept, insights}
Compress and Extract Files:
To compress:
Select files or folder → Right-click
→Compress…
Choose .zip or .tar.gz → Set output
name → Confirm location
…
4
3
KB refinement
Expand-and-
Gather MCTS
5
Bootstrapping process
mapping
𝐌𝐂𝐓𝐒𝒅𝟏(𝒔𝒊)
𝐌𝐂𝐓𝐒𝒅𝑲(𝒔𝒊)
…
Figure 1: BREW architecture overview using examples from the OSWorld dataset. Step 1 indicates
the trajectory generation process with agent alignment to human-validated rubrics and correctness
using task-specific grader. Steps 2–4 indicate the Reflector Agent, which learns key concepts and
corresponding insights from trajectories. Step 5 indicates the Integrator Agent, which integrates
knowledge from the Reflector Agent to bootstrap the KB. We introduce Expand-and-Gather MCTS
for finding the best KB configuration by a reward-guided se
Reference
This content is AI-processed based on open access ArXiv data.