Self-Guided Function Calling in Large Language Models via Stepwise Experience Recall

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Function calling enables large language models (LLMs) to interact with external systems by leveraging tools and APIs. When faced with multi-step tool usage, LLMs still struggle with tool selection, parameter generation, and tool-chain planning. Existing methods typically rely on manually designing task-specific demonstrations, or retrieving from a curated library. These approaches demand substantial expert effort and prompt engineering becomes increasingly complex and inefficient as tool diversity and task difficulty scale. To address these challenges, we propose a self-guided method, Stepwise Experience Recall (SEER), which performs fine-grained, stepwise retrieval from a continually updated experience pool. Instead of relying on static or manually curated library, SEER incrementally augments the experience pool with past successful trajectories, enabling continuous expansion of the pool and improved model performance over time. Evaluated on the ToolQA benchmark, SEER achieves an average improvement of 6.1% on easy and 4.7% on hard questions. We further test SEER on $τ$-bench, which includes two real-world domains. Powered by Qwen2.5-7B and Qwen2.5-72B models, SEER demonstrates substantial accuracy gains of 7.44% and 23.38%, respectively.

💡 Research Summary

The paper tackles a core limitation of large language models (LLMs) when they need to invoke external tools or APIs in multi‑step scenarios. Existing approaches either hand‑craft task‑specific demonstrations or retrieve examples from a static, curated library. Both strategies become unwieldy as the number of tools grows and tasks become more complex, because prompt length is limited and manual effort scales poorly.

To overcome these issues, the authors introduce Stepwise Experience Recall (SEER), a self‑guided framework that dynamically selects in‑context examples from an ever‑expanding “experience pool” of previously successful interaction trajectories. SEER consists of three components:

Trajectory Experience Extraction – Each completed interaction (a trajectory τ) is transformed into a structured tuple ⟨Eτ, Eq, Iτ, Uτ⟩. Eτ is a vector embedding of the whole trajectory, Eq encodes the user’s initial query, Iτ is a discrete intent label inferred by the LLM itself, and Uτ is the set of tools used (order ignored). This representation requires no human annotation.
Stepwise Experience Recall – When the model reaches a new step t with history Ht, SEER scores every candidate trajectory τ′ in the pool using three complementary metrics:
- Trajectory Similarity (s1) – Normalized cosine similarity between the embeddings of Ht and τ′, capturing overall conversational flow.
- ToolChain Coverage (s2) – The proportion of tools present in the current task that also appear in τ′, encouraging reuse of effective tool sequences.
- Intent Match (s3) – A binary indicator of whether the inferred intents of Ht and τ′ coincide.
  The final relevance score is a weighted sum Scoreτ′ = λ1·s1 + λ2·s2 + λ3·s3 (the paper uses equal weights). The top‑k trajectories (k=4 by default) are returned as in‑context demonstrations to guide the LLM’s next decision. This multi‑dimensional scoring goes beyond simple query similarity and explicitly aligns tool usage patterns and user goals.
Continual Experience Accumulation – After a task finishes, an “LLM‑as‑a‑judge” module compares the model’s output with a reference answer, tolerating minor formatting or numeric variations. If judged successful, the trajectory is added to the experience pool, automatically expanding the knowledge base without any external labeling.

Experimental Evaluation
The authors evaluate SEER on two benchmarks:

ToolQA, a synthetic suite covering a wide range of tools and question difficulties (easy/hard).
τ‑bench, which contains two real‑world domains and uses GPT‑4o as a simulated user.

Both Qwen2.5‑7B and Qwen2.5‑72B models are used as the base LLM. Across all settings, SEER outperforms strong baselines such as AR‑T, ExpeL, and standard chain‑of‑thought prompting. On ToolQA, SEER improves accuracy by 6.1 % on easy questions and 4.7 % on hard ones. On τ‑bench, the gains are 7.44 % for the 7‑billion‑parameter model and a striking 23.38 % for the 72‑billion‑parameter model.

Ablation studies reveal that each scoring component contributes meaningfully: removing tool‑chain coverage drops performance by ~2–3 %, while omitting intent matching leads to larger errors on complex goals. Varying the top‑k value shows modest improvements when increasing from 2 to 8, but token‑budget considerations make k=4 a practical sweet spot.

Insights and Limitations
SEER demonstrates a clear self‑improvement loop: as more successful trajectories are collected, the relevance of retrieved examples improves, leading to higher downstream accuracy. This addresses the data‑scarcity problem that plagues many LLM‑augmented systems. However, the current tool‑chain coverage metric treats tool sets as unordered, so it cannot capture nuanced ordering or conditional dependencies. Moreover, the reliance on an LLM‑based evaluator introduces a potential source of bias; future work could explore more robust, possibly human‑in‑the‑loop verification.

Conclusion
Stepwise Experience Recall (SEER) offers a principled, scalable solution for multi‑step function calling. By jointly considering trajectory similarity, tool‑chain overlap, and intent alignment, and by continuously enriching its experience pool through self‑assessment, SEER enables LLMs to select highly relevant demonstrations on the fly and to improve autonomously over time. The reported gains across synthetic and real‑world benchmarks suggest that SEER could become a foundational component for deploying tool‑augmented LLM agents in complex, evolving application domains.

Self-Guided Function Calling in Large Language Models via Stepwise Experience Recall

💡 Research Summary

Comments & Academic Discussion

Leave a Comment