HuggingR$^{4}$: A Progressive Reasoning Framework for Discovering Optimal Model Companions

Reading time: 5 minute
...

📝 Original Info

  • Title: HuggingR$^{4}$: A Progressive Reasoning Framework for Discovering Optimal Model Companions
  • ArXiv ID: 2511.18715
  • Date: 2024-05-16
  • Authors: :

📝 Abstract

Large Language Models (LLMs) have made remarkable progress in their ability to interact with external interfaces. Selecting reasonable external interfaces has thus become a crucial step in constructing LLM agents. In contrast to invoking API tools, directly calling AI models across different modalities from the community (e.g., HuggingFace) poses challenges due to the vast scale (>10k), metadata gaps, and unstructured descriptions. Current methods for model selection often involve incorporating entire model descriptions into prompts, resulting in prompt bloat, wastage of tokens and limited scalability. To address these issues, we propose HuggingR 4 , a novel framework that combines Reasoning, Retrieval, Refinement, and Reflection, to efficiently select models. Specifically, We first perform multiple rounds of reasoning and retrieval to get a coarse list of candidate models. Then, we conduct fine-grained refinement by analyzing candidate model descriptions, followed by reflection to assess results and determine if retrieval scope expansion is necessary. This method reduces token consumption considerably by decoupling user query processing from complex model description handling. Through a pre-established vector database, complex model descriptions are stored externally and retrieved on-demand, allowing the LLM to concentrate on interpreting user intent while accessing only relevant candidate models without prompt bloat. In the absence of standardized benchmarks, we construct a multimodal human-annotated dataset comprising 14,399 user requests across 37 tasks and conduct a thorough evaluation. HuggingR 4 attains a workability rate of 92.03% and a reasonability rate of 82.46%, surpassing existing method by 26.51% and 33.25% respectively on GPT-4o-mini.

💡 Deep Analysis

📄 Full Content

The Model Context Protocol (MCP) [3] has catalyzed the evolution of large language models (LLMs), making the in- tegration of diverse external interfaces a key paradigm in AI agent development, with remarkable results across various fields [4,10,46]. These external interfaces can fall into two categories: functional APIs [16,21,26], and system-level interfaces [2,5,41]. By leveraging these tools, LLMs break through the boundaries of pure text-based reasoning, enabling them to process visual inputs, access real-time data, and undertake complex tasks. With the rapid growth of the tool ecosystem spanning vision, language, and cross-modal capabilities selecting the most appropriate tool from a vast pool based on user queries has become a critical step in building AI agents.

While significant breakthroughs have been made in API tool calls [6,23], current research has largely neglected exploring the invocation of community-driven AI models further. To date, the HuggingFace community has amassed over 1.7 million models, 1,200 datasets, and supports 4,700 languages, thereby vastly expanding the available toolset for AI agents. We believe that each model excels in its specialized domain, and models fine-tuned on domain-specific data are more adept at meeting users’ particular needs. For instance, BioBERT [15] is better suited for medical text analysis than BERT [13]. However, due to the rapid expansion of the community model ecosystem and metadata gaps, retrieval system performance is significantly impacted, forcing researchers to allocate 78% of optimization costs to data cleaning rather than algorithmic innovation [32]. Therefore, developing model selection algorithms that balance both efficiency and accuracy has become a pivotal bottleneck in unlocking the full potential of community models.

Current mainstream approaches for community model selection, such as HuggingGPT [30], achieve this by directly embedding model descriptions of all candidate models into prompts, leading to significant token consumption. Recent research has shifted towards a paradigm driven by meta-learners that select models based on dataset similarity [19,22,47,49]. However, these methods are limited to fixed, small-scale, task-specific repositories and fail to generalize to other tasks or adapt to dynamically evolving largescale model communities. Moreover, existing approaches fail to perform fine-grained analysis of model descriptions. This makes it difficult to achieve an optimal match between the heterogeneous nature of community models and the diverse needs of users.

To address the above issues, we introduce HuggingR 4 , a novel LLM-powered framework for model selection in constantly changing model communities. It employs a coarseto-fine strategy that combines efficient broad-scale retrieval with precise fine-grained refinement, achieving competitive performance while maintaining remarkably low token consumption compared to existing methods. As shown in Figure 1c, HuggingR 4 construction process comprises three steps: 1) Reasoning and Retrieval. Instruct the LLM to engage in step-by-step reasoning and iteratively invoke targeted retrieval operations to obtain a compact set of competitive candidate models. 2) Refinement. Instruct the LLM to conduct comprehensive fine-grained analysis of the candidate models’ full descriptions, enabling informed selection of the most suitable model based on task-specific requirements and performance criteria. 3) Reflection. Instruct the LLM to validate the results through self-reflection. If the results are unreasonable, it dynamically adjusts the retrieval parameters and returns to the first step. To address metadata gaps, we introduce a failure traceback module after each retrieval to prevent incorrect results caused by miss-ing metadata. Furthermore, we propose a novel sliding window strategy that intelligently manages the LLM’s access to model card information at each processing step, optimizing both context utilization and token efficiency.

To evaluate our approach effectively, we present a first forward-labeled dataset for user requests in this task. This benchmark includes 1,016 single-task requests and 13,383 multi-task requests spanning vision, language, audio, and multimodal scenarios, each rigorously human-annotated for model selection. We employ two key metrics: workability (ability to execute and produce results) and reasonability (ability to meet user requirements). In summary, our contributions are as follows:

• We introduce HuggingR 4 , the first community-driven model selection system that seamlessly combines LLMdriven reasoning and vector-based retrieval, requiring no additional training data or domain adaptation. • To address the unique challenges of community model selection, we introduce a failure traceback module and a sliding window strategy. These mechanisms greatly enhance system robustness and reduce token consumption • We create the first forward-labeled user request datase

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut