LOCUS: Low-Dimensional Model Embeddings for Efficient Model Exploration, Comparison, and Selection
The rapidly growing ecosystem of Large Language Models (LLMs) makes it increasingly challenging to manage and utilize the vast and dynamic pool of models effectively. We propose LOCUS, a method that produces low-dimensional vector embeddings that compactly represent a language model’s capabilities across queries. LOCUS is an attention-based approach that generates embeddings by a deterministic forward pass over query encodings and evaluation scores via an encoder model, enabling seamless incorporation of new models to the pool and refinement of existing model embeddings without having to perform any retraining. We additionally train a correctness predictor that uses model embeddings and query encodings to achieve state-of-the-art routing accuracy on unseen queries. Experiments show that LOCUS needs up to 4.8x fewer query evaluation samples than baselines to produce informative and robust embeddings. Moreover, the learned embedding space is geometrically meaningful: proximity reflects model similarity, enabling a range of downstream applications including model comparison and clustering, model portfolio selection, and resilient proxies of unavailable models.
💡 Research Summary
The paper introduces LOCUS, a novel framework for generating low‑dimensional embeddings that succinctly capture the capabilities of large language models (LLMs) across a set of queries. The motivation stems from the explosive growth of LLMs, which creates practical challenges in comparing, organizing, and selecting models for inference. Existing approaches fall into two categories: parametric methods that learn per‑model embeddings as trainable parameters (e.g., EmbedLLM, IRT‑Net) and non‑parametric methods that directly compute embeddings from a shared set of evaluation queries (e.g., LLM‑DNA). Parametric methods suffer from instability—different training runs produce different embeddings—and require retraining whenever a new model is added. Non‑parametric methods, while training‑free, demand that every model be evaluated on the exact same query set and cannot easily incorporate additional evaluations.
LOCUS combines the strengths of both families. It treats each model‑query evaluation as a “token” composed of a query embedding (obtained from a pretrained sentence encoder) and a correctness score (binary or otherwise). A small MLP (the tokenizer) maps each (query, score) pair into a fixed‑dimensional token vector. All tokens for a model are stacked into a matrix and fed into a transformer‑style encoder that uses bidirectional multi‑head attention without positional encodings, guaranteeing permutation invariance with respect to the order of evaluations. To keep computation tractable when a model has many evaluations, LOCUS introduces a latent bottleneck: a set of r ≪ n learnable latent vectors that first attend to the token set and then broadcast back, reducing the attention cost from O(n²) to O(n · r). After L such bottleneck layers, a learned “query” vector attends over the final token representations, producing a single fixed‑size embedding zₘ for model m.
In parallel, a lightweight correctness predictor Gψ takes a model embedding zₘ and a new query embedding ϕ(x) and outputs a probability of correctness via a sigmoid‑activated MLP. The encoder Fθ (including the tokenizer) and predictor Gψ are jointly trained on a supervised dataset of model‑query evaluation pairs, minimizing binary cross‑entropy between predicted and observed correctness.
Empirical evaluation involves several hundred publicly available LLMs spanning multiple domains (code generation, math reasoning, general QA). Key findings include:
- Sample efficiency – LOCUS achieves comparable or higher routing accuracy than EmbedLLM and IRT‑Net while using 2.3–4.8× fewer evaluation samples per model. This demonstrates that the attention‑based token aggregation can extract strong capability signals from limited data.
- Embedding stability – Because embeddings are deterministic functions of the evaluation set (no stochastic training of per‑model parameters), repeated runs on the same data yield nearly identical vectors, and shuffling token order has no effect.
- Geometric meaningfulness – Distances (cosine or Euclidean) between embeddings correlate strongly with actual performance similarity. Visualization shows clear clustering of models by architecture family, size, or training data regime.
- Downstream utility – The learned space enables nearest‑neighbor model discovery, hierarchical clustering for model family identification, portfolio selection (choosing a minimal subset that collectively covers the capability space), and resilient fallback proxies for unavailable models (selecting the nearest embedding as a surrogate).
The authors enumerate several desiderata satisfied by LOCUS: black‑box compatibility (only query‑score pairs needed), support for varying evaluation sets across models, training‑free onboarding of new models, and the ability to refine embeddings as more evaluations become available. Limitations include reliance on the quality of the underlying query encoder and the binary correctness formulation, which may not capture nuanced performance dimensions such as latency, cost, or confidence. Future work is suggested in extending the framework to multi‑metric scores, handling more complex interactional queries (e.g., code with multiple turns), and integrating embeddings into automated model versioning and monitoring pipelines.
In summary, LOCUS provides an efficient, deterministic, and geometrically interpretable method for embedding LLMs based solely on black‑box evaluation data. By dramatically reducing the number of required evaluations while preserving rich downstream functionality, it offers a practical solution for managing the ever‑expanding landscape of language models.
Comments & Academic Discussion
Loading comments...
Leave a Comment