Verbalized Algorithms
Instead of querying LLMs in a one-shot manner and hoping to get the right answer for a reasoning task, we propose a paradigm we call \emph{verbalized algorithms} (VAs), which combines LLMs with classical algorithms with established theoretical guarantees. VAs decompose a task into simple elementary operations on natural language strings that LLMs are able to answer reliably, and limit the scope of LLMs to those simple tasks. For example, for sorting a series of natural language strings, \emph{verbalized sorting} uses an LLM as a binary comparison oracle in a known and well-analyzed sorting algorithm (e.g., bitonic sorting network). Although this is already known as \emph{pairwise ranking} in the literature, we additionally demonstrate the effectiveness of \emph{verbalized maximum}, \emph{verbalized clustering}, and \emph{verbalized submodular maximization} for numerical reasoning, topic clustering and multi-hop Q&A RAG task, which guarantees $O(n)$ runtime, $O(n \log n)$ runtime, and $1/(1-e)$ optimality, respectively. Clustering and submodular maximization outperformed or improved the nearest neighbor search using state-of-the-art embedding models.
💡 Research Summary
The paper introduces “Verbalized Algorithms” (VAs), a framework that integrates large language models (LLMs) into classical algorithmic pipelines by replacing only the atomic comparison or decision operations with LLM‑driven yes/no queries. The authors argue that many reasoning tasks can be expressed as formal problems for which optimal classical algorithms already exist; the bottleneck is often the lack of a reliable oracle. By treating the LLM as a (potentially noisy) oracle, VAs preserve the high‑level control flow, theoretical runtime guarantees, and correctness properties of the underlying algorithm, while leveraging the LLM’s natural‑language understanding for the elementary operations.
The paper defines three VA categories: (1) Naïve VA, which trusts the LLM output as ground truth; (2) Robust VA, which assumes a bounded error rate and mitigates it through repeated queries and majority voting, with error bounds derived from Hoeffding’s inequality; and (3) Probabilistic VA, which exploits the LLM’s probability logits to make probabilistic decisions or to symmetrize bias (e.g., positional bias and a “yes” tendency). The theoretical discussion shows that if the LLM were perfect, the original algorithm’s time‑space complexity and correctness are retained; with a noisy LLM, robust or probabilistic variants can still bound the degradation.
Four case studies illustrate the approach:
-
Verbalized Maximum – The task of finding the maximum integer in a list is implemented by overriding Python’s
maxfunction with an LLM‑based binary comparator (“Is X larger than Y?”). Experiments on Qwen‑3 models (1.7B–32B) demonstrate that naïve prompting yields large mean absolute errors (MAE > 500 for n=10), whereas the VA version achieves near‑zero MAE even for n=100, confirming O(n) correctness. -
Verbalized Sorting – Sorting natural‑language strings (Amazon product reviews) is performed using two classic comparison‑based sorts: Powersort (O(n log n)) and a Bitonic sorting network (parallel O((log n)²) depth). Each comparison is a single LLM yes/no query. Robust VA (K=3 majority voting) and Symmetrized VA (dual‑prompt bias correction) significantly improve Kendall‑τ scores compared to baseline constrained‑decoding or scoring‑based methods. Notably, a 1.7B model with Robust VA outperforms a 32B baseline. Bitonic networks also achieve faster wall‑clock times thanks to parallel batch queries and KV‑cache reuse.
-
Verbalized Clustering – Documents are assigned to pre‑defined topic seeds via LLM binary relevance checks. Compared to embedding‑based k‑means, VA clustering yields higher intra‑cluster homogeneity, especially when the number of clusters is small, demonstrating that LLM semantic judgments can replace dense vector similarity in certain settings.
-
Verbalized Submodular Maximization – The greedy algorithm for submodular maximization (used for multi‑hop retrieval‑augmented generation) is adapted so that marginal gains are evaluated by LLM prompts (“Does adding this passage increase relevance?”). The method retains the classic 1 − e⁻¹ approximation guarantee while achieving a 7 % lift in Recall@10 over state‑of‑the‑art dense retrievers, showing that VA can improve both quality and theoretical optimality in complex selection problems.
The authors also analyze practical aspects: KV‑cache reuse dramatically reduces the per‑query cost because most comparison prompts share a long prefix; batch processing and multi‑GPU parallelism further close the gap between theoretical and observed runtimes. They discuss the trade‑off between robustness (larger K) and latency, and note that prompt design remains critical—different domains may require tailored yes/no phrasing.
Limitations include dependence on LLM accuracy (high error rates inflate K), focus on English‑centric models (generalization to Korean or multilingual contexts is untested), and the need for careful prompt engineering to avoid systematic bias. Future work could extend VAs to other algorithmic families (dynamic programming, graph algorithms) and explore automated prompt optimization.
In conclusion, the paper presents a compelling hybrid paradigm: by confining LLMs to simple, well‑understood oracle calls, one can combine the expressive power of language models with the rigor of classical algorithms, achieving provable guarantees, superior empirical performance, and better scalability than end‑to‑end LLM prompting or formalization‑based pipelines. This work paves the way for more reliable, efficient, and theoretically grounded AI systems that leverage LLMs as modular components rather than monolithic reasoners.
Comments & Academic Discussion
Loading comments...
Leave a Comment