Exploring Depth Generalization in Large Language Models for Solving Recursive Logic Tasks

December 02, 2025

Reading time: 5 minute

...

📝 Original Info

Title: Exploring Depth Generalization in Large Language Models for Solving Recursive Logic Tasks
ArXiv ID: 2512.02677
Date: 2025-12-02
Authors: Zhiyuan He

📝 Abstract

Large language models have demonstrated remarkable capabilities across many tasks, yet face significant challenges when dealing with recursive reasoning problems, those requiring the resolution of nested hierarchical structures. While prior research has extensively studied length generalization (a model's ability to handle longer sequences than seen during training), we investigate a distinct and underexplored limitation: depth generalization. Here, depth refers to the number of nested levels in a hierarchical problem, such as the layers of parentheses in a mathematical expression or the nesting of logical clauses in a Boolean formula. Our work reveals that standard transformer architectures struggle with problems involving deeper recursion than encountered during training, even when they perform well on longer but non-nested sequences. This limitation stems from their inability to maintain stack-like behavior, the capacity to track and resolve multiple levels of nested dependencies. Through systematic analysis, we demonstrate how this architectural constraint leads to rapid performance decay as the depth of the recursion increases. To address this challenge, we develop a novel looped locate-and-replace pipeline that decomposes recursive problems into manageable subcomponents. The approach employs two specialized models: a locator that identifies solvable subexpressions and a replacer that evaluates these components while preserving the overall structure. We evaluated this method in three carefully designed domains: Boolean algebra, recursive arithmetic, and propositional logic, each with a controllable depth of recursion. We show that our method effectively alleviates the performance decay when tested on out-of-distribution recursion depth.

💡 Deep Analysis

📄 Full Content

Exploring Depth Generalization in Large Language Models for Solving Recursive Logic Tasks Zhiyuan He University College London zcabebx@ucl.ac.uk Abstract Large language models have demonstrated remarkable ca- pabilities across many tasks, yet face significant challenges when dealing with recursive reasoning problems, those re- quiring the resolution of nested hierarchical structures. While prior research has extensively studied length generalization (a model’s ability to handle longer sequences than seen during training), we investigate a distinct and underexplored limita- tion: depth generalization. Here, depth refers to the number of nested levels in a hierarchical problem, such as the layers of parentheses in a mathematical expression or the nesting of logical clauses in a Boolean formula. Our work reveals that standard transformer architectures struggle with problems in- volving deeper recursion than encountered during training, even when they perform well on longer but non-nested se- quences. This limitation stems from their inability to maintain stack-like behavior, the capacity to track and resolve multi- ple levels of nested dependencies. Through systematic anal- ysis, we demonstrate how this architectural constraint leads to rapid performance decay as the depth of the recursion in- creases. To address this challenge, we develop a novel looped locate-and-replace pipeline that decomposes recursive prob- lems into manageable subcomponents. The approach em- ploys two specialized models: a locator that identifies solv- able subexpressions and a replacer that evaluates these com- ponents while preserving the overall structure. We evaluated this method in three carefully designed domains: Boolean al- gebra, recursive arithmetic, and propositional logic, each with a controllable depth of recursion. We show that our method effectively alleviates the performance decay when tested on out-of-distribution recursion depth. Introduction Large language models (LLMs), particularly transformer- based architectures (Vaswani et al. 2017), have achieved re- markable success across diverse domains, from natural lan- guage processing to symbolic reasoning (Brown et al. 2020; Radford et al. 2019). However, as their adoption grows, un- derstanding their fundamental limitations becomes critical. Recent research has extensively explored the boundaries of transformers, with length generalization (Lin et al. 2025; Xiao and Liu 2025; Cai et al. 2025; Zhou et al. 2024; Abbe et al. 2024b,a; Li et al. 2024; Anil et al. 2022), the ability to Copyright © 2026, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. generalize to longer sequences than those seen during train- ing, being a prominent focus. Tasks such as multi-digit arith- metic, copying, and sorting have served as benchmarks for these studies, revealing both strengths and failures in extrap- olating to longer inputs. Yet, length is only one dimension of generalization. In this work, we investigate a distinct and underexplored axis: depth generalization, where the complexity of a problem is measured by its recursion depth (e.g., the nesting level of hi- erarchical structures). While length generalization tests scal- ability, depth generalization probes a model’s capacity for compositional reasoning and handling recursive patterns, a capability central to human cognition and formal systems. Recursion underpins many fundamental domains, including propositional logic (Pospesel 1974) (nested quantifiers and clauses), Boolean algebra (Sikorski et al. 1969) (compound expressions like (A ∧(B ∨¬C)) ), and recursive arithmetic (nested operations like 3 ∗(2 + (5/1)) ). The depth of re- cursion reflects the complexity of hierarchical relationships, demanding models to track intermediate states and compose operations systematically. For instance, evaluating an ex- pression with depth k requires resolving k layers of nested dependencies, a challenge distinct from processing a flat se- quence of length k. We hypothesize that while non-recursive tasks (e.g., multi-digit addition or sequence reversal) can often be solved via transformers’ autoregressive nature or enhanced positional encodings, recursive problems pose a fundamen- tally harder challenge. Unlike linear sequences, recursive structures require stack-like behavior, the ability to push, pop, and backtrack through nested contexts, which trans- formers lack by design. Attention mechanisms, despite their global receptive field, struggle to implicitly manage dynamic stacks or resolve long-range dependencies across hierarchi- cal layers. This limitation suggests that depth generaliza- tion may demand architectural innovations beyond standard positional biases or data scaling, such as explicit memory mechanisms or syntactic scaffolding. Understanding this gap is essential for applications re- quiring rigorous symbolic reasoning, such as code genera- tion (recursive function calls) (Allamanis

📄 Read Full PDF on ArXiv