Functorial Neural Architectures from Higher Inductive Types
Neural networks systematically fail at compositional generalization -- producing correct outputs for novel combinations of known parts. We show that this failure is architectural: compositional generalization is equivalent to functoriality of the dec…
Authors: Karen Sargsyan
Submitted to: A CT 2026 Functorial Neural Ar chitectur es from Higher Inductiv e T ypes Karen Sargsyan Institute of Chemistry , Academia Sinica, T aipei, T aiwan karen.sarkisyan@gmail.com Neural networks systematically fail at compositional generalization—producing correct outputs for nov el combinations of known parts. W e show that this failure is architectural: compositional gen- eralization is equiv alent to functoriality of the decoder , and this perspective yields both guarantees and impossibility results. W e compile Higher Inducti ve T ype (HIT) specifications into neural archi- tectures via a monoidal functor from the path groupoid of a target space to a category of paramet- ric maps: path constructors become generator netw orks, composition becomes structural concate- nation, and 2-cells witnessing group relations become learned natural transformations. W e prov e that decoders assembled by structural concatenation of independently generated segments are strict monoidal functors (compositional by construction), while softmax self-attention is not functorial for any non-trivial compositional task. Both results are formalized in Cubical Agda. Experiments on three spaces validate the full hierarchy: on the torus ( Z 2 ), functorial decoders outperform non- functorial ones by 2–2 . 7 × ; on S 1 ∨ S 1 ( F 2 ), the type-A/B gap widens to 5 . 5–10 × ; on the Klein bottle ( Z ⋊ Z ), a learned 2-cell closes a 46% error gap on words exercising the group relation. 1 Intr oduction A model that has learned to add 2-digit numbers should handle 5-digit numbers—the algorithm is the same, applied to more parts. A robot that can navigate around one obstacle should handle two obstacles by composing single-obstacle plans. A language model trained on simple commands should handle “go left then right” without retraining. In each case, the task decomposes into parts that combine by a kno wn rule, and the model must respect that rule on inputs nev er seen during training. This is compositional generalization, and neural networks systematically fail at it. Standard neural networks fail on SCAN [1] (linguistic composition), COGS [2] (semantic parsing), and multi-step arithmetic [3]. These failures are not capacity limitations—they persist as models scale. W e argue that the failure is architectural. Consider a decoder that handles a combined input by producing each part’ s output independently , then combining the results: D ( w 1 · w 2 ) = D ( w 1 ) ⊕ D ( w 2 ) . In categorical language, this equation says the decoder is a monoidal functor from the input algebra to the output algebra. W e prov e that a specific class of decoders (transport decoders) are monoidal functors (Theorem 3.3), and that softmax attention is not one, for any choice of parameters (Theorem 4.1). The reason is concrete: two different input sequences can represent the same compositional meaning (e.g., ab and ba in an abelian group), so a compositional decoder must produce the same output for both. But attention computes dif ferent k ey vectors for different token orderings, producing dif ferent outputs regardless of the learned weights. The principle that compositional semantics is functoriality originates in DisCoCat [4, 5]. T o realize this principle in learnable systems, one needs a cate gorical account of neural architectures themselv es; categorical deep learning [6, 7, 8] provides this, organizing parametric maps into categories and formaliz- ing learning and architectural constraints via functorial and algebraic structure. But in both programmes, building a functorial architecture for a giv en compositional structure—say Z 2 for the torus, F 2 for two- obstacle na vigation—requires ad hoc engineering with no guarantee that the result is actually functorial. 2 Functorial Neural Architectures from Higher Inductiv e T ypes W e provide the missing step: a compilation functor from Higher Inducti ve T ype (HIT) specifications to neural architectures, so that the algebraic structure of the task determines the architecture automati- cally , with compositional correctness by construction. Experiments sho w that the resulting architectures outperform non-functorial alternati ves by 2–10 × . Contributions. (1) A compilation functor mapping HIT constructors—basepoints, loops, 2-cells—to architectural components—generator networks, structural concatenation, learned homotopies—so that compositional correctness is guaranteed by construction (§3). (2) T ransport decoders (which concate- nate independently generated segments) are strict monoidal functors; softmax attention is not, for any parameters. Both results are formalized in Cubical Agda (§4). (3) Experiments on T 2 (fundamental group π 1 = Z 2 , abelian), S 1 ∨ S 1 ( π 1 = F 2 , free), and Klein bottle ( π 1 = Z ⋊ Z , semidirect product) vali- date all three lev els: winding constraints, monoidal composition, and a learned 2-cell witnessing a group relation (§5). 2 Fr om T ypes to Categories The three ingredients below—HIT specifications, monoidal structure of π 1 , and parametric maps— combine in §3 to gi ve the compilation functor . The reader familiar with HoTT and categorical deep learning may skip to §3. 2.1 Spaces from generators and r elations T o b uild decoders with topological guarantees, we need a specification language that describes spaces by their generators and relations—much as a group presentation describes a group by generators and relations. Higher inductive types pro vide exactly this. In HoTT [9, 10], e very type A has an identity type a = A b . Elements p : a = b are paths; they compose ( p · q : a = c ), have in verses ( p − 1 : b = a ), and satisfy groupoid laws up to higher coherence. The loop space Ω ( A , a ) : = ( a = a ) carries a group structure up to higher coherence under path composition; its set-truncation ∥ Ω ( A , a ) ∥ 0 is the group π 1 ( A , a ) . A Higher Inductive T ype (HIT) specifies a type by listing its constructors at each dimension: a basepoint, loops (paths from the basepoint to itself), and 2-cells (homotopies between loops witnessing algebraic relations). Three examples driv e the paper: T orus T 2 : base : T 2 ; loop a , loop b : base = base ; surf : loop a · loop b = loop b · loop a . π 1 ( T 2 ) = Z 2 (abelian; the 2-cell surf witnesses commutati vity). W edge of circles S 1 ∨ S 1 : base ; loop a , loop b : base = base ; no 2-cells. π 1 ( S 1 ∨ S 1 ) = F 2 (free group; no relations, ab = ba ). Klein bottle K : base ; loop a , loop b ; rel : loop b · loop a · loop − 1 b = loop − 1 a . π 1 ( K ) = ⟨ a , b | bab − 1 = a − 1 ⟩ ∼ = Z ⋊ Z (non-tri vial relation). These three span a hierarchy: abelian with a 2-cell ( T 2 ), non-abelian without one ( S 1 ∨ S 1 ), and non- abelian with a non-tri vial 2-cell ( K ). Each exercises a dif ferent lev el of the compilation functor in §3. 2.2 Composition as monoidal structure The word “compositional” needs a precise meaning. W e need a category whose monoidal product is word concatenation, so that a decoder preserving this product is compositional by definition—not by K. Sar gsyan 3 empirical e valuation on a test set. A group G = ⟨ a 1 , . . . , a k | r 1 , . . . , r m ⟩ can be viewed as a single-object category BG : the morphisms are elements of G , and composition is the group operation—i.e., word concatenation followed by reduction via the relations r j . This category is monoidal under composition. When G = π 1 ( X , x 0 ) , the morphisms are homotopy classes of loops and composition is path concatenation. A decoder is then a functor D : BG → C into some tar get category . If D preserv es the monoidal product— D ( w 1 · w 2 ) = D ( w 1 ) ⊕ D ( w 2 ) —it is a monoidal functor , and compositional by construction. 2.3 Neural networks as parametric maps T o land the compilation functor, we need a target category whose morphisms are neural networks. Fol- lo wing Cruttwell et al. [6] and Gavrano vi ´ c et al. [7], a neural network layer is a parametric map ( Θ , f : Θ × X → Y ) —a morphism in the category Para ( Smooth ) parameterized by weights θ ∈ Θ . Composing two layers pairs their parameter spaces: ( Θ 1 × Θ 2 , g ◦ f ) . Fong, Spiv ak, and T uy ´ eras [8] sho w that gradient descent defines a monoidal functor from Para to a category of learners, so anything landing in Para is trainable by backpropagation. Our compilation functor targets a subcategory: it maps each generator of π 1 ( X ) to a parametric loop in Para ( Smooth ) , and composition in π 1 maps to concatenation of loop segments. Functoriality of this map is compositional generalization. The next section makes this concrete. 3 Fr om Specifications to Architectur es Let X be a pointed connected space with π 1 ( X ) = G = ⟨ a 1 , . . . , a k | r 1 , . . . , r m ⟩ . Definition 3.1 (Parametric loops) . ParLoop ( X ) is the monoidal cate gory with a single object ⋆ (the basepoint x 0 ), whose morphisms ⋆ → ⋆ are parametric loops ( Θ , L : Θ × [ 0 , 1 ] → X ) with L ( θ , 0 ) = L ( θ , 1 ) = x 0 for all θ , and whose monoidal pr oduct is loop concatenation: ( L 1 ⊕ L 2 )( t ) = L 1 ( 2 t ) for t ≤ 1 2 , L 2 ( 2 t − 1 ) for t > 1 2 . Construction 3.2 (HIT compilation) . Given a HIT for X with π 1 ( X ) = G, the compilation functor D : BG → ParLoop ( X ) is: (i) Generators. F or each generator a i of G, a neural network g a i ( θ i , t ) : [ 0 , 1 ] → X (an MLP taking time t and outputting a point on X ) pr oducing a parametric loop whose winding number is constrained to [ a i ] ∈ π 1 ( X ) by construction. F or in verses: g a − 1 i ( θ i , t ) : = g a i ( θ i , 1 − t ) . (ii) Composition. F or a wor d w = a i 1 · · · a i L , D ( w ) : = g a i 1 ⊕ · · · ⊕ g a i L (structural concatenation of the L loop se gments). (iii) 2-cells. F or each r elation r j : LHS j = RHS j , a parametric homotopy H j : Θ H j × [ 0 , 1 ] s × [ 0 , 1 ] t → X with boundary conditions H j ( − , 0 , − ) = D ( LHS j ) , H j ( − , 1 , − ) = D ( RHS j ) —a continuous defor - mation fr om one loop to the other , implemented as a separate MLP . Theorem 3.3 (T ransport composition = strict functoriality) . Let D be the transport decoder fr om Con- struction 3.2(i–ii). (a) F ree monoid. F or all wor ds w 1 , w 2 over the gener ators and for all par ameter values θ : D ( w 1 · w 2 ) = D ( w 1 ) ⊕ D ( w 2 ) . That is, D is a strict monoidal functor fr om the free monoid on the g enerators to P arLoop ( X ) . This is an ar chitectur al identity , not a learned appr oximation. 4 Functorial Neural Architectures from Higher Inductiv e T ypes (b) Extension to G . W ith the 2-cells of Construction 3.2(iii), D respects the gr oup r elations up to ho- motopy: for eac h r elation r j : LHS j = RHS j , the learned homotopy H j pr ovides a path D ( LHS j ) ≃ D ( RHS j ) in the loop space of X . The r esult is a functor fr om BG to the homotopy cate gory of parametric loops: strict on composition, with H j witnessing each r elation. Pr oof. Part (a). Write w 1 = a i 1 · · · a i n and w 2 = a j 1 · · · a j m . By Construction 3.2(ii): D ( w 1 · w 2 ) = g a i 1 ⊕ · · · ⊕ g a i n ⊕ g a j 1 ⊕ · · · ⊕ g a j m = D ( w 1 ) ⊕ D ( w 2 ) , using associativity of ⊕ . The unit law D ( e ) = x 0 holds by con vention (the empty concatenation is the constant loop at x 0 ). Part (b). Each H j is a parametric homotop y satisfying the boundary conditions of Construc- tion 3.2(iii). By definition, this makes D ( LHS j ) and D ( RHS j ) equal in the set-truncation ∥ Ω ( X , x 0 ) ∥ 0 = π 1 ( X ) , so D descends to a well-defined map on the quotient BG . Formalized in Cubical Agda ( TransportCoherence.agda ). Remark 3.4 (Strict vs. up-to-reparametrization) . The continuous concatenation formula of Definition 3.1 (run L 1 at double speed, then L 2 ) is associative only up to r eparametrization of [ 0 , 1 ] . In the implementa- tion, ⊕ is point-cloud list-append, which is strictly associative. The strict functoriality of Theor em 3.3(a) r elies on this discrete r epresentation. In the continuous setting, the same r esult holds as a monoidal functor up to coher ent repar ametrization isomorphisms—a standar d situation in homotopy theory , wher e path concatenation is associative only up to higher coher ence. Put simply: the transport decoder composes by list-append, and list-append is associati ve with an empty list as unit. No learning is in v olved in the composition step—it is a structural identity that holds for e very parameter v alue and ev ery word length. Remark 3.5 (When the 2-cell matters) . On T 2 and S 1 ∨ S 1 , the 2-cell H is either trivially satisfiable (abelian: any or dering is homotopic) or absent (fr ee: no r elations to witness). On K , the relation bab − 1 = a − 1 is non-trivial: after composing g b ⊕ g a ⊕ g b − 1 , the result must be homotopic to g a − 1 . W ithout H , the transport decoder generates geometrically correct segments but globally incoher ent loops—it ignor es the frame flip. W ith H , the decoder learns the deformation. The topological constraints do not destroy geometric expressi vity . Each generator g a i ( θ i ) can rep- resent any continuous loop in its homotopy class (by the univ ersal approximation theorem for MLPs); the constraint acts only between classes, enforcing that D ( w ) is assembled from these generators. An alternati ve architecture—the co ver decoder—assigns an independent shape to each homotopy class, gi v- ing strictly more geometric freedom. But this extra freedom is precisely the freedom to be incoherent: to produce outputs that violate composition. Functoriality trades inter-class freedom for the guarantee (Appendix J). This moti vates a formal distinction: Definition 3.6 (T ype-A / T ype-B) . An ar chitectur e is type-B (functorial) if its decoder defines a monoidal functor G → ParLoop ( X ) for all par ameter values. It is type-A (non-functorial) otherwise. T ype-B ar chitectur es gener ate each loop se gment independently fr om the corr esponding gener ator network, then concatenate. T ype-A ar chitectur es use cr oss-se gment dependencies (e.g., attention between positions belonging to differ ent gener ators), br eaking the monoidal factorization D ( w 1 · w 2 ) = D ( w 1 ) ⊕ D ( w 2 ) . K. Sar gsyan 5 4 What Attention Cannot Compose Construction 3.2 produces type-B architectures. W e no w sho w that the dominant composition mecha- nism in modern architectures—softmax attention—is inherently type-A: no parameter setting mak es it functorial. Softmax self-attention [11] aggregates information across positions via content-dependent weights: each position attends to e very other position based on learned similarity between their representations, so e very position’ s output depends on the tokens at all other positions. (The standard formulation in terms of query , key , and value matrices is recalled in Appendix A.) Theorem 4.1 (Attention is not functorial) . Let T θ be a transformer with softmax self-attention, and let G be a non-trivial gr oup. Ther e exist no parameters θ for which T θ defines a monoidal functor G → ParLoop ( X ) . Pr oof sketch (full pr oof in Appendix A). Suppose for contradiction that T is functorial, so T ( w 1 · w 2 ) = T ( w 1 ) ⊕ T ( w 2 ) . Since ⊕ is a function of T ( w 1 ) and T ( w 2 ) alone, the output for the w 1 -segment can depend on w 2 only through T ( w 2 ) : any two words representing the same group element must produce the same effect on w 1 ’ s output. But attention violates this: at each layer , position i ∈ w 1 attends to positions j ∈ w 2 via key vectors K h j determined by the token embeddings of w 2 , not by T ( w 2 ) . Since G is non- tri vial, there exist distinct words w 2 = w ′ 2 with [ w 2 ] = [ w ′ 2 ] ∈ G —same group element, different token sequences. Under any standard embedding (which maps distinct generator symbols to distinct vectors), these produce different ke y vectors, causing w 1 ’ s output to dif fer between T ( w 1 · w 2 ) and T ( w 1 · w ′ 2 ) , contradicting T ( w 2 ) = T ( w ′ 2 ) . In short: attention is a content-based routing mechanism, and content-based routing cannot be compo- sitional. A functorial decoder must treat all words representing the same group element identically , but attention reads the tokens themselves—not their equiv alence class—so it distinguishes words that the functor must identify . Scaling pr ediction. For type-B decoders, the per-se gment error is O ( 1 ) as word length grows: each segment is generated independently by the same network. For type-A decoders, per-se gment error de- grades beyond the training word length because attention patterns over longer sequences are out-of- distribution (prov ed formally in Appendix I). The e xperiments in §5 confirm this across all three spaces. Depth obstruction for nonsolvable groups. The functoriality obstruction (Theorem 4.1) applies to all non-trivial groups, abelian or not. For groups with nonsolvable finite quotients (such as F 2 ↠ S 5 ), a separate depth obstruction compounds the problem: prefix products are NC 1 -complete [12], outside the A C 0 circuit class that fixed-depth transformers implement. F or solvable groups (such as Z ⋊ Z ), constant- depth shortcuts exist [21] and only the functoriality obstruction applies. The full trichotomy—abelian, solv able non-abelian, nonsolvable—is developed in Appendix H (Corollary H.2). Section 5.2 ( F 2 ) tests both obstructions simultaneously; Section 5.3 ( Z ⋊ Z , solvable) isolates functoriality alone. Relationship to DisCoCat and categorical deep learning. Theorem 4.1 provides a structural impos- sibility result absent from prior work on functorial semantics: softmax cross-attention is incompatible with monoidal functoriality , whether the codomain is FdV ect (as in DisCoCat [4, 5]) or ParLoop ( X ) . A stronger version (Appendix F) drops group structure entirely , requiring only sequential composition- ality . The categorical deep learning programme [6, 8, 7] provides the semantic foundations we build 6 Functorial Neural Architectures from Higher Inductiv e T ypes on (§2); our contribution extends it from analysis to synthesis, deriving functorial architectures from type-theoretic specifications rather than analyzing existing ones. 5 Thr ee Spaces, Three Pr edictions The theorems make testable predictions: type-B decoders should sho w constant per-se gment error as word length L grows (Theorem 3.3), while type-A decoders should degrade (Theorem 4.1). W e test on three spaces spanning the HIT hierarchy . T ask. The task is continuous geometric generation: gi ven a word w = w 1 · · · w L ov er generators of π 1 ( X ) , produce a point cloud ˆ γ ⊂ R d approximating a loop on X in the homotopy class [ w ] —a curv e that winds around X in the way prescribed by w (with d = 3 for T 2 and S 1 ∨ S 1 ; d = 4 for K ). Training uses all words of length ≤ 2 (6 words for T 2 and S 1 ∨ S 1 ; 16 words for K which has in verses); test words hav e length 3 , 4 , 6 , 8 , 10—ne ver seen during training. This is up to a 5 × length extrapolation test. Ground-truth loops and data generation details are in Appendix B. Architectur es. T able 1 summarizes the six decoders tested, classified by which HIT constructors they compile and whether the y are functorial (type-B) or not (type-A). The type-A/B classification is deter- mined by a single question: does the decoder structurally compose outputs from independently generated parts, or does it allo w cross-segment information flo w? T able 1: Architectures tested on T 2 . The other two spaces use a subset of these (see T ables 3, 4). “Compiled from HIT” indicates which lev el of Construction 3.2 the architecture implements: winding = hard topological constraint only , generators = independent loop networks, all = generators plus learned 2-cell H . T ype-A decoders allow cross-segment information flo w or lack compositional structure; type-B decoders compose independently generated segments. Architecture T ype Compiled from HIT Params Why type-A or B? T ransf. (WC) A winding only 585K Attn. mixes se gments Cov er A winding only 205K No composition T ransp. Attn. A winding + positions 585K Correct positions, attn. mixes T ransport B winding + generators 170K Struct. concatenation Homotopy B all (incl. 2-cell) 195K Concat. + learned H T ype-A decoders (non-functorial). The transformer (WC) augments a standard transformer with a hard winding constraint: angle increments sum to the correct winding number by construction, but attention still mixes information across segments. 1 The cov er decoder assigns a single independent network to each homotopy class—it respects winding but has no compositional structure, since each class is generated from scratch rather than assembled from parts; it is type-A not because of cross- segment information flow , but because it lacks monoidal f actorization entirely . The transport attention decoder is designed to isolate the effect of the composition method: our framew ork deri ves its positional encoding (Appendix G), it has 3 . 4 × more parameters than the transport decoder , and it shares the winding 1 An unconstrained transformer achiev es 0% winding accuracy and is e xcluded from all experiments. K. Sar gsyan 7 constraint—yet it is type-A because attention still mix es across segments. If this decoder degrades on longer words, the cause is the composition method, not the conditioning. T ype-B decoders (functorial). The transport decoder is the Construction 3.2(i–ii) compilation: inde- pendent generator networks whose outputs are structurally concatenated. The homotop y decoder adds the learned 2-cell H from Construction 3.2(iii), which witnesses group relations as continuous deforma- tions between loops. All architectures are trained with the same protocol: Chamfer loss optimized by AdamW (lr = 10 − 3 , weight decay 10 − 4 ), cosine learning rate schedule, 500 epochs max with early stopping (patience 80). On T 2 , all architectures conv erge to comparable training loss (2 . 23–2 . 27); on S 1 ∨ S 1 , training losses dif fer (a matched-loss ablation in Appendix C confirms the gap is architectural, not statistical). Results are aggregated o ver 3 random seeds (mean ± std). Full hyperparameters are in Appendix B. Metrics. The Chamfer distance d C ( P , Q ) is the symmetric average nearest-neighbor squared distance between tw o point clouds—a standard metric in geometric deep learning, insensitive to parameterization (formula in Appendix B). Definition 5.1 (Per-se gment Chamfer distance) . Given a generated loop ˆ γ and gr ound-truth γ for a wor d w of length L, both loops ar e divided into L se gments (one per gener ator), each r esampled to a fixed r esolution of n pts = 32 points. The per-se gment Chamfer distance is: ¯ d L ( w ) = 1 L L ∑ i = 1 d C ( ˆ γ i , γ i ) F or a functorial decoder , ¯ d L = ε gen (constant) independent of L. This is the primary metric in all experi- ments. Circle accuracy (for S 1 ∨ S 1 only): the fraction of generated segments whose centroid is closer to the correct circle ( A or B ) than to the wrong one. This detects a failure mode specific to non-abelian structure: a decoder that cannot distinguish which generator to trace will scatter se gments across both circles. 5.1 Experiment 1: T orus T 2 ( π 1 = Z 2 , abelian) T 2 is embedded in R 3 with major radius R = 2 . 0, minor radius r = 0 . 8. The generators a , b trace loops around the two c ycles. T able 2: T orus T 2 : Per-segment Chamfer distance ¯ d L (mean ± std, 3 seeds) at increasing word length L . Training uses L ≤ 2; test lengths 3–10 are nev er seen. T ype-B decoders (below line) stabilize; type-A decoders (abov e) stagnate or degrade despite more parameters. Architecture T ype L = 2 L = 4 L = 6 L = 8 L = 10 T ransf. (WC) A 1 . 89 ± . 12 1 . 42 ± . 31 1 . 42 ± . 37 1 . 53 ± . 51 1 . 54 ± . 34 Cov er A 2 . 15 ± . 31 1 . 65 ± . 11 2 . 07 ± . 06 1 . 98 ± . 20 1 . 91 ± . 10 T ransp. Attn. A 1 . 83 ± . 18 1 . 63 ± . 17 1 . 77 ± . 21 2 . 05 ± . 19 2 . 09 ± . 13 T ransport B 1 . 68 ± . 22 0 . 86 ± . 12 0 . 74 ± . 03 0 . 73 ± . 03 0 . 77 ± . 03 Homotopy B 1 . 68 ± . 19 0 . 93 ± . 09 0 . 74 ± . 03 0 . 74 ± . 02 0 . 80 ± . 03 8 Functorial Neural Architectures from Higher Inductiv e T ypes At L = 10, the type-B range is 0 . 77–0 . 80 and the type-A range is 1 . 54–2 . 09: a 2–2 . 7 × gap. The transport attention decoder is especially informati ve: it shares the winding constraint, uses theoretically- moti vated positional encoding, and has 3 . 4 × more parameters (585K vs. 170K)—yet degrades because cross-segment attention breaks functoriality . Correct conditioning cannot substitute for correct composi- tion. T wo further observations sharpen the interpretation. First, the cov er decoder has hard winding and factors through Z 2 , yet stagnates (2 . 15 → 1 . 91)—correct homotopy class is necessary but not suf ficient; composition structure is required. Second, all architectures con ver ge to comparable training loss (2 . 23– 2 . 27; Appendix B), so the gap is not a training artifact. On this abelian space, transport ≈ homotopy: the proof term is unnecessary because Z 2 has no non-tri vial relations. The non-abelian e xperiments belo w remove the letter -counting crutch. 5.2 Experiment 2: S 1 ∨ S 1 ( π 1 = F 2 , non-abelian free) The theory makes a sharp prediction: the type-A/B gap should widen dramatically . On T 2 , letter-counting partially compensated for the lack of functoriality; on S 1 ∨ S 1 , the non-abelian fundamental group F 2 remov es this crutch. The decoder must now produce different loops for different orderings of the same letters— ab = ba in F 2 , and an architecture that collapses word order will fail categorically , not just quantitati vely . Setup. S 1 ∨ S 1 is embedded in R 3 as two unit circles meeting at the origin: circle A in the xy -plane centered at ( − 1 , 0 , 0 ) , circle B in the yz -plane centered at ( 0 , 0 , − 1 ) . A word w = w 1 · · · w L traces the loop that follo ws circle w 1 , returns to the wedge point, follows circle w 2 , and so on. A sequential decoder (GR U: a gated recurrent network that processes the word left-to-right, maintaining a hidden state) is included as a type-A baseline that respects sequential order . T able 3: W edge of circles S 1 ∨ S 1 ( π 1 = F 2 , non-abelian): per-segment Chamfer and circle accurac y . Non-abelian structure dramatically amplifies the type-A/B gap. Full L -progression in Appendix D. Per-se g Chamfer ¯ d L Circle acc. (%) Architecture T ype L = 2 L = 6 L = 10 L = 2 L = 10 T ransformer A . 152 ± . 002 . 411 ± . 027 . 537 ± . 027 33 14 Sequential A . 010 ± . 001 . 173 ± . 021 . 297 ± . 027 67 11 T ransport B . 002 ± . 001 . 018 ± . 008 . 054 ± . 019 100 100 The non-abelian separation. The gap widens to 5 . 5–10 × : transport achie ves 0 . 054 at L = 10 while the transformer reaches 0 . 537 (10 × ) and the sequential decoder 0 . 297 (5 . 5 × ). T opological collapse. Circle accurac y is the sharpest diagnostic. Even at L = 2 (a training length), the transformer assigns only 33% of segments to the correct circle; by L = 10 this drops to 14%—it cannot e ven distinguish which generator to trace, producing points near an arbitrary mixture of the two circles. This is a cate gorically different failure from T 2 , where type-A decoders at least produced loops on the torus (with wrong per -segment geometry b ut correct topology). On S 1 ∨ S 1 , the transformer collapses the non-abelian structure entirely , producing outputs that are topologically meaningless. The transport decoder maintains 100% circle accuracy at all lengths. K. Sar gsyan 9 The sequential decoder . The GR U outperforms the transformer at all lengths (0 . 297 vs. 0 . 537 at L = 10), consistent with the depth trichotomy (Appendix H): sequential processing helps with the prefix products that fixed-depth attention cannot compute. Howe ver , the GR U is still type-A—its segment gen- eration depends on accumulated context, not structural composition—and it still degrades: per-se gment Chamfer increases from 0 . 010 to 0 . 297 (30 × ). Sequential processing is necessary but not sufficient for compositional generalization in non-abelian groups. A matched-loss ablation (Appendix C) rules out training quality as the explanation: retraining type- A architectures with double epochs to match the best type-B training loss barely changes extrapolation (transformer ¯ d 10 : 0 . 537 → 0 . 491, still 9 × worse; sequential: 0 . 297 → 0 . 282, still 5 × w orse). The gap is architectural, not statistical. 5.3 Experiment 3: Klein bottle K ( π 1 = Z ⋊ Z ) — the 2-cell The first two experiments test Levels 0 and 1 of the hierarchy: winding constraints and monoidal com- position. The Klein bottle tests Lev el 2: the learned proof term H . The relation bab − 1 = a − 1 means that trav ersing the b -generator flips the orientation of a : after going around b , the a -direction reverses. The theory predicts that H should matter only for words that exercise this relation, and should be dormant otherwise. The group Z ⋊ Z is solv able, so the depth obstruction of §4 does not apply: constant-depth trans- formers hav e sufficient computational po wer for this group in principle [21]. The transformer’ s failure on K is therefore entirely due to the functoriality obstruction (Theorem 4.1), making K a cleaner test of functoriality than S 1 ∨ S 1 . K is embedded in R 4 via the standard half-angle twist parameterization. W e test on tw o classes of words: Canonical words hav e all a -generators before all b -generators (e.g., aabb ): the relation is not ex ercised, so the frame nev er flips mid-word. Non-canonical words interleav e a and b (e.g., abab , baa ): the relation is ex ercised, requiring the decoder to track the flip. T able 4: Klein bottle: per-segment Chamfer at L = 10, split by w ord type. The homotopy decoder (type- B with proof term H ) closes the gap on non-canonical words. Architecture T ype Canonical Non-canonical Cov er A 1 . 57 ± . 10 1 . 96 ± . 07 T ransf. (WC) A 2 . 28 ± . 35 2 . 26 ± . 36 T ransport B 0 . 84 ± . 03 1 . 52 ± . 08 Homotopy B 0 . 82 ± . 03 0 . 82 ± . 06 The three decoders respond dif ferently . The co ver decoder f actors through the abelianization Z ⋊ Z → Z 2 , which collapses the relation entirely—it cannot distinguish ba from ab ev en though these repre- sent different homotopy classes on K . The transport decoder concatenates generators in word order and respects this distinction, but ignores the frame flip after b . Only the homotopy decoder , with its learned proof term H witnessing the relation, tracks the frame correctly . Three results. (1) The proof term closes a 46% gap. On non-canonical words, homotopy achiev es 0 . 82 vs. transport’ s 1 . 52—a 1 . 85 × improv ement entirely attributable to the learned 2-morphism H . (2) On canonical words, transport ≈ homotopy (0 . 84 vs. 0 . 82): when the relation is not ex ercised, H is dormant and adds no benefit. This confirms that the improvement is specifically due to the rela- tion, not to additional capacity . (3) Homotopy eliminates the canonical/non-canonical asymmetry : 10 Functorial Neural Architectures from Higher Inductiv e T ypes transport’ s non-canonical/canonical ratio is 1 . 8 × ; homotopy’ s is 1 . 0 × . 5.4 Cross-experiment summary T able 5: Cross-e xperiment summary at L = 10. “Best A / best B” gi ves the tightest type comparison. K values are overall averages across canonical and non-canonical words (T able 4 shows the split). The Klein bottle is the only space where the 2-cell matters. T 2 ( Z 2 ) S 1 ∨ S 1 ( F 2 ) K ( Z ⋊ Z ) Best type-B ¯ d 10 0 . 77 0 . 054 0 . 82 Best type-A ¯ d 10 1 . 54 0 . 297 1 . 76 Gap (best A / best B) 2 . 0 × 5 . 5 × 2 . 1 × T ransport vs. Homotopy ≈ 1 . 0 ≈ 1 . 0 1 . 85 × 2-cell needed? No (abelian) No (free) Y es (relation) Each experiment validates a different lev el of the functorial hierarchy . T 2 : winding constraints and monoidal composition (2–2 . 7 × gap). S 1 ∨ S 1 : non-abelian amplification—attention is especially harmful when segment identity (which circle) must be preserved. K : the learned 2-cell (natural transformation). Only the full HIT compilation—functorial composition plus proof terms—achiev es uniformly low error across all spaces and word orderings. 6 F ormalization and V erified Machine Lear ning The experiments confirm that functorial architectures outperform non-functorial ones—but the theoreti- cal guarantees are stronger than any finite experiment can demonstrate, because the y hold for all param- eter v alues and all word lengths. T o make this claim precise, we formalize the core results in a proof assistant. This in verts the standard approach to neural network verification. Post-hoc v erification [13] attempts to prov e properties of a trained network—computationally intractable for all but the simplest properties. Our approach verifies at design time: gi ven a property P (e xpressed as a homotop y type), construct an architecture that satisfies P by construction, for all parameter v alues. Training adjusts geometric detail within the topologically correct constraint space; it cannot violate the type-theoretic guarantees. The core positi ve and ne gati ve results are formalized in Cubical Agda (v2.6.4, --cubical --safe , no postulates) in four modules. T orus.agda : T 2 as a HIT ; transport commutativity (tr a ◦ tr b = tr b ◦ tr a ) holds definitionally . W edgeOfCircles.agda : ab = ba in F 2 (prov ed by case analysis on head letters). T ransportCoherence.agda : Theorem 3.3—the transport decoder preserves monoidal structure—prov ed by induction on the first word, using associativity of concatenation. NonCompositionality .agda : the abstract schema of Theorem 4.1—transport coherence and global mixing are contradictory—proved as an impossibility theorem over any output type. The instantiation to softmax (that α i j > 0 for all i , j makes attention globally mixing) relies on the standard positi vity of softmax, which is not formalized. The Klein bottle relation and proof term H are not yet formalized; this w ould require extending the word type with in verses and formalizing the boundary conditions of the homotopy . The guarantees form a hierarchy mirroring the Postniko v tower of the tar get space: K. Sar gsyan 11 Lev el Guarantee Mechanism V alidated by 0 Correct π 1 class Hard winding constraint All 3 spaces 1 Monoidal composition Generator concatenation T 2 , S 1 ∨ S 1 2 Group relations Learned proof term H Klein bottle K ≥ 3 π 2 , π 3 , . . . Higher-dim. cells Future work Each lev el adds architectural constraints that narrow the output space while preserving within-class ex- pressi vity (Appendix J). The three experiments validate lev els 0–2; extending to higher homotop y groups remains open. The result is a verified ML pipeline: specify (write a HIT), verify (prov e properties in Cubical Agda), compile (apply Construction 3.2), train (standard gradient descent). The guarantees hold for the trained network at any parameter v alue, because they are properties of the architecture, not of the learned weights. T o our knowledge, this is the first instance where machine-verified proofs provide compositional generalization guarantees for a neural architecture. The long-term vision is a compiler from HIT specifications to certified neural architectures: parse a HIT declaration in Cubical Agda to extract generators, paths, and higher cells; assign each constructor to an architectural component per Construction 3.2; generate the hard constraints (winding, boundary con- ditions) as differentiable layers; and output a training script. This would make type-theoretic guarantees accessible to practitioners who specify their domain’ s topology without requiring knowledge of HoTT . 7 Related W ork The frame work draws on three lines of prior work: categorical foundations for deep learning, equi variant network design, and the empirical study of compositional generalization failures. Categorical deep learning . Cruttwell et al. [6] formalize gradient-based learning via lenses, paramet- ric maps, and reverse deri vati ve categories. Ga vranovi ´ c et al. [7] propose monads in a 2-category of parametric maps as a unified theory of architectures, recovering geometric deep learning constraints. Fong, Spiv ak, and T uy ´ eras [8] showed backpropagation is a functor; Spi vak’ s broader programme on compositionality [14] grounds the philosophical claim that systems compose functorially . W e extend this lineage by constructing (not merely analyzing) functorial architectures from type-theoretic specifi- cations. Equivariant networks. Cohen and W elling [15] and W eiler and Cesa [16] enforce f ( g · x ) = g · f ( x ) — a constraint on a single morphism. Our functoriality is a constraint on the entire decoder: D ( g 1 · g 2 ) = D ( g 1 ) ⊕ D ( g 2 ) . The distinction is not merely formal: the transport attention decoder (§5.1) has equiv- ariant positional encoding deriv ed from group theory , shares the winding constraint, and has 3 . 4 × more parameters than the transport decoder —yet de grades on longer words because cross-segment attention breaks functoriality . Equi variant conditioning (a morphism-lev el property) does not imply compositional generalization (a functor-le vel property). Compositional generalization in ML. Lake and Baroni [1], Kim and Linzen [2], and Dziri et al. [3] document systematic failures. K obayashi et al. [17] find that an architectural bottleneck en- ables compositionality—our type-A/B distinction provides the cate gorical explanation: the bottleneck enforces monoidal factorization, i.e., functoriality . 12 Functorial Neural Architectures from Higher Inductiv e T ypes T ransformer expressivity and group composition. Liu et al. [21] characterize the depth at which transformers can simulate finite-state automata, sho wing that solvable automata admit O ( 1 ) -depth short- cuts (via Krohn-Rhodes decomposition) while nonsolvable automata require Ω ( log T ) depth (via Bar - rington’ s theorem). Their results yield the solvable/nonsolv able trichotomy in Corollary H.2 and make the Klein bottle ( Z ⋊ Z , solvable) a clean test of functoriality rather than depth. Concurrent work by Marchetti et al. [20] studies the same problem from the perspecti ve of learning dynamics, proving width and depth separations for finite group composition. Their analysis is complementary: they character - ize how networks learn group composition; we characterize which architectures preserve it and compile them from type-theoretic specifications. W e do not compare against methods dev eloped for SCAN, COGS, or CFQ because the task types are incommensurable: discrete sequence transduction vs. continuous geometric generation, with exact-match accuracy vs. Chamfer distance. Running a SCAN-specialized architecture on loop generation, or our transport decoder on command parsing, would be a category error . Howe ver , the theoretical connection is direct: Theorem 4.1 (and its stronger form, Appendix F) applies to SCAN—the failure to generalize from a primitive command (“jump” seen in isolation) to its composed forms (“jump twice”, “jump around left”, etc.) is an instance of cross-segment dependencies corrupting compositional structure. 8 Conclusion Compositional generalization is functoriality . HIT specifications provide a systematic source of functo- rial architectures: the compilation functor maps type-theoretic constructors to architectural components, guaranteeing monoidal preservation. Three experiments validate the full hierarchy—winding constraints, monoidal composition, and 2-cells witnessing group relations—with the Klein bottle e xperiment demon- strating the first neural architecture where a natural transformation is both theoretically necessary and empirically measurable. The design principle is immediately actionable: when a task has compositional structure, use struc- tural composition; attention-based architectures cannot learn this structure. This shifts the question from “can the network learn to compose?” to “does the architecture guarantee functoriality?” The plane with k obstacles has π 1 = F k —our S 1 ∨ S 1 experiment ( π 1 = F 2 ) tests the same algebraic structure as 2- obstacle path planning, and the 5 . 5–10 × gap quantifies the advantage. Any domain with compositional structure—modular programs, multi-step plans, molecular ring systems—can be specified as a HIT and compiled into a certified architecture via the specify–verify–compile–train pipeline. Limitations. The framework currently handles π 1 constraints. Extending to higher homotopy groups ( π 2 , π 3 ) w ould require higher -dimensional proof terms. The experiments use three spaces with relati vely simple π 1 ; surfaces of higher genus w ould test H more strenuously . For natural language, the theoretical results provide explanatory value, but the constructi ve approach awaits an appropriate formalization of linguistic compositional structure. Code and data a vailability . Cubical Agda formalization and Python code to replicate all experiments are av ailable at https://github.com/karsar/hott_neuro . K. Sar gsyan 13 Refer ences [1] B. Lake and M. Baroni. Generalization without systematicity: On the compositional skills of sequence-to- sequence recurrent networks. In ICML 2018 , pp. 2873–2882, 2018. [2] N. Kim and T . Linzen. COGS: A compositional generalization challenge based on semantic interpretation. In EMNLP 2020 , pp. 9087–9105, 2020. [3] N. Dziri, X. Lu, M. Sclar , et al. F aith and fate: Limits of transformers on compositionality . In NeurIPS 2023 , 2023. [4] B. Coecke, M. Sadrzadeh, and S. Clark. Mathematical foundations for a compositional distrib utional model of meaning. Linguistic Analysis , 36:345–384, 2010. [5] B. Coecke, E. Grefenstette, and M. Sadrzadeh. Lambek vs. Lambek: Functorial vector space semantics and string diagrams for Lambek calculus. Annals of Pur e and Applied Logic , 164(11):1079–1095, 2013. [6] G. S. H. Cruttwell, B. Gavranovi ´ c, N. Ghani, P . Wilson, and F . Zanasi. Categorical foundations of gradient- based learning. In ESOP 2022 , LNCS, pp. 1–28. Springer , 2022. [7] B. Ga vranovi ´ c, P . Lessard, A. Dudzik, T . von Glehn, J. G. M. Ara ´ ujo, and P . V eli ˇ ckovi ´ c. Position: Categorical deep learning is an algebraic theory of all architectures. In ICML 2024 , 2024. [8] B. Fong, D. I. Spi vak, and R. T uy ´ eras. Backprop as functor: A compositional perspecti ve on supervised learning. In LICS 2019 , pp. 1–13. IEEE, 2019. [9] The Univ alent Foundations Program. Homotopy T ype Theory: Univalent F oundations of Mathematics . Insti- tute for Advanced Study , 2013. [10] C. Cohen, T . Coquand, S. Huber, and A. M ¨ ortberg. Cubical type theory: A constructiv e interpretation of the univ alence axiom. FLAP , 4(10):3127–3170, 2018. [11] A. V aswani, N. Shazeer, N. Parmar , J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser , and I. Polosukhin. Attention is all you need. In NeurIPS 2017 , pp. 5998–6008, 2017. [12] D. A. Barrington. Bounded-width polynomial-size branching programs recognize exactly those languages in NC 1 . J . Comput. Syst. Sci. , 38(1):150–164, 1989. [13] G. Katz, C. Barrett, D. L. Dill, K. Julian, and M. J. K ochenderfer . Reluplex: An efficient SMT solver for verifying deep neural networks. In CA V 2017 , LNCS, pp. 97–117. Springer , 2017. [14] B. Fong and D. I. Spiv ak. An Invitation to Applied Cate gory Theory: Seven Sketches in Compositionality . Cambridge Univ ersity Press, 2019. [15] T . Cohen and M. W elling. Group equi variant con volutional networks. In ICML 2016 , pp. 2990–2999, 2016. [16] M. W eiler and G. Cesa. General E(2)-equiv ariant steerable CNNs. In NeurIPS 2019 , pp. 14334–14345, 2019. [17] S. K obayashi, S. Schug, Y . Akram, F . Redhardt, J. von Oswald, R. Pascanu, G. Lajoie, and J. Sacramento. When can transformers compositionally generalize in-context? , 2024. [18] J. Su, M. Ahmed, Y . Lu, S. Pan, W . Bo, and Y . Liu. RoFormer: Enhanced transformer with rotary position embedding. Neur ocomputing , 568:127063, 2024. [19] A. Gu and T . Dao. Mamba: Linear-time sequence modeling with selecti ve state spaces. , 2023. [20] G. L. Marchetti, D. Kunin, A. Myers, F . Acosta, and N. Miolane. Sequential group composition: A windo w into the mechanics of deep learning. , 2026. [21] B. Liu, J. T . Ash, S. Goel, A. Krishnamurthy , and C. Zhang. Transformers learn shortcuts to automata. In ICLR 2023 , 2023. 14 Functorial Neural Architectures from Higher Inductiv e T ypes A Full Pr oof of Theorem 4.1 The main text giv es a proof sketch. Here we make the argument fully explicit, tracking the information flo w through each attention layer to show e xactly where functoriality breaks. Pr oof. Let T θ be a transformer with parameters θ , L layers of multi-head softmax self-attention, and embedding dimension d . The input is a word w = w 1 · · · w n ov er generators of a non-tri vial group G , with each w i embedded as h ( 0 ) i ∈ R d . Suppose for contradiction that T θ defines a monoidal functor: there exists ⊕ such that T θ ( w 1 · w 2 ) = T θ ( w 1 ) ⊕ T θ ( w 2 ) for all w 1 , w 2 . At layer l , position i in the w 1 -segment computes: h ( l ) i = h ( l − 1 ) i + MHA ( l ) ( h ( l − 1 ) i , { h ( l − 1 ) j } j ) , where multi-head attention aggregates: Attn i = ∑ j α i j V h ( l − 1 ) j with α i j = softmax j ( Qh ( l − 1 ) i · K h ( l − 1 ) j / √ d ) . For j in the w 2 -segment, the key K h ( l − 1 ) j depends on the embedding of w 2 ’ s tokens. Since G is non-tri vial, there exist distinct words w 2 = w ′ 2 with [ w 2 ] = [ w ′ 2 ] ∈ G . These have dif ferent tokens, hence dif ferent embeddings, hence dif ferent keys, hence dif ferent α i j , hence different h ( 1 ) i for i ∈ w 1 . Propagat- ing through layers: T θ ( w 1 · w 2 ) = T θ ( w 1 · w ′ 2 ) . But functoriality and T θ ( w 2 ) = T θ ( w ′ 2 ) require T θ ( w 1 · w 2 ) = T θ ( w 1 ) ⊕ T θ ( w 2 ) = T θ ( w 1 ) ⊕ T θ ( w ′ 2 ) = T θ ( w 1 · w ′ 2 ) . Contradiction. B T raining Details Data generation. For each training word, 1000 ground-truth loops are generated by tracing the stan- dard geometric generators on X with random phase of fsets (uniformly distributed starting points along each generator circle) and small Gaussian noise ( σ = 0 . 02) applied independently to each output co- ordinate. This provides within-class geometric v ariation so that the network learns the shape of each homotopy class rather than memorizing a single curv e. Chamfer distance. The Chamfer distance between two point clouds P = { p i } N i = 1 and Q = { q j } M j = 1 is: d C ( P , Q ) = 1 2 N ∑ i min j ∥ p i − q j ∥ 2 + 1 2 M ∑ j min i ∥ p i − q j ∥ 2 Common protocol. All experiments: Chamfer loss, AdamW optimizer (lr = 10 − 3 , weight decay 10 − 4 ), cosine learning rate schedule with 20-epoch warmup, up to 500 epochs, early stopping (patience 80). Generator netw orks are 2-layer MLPs with 128 hidden units producing n pts = 32 points per seg- ment. All decoders output 64 points total, which for type-B decoders means L · 32 intermediate points resampled to 64. Seeds: 42, 179, 316. T 2 . Embedded in R 3 . T raining: 6 w ords ( { a , b , aa , ab , ba , bb } ), 1000 samples each. Training losses: all architectures con ver ge to 2 . 23–2 . 27. The homotopy decoder adds smoothness regularization ( λ = 0 . 05) on the proof term. S 1 ∨ S 1 . T wo unit circles in R 3 meeting at origin. Same 6 training w ords (b ut ab = ba ). T raining losses: transport 0 . 0023 ± 0 . 0003, transformer 0 . 0030 ± 0 . 0002, sequential 0 . 0051 ± 0 . 0002. K. Sar gsyan 15 Klein bottle K . Embedded in R 4 via half-angle twist. T raining: all 16 words of length ≤ 2 in { a , a − 1 , b , b − 1 } . The proof term H is a 2-layer MLP (64 hidden units) with boundaries H ( s = 0 ) = g b ⊕ g a , H ( s = 1 ) = g a − 1 ⊕ g b , trained jointly with generators. T raining losses: cover 0 . 831, transport 0 . 920, ho- motopy 0 . 920, transformer WC 0 . 851 (all ± < 0 . 02). C Matched-Loss Ablation A natural objection to the main results is that type-A architectures simply need more training. W e test this directly: if we retrain type-A decoders to match the type-B training loss, does the extrapolation gap close? T o rule out training quality as the explanation for the type-A/B gap, we retrained type-A architectures on S 1 ∨ S 1 with 2 × epochs and reduced learning rate (5 × 10 − 4 ), targeting the best type-B training loss. T rain loss ¯ d 10 Architecture Original Retrained Original Retrained T ransformer . 0030 . 0030 . 537 . 491 Sequential . 0051 . 0054 . 297 . 282 T ransport (ref.) . 0023 . 054 Neither type-A architecture can reach the type-B training loss e ven with 2 × epochs: the transformer stays at 0 . 0030 (tar get 0 . 0023), the sequential worsens to 0 . 0054 (from 0 . 0051 with standard training; the lower learning rate finds a different basin). This is itself an architectural limitation: the compositional structure of the transport decoder helps optimization, not just generalization. The per-segment Chamfer at L = 10 barely changes after retraining: transformer . 537 → . 491 (still 9 × worse than transport), sequential . 297 → . 282 (still 5 × w orse). The g ap is architectural. D Full L -Progr ession T ables The main text reports selected lengths to save space. The full trajectories below rev eal how degradation unfolds: whether it is gradual or abrupt, and whether it saturates. T able 6: S 1 ∨ S 1 : per-segment Chamfer ¯ d L (mean ± std, 3 seeds), full L -progression. Architecture T ype L = 2 L = 3 L = 4 L = 6 L = 8 L = 10 T ransformer A . 152 ± . 002 . 245 ± . 006 . 297 ± . 022 . 411 ± . 027 . 464 ± . 039 . 537 ± . 027 Sequential A . 010 ± . 001 . 100 ± . 005 . 129 ± . 009 . 173 ± . 021 . 247 ± . 034 . 297 ± . 027 T ransport B . 002 ± . 001 . 015 ± . 010 . 012 ± . 005 . 018 ± . 008 . 030 ± . 011 . 054 ± . 019 The transformer’ s 33% at L = 2 (a training length) shows that the failure is not merely extrapolation—it cannot represent non-abelian structure e ven in-distribution. The sequential decoder starts better (67%) but con ver ges to the same floor . The S 1 ∨ S 1 tables re veal the de gradation trajectory: the transformer is already at 33% circle accuracy at L = 2 (a training length), confirming that the failure is fundamental, not merely an extrapolation issue. The Klein bottle tables show a striking asymmetry: the homotopy decoder improves with length (from 16 Functorial Neural Architectures from Higher Inductiv e T ypes T able 7: S 1 ∨ S 1 : circle accuracy (%, mean ov er 3 seeds). T ransport achiev es 100% by construction. Architecture T ype L = 2 L = 3 L = 4 L = 6 L = 8 L = 10 T ransformer A 33 19 16 19 12 14 Sequential A 67 36 20 14 8 11 T ransport B 100 100 100 100 100 100 T able 8: Klein bottle K : per -segment Chamfer ¯ d L on all test words (mean ± std, 3 seeds). Architecture T ype L = 2 L = 3 L = 4 L = 6 L = 8 L = 10 Cov er A 1 . 81 ± . 13 1 . 90 ± . 04 1 . 69 ± . 04 1 . 86 ± . 06 1 . 81 ± . 13 1 . 76 ± . 07 T ransf. (WC) A 1 . 80 ± . 16 1 . 84 ± . 21 1 . 84 ± . 24 2 . 03 ± . 29 2 . 28 ± . 39 2 . 27 ± . 33 T ransport B 1 . 54 ± . 18 1 . 51 ± . 05 1 . 25 ± . 06 1 . 24 ± . 05 1 . 18 ± . 03 1 . 17 ± . 04 Homotopy B 1 . 43 ± . 14 1 . 10 ± . 04 0 . 97 ± . 07 0 . 85 ± . 04 0 . 79 ± . 05 0 . 82 ± . 03 1 . 43 at L = 2 to 0 . 82 at L = 10), suggesting that the per-segment metric becomes more stable as L gro ws, while the transport decoder’ s non-canonical error worsens from L = 4 onward, reflecting cumulativ e frame-flip errors that the proof term H w ould correct. E Coher ence Battery Per-se gment Chamfer measures ho w well each segment is generated. But functoriality also implies struc- tural identities—exact zeros, not approximate ones. The follo wing battery tests whether these identities hold, distinguishing architectural guarantees from learned approximations. For T 2 , we test four structural coherence properties. Composition gap: d ∞ ( D ( w 1 ) ⊕ D ( w 2 ) , D ( w 1 w 2 )) for canonical pairs. Commutati vity: d ∞ ( D ( ab ) , D ( ba )) . Reordering gap: d ∞ ( D ( b ) ⊕ D ( a ) , D ( ab )) . Non-canonical: d ∞ ( D ( abab ) , D ( aabb )) . Commutati vity and non-canonical gaps are exactly zero for cover , transport, and homotopy (they factor through Z 2 , so ab and ba produce identical output by construction). T ransport attention sho ws 1 . 54 ± 1 . 39: learned coherence is fragile and high-v ariance, while structural coherence is exact. Order sensitivity on S 1 ∨ S 1 . W e test whether each decoder distinguishes words that dif fer only in generator order (e.g., ab vs. ba , aab vs. aba )—all distinct homotopy classes in F 2 . T ransport distinguishes 80% of test pairs (the remaining 20% hav e cross-Chamfer near the noise floor, reflecting geometric similarity rather than architectural failure). The transformer distinguishes only 40 ± 16%—barely abov e chance—confirming that attention collapses word order . F General Non-Compositionality of Attention Theorem 4.1 in the main text uses group structure. A natural question is whether groups are essential to the argument, or whether the obstruction is more fundamental. The following result sho ws it is the latter: softmax attention is incompatible with any form of sequential compositionality , re gardless of algebraic structure. K. Sar gsyan 17 T able 9: Klein bottle K : per -segment Chamfer on non-canonical words only (words where b precedes a , ex ercising the relation bab − 1 = a − 1 ). The transport–homotopy gap isolates the proof term’ s contribution. Architecture T ype L = 2 L = 3 L = 4 L = 6 L = 8 L = 10 Cov er A 1 . 86 ± . 20 2 . 08 ± . 11 1 . 68 ± . 07 1 . 99 ± . 08 2 . 02 ± . 08 1 . 96 ± . 07 T ransf. (WC) A 1 . 84 ± . 11 1 . 93 ± . 33 1 . 87 ± . 28 1 . 96 ± . 30 2 . 32 ± . 40 2 . 26 ± . 36 T ransport B 1 . 73 ± . 12 1 . 82 ± . 06 1 . 37 ± . 08 1 . 45 ± . 04 1 . 44 ± . 05 1 . 52 ± . 08 Homotopy B 1 . 38 ± . 12 1 . 13 ± . 02 0 . 97 ± . 08 0 . 83 ± . 02 0 . 75 ± . 02 0 . 82 ± . 06 T able 10: T 2 coherence battery . Exact zeros are architectural guarantees, not learned. Arch. Comp. Comm. Reorder Noncan. T ransf. (WC) 4 . 05 ± . 21 0 . 24 ± . 13 4 . 37 ± . 10 2 . 04 ± 1 . 65 Cov er 4 . 25 ± . 18 0.000 4 . 96 ± . 39 0.000 T ransp. Attn. 4 . 21 ± . 14 1 . 54 ± 1 . 39 4 . 79 ± . 28 0 . 96 ± . 56 T ransport 4 . 28 ± . 13 0.000 3 . 82 ± . 07 0.000 Homotopy 4 . 24 ± . 19 0.000 3 . 79 ± . 13 0.000 Definition F .1 (Segment-independent compositionality) . A decoder D : Σ ∗ → Y is se gment-independently compositional if ther e e xists a combination operation ⊕ on Y such that: (i) D ( w 1 · w 2 ) = D ( w 1 ) ⊕ D ( w 2 ) for all wor ds w 1 , w 2 ; (ii) D ( w ) is computed fr om w alone; (iii) ⊕ does not intr oduce cr oss-se gment interactions: when Y is a sequence, the output positions cor- r esponding to w 1 ar e determined by D ( w 1 ) alone . Theorem F .2 (General non-compositionality of attention) . Let T be a network containing at least one softmax attention layer over the full input sequence. Then T is not se gment-independently compositional for any alphabet | Σ | ≥ 2 , for generic par ameters ( Q , K , V ) . Pr oof. W e sho w condition (iii) is violated. At the first attention layer applied to w 1 · w 2 , position i within the w 1 segment computes: h ′ i = ∑ j ∈ w 1 α i j V h j + ∑ j ∈ w 2 α i j V h j , where α i j = softmax ( Qh i · K h j / √ d ) . The cross-segment sum depends on the tokens of w 2 through both α i j and V h j . For generic ( Q , K , V ) and | Σ | ≥ 2, replacing any token in w 2 changes K h j and V h j , hence changes h ′ i . Therefore the output at position i ∈ w 1 depends on the content of w 2 , violating condition (iii). Remark F .3. This theor em r equir es no gr oup structur e, no topology , and no specific targ et space. It pr ovides a structural explanation for why transformers struggle with compositional generalization acr oss SCAN, COGS, and CFQ—the obstacle is ar chitectur al, not task-specific. The r esult fails for causal attention with a har d segment partition (wher e positions in w 1 attend only to w 1 ), but such masking pr events w 1 ’ s r epr esentation fr om depending on w 2 —pr ecisely the loss of global conte xt that motivates full attention. Corollary F .4 (Structural limits on learned compositionality) . Theorems 4.1 and F .2 do not imply that a transformer cannot learn to approximate compositional behaviour on the training distribution. What the y constrain is the natur e of any such learned compositionality: (a) Approximate, never exact. The cr oss-se gment attention term ∑ j ∈ w 2 α i j V h j is ar chitectur ally pr esent and strictly positive ( softmax > 0 ). T raining can mak e this term small but not zer o. 18 Functorial Neural Architectures from Higher Inductiv e T ypes (b) Distribution-dependent. The suppression of cr oss-segment attention is learned for the specific input statistics (wor d lengths, winding rang es, token fr equencies) encountered during tr aining. (c) No length transfer guarantee. F or L > L train , the cr oss-se gment attention weights that wer e learned- to-be-small on training data have no ar chitectur al r eason to r emain small. The quality of learned compositionality may de grade smoothly , abruptly , or not at all—this is an empirical question, not a structural guar antee. Pr oof. (a) By the strict positivity of softmax, α i j > 0 for all ( i , j ) , so the cross-segment sum is nonzero whene ver V h j = 0 (which holds generically). (b) The trained parameters ( Q , K , V ) minimize a loss on the training distribution; there is no penalty on cross-se gment attention magnitude as such. (c) At length L > L train , both the number of cross-segment terms and the positional encodings (at untrained positions) change; the learned suppression was not trained for these configurations. G Abelian T ransport-Attention Theorem The transport attention decoder in our e xperiments uses a positional encoding deri ved from the frame- work rather than chosen ad hoc. This appendix shows that the deri vation recovers a well-kno wn technique—rotary position embeddings (RoPE)—as a special case, and explains wh y principled posi- tions alone are insuf ficient for compositional generalization. The main text notes that the frame work deri ves 2D rotary positional encoding from group theory . Here we state the full result. Definition G.1 (T ransport-structured attention) . Let G be a finitely g enerated abelian gr oup with gener - ators g 1 , . . . , g k . F or each generator g i , let T i ∈ O ( d ) be a learnable orthogonal matrix. The transport- structur ed attention at position p with gr oup element γ p ∈ G is: α pq = softmax q p · T γ q − γ p k q √ d , T γ = T n 1 1 · · · T n k k for γ = ( n 1 , . . . , n k ) . (1) Theorem G.2 (Abelian transport attention) . Let G be a finitely generated abelian gr oup. Then: (a) T ransport-structur ed attention (1) is well-defined: T γ is independent of the decomposition of γ into gener ators (since G is abelian and the T i commute as orthogonal matrices with matching r otation planes). (b) The computational cost is O ( T 2 d ) —identical to standar d attention. (c) When T i acts as block-diagonal r otations on d / 2 independent planes, this reduces to 2 k -dimensional r otary position embeddings (RoPE) [18]. Standar d RoPE is the G = Z special case; the framework gives the natural G = Z k extension. Pr oof. (a) F or abelian G , each γ has a unique decomposition γ = n 1 g 1 + · · · + n k g k . Block-diagonal rota- tions on the same planes commute (angles add), so T γ = T n 1 1 · · · T n k k is independent of ordering. (b) Com- puting T γ q − γ p requires k matrix powers in O ( d ) each (block-diagonal); the dominant cost is T 2 pairwise computations at O ( d ) each. (c) With T i block-diagonal ha ving 2 × 2 rotation blocks R ( ω i j ) , the trans- port T n i acts as R ( n ω i j ) on each plane. The combined T γ acts as R ( ∑ i n i ω i j ) —exactly multi-dimensional RoPE with frequency v ectors ω i = ( ω i 1 , . . . , ω i , d / 2 ) . K. Sar gsyan 19 Remark G.3. This explains why RoPE impr oves length generalization: it implements transport in the universal cover of the classifying space of Z . Howe ver , transport-structur ed positional encoding does not make the decoder type-B (functorial): the attention mechanism still mixes information acr oss se gments. The transport attention decoder in our experiments confirms this—it has principled positions but wr ong composition, and de grades like any type-A ar chitectur e. H Depth Obstruction and RNN–T ransformer T richotomy Why does the sequential decoder (GR U) outperform the transformer on S 1 ∨ S 1 but not match the trans- port decoder? The answer in volves tw o independent obstructions. The first is functoriality: attention mixes across segments for any non-tri vial group (Theorem 4.1). The second is depth: for groups with nonsolv able finite quotients, computing prefix products requires depth Ω ( log n ) , which fixed-depth trans- formers cannot provide. F or solvable groups, howe ver , constant-depth shortcuts exist [21], so the depth obstruction v anishes—only the functoriality obstruction remains. The GR U has the right depth (sequen- tial scan) but the wrong composition (context-dependent, not structural), which is why it outperforms the transformer on F 2 yet still falls far short of the transport decoder . Theorem H.1 (Depth obstruction for prefix products) . Let G be a finitely generated non-abelian gr oup. Any transport-coher ent attention mechanism for G requir es computing the accumulated gr oup element γ p = g w 1 · g w 2 · · · g w p for each pr efix of the input. This pr efix-pr oduct computation: (a) Cannot be performed by a fixed-depth transformer when G surjects onto a nonsolvable finite group. A fixed-depth transformer implements a bounded-depth, unbounded-fan-in computation—the cir- cuit class AC 0 . The free gr oup F 2 surjects onto nonsolvable finite gr oups (e.g., S 5 , which is 2 - gener ated), and the wor d pr oblem for any nonsolvable finite gr oup is NC 1 -complete under A C 0 r eductions [12]. Since NC 1 ⊂ AC 0 (parity separates them), computing pr efix pr oducts in F 2 is at least as har d as solving these wor d pr oblems, and hence outside AC 0 . F or solvable gr oups, this obstruction does not apply: Liu et al. [21] show that all solvable semiautomata admit constant- depth T ransformer simulators via the Kr ohn-Rhodes decomposition (their Theor em 2), though the r esulting shortcuts ar e empirically brittle and do not generalize out of distrib ution. (b) Requires Ω ( n ) total work (each pr efix pr oduct depends on the full prefix, and non-commutativity pr events any shortcut). (c) Can be performed by an associative scan (parallel pr efix) in O ( log n ) depth with O ( n ) work—a sequential-scan structur e absent fr om standar d transformers but pr esent in state space models [19] and r ecurr ent ar chitectur es. Marc hetti et al. [20] pr ove this rigor ously for finite gr oups: they construct e xplicit RNN solutions that compose in O ( k ) steps and multilayer MLP solutions that compose in O ( log k ) layer s, both with hidden width O ( | G | 3 / 2 ) independent of sequence length— compar ed to the O ( exp k ) width r equired by two-layer networks. Pr oof. (a) The key is the circuit-complexity classification of group word problems. Barrington [12] sho wed that the w ord problem for any fixed nonsolv able group G —deciding whether a product of gener- ators equals the identity—is complete for NC 1 (fan-in 2, depth O ( log n ) ) under AC 0 reductions (bounded- depth, unbounded-fan-in circuits). Since NC 1 ⊂ AC 0 (the parity function is in NC 1 but not in AC 0 ), no A C 0 circuit can solve these w ord problems. The free group F 2 surjects onto S 5 (which is 2-generated and nonsolv able): the natural quotient map F 2 ↠ S 5 reduces the word problem for S 5 to computing prefix products in F 2 . A fix ed-depth transformer with a fix ed number of layers implements a bounded-depth, 20 Functorial Neural Architectures from Higher Inductiv e T ypes unbounded-fan-in computation—that is, an AC 0 circuit—and therefore cannot compute these products. The con verse—solv able groups admit constant-depth solutions—follows from Liu et al. [21], who apply the Krohn-Rhodes decomposition to factor solvable semiautomata into modular counters and memory units, each simulable at depth 1. (b) The prefix product γ p depends on all p inputs; since G is non-abelian, no proper subset determines γ p . (c) Group multiplication is associati ve, so the parallel prefix algorithm computes all n prefix products in O ( log n ) depth with O ( n ) multiplications. Corollary H.2 (Depth trichotomy) . The computational r equir ements for pr efix pr oducts depend on the solvability structur e of G, yielding thr ee re gimes: (a) Abelian. The pr oduct depends only on letter counts, computable in AC 0 . Bounded-depth parallel attention suffices for the computation. (b) Solvable non-abelian (e.g. Z ⋊ Z ). Constant-depth shortcuts exist via Kr ohn-Rhodes decomposi- tion [21], with depth independent of sequence length b ut potentially larg e width. F ixed-depth trans- formers have sufficient computational power for pr efix pr oducts, but attention still mixes acr oss se gments, so the functoriality obstruction (Theorem 4.1) r emains. (c) Nonsolvable quotients (e.g. F 2 ↠ S 5 ). Pr efix pr oducts ar e NC 1 -complete, outside A C 0 . F ixed-depth transformer s fail for two independent r easons: insuf ficient depth and brok en functoriality . Functoriality (Theor em 4.1) is the uniform obstruction acr oss all thr ee cases. Depth is an additional obstruction only in case (c). Pr oof. (a) For abelian G , the product g 1 · · · g n depends only on the multiplicity of each generator , which is a sum—computable by AC 0 circuits. (b) Liu et al. [21] sho w that all solv able semiautomata admit O ( 1 ) - depth Transformer simulators (their Theorem 2), via the Krohn-Rhodes decomposition into modular counters and memory units. Since every finite quotient of a solvable group is solv able, prefix products in any solvable G can be computed at constant depth. Ho we ver , constant-depth computability does not imply functoriality: by Theorem 4.1, the attention mechanism still mixes tok en-lev el information across segments. (c) Theorem H.1(a): Barrington’ s theorem plus NC 1 ⊂ A C 0 . The GRU’ s intermediate performance on S 1 ∨ S 1 (0 . 297 vs. transformer’ s 0 . 537, transport’ s 0 . 054) con- firms case (c): the GR U has the right depth structure (sequential processing) but the wrong composition structure (context-dependent, not structural). On the Klein bottle, the transformer’ s failure illustrates case (b): e ven when depth is not an obstruction, cross-segment attention prevents functorial composi- tion. I P er -Segment Error Scaling The experiments sho w that type-B error is flat in L while type-A error grows. Is this an accident of specific architectures, or does it follow from the theorems? Here we prove it follows: functoriality implies O ( 1 ) scaling, while non-functoriality implies Ω ( 1 ) degradation for L ≫ L train . Proposition I.1 (Per-segment error scaling) . Let ¯ d L = E w : | w | = L [ Chamfer ( D ( w ) , D ∗ ( w ))] / L denote the per-se gment Chamfer distance for wor ds of length L. (a) Type-B: F or a transport decoder with con ver ged training, ¯ d L ≤ max i Chamfer ( g i ( θ ) , g ∗ i ) + η ( L ) / L L → ∞ − − − → ε gen , wher e ε gen is the gener ator appr oximation err or , independent of L. (b) Type-A: F or type-A arc hitectur es tested at L > L train , ¯ d L ≥ Ω ( e ( L − L train )) wher e e ( · ) is the extr apolation err or of the network’ s learned functions be yond their training domain. F or the K. Sar gsyan 21 cover decoder , the MLP evaluates at winding magnitudes up to L (trained only to L train ), giving e = Ω ( L − L train ) by the T aylor r emainder . F or transport attention, 2D RoPE r otation angles scale as n a / L train times any tr aining angle, placing attention in an untrained re gime. F or the transformer (WC), attention patterns at length L in volve L 2 pairwise interactions vs. L 2 train during training; the learned attention weights have no ar chitectur al reason to generalize to the longer -range interac- tions. In all cases, ¯ d L = Ω ( 1 ) for L ≫ L train . Pr oof. (a) By Theorem 3.3, Chamfer ( D ( w ) , D ∗ ( w )) ≤ L · ε gen + η ( L ) , where η ( L ) accounts for resam- pling ( O ( √ L ) interpolation error). Dividing by L : ¯ d L ≤ ε gen + η ( L ) / L → ε gen . (b) A word of length L has winding pair ( n a , n b ) with n a + n b = L . The cov er decoder’ s MLP ev aluates at ∥ ( n a , n b ) ∥ up to L , while training co vered only ∥ ( n a , n b ) ∥ ≤ L train . For smooth functions, extrapolation error is Ω ( ∥ x − x train ∥ ) . For transport attention, rotation angles are n a / L train times larger than an y training angle. For the transformer (WC), the number of cross-segment attention pairs grows as ( L − L train ) new positions per segment, each contributing untrained interactions. All three mechanisms give per -position error Ω ( L − L train ) , hence ¯ d L = Ω (( L − L train ) / L ) = Ω ( 1 ) for L ≫ L train . J Expr essivity of T ransport Decoders A natural concern is that functoriality constrains the decoder so tightly that it cannot represent interesting loops. The follo wing result shows this is not the case: within each homotopy class, the transport decoder is a uni versal approximator . The constraint acts only between classes, enforcing composition. Proposition J.1 (Expressivity) . Let D tr θ be a transport decoder with g enerator networks g a ( θ ) , g b ( θ ) . Let L n ⊂ L ( X ) denote the space of loops with homotopy class n ∈ G. Then: (a) Within-class expressivity: Each generator g a ( θ ) can r epr esent any continuous loop in L ( 1 , 0 ) given sufficient network capacity (by the universal appr oximation theor em applied to the generator network in polar coor dinates). (b) Between-class constraint: F or homotopy class ( n a , n b ) , the output is the n a -fold concatenation of g a followed by n b -fold concatenation of g b . The geometric de gr ees of freedom ar e those of g a and g b alone. (c) Degrees of freedom: The number of independent geometric de gr ees of fr eedom equals the number of gener ators in the gr oup presentation. The cover decoder has strictly mor e geometric fr eedom (an independent shape per homotopy class), but this extr a fr eedom is pr ecisely the freedom to be incoher ent. Pr oof. (a) Universal approximation in polar coordinates. (b) By construction of the transport decoder . (c) For Z 2 , the group is free abelian on two generators; the transport decoder factors through canonical form a n a b n b , making the two generator shapes the only free parameters. K Resampling Artifact Analysis Comparing loops of different lengths with a fixed-resolution metric creates a subtle confound. This appendix sho ws that our per-se gment Chamfer metric av oids it, and quantifies ho w large the artifact would be with a nai ve alternati ve. 22 Functorial Neural Architectures from Higher Inductiv e T ypes All decoders produce a fixed-length output of T out = 64 points re gardless of word length L . This creates a systematic confound: as L gro ws, the underlying curve lengthens while point density drops, causing nai ve Chamfer distance to decrease e ven when per -segment error is constant. W e verified this with synthetic curv es under constant additiv e noise σ = 0 . 1: L = 2 L = 4 L = 8 L = 10 Per-se g Chamfer (fair , 32 pts/seg) 0.026 0.024 0.029 0.026 Per-se g Chamfer (naiv e, resample to 64) 0.010 0.003 0.002 0.001 Artifact ratio (nai ve/fair) 0 . 4 × 0 . 1 × 0 . 07 × 0 . 05 × The fair metric (comparing each segment at fixed 32 points) is flat ( ∼ 0 . 026), as expected for constant noise. The naiv e metric decreases by 10 × from L = 2 to L = 10—a pure measurement artifact. Our per- segment Chamfer ¯ d L in all experiments av oids this by comparing each of the L segments at a fixed resolution of 32 points, ensuring that the metric measures generalization quality , not resampling fidelity .
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment