Hierarchical Latent Structure Learning through Online Inference

Hierarc hical Latent S tr ucture Lear ning through Online Inf erence Ines Aitsahalia 1 and Kiy ohito Iiga ya 1,2,3,4 1 Center f or Theoretical Neuroscience and Zuc k erman Institute, Columbia U niv ersity , Ne w Y ork, NY 10027 2 Depar tment of Psy c hiatry , Columbia Univ ersity Irving Medical Center , Ne w Y ork, NY 10032 3 Ne w Y ork State Psy chiatric Institute, N ew Y ork, NY 10032 4 Columbia Data Science Institute, N ew Y ork, NY 10027 Abstract Learning sy stems must balance generalization across experiences with discr imination of task - rele v ant details. Eﬀectiv e lear ning theref ore requires representations that support both. Online latent-cause models support incremental inf erence but assume ﬂat partitions, whereas hierarc hical Ba y esian models capture multile v el s tructure but typicall y req uire oﬄine inf erence. W e introduce the Hierarc hical Online Learning of Multiscale Experience Structure (HOLMES) model , a computational framew ork for hierarchical latent structure lear ning through online inf erence. HOLMES combines a v ariation on the nested Chinese R estaurant Process prior with sequential Monte Carlo inf erence to per f or m tractable tr ial-by -tr ial inf erence o ver hierarc hical latent representations without e xplicit super vision ov er the latent structure. In simulations, HOLMES matched the predictiv e per f ormance of ﬂat models while lear ning more compact representations that suppor ted one-shot transf er to higher -le vel latent categories. In a conte xt-dependent task with nested temporal s tr ucture, HOLMES also impro v ed outcome prediction relativ e to ﬂat models. These results pro vide a tractable computational frame w ork f or disco vering hierarchical s tructure in sequential data. Introduction A central challeng e in lear ning is balancing g eneralization and discrimination . Generalization allo ws ag ents to apply prior kno wledg e to no vel situations, whereas discr imination enables sensitivity to subtle but meaningful diﬀerences in observations. 1, 2, 3, 4 Eﬀectiv e lear ning theref ore requires representations that capture shared s tructure across multiple lev els of abstraction while preserving task -relev ant distinctions among observations. One inﬂuential approach is latent s tructure inference, in whic h an ag ent par titions observations into hidden causes that g enerate the data. Ba y esian nonparametr ic models pro vide a pr incipled frame w ork f or this process b y allo wing the number of latent causes to gro w with e xper ience. 5, 6, 7, 8 Ho we ver , most e xisting models assume ﬂat partitions in which all latent causes e xist at a single le v el of abstraction. Man y en vironments are instead inherentl y hier ar c hical , 9 where obser v ations share inter medi- ate categories and broader conte xtual regular ities. Hierarchical Ba y esian models 10 such as the 1 2 Figure 1: Online hierarchical e xtension of the Chinese R estaurant Process (CRP) f or structure learning. (A) CRP Algorithm. The CRP pro vides a Ba yesian non parametr ic pr ior o v er par titions, assigning each observation i (ﬁlled circle) to an e xisting cluster k with probability proportional to its occupancy ( 𝑛 𝑘 ) and concentration parameter 𝛼 or to a ne w cluster . (B) HOLMES prior with nested Chinese R estaurant Process (nCRP) algor ithm. The nCRP pro vides a Ba y esian nonparametr ic prior o v er tree-str uctured par titions, assigning observations to paths b y sequentiall y selecting branc hes at eac h lev el 𝐿 with probability proportional to occupancy and 𝛼 𝐿 , the lev el-speciﬁc concentration. nested Chinese Res taurant Process (nCRP) 11, 12 capture such structure, but typicall y rely on oﬄine batch inf erence. Con v ersely , online latent cause models based on seq uential Monte Car lo methods suppor t incremental trial-by -tr ial inference but typicall y assume ﬂat latent spaces. 13 Bridging this gap req uires a model that combines hierarchical representation with online inf erence. Here, w e introduce Hier ar c hical Online Learning of Multiscale Experience S tructur e (HOLMES) , a framew ork that combines hierarchical nonparametric structure with online seq uential inf erence. Our model perf orms tr ial-by -tr ial inf erence o v er dynamicall y e xpanding latent trees, enabling hierarchical representations to be disco v ered directl y from sequential e xperience. W e ev aluate the model in two classes of synthetic tasks. In compositional en vironments, hierarchical inf erence preserves predictiv e per f or mance while lear ning more compact rep- resentations that suppor t one-shot transfer across latent categor ies. In a conte xt-dependent decision-making task with nes ted temporal structure, hierarchical inference additionall y impro v es outcome prediction b y capturing latent r ule structure across contexts and timescales. Methods W e f ormalize HOLMES as a Ba y esian nonparametr ic model in which observations are assigned to paths through a latent tree. The model combines a hierarchical prior ov er tree structure with sequential Monte Car lo inf erence, allo wing latent representations to be constructed and updated online from sequential observations. W e ﬁrst describe the standard ﬂat latent-cause model, 7, 8 then introduce our hierarchical model and its online inf erence procedure. Formal problem deﬁnition W e consider a task in whic h an ag ent observes a seq uential dataset 𝐷 = { 𝑑 1 , 𝑑 2 , . . . , 𝑑 𝑁 } , (1) 3 where each obser vation 𝑑 𝑖 is an 𝐹 -dimensional binar y vector indicating the presence (1) or absence (0) of each feature at time 𝑖 . In classical conditioning, for e xample, this v ector may encode the presence of a cue and an outcome. The agent ’ s goal is to inf er , at each time step 𝑖 , a latent assignment sequence Z = { 𝑍 1 , 𝑍 2 , . . . , 𝑍 𝑁 } , (2) where 𝑍 𝑖 ∈ { 1 , 2 , . . . , 𝐾 } inde xes the clus ter associated with obser v ation 𝑖 . Critically , both the number of clusters 𝐾 and their assignments are unkno wn and must be inf erred from data. Because the number of possible par titions gro w s super -e xponentially , e xact inf erence is intractable. 14 W e theref ore combine a Bay esian nonparametric pr ior 15 and sequential Monte Carlo method (particle ﬁlter ing) 16, 17 to appro ximate poster ior beliefs online. Flat latent-cause model In the ﬂat latent cause model, obser v ations are grouped using a Bay esian nonparametr ic cluster ing f or malism, with clusters commonly interpreted as “latent causes ” that g enerate the obser vations. 7 On each tr ial, the model ev aluates whether the cur rent obser vation is better explained by an e xisting cluster or b y introducing a new one. Prior o v er cluster assignments The pr ior ov er latent assignments is speciﬁed using the Chinese R estaurant Process (CRP) (F ig.1A). This allo ws the number of clusters (tables in a restaurant) to gro w with e xper ience while fa voring reuse of pre viousl y inf er red clusters. 15 U nder the CRP , the probability that obser vation 𝑖 is assigned to an exis ting cluster 𝑘 or a new cluster is: 𝑃 ( 𝑍 𝑖 = 𝑘 | 𝑍 1: 𝑖 − 1 ) = 𝑛 𝑘 𝑖 − 1 + 𝛼 , 𝑃 ( 𝑍 𝑖 = ne w | 𝑍 1: 𝑖 − 1 ) = 𝛼 𝑖 − 1 + 𝛼 , (3) where 𝑛 𝑘 is the number of pre vious obser vations assigned to cause 𝑘 , and 𝛼 is a concentration parameter . Larger 𝛼 f a vors ﬁner partitions with more clusters, whereas smaller 𝛼 f a vors reuse of e xisting clusters and broader g eneralization. Lik elihood and online updates Obser v ations are modeled independently across f eatures within each clus ter using a f actorized likelihood. For each clus ter 𝑘 and f eature 𝑓 , the model maintains suﬃcient statis tics: • 𝑛 𝑓 , 𝑘 : number of times f eature 𝑓 has been obser v ed as present in cluster 𝑘 • 𝑏 𝑓 , 𝑘 : number of times f eature 𝑓 has been obser v ed as absent in cluster 𝑘 All f eature counts are initialized with a symmetr ic pseudocount Ω > 0: 𝑛 ( 0 ) 𝑓 , 𝑘 = Ω , 𝑏 ( 0 ) 𝑓 , 𝑘 = Ω . (4) This obser v ation bias regular izes inf erence when data are sparse and encodes prior uncer tainty about f eature distr ibutions. At eac h obser vation 𝑖 , the agent observ es a binary feature v ector d 𝑖 = [ 𝑑 𝑖 , 1 , . . . , 𝑑 𝑖 , 𝐹 ] , 𝑑 𝑖 , 𝑓 ∈ { 0 , 1 } , (5) and updates suﬃcient statis tics f or the assigned cluster 𝑍 𝑖 = 𝑘 : 𝑛 ( 𝑖 ) 𝑓 , 𝑘 = 𝑛 ( 𝑖 − 1 ) 𝑓 , 𝑘 + 𝑑 𝑖 , 𝑓 , (6) 𝑏 ( 𝑖 ) 𝑓 , 𝑘 = 𝑏 ( 𝑖 − 1 ) 𝑓 , 𝑘 + ( 1 − 𝑑 𝑖 , 𝑓 ) . (7) These recursiv e updates allow learning to proceed fully online. 4 Lik elihood computation Given current f eature counts, the likelihood of observ ation d 𝑖 under cluster 𝑘 f ollo ws a Beta–Bernoulli distribution: 𝑃 ( d 𝑖 | 𝑍 𝑖 = 𝑘 ) = 𝐹 Ö 𝑓 = 1                  𝑛 ( 𝑖 − 1 ) 𝑓 , 𝑘 𝑛 ( 𝑖 − 1 ) 𝑓 , 𝑘 + 𝑏 ( 𝑖 − 1 ) 𝑓 , 𝑘 if 𝑑 𝑖 , 𝑓 = 1 𝑏 ( 𝑖 − 1 ) 𝑓 , 𝑘 𝑛 ( 𝑖 − 1 ) 𝑓 , 𝑘 + 𝑏 ( 𝑖 − 1 ) 𝑓 , 𝑘 if 𝑑 𝑖 , 𝑓 = 0 (8) where counts are initialized with pseudocount Ω (Eq. 4) and include all observations up to observation 𝑖 − 1. The parameter Ω controls the inﬂuence of ne w evidence: larg er Ω v alues promote slo wer updat- ing and g reater generalization, whereas smaller Ω v alues allo w more rapid conte xt diﬀerentiation at the cost of increased sensitivity to noise and risk of o v er ﬁtting. Sequential Monte Car lo inference Exact pos terior inference ov er latent assignments is intractable. W e theref ore appro ximate the posterior using par ticle ﬁlter ing. The particle ﬁlter maintains an ensemble of 𝑃 par ticles, each representing a h ypothesis about the cur rent cluster ing state, including clus ter assignments and f eature statistics. At observ ation 𝑖 , particles sample assignments from the CRP pr ior and are w eighted by the likelihood of the current obser v ation: 𝑤 ( 𝑝 ) 𝑖 = 𝑃 ( d 𝑖 | 𝑍 ( 𝑝 ) 𝑖 ) , (9) W eights are normalized and particles are resampled at each observ ation to pre vent particle deg eneracy . The model’ s predictions are computed as posterior e xpectations o ver the particle ensemble, naturall y reﬂecting uncer tainty when par ticles disagree. HOLMES: online hierarchical latent structure learning W e e xtend the ﬂat latent-cause model b y assigning obser vations to paths through a hierarchical tree of latent causes. Unlik e standard ﬂat f or mulations, this e xtension org anizes latent str ucture across multiple le v els of abstraction, allo wing higher -lev el nodes to capture shared regularities and lo wer -le v el nodes to capture more detailed distinctions. Hierarc hical prior o v er latent paths W e deﬁne the prior o ver tree str uctures using a dep th- decayed, s tic ky nested Chinese R estaur ant Process (nCRP) . 12, 11 U nlik e a standard CRP , where observations are assigned to a single partition, the nCRP recursiv ely partitions data at multiple le v els (Fig.1B). At eac h trial 𝑡 , each par ticle 𝑝 samples a sequence of latent assignments c ( 𝑝 ) 𝑡 =  𝑐 ( 𝑝 , 0 ) 𝑡 , 𝑐 ( 𝑝 , 1 ) 𝑡 , . . . , 𝑐 ( 𝑝 , 𝐿 − 1 ) 𝑡  , where 𝑐 ( 𝑝 , 𝑙 ) 𝑡 inde xes the node at le vel 𝑙 , and the depth 𝐿 ma y v ary across tr ials. Depth-deca y ed concentration and stopping. W e deﬁne depth-adjusted concentration as 𝛼 ℓ = 𝛼 · 𝑒 − 𝛼 · ℓ , (10) which implements a depth budg et constraint : the e xponential deca y rate is directly tied to the base concentration, such that models with higher 𝛼 (f a v or ing proliferation of clusters) 5 deplete their capacity f or deep branching more rapidl y . This coupling ensures that the model cannot simultaneousl y maintain high branching f actors and deep hierarchies—a f or m of implicit model capacity control. U nder this parameterization, 𝛼 ℓ is non-monotone in 𝛼 f or ﬁxed ℓ > 0 (maximized at 𝛼 = 1 / ℓ ), reﬂecting the tradeoﬀ betw een local branc hing propensity and global depth budg et. W e inter pret 𝛼 as controlling o verall model capacity rather than concentration at an y single le v el. 18, 19 At lev el ℓ , giv en parent node 𝑢 , the probability of assigning an observation to an e xisting child branch 𝑘 or creating a new branch f ollow s the standard CRP f or m with the descr ibed depth-adjusted concentration: 𝑃 ( 𝑐 ( 𝑝 , ℓ ) 𝑡 = 𝑘 | 𝑢 ) = 𝑛 ( 𝑝 , ℓ ) 𝑢, 𝑘 𝑁 ( 𝑝 , ℓ ) 𝑢 + 𝛼 ℓ , 𝑃 ( 𝑐 ( 𝑝 , ℓ ) 𝑡 = ne w | 𝑢 ) = 𝛼 ℓ 𝑁 ( 𝑝 , ℓ ) 𝑢 + 𝛼 ℓ , (11) where 𝑛 ( 𝑝 , ℓ ) 𝑢, 𝑘 is the number of pre vious assignments to child 𝑘 at le v el ℓ within par ticle 𝑝 , and 𝑁 ( 𝑝 , ℓ ) 𝑢 = Í 𝑘 𝑛 ( 𝑝 , ℓ ) 𝑢, 𝑘 is the total count at le vel ℓ within par ticle 𝑝 . W e impose a maximum branching f actor of 20 children per node f or computational tractability , as a weak limit appro ximation. In addition to depth-dependent branching, the model includes a s tochas tic stopping mechanism that allo w s path tra v ersal to ter minate at an y le v el. 20 After selecting a branch at lev el ℓ > 0 , tra v ersal stops with probability : 𝑃 ( stop at lev el ℓ ) = 1 1 + 𝛼 ℓ . (12) Stopping is not per mitted at le v el ℓ = 0 , ensur ing at least one representational lev el e xists. T ogether , depth-dependent concentration and stoc hastic stopping deﬁne a ﬂexible pr ior o v er hierarchical depth without req uiring an explicit depth limit. P ersistence and node reuse T o capture temporal persistence in latent structure, w e include a stic kiness bias 21 that increases the probability of reusing the previousl y inferred latent conﬁguration at each le vel and allo ws reuse of global node identities across particles. When diﬀerent par ticles disco ver identical subs tructures, they share the same global node rather than maintaining redundant local copies 12 (see Supplementary Methods for full implementation). Hierarc hical inference procedure Inf erence used the same par ticle ﬁlter ing framew ork as in the ﬂat model. At each tr ial, each par ticle samples a path through the hierarchical pr ior , e v aluates the lik elihood of the observation at the leaf node, and updates its w eight accordingly . W e nor malized par ticles and resampled to appro ximate the posterior distr ibution. As in the ﬂat model, likelihoods are ev aluated onl y at the leaf nodes of the inferred hierarch y based on the observation statis tics (see Supplementar y Methods). This allo w s the inf erence space to remain constrained e ven as the h ypothesis space grow s. Simulation environment Compositional task en vironment W e ﬁrst e valuated the models in synthetic compositional tasks (Fig. 2A). These tasks contained tw o to ﬁv e lev els, where each lev el represented a binar y latent categor y that deter mined obser vable feature values. Observations w ere binar y f eature v ectors (Fig. S2) g enerated b y lev el-speciﬁc r ules. The top 𝑛 − 2 le v els were encoded as single binary f eatures, the ne xt lo w er identity was encoded with tw o binary f eatures, and the low er le vels w ere encoded as a one-hot v ector across f our random f eatures. Obser v ations w ere generated b y 6 sampling a latent category value at each hierarchical le vel, then deterministically encoding the resulting conﬁguration as f eatures. Outcomes w ere deter mined by a conjunctiv e r ule o v er the latent category assignments at the two highes t le v els (Fig. 2A). Context-dependent task with nested temporal structure W e next ev aluated the models in a synthetic context-dependent task with nested temporal structure. 22, 23 On each tr ial, the model obser v ed one of f our stimuli deﬁned by the combination of tw o binar y feature dimensions (e.g., color and shape), and re w ard depended on the cur rent task context. The task alter nated betw een tw o slo wl y changing r ule contexts, eac h specifying which f eature dimension deter mined re w ard. In one conte xt, the ﬁrst f eature dimension was predictiv e of re w ard while the second was ir rele v ant; in the other conte xt, the second f eature dimension w as predictive while the ﬁrs t was ir rele v ant. Within eac h slo w conte xt, the re warded f eature value also switched, creating nes ted temporal structure across timescales (Fig. 4A,B). Evaluation metrics Outcome pr ediction accuracy The outcome prediction accuracy w as deﬁned as the propor tion of trials in which the model cor rectly predicted whether an outcome occurred. Repr esentational eﬃciency T o quantify representational eﬃciency , w e measured the entrop y of cluster assignments within eac h ground tr uth categor y: 𝐻 𝑚 = − Í 𝑘 𝑝 𝑚 𝑘 log 𝑝 𝑚 𝑘 , where 𝑝 𝑚 𝑘 is the proportion of categor y- 𝑚 trials assigned to cluster 𝑘 . Lo w er entropy indicates that trials from a categor y concentrated in f ew er clusters. In the transf er task, categor ies cor responded to ground tr uth labels at a given hierarchical le vel. For the temporal task, categor ies were the f our latent states f or med by crossing conte xts (shape-r ule vs. te xture-r ule) with rew arded values (e.g., shape-circle, te xture-stripes). For hierarchical models, w e report results from the highest non-root tree le v el created. W e av eraged entrop y across categor ies, w eighted b y trial frequency (see Supplementary Methods for details). One-shot transf er T o assess generalization, w e ev aluated one-shot transf er after training. A single higher le v el labeled ex emplar was presented and models w ere required to g eneralize that label to previousl y obser ved s timuli belonging to the same latent category . T ransfer perf ormance w as quantiﬁed using recall: the propor tion of tr ue same-categor y trials cor rectly identiﬁed. 24 W e f ocus on recall as the primar y metr ic because (1) category membership is balanced across trials, making recall directl y interpretable as generalization success, and (2) the scientiﬁc ques tion concerns the model’ s ability to recognize same-category instances across conte xts, which recall measures directl y . This is highly cor related with F1 score across all task complexities (f or additional metrics, see Supplementar y T able 2). Results Hierarchical inference preser ves outcome prediction while improving represen- tational eﬃciency W e ﬁrs t e xamined whether hierarc hical inf erence improv es the eﬃciency with which compositional structure is represented dur ing learning. In man y en vironments, observable f eatures combine to f or m higher -lev el latent categor ies that determine outcomes. Lear ning in such settings requires disco vering ho w obser vations compose into reusable abstractions. W e theref ore asked whether 7 hierarchical clus tering produces more eﬃcient inter nal representations than ﬂat clustering when learning compositional tasks. T o test this, w e constr ucted synthetic categor ization en vironments in whic h observations w ere g enerated from latent f actors org anized in a nested hierarch y (Fig. 2A). In the simplest 2-le vel v ersion, obser vable binary features combine to deﬁne a latent categor y , which in tur n deter mines the binary outcome (rew ard/no rew ard; Fig. 2A). More comple x versions add additional latent f actors in the hierarch y abo v e this categor y , creating deeper hierarchical s tructures. These tasks allo w a direct compar ison between tw o representational strategies. A ﬂat model represents each f eature combination independently , whereas a hierarchical model can reuse latent structure across obser v ations by org anizing them into a tree of shared causes. Fig. 2B illustrates the diﬀerence in the 2-lev el task: the ﬂat model creates separate clusters f or each observation-outcome combination at the obser vation lev el, while HOLMES compresses this inf or mation into a tree with categor y-le v el g roups. This or ganization allow s multiple obser v ations to inherit shared latent str ucture rather than being encoded independentl y . Despite their architectural diﬀerences, both models achie v ed equivalent per f ormance on the pr imar y learning objective. W e quantiﬁed lear ning per f or mance as outcome prediction accuracy—the propor tion of tr ials on which models cor rectly predicted binary outcomes (re w ard vs. no rew ard) from obser vable f eatures. At 2-lev el comple xity , ﬂat and hierarchical models sho wed s tatisticall y indistinguishable accuracy (F ig. 2C). Thus hierarchical organization does not impair predictiv e lear ning. Ho we ver , the models diﬀered markedl y in representational eﬃciency . T o q uantify , w e measured the entrop y of cluster assignments within each true category: a metric captur ing ho w consistentl y tr ials from the same categor y mapped to the same cluster . Lo wer entrop y indicates more org anized, eﬃcient representations. HOLMES sho wed signiﬁcantl y lo wer entrop y than ﬂat models at the 2-le vel task (Fig. 2D), indicating more consis tent category-to-clus ter mappings despite equiv alent outcome prediction. W e ne xt e xamined how these eﬀects scale with task comple xity . W e generated tasks with up to 5 hierarc hical lev els, training both models to predict the same binary outcome across increasing structural comple xity . Across all comple xity lev els, both models achiev ed comparable accuracy in predicting outcomes (Fig. 2E). Ho w e v er , HOLMES sho wed signiﬁcantl y low er entropy than ﬂat models at all complexity lev els (Fig. 2F), indicating more consistent categor y-to-clus ter mappings regardless of task depth. W e fur ther quantiﬁed compression eﬃciency b y measur ing the number of clus ters learned relativ e to the true number of latent categories. HOLMES maintained near -optimal compression across task comple xities, using appro ximately as man y clusters as tr ue categor y labels (Fig. 2G, light). In contrast, ﬂat models sho wed increasing redundancy with comple xity , requiring progressiv el y more clusters than necessary (Fig. 2G, dark). T og ether , these results demonstrate that hierarchical clus tering preser v es predictiv e per f ormance while lear ning substantiall y more eﬃcient internal representations of compositional str ucture. Hierarchical representations enable one-shot transfer across latent categories If hierarchical inf erence discov ers latent compositional s tructure dur ing learning, the resulting representations should suppor t rapid g eneralization across observations sharing the same latent category . W e theref ore ne xt ask ed whether the hierarchical model’ s more eﬃcient representations enable one-shot transf er across latent categor ies. T o test this prediction, we implemented a one-shot transfer paradigm (Fig. 3A). Models w ere ﬁrs t trained on outcome prediction e xactl y as bef ore, learning to predict rew ard outcomes without ev er receiving e xplicit category labels (e.g., ”categor y A ”). After training, w e introduced 8 Figure 2: Hierarchical inf erence preserves outcome prediction accuracy while improving representational eﬃciency . (A) T wo-le vel hierarc hical task structure. Obser vations (e.g. A ’ , B”) vary along binar y f eatures (observation lev el) and categor y identity (latent). Categor y deter mines re w ard outcome. (B) Example learned model structures. The ﬂat latent-cause model represents each obser vation type separately at the observation le vel, while HOLMES learns a compressed category-le vel representation. (C) Outcome prediction performance on 2-le v el task. Both models achiev e equiv alent high accuracy . Er ror bars: 95% CI across 200 parameter combinations. (D) Representational eﬃciency measured b y av erage entrop y of cluster assignments. HOLMES (hierarchical models) sho w signiﬁcantly lo wer entrop y (0.076 ± 0.017 v s. 0.131 ± 0.030), indicating better compression. Er ror bars: 95% CI across 200 parameter combinations. (E) Outcome prediction perf ormance across task comple xity . Both models achie ve comparable asymptotic accuracy (84-100%) across all comple xity le v els, with small eﬀect sizes (C ohen ’s | 𝑑 | ¡ 0.4). Error bars: 95% CI across 200 parameter combinations. (F) Representational eﬃciency across complexity . HOLMES sho w s signiﬁcantly lo wer entrop y at all lev els, with increasing advantag e at higher complexities (2-lev el: -0.055; 5-lev el: -0.827). Er ror bars: 95% CI across 200 parameter combinations. (G) Number of learned clusters across task comple xity . HOLMES (light) maintains near -optimal compression across comple xities, while ﬂat models (dark) show increasing redundancy , using signiﬁcantly more clus ters at all lev els (diﬀerences: -1.1 to -2.7 clusters). Er ror bars: 95% CI across 200 parameter combinations. 9 Figure 3: Hierarchical advantag e in one-shot transf er emerges with task complexity . (A) T ransf er task structure. Models are ﬁrst trained on binar y outcome prediction as in the previous task. After training, models receiv e a single labeled ex ample identifying one observation as a par ticular label, such as ”obser vation A ’ belongs to A ” (T eaching phase). In the transf er test, models must generalize this category label to identify which past observations belong to the same latent category , despite nev er receiving e xplicit categor y labels during training. (B) One-shot transf er across task comple xity . While both models sho w ed a decrease in recall accuracy as task complexity increased, HOLMES (light, triangle) maintained super ior transfer at higher comple xities, while ﬂat models (dark, circle) show ed declining performance. At 2-le vel comple xity , both models performed comparably (ﬂat: 89.7 ± 2.0%, hierarchical: 89.3 ± 1.8%); ho w ev er, hierarc hical models signiﬁcantl y outperformed ﬂat models at higher comple xities (3-lev el: +21.0%, 95% CI [19.2%, 22.8%]; 4-le vel: +24.7%, 95% CI [22.6%, 26.8%]; 5-le v el: +26.6%, 95% CI [24.4%, 28.8%]. Er ror bars represent 95% conﬁdence intervals across 200 parameter combinations. (C) Hierarchical adv antage (hierarc hical minus ﬂat transf er accuracy) increases sys tematicall y with task comple xity , transitioning from no advantag e at 2-lev el to substantial advantag es at 3+ lev els. Er ror bars represent 95% conﬁdence inter vals across n=200 parameter combinations. (D) One-shot transfer performance at the most comple x task f or all the le vels tes ted. The HOLMES (hierarchical model; right) outperforms the ﬂat model, with greater adv antag e on deeper le vels. (E) Relationship betw een outcome prediction accuracy and one-shot transf er accuracy across 200 parameter settings f or hierarchical (light) and ﬂat models (dark). Each point represents one parameter combination av eraged o ver seeds. Regression lines sho w 95% conﬁdence bands. 10 a teaching phase consisting of a single labeled e xample—showing the model one previousl y- observed s timulus and labeling it as belonging to a par ticular latent categor y . In the transf er test, models had to generalize this label to identify whic h other pre viousl y -observ ed stimuli belong ed to the same category , despite having onl y seen a single labeled ex ample and ne v er being trained on categorization directly . This paradigm req uires models to hav e spontaneousl y disco v ered the latent category structure during outcome prediction lear ning. A model that onl y learned obser v ation-le v el associations must treat eac h stimulus independentl y and theref ore cannot reliabl y g eneralize categories. In contrast, a model that learned abstract category-le vel representations should readil y transfer the label across diﬀerent colored instances of the same latent s tructure. Consistent with this prediction, HOLMES show ed strong er transf er performance than the ﬂat model as task complexity increased (F ig. 3B). While both models per f ormed similarl y in the simplest 2-le vel task, HOLMES substantiall y outper f or med the ﬂat model as the tasks became deeper (Fig. 3B). The hierarchical advantag e increased sys tematicall y with task complexity (Fig. 3C), indicating that the beneﬁt of hierarchical representations becomes more pronounced as latent structure deepens. T o ex amine g eneralization across diﬀerent abstraction le vels, we tes ted both models on all hierarchical le vels in the mos t complex (5-le vel) task. The hierarchical model outperf or med the ﬂat model across all tested le vels, with the adv antage being most pronounced at deeper le vels of the hierarch y (Fig. 3D). W e next e xamined the relationship between outcome prediction accuracy and one-shot transf er per f ormance across parameter regimes. Both model architectures e xhibited a negativ e cor relation between these objectives, sugges ting a fundamental tradeoﬀ between ﬁne-grained discrimination and broad generalization. Parameter regimes that optimize outcome prediction create man y ﬁne-grained observation-le vel clus ters, which can impair g eneralization. Ho w ev er , at matched le v els of outcome prediction accuracy , hierarchical models consistentl y achie ved superior transfer perf or mance, suggesting a P areto improv ement rather than a simple parameter - dependent tradeoﬀ (Fig. 3E). This adv antage lik el y reﬂects the hierarc hical model’ s ability to simultaneousl y maintain task -rele v ant ﬁne-grained distinctions at low er tree lev els while building reusable abstractions at higher le v els—a form of representational specialization una vailable to ﬂat architectures. T o assess whether these adv antag es depended on speciﬁc parameter choices, w e e valuated model performance across a broad rang e of concentration and bias parameters. A cross 200 sampled parameter combinations, HOLMES consis tentl y produced more eﬃcient representations (Supplementary Fig. S4 A,B) and higher transfer accuracy than the ﬂat latent-cause model (Supplementary Fig. S4 C,D). These results indicate that the hierarchical advantag e reﬂects architectural properties of the model rather than parameter tuning. T ogether , these ﬁndings sho w that hierarc hical structure learning enables agents to or ganize e xper ience into reusable abstractions that suppor t both accurate prediction and rapid transf er from minimal supervision. Hierarchical inference improv es outcome prediction in a conte xt-dependent task with nested temporal structure So far w e e xamined compositional environments, in which latent structure is embedded in combinations of observable f eatures. How ev er , many en vironments are structural: the relev ant regularities are r ules that gov er n ho w f eatures map to outcomes across conte xts and time. W e theref ore asked whether hierarchical inf erence also improv es lear ning when the relev ant str ucture unf olds across multiple timescales. 11 T o test this, w e constructed a synthetic conte xt-dependent task with nes ted temporal s tructure. In this task, models had to inf er both a slo w ly changing rule conte xt and faster switches in the re w arded feature v alue within each conte xt. On each trial, the model obser ved one of f our stimuli deﬁned b y tw o binar y feature dimensions. The rew arded feature dimension depended on a slo wl y changing rule conte xt. In one conte xt, the ﬁrs t f eature dimension predicted re w ard, whereas in the other conte xt, the second dimension was predictiv e. Within eac h rule context, the re w arded f eature value also switched per iodically , creating nested temporal structure across timescales (Fig. 4A,B). U nlik e the compositional tasks, w e found that the hierarchical model sho w ed a clear adv antag e in outcome prediction. The ﬂat model treats each stimulus–outcome conting ency independentl y and theref ore operates near chance perf or mance. In contrast, HOLMES can capture the latent rule str ucture linking conte xts, feature dimensions, and re warded f eature values (F ig. 4C). R epresentational eﬃciency mir rored the pattern obser v ed in the compositional tasks. HOLMES produced substantiall y more specialized representations, with tr ials from each latent state concentrated in f ew er clusters. This resulted in signiﬁcantly lo w er within-state entrop y than the ﬂat model (Fig. 4D) and f e w er clusters used per latent s tate (Fig. 4E). These results sho w that the hierarchical model disco vers compact representations aligned with the latent rule structure across contexts and timescales, impro ving both predictiv e per f or mance and representational eﬃciency . Discussion W e introduced HOLMES, an online hierarchical latent-cause model f or sequential str ucture learning. The model combines hierarchical nonparametric structure with tractable sequential inf erence, allo wing latent causes to be org anized across multiple le v els of abstraction during online learning. Across compositional tasks, the hierarchical model matc hed the predictiv e performance of ﬂat latent-cause models while lear ning more compact representations and enabling improv ed one-shot transfer . In a context-dependent task with nested temporal structure, hierarchical inf erence additionally impro v ed outcome prediction by captur ing latent rule structure across conte xts and timescales. These results demonstrate ho w hierarchical latent s tructure can suppor t both eﬃcient representation and ﬂe xible generalization in sequential learning. Our model br idges tw o previousl y separate approaches. Hierarchical Ba y esian models, such as the nested Chinese Res taurant Process (nCRP), 11, 12 originally de v eloped for hierarc hical topic modeling, can represent rich multi-le v el structure but typically rel y on batch inf erence. Online latent-cause inf erence models 7 perform sequential inf erence but operate o ver ﬂat latent spaces. By combining a modiﬁed nCRP pr ior with par ticle ﬁlter ing, our model suppor ts tr ial-by -tr ial inf erence ov er hierarchical latent s tr ucture. Unlik e other hierarc hical models that rel y on batch inf erence, ﬁxed hierarc hies, 25 or Bay esian model selection o v er candidate structures (e.g., the COIN model 26 ), HOLMES per f or ms fully online inf erence of arbitrar y depth, enabling dynamic construction of hierarchical representations during sequential e xper ience. Our simulations illustrate two functional consequences of hierarchical inference. In com- positional en vironments, hierarchical structure lear ning produced substantiall y more eﬃcient representations while preser ving outcome prediction performance. These compressed represen- tations aligned with the latent g enerative s tr ucture of the task and enabled rapid generalization across obser vations shar ing higher -lev el categories. The ﬂat latent cause model, in contrast, required multiple independent clusters to represent the same structure, limiting their ability to suppor t transfer across related observations. This result highlights ho w representational org anization, rather than predictive accuracy alone, can deter mine the range of generalization 12 Figure 4: Hierarchical inf erence improv es prediction in a conte xt-dependent task with nes ted temporal structure. (A)T ask structure. The task inv olv es two slo w-changing conte xts, each specifying which of tw o binary feature dimensions determines re ward (illus trated here as shape and texture). In shape-rule conte xts, shape determines rew ard (circle or triangle), while te xture is irrelev ant. In te xture-rule contexts, te xture determines rew ard (dots or stripes), while shape is ir relev ant. (B) Nes ted temporal structure. Each block presents all four stimulus combinations (shape × te xture), ensuring the same stimulus can yield diﬀerent outcomes depending on the cur rent conte xt. Within each slo w context, the re w arded feature value switches in sub-blocks. Dashed lines indicate these fast value switches within conte xts. S tars denote re w arded stimuli. (C) Outcome prediction accuracy . HOLMES (Hierarchical models; light gray) achie ve signiﬁcantl y higher accuracy than ﬂat models (dark gray). (Flat: 48.1 ± 0.3, HOLMES: 80.3 ± 1.1 (95% CI)). (D) Within-s tate entrop y . HOLMES (hierarchical models) achie ve lo wer within-s tate entropy , indicating that each of the f our latent states (shape-circle, shape-triangle, te xture-stripes, te xture-dots) maps to f e w er clusters. Lo w er entropy indicates more eﬃcient representations. (Flat: 2.6 ± 0.03, HOLMES: 1.8 ± 0.1 (95% CI)) (E) Representational eﬃciency . HOLMES uses f e wer clusters per latent state. (Flat: 32.3 ± 0.9, HOLMES: 15.1 ± 1.1 (95% CI)). 13 operations suppor ted b y a learned model. Prior w ork has sho wn that one-shot generalization can arise from r ichl y str uctured generativ e priors (e.g., 4 ). The open ques tion, ho we ver , is ho w reusable abstractions that support one-shot g eneralization can be discov ered incrementally from sequential experience without e xplicit supervision ov er the latent structure. Our contr ibution is theref ore not one-shot lear ning per se, but a tractable online frame work f or inf er r ing hierarchical latent representations that make one-shot transf er possible. U nlik e standard one-shot concept-lear ning benchmarks, the task studied here req uires categor y s tructure to be inf er red from unlabeled seq uential experience rather than speciﬁed in adv ance b y a richl y structured concept prior . In a conte xt-dependent task with nes ted temporal structure, hierarchical inference pro vided an additional adv antag e b y enabling discov er y of latent r ule structure across conte xts and timescales. Because stimulus–outcome associations alone w ere insuﬃcient to solv e the task, successful learning required identifying the r ule go v er ning which f eature dimension predicted rew ard. The hierarchical model could capture this structure b y org anizing conte xts and value states into a nested representation, whereas the ﬂat model treated each conting ency independentl y . This result highlights ho w hierarchical inf erence can support r ule lear ning in environments where rele vant structure unf olds across multiple temporal scales. An impor tant reﬁnement of our model is that 𝛼 deca y s with tree depth. This can be interpreted as a bounded-rational 27 prior ov er structure: agents cannot “o ver -car ve ” the w or ld at e v er y scale but must allocate a limited segmentation budget. This constraint is consistent with the idea that abstraction is economical—higher le v els of representation are sparse and conser v ativ e, while ﬁner distinctions emer ge onl y when prediction errors justify additional comple xity . This depth-dependent attenuation implements a soft penalty on hierarchical elaboration, captur ing the trade-oﬀ betw een e xpressiv e capacity and representational simplicity . More broadly , the model connects to theories of appro ximate Bay esian inf erence under resource constraints. 28, 29, 27 Particle ﬁltering appro ximates the Ba y esian poster ior using limited computational resources, introducing a trade-oﬀ betw een accuracy and computational cos t: the number of particles determines how w ell the ﬁlter can represent multimodal hierarchical belief s. Se v eral limitations remain. Our analy sis f ocused on synthetic tasks with discrete binary f eatures and known g enerativ e trees. In real-w or ld en vironments, hierarchical structure may be ambiguous or par tiall y o ver lapping. The model also cur rently lac ks a f org etting or pruning mechanism, 30 which ma y be impor tant in non-stationary en vironments. Incor porating structural pruning or recency-w eighted updates ma y allo w the model to adapt more ﬂexibl y to changing en vironments. Despite these limitations, our w ork pro vides a tractable computational frame w ork for online hierarchical latent-cause inf erence, oﬀering a foundation f or studying compositional reasoning, rule discov er y , continual lear ning, and structured generalization in both biological and artiﬁcial learning systems. Code a vailability . All model and simulation code will be made a v ailable at https://github.com/Iiga ya- Lab/HOLMES-2026. All simulations w ere implemented in Python using NumPy f or numer ical operations, Scipy and Sklearn for s tatistics, and ﬁgures w ere g enerated using Matplotlib and Seaborn. Ackno wledgments The authors w ould lik e to thank Kim Stac henf eld, Adith ya Gungi, Deniz Y agmur Ure y , Sashank Pisupati. Research reported in this publication was supported by the N ational Institute of Mental 14 Health (R01MH136214; KI), the Brain & Behavior Researc h Foundation Y oung In v estigator Grant (KI), and the National Ins titute of Neurological Disorders and S troke (T32NS064929; IA). References 1 Joshua B. T enenbaum and Thomas L. Gr iﬃths. Generalization, similar ity , and Bay esian inf erence. Behavior al and Br ain Sciences , 24(4):629–640, 2001. Edition: 2002/08/20 Publisher: Cambr idge U niv ersity Press. 2 Joshua B T enenbaum, Charles Kemp, Thomas L Gr iﬃths, and Noah D Goodman. Ho w to gro w a mind: Statis tics, structure, and abstraction. science , 331(6022):1279–1285, 2011. 3 Omri Barak, Mattia Rigotti, and Stef ano Fusi. The sparseness of mix ed selectivity neu- rons controls the g eneralization–discr imination trade-oﬀ. The Journal of N euroscience , 33(9):3844–3856, February 2013. 4 Brenden M. Lake, Ruslan Salakhutdino v , and Joshua B. T enenbaum. Human-lev el concept learning through probabilistic program induction. Science , 350(6266):1332–1338, December 2015. 5 John R Anderson. The adaptiv e nature of human categor ization. Psyc hological review , 98(3):409, 1991. 6 VK Mansinghka, C K emp, JB T enenbaum, and TL Griﬃths. Structured pr iors f or s tructure learning. In Proceedings of the T wenty-Second Confer ence on U ncertainty in Artiﬁcial Intellig ence , pages 324–331, 2006. 7 Samuel J. Gershman, David M. Blei, and Y ael Niv . Conte xt, learning, and e xtinction. Psyc hological Review , 117(1):197–209, January 2010. 8 Samuel J. Gershman and Y ael Niv . Explor ing a latent cause theory of classical conditioning. Learning & Behavior , 40(3):255–268, September 2012. 9 John P O’Doher ty , Ueli Rutishauser , and Kiy ohito Iiga ya. The hierarchical construction of v alue. Current Opinion in Behavior al Sciences , 41:71–77, October 2021. 10 Y ee Wh y e T eh, Michael I Jordan, Matthe w J Beal, and Da vid M Blei. Hierarchical dirichlet processes. Journal of the American Statistical Association , 101(476):1566–1581, December 2006. 11 Thomas Gr iﬃths, Michael Jordan, Joshua T enenbaum, and David Blei. Hierarchical topic models and the nested c hinese restaurant process. Advances in neur al inf or mation processing sys tems , 16, 2003. 12 Da vid M Blei, Thomas L Gr iﬃths, and Michael I Jordan. The nested chinese restaurant process and ba y esian nonparametric inf erence of topic hierarchies. Journal of the A CM (J A CM) , 57(2):1–30, 2010. 13 Samuel J. Gershman and Da vid M. Blei. A tutor ial on Ba yesian nonparametric models. Jour nal of Mathematical Psy c hology , 56(1):1–12, F ebruar y 2012. 14 Nils Lid Hjor t, Chr is Holmes, Peter M ¨ uller , and Stephen G. W alker . An invitation to Bayesian nonpar ametrics , pag e 1–21. Cambr idg e Ser ies in Statis tical and Probabilistic Mathematics. Cambridge U niversity Press, 2010. 15 15 Da vid J. Aldous. Exchang eability and relat ed topics , pag e 1–198. Spr inger Ber lin Heidelberg, 1985. 16 Maar ten Speekenbrink. A tutor ial on par ticle ﬁlters. Journal of Mathematical Psyc hology , 73:140–152, A ugust 2016. 17 Luca Martino, V ´ ıctor El vira, and Gus tau Camps- V alls. Group impor tance sampling f or par ticle ﬁltering and MCMC. Digit. Signal Process. , 82:133–151, N ov ember 2018. 18 Ishita Dasgupta and Thomas L. Griﬃths. Clustering and the eﬃcient use of cognitive resources. Journal of Mathematical Psyc hology , 109:102675, Augus t 2022. 19 Pa yam Piray and Nathaniel D. Daw . A model f or lear ning based on the joint estimation of stoc hasticity and v olatility . Natur e Communications , 12(1), No vember 2021. 20 Zoubin Ghahramani, Mic hael Jordan, and R yan P A dams. T ree-structured stic k breaking f or hierarchical data. Advances in neur al inf or mation processing sy st ems , 23, 2010. 21 Emil y B F o x, Er ik B Sudder th, Michael I Jordan, and Alan S W illsky . The stic ky hdp-hmm: Ba y esian nonparametric hidden mark o v models with persistent states. Arxiv pr eprint , 2, 2007. 22 V aler io Mante, Da vid Sussillo, Kr ishna V . Sheno y , and W illiam T . Ne w some. Conte xt- dependent computation b y recurrent dynamics in prefrontal cor te x. Natur e , 503(7474):78–84, No v ember 2013. 23 Sil via Bernardi, Marcus K. Benna, Mattia Rigotti, J ´ er ˆ ome Munuera, S tef ano Fusi, and C. Daniel Salzman. The g eometry of abstraction in the hippocampus and prefrontal cor tex. Cell , 183(4):954–967.e21, No v ember 2020. 24 Michael Buc kland and Fredric Ge y . The relationship between recall and precision. Jour nal of the American Socie ty f or Inf or mation Science , 45(1):12–19, 1994. 25 Chong W ang, John Paisle y , and David M Blei. Online variational inference f or the hierarchical dirichlet process. In Pr oceedings of the fourteent h international conf er ence on ar tiﬁcial intellig ence and statistics , pages 752–760. JMLR W orkshop and Conf erence Proceedings, 2011. 26 James B. Heald, M ´ at ´ e Lengy el, and Daniel M. W olper t. Contextual inf erence underlies the learning of sensor imotor reper toires. Natur e , 600(7889):489–493, No v ember 2021. 27 Rahul Bhui, Lucy Lai, and Samuel J Gershman. Resource-rational decision making. Curr ent Opinion in Behavior al Sciences , 41:15–21, October 2021. 28 Herber t A. Simon. Models of Man: Social and Rational . Wile y , 1957. 29 Samuel J. Gershman, Eric J. Hor vitz, and Joshua B. T enenbaum. Computational ratio- nality : A con v erging paradigm f or intelligence in brains, minds, and machines. Science , 349(6245):273–278, Jul y 2015. 30 Quentin JM Huy s, N eir Eshel, Elizabeth O’Nions, Luke Sher idan, Peter Day an, and Jonathan P R oiser . Bonsai trees in y our head: ho w the pa vlo vian sys tem sculpts goal-directed choices b y pruning decision trees. PLoS computational biology , 8(3):e1002410, 2012. S-1 31 N. Kantas, A. Doucet, S.S. Singh, and J.M. Maciejo wski. An o v er view of sequential monte carlo methods f or parameter estimation in general state-space models. IF A C Proceedings V olumes , 42(10):774–785, 2009. 32 Emil y B. F o x, Er ik B. Sudder th, Michael I. Jordan, and Alan S. W illsky . A stic ky hdp-hmm with application to speaker diarization. The Annals of Applied Statistics , 5(2A), June 2011. S-2 Supplemental Materials Extended Methods Particle Filter Exact pos terior inference o v er latent assignments is intractable in both the ﬂat and hierarchical models. W e theref ore approximate the posterior using a par ticle ﬁlter with 𝑃 = 200 par ticles, which w e f ound to pro vide stable estimates without unnecessar y computational cost. Each particle 𝑝 ∈ { 1 , . . . , 𝑃 } maintains a cluster assignment 𝑍 ( 𝑝 ) 𝑡 sampled from the CRP prior, tog ether with f eature count statis tics { 𝑛 ( 𝑝 ) 𝑓 , 𝑘 , 𝑏 ( 𝑝 ) 𝑓 , 𝑘 } f or each cluster 𝑘 disco vered b y that par ticle. Each par ticle ’ s w eight 𝑤 ( 𝑝 ) 𝑡 represents ho w well that h ypothesis explains the observ ed data. Since par ticles are sampled directl y from the CRP pr ior rather than from a separate proposal distribution, the weight reduces to the data likelihood under that particle ’ s cluster assignment: 𝑤 ( 𝑝 ) 𝑡 = 𝑃 ( O 𝑡 | 𝑍 ( 𝑝 ) 𝑡 ) . (13) W eights are nor malized so that Í 𝑝 𝑤 ( 𝑝 ) 𝑡 = 1 , f or ming a discrete appro ximation to the posterior . Particles are resampled at each time step with probability proportional to their weights, concen- trating computational resources on high-likelihood hypotheses and pr uning implausible ones, thereb y pre v enting par ticle degeneracy. 31 Hypotheses that predict data well prolif erate, while poor h ypotheses are eliminated. The model’ s predictions are computed as posterior expectations o ver the particle ensemble, naturally reﬂecting uncer tainty when par ticles disag ree about cluster assignments: when par ticles are distributed across diﬀerent cluster assignments, the prediction is a mixture of diﬀerent cluster s tatistics. W e consider par ticle ﬁltering to be par ticularl y w ell-suited to modeling cognition because it operates in a single forw ard pass through the data stream, maintaining a r unning posterior appro ximation that can generate predictions at an y time step. This mir rors the constraints f aced b y biological ag ents processing temporal sequences of e xper ience, in contrast to batc h sampling methods such as Gibbs sampling that req uire multiple passes o ver the full dataset. Depth-deca y interpretation. The e xponential deca y schedule in Eq. 10 has a natural bounded- rationality interpretation: it implements a ﬁx ed segmentation budg et that must be allocated across le v els of the hierarch y . Higher 𝛼 e xhausts this budget q uickly , producing shallow trees; lo w er 𝛼 preserves budget f or deeper le v els, encouraging ﬁner -grained dis tinctions when e vidence warrants them. This adaptiv e depth control emer ges automatically from the concentration sc hedule without requiring an e xplicit depth limit. The combination of depth decay and stochas tic stopping (Eq. 12) pro vides tw o complementar y mechanisms f or controlling tree complexity : depth decay e xponentially reduces the prior probability of branching at deeper lev els, while stochas tic stopping allo ws the model to ter minate paths earl y ev en when branches e xis t. T ogether , they implement a ﬂe xible pr ior ov er hierarchical depth that adapts to en vironmental structure. Lik elihood computation. Likelihoods are ev aluated only at the leaf nodes of the inferred hierarch y . Each leaf maintains Beta-Ber noulli f eature statistics { 𝑛 ( 𝑝 ) 𝑓 , 𝑘 , 𝑏 ( 𝑝 ) 𝑓 , 𝑘 } , initialized with pseudocount Ω , e xactl y as in the ﬂat model. After obser ving f eedback, f eature counts for the activ e leaf are updated incrementally . For numer ical stability , all likelihood computations are performed in log-space. S-3 Figure S1: Flat and Hierarchical model sequential lear ning. (A) One-lay er (“ﬂat”) model. Standard CRP -based models maintain a single partition ov er observations, incrementall y assigning eac h observation to an e xisting clus ter or creating a new one. T r ial 1: The ﬁrst observation creates a ne w group. T rial 2: The second observation (dashed) creates another ne w group. T r ial 3: The third obser vation is assigned to the e xisting compatible group. T r ial n: After man y trials, the model has discov ered multiple g roups at a single le v el of abstraction. (B) Online hierarchical latent s tructure lear ning. Our model generalizes the ﬂat f ormulation by organizing latent structure across multiple lev els of abstraction online . T r ial 1: The ﬁrst observation creates new nodes at multiple lev els (each marked ’ne w’). T r ial 2: The second observation creates a ne w branch, f or ming a sibling relationship with the ﬁrst obser vation at higher le vels while diﬀering at lo w er le v els, and deepening the hierarch y . T rial 3: The third obser vation reuses e xisting structure at higher lev els but creates new structure at the obser vation lev el. T rial n: The model has disco v ered a multi-lev el tree where shared structure at higher lev els. S-4 Canonical node reuse. When diﬀerent particles discov er identical substructures dur ing inf er - ence, the y share the same global node identity rather than maintaining redundant local copies. This canonical node reuse 12 suppor ts hierarchical abstraction by ensur ing that the inf erred tree represents shared latent structure across the par ticle ensemble, rather than a collection of par ticle-speciﬁc local trees. Global node identities are assigned at the time a new branc h is ﬁrst created and are preserved across all particles that subsequentl y visit the same branch. Stickiness T o encourage temporal persistence, w e add ’s tickiness ’ to states. 32 This sticky parameter augments the Chinese Res taurant Process b y adding e xtra w eight Ω to self-transitions: 𝑃 ( 𝑐 ( 𝑝 , ℓ ) 𝑡 = 𝑘 | 𝑐 ( 𝑝 , ℓ ) 𝑡 − 1 , 𝑢 ) =          𝑛 ( 𝑝 ,ℓ ) 𝑢, 𝑘 + Ω · 1 [ 𝑘 = 𝑐 ( 𝑝 ,ℓ ) 𝑡 − 1 ] 𝑁 ( 𝑝 ,ℓ ) 𝑢 + 𝛼 ℓ + Ω , 𝑘 e xisting 𝛼 ℓ 𝑁 ( 𝑝 ,ℓ ) 𝑢 + 𝛼 ℓ + Ω , 𝑘 new (14) where Ω serves dual roles as both the outcome prior pseudocount and the stic kiness parameter (denoted 𝜅 in Fo x et al.). This coupling maintains consistent pr ior strength across model components: s trong pr ior beliefs ( Ω larg e) simultaneously bias to w ard fe wer clusters (via the outcome pr ior) and persistent assignments (via stickiness), naturally linking representational parsimon y with temporal stability . Simulations W e instantiate the hierarchical latent-cause model parameterized b y the concentration parameter (and depth limiter) 𝛼 and the obser vation pseudocount (and s tickiness) Ω as described abov e. W e ran simulations using 200 par ticles f or all main analy ses, providing stable estimates without unnecessary computational cost. For hierarc hical modeling, we set a maximum tree depth of 20 le vels and a maximum of 20 children per node to limit runa w a y g ro wth while retaining substantial representational capacity . T o systematicall y assess the representational eﬃciency of hierarc hical v ersus ﬂat structure, w e g enerated synthetic tasks with controlled hierarchical ground truth and varying comple xity . Scalable T ask structure. W e constructed a scalable synthetic categor ization task in which the number of hierarchical le vels 𝐿 w as v aried parametr ically from 2 to 5. Each le v el introduced an additional binary latent variable, producing task en vironments of increasing structural comple xity . The observation f eature space consisted of three types of binar y features. First, obser v ation- lev el f eatures (4 dimensions): a one-hot encoding representing the low est le vel of the hierarc hy , with the f eature at inde x 𝑜 ∈ { 0 , 1 , 2 , 3 } set to 1. Second, ﬁrst latent f eatur es (2 dimensions, present f or 𝐿 ≥ 2 ): both dimensions w ere assigned the same binary value ℓ 1 ∈ { 0 , 1 } , encoding the ﬁrs t latent category le vel. Third, higher latent featur es (1 dimension per lev el abo ve 2, present f or 𝐿 ≥ 3 ): each additional hierarchical le v el contributed one binar y f eature encoding that lev el’ s latent v alue. For a task with 𝐿 le v els, the total number of observable f eatures was 𝐹 = max ( 0 , 𝐿 − 2 ) + 2 + 4 , giving 𝐹 = 𝐿 + 4 f or 𝐿 ≥ 2 . The outcome label w as appended as a ﬁnal f eature, yielding a f eature matrix with dimensions ( 𝐹 + 1 ) × 𝑇 , where 𝑇 is the total number of tr ials. Context enumeration. All possible conte xts were enumerated recursiv el y . At the base le v el ( 𝐿 = 1 ), f our contexts corresponded to the four observation-le v el v alues. At each additional le vel, the conte xt set doubled b y pair ing all e xisting sub-conte xts with each of two binary values f or the ne w latent le v el. This yielded 4 × 2 𝐿 − 1 conte xts for a task of 𝐿 lev els: S-5 Hierarc hical Le vels ( 𝐿 ) Number of Conte xts T rials per Le v el (10 per context) 2 8 80 3 16 160 4 32 320 5 64 640 T able 1: T ask complexity scaling with number of hierarc hical lev els. Figure S2: Example feature v ector encoding f or 2-lev el task structure. Outcome rule. Outcomes were binary ( 𝑦 ∈ { 0 , 1 } ) and w ere determined by a conjunctiv e r ule deﬁned o ver the tw o highest latent le v els of the hierarch y . For 𝐿 = 2 , the outcome w as 𝑦 = 1 if the ﬁrs t latent lev el v alue (le vel 2) = 0 , and 𝑦 = 0 otherwise. For 𝐿 ≥ 3 , the outcome w as 𝑦 = 1 if and onl y if both the top latent lev el value (le vel 𝐿 ) = 0 and the second-highest latent le v el value (le v el 𝐿 − 1) = 0; other wise 𝑦 = 0. This conjunction r ule ensured that no single obser vation-le vel feature was suﬃcient to predict the outcome: models were required to identify and represent the relev ant higher -order latent structure to achie ve high outcome prediction accuracy . Observ ation generation and noise. For each context, a prototype f eature v ector w as constructed according to the encoding r ules abo v e. On each tr ial, a cop y of the prototype was presented with a small probability of perceptual noise: with probability 0.02, one randomly selected observation-le v el feature dimension was bit-ﬂipped ( 0 → 1 or 1 → 0 ). All tr ials across all conte xts were shuﬄed unif or mly at random prior to presentation using a seeded random number g enerator , ensur ing independent tr ial order ings across seeds. Both the ﬂat and hierarchical models receiv ed the full feature matr ix as input and were e v aluated on the same task instance per seed. S-6 Outcome prediction accuracy . The pr imar y lear ning metric was outcome prediction accuracy , computed as the propor tion of tr ials on which the model’ s binar ized outcome estimate matc hed the true outcome: A cc outcome = 1 𝑇 𝑇  𝑡 = 1 1 [ [ ˆ 𝑟 𝑡 > 0 . 5 ] = 𝑦 𝑡 ] (15) where ˆ 𝑟 𝑡 is the model’ s posterior mean outcome estimate at tr ial 𝑡 and 𝑦 𝑡 ∈ { 0 , 1 } is the tr ue outcome. This w as computed separatel y f or the ﬂat model ( ˆ 𝑟 ﬂat 𝑡 ) and the hierarc hical model ( ˆ 𝑟 hier 𝑡 ). T o character ize the representational eﬃciency of each model, w e computed two k ey metrics relating the model’ s lear ned cluster structure to the g round-truth categor ical labels at each hierarchical le vel. F or each trial, the cluster assignment was determined b y ma jority vote across par ticles. T r ials with no v alid assignment (clus ter id = − 1 ) were e xcluded, with analy ses requiring at least 10% v alid cov erag e and a minimum of 5 v alid trials. • Number of clusters ( 𝐾 ): the total number of distinct cluster identities present in the valid trial assignments. • A v erage within-label entrop y : f or each ground-truth label 𝑚 , the Shannon entropy of the cluster assignment dis tribution ov er tr ials with that label, 𝐻 𝑚 = − Í 𝑘 𝑝 𝑚 𝑘 log 𝑝 𝑚 𝑘 , where 𝑝 𝑚 𝑘 is the proportion of label- 𝑚 trials assigned to cluster 𝑘 . Entrop y w as normalized by log 𝐾 𝑚 where 𝐾 𝑚 is the number of dis tinct clusters used f or label 𝑚 (set to 0 if 𝐾 𝑚 = 1 ). The w eighted a verag e across labels (weighted b y label frequency) w as repor ted. For the ﬂat model, representational eﬃciency metrics w ere computed directl y from the particle ﬁlter’ s cluster assignments. For the hierarchical model, representational eﬃciency metrics w ere computed using the same tree le v el that was selected f or transfer , ensuring consistency betw een the g eneralization and representational eﬃciency assessments. Parame ter selection. Scaling task analy ses were per f ormed across 200 randomly sampled parameter combinations ( 𝛼 ∈ [ 0 . 1 , 3 . 0 ] , Ω ∈ [ 0 . 1 , 3 . 0 ] ), with each combination e v aluated o ver 6 independent random seeds. For eac h parameter combination, w e computed mean per f or mance across its 6 seeds. Reported statis tics (conﬁdence intervals, eﬀect sizes) w ere computed across these 200 parameter combinations, capturing variability across the parameter space rather than seed-to-seed v ariability alone. The asymptotic outcome prediction accuracy was computed from the ﬁnal 70% of trials. Er ror bars represent 95% conﬁdence inter vals unless otherwise noted. Compr ession eﬃciency is robust Hierarchical models sho wed superior compression eﬃciency (lo wer entrop y) across 100% of tested parameter combinations (mean entrop y adv antage: − 0 . 677 , rang e: [ − 0 . 990 , − 0 . 155 ] ; Figure S4, panels A-B). The neg ativ e values indicate that hierarchical models consistentl y achiev ed low er entropy than ﬂat models, demons trating more eﬃcient category-to-cluster mappings regardless of parameter settings. This univ ersal adv antage indicates that hierarchical compression reﬂects an architectural proper ty rather than parameter -speciﬁc tuning. One-shot transf er generalization. T o assess whether each model had lear ned a representation that supported generalization, w e e valuated one-shot categor ization transf er after task completion. For eac h hierarchical le v el ℓ ∈ { 2 , . . . , 𝐿 } , the true factor at that le vel partitioned tr ials into tw o categories (factor v alue 0 or 1). A single labeled ex emplar was selected as the ﬁrst trial in the sequence f or which the factor v alue w as 0. The model’ s cluster assignment at that tr ial was taken as the anchor clus ter . All remaining tr ials were then classiﬁed as belonging to the same categor y S-7 as the anchor (predicted categor y 0) if assigned to the same cluster , or to a diﬀerent categor y (predicted category 1) other wise. T ransf er performance was q uantiﬁed using three metr ics that characterize diﬀerent aspects of g eneralization. Recall measures the proportion of tr ue same-categor y tr ials cor rectly identiﬁed as such: R ecall ( ℓ ) = TP TP + FN = 1 | T 0 |  𝑡 ∈ T 0 1 [ ˆ 𝑐 𝑡 = ˆ 𝑐 anchor ] (16) where T 0 = { 𝑡 : 𝑧 ( ℓ ) 𝑡 = 0 } is the set of trials belonging to tr ue categor y 0 at lev el ℓ , ˆ 𝑐 𝑡 is the ma jority-v oted cluster assignment of tr ial 𝑡 across par ticles, ˆ 𝑐 anchor is the ma jority-v oted assignment of the labeled ex emplar trial, TP denotes tr ue positives (cor rect same-categor y identiﬁcations), and FN denotes f alse negativ es (same-category tr ials misclassiﬁed as diﬀerent). Precision quantiﬁes the speciﬁcity of same-categor y predictions: Precision = TP/(TP + FP), where FP denotes false positives (diﬀerent-categor y tr ials misclassiﬁed as same). F1 score is the har monic mean of precision and recall, balancing both aspects: F1 = 2 × (Precision × R ecall)/(Precision + R ecall). W e repor t recall as the pr imar y metric because it directly quantiﬁes the model’ s g eneralization ability—the capacity to recognize same-categor y instances across conte xts—independent of o ver g eneralization tendencies. Category membership is balanced in our task design, making recall directl y inter pretable. Cr itically , recall is highly correlated with F1 score across all parameter combinations (T able 2 note), conﬁr ming that conclusions are robust to metric choice. Precision and F1 pro vide complementar y inf ormation about the precision-recall tradeoﬀ. For the hierarc hical model, cluster assignments w ere e xtracted from the tree path at a targ et depth cor responding to the lev el being e v aluated ( depth = ℓ − 1 ). A fallbac k procedure was applied: if f e wer than 10% of trials had v alid (non-negativ e) node assignments at the targ et depth, the model f ell back to the ne xt shallow er lev el, continuing until a suﬃcient lev el w as f ound or no v alid le v el remained. T able 2 presents transf er perf ormance across all metr ics and task complexities. T able 2: T ransf er generalization metrics b y model and task comple xity Le v el Model Recall Precision F1 2 Flat 0.897 (0.147) 0.995 (0.025) 0.918 (0.117) Hierarchical 0.893 (0.128) 0.656 (0.097) 0.707 (0.106) 3 Flat 0.445 (0.064) 0.661 (0.132) 0.507 (0.064) Hierarc hical 0.655 (0.137) 0.579 (0.091) 0.541 (0.080) 4 Flat 0.329 (0.098) 0.517 (0.076) 0.368 (0.080) Hierarc hical 0.576 (0.186) 0.601 (0.077) 0.468 (0.114) 5 Flat 0.355 (0.085) 0.533 (0.077) 0.397 (0.068) Hierarc hical 0.621 (0.186) 0.590 (0.076) 0.507 (0.110) Mean (SD) across 200 parameter combinations (6 seeds each). At 3–5 lev els, hierarchical models achie ve substantially higher recall and F1 scores, indicating super ior generalization with balanced precision-recall tradeoﬀ. Recall and F1 are highl y cor related across parameter combinations (3-lev el: 𝑟 = 0 . 85; 4-lev el: 𝑟 = 0 . 95; 5-lev el: 𝑟 = 0 . 95). T ransfer advantage holds acr oss most parameter space Hierarchical models ac hie v ed superior transfer perf or mance across 94% of parameter combinations (Supplementar y Figure S4, S-8 Figure S3: T ransf er Adv antag e P arameters. Parameter sw eep f or both parameters betw een 0.1-3.0, with 200 randomly sampled parameter values in this range. Parameter heatmap sho wn with scores on (A) task accuracy as well as av erage transfer advantag e. Stars show the parameter regime that conferred the greatest accuracy or recall advantag e, respectiv el y . (B) A verag ed learning curves for each lev el of task comple xity ov er 50 randomly sampled runs. Models reach equiv alent asymptotic per f or mance. (C) T ransf er accuracy on all lev els tested. Br ighter colors sho w higher accuracy . (D) T ransf er per formance is negativ ely correlated with task accuracy per f ormance f or both models across all lev els. This might be because transf er perf ormance encourages g eneralization which ma y inhibit earl y lear ning in the task. S-9 Figure S4: (A) Compression eﬃciency (entropy diﬀerence: hierarchical - ﬂat) as a function of concentration parameter 𝛼 . Each point represents one parameter combination (n = 100). Neg ativ e values indicate hierarchical models achie ve lo wer entrop y (better compression). Hierarchical models sho w ed super ior compression across 100% of tested combinations. (B) Compression eﬃciency as a function of stickiness parameter Ω . Hierarchical adv antage w as universal across all Ω values. (C) T ransfer performance (transf er diﬀerence: hierarchical - ﬂat) as a function of 𝛼 . Positiv e v alues indicate hierarchical models achiev e higher transf er accuracy . Hierarchical models outperf ormed ﬂat models across 94% of parameter combinations. (D) T ransf er per f ormance as a function of Ω . Hierarchical advantag e increased with higher pr ior and stic kiness, reﬂecting beneﬁts of stable hierarc hical str ucture. panels C-D). These results conﬁrm that the hierarchical transf er advantag e is robust across nearl y all reasonable parameter settings. Switching T ask structure. W e constructed a nested temporal task requiring discov ery of structure across tw o timescales. The task environment alter nated betw een tw o r ule contexts e v er y 100 tr ials (slow timescale), with each rule context further subdivided into alter nating value conte xts ev er y 12 tr ials (fas t timescale). The obser vation f eature space consisted of tw o binar y dimensions. First, shape featur e (1 dimension): encoding whether the stimulus was a circle ( 𝑠 = 0 ) or tr iangle ( 𝑠 = 1 ). Second, textur e featur e (1 dimension): encoding whether the stimulus had stripes ( 𝑥 = 0 ) or dots ( 𝑥 = 1 ). The outcome label was appended as a ﬁnal feature, yielding a f eature matr ix with dimensions ( 3 × 𝑇 ) , where 𝑇 is the total number of tr ials. S-10 Context structure. The task comprised f our slo w r ule conte xts (tw o shape-rule, tw o te xture-r ule) that alternated ev er y 50 tr ials. In shape-r ule conte xts, outcomes depended on the shape feature while te xture was irrelev ant. In textur e-r ule conte xts, outcomes depended on the texture f eature while shape w as irrelev ant. Within eac h slo w conte xt, the re warded f eature value alternated e v er y 12 tr ials: in shape-r ule contexts, circles w ere re w arded, then tr iangles, then circles again; in te xture-r ule contexts, s tr ipes were re warded, then dots, then s tripes again. All f our stimulus combinations (circle+s tripes, circle+dots, tr iangle+stripes, triangle+dots) w ere presented in e v er y 12-tr ial block with equal frequency . This design ensured that individual stimulus combinations carr ied no predictive inf or mation: f or e xample, circle+dots was re warded 50% of the time o verall, with outcomes determined by the current position in the nested temporal structure. Outcome rule. Outcomes w ere binar y ( 𝑦 ∈ { 0 , 1 } ) and w ere deter mined by a conjunction of the current slo w conte xt (which dimension matters) and fas t conte xt (whic h v alue is rew arded). In shape-rule conte xts with circles re w arded, 𝑦 = 1 if 𝑠 = 0 , otherwise 𝑦 = 0 . In shape-r ule conte xts with triangles rew arded, 𝑦 = 1 if 𝑠 = 1, other wise 𝑦 = 0. Observ ation g eneration and noise. Within each f ast conte xt bloc k, one tr ial of eac h of the four stimulus combinations w as presented in random order . With probability 0.02, the outcome on an y trial was bit-ﬂipped ( 0 → 1 or 1 → 0 ), introducing stoc hastic noise. The full sequence of 200 trials (4 slow conte xts × 50 trials each) was g enerated using a seeded random number generator . Both the ﬂat and hierarchical models received the same f eature matr ix as input and were e v aluated on identical task instances per seed. Outcome prediction accuracy . The pr imar y lear ning metric was outcome prediction accuracy , computed as the propor tion of tr ials on which the model’ s binar ized outcome estimate matc hed the true outcome: A cc outcome = 1 𝑇 𝑇  𝑡 = 1 1 [ [ ˆ 𝑟 𝑡 > 0 . 5 ] = 𝑦 𝑡 ] (17) where ˆ 𝑟 𝑡 is the model’ s posterior mean outcome estimate at tr ial 𝑡 and 𝑦 𝑡 ∈ { 0 , 1 } is the tr ue outcome. Main te xt ﬁgures were generated b y r unning 150 random runs using the bes t f ound parameters f or av erage total accuracy in a sw eep of 50 parameters r un with 6 seeds each. Eﬃciency T o quantify representational eﬃciency , we measured within-state clus ter entropy : ho w concentrated tr ials from each of the 4 true latent state w ere across clusters. For eac h latent state 𝑚 , we computed the entrop y of cluster assignments: 𝐻 𝑚 = −  𝑘 𝑝 𝑚 𝑘 log 𝑝 𝑚 𝑘 (18) where 𝑝 𝑚 𝑘 is the propor tion of state- 𝑚 trials assigned to cluster 𝑘 b y ma jority v ote across par ticles. Lo wer entrop y indicates more specialized representations: tr ials from a giv en state concentrate in f ew er clusters. W e then av eraged across s tates, w eighted by the number of trials in each state: 𝐻 av g = 1 𝑇  𝑚 𝑛 𝑚 𝐻 𝑚 (19) where 𝑛 𝑚 is the number of trials in state 𝑚 and 𝑇 = Í 𝑚 𝑛 𝑚 is the total number of trials. S-11 Figure S5: Example learning trace across 400 tr ials of conte xt learning task (100 tr ial blocks). Shading reﬂects the r ule conte xt and dashed lines the switches betw een sub-conte xts. Bolded line sho ws hierarchical model trace. For the hierarchical model, we analyzed the deepest tree le v el with substantial posterior mass—deﬁned as the deepest le v el where at least 1% of trials were assigned b y the par ticle ﬁlter . This threshold ensures w e measure the most reﬁned compositional s tructure the model activ el y uses rather than shallow er or unused le v els. W e also quantiﬁed representational comple xity as the a v erage number of clus ters (or nodes) used per latent state.

Hierarchical Latent Structure Learning through Online Inference

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment