Bayesian Structural Inference for Hidden Processes

San ta F e Institute W orking P ap er 13-09-027 arXiv :1309.1392 [stat.ML] Ba y esian Structural Inference for Hidden Pro cesses Christopher C. Strelioﬀ 1 , ∗ and James P . Crutc hﬁeld 1, 2 , † 1 Complexity Scienc es Center and Physics Dep artment, University of California at Davis, One Shields Avenue, Davis, CA 95616 2 Santa F e Institute, 1399 Hyde Park R o ad, Santa F e, NM 87501 (Dated: Decem ber 10, 2013) W e introduce a Ba y esian approach to discov ering patterns in structurally complex pro cesses. The prop osed metho d of Bay esian Structural Inference (BSI) relies on a set of candidate uniﬁlar hidden Mark o v mo del (uHMM) top ologies for inference of pro cess structure from a data series. W e employ a recently dev elop ed exact en umeration of top ological  -machines. (A sequel then remo v es the top ological restriction.) This subset of the uHMM topologies has the added b eneﬁt that inferred mo dels are guaranteed to be  -mac hines, irresp ective of estimated transition probabilities. Prop erties of  -machines and uHMMs allow for the deriv ation of analytic expressions for estimating transition probabilities, inferring start states, and comparing the p osterior probability of candidate model top ologies, despite pro cess internal structure b eing only indirectly present in data. W e demonstrate BSI’s eﬀectiveness in estimating a pro cess’s randomness, as reﬂected by the Shannon en tropy rate, and its structure, as quantiﬁed by the statistical complexity . W e also compare using the p osterior distribution ov er candidate models and the single, maximum a posteriori mo del for p oint estimation and show that the former more accurately reﬂects uncertaint y in estimated v alues. W e apply BSI to in-class examples of ﬁnite- and inﬁnite-order Marko v pro cesses, as w ell to an out-of-class, inﬁnite- state hidden pro cess. Keyw ords : stochastic pro cess, hidden Marko v mo del,  -machine, causal states P ACS n um bers: 02.50.-r 89.70.+c 05.45.Tp 02.50.Ey 02.50.Ga I. INTR ODUCTION Emergen t patterns are a hallmark of complex, adaptiv e b eha vior, whether exhibited b y natural or designed sys- tems. Practically , disco v ering and quantifying the struc- tures making up emergen t patterns from a sequence of observ ations lies at the heart of our abilit y to understand, predict, and control the w orld. But, what are the statisti- cal signatures of structure? A common mo deling assump- tion is that observ ations are indep endent and identically distributed (I ID). This is tantamoun t, though, to assum- ing a system is structureless. And so, pattern discov ery dep ends critically on testing when the I ID assumption is violated. Said more directly , successful pattern discov ery extracts the (typically hidden) mec hanisms that create departures from I ID structurelessness. In many applica- tions, the search for structure is made all the more c hal- lenging by limited a v ailable data. The very real conse- quences, when pattern discov ery is done incorrectly with ﬁnite data, are that structure can b e mistaken for ran- domness and randomness for structure. Here, w e dev elop an approach to pattern disco v ery that remov es these confusions, fo cusing on data series consisting of a sequence of symbols from a ﬁnite alpha- b et. That is, w e wish to discov er temp oral patterns, as ∗ strelioﬀ@ucdavis.edu † chaos@ucda vis.edu they o ccur in discrete-time and discrete-state time se- ries. (The approach also applies to spatial data exhibit- ing one-dimensional patterns.) Inferring structure from data series of this type is integral to man y ﬁelds of science ranging from bioinformatics [1, 2], dynamical systems [3– 6], and linguistics [7, 8] to single-molecule sp ectroscopy [9, 10], neuroscience [11, 12], and crystallography [13, 14]. Inferred structure assumes a meaning distinctiv e to each ﬁeld. F or example, in single molecule dynamics struc- ture reﬂects stable molecular conﬁgurations, as w ell as the rates and t ypes of transition b etw een them. In the study of coarse-grained dynamical systems and linguis- tics, structure often reﬂects forbidden w ords and relative frequencies of symbolic strings that make the language or dynamical system functional. Thus, the results of suc- cessful pattern discov ery teach one m uc h more ab out a pro cess than mo dels that are only highly predictive. Our goal is to infer structure using a ﬁnite data sam- ple from some pro cess of in terest and a set of candidate  -mac hine model top ologies. This c hoice of mo del class is made b ecause  -mac hines provide optimal prediction as w ell as b eing a minimal and unique represen tation [15]. In addition, given an  -machine, structure and random- ness can be quantiﬁed using the statistical complexit y C µ and Shannon en trop y rate h µ . Previous eﬀorts to infer  -machines from ﬁnite data include subtr e e mer ging (SM) [16],  -machine sp e ctr al r e c onstruction (  MSR) [17], and c ausal-state splitting r e c onstruction (CSSR) [18, 19]. These metho ds pro duce a single, b est-estimate of the ap- 2 propriate  -machine given the av ailable data. The follo wing develops a distinctiv ely diﬀeren t ap- proac h to the problem of structural inference— Bayesian Structur al Infer enc e (BSI). BSI requires a data series D and a set of candidate uniﬁlar hidden Marko v mo del (uHMM) top ologies, which we denote M . How ev er, for our present goal of introducing BSI, we consider only a subset of uniﬁlar hidden Marko v mo dels—the topological  -mac hines—that are guaranteed to b e  -machines irre- sp ectiv e of estimated transition probabilities [20]. Unlike the inference metho ds cited ab o v e, BSI’s output is not a single b est-estimate. Instead, BSI determines the p oste- rior probabilit y of each model topology conditioned on D and M . One result is that man y model topologies are vi- able candidates for a given data set. The shorter the data series, the more prominen t this eﬀect b ecomes. W e ar- gue, in this ligh t, that the most careful approac h to struc- tural inference and estimation is to use the complete set of model top ologies according to their posterior probabil- it y . Another consequence, familiar in a Bay esian setting, is that principled estimates of uncertaint y—including un- certain t y in mo del top ology—can b e straigh tforw ardly obtained from the p osterior distribution. The metho ds developed here dra w from several ﬁelds, ranging from computational mechanics [15] and dynam- ical systems [21–23] to metho ds of Ba y esian statistical inference [24]. As a result, elements of the following will b e unfamiliar to some readers. T o create a bridge, w e pro vide an informal ov erview of foundational concepts in Sec. I I b efore moving to BSI’s technical details in Sec. I I I. I I. PR OCESS STRUCTURE, MODEL TOPOLOGIES, AND FINITE DA T A T o start, we oﬀer a non tec hnical introduction to struc- tural inference to b e clear how we distinguish (i) a pro cess and its inherent structure from (ii) model top ology and these from (iii) sampled data series. A pr o c ess represents all possible b ehaviors of a system of in terest. It is the ob- ject of our fo cus. Saying that we infer structur e means w e wan t to ﬁnd the pro cess ’s organization—the internal mec hanisms that generate its observed b ehavior. Ho w- ev er, in any empirical setting w e only hav e samples of the pro cess’s b ehavior in the form of ﬁnite data series . A data series necessarily provides an incomplete picture of the pro cess due to the ﬁnite nature of the observ a- tion. Finally , w e use a mo del or, more precisely , a mo del top olo gy to express the pro cess’s structure. The mo del top ology—the set of states and transitions, their con- nections and observed output sym b ols—explicitly repre- sen ts the pro cess’s structure. Typically , there are man y mo del top ologies that accurately describ e the probabilis- tic structure of a given pro cess.  -Mac hines are sp ecial within the set of accurate mo dels, how ever, in that they are the model top ology that pro vides the unique and min- imal representation of pro cess structure. T o ground this further, let’s graphically survey diﬀer- en t mo del top ologies and consider what pro cesses they represen t and how they generate ﬁnite data samples. Fig- ure 1 shows mo dels with one or t w o states that gener- ate binary pro cesses—observed b ehavior is a sequence of 0s and 1s . F or example, the smallest mo del topology is sho wn in Fig. 1(a) and represen ts the I ID binary pro cess. This mo del generates data b y starting in state A and out- puts a 0 with probabilit y p and a 1 with probability 1 − p , alw ays returning to state A . A more complex model top ology , shown in Fig. 1(g), has tw o states and four edges. In this case, when the mo del is in state A it generates a 0 with probability p and returns to state A , or it generates a 1 with proba- bilit y 1 − p and mo ves to state B . When in state B , a 0 is generated with probabilit y q and 1 with probability 1 − q , mo ving to state A in both cases. If p 6 = q this model top ology represents a unique, structured pro cess. Ho w- ev er, if p = q the probability of generating a 0 or 1 do es not depend on states A and B and the resulting pro cess is I ID. Thus, this mo del top ology with p = q becomes an o verly verbose representation of the I ID pro cess, which requires only a single state—the top ology of Fig. 1(a). This setting of the transition probabilities is an example where a mo del topology describ es the probabilistic be- ha vior of a pro cess, but do es not reﬂect the structure. In fact, the mo del top ology in Fig. 1(g) is not an  -machine when p = q . Rather, the pro cess structure is prop erly represen ted by Fig. 1(a), which is. This example and other cases where sp eciﬁc mo del top ologies are not minimal and unique represen tations of a pro cess’s structure motiv ate identifying a sub class of mo del top ologies. All mo del top ologies in Fig. 1 are uniﬁlar hidden Marko v mo dels (deﬁned shortly). How- ev er, the six mo del top ologies with t wo states and four edges, Fig. 1(g-i, l-n), are not minimal when p = q . As with the previous example, they all b ecome ov erly com- plex representations of the I ID process for this parame- ter setting. Excluding thes e uHMMs leav es a subset of top ologies called top olo gic al  -machines , Fig. 1(a-f,j-k), that are guaranteed to b e minimal and unique representa- tions of pro cess structure for any transition probabilities setting, other than 0 or 1. P artly to emphasize the role of pro cess structure and partly to simplify technicalities, in this ﬁrst introduction to BSI we only consider top ological  -mac hines. A sequel lifts this restriction, adapting BSI to work with all  -machines. In this w ay , we see ho w a pro cess’s structure is ex- 3 FIG. 1. All binary , uniﬁlar hidden Marko v mo del top ologies with one or tw o states. Eac h top ology , designated (a) through (n), also has a unique lab el that pro vides the num b er of states n = 1 , 2, the alphab et size k = 2, and a unique id that comes from the algorithm used to enumerate all p ossible mo del top ologies [20]. Mo del edges are lab eled with a transition probability and output symbol using the format: pr ob ability | symb ol . pressed in mo del top ology and how p ossible ambiguities arise. This is the forwar d problem of statistical infer- ence. No w, consider the complementary inverse prob- lem: Given an observ ed data series, ﬁnd the model top ol- ogy that most eﬀectively describ es the unknown pro cess structure. In a Bay esian setting, the ﬁrst step is to iden- tify those mo del top ologies that can generate the ob- serv ed data. As just discussed, we do this by choosing a sp eciﬁc mo del top ology and start state and attempting to trace the hidden-state path through the mo del, using the observed sym b ols to determine the edges to follow. If there is a path for at least one start state, the mo del top ology is a viable candidate. This pro cess is rep eated for each mo del top ology in a sp eciﬁed set, suc h as that displa yed in Fig. 1. The pro cedure that lists, and tests, mo del top ologies in a set of candidates we call enumer a- tion . T o clarify the pro cedure for tracing hidden-state paths let’s consider a speciﬁc example of observed data consist- ing of the short binary sequence: 11101100111101111001 . (1) If tested against eac h candidate in Fig. 1, eigh t of the sixteen mo del top ologies are p ossible: (a, e, g-i, l-n). F or example, using Fig. 1(i) and starting in state A , the ob- serv ed data is generated by the hidden-state path: AB AB B AB B B AB AB B AB AB B B A . (2) One w ay to describe this path—one that is cen tral to statistical estimation—is to coun t the num b er of times eac h edge in the mo del was trav ersed. Using n ( σ x | σ 0 ) to denote the num b er of times that sym b ol x is gener- ated using an edge from state σ giv en that the sequence starts in state σ 0 , we obtain: n ( A 0 | A ) = 0, n ( A 1 | A ) = 7, n ( B 0 | A ) = 6, and n ( B 1 | A ) = 7, again assuming σ 0 = A . Similar paths and sets of edge counts are found for the eigh t viable top ologies cited ab ov e. These counts are the basis for estimating a top ology’s transition and start- state probabilities. F rom these, one can then calculate the probability that eac h mo del top ology pro duced the observ ed data series—eac h candidate’s p osterior proba- bilit y . By wa y of outlining what is to follow, let’s formalize 4 the pro cedure just sketc hed in terms of the primary goal of estimating candidates’ p osterior probabilities. First, Sec. II I recapitulates what is known ab out the space of structured processes, reviewing how they are repre- sen ted as  -machines and how top ological  -machines are exactly enumerated. Then, Sec. IV adapts Bay esian in- ference metho ds to this mo del class, analyzing transition probabilit y and start state estimation for a single, known top ology . Next, setting the context for comparing mo del top ologies, it explores the organization of the prior ov er the set M of candidate mo dels. Section IV closes with a discussion of ho w to estimate v arious pro cess statistics from functions of mo del parameters. Finally , Sec. V applies BSI to a series of increasingly complex pro cesses: (i) a ﬁnite-order Mark ov pro cess, (ii) an inﬁnite-order Mark ov pro cess, and, ﬁnally , (iii) an inﬁnite-memory pro- cess. Eac h illustrates BSI’s eﬀectiv eness by emphasiz- ing its ability to accurately estimate a pro cess’s stored information (statistical complexity C µ ) and randomness (Shannon entrop y rate h µ ). I I I. STRUCTURED PR OCESSES W e describ e a system of in terest in terms of its ob- serv ed b ehavior, follo wing the approach of computa- tional mechanics, as reviewed in [15]. Again, a pr o- c ess is the collection of behaviors that the system pro- duces. A pro cess’s probabilistic description is a bi-inﬁnite c hain of random v ariables, denoted b y capital letters . . . X t − 2 X t − 1 X t X t +1 X t +2 . . . . A realization is indi- cated by lo w ercase letters . . . x t − 2 x t − 1 x t x t +1 x t +2 . . . . W e assume the v alue x t b elongs to a discrete alphab et X . W e work with blo c ks X t : t 0 = X t . . . X t 0 − 1 , where the ﬁrst index is inclusive and the second exclusiv e.  -Mac hines were originally deﬁned in terms of predic- tion, in the so-called history formulation [15, 16]. Given a past realization x −∞ : t = . . . x t − 2 x t − 1 and future ran- dom v ariables X t : ∞ = X t X t +1 . . . , the conditional distri- butions P ( X t : ∞ | x −∞ : t ) deﬁne the predictive equiv alence relation ov er pasts: x −∞ : t ∼ x −∞ : t 0 ⇔ P ( X t : ∞ | x −∞ : t ) = P ( X t 0 : ∞ | x −∞ : t 0 ) . (3) Within the history formulation, a pro cess determines the  -mac hine top ology through ∼ : The c ausal states S are its equiv alence classes and these, in turn, induce state- transition dynamics [15]. This wa y of connecting a pro- cess and its  -machine inﬂuenced previous approac hes to structural inference [16, 19, 25]. The  -machine generator form ulation, an alternativ e, w as motiv ated b y the problem of synchronization [26, 27]. There, an  -mac hine top ology deﬁnes the process that can b e generated b y it. Recently , the generator and history formulations were pro ven to b e equiv alent [28]. Although, the history view is sometimes more intuitiv e, the generator view is useful in a v ariet y of applications, esp ecially the approach to structural inference developed here. F ollowing [26 – 28], we start with four deﬁnitions that delineate the mo del classes relev ant for temp oral pattern disco very . Deﬁnition 1. A ﬁnite-state, edge-labeled hidden Mark ov mo del (HMM) c onsists of: 1. A ﬁnite set of hidden states S = { σ 1 , . . . , σ N } . 2. A ﬁnite output alphab et X . 3. A set of N × N symb ol-lab ele d tr ansition matri- c es T ( x ) , x ∈ X , wher e T ( x ) i,j is the pr ob ability of tr ansitioning fr om state σ i to state σ j and emitting symb ol x . The c orr esp onding over al l state-to-state tr ansition matrix is denote d T = P x ∈X T ( x ) . Deﬁnition 2. A ﬁnite-state, edge-labeled, uniﬁlar HMM (uHMM) is a ﬁnite-state, e dge-lab ele d HMM with the fol- lowing pr op erty: • Uniﬁlarit y : F or e ach state σ i ∈ S and e ach symb ol x ∈ X ther e is at most one outgoing e dge fr om state σ i that outputs symb ol x . Deﬁnition 3. A ﬁnite-state  -machine is a uHMM with the fol lowing pr op erty: • Probabilistically distinct states : F or e ach p air of distinct states σ k , σ j ∈ S ther e exists some ﬁnite wor d w = x 0 x 1 . . . x L − 1 such that: P ( w | σ 0 = σ k ) 6 = P ( w | σ 0 = σ j ) . Deﬁnition 4. A top ological  -machine is a ﬁnite-state  -machine wher e the tr ansition pr ob abilities for le aving e ach state ar e e qual for al l outgoing e dges. These deﬁnitions pro vide a hierarch y in the mo del top ologies to b e considered. The most general set (Def. 1) consists of ﬁnite-state, edge-lab eled HMM top ologies with few restrictions. These are similar to mo dels em- plo yed in man y machine learning and bioinformatics ap- plications; see, e.g., [1]. Using Def. 2, the class of HMMs is further restricted to be uniﬁlar. The inference methods dev elop ed here apply to all mo del topologies in this class, as well as all more restricted sub classes. As a p oint of reference, Fig. 1 shows all binary , full-alphabet (able to generate b oth 0s and 1s) uHMM top ologies with one or t wo states. If all states in the mo del are probabilistically distinct, following Def. 3, these mo del top ologies are also v alid generator  -machines. Whether a uHMM is also a 5 States  -Mac hines n F n, 2 1 1 2 7 3 78 4 1,388 5 35,186 T ABLE I. Size F n, 2 of the en umerated library of full-alphabet, binary top ological  -machines from one to ﬁve states. Repro- duced with p ermission from [20]. v alid  -mac hine often depends on the speciﬁc transition probabilities for the machine; see Sec. I I for an example. This dep endence motiv ates the ﬁnal restriction to top o- logical  -mac hines (Def. 4), which are guaranteed to b e minimal even if transition probabilities are equal. Here, we employ the set of topological  -machines for structural inference. Although sp eciﬁc settings of the transition probabilities are used to deﬁne the set of al- lowe d mo del top olo gies this do es not aﬀect the actual in- ference pro cedure. F or example, in Fig. 1 only (a-f, j-k) are top ological  -machines. Ho wev er, the set of top olog- ical  -machines do es exclude a v ariety of mo del top olo- gies that migh t be useful for general time-series inference. F or example, when Def. 4 is applied, all pro cesses with full supp ort (all w ords allow ed) reduce to a single-state mo del. Ho wev er, broadening the class of topologies b e- y ond the set considered here is straightforw ard and so w e address extending the present metho ds to them in a sequel. The net result emphasizes structure arising from the distribution’s supp ort and guarantees that inferred mo dels can b e in terpreted as v alid  -machines. And, the goal is to present BSI’s essen tial ideas for one class of structured pro cesses—the top ological  -machines. The set of top ological  -machines can b e exactly and eﬃcien tly en umerated [20], motiv ating the use of this mo del class as our ﬁrst example application of BSI. T able I lists the n um b er F n,k of full-alphab et top ologies with n = 1 , . . . , 5 states and alphab et size k = 2. Compare this table with the mo del top ologies in Fig. 1, where all n = 1 and n = 2 uHMMs are sho wn. Only Fig. 1(a-f,j- k) are top ological  -machines, accounting for the diﬀer- ence betw een the eight models in the table abov e and the fourteen in Fig. 1. F or comparison, the library has b een en umerated up to eight states, containing approximately 2 × 10 9 distinct top ologies. How ever, for the examples to follow we employ all 36 , 660 binary model top ologies up to and including ﬁv e states as the candidate basis for structural inference. IV. BA YESIAN INFERENCE Previously , w e developed methods for k th-order Mark ov chains to infer mo dels of discrete sto c hastic pro- cesses and coarse-grained contin uous chaotic dynamical systems [6, 29]. There, we demonstrated that correct mo dels for in-class data sources could b e eﬀectively and parsimoniously estimated. In addition, we show ed that the hidden-state nature of out-of-class data sources could b e extracted via mo del comparison b etw een Marko v or- ders as a function of data series length. Notably , we also found that the entrop y rate can b e accurately estimated, ev en when out-of-class data was considered. The following extends the Marko v c hain methods to the top ologically richer mo del class of uniﬁlar hidden Mark ov models. The starting point dep ends on the uniﬁlar nature of the HMM topologies considered here (Def. 2)—transitions from eac h state ha ve a unique emit- ted sym b ol and destination state. As w e demonstrated in Sec. I I uniﬁlarit y also means that, giv en an assumed start state, an observ ed data series corresp onds to at most one path through the hidden states. The ability to di- rectly connect observ ed data and hidden-state paths is not p ossible in the more general class of HMMs (Def. 1) b ecause they can hav e many , often exp onentially many , p ossible hidden paths for a single observed data series. In contrast, as a result of uniﬁlarity , our analytic meth- o ds previously dev elop ed for “nonhidden” Mark ov c hains [29] can b e applied to infer uHMMs and  -machines b y adding a latent (hidden) v ariable for the unknown start state. W e note in passing that for the more general class of HMMs, including nonuniﬁlar top ologies, there are tw o approac hes to statistical inference. The ﬁrst is to con- v ert them to a uHMM (if p ossible), using mixed states [30, 31]. The second is to use more con v entional compu- tational metho ds, suc h as Baum-W elch [32]. Setting aside these alternatives for now, we formalize the connection b etw een observ ed data series and a can- didate uHMM top ology discussed in Sec. I I. W e assume that a data series D 0: T = x 0 x 1 . . . x T − 2 x T − 1 of length T has been obtained from the pro cess of interest, with x t taking v alues in a discrete alphab et X . When a sp eciﬁc mo del top ology and start state are assumed, a hidden- state sequence corresp onding to the observed data can sometimes, but not alw ays, b e found. W e denote a hid- den state at time t as σ t and a hidden-state sequence cor- resp onding to D 0: T as S 0: T +1 = σ 0 σ 1 . . . σ T − 1 σ T . Note that the state sequence is longer than the observed data series since the start and ﬁnal states are included. Us- ing this notation, an observed sym b ol x t is emitted when transitioning from state σ t to σ t +1 . F or example, us- ing the observed data in Eq. (1), a hidden-state path corresp onding to Eq. (2) can b e obtained by assuming 6 top ology Fig. 1(i) and start state A . W e can no w write out the probability of an observed data series. W e assume a stationary uHMM top ology M i with a set of hidden states σ i ∈ S i . W e add the sub- script i to mak e it clear that w e are analyzing a set of dis- tinct, en umerated mo del top ologies. As demonstrated in the example from Sec. I I, edge counts n ( σ i x | σ i, 0 ) are ob- tained by tracing the hidden-state path giv en an assumed start state σ i, 0 . Putting this all together, the probabil- it y of observed data D 0: T and corresp onding state-path S 0: T +1 is: P ( S 0: T +1 , D 0: T ) = p ( σ i, 0 ) (4) × Y σ i ∈ S i Y x ∈X p ( x | σ i ) n ( σ i x | σ i, 0 ) . A slight manipulation of Eq. (4) lets us write the prob- abilit y of observed data and hidden dynamics, giv en an assumed start state σ i, 0 , as: P ( S 0: T +1 , D 0: T | σ i, 0 ) = Y σ i ∈ S i Y x ∈X p ( x | σ i ) n ( σ i x | σ i, 0 ) . (5) The developmen t of Eq. (5) and the simple example pro- vided in Sec. Fig. I I lay the groundwork for our appli- cation of Ba yesian metho ds. That is, given top ology M i and start state σ i, 0 , the probabilit y of observe d data D 0: T and hidden dynamics S 0: T +1 can b e calculated. F or the purp oses of inference, the combination of observ ed and hidden sequences is our data D = ( D 0: T , S 0: T +1 ). A. Inferring T ransition Probabilities The ﬁrst step is to infer transition probabilities for a single uHMM or top ological  -mac hine M i . As noted ab o v e, w e must assume a start state σ i, 0 so that edge coun ts n ( σ i , x | σ i, 0 ) can b e obtained from D 0: T . This re- quiremen t means that the inferred transition probabili- ties also dep end on the assumed start state. At a later stage, when comparing mo del top ologies, w e demonstrate that the uncertaint y in start state can b e av eraged ov er. The set { θ i } of parameters to estimate consists of those transition probabilities deﬁned to b e neither one nor zero b y the assumed top ology: θ i = { 0 < p ( x | σ i , σ i, 0 ) < 1 : σ i ∈ S ∗ i , σ i, 0 ∈ S i } , where S ∗ i ⊆ S i is the subset of hidden states with more than one outgoing edge. The resulting likelihoo d follo ws directly from Eq. (5): P ( D | θ i , σ i, 0 , M i ) = Y σ i ∈ S i Y x ∈X p ( x | σ i , σ i, 0 ) n ( σ i ,x | σ i, 0 ) , (6) W e note that the set of transition probabilities used in the ab ov e expression are unknown when doing statisti- cal inference. How ever, we can still write the probabil- it y of the observed data giv en a setting for these un- kno wn v alues, as indicated by the notation for the like- liho o d: P ( D | θ i , σ i, 0 , M i ). Although not made explicit ab o v e, there is also a p ossibility that the likelihoo d v an- ishes for some, or all, start states if the observ ed data is not compatible with the top ology . F or example, if we attempt to use Fig. 1(d) for the observed data in Eq. (1) w e ﬁnd that neither σ i, 0 = A nor σ i, 0 = B leads to viable paths for the observed data, resulting in zero lik eliho o d. F or later use, we denote the num b er of times a hidden state is visited by n ( σ i • | σ i, 0 ) = P x ∈X n ( σ i , x | σ i, 0 ). Equation (6) exp oses the Marko v nature of the dy- namics on the hidden states and suggests adapting the metho ds we previously dev elop ed for Marko v chains [29]. Said simply , states that corresp onded there to histories of length k for Marko v chain mo dels are replaced by a hidden state σ i . Mirroring the earlier approac h, we em- plo y a conjugate prior for transition probabilities. This c hoice means that the p osterior distribution has the same form as the prior, but with mo diﬁed parameters. In the presen t case, the conjugate prior is a product of Dirichlet distributions: P ( θ i | σ i, 0 , M i ) = Y σ i ∈ S ∗ i ( Γ( α ( σ i • | σ i, 0 )) Q x ∈X Γ( α ( σ i x | σ i, 0 )) × δ 1 − X x ∈X p ( x | σ i , σ i, 0 ) ! (7) × Y x ∈X p ( x | σ i , σ i, 0 ) α ( σ i x | σ i, 0 ) − 1 ) , where α ( σ i • | σ i, 0 ) = P x ∈X α ( σ i x | σ i, 0 ). In the examples to follo w w e tak e α ( σ i x | σ i, 0 ) = 1 for all parameters of the prior. This results in a uniform densit y ov er the simplex for all transition probabilities to b e inferred, irresp ective of start state [33]. The pro duct of Dirichlet distributions includes transi- tion probabilities only from hidden states in S ∗ i b ecause these states ha ve more than one outgoing edge. F or tran- sition probabilities from states σ i 6∈ S ∗ i there is no need for an explicit prior b ecause the transition probabilit y m ust b e zero or one by deﬁnition of the uHMM topology . As a result, the prior exp ectation for transition probabil- ities is: E prior [ p ( x | σ i , σ i, 0 )] = α ( σ i x | σ i, 0 ) α ( σ i • | σ i, 0 ) , (8) for states σ i ∈ S ∗ i . Next, we employ Bay es’ Theorem to obtain the p os- terior distribution for the transition probabilities giv en data and prior assumptions. In this con text, it takes the 7 form: P ( θ i | D , σ i, 0 , M i ) = P ( D | θ i , σ i, 0 , M i ) P ( θ i | σ i, 0 , M i ) P ( D | σ i, 0 , M i ) . (9) The terms in the numerator are already sp eciﬁed ab ov e as the lik eliho o d and the prior, Eqs. (6) and (7), resp ec- tiv ely . The normalization factor in Eq. (9) is called the evi- denc e , or mar ginal likeliho o d . This term integrates the pro duct of the likelihoo d and prior with resp ect to the set of transition probabilities θ i : P ( D | σ i, 0 , M i ) = Z dθ i P ( D | θ i , σ i, 0 , M i ) P ( θ i | σ i, 0 , M i ) = Y σ i ∈ S ∗ i ( Γ( α ( σ i • | σ i, 0 )) Q x ∈X Γ( α ( σ i x | σ i, 0 )) (10) × Q x ∈X Γ( α ( σ i x | σ i, 0 ) + n ( σ i x | σ i, 0 )) Γ( α ( σ i • | σ i, 0 ) + n ( σ i • | σ i, 0 )) ) , resulting in the av erage of the lik eliho o d with resp ect to the prior. In addition to normalizing the p osterior distri- bution (Eq. (9)), the evidence is imp ortant in our subse- quen t applications of Bay es’ Theorem. In particular, the quan tity is cen tral to the mo del selection to follo w and is used to (i) determine the start state given the mo del and (ii) compare mo del top ologies. As discussed ab ov e, conjugate priors result in a p oste- rior distribution of the same form, with prior parameters mo diﬁed b y observed counts: P ( θ i | D ,σ i, 0 , M i ) = Y σ i ∈ S ∗ i ( Γ( α ( σ i • | σ i, 0 ) + n ( σ i • | σ i, 0 )) Q x ∈X Γ( α ( σ i x | σ i, 0 ) + n ( σ i x | σ i, 0 )) × δ 1 − X x ∈X p ( x | σ i , σ i, 0 ) ! (11) × Y x ∈X p ( x | σ i , σ i, 0 ) α ( σ i x | σ i, 0 )+ n ( σ i x | σ i, 0 ) − 1 ) . Comparing Eqs. (7) and (11)—prior and p osterior, resp ectiv ely—sho ws that the distributions are v ery simi- lar: α ( σ i x | σ i, 0 ) (prior only) is replaced b y α ( σ i x | σ i, 0 ) + n ( σ i x | σ i, 0 ) (prior plus data). Thus, one can immediately write down the p osterior mean for the transition proba- bilities: E p ost [ p ( x | σ i , σ i, 0 )] = α ( σ i x | σ i, 0 ) + n ( σ i x | σ i, 0 ) α ( σ i • | σ i, 0 ) + n ( σ i • | σ i, 0 ) , (12) for states σ i ∈ S ∗ i . As with the prior, probabilities for transitions from states σ i / ∈ S ∗ i are zero or one, as deﬁned b y the mo del top ology . Notably , the posterior mean for the transition proba- bilities do es not completely specify our knowledge since the uncertaint y , reﬂected in functions of the p osterior’s higher momen ts, can b e large. These moments are av ail- able elsewhere [33]. Ho wev er, using metho ds detailed b e- lo w, we employ sampling from the p osterior at this lev el, as well as other inference lev els, to capture estimation uncertain ty . B. Inferring Start States The next task is to calculate the probabilities for each start state given a prop osed machine top ology and ob- serv ed data. Although w e are not typically interested in the actual start state, introducing this laten t v ariable is necessary to develop the previous section’s analytic meth- o ds. And, in any case, another level of Bay es’ Theorem allo ws us to a verage o ver uncertaint y in start state to obtain the probability of observed data for the top ology , indep enden t of start state. W e b egin with the evidence P ( D | σ i, 0 , M i ) deriv ed in Eq. (10) to estimate transition probabilities. When de- termining the start state, the evidence (marginal likeli- ho o d) from inferring transition probabilities becomes the lik eliho o d for start state estimation. As b efore, we apply Ba yes’ Theorem, this time with unkno wn start states, instead of unknown transition probabilities: P ( σ i, 0 | D , M i ) = P ( D | σ i, 0 , M i ) P ( σ i, 0 | M i ) P ( D | M i ) . (13) This calculation requires deﬁning a prior o ver start states P ( σ i, 0 | M i ). In practice, setting start states as equally probable a priori is a sensible c hoice in light of the larger goal of structural inference. The normalization P ( D | M i ), or evidence, at this level follows b y a veraging ov er the uncertain ty in σ i, 0 : P ( D | M i ) = X σ i, 0 ∈ S i P ( D | σ i, 0 , M i ) P ( σ i, 0 | M i ) . (14) The result of this calculation no longer explicitly de- p ends on start states or transition probabilities. The uncertain ty created by these unknowns has been av er- aged ov er, pro ducing a very useful quantit y for compar- ing diﬀerent top ologies: P ( D | M i ). How ever, one m ust not forget that inferring transition and start state prob- abilities underlies the structural comparisons to follow. In particular, the priors set at the levels of transition probabilities and start states can impact the structures detected due to the hierarchical nature of the inference: 8 P ( D | θ i , σ i, 0 , M i ) → P ( D | σ i, 0 , M i ) → P ( D | M i ). C. Inferring Mo del T op ology So far, we inferred transition probabilities and start states for a given mo del top ology . Now, we are ready to compare diﬀerent top ologies in a set M of candidate mo dels. As with inferring start states given a top ology , w e write down yet another version Bay es’ Theorem, ex- cept one for mo del top ology: P ( M i | D , M ) = P ( D | M i , M ) P ( M i | M ) P ( D | M ) , (15) writing the likelihoo d as P ( D | M i , M ) to make the na- ture of the conditional distributions clear. This is exactly the same, how ever, as the evidence derived ab ov e in Eq. (14): P ( D | M i ) = P ( D | M i , M ). Equalit y holds b ecause nothing in calculating the previous evidence term directly dep ends on the set of mo dels considered. The evidence P ( D | M ), or normalization term, in Eq. (15) has the gen- eral form: P ( D | M ) = X M j ∈ M P ( D | M j , M ) P ( M j | M ) . (16) T o apply Eq. (15) we m ust ﬁrst provide an explicit prior ov er mo del top ologies. One general form, tuned by single parameter β , is: P ( M i | M ) = exp ( − β φ ( M i )) P M j ∈ M exp ( − β φ ( M j )) , (17) where φ ( M i ) is some desired function of mo del top ol- ogy . In the examples to follow w e use the n um b er of causal states— φ ( M i ) = | M i | —thereb y p enalizing for mo del size. This is particularly imp ortant when a short data series is b eing in vestigated. How ever, setting β = 0 remo ves the p enalty , making all models in M a priori equally likely . It is imp ortant to in vestigate the eﬀects of c ho osing a sp eciﬁc β for a given set of candidate top olo- gies. Below, we ﬁrst demonstrate the eﬀect of choosing β = 0, 2, or 4. After that, ho w ever, w e employ β = 4 since this v alue, in combination with the set of one- to ﬁv e-state binary-alphab et top ological  -machines, pro- duces a preference for one- and t wo-state machines for short data series and still allo ws for inferring larger ma- c hines with only a few thousand symbols. Exp erience with this β shows that it is structurally conserv ative. In the examples w e explore tw o approac hes to using the results of structural inference. The ﬁrst tak es in to ac- coun t all mo del topologies in the set considered, w eighted according to the p osterior distribution giv en in Eq. (15). FIG. 2. Pseudo co de for generating N s samples of a function f ( θ i ) of model parameters { θ i } . Algorithm 1 samples a top ol- ogy eac h time through the loop, whereas Algorithm 2 uses the MAP top ology for all iterations. The sampling at each stage allo ws for the creation of a set of samples { f n } that accu- rately reﬂect the many sources of uncertaint y in the p osterior distribution. The second selects a single model M map that is the max- imum a p osteriori (MAP) top ology: M map = argmax M i ∈ M P ( M i | D , M ) . (18) The diﬀerence b etw een these metho ds is most dramatic for short data series. Also, using the MAP top ology of- ten underestimates the uncertain ty in functions of the mo del parameters; which we discuss shortly . Of course, since one throws aw ay any n umber of comparable mo d- els, estimating uncertaint y in any quantit y that explicitly dep ends on the mo del top ology cannot b e done prop erly if MAP selection is employ ed. Ho wev er, we exp ect some will wan t or need to use a single model top ology , so w e consider b oth metho ds. D. Estimating F unctions of Mo del Parameters A primary goal in inference is estimating functions that dep end on an inferred mo del’s parameters. W e de- note this f ( θ i ) to indicate the dep endence on transition probabilities. Unfortunately , substituting the p osterior mean for the transition probabilities in to some function of interest do es not pro vide the desired exp ectation. In general, obtaining analytic expressions for the posterior mean of desired functions is quite diﬃcult; see, for exam- ple, [34, 35]. Deriving expressions for the uncertaint y in the resulting estimates is equally inv olved and typically not done; although see [34]. Ab o v e, the inference metho d required inferring tran- sition probabilities, start state, and top ology . F unction 9 estimation, as a result, should take in to account all these sources of uncertaint y . Instead of deriving analytic ex- pressions for p osterior means (if p ossible), w e turn to n umerical metho ds to estimate function means and un- certain ties in great detail. W e do this by rep eatedly sampling from the p osterior distribution at eac h lev el to obtain a sample  -machine and ev aluating the func- tion of in terest for the sampled parameter v alues. The algorithms in Fig. 2 detail the pro cess of sampling f ( θ i ) using all candidate mo dels M (Algorithm 1) or the sin- gle M MAP mo del (Algorithm 2). Given a set of samples of the function of in terest, any summary statistic can b e emplo yed. In the examples, we generate N s = 50 , 000 samples from which we estimate a v ariety of prop erties. More speciﬁcally , these samples are emplo y ed to estimate the p osterior mean and the 95%, equal-tailed, credible in- terv al (CI) [24]. This means there is a 5% probability of samples b eing outside the sp eciﬁed interv al, with equal probabilit y of b eing ab ov e or b elow the interv al. Finally , a Gaussian kernel densit y estimation (Gkde) is used to visualize the p osterior density for the functions of inter- est. The examples demonstrate estimating pro cess random- ness and structure from data series using the t w o algo- rithms introduced ab o v e. F or a known  -machine top ol- ogy M i , with sp eciﬁed transition probabilities { p ( x | σ i ) } , these prop erties are quantiﬁed using the entrop y rate h µ and statistical complexit y C µ , resp ectiv ely . The en tropy rate is: h µ = − X σ i ∈ S i p ( σ i ) X x ∈X p ( x | σ i ) log 2 p ( x | σ i ) (19) and the statistical complexity is: C µ = − X σ i ∈ S i p ( σ i ) log 2 p ( σ i ) . (20) In these expressions, the p ( σ i ) are the asymptotic state probabilities determined by the left eigenv ector (normal- ized in probabilit y) of the internal Marko v c hain tran- sition matrix T = P x ∈X T ( x ) . Of course, h µ and C µ are also functions of the mo del top ology and transition probabilities, so these quantities provide go o d examples of how to estimate functions of mo del parameters in gen- eral. V. EXAMPLES W e divide the examples in to tw o parts. First, we demonstrate inferring transition probabilities and start states for a known top ology . Second, w e focus on infer- ring  -machine top ology using the set of all binary , one- FIG. 3. State-transition diagram for the Ev en Pro cess’s  -mac hine top ology . The “true” v alue of the unsp eciﬁed transition probabilit y is p (0 | A ) = 1 / 2. F or this topology , S ev en = { A, B } and S ∗ ev en = { A } b ecause state B has only one outgoing transition. FIG. 4. (Color online) Conv ergence of p osterior density P ( p (0 | A ) | D 0: L , M ev en ) as a function of subsample length L = 2 i , i = 0 , 1 , 2 , . . . , 17. Eac h p osterior density plot uses a Gaussian kernel density estimator with 50 , 000 samples from the p osterior. The true v alue of p (0 | A ) = 1 / 2 appears as a dashed line and the p osterior mean as a solid line. to ﬁve-state top ological  -machines, consisting of 36 , 660 candidates; see T able I. W e use the con vergence of esti- mates for the information-theoretic v alues h µ and C µ to monitor structure discov e ry . How ever, estimating mo del parameters is at the core of the later examples and so we start with this pro cedure. F or each example we generate a single data series D 0: T of length T = 2 17 . When analyzing con vergence, w e consider subsamples D 0: L of lengths L = 2 i , us- ing i = 0 , 1 , 2 , . . . , 17. F or example, a four-sym b ol se- quence starting at the ﬁrst data p oint is designated D 0:4 = x 0 x 1 x 2 x 3 . The ov erlapping analysis of a single data series gives insight into conv ergence for the inferred mo dels and for the statistics estimated. 10 A. Estimating Parameters 1. Even Pr o cess W e ﬁrst explore a single example of inferring properties of a known data source using Eqs. (6)-(11). W e generate a data series from the Even Pro cess and then, using the correct top ology (Fig. 3), w e infer start states and transi- tion probabilities and estimate the entrop y rate and sta- tistical complexity . W e do not conce n trate on this level of inference in subsequent examples, preferring to focus instead on mo del top ology and its representation of the unkno wn pro cess structure. Nonetheless, the procedure detailed here underlies all of the examples. The Ev en Pro cess is notable b ecause it has inﬁnite Mark ov order. This means no ﬁnite-order Marko v chain can repro duce its word distribution [29]. It can b e ﬁnitely mo deled, though, with a ﬁnite-state uniﬁlar HMM—the  -mac hine of Fig. 3. A single data series was generated using the Ev en Pro cess  -machine with p (0 | A ) = 1 / 2. The start state was randomized b efore generating se- quence data of length T = 2 17 . As it turned out, the initial segment was D 0: T = 100 . . . , indicating that the unkno wn start state w as B on that realization. This is so b ecause the ﬁrst sym b ol is a 1, whic h can b e gener- ated starting in either state A or B , but the sequence 10 is only p ossible by starting at no de B . Next, we estimate the transitions from the gener- ated data series using length- L subsamples D 0: L = x 0 x 1 . . . x L − 1 to track con vergence. Although the mean and other moments of the Dirichlet p osterior can b e cal- culated analytically [33], we sample v alues using Algo- rithm 2 in Fig. 2. Ho wev er, in this example we employ M ev en instead of M map b ecause we are fo cused on the mo del parameters for a kno wn top ology . The posterior densit y for eac h subsample D 0: L is plotted in Fig. 4 us- ing Gaussian kernel densit y estimation (Gkde). The true v alue of p (0 | A ) is shown as a black, dashed line and the p osterior mean as a solid, gray line. (Both lines connect v alues ev aluated at each length L = 2 0 , 2 1 , . . . 2 17 .) The con vergence of the p osterior density to the correct v alue of p (0 | A ) = 1 / 2 with increasing data size is clear and, moreo ver, the true v alue is alw ays in a region of positive probabilit y . F or our ﬁnal example using a known top ology w e es - timate h µ and C µ from the Even Pro cess data. This illustrates estimating these functions of mo del parame- ters when the  -machine top ology is known but there is uncertaint y in start state and transition probabilities. As ab o ve, we use Algorithm 2 in Fig. 2 and employ the kno wn mac hine structure. W e sample start state s and transition probabilities, follow ed by calculating h µ and C µ —via Eqs. (19) and (20), resp ectively—to build a pos- terior density for these quantities. Figure 5 presents the joint distribution for C µ and h µ along with the Gkde estimation of their marginal densi- ties. Samples from the joint p osterior distribution are plotted in the lo wer left panel for subsample lengths L = 1 , 64 , and 16 , 384. Only 5 , 000 of the av ailable sam- ples are display e d in this panel to minimize the graphic’s size. The marginal densities for h µ (top panel) and C µ (righ t panel) are plotted using a Gkde with all 50 , 000 samples. Small data size ( L = 1, indicated by black p oin ts) samples allow a wide range of structure and ran- domness constrained only b y the Ev en Process  -mac hine top ology . The range of h µ and C µ reﬂect the ﬂat priors set for start states and transition probabilities. W e note that a uniform prior distribution o ver transition proba- bilities and start states do es not pro duce a uniform distri- bution o ver h µ or C µ . Increasing the size of the data sub- sample to L = 64 (bro wn points) results in a considerable reduction in the uncertaint y for b oth functions. F or this amoun t of data, the possible v alues of en tropy rate and statistical complexity curve around the true v alue in the h µ − C µ plane and result in a shifted p eak for the marginal densit y for h µ . F or subsample length L = 16 , 384 (blue p oin ts) the estimates of b oth functions of mo del param- eters con v erge to the true v alues, indicated b y the black, dashed lines. B. Inferring Pro cess Structure W e are now ready to demonstrate BSI’s eﬃcacy for structural inference via a series of increasingly complex pro cesses, monitoring conv ergence using data subsamples up to a length of L = 2 17 . In this, w e determine the n um- b er of hidden states, num b er of edges connecting them, and symbols output on each transition. As discussed ab o v e, we use the set of top ological  -machines as candi- dates b ecause an eﬃcient and exhaustive enumeration is a v ailable. F or comparison, we ﬁrst explore the organization of the prior ov er the set of candidate  -machines using intrinsic informational co ordinates—the pro cess entrop y rate h µ and statistical complexity C µ . W e fo cus on their joint distribution, as induced by v arious settings of the prior parameter β . The results lead us to use β = 4 for the subsequen t examples. This v alue creates a preference for small models when little data is a v ailable but allo ws for a larger n umber of states when reasonable amoun ts of data supp ort it. W e establish the BSI’s eﬀectiveness by inferring the structure of a ﬁnite-order Mark ov pro cess, an inﬁnite- order Marko v pro cess, and an inﬁnite memory pro cess. Again, the proxy for conv ergence is estimating structure 11 FIG. 5. Con vergence of randomness ( h µ ) and structure ( C µ ) calculated with transition probabilities and start states es- timated from Even Pro cess data, assuming the correct top ology . 50 , 000 samples were taken from the joint p osterior P ( h µ , C µ | D 0: L , M ev en ). (Low er left) A subsample of size 5 , 000 for data sizes L = 1 (black), L = 64 (bro wn), and L = 16 , 384 (blue). Gaussian kernel densit y estimates (using all 50 , 000 samples) of the marginal distributions P ( h µ | D 0: L , M ev en ) (top) and P ( C µ | D 0: L , M ev en ) (righ t) for the same v alues of L . Dashed lines indicate the true v alues of h µ and C µ for the Even Pro cess. and randomness as a function of the data subsample length L . Comparing these quantities’ p osterior distri- butions with their prior illustrates uncertaint y reduction as more data is analyzed. 1. Priors for Structur e d Pr o c esses Here, we use a prior o ver all binary-alphab et, top o- logical  -machines with one to ﬁv e states. (Recall T a- ble I.) W e denote the set of top ological  -machines de- tailed in T able I as M . Equation (17) allows sp ecify- ing a preference for smaller  -machines by setting β > 0 and deﬁning the function of mo del structure to b e the n umber of states: φ ( M i ) = | M i | . Beyond setting this explicitly , there is an inherent bias to smaller mo dels in- v ersely prop ortional to the parameter space dimension. The parameter space is that of the estimated transition probabilities. Its dimension is the num b er of states with more than one out-going transition. Ho wev er, candidate  -mac hine top ologies with man y states and few transi- tions result in a small parameter space and so may b e assigned high probability for short data series. In ad- 12 FIG. 6. Mo del prior dep endence on p enalty parameter β : 50 , 000 samples were taken from the joint prior P ( h µ , C µ | M ) using all binary-alphab et, top ological  -machines with 1 − 5 states and parameters: β = 0 (black), β = 2 (brown), and β = 4 (blue). (Lo w er left) A subsample of size 5 , 000 from the joint distribution is shown for eac h v alue of β . A Gaussian kernel density estimation, using all 50 , 000 samples for each v alue of β , of the marginal distributions P ( h µ | M ) (top) and P ( C µ | M ) (righ t). dition, the prior ov er top ologies must take into account the increasing n umber of candidates as the n umber of states increases. Setting β suﬃciently high so that large mo dels are not given high probability under these condi- tions is reasonable, as w e would lik e to approac h struc- ture estimates ( C µ ) monotonically from below, as data size increase s. Figure 6 plots samples from the resulting join t prior for ( h µ , C µ ) as w ell as the corresp onding Gkde for marginal densities of b oth quan tities. The data are generated by using the metho d of Sec. IV D and replacing the p osterior densit y with the prior density . Sp eciﬁcally , rather than sampling a top ology M i from P ( M i | D , M ), we sample from P ( M i | M ). Similar substitutions are made at eac h lev el, using the distributions that do not dep end on ob- serv ed data, resulting in samples from the prior. Each color in the ﬁgure reﬂects samples using all  -machines in M with diﬀerent v alues for the prior parameter: β = 0 (blac k), β = 2 (bro wn) and β = 4 (blue). While β = 0 has man y samples at high C µ , reﬂecting the large num- b er of ﬁve-state  -machines, increasing to β = 2 results in noticeable bands in the h µ − C µ plane and p eaks at C µ = log 2 1, C µ = log 2 2, C µ = log 2 3 bits, and so on. This reﬂects the fact that larger β mak es smaller ma- 13 FIG. 7. Golden Mean Pro cess’s  -machine. c hines more likely . As a consequence, the emergence of patterns due to one-, tw o-, and three-state top ologies is seen. Setting β = 4 sho ws a stronger a priori preference for one- and t wo-state machines, reﬂected b y the strong p eaks at C µ = 0 bits and C µ = 1 bit. In terestingly , the prior distribution o ver h µ and C µ is quite similar for β = 0 and 2, with more distributional structure due to smaller  -machines at β = 2. Ho wev er, the prior distribu- tion for h µ and C µ is quite diﬀerent for β = 4, creating a strong preference for one- and tw o-state top ologies. This results in an a priori preference for low C µ and high h µ that, as w e demonstrate shortly , is modiﬁed for moderate amoun ts of data. W e employ β = 4 as a reasonable v alue in all subsequen t examples. In practice, sensitivity to this c hoice should b e tested in each application to verify that the resulting b ehavior is appropriate. W e suggest small, nonzero v alues as reasonable starting p oints. As alwa ys, suﬃcien t data makes the choice relatively unimp ortant for the resulting inference. 2. Markov Example: The Golden Me an Pro c ess The ﬁrst example of structural inference explores the Golden Mean Pro cess, pictured in Fig. 7. Although it is illustrated as an HMM in the ﬁgure, it is eﬀectively a Mark ov chain with no hidden states: observing a 1 cor- resp onds to state A , whereas observing 0 means the pro- cess is in state B . Previously , we sho wed that this data source can b e inferred using the mo del class of k th order Mark ov chains, as expected [29]. How ever, the Golden Mean Pro cess is also a member of the class of binary- alphab et, top ological  -machines considered here. As a result, structural inference from Golden Mean data is an example of in-class mo deling. W e pro ceed using the approach laid out ab ov e for the Ev en Process transition probabilities and start states. W e generated a single data series by randomizing the start state and creating a symbol sequence of length T = 2 17 using the Golden Mean Pro cess  -mac hine. As ab ov e, w e monitor the conv ergence using subsam- ples D 0: L = x 0 x 1 . . . x L − 1 for lengths L = 2 i , i = 0 , 1 , . . . 17. The candidate machines M consist of all 36 , 600  -machine top ologies in T able I. Estimating h µ and C µ aids in monitoring conv ergence of inferred top ol- ogy and related prop erties to the correct v alues. In addi- tion, we pro vide supplementary tables and ﬁgures, using b oth M and the maximum a p osteriori model M MAP at eac h data length L , to giv e a detailed view of structural inference. Figure 8 plots samples from the join t p osterior o ver ( h µ , C µ ), as w ell as their marginal distributions, for three subsample lengths. As in Fig. 5, w e consider L = 1 (blac k), L = 64 (brown), and L = 16 , 384 (blue). Ho w- ev er, this example emplo ys the full set M of candidate top ologies. F or small data size ( L = 1) the distribution closely approximates the prior distribution for β = 4, as it should. At data size L = 64, the samples of b oth the h µ and C µ are still broad, resulting in multimodal b ehavior with considerable weigh t given to b oth tw o- and three- state top ologies. Consulting T able S2 in the supplemen- tary material, we see that this is the shortest length that selects the correct top ology for the Golden Mean Pro cess (denoted n2k2id5 in T able S2). F or smaller L , the single- state, tw o-edge top ology is preferred (denoted n1k2id3). Ho wev er, the probability of the correct mo del is only 78.7%, leaving a substan tial probabilit y for alternative candidates. The uncertaint y is further reﬂected in the large credible interv al for C µ pro vided b y the complete set of mo dels M (see T able S1), ranging from 0 . 8235 bits as the lo wer b ound to 1 . 797 bits as the upp er b ound. Ho wev er, b y subsample length L = 16 , 384 the proba- bilit y of the correct top ology is 99.998%, giv en the set of candidate machines M , and estimates of both h µ and C µ ha ve conv erged to accurately reﬂect the correct v alues. In addition to T ables S1 and S2, the supplementary materials provide Fig. S1 showing the Gkde estimates of b oth h µ and C µ using M and M MAP as a function of subsample length. The four panels clearly show the con- v ergence of estimates to the correct v alues as L increases. F or long data series, there is little diﬀerence b etw een the inference made using the maximum a p osteriori (MAP) mo del and the p osterior ov er the entire candidate set. Ho wev er, this is not true for short time series, where using the full set more accurately captures the uncertaint y in estimation of the information-theoretic quan tities of in- terest. W e note that the C µ estimates approac h the true v alue from b elow, preferring small top ologies when there is little data and selecting the correct, larger topology only as av ailable data increases. This desired b ehavior results from setting β = 4 for the prior ov er M . Setting β = 2, shown in Fig. S2, do es not hav e this eﬀect. This v alue of β is insuﬃcient to ov ercome the large num b er of three-, four-, and ﬁve-state  -mac hines. Finally , Fig. S3 plots samples from the join t posterior of h µ and C µ us- ing only the MAP model for subsample lengths L = 1 , 64, and 16 , 384. This should b e compared with Fig. 8 where the complete set M is used. Again, there is a substan tial diﬀerence for short data series and muc h in common for 14 FIG. 8. Conv ergence of randomness ( h µ ) and structure ( C µ ) calculated with mo del top ologies, transition probabilities, and start states estimated from Golden Mean Pro cess data, using all one- to ﬁv e-state top ological  -machines. 50 , 000 samples w ere taken from the joint p osterior distribution P ( h µ , C µ | D 0: L , M ). (Low er left) Subsample of size 5 , 000 for data sizes L = 1 (blac k), L = 64 (brown), and L = 16 , 384 (blue). Gaussian kernel density estimates of the marginal distributions (using all 50 , 000 samples) P ( h µ | D 0: L , M ) (top) and P ( C µ | D : L , M ) (right) for the same v alues of L . Dashed lines indicate the true v alues of h µ and C µ for the Golden Mean Pro cess. larger L . Before moving to the next example, let’s brieﬂy re- turn to consider start-state inference. The data series generated to test inferring the Golden Mean Process started with the sequence D 0: T = 1110 . . . . W e note that the correct start state, which happ ens to b e state A in that realization, cannot b e inferred and has low er probabilit y than state B due to the pro cess’s structure: P ( σ gm , 0 = A | D 0: T = 1110 . . . , M gm ) ≈ 0 . 3328 using Eq. (13). The reason for the inabilit y to discern the start state is straigh tforward. Consulting Fig. 7, w e can see that the string 1110 can b e pro duced b eginning in both states A and B . On the one hand, assuming σ gm , 0 = A , the state path would b e AAAAB with prob- abilit y p (1 | A ) 3 p (0 | A ) = (1 / 2) 4 . On the other hand, as- suming σ gm , 0 = B , the state path is B AAAB with prob- abilit y p (1 | B ) p (1 | A ) 2 p (0 | A ) = 1 × (1 / 2) 3 . The only dif- ference in the probabilities is a factor of p (1 | A ) = 1 / 2 v ersus p (1 | B ) = 1 resulting in: 15 P ( σ i, 0 = A | D = 1110 , M gm ) = (1 / 2) 4 (1 / 2) 4 + (1 / 2) 3 = 1 / 3 . This calculation agrees nicely with the result stated ab o v e, using ﬁnite data and the inference calculations from Eq. (13). It turns out that an y observed data series from the Golden Mean Pro cess that b egins with a 1 will ha ve this am biguity in start state. How ever, observed sequences that b egin with a 0 uniquely identify A as the start state since a 0 is not allow ed leaving state B . Despite this, the correct top ology is inferred and accurate estimates of h µ and C µ are obtained. 3. Inﬁnite-or der Markov Example: The Even Pro c ess Next, w e consider inferring the structure of the Ev en Pro cess using the same set of binary-alphabet, one- to ﬁv e-state, topological  -mac hines. T o b e clear, this ex- ample diﬀers from Sec. V A 1, where the correct top ol- ogy was assumed. Now, we explore Ev en Pro cess struc- ture using M . As noted ab ov e, the Even Pro cess is an inﬁnite-order Marko v pro cess and inference requires the set of top ological  -machines considered here. (Ho wev er, see out-of-class inference of the Even Pro cess using k th- order Marko v chains in [29].) As a result, this is an exam- ple of in-class inference since the Even Pro cess top ology is contained within the set M . As with the previous ex- ample, a single data series was generated from the Even Pro cess. Figure 9 sho ws samples from the p osterior distribution o ver ( h µ , C µ ) using three subsample lengths L = 1 , 64, and 16 , 384 as b efore. An equiv alent plot using only the MAP mo del is provided in the supplementary materials for comparison; see Fig. S6. Again, for short data series the samples mirror the prior distribution as they should. (See blac k p oints for L = 1.) At subsample length L = 64 the v alues of h µ and C µ are m uch more tightly delin- eated. Comparing samples for the Golden Mean Pro cess in Fig. 8 shows that there is m uch less uncertaint y in structure for the Even Pro cess at this data size. Con- sulting T able S4, the MAP topology for this v alue of L already identiﬁes the correct top ology (denoted n2k2id7) and assigns a probabilit y of 99.41%. This high proba- bilit y is reﬂected b y the smaller spread, when compared with the Golden Mean example, of the samples of h µ and C µ . A t subsample length L = 16 , 384 the probability of the correct top ology has grown to 99.998%. Estim ates of b oth h µ and C µ are also v ery accurate, with small uncertain ties, at this L ; see T able S3. The supplementary materials provide Figs. S4 and S5 to show the conv ergence of the p osterior densities for h µ and C µ as a function of subsample length. Figure S4 sho ws estimates using b oth M and M MAP for β = 4. Whereas, Fig. S5 demonstrates the eﬀects of using a small p enalty ( β = 2) for mo del size. As seen with the Golden Mean Process, the diﬀerence is most apparen t at small data sizes. A t large L , the diﬀerence b etw een using the complete set M of models versus the MAP mo del is minor, as is the eﬀect of c ho osing β = 4 or β = 2. Ho wev er, at small data sizes the choices impact the resulting inference. In particular, the choice of β = 4 allo ws the inference mac hinery to approach the correct C µ from b elow whereas the c hoice of β = 2 approaches C µ from ab ov e; see Figs. S4 and S5. This b eha vior, whic h w e b elieve is desirable, is similar to the inference dynamics observed for the Golden Mean Pro cess, further strengthening the apparent suitability of using β = 4. Unlik e the previous example, the start state for the cor- rect structure is inferred with little data. In this example, the data series b egins with the symbols D 0: T = 10 . . . , whic h can only b e generated from state B . So, at L = 2 the start state for the correct top ology is determined, but it takes more data—32 symbols in this case—for this structure to b ecome the most probable in the set consid- ered. 4. Out-of-Class Structur al Infer enc e: The Simple Nonuniﬁlar Sour c e The Simple Nonuniﬁlar Source (SNS) is our ﬁnal and most challenging example of structural inference due its b eing out-of-class. The SNS is not only inﬁnite-order Mark ov, an y uniﬁlar presen tation requires a inﬁnite num- b er of states. In particular, its  -machine, the min- imal uniﬁlar presentation, has a coun table inﬁnit y of causal states [36]. W e can see the diﬀerence b et w een the SNS and previous pro cesses by inspecting state A , where b oth out-going edges emit a symbol ‘1’. (See Fig. 10 for a hidden Marko v mo del presen tation that is not an  -mac hine.) This makes the SNS a non uniﬁlar top ology , as the name suggests. Importantly , even if w e assume a start state, there is no longer a single, unique path through the hidden states for an observ ed output data series. This is completely diﬀerent from the uniﬁlar ex- amples previously considered, where an assumed start state and observ ed data series either determined a unique path through hidden states or w as disallow ed. As a re- sult, the inference tools developed here cannot use the HMM top ology of Fig. 10. Concretely , this class of rep- resen tation breaks our metho d for counting transitions. 16 FIG. 9. Conv ergence of randomness ( h µ ) and structure ( C µ ) calculated with mo del top ologies, transition probabilities, and start states estimated from Ev en Pro cess data, using all one- to ﬁve-state top ological  -machines. 50 , 000 samples from the join t p osterior P ( h µ , C µ | D : L , M ). (Low er left) A subsample of 5 , 000 for data sizes L = 1 (black), L = 64 (brown), and L = 16 , 384 (blue). Gaussian kernel densit y estimates of the marginal distributions (using all 50 , 000 samples) P ( h µ | D : L , M ) (top) and P ( C µ | D 0: L , M ) (right) are shown for the same v alues of L . Dashed lines indicate the true v alues of h µ and C µ for the Even Pro cess. FIG. 10. The Simple Nonuniﬁlar Source. Our goal, though, is to use the set of uniﬁlar, top o- logical  -machines at our disp osal to infer prop erties of the Simple Non uniﬁlar Source. (One reason to do this is that uniﬁlar mo dels are r e quir e d to calculate h µ .) Typi- cal data series generated by the SNS mo del are accepted b y many of the uniﬁlar top ologies in M and a poste- rior distribution ov er these mo dels can b e calculated. As with previous examples, w e demonstrate estimating h µ and C µ for the data source. Due to the nonuniﬁlar na- ture of the source, w e exp ect C µ estimates to increase with the size of the av ailable data series. How ever, the abilit y to estimate h µ accurately is unclear a priori. Of course, in this example w e cannot ﬁnd the correct mo del top ology b ecause inﬁnite structures are not contained in 17 FIG. 11. Conv ergence of randomness ( h µ ) and structure ( C µ ) calculated with mo del top ologies, transition probabilities, and start states estimated from Simple Nonuniﬁlar Source data, using all one- to ﬁve-state top ological  -mac hines. Sample sizes, colors and line types mirror those in previous ﬁgures. M . Figure 11 presents the joint p osterior for ( h µ , C µ ) for three subsample lengths. As previously , a single data se- ries of length 2 17 is generated using the SNS and analysis of subsamples D 0: L are emplo yed to demonstrate con- v ergence. The short subsample (L=1, black p oints) is predictably unin teresting, reﬂecting the the prior distri- bution ov er mo dels. F or subsamples shorter than L = 64 the MAP mo del is the single-state, t wo-edge top ology . (Denoted n1k2id3 in T able S6.) A t L = 64 the Golden Mean Pro cess top ology b ecomes most probable with a p osterior probability of 53.01%. The probability of the single-state top ology is still 43.98%, though, resulting in C µ ’s strongly bimo dal marginal p osterior observed for L = 64. (See Fig. 11 brown p oints, righ t panel.) Bi- mo dalit y also app ears in the marginal p osterior for h µ , with the largest p eak coming from the tw o-state top ol- ogy and the high entrop y rates b eing contributed by the single-state mo del. At large data size ( L = 16 , 384, blue p oin ts) h µ has conv erged on the true v alue, while C µ has sharp, bimo dal p eaks due to many nearly equally prob- able ﬁv e-state topologies. Consulting T able S6, w e see that the MAP structure for this v alue of L has ﬁv e states (denoted n5k2id22979, there) and a low p osterior prob- abilit y of only 8.63%. F urther inv estigation rev eals that there are four additional  -machine top ologies (making 18 a total of ﬁve) with similar p osterior probability . These general details p ersist for longer subsamples sequences in- cluding the complete data series at length 2 17 . Although estimating h µ con verges smo othly , the inference of struc- ture as reﬂected by C µ do es not show signs of graceful con vergence. W e pro vide supplementary plots in Figs. S7 and S8 that sho w the conv ergence of h µ and C µ using M and M MAP for prior parameters β = 4 and β = 2, resp ec- tiv ely . Again, the choice of β matters most at small data sizes. While the C µ estimate increases as function of L for β = 4, the use of β = 2 results in p osterior means for C µ that ﬁrst decrease as function of L , then increase. Again, this supp orts the use of β = 4 for this set of binary- alphab et, top ological  -mac hines. The need to employ the complete mo del set M versus the MAP top ology is most evident at small data sizes; as was also seen in previous examples. How ever, the C µ inference in this ex- ample is more complicated due to the large num b er of ﬁv e-state top ologies with roughly equal probability . The MAP metho d selects just one mo del, of course, and so cannot represent the p osterior distribution’s bimo dal b e- ha vior. Given that the data source is out-of-class, this trouble is p erhaps not surprising. Figure S9 sho ws sam- ples from the joint p osterior of ( h µ , C µ ) using only the MAP top ology . Using the latter also suﬀers from requir- ing one to select a single exemplar top ology for a p oste- rior distribution that is simply not well represented by a single  -machine. VI. DISCUSSION The examples demonstrated structural inference of uniﬁlar hidden Marko v mo dels using the set of one- to ﬁv e-state, binary-alphabet, topological  -machines. W e found that in-class examples, including the Golden Mean and Ev en Pro cesses, w ere eﬀectively and eﬃciently dis- co vered. That is, the correct top ology was accorded the largest p osterior probability and estimates of informa- tion co ordinates h µ and C µ w ere accurate. How ever, w e found that a suﬃciently large v alue of β , pro viding the mo del size p enalty , was key to a conserv ativ e struc- tural inference. Conserv ativ e means that C µ estimates approac h the true v alue from b elow, eﬀectively coun ter- acting the increasing n umber of top ologies with larger state sets. F or the out-of-class example, given b y the Simple Non uniﬁlar Source, these broader patterns held true. How ever, structure could not b e captured as re- ﬂected in the increasing num b er of states inferred as a function of data length. Also, many top ologies had rele- v ant p osterior probability for the SNS data, reﬂecting a lac k of consensus and a large degeneracy with regard to structure. This resulted in a multimodal p osterior distri- bution for C µ and a MAP mo del with very low p osterior probabilit y . One of the surprises was the n umber of ac c epting top ologies for a giv en data set. By this we mean the n umber of candidate structures for which the data series of in terest had a v alid path through hidden states, result- ing in nonzero p osterior probability . In many w a ys , this asp ect of structural inference mirrors grammatical infer- ence for deterministic ﬁnite automaton (DF A) [37, 38]. In the supplementary material we pro vide plots for the three pro cesses considered ab ov e sho wing the n umber of accepting top ologies in the set of one- to ﬁv e-state  -mac hines used for M . (See Supplemen tal Fig. S10.) F or all of these top ologies, a rapid decline in the num b er of accepting top ologies o ccurs for the ﬁrst 2 6 to 2 7 sym- b ols, follow ed b y a plateau at a set of accepting top olo- gies. F or smaller topologies, which come from the mo del class under consideration, this pattern mak es sense. Of- ten, the smaller top ology is embedded within a larger set of states, some of which are never used. F or out-of-class examples like the SNS this b ehavior is less transparent. The rejection of a data series by a given top ology pro- vides a ﬁrst level of ﬁltering by assigning zero p osterior probabilit y to the structure due to v anishing lik eliho o d of the data given the mo del. F or the examples given ab o ve, of the 36 , 660 p ossible top ologies, 6 , 225 accepted Golden Mean data, 3 , 813 topologies accepted Even Pro cess data, and 6 , 225 accepted SNS data when the full data series w as considered. In all of the examples the data sources w ere stationary , so that statistics did not c hange ov er the course of the data series. This is imp ortan t b ecause stationarity is built in to the mo del class deﬁnition emplo yed: the model top ology and transition probabilities did not dep end on time. Ho wev er, given a general data series with unkno wn prop erties, it is un wise to assume stationarity holds. Ho w can this b e prob ed? One method is to subdivide the data in to o verlapping segments of equal length. Given these, inference using M or M MAP should return similar results for each segment. F or in-class data sources like the Even and Golden Mean Pro cesses, the true mo del should be returned for each data subsegmen t. F or out-of-class, but stationary models lik e the Simple Non uniﬁlar Source, the true topology cannot be returned, but a consisten t mo del within M should b e returned for each data segmen t. Ho wev er, one form of relativ ely simple nonstationarit y—a structural change-point problem suc h as switching b e t ween the Golden Mean and Even Pro cesses—can b e detected by BSI applied to subseg- men ts. The inferred top ology for early segments returns the Golden Mean top ology and later segmen ts return the Even top ology . Notably , the inferred top ology using 19 all of the data or a subsegmen t ov erlapping the switc h returns a more complicated mo del topology reﬂecting b oth structures. Of course, detection of this behavior requires suﬃcient data and slow switc hing b et ween data sources. In a sequel we compare BSI to alternativ e structural in- ference metho ds. The range of and diﬀerences with these is large and so a comparison demands its own ven ue. Also, the sequel addresses expanding the model candi- dates b eyond the set of top ological  -machines to the full set of uniﬁlar hidden Mark ov mo dels. A necessary step b efore useful comparisons can b e explored. VI I. CONCLUSION W e demonstrated eﬀective and eﬃcient inference of top ological  -mac hines using a library of candidate struc- tures and the to ols of Bay esian inference. Sev eral a ven ues for further developmen t are immediately ob- vious. First, as just noted, using full unrestricted  -mac hines—allowing mo dels outside the set of top olog- ical  -machines—is straightforw ard. This will provide a broad arra y of candidates within the more general class of uniﬁlar hidden Mark ov mo dels. In the present setting, by w ay of con trast, pro cesses with full supp ort (all w ords al- lo wed) can map only to the single-state top ology . Second, reﬁning the eminently parallelizable Bay esian Structural Inference algorithms will allo w them to take adv an tage of large compute clusters and cloud computing to dramati- cally expand the num b er of candidate top ologies consid- ered. F or comparison, the current implementation uses nonoptimized Python on a single thread. This conﬁgu- ration (running on contemporary Linux compute node) tak es b etw een 0 . 6 and 1 . 6 hours, dep ending on the num- b er of accepting topologies, to calculate the posterior dis- tribution ov er the 36 , 660 candidates for a data series of length 2 17 . An additional 10 to 20 min utes is needed to generate the 50 , 000 samples from the p osterior to esti- mate functions of mo del parameters, like h µ and C µ . W e note that the metho ds of Bay esian Structural Infer- ence can b e applied to any set of uniﬁlar hidden Mark ov mo dels and, moreo ver, they do not hav e to employ a large, en umerated library . F or example, a small set of candidate ﬁfty-state top ologies could b e compared for a giv en data series. This ability op ens the do or to auto- mated metho ds for generating candidate structures. Of course, as alwa ys, one m ust keep in mind that all infer- ences are then conditioned on the, p ossibly limited or inappropriate, set of mo del top ologies chosen. Finally , let’s return to the scien tiﬁc and engineer- ing problem areas cited in the introduction that mo- tiv ated structural inference in the ﬁrst place. Gener- ally , Ba yesian Structural Inference will ﬁnd application in ﬁelds, such as those mentioned, that rely on ﬁnite-order Mark ov c hains or the broader class of (nonuniﬁlar) hid- den Marko v mo dels. It will also ﬁnd application in areas requiring accurate estimates of v arious system statistics. The mo del class considered here (  -machines) consists of a no vel set of top ologies and usefully allows one to esti- mate b oth randomness and structure using h µ and C µ . Tw o of the most basic informational measures. As a re- sult, w e exp ect Bay esian Structural Inference to ﬁnd an arra y of applications in bioinformatics, linguistics, and dynamical systems. A CKNOWLEDGMENTS The authors thank Ryan James and Chris Ellison for helpful comments and implementation advice. Partial supp ort was provided by ARO grants W911NF-12-1-0234 and W911NF-12-1-0288. [1] B.-J. Y o on. Hidden Marko v mo dels and their applica- tions in biological sequence analysis. Curr. Genomics , 10:402–415, 2009. 1, 4 [2] L. Narlik ar, N. Mehta, S. Galande, and M. Arjunw adk ar. One size does not ﬁt all: On how Marko v model order dictates p erformance of genomic sequence analyses. Nu- cleic A cids Res. , 2012. 1 [3] R. L. Da vidchac k, Y.-C. Lai, E. M. Bollt, and M. Dhamala. Estimating generating partitions of chaotic systems b y unstable perio dic orbits. Phys. R ev. E , 61:1353–1356, 2000. 1 [4] C. S. Daw, C. E A Finney , and E. R. T racy . A review of symbolic analysis of experimental data. R ev. Sci. In- strum. , 74:915–930, 2003. [5] M. B. Kennel and M. Buhl. Estimating go o d discrete partitions from observ ed data: Symbolic false nearest neigh b ors. Phys. R ev. L ett. , 91:084102, 2003. [6] C. C. Strelioﬀ and J. P . Crutc hﬁeld. Optimal instrumen ts and mo dels for noisy chaos. CHAOS , 17:043127, 2007. 1, 5 [7] R. P . N. Rao, N. Y adav, M. N. V ahia, H. Joglek ar, R. Ad- hik ari, and I. Mahadev an. A Marko v mo del of the Indus script. Pr oc. Natl. A c ad. Sci. USA , 106:13685–13690, 2009. 1 [8] R. Lee, P . Jonathan, and P . Ziman. Pictish symbols re- v ealed as a written language through application of Shan- non en trop y . Pr o c. R oy. So c. A , 2010. 1 20 [9] D. Kelly , M. Dillingham, A. Hudson, and K. Wiesner. A new metho d for inferring hidden Marko v mo dels from noisy time sequences. PL oS ONE , 7:e29703, 2012. 1 [10] C.-B. Li, H. Y ang, and T. Komatsuzaki. Multiscale com- plex netw ork of protein conformational ﬂuctuations in single-molecule time series. Pro c. Natl. A c ad. Sci. USA , 105:536–541, 2008. 1 [11] P . Grab en, J. D. Saddy , M. Schlesewsky , and J. Kurths. Sym b olic dynamics of even t-related brain p otentials. Phys. R ev. E , 62:5518–5541, 2000. 1 [12] R. Haslinger, K. L. Klinkner, and C. R. Shalizi. The computational structure of spike trains. Neur al Comput , 22:121–157, 2009. 1 [13] D. P . V arn, G. S. Canright, and J. P . Crutchﬁeld. Infer- ring pattern and disorder in close-pac k ed structures via  -mac hine reconstruction theory: Structure and in trin- sic computation in zinc sulphide. A cta. Cryst. Se c. B , 63(2):169–182, 2007. 1 [14] D. P . V arn, G. S. Canright, and J. P . Crutc hﬁeld.  - Mac hine sp ectral reconstruction theory: A direct metho d for inferring planar disorder and structure from X-ray diﬀraction studies. A cta. Cryst. Se c. A , 69(2):197–206, 2013. 1 [15] J. P . Crutchﬁeld. Betw een order and c haos. Natur e Physics , 8(Jan uary):17–24, 2012. 1, 2, 4 [16] J. P . Crutchﬁeld and K. Y oung. Inferring statistical com- plexit y . Phys. R ev. L et. , 63:105–108, 1989. 1, 4 [17] D. P . V arn, G. S. Canright, and J. P . Crutchﬁeld. Dis- co v ering planar disorder in close-pack ed structures from X-Ra y diﬀraction: Beyond the fault mo del. Phys. Rev. B , 66(17):174110–3, 2002. 1 [18] C. R. Shalizi, K. L. Shalizi, and J. P . Crutc hﬁeld. Pattern disco v ery in time series, Part I: Theory , algorithm, anal- ysis, and conv ergence. 2002. Santa F e Institute W orking P ap er 02-10-060; arXiv.org/abs/cs.LG/0210025. 1 [19] C. R. Shalizi, K. L. Shalizi, and R. Haslinger. Quantify- ing self-organization with optimal predictors. Phys. R ev. L ett. , 93:118701, 2004. 1, 4 [20] B. D. Johnson, J. P . Crutchﬁeld, C. J. Ellison, and C. S. McT ague. En umerating ﬁnitary pro cesses. 2012. SFI W orking P ap er 10-11-027; arxiv.org:1011.0036 [cs.FL]. 2, 3, 5 [21] E. Ott. Chaos in Dynamical Systems . Cambridge Uni- v ersit y Press, New Y ork, 1993. 2 [22] S. H. Strogatz. Nonline ar Dynamics and Chaos: with ap- plic ations to physics, biolo gy, chemistry, and engine ering . Addison-W esley , Reading, Massac h usetts, 1994. [23] D. Lind and B. Marcus. An Intr o duction to Symb olic Dynamics and Co ding . Cambridge Universit y Press, New Y ork, 1995. 2 [24] A. B. Gelman, J. S. Carlin, H. S. Stern, and D. B. Rubin. Bayesian Data Analysis . Chapman & Hall, CRC, 1995. 2, 9 [25] C. R. Shalizi and J. P . Crutchﬁeld. Computational me- c hanics: P attern and prediction, structure and simplicity . J. Stat. Phys. , 104:817–879, 2001. 4 [26] N. T rav ers and J. P . Crutchﬁeld. Exact synchronization for ﬁnite-state sources. J. Stat. Phys. , 145(5):1181–1201, 2011. 4 [27] N. T ra vers and J. P . Crutchﬁeld. Asymptotic syn- c hronization for ﬁnite-state sources. J. Stat. Phys. , 145(5):1202–1223, 2011. 4 [28] N. T rav ers and J. P . Crutchﬁeld. Equiv alence of history and generator  -machines. 2011. SFI W orking Paper 11- 11-051; arxiv .org:1111.4500 [math.PR]. 4 [29] C. C. Strelioﬀ, J. P . Crutc hﬁeld, and A. W. H ¨ ubler. In- ferring Marko v chains: Ba yesian estimation, mo del com- parison, entrop y rate, and out-of-class mo deling. Phys. R ev. E , 76:011106, 2007. 5, 6, 10, 13, 15 [30] C. J. Ellison, J. R. Mahoney , and J. P . Crutc hﬁeld. Prediction, retro diction, and the amount of information stored in the present. J. Stat. Phys. , 136(6):1005–1034, 2009. 5 [31] C. J. Ellison and J. P . Crutchﬁeld. States of states of un- certain t y . page arxiv.org: 13XX.XXXX, in preparation. 5 [32] L. Rabiner. A tutorial on hidden Mark ov mo dels and selected applications in sp eech recognition. Pr o c. IEEE , 77:257–286, 1989. 5 [33] S. S. Wilks. Mathematic al Statistics . John Wiley & Sons, Inc., New Y ork, NY, 1962. 6, 7, 10 [34] D. H. W olp ert and D. R. W olf. Estimating functions of probability distributions from a ﬁnite set of samples. Phys. R ev. E , 52:6841–6854, 1995. 8 [35] L. Y uan and H. K. Kesa v an. Bay esian estimation of Shan- non entrop y . Commun. Stat. The ory Metho ds , 26:139– 148, 1997. 8 [36] J. P . Crutchﬁeld. The calculi of emergence: Compu- tation, dynamics, and induction. Physic a D , 75:11–54, 1994. 15 [37] K. J. Lang, B. A. P earlm utter, and R. A. Price. Re- sults of the Abbadingo One DF A learning comp etition and a new evidence-driven state merging algorithm. In V. Honav ar and G. Slutzki, editors, Gr ammatic al Infer- enc e , volume 1433 of L e ct. Notes Comp. Sci. , pages 1–12. Springer B erlin Heidelb erg, 1998. 18 [38] C. de la Higuera. A bibliographical study of grammatical inference. Patt. R e c o g. , 38:1332–1348, 2005. 18 21 Supplemen tary Material Ba y esian Structural Inference for Hidden Pro cesses Christopher C. Strelioﬀ and James P . Crutc hﬁeld App endix A: Overview The supplementary materials provide tables and ﬁgures that lend an in-depth picture of the Bay esian Structural Inference examples. Unless otherwise noted, all analyses presented here use the same single data series and parameter settings detailed in the main text. Please use the main text as the primary guide. The ﬁrst three sections address the Golden Mean, Even, and SNS pro cesses. Eac h provides a table of estimates of h µ and C µ using the complete set M of one- to ﬁv e-state  -machines denoted. Estimates are giv en for each subsample length L = 2 i , where i = 0 , 1 , 2 , . . . , 17, as in the main text. T o b e clear, this means that we analyze subsamples D 0: L = x 0 x 1 x 2 . . . x L − 1 using diﬀeren t initial segments of a single long data series, allowing for a consistent view of estimate conv ergence. F or b oth information-theoretic quan tities, we list the p osterior mean and equal-tailed, 95% credible interv al (CI) constructed using the 2.5% and 97.5% quantiles estimated from 50 , 000 samples of the p osterior distribution. The CI is denoted by parenthesized n um b er pairs. A second table provides the same estimates of h µ and C µ using only the M MAP mo del. As a result, this table no longer reﬂects uncertaint y in mo del top ology , whic h ma y b e small or large dep ending on the data and subsample length under consideration. An additional column in this second table provides the MAP top ology along with its p osterior probabilit y . The latter is denoted in parentheses. In addition to the tables of estimates, ﬁgures demonstrate the conv ergence of h µ and C µ marginal p osterior distri- butions as a function subsample length L . In this, we consider the diﬀerence b et w een p osteriors using the complete set M of candidate mo dels and those that only employ the MAP top ology . This set of ﬁgures also illustrates the diﬀerence b et w een β = 4 and β = 2. (W e use diﬀerent data, but still a single time series, for the β = 2 example.) In all plots the marginal p osterior distribution for the quantit y of interest is estimated using a Gaussian k ernel density estimation (Gkde) of the densit y using 50 , 000 samples from the appropriate density . If there is little or no v ariation in the samples the Gkde fails and no densit y is drawn. This happ ens, for example, when the MAP top ology has one state, and C µ = 0, for small data sizes. Posterior samples are v alid, how ever, and p osterior mean and credible interv al can b e pro vided (see tables). Section E plots the num b er of accepting top ologies as a function of subsample length for each of the example data sources in Fig. S10. The panels demonstrate that there are many v alid candidate top ologies for a given data series, ev en when subsamples of considerable length are av ailable. Finally , Sec. F illustrates all top ologies that met the MAP criterion for the data sources considered. Notably , there are not many structures to consider despite the large num b er of top ologies that accept the data. 22 App endix B: Golden Mean Pro cess: Structural Inference T ABLE S1. Inference of Golden Mean Pro cess prop erties using M , β = 4. L h µ C µ 1 6.767e-01 (3.682e-02,9.994e -01) 1.467e-01 (0.000e+00,1.333e+00) 2 6.400e-01 (6.662e-02,9.990e -01) 1.074e-01 (0.000e+00,1.089e+00) 4 7.771e-01 (2.760e-01,9.996e -01) 1.146e-01 (0.000e+00,1.000e+00) 8 7.753e-01 (3.557e-01,9.994e -01) 1.441e-01 (0.000e+00,1.000e+00) 16 7.941e-01 (4.751e-01,9.976e- 01) 1.128e-01 (0.000e+00,9.469e-01) 32 7.697e-01 (5.221e-01,9.773e- 01) 2.564e-01 (0.000e+00,1.556e+00) 64 6.440e-01 (5.207e-01,6.942e- 01) 1.052e+00 (8.235e-01,1.797e+00) 128 6.575e-01 (5.953e-01,6.930e-01) 9.209e-01 (8.667e-01,9.590e-01) 256 6.684e-01 (6.311e-01,6.917e-01) 9.128e-01 (8.740e-01,9.437e-01) 512 6.718e-01 (6.477e-01,6.889e-01) 9.107e-01 (8.835e-01,9.338e-01) 1024 6.622e-01 (6.428e-01 ,6.780e-01) 9.217e-01 (9.048e-01,9.369e- 01) 2048 6.618e-01 (6.483e-01 ,6.736e-01) 9.225e-01 (9.107e-01,9.333e- 01) 4096 6.587e-01 (6.490e-01 ,6.678e-01) 9.253e-01 (9.172e-01,9.329e- 01) 8192 6.645e-01 (6.582e-01 ,6.704e-01) 9.203e-01 (9.143e-01,9.259e- 01) 16384 6.643e-01 (6.599e-01,6.685e- 01) 9.205e-01 (9.164e-01,9.245e-01) 32768 6.647e-01 (6.615e-01,6.676e- 01) 9.202e-01 (9.173e-01,9.231e-01) 65536 6.662e-01 (6.640e-01,6.682e- 01) 9.188e-01 (9.167e-01,9.208e-01) 131072 6.670e-01 (6.655e-01,6.684e-01) 9.180e-01 (9.165e-01,9.194e-01) T ABLE S2. Inference of Golden Mean Pro cess prop erties using M MAP , β = 4. L h µ C µ MAP T op ology 1 7.221e-01 (9.729e-02,9.996e-01) 0.000e+00 (0.000e+00,0.000e+00) n1k2id3 (8.570e-01) 2 6.603e-01 (6.849e-02,9.992e-01) 0.000e+00 (0.000e+00,0.000e+00) n1k2id3 (8.954e-01) 4 8.116e-01 (3.066e-01,9.997e-01) 0.000e+00 (0.000e+00,0.000e+00) n1k2id3 (8.896e-01) 8 8.129e-01 (3.811e-01,9.995e-01) 0.000e+00 (0.000e+00,0.000e+00) n1k2id3 (8.600e-01) 16 8.141e-01 (4.787e-01,9.981e-01) 0.000e+00 (0.000e+00,0.000e+00) n1k2id3 (8.795e-01) 32 8.134e-01 (5.668e-01,9.830e-01) 0.000e+00 (0.000e+00,0.000e+00) n1k2id3 (7.324e-01) 64 6.636e-01 (5.842e-01,6.942e-01) 9.061e-01 (8.188e-01,9.622e-01) n2k2id5 (7.873e-01) 128 6.577e-01 (5.962e-01,6.929e-01) 9.198e-01 (8.666e-01,9.583e-01) n2k2id5 (9.971e-01) 256 6.684e-01 (6.316e-01,6.918e-01) 9.125e-01 (8.736e-01,9.433e-01) n2k2id5 (9.987e-01) 512 6.717e-01 (6.477e-01,6.889e-01) 9.108e-01 (8.836e-01,9.338e-01) n2k2id5 (9.994e-01) 1024 6.621e-01 (6.429e-01,6.781e-01) 9.217e-01 (9.046e-01,9.369e-01) n2k2id5 (9.997e-01) 2048 6.617e-01 (6.481e-01,6.735e-01) 9.226e-01 (9.108e-01,9.335e-01) n2k2id5 (9.998e-01) 4096 6.588e-01 (6.491e-01,6.677e-01) 9.253e-01 (9.172e-01,9.328e-01) n2k2id5 (9.999e-01) 8192 6.645e-01 (6.582e-01,6.705e-01) 9.202e-01 (9.143e-01,9.259e-01) n2k2id5 (1.000e+00) 16384 6.643e-01 (6.599e-01,6.685e-01) 9.205e-01 (9.164e-01,9.245e-01) n2k2id5 (1.000e+00) 32768 6.646e-01 (6.616e-01,6.677e-01) 9.202e-01 (9.173e-01,9.231e-01) n2k2id5 (1.000e+00) 65536 6.662e-01 (6.640e-01,6.682e-01) 9.188e-01 (9.167e-01,9.208e-01) n2k2id5 (1.000e+00) 131072 6.670e-01 (6.655e-01,6.684e-01) 9.180e-01 (9.165e-01,9.194e-01) n2k2id5 (1.000e+00) 23 FIG. S1. Golden Mean Pro cess, β = 4: Con vergence of the p osterior densities for C µ (top) and h µ (b ottom) as a function of subsample length L using the set of all top ological  -machines with 1-5 states M (left column) and the maximum a p osteriori mo del M MAP (righ t column). In each panel, the black, dashed line indicates the true v alue and the gray , solid line shows the p osterior mean. 24 FIG. S2. Golden Mean Pro cess, β = 2: Con vergence of the p osterior densities for C µ (top) and h µ (b ottom) as a function of subsample length L using the set of all top ological  -machines with 1-5 states M (left column) and the maximum a p osteriori mo del M MAP (righ t column). In each panel, the black, dashed line indicates the true v alue and the gray , solid line shows the p osterior mean. Contrast these panels with those in Figure S1, where the p enalty for structure is higher. 25 FIG. S3. Golden Mean Pro cess: Join t distribution samples using the MAP mo del at the given lengths instead of the full set of candidate mo dels. Colors corresp ond to data subsample length, as in previous plots. The MAP top ology for L = 1 (black) has one state and C µ = 0, as indicated by the samples in the h µ − C µ plane. No Gkde approximation of these samples is provided due to this complete lack of v ariation. 26 App endix C: Even Pro cess: Structural Inference T ABLE S3. Inference of Even Pro cess prop erties using M , β = 4. L h µ C µ 1 6.777e-01 (3.811e-02,9.994e -01) 1.480e-01 (0.000e+00,1.388e+00) 2 7.414e-01 (0.000e+00,9.997e-01) 2.222e-01 (0.000e+00,1.528e+00) 4 7.697e-01 (2.359e-01,9.996e -01) 1.191e-01 (0.000e+00,1.224e+00) 8 8.572e-01 (4.097e-01,9.998e -01) 1.249e-01 (0.000e+00,1.422e+00) 16 8.235e-01 (4.751e-01,9.998e- 01) 3.080e-01 (0.000e+00,9.454e-01) 32 6.457e-01 (4.655e-01,9.616e- 01) 6.909e-01 (0.000e+00,8.961e-01) 64 6.804e-01 (6.276e-01,6.942e- 01) 8.746e-01 (7.675e-01,9.464e-01) 128 6.824e-01 (6.453e-01,6.942e-01) 8.854e-01 (8.166e-01,9.359e-01) 256 6.783e-01 (6.485e-01,6.939e-01) 8.993e-01 (8.568e-01,9.333e-01) 512 6.679e-01 (6.422e-01,6.868e-01) 9.151e-01 (8.890e-01,9.374e-01) 1024 6.756e-01 (6.602e-01 ,6.874e-01) 9.069e-01 (8.875e-01,9.243e- 01) 2048 6.700e-01 (6.581e-01 ,6.801e-01) 9.144e-01 (9.016e-01,9.260e- 01) 4096 6.666e-01 (6.578e-01 ,6.744e-01) 9.181e-01 (9.096e-01,9.263e- 01) 8192 6.704e-01 (6.647e-01 ,6.757e-01) 9.142e-01 (9.080e-01,9.202e- 01) 16384 6.666e-01 (6.623e-01,6.707e- 01) 9.183e-01 (9.141e-01,9.225e-01) 32768 6.660e-01 (6.629e-01,6.689e- 01) 9.189e-01 (9.160e-01,9.219e-01) 65536 6.657e-01 (6.635e-01,6.677e- 01) 9.193e-01 (9.172e-01,9.213e-01) 131072 6.658e-01 (6.643e-01,6.672e-01) 9.192e-01 (9.177e-01,9.206e-01) T ABLE S4. Inference of Even Pro cess prop erties using M MAP , β = 4. L h µ C µ MAP T op ology 1 7.226e-01 (1.003e-01,9.996e-01) 0.000e+00 (0.000e+00,0.000e+00) n1k2id3 (8.570e-01) 2 8.426e-01 (3.541e-01,9.998e-01) 0.000e+00 (0.000e+00,0.000e+00) n1k2id3 (7.893e-01) 4 8.100e-01 (2.982e-01,9.997e-01) 0.000e+00 (0.000e+00,0.000e+00) n1k2id3 (8.721e-01) 8 9.027e-01 (5.764e-01,9.999e-01) 0.000e+00 (0.000e+00,0.000e+00) n1k2id3 (8.626e-01) 16 9.517e-01 (7.735e-01,9.999e-01) 0.000e+00 (0.000e+00,0.000e+00) n1k2id3 (6.023e-01) 32 6.316e-01 (4.650e-01,6.941e-01) 7.152e-01 (4.825e-01,8.861e-01) n2k2id7 (9.434e-01) 64 6.802e-01 (6.282e-01,6.942e-01) 8.728e-01 (7.690e-01,9.445e-01) n2k2id7 (9.941e-01) 128 6.823e-01 (6.456e-01,6.942e-01) 8.845e-01 (8.165e-01,9.351e-01) n2k2id7 (9.973e-01) 256 6.783e-01 (6.483e-01,6.939e-01) 8.991e-01 (8.566e-01,9.334e-01) n2k2id7 (9.989e-01) 512 6.681e-01 (6.426e-01,6.869e-01) 9.149e-01 (8.887e-01,9.370e-01) n2k2id7 (9.995e-01) 1024 6.757e-01 (6.604e-01,6.873e-01) 9.068e-01 (8.878e-01,9.241e-01) n2k2id7 (9.997e-01) 2048 6.700e-01 (6.581e-01,6.801e-01) 9.143e-01 (9.017e-01,9.260e-01) n2k2id7 (9.999e-01) 4096 6.666e-01 (6.579e-01,6.744e-01) 9.181e-01 (9.096e-01,9.262e-01) n2k2id7 (9.999e-01) 8192 6.704e-01 (6.647e-01,6.757e-01) 9.142e-01 (9.080e-01,9.202e-01) n2k2id7 (1.000e+00) 16384 6.666e-01 (6.623e-01,6.707e-01) 9.183e-01 (9.141e-01,9.224e-01) n2k2id7 (1.000e+00) 32768 6.660e-01 (6.629e-01,6.689e-01) 9.189e-01 (9.160e-01,9.219e-01) n2k2id7 (1.000e+00) 65536 6.657e-01 (6.635e-01,6.678e-01) 9.193e-01 (9.172e-01,9.213e-01) n2k2id7 (1.000e+00) 131072 6.658e-01 (6.642e-01,6.672e-01) 9.192e-01 (9.177e-01,9.207e-01) n2k2id7 (1.000e+00) 27 FIG. S4. Ev en Pro cess, β = 4: Conv ergence of the p osterior densities for C µ (top) and h µ (b ottom) as a function of subsample length L using the set of all top ological  -machines with one- to ﬁve-states M (left column) and the maximum a p osteriori mo del M MAP (righ t column). In each panel, the black, dashed line indicates the true v alue and the gray , solid line shows the p osterior mean. 28 FIG. S5. Ev en Pro cess, β = 2: Conv ergence of the p osterior densities for C µ (top) and h µ (b ottom) as a function of subsample length L using the set of all top ological  -machines with one- to ﬁve-states M (left column) and the maximum a p osteriori mo del M MAP (righ t column). In each panel, the black, dashed line indicates the true v alue and the gray , solid line shows the p osterior mean. 29 FIG. S6. Even Pro cess: Samples of the joint distribution using the MAP mo del at the given lengths instead of the full set of candidate mo dels. Colors corresp ond to data subsample length, as in previous plots. The MAP top ology for L = 1 (black) has one state and C µ = 0, as indicated by the samples in the h µ − C µ plane. No Gkde approximation of these samples is provided due to this complete lack of v ariation. 30 App endix D: SNS Pro cess: Structural Inference T ABLE S5. Inference of SNS Pro cess prop erties using M , β = 4. L h µ C µ 1 6.780e-01 (3.817e-02,9.993e-01) 1.483e-01 (0.000e+00,1.325e+00) 2 7.425e-01 (0.000e+00,9.997e-01) 2.207e-01 (0.000e+00,1.525e+00) 4 7.698e-01 (2.398e-01,9.997e-01) 1.207e-01 (0.000e+00,1.225e+00) 8 7.781e-01 (3.449e-01,9.994e-01) 1.326e-01 (0.000e+00,1.357e+00) 16 7.952e-01 (2.702e-01,9.994e-01) 3.679e-01 (0.000e+00,2.084e+00) 32 7.555e-01 (4.978e-01,9.605e-01) 8.161e-02 (0.000e+00,8.579e-01) 64 7.228e-01 (5.935e-01,9.142e-01) 4.627e-01 (0.000e+00,1.043e+00) 128 6.808e-01 (6.365e-01,6.942e-01) 8.006e-01 (6.982e-01,8.808e-01) 256 6.756e-01 (6.411e-01,6.937e-01) 7.801e-01 (7.088e-01,8.407e-01) 512 6.799e-01 (6.562e-01,6.929e-01) 8.151e-01 (7.419e-01,1.390e+00) 1024 6.849e-01 (6.693e-01,6.931e-01) 9.021e-01 (7.717e-01,1.757e+00) 2048 6.827e-01 (6.701e-01,6.922e-01) 1.441e+00 (7.905e-01,2.219e+00) 4096 6.825e-01 (6.756e-01,6.896e-01) 1.787e+00 (1.673e+00,2.228e+00) 8192 6.828e-01 (6.782e-01,6.874e-01) 2.002e+00 (1.692e+00,2.233e+00) 16384 6.800e-01 (6.769e-01,6.832e-01) 2.198e+00 (2.168e+00,2.231e+00) 32768 6.789e-01 (6.766e-01,6.811e-01) 2.197e+00 (2.170e+00,2.229e+00) 65536 6.784e-01 (6.769e-01,6.800e-01) 2.199e+00 (2.174e+00,2.228e+00) 131072 6.788e-01 (6.777e-01,6.799e-01) 2.201e+00 (2.178e+00,2.230e+00) T ABLE S6. Inference of SNS Pro cess prop erties using M MAP , β = 4. L h µ C µ MAP T op ology 1 7.231e-01 (9.607e-02,9.996e-01) 0.000e+00 (0.000e+00,0.000e+00) n1k2id3 (8.570e-01) 2 8.414e-01 (3.462e-01,9.998e-01) 0.000e+00 (0.000e+00,0.000e+00) n1k2id3 (7.893e-01) 4 8.086e-01 (2.981e-01,9.997e-01) 0.000e+00 (0.000e+00,0.000e+00) n1k2id3 (8.721e-01) 8 8.136e-01 (3.826e-01,9.996e-01) 0.000e+00 (0.000e+00,0.000e+00) n1k2id3 (8.829e-01) 16 8.800e-01 (5.927e-01,9.997e-01) 0.000e+00 (0.000e+00,0.000e+00) n1k2id3 (7.774e-01) 32 7.665e-01 (5.040e-01,9.641e-01) 0.000e+00 (0.000e+00,0.000e+00) n1k2id3 (9.105e-01) 64 6.712e-01 (5.947e-01,6.942e-01) 7.842e-01 (6.406e-01,8.918e-01) n2k2id5 (5.301e-01) 128 6.803e-01 (6.370e-01,6.942e-01) 7.981e-01 (7.021e-01,8.756e-01) n2k2id5 (9.835e-01) 256 6.755e-01 (6.408e-01,6.937e-01) 7.786e-01 (7.083e-01,8.393e-01) n2k2id5 (9.953e-01) 512 6.804e-01 (6.600e-01,6.928e-01) 7.887e-01 (7.416e-01,8.313e-01) n2k2id5 (9.721e-01) 1024 6.858e-01 (6.746e-01,6.929e-01) 8.029e-01 (7.714e-01,8.321e- 01) n2k2id5 (8.989e-01) 2048 6.871e-01 (6.801e-01,6.922e-01) 8.066e-01 (7.848e-01,8.273e- 01) n2k2id5 (3.419e-01) 4096 6.826e-01 (6.760e-01,6.893e-01) 1.703e+00 (1.672e+00,1.733e+00) n4k2id3334 (1.336e-01) 8192 6.834e-01 (6.792e-01,6.877e-01) 1.709e+00 (1.687e+00,1.730e+00) n4k2id3334 (6.462e-02) 16384 6.800e-01 (6.769e-01,6.831e-01) 2.177e+00 (2.166e+00,2.188e+00) n5k2id22979 (8.630e-02) 32768 6.789e-01 (6.766e-01,6.810e-01) 2.176e+00 (2.169e+00,2.184e+00) n5k2id22979 (8.632e-02) 65536 6.784e-01 (6.769e-01,6.799e-01) 2.178e+00 (2.173e+00,2.184e+00) n5k2id22979 (8.560e-02) 131072 6.788e-01 (6.777e-01,6.798e-01) 2.181e+00 (2.177e+00,2.185e+00) n5k2id22979 (8.539e-02) 31 FIG. S7. Simple Nonuniﬁlar Source, β = 4: Conv ergence of the p osterior densities for C µ (top) and h µ (b ottom) as a function of subsample length L using the set of all top ological  -machines with one- to ﬁve-states M (left column) and the maximum a p osteriori mo del M MAP (righ t column). In each panel, the black, dashed line indicates the true v alue and the gray , solid line sho ws the p osterior mean. 32 FIG. S8. Simple Nonuniﬁlar Source, β = 2: Conv ergence of the p osterior densities for C µ (top) and h µ (b ottom) as a function of subsample length L using the set of all top ological  -machines with one- to ﬁve-states M (left column) and the maximum a p osteriori mo del M MAP (righ t column). In each panel, the black, dashed line indicates the true v alue and the gray , solid line sho ws the p osterior mean. 33 FIG. S9. Simple Nonuniﬁlar Source: Joint distribution samples using the MAP mo del at the given lengths instead of the full set of candidate models. Colors corresp ond to data subsample length, as in previous plots. The MAP top ology for L = 1 (blac k) has one state and C µ = 0, as indicated by the samples in the h µ − C µ plane. No Gkde approximation of these samples is pro vided due to this complete lack of v ariation. 34 App endix E: Number of Accepting T op ologies for Pro cesses FIG. S10. Num b er of accepting top ologies for each of example processes as a function of subsample length L . F or each, a set of ten data series was created and subsamples of length L were analyzed to determined the num b er of binary-alphab et, one- to ﬁv e-state top ological  -machines that had at least one v alid path for that length. (This would result in nonzero likelihoo d and p osterior probabilit y .) F or each data series, a gray p oint is plotted. Overlapping gray points, created by multiple data series with the same num b er of accepting top ologies at the given v alue of L generate a darker gray or black p oint. The horizontal lines indicate the total num b er of candidate structures (36 , 000, gra y dashed line) and the asymptotic num b er of accepting top ologies (solid, gray line). F or the Even Process (b ottom, left panel), 3 , 813 top ologies were asymptotically accepting whereas the Golden Mean Pro cess (top, left panel) and Simple Nonuniﬁlar Source (top, right panel) b oth had 6 , 225. 35 App endix F: Maximum a p osteriori top ologies Figure S11 lists all MAP topologies encountered when inferring  -machine structure using data from the Even, Golden Mean, and SNS Pro cesses. All pro cesses had n1k2id3 (panel A) as the MAP top ology for small L , reﬂecting a preference for small structures when limited data is a v ailable. The Golden Mean Pro cess transitioned from n1k2id3 to n2k2id5 (panel B), the correct structure, at L = 64, as do cumented in T able S2. In a similar manner, the Ev en Pro cesses changed from n1k2id3 to n2k2id7 (panel C), the correct structure, at data size L = 32, as shown in T able S4. The fact that these in-class  -machine structures quickly conv erge on the correct topology is p erhaps exp ected. Predicting the sample size at which this o ccurs, ho wev er, is not obvious. The Simple Nonuniﬁlar Source has a more complicated series of MAP top ologies, starting with the simple n1k2id3 and progressing through n2k2id5 (panel B), to n4k2id3334 (panel D) and, ﬁnally , to n5k2id22979 (panel E) at data size 2 17 . Of course, this out-of-class data source cannot b e exactly captured b y the set of ﬁnite-state uniﬁlar  -machines considered here. Nonetheless, we exp ect the size of the inferred mo del to increase if more data from the SNS w ere emplo yed and a larger num b ers of states were allow ed. It is imp ortant to note that the MAP structure in this case has v ery low p osterior probability . As discussed in the main text, the top ology listed is one of ﬁve similar structures with nearly equal p osterior probabilities. FIG. S11. Maximum a p osteriori topologies for the Golden Mean, Even, and SNS Pro cess data series. T ransitions are only listed with emitted output symbol. T ransition probabilities are inferred from data for states that hav e more than one out-going transition. T ransitions from states with only one out-going arc must hav e probability one, by deﬁnition of the topology . Consult T ables S2, S4, and S6 to see when these structures corresp onded to the MAP top ologies for the given data sources.

Bayesian Structural Inference for Hidden Processes

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment