Restricted Collapsed Draw: Accurate Sampling for Hierarchical Chinese Restaurant Process Hidden Markov Models

Restricted Collapsed Draw: Accurate Sampling f or Hierar chical Chinese Restaurant Process Hid den Mark ov Models Abstract W e propo se a restricted collap sed dr aw (RCD) sampler, a gener al Markov chain Monte Carlo sampler of simultaneous d raws from a h ierar- chical Chinese restauran t process (HCRP) with restriction. Models that require simultaneous draws fro m a hierarchica l Dirichlet process with restriction, suc h as in ﬁnite Hidd en markov mod- els (iHMM), were difﬁcult to enjoy beneﬁts of the HCRP due to combina torial explosion in calculating d istributions of coupled dr aws. By constructing a proposal of seating arrange ments (partitionin g) and stochastically accepts th e pro- posal by the Metropolis-Hastings algo rithm, the RCD sampler makes accurate sampling for com- plex combin ation of draws while r etaining efﬁ- ciency of HCRP r epresentation . Based on the RCD sampler , we de veloped a series of sophis- ticated sampling algorithms f or iHMMs, includ- ing blocked Gibbs s ampling, beam sampling, and split-merge sampling , that outperfor med con ven- tional iHMM samplers in experiments. 1 Intr oduction Existing samp ling algor ithms for in ﬁnite hid den Markov models (iHMM s, also known as the hier archical Dirichlet process HMMs) [ ? ? ] do no t use a hier archical Chinese restaurant pro cess (HCRP) [ ? ] , which is a way of rep re- senting the predictive distribution of a hierarch ical Dirich- let process (HDP) by collapsing, i.e. in tegrating out, the u n- derlying distrib utions of the Dirichlet process (DP). While an HCRP representatio n provides efﬁcient sampling for many other models b ased on an HDP [ ?? ] thro ugh reduc- ing the dimension of sampling space, it has been con sid- ered rath er “awkward” [ ? ] to use an HCRP f or iHMMs, due to the difﬁculty in handling cou pling be tween rand om variables. In the simplest c ase, consider step-wise Gibb s sampling f rom an iHMM deﬁned as π k ∼ DP( β , α 0 ) and x i -1 x i x i +1 y i -1 y i y i +1 x i +2 y i +2 x i -1 x' i x i +1 x' i +1 y i -1 y i y i +1 x i +2 y i +2 Figure 1: Step-wise Gib bs sampling in iHMM. Since the Dirichlet pr ocess pr ior is posed o n transitions in iHMM, resampling x i in volves takin g two tran sitions, x i − 1 → x i and x i → x i +1 , simultaneo usly . In this case, we consider distribution of two draws ( x ′ i , x ′ i +1 ) with restriction th at the draws are co nsistent with remaining sequen ce, i.e., x ′ i +1 = x i +1 . β ∼ GEM( γ ) . Given x 1 , . . . , x i − 1 , x i +1 , . . . , x T , resam- pling hidden state x i at time step i a ctually consists of two draws ( Figure 1), x ′ i ∼ π x i − 1 and x ′ i +1 ∼ π x ′ i , unde r the restriction ( x ′ i , x ′ i +1 ) ∈ C tha t these draws a re co nsistent with the fo llowing sequ ence, i.e., C = { ( x ′ i , x ′ i +1 ) | x ′ i +1 = x i +1 } . Under the HCRP , the two draws are coupled ev en if x i − 1 6 = x ′ i , because distributions π x i − 1 , π x ′ i as well as the base measure β are integrated out in an HCRP, and cou- pling complicate s samp ling from the restricted distribution. T o generalize , the main par t of th e difﬁculty is to o btain a sample from a restricted jo int distribution of simulta- neous dr aws from co llapsed distributions, which we call r estricted co llapsed draw (RCD). Consider resamplin g L draws simultaneously , x = ( x j 1 i 1 , . . . , x j L i L ) , from the respective r estaurants j = ( j 1 , . . . , j L ) , when we have a restriction C suc h that x ∈ C . Step-wise Gibbs sampling from iHMM can b e ﬁtted in to RCD with L = 2 by allowing restaurant in dex j 2 to b e dep endent on the p receding dr aw x j 1 i 1 . In this pap er , we poin t out that it is n ot enou gh to consid er the distribution of draws. Since the HCRP in troduces an additional set of latent variables s th at acco unts fo r the seat- ing arrangements o f th e restaur ants, we h av e to compute an exact distribution of s as well, u nder the restriction . W e want to p erform sampling from the fo llowing conditiona l distribution, p ( x , s | C ) = 1 Z C I [ x ∈ C ] p ( x , s ) , (1) where Z C is a normalization constant an d I is the indicator function , whose value is 1 if the condition is tru e and 0 oth- erwise. Although n on-restricted prob ability p ( x , s ) can be easily calcu lated for a given x and s , calculating th e nor- malization co nstant Z C leads to a comb inatorial explosion in terms of L . T o solve this issue, we propo se th e restricted collapsed draw (RCD) sampler, which p rovides accurate distribu- tions of simultaneo us draws and seating arra ngements from HCRP. T he RCD sampler co nstructs a pro posal of seating arrange ments using a gi ven proposal of draws, and the pair of proposals are stochastically accepted by the Metrop olis- Hastings algorithm [ ? ]. Since the RCD sampler can han - dle a ny combin ation of restricted co llapsed dr aws simul- taneously , we were able to de velop a series of sampling method for HCRP-HMM, including a b locked co llapsed Gibbs samp ler , a collapsed beam sam pler , and a split- merge sampler for HCRP -HMM. Throu gh e x perimen ts we found tha t our co llapsed samplers o utperfo rmed their non - collapsed counterp arts. 2 HCRP repr esentation for iHMM 2.1 Inﬁnite HMM An in ﬁnite hid den Markov m odel (iHMM) [ ?? ] is deﬁned over the fo llowing HDP: G 0 ∼ DP( γ , H ) G k ∼ DP( α 0 , G 0 ) , (2) T o see the relation of this HDP to the transition matrix π , consider the explicit representation of parameters: G 0 = ∞ X k ′ =1 β k ′ φ k ′ G k = ∞ X k ′ =1 π k ′ k φ k ′ , (3) where t ransition p robability π k is gi ven as π k ∼ DP( α 0 , β ) , β ∼ GEM( γ ) is th e stick-breakin g co nstruc- tion of DPs [ ? ], and φ k ∼ H . A forma l d eﬁnition for the HDP based o n this representa- tion is: β | γ ∼ GEM( γ ) π j | α 0 , β ∼ DP( α 0 , β ) (4) x j i | ( π k ) ∞ k =1 ∼ π j φ k ∼ H y j i | x j i ∼ F ( φ x ji ) , (5) Giv e n an HDP and initial state x 0 , we can con struct an inﬁnite HMM by extractin g a sequence of dr aws x i as x i = x x i − 1 i , and c orrespon ding observations y i = y x i i . Figure 2 shows a gra phical representation of the iHMM. 2.2 HCRP-HMM As another way of rep resenting HDP in iHMM (Eq . 2), we introdu ce a hierarchical Chinese restaurant process (HCRP , also k nown as the Ch inese restauran t franch ise), which does not need to sample the tran sition distribution π and its base measure β in Eq. (4): k j t | γ ∼ CRP( γ ) t j i | α 0 ∼ CRP( α 0 ) (6) x j i = k j t j i (7) φ k ∼ H y j i | x j i , φ ∼ F ( φ x j i ) . (8) Using the Chine se restaurant m etaphor, we say that c us- tomer i of restau rant j sits at table t j i , which h as a dish of an index k j t ji . T o unde rstand con nection between HDP and HCRP , con- sider a ﬁnite m odel of grouped o bservations x j i , in which each gr oup j choose a subset of M mix ture com ponen ts from a model-w ide set of K mixture compone nts: β | γ ∼ Dir( γ /K, . . . , γ /K ) k j t | β ∼ β (9) τ j | α 0 ∼ Dir( α 0 / M , . . . , α 0 / M ) t j i | τ j ∼ τ j (10) As K → ∞ and M → ∞ , the limit of this model is HCRP; hence the inﬁnite limit o f th is mo del is also HDP. Equa- tion (6) is derived b y tak ing the inﬁnite limit o f K and M after integrating out β and τ in Eqs. (9) an d (10). The distribution π j in Eq. 4 can be derived from τ j and k j as follows: π j = X t τ j t δ k jt . (11) T o consider sampling of x j i using HCRP (Eqs. 7 and 8 ), we use c ount notation n j tk as the number of cu stomers in restaurant j at table t servin g the dish of the k -th entry , and m j k as the num ber of t ables in restau rants the j serving the dish of the k -th entry . W e also use dots for marginal c ounts (e.g., m · k = P j m j k ). Then, we sample table index t j i from the following distrib utio n: p ( t j i = t | t j 1 , . . . , t j,i − 1 ) = n j t · n j ·· + α 0 (12) p ( t j i = t new | t j 1 , . . . , t j,i − 1 ) = α 0 n j ·· + α 0 . (13) When t j i = t new (i.e., the customer s its at a ne w table) , we need to sample k j t new , whose distribution i s: p ( k j t = k | k 11 , . . . ) = m · k m ·· + γ (14) p ( k j t = k new | k 11 , . . . ) = γ m ·· + γ . (15) These variables d etermine the n ew samp le x j i = k j t ji . Since x j i does not uniqu ely determ ine th e state of the HCRP mode l, we nee d to keep latent variables t j i and H γ α 0 β  x t −1 x 0 x 1 x 2 x 3 x T y 1 y 2 y 3 y T F x t ∞ ∞ Figure 2: Graphical Represen tation of iHMM. k j t for subsequen t sampling . W e will deno te s ( j ) = ( t j 1 , t j 2 , . . . ) as the seating a rrangem ent in restauran t j , s (0) = ( k 11 , k 12 , . . . , k 21 , . . . ) as the seating arran gement in the root r estaurant, an d s as the co llection of all seatin g arrange ments, correspon ding to the sam pled mo del state. In B ayesian inference based o n sampling, we need a proce- dure to sample the latent variables, g iv en the value of new draw x j i and the seating arran gements fo r o ther draws s , which is called as addCustomer . Construction of HCRP-HMM is the same as iHMM, i.e., extracting a sequ ence of draws x i giv en x 0 as x i = x x i − 1 i , and correspo nding ob servations y i = y x i i . 3 Restricted Collapsed Draw Sampler What we want is a sampling algorithm for HCRP-HMM. As described in th e In troductio n, the p roblem can be re- duced to an algorithm for sampling from p ( x , s | C ) , i.e., the d istribution o f re stricted collapsed draw with seating arrange ments (E q. 1). Our idea is to apply the Metropo lis-Hastings algorithm [ ? ] to the seating arrang ements, wh ich stochastically ac- cepts the propo sal d istribution of seating arrangements. Al- though it is hard to directly g i ve proposal distribution q ( s ) of seating arran gements, our method constru cts q ( s ) by combinin g q x ( x ) with q s ( s | x ) , an other proposal of seat- ing arran gements giv e n the pro posed draws, which is based on the addCus tomer pro cedure that is standar dly used in Gibbs sampling of HCRP . 3.1 Overall sampling The Metropolis-Hastings algorithm provid es a way of con - structing an MCMC sampler using unnormalized probabil- ity value ˜ p ( z ) . After sampling z ∗ from p roposal distribu- tion q ( z ∗ | z ) , the alg orithm c omputes acc eptance pr obabil- ity R : R = min  1 , ˜ p ( z ∗ ) ˜ p ( z old ) q ( z old | z ∗ ) q ( z ∗ | z old )  . (16) Then the r esult z new = z ∗ with pro bability R , and z new = z old otherwise. Repeating this p rocess co nsti- tutes an MCMC sampler fro m req uired distrubution p ( z ) ∝ ˜ p ( z ) , W ithin the co ntext of HCRP , sample space z consists of draws x and seatin g ar rangemen t s . Fro m Eq . (1 ), w e can use the non-restricted p robab ility of draws p ( x , s ) as un- normalized probability value ˜ p ( z ) , but it is n ot easy to pro- vide a proposal for joint distribution q ( x ∗ , s ∗ ) . Our idea is to factorize the proposal distribution as: q ( x ∗ , s ∗ | s 0 ) = q x ( x ∗ | s 0 ) · q s ( s ∗ | x ∗ , s 0 ) . (17) First factor q x is the p roposal distribution of the dr aws. Second factor q s is the propo sal distribution o f the seat- ing a rrangem ents g i ven the pro posal draws. W e use the result of th e addCustomer p rocedur e, wh ich stochastically updates the seating arrangeme nts, as the pr oposal distribu- tion of the seating arrange ments. 3.2 Computing Factors The following de scribes each factor in R and its computa- tion. T rue Probability p ( x , s ) in Eq . (1 ) is the joint probabil- ity of all draws x j i : p ( x , s ) = Y j p ( s ( j ) ) · p ( s (0) ) (18) where p ( s ( j ) ) = Γ( α 0 ) Γ( α 0 + n j ·· ) · α m j · 0 · Y t Γ( n j t · ) (19) p ( s (0) ) = Γ( γ ) Γ( γ + m ·· ) · γ K · Y t Γ( m · k ) , (20) and Γ is the Gamma function. This is the product of the probab ilities of seatin g arran gements (E qs. 12 to 15) fo r each customer . In prac tice, we only ne ed to calcu late prob abilities that ac- count f or the ch ange in seating from s 0 , be cause the pr ob- ability fo r unch anged customers is can celled out throu gh reducing the fr action in R . Let s 0 be the seatin g arrange- ment for the uncha nged cu stomers, then p ( s ∗ ) p ( s old ) = p ( s ∗ | s 0 ) p ( s old | s 0 ) . (21) In fact, p ( x ∗ , s ∗ | s 0 ) is easily calculated along with add- Customer operations: p ( x ∗ , s ∗ | s 0 ) = p ( x ∗ 1 , s ∗ 1 | s 0 ) p ( x ∗ 2 , s ∗ 2 | s ∗ 1 ) · · · p ( x ∗ L , s ∗ | s ∗ L − 1 ) . (22 ) Here, p ( x ℓ , s ℓ | s ℓ − 1 ) is prob ability p ( x j ℓ i ℓ , t j ℓ i ℓ | j ℓ , s ℓ − 1 ) of o btaining seating arrangemen t s ℓ as a result of dr awing a sample from restaurant j : p ( x j i = k , t j i = t | j, s ) = 1 n j ·· + α 0 ×          n j tk n j t · ≥ 1 α 0 · m · k m ·· + γ n j t · = 0 , m · k ≥ 1 α 0 · γ m ·· + γ n j t · = 0 , m · k = 0 (23) The same applies to th e calcu lation of p ( s old | s 0 ) , which can be done along with remov eCus tomer oper ations. Proposal Distrib ution o f Dr aws q ( x ) can be anything as lon g as it is ergodic within re striction C . T o incr ease the acceptan ce probability , ho wever , it is pref erable for the propo sal distribution to be clo se to the true d istribution. W e sugg est that a g ood starting poin t would b e to use a joint distribution comp osed o f the p redictive d istributions of each draw , as has b een don e in the app roximated Gibb s sampler [ ? ]: q x ( x ) = I [ x ∈ C ] L Y i =1 p ( x i | s 0 ) . (24) W e will again discuss the pr oposal distribution of draws for the HCRP-HMM case in Section 4. Proposal Distribution of Seating Arra ngements q s ( s ∗ | x , s 0 ) , is the prod uct of the prob abilities for each operation of adding a customer: q s ( s ∗ | x ∗ , s 0 ) = q s ( s ∗ 1 | x ∗ 1 , s 0 ) q s ( s ∗ 2 | x ∗ 2 , s ∗ 1 ) · · · q s ( s ∗ | x ∗ ℓ , s ∗ ℓ − 1 ) . (25 ) Here, q s ( s ℓ | x ℓ , s ℓ − 1 ) = p ( t j ℓ i ℓ | x j ℓ i ℓ , j ℓ , s ℓ − 1 ) , i.e., the probab ility of o btaining seatin g ar rangemen t s ℓ as a result of the addCustomer ( x j ℓ i ℓ , j ℓ , s ℓ − 1 ) o peration. p ( t j i = t | x j i = k , j, s ) = (26)            n j tk n j · k + α 0 m · k m ·· + γ n j t · ≥ 1 ∧ k j t = k α 0 m · k m ·· + γ n j · k + α 0 m · k m ·· + γ n j t · = 0 ∧ m · k > 0 1 n j t · = 0 ∧ m · k = 0 . (27) 3.3 Simpliﬁcation Paying attention to t he fact that both Eqs. (23) and (2 7) are calculated along a series of addCustomer calls, we intro- duce factors r ∗ ℓ = p ( x ∗ ℓ , s ℓ | s ℓ − 1 ) q s ( s ℓ | x ∗ ℓ , s ℓ − 1 ) r old ℓ = p ( x old ℓ , s ℓ | s ℓ − 1 ) q s ( s ℓ | x old ℓ , s ℓ − 1 ) (28) to simplify the calculation of R as: R = min  1 , p ( s ∗ ) p ( s old ) q s ( s old | x old , s 0 ) q s ( s ∗ | x ∗ , s 0 ) q x ( x old ) q x ( x ∗ )  = min  1 , r ( x ∗ , s ∗ | s 0 ) r ( x old , s old | s 0 ) q ( x old ) q ( x ∗ )  , (29) where r ( x ∗ , s ∗ | s 0 ) = p ( x ∗ , s ∗ | s 0 ) q s ( s ∗ | x ∗ , s 0 ) = p ( x ∗ 1 , s 1 | s 0 ) q s ( s 1 | x ∗ 1 , s 0 ) · · · p ( x ∗ L , s L | s L − 1 ) q s ( s L | x ∗ L , s L − 1 ) = r ∗ 1 · r ∗ 2 · · · r ∗ L . (30) Surprisingly , assigning Eqs. ( 23) and ( 27) into Eq . (28) re- veals that r ∗ ℓ is equa l to p ( x j ℓ i ℓ = x ∗ ℓ | s ∗ ℓ − 1 ) , i.e., the p rob- ability of new custo mer x j ℓ i ℓ at restau rant j ℓ eating dish x ∗ ℓ : p ( x j i = k | s ) = n j · k + α 0 m · k m · k + γ n j ·· + α 0 (31) p ( x j i = k new | s ) = α 0 γ m · k + γ n j ·· + α 0 . (32) In othe r words, calculatio n of the accep t ratio do es not use t j i (the table in dex of each cu stomer), d espite the fact that the values o f t j i are being p roposed; t j i will in directly af - fect th e accep t ratio b y chang ing subsequen t dr aw pr ob- abilities p ( x ∗ ℓ +1 | s ∗ ℓ ) , p ( x ∗ ℓ +2 | s ∗ ℓ +1 ) , . . . through modifying n j tk and m j k , i.e., the number of customers and tables. It is now clear that, as d one in some previous work [ ? ], we can save storage space by using an alternativ e representa- tion f or seating arrangemen ts s , in which the table indices of each customer t j i are f orgotten but only the n umbers of customers n j t · , k j t and m j k are r etained. The only remain- ing reference to t j i in the removeCustomer pro cedure ca n be safely replaced by sampling. Howe ver, it should be no ted that we have to revert to orig- inal seating assignment s old whenever the proposal is re- jected. Putting th e old draws x old back into s 0 by using the addCus tomer procedure again will lead samp ling to an incorrect distrib u tion of seating assignmen ts, and con- sequently , an incorrect distrib ution of draws. Algorithm 1 is the one w e pro pose othat ob tains new sam- ples x new drawn simu ltaneously f rom restaurants in dexed by j and associa ted seating ar rangeme nt s new , gi ven pre- vious samples x old and s old . Algorithm 1 MH-RCDSampler ( j , x old , s old ): Metr opolis-Hastings sampler for restricted collapsed draw 1: s old L = s old 2: for ℓ = L downto 1 do 3: s old ℓ − 1 = remo veCustomer ( x old ℓ , j ℓ , s old ℓ ) { Remove customers for x old 1 , . . . , x old m sequentially from s old } 4: r old ℓ = p ( x old ℓ , s old ℓ − 1 ) { Calculate factors for accept ratio } 5: end for 6: s ∗ 0 = s 0 = s old 0 7: x ∗ ∼ q x ( x ; s 0 ) { Draw x ∗ from pro posal d istribution q ( x ) o f draws. } 8: for ℓ = 1 t o L do 9: r ∗ ℓ = p ( x ∗ ℓ , s ∗ ℓ − 1 ) { Calculate factors for accept ratio } 10: s ∗ ℓ = add Customer ( x ∗ ℓ , j ℓ , s ∗ ℓ − 1 ) { Add cu stomers for x ∗ 1 , . . . , x ∗ m sequentially to s ∗ 0 } 11: end for 12: s ∗ = s ∗ L { Obta in proposal seating s ∗ } 13: R = min  1 , q x ( s old ) q x ( s ∗ ) L Y ℓ =1 r ∗ ℓ r old ℓ  { Calculate accep tance probability } 14: return h x new , s new i = ( h x ∗ , s ∗ i with pr obability R h x old , s old i otherwise. { Accep t/reject propo sed sample } The ﬁrst half of this sampler is similar to a sampler for a sing le draw; it consists of rem oving old cus- tomers ( line 3), ch oosing a new sample (line 7), and adding the cu stomers ag ain (lin e 1 0). The main d iffer - ence is that there are L times o f iteration for each call removeCustomer / ad dCustomer , and the calcula tion of r , which is later used for acceptanc e pro bability R . 4 Gibbs sampler f or H CRP-HMM This section describ es a series of samplers fo r HCRP- HMM. F irst, we present the step- wise Gibbs sampler as the simplest examp le. After that, we d escribe a blo cked Gibbs sampler using a forward-b ackward algorithm. W e also ex- plain the HCRP version of the beam samp ler [ ? ] as well as the split-merge sampler [ ? ] for iHMM. 4.1 Step-wise Gibbs sampler A step-wise Gibbs sam pler for HCRP-HMM is easily con - structed using an RCD samp ler ( Algorithm 5 in the Ap- pendix describes one Gibb s sweep ). W e slightly mo diﬁed the pro posal distribution q ( x t ) from that sug gested in Sec- tion 3.2, in order to ensur e that x t +1 is propo sed with n on- zero probability ev en wh en n o table in s 0 serves d ish x t +1 : q x ( x t ) ∝  p ( x t | s ( x t − 1 ) 0 ) +  α 0 γ ( α 0 + n x t − 1 ·· )( γ + m ·· )  δ x t +1  · p ( x t +1 | s ( x t ) 0 ) · p ( y t | F ( x t ) 0 ) . (33) 4.2 Blocked Gibbs sampler W e can co nstruct an alternate samplin g schem e under the f ramework of RCD sampler th at resamp les a block of h idden states simu ltaneously , based on the for ward- backward samp ler [ ? ]. The idea is that we run the forward- backward sampler with a pr edictive tran sition distribution from HCRP-HMM, and use the result as a propo sal of re- stricted collapsed draw . For iHMM , the forward-back ward sampling a lgorithm [ ? ] cannot b e dire ctly used , because the forward p robability values for an inﬁnite nu mber of states have to be st ored for each time step t [ ? ]. This is no t the case fo r HCRP-HMM, because predictive transition pro bability ˆ π fr om giv en seat- ing assignm ent s 0 , which is given as Eqs. ( 31) and (32 ), only co ntains tr ansition pro bability for ﬁnite number K of states plus one for k new . Thus we only need to store K + 1 forward probability for each time step t . Result ¯ x of the for ward-backward sam pler , howe ver , can- not be used directly as the p roposal; the i -th state of the propo sal x ∗ i is equal to ¯ x i when ¯ x i 6 = k new , but we need to assign new state indices to x ∗ i whenever ¯ x i = k new . In particular, when k new has appear ed W ≥ 2 times, all ap- pearances of k new may refer either to the same new state, or to W different states, or to anything in between the two, in which some appearan ces of k new share a new state. T o achiev e th is p urpose, we prepare special CRP Q ∗ that accounts for the p reviously unseen states, marked by k new in the result of the forward- backward sampler . Speciﬁcally , each table in Q ∗ has a dish with an un used state index, and each appearan ce of k new is replace d with a draw from Q ∗ . This co nstruction e nsures th at every state sequence is pro- posed with a no n-zero prob ability , an d allows the pro posal probab ility to be easily calculated. The concentration pa- rameter of Q ∗ is set as equal to γ . T o han dle the case wher e some of th e new states are equ al to x t b +1 , i.e., ind ex of the state that succeed s to the resampling block, we ad d to Q ∗ an extra customer that cor repond s to x t b +1 when x t b +1 does not appear in s 0 , Resulting propo sal probability is: q x ( x ∗ ) = L Y ℓ =0 ˆ π ¯ x ℓ +1 ¯ x ℓ · L Y ℓ =1 F ¯ x ℓ ( y ℓ ) ! · Y ℓ : ¯ x ℓ = k new p ( x ∗ ℓ | Q ∗ ) , (34) where the ﬁrst factor acco unts for the forward probab ility of the seq uence, an d the secon d factor acco unts f or pro ba- bility of the new state assignm ent. Note also that, to make a sensible proposal distribution, we cannot resample the who le state sequence simu ltaneously . W e need to divide the state sequence into se veral blocks, and resample e ach block g iv e n the other blocks. The size of a block affects efﬁciency , bec ause blocks that are too large have lo wer accep t pr obability , while with b locks that are to o small, the algorithm ha s little advantage over step- wise Gibbs sampling. Algorithm 8 in the Appendix d escribes one sweep of a blocked Gibbs sampler for an HCRP-HMM. 4.3 Beam sampling Beam samp ling for HDP-HMM [ ? ] is a sampling alg o- rithm that uses slice samp ling [ ? ] fo r transition proba bility to extract a ﬁnite s ubset from the state space. Although the possible states are already ﬁnite in HCRP-HMM, the same technique may ben eﬁt sampling of HCRP-HMM b y im- proving e fﬁcienc y from the red uced n umber o f states con- sidered durin g on e sampling step. W e just ne ed replace the call to Fo rwa rdBackwadSampling in Algorithm 8 with the call to BeamSampling to use beam sampling with HCRP-HMM. A brief overview of the bea m sampling is: 1. Sample auxiliary variables u = ( u 0 , . . . , u L ) as u ℓ ∼ Uniform(0 , π x ℓ x ℓ − 1 ) , 2. For ℓ = 1 , . . . , L , calculate forward pr obability q ( x ′ ℓ = k ′ ) using a slice of transition pro bability q ( x ′ ℓ = k ′ ) = F k ′ ( y ℓ ) P k I ( π k ′ k > u ℓ − 1 ) q ( x ′ ℓ − 1 = k ) , 3. For ℓ = L, . . . , 1 , sample th e states x ′ ℓ backwardly , i.e. p ( x ′ ℓ = k ) ∝ I ( π x ′ ℓ +1 k > u ℓ ) . For details, refer to the original paper [ ? ]. Some remar ks may be needed fo r the calc ulation of q ∗ x , i.e., the pro posal pro bability f or the state sequence. Al- though beam s ampling has a different propo sal distribution from forward-back ward sampling , we can use the same cal- culation o f pr oposal pr obability u sed in accep tance pr oba- bility as that of f orward-back ward samp ling. Th is is be- cause b eam sampling satisﬁes the d etailed balance equa- tion, which ensures that the ratio o f p roposal p robability with beam sampling q ∗ slice q old slice is alw ays equal to the ratio of the probab ility ob tained by forward-back ward sampling q ∗ q old . 4.4 Split-Merge Sampling W e can integrate the split-merge samplin g alg orithm, which is anoth er sam pling appro ach to Dirich let process mixture mod els [ ? ], into HCRP-HMM using the RCD sam- pler . A split-merge samp ler makes a prop osal move that tries to merge tw o mixture compon ents into on e, or to split a m ixture com ponen t into two; the sampler then uses a Metropolis-Ha stings step to stochastically accep t the pr o- posal. Based on the RCD fram ew o rk, we can extend the split-merge sample r into HCRP , which can be applied to HCRP-HMM. W ithin the context o f HMM , the sampler correspo nds to merge two state in dices into o ne, or to split a state index into tw o . Our im plementation is based on an improved version o f hte split-m erge sampler, called the sequ entially-alloca ted merge-split sample r [ ? ], wh ich p roduce s a split propo sal while sequentially allocating comp onents in random order . T o deal with tempo ral depend ency in HMM, we ide ntify fragmen ts of state sequ ences to be resampled within the state sequen ce, an d perf orm blo cked Gibbs samp ling for each fragmen t in r andom order . W e a dded one imp ortant optimization to the split-m erge sampling algorith m. Sin ce a merge move is proposed much more fr equently than a sp lit m ove, and the m ove has a rel- ativ e ly low accep t probability , it is beneﬁcial if we have a way of determin ing whether a merge move is r ejected or not e arlier . Let us point o ut th at, wh en p roposal pro ba- bility for a merge m ove is calculate d, the accept probab il- ity is mono tonically d ecreasing. Conseq uently we sam ple R thr , the thr eshold of accept probab ility , at the beginn ing of the algorithm and stop furth er calculation when R be- comes less th an R thr . Algor ithm 9 in the Append ix is the split-merge s ampling algo rithm for HCRP -HMM. Split-merge sampling allows f aster mixing when it i s inter- leav ed with other sampling strategies. W e examine split- merge sam pling with each o f the samplers we have pre- sented in this paper . 5 Experiments and Discussion This section pr esents two series o f experimen ts, the ﬁrst with sma ll a rtiﬁcial sequ ences and the second with a se- quence of natural language words. 5.1 Settings W e put g amma p rior Gamma(1 , 1 ) on α 0 and γ , a nd sampled between every sweep using an auxiliary variable method [ ? ] in all the experiments. W e introduced HCRP as a prior of emission d istributions as well, an d its hyp er- parameters were also sampled in the same way . The initial state sequ ence given to the sampler is th e result of a particle ﬁlter with 100 particles. W e measured au tocorrelatio n time (ACT) to evaluate mix- ing. Gi ven a seq uence of v alues x = x 1 , x 2 , . . . , x T , its mean µ and variance σ 2 , AC T ( x ) are deﬁned as follows: AC F t ( x ) = 1 ( T − t ) σ 2 T − t X i =1 ( x i − µ )( x i + t − µ ) (35) AC T ( x ) = 1 2 + ∞ X t =1 AC F t ( x ) . (36) Since with larger t , AC F i ( x ) is expected to conver g e to zero, we used AC F i ( x ) f or t ≤ 1 000 . For artiﬁcial sequen ce, we ev alu ated mutu al info rmation between the h t , hidd en state used in sequence generation and x t , inferr ed states as follows: M I = X h X x p ( x, h ) lo g p ( x, h ) p ( x ) p ( h ) . (37) For natur al lang uage text, the inferred model is ev alu ated by multiple ru ns of a particle ﬁl ter on a given test sequence of length T test . W e speciﬁcally con struct a particle ﬁlter with Z = 100 particles for each samp led mod el state s z , and e valuate likelihood l ( y i | s z ) for each emission. Finally , we calculate the per plexity (the reciprocal geometric mean of the emission probab ilities) of the t est sequence: P P L = exp  − 1 T test X log ˆ l ( y i )  (38) where ˆ l ( y i ) = 1 Z Z X z =1 l ( y i | s z ) . (39) The samp lers we c hose for co mparison are the step-wise Gibbs sampler with direct assignm ent r epresentation [ ? ], which uses stick- breaking for the r oot DP and CRP f or the other DPs, the st ep-wise Gibbs sampler with stick- breaking construction , and th e be am sampler with stick-bre aking construction [ ? ]. For fair comp arison between different al- gorithms, we c ollected samples to e valuate the autocor re- lation time an d p erplexity on a CPU-time basis (excluding the tim e used in evaluation). All the algorithm s were im - plemented with C++ and tested on machin es with an Intel Xeon E545 0 at 3 GHz. 1 9 2 10 3 11 4 12 5 13 6 14 7 15 8 16 H A B C D E F G H A Figure 3: Automaton that gen erates Sequence 2. Circles denote hidden states, and the same alph abet e missions are observed from states within an oval grou p. A dashed ar- row denotes transition with pr obability 0.8 , a bold arrow denotes transition with p robab ility 0.84, and a solid arrow denotes emission with prob ability 1 /3. 5.2 Artiﬁcial data The ﬁrst ser ies o f exp eriments are perform ed with two small artiﬁcial sequences. Sequ ence 1 con sists of re peat- ing sequence of symbols A-B-C-D-B-C-D-E-.. . for length T = 500 , and we run the sam pler 30 s fo r burn-in, and af - ter that, a model state is sampled every 2 s until a total o f 300 s is rea ched. Seq uence 2 is gen erated from the simple ﬁnite state automato n in Figure 3 for len gth T = 250 0 , and we use 60 s fo r burn-in and total 600 s. W e e valuated the mutual inf ormation between the in ferred hidd en states an d the true hidden states. Figure 4 shows the distribution of m utual info rmation fo r 100 trials after 300 s. W e can s ee that some of the samplers based on the proposed method achieved a be tter mutual in- formation co mpared to existing sam plers. The imp rove- ment depend s on the ty pe of sequence and the samplers. For Sequen ce 1, we can see that split-merge sampling yields better results co mpared to other samplers. Alth ough HMM with eigh t hidden states can co mpletely predict the sequence, the samplers ten d to be trapped in a local op ti- mum with ﬁ ve states in the initial phase, because our se- lected p rior of γ poses a larger p robability on a smaller number of hidden states, Detailed in vestigation s (Figur e 5) conﬁrmed this analysis. For Sequen ce 2, on the o ther h and, blocked sampler s worked very efﬁciently . Step-wise samplers genera lly worked p oorly on the sequenc e, because the strong de- penden cy on temporally adjacent states im pedes mixin g. Still, step-wise Gibbs samp ler for HCRP-HMM o utper- formed the beam sampler with the stick-break ing pro cess. The blocked Gibbs sampler had inferior performance due to its heavy computatio n fo r a lar ge num ber of states, but the beam sampler for HCRP-HMM was e fﬁcient an d per- formed well. Combinatio n with a small nu mber of split- merge samp lers increases the per forman ce (more split- merge samp ling leads to lower p erform ance by occupying computatio nal resource for the beam samp ler). From av er- ages statisti cs of sampler s (T ab le 1) , we can see that (1) th e increase of mutu al in formation canno t be described only by the increase of the number of st ates; (2) The acce pt ratio for the Gibb s trial h as a very hig h accep t rate; (3 ) Split- merge samplers h av e a very low accep t rate, but still m ake improvement fo r mutual information. 5.3 Natural language text W e also tested the sam plers u sing a sequ ence of natur al languag e words from A lice’ s Adven tur e in W on derland . W e conv erted the t ext to lo wer case, r emoved punctu ation, and placed a special word EOS after every sentence to obtain a cor pus with 28 , 120 words; we kept th e last 1,000 word s for test corpus and lear ned on a seq uence with length T = 27120 . W e intr oduce a special word UNK ( unknown) to replace every word th at occu rred only once, resulting in | Σ | = 1 , 4 87 uniqu e words in the text. W e took 10,0 00 s for burn-in, and sampled a model state for ev e ry 1 20 s, until the to tal of 172 ,800 s. T a ble 2 summa rize the av e raged statistics for 18 trials. W e fo und that step-wise sampling outp erform ed blocked sampling (including beam sampling). The reason for this may be the n ature of the sequ ence, which has a lower tem- poral depen dency . Blocked Gibbs sampling , in par ticular , consumes to o mu ch tim e fo r on e sweep to b e of any prac - tical use. W e also fou nd that split-merge samp ling had a very low accept rate and thus made little con tribution to the result. Y et, we can see the advantage of u sing HCRP rep resen- tation over stick-b reaking representatio n. The dir ect as- signment (DA) alg orithm showed a co mpetitively good perplexity , reﬂecting the fact that D A uses stick-breakin g for only the ro ot DP and uses the CRP represen tation for the other DP. Though step-wise Gib bs sampling and its slice sampling version seem s o utperfo rming D A slightly , we nee d to collect mo re data to show that the d ifference is signiﬁcant. At least, ho wever , w e can say that now many sampling algorithms are a vailable for inference, and we can choose a suitable on e depen ding on the nature o f the se- quence. 6 Conclusion and Futur e W ork W e have proposed a me thod of samplin g directly fro m con- strained distributions of simultaneo us draws fr om a h ier- archical Chinese restauran t pr ocess (HCRP). W e p ointed out that, to obta in a correct sample distribution, the seat- ing arrangemen ts (partitionin g) must b e correctly samp led for restricted co llapsed draw , and we th us pr oposed app ly- ing th e Me tropolis-Hasting s algor ithm to the seating ar- rangeme nts. Our algorith m, called the Restricted Collap sed Draw ( RCD) sampler, u ses a na¨ ıve sampler to p rovide a propo sal distrib utio n for seating arran gements. Based on the sampler, we d ev eloped various sampling algor ithms for HDP-HM M based on HCRP rep resentation, inclu ding blocked Gibb s samplin g, b eam samplin g, and split-merge sampling. The application s of the RCD sampler, which is at the h eart of our algorithm s, are n ot limited to HCRP-HMM. The experimental re sults re vealed that some of the pro posed al- gorithms o utperfo rm existing sampling methods, indicating that the beneﬁts of using a collapsed repr esentation e xceed the cost of rejecting propo sals. The main contribution o f this study is th at it opens a way of d ev e loping m ore co mplex Bayesian models b ased on CRPs. Since the RCD sampler is simple, ﬂexible, and indepen dent o f the particular structure of a hier archy , it can b e applied to any combina tion or h ierarchical stru c- ture of CRPs. Our f uture work includes u sing this algo- rithm to construc t n ew Baye sian models based on hierar- chical CRPs, which a re ha rd to implem ent u sing a n on- collapsed representation . Plann ed w or k includes e x tending HDP-IOHMM [ ? ] with a th ree-level hier archical DP ( e.g., the seco nd level could co rrespond to actions, an d th e third lev e l, to input symbols). (a) Sequence 1 (b) Sequenc e 2 Figure 4: A verage mutual information of sampled hidden states 0 20 40 60 80 100 2 2.2 2.4 2.6 2.8 3 0 20 40 60 80 100 2 2.2 2.4 2.6 2.8 3 0 20 40 60 80 100 2 2.2 2.4 2.6 2.8 3 (a) HDP-HMM (SB) SGibbs (b) HDP-HMM (D A) SGibbs (c) HDP-HMM (SB) Beam 0 20 40 60 80 100 2 2.2 2.4 2.6 2.8 3 0 20 40 60 80 100 2 2.2 2.4 2.6 2.8 3 0 20 40 60 80 100 2 2.2 2.4 2.6 2.8 3 (d) HCRP-HMM Beam (e) HCRP-HMM Beam (f) HCRP-HMM Beam Split-Merge 3/sweep Split-Merge 25/sweep Figure 5: Distribution of mutual information for Sequence 1. X-axis shows mutual information and Y -axis shows f re- quency . Block size ≈ 6 fo r HCRP-HMM Beam sampling. T able 1: Experimental results for Sequence 2 name MI ACT #states #states secs/sweep Gibbs accept rate SM accep t rate HDP-HMM (D A) SGibb s 2.92 0.527 14. 910 0.044 — — HDP-HMM (SB) Beam 3.04 0.640 14. 720 0.032 — — HCRP-HMM SGibbs 3.18 0.719 16. 210 0.026 0.999 666 — HCRP-HMM SGibbs +SM=2 3.28 0.615 17. 190 0.030 0.999 632 0.00 0631 HCRP-HMM SGibbs +SM=13 3 .19 0.493 16.8 30 0.044 0.9996 19 0.000 650 HCRP-HMM SSlice 2.82 0.820 15. 950 0.009 0.999 847 — HCRP-HMM SSlice +SM=2 2.86 0.705 17. 330 0.013 0.999 822 0.00 0827 HCRP-HMM SSlice +SM=13 2.59 0.604 16. 400 0.030 0.999 830 0.00 0910 HCRP-HMM BGibbs 3.01 0.317 14. 900 0.206 0.995 525 — HCRP-HMM BGibbs +SM=2 3.12 0.513 16. 270 0.237 0.995 135 0.00 0985 HCRP-HMM BGibbs +SM=13 3.1 8 0.637 16.530 0.2 65 0.994715 0. 00071 5 HCRP-HMM Beam 3.21 0.866 15. 180 0.016 0.997 233 — HCRP-HMM Beam +SM=2 3.37 0.89 8 16.910 0.01 9 0.996369 0.0 00497 HCRP-HMM Beam +SM=13 3.2 8 0.875 17.070 0.0 34 0.996316 0. 00053 2 D A: Di rect Assignmen t SB: Stick-Breaking construction MI: Mutual Information A CT: Auto-correlation time, samples collected for e very 0.1 s #states: number of states S Gibbs: step-wise Gibbs SSlice: step-wise Gibbs with sli ce sampling (beam sampling with block size=1) BGibbs: blocked Gibbs (block size ≈ 8) Beam: beam sampling (block size ≈ 8 for HCRP-HMM, T for stick-breaking ) SM: Split-Merge sampler (+SM= n denotes SM trials per Gibbs sweep) T able 2: Experiments on Natural language te x t Sampler Perplexity # states sec/sweep Gibbs accept rate SM accept rate HDP-HMM (D A) SGibb s 134.2 2 313.0 56 10.017 — — HDP-HMM (SB) SGibbs 151.1 0 242.3 89 38.045 — — HDP-HMM (SB) Beam 178 .59 68.44 4 16.126 — — HCRP-HMM SGibbs 133.3 1 379.8 33 7.027 0.9 99861 — HCRP-HMM SGibbs+SM=130 131.6 6 386.8 33 7.664 0.9 99857 0.00 0052 HCRP-HMM SGibbs+SM=540 0 135.9 4 336.2 78 31.751 0.99988 0 0.00005 0 HCRP-HMM SSlice 131.1 7 422.0 00 0.469 0.9 99986 — HCRP-HMM SSlice+SM=130 1 31.67 4 09.83 3 0.976 0.99 9993 0.000 052 HCRP-HMM SSlice+SM=5400 152.7 6 254.7 22 36.617 0.99999 3 0.00005 6 HCRP-HMM BGibbs 199.1 4 1840 .833 2 9380. 261 0.992748 — HCRP-HMM Beam 141.7 7 603.3 33 80.681 0.99562 7 — HCRP-HMM Beam+SM=130 142.6 9 567.2 78 72.217 0.99561 2 0.00012 4 HCRP-HMM Beam+SM=540 0 141.0 7 495.6 67 84.925 0.99555 4 0.00010 1 For HCRP-HMM, the blo ck size ≈ 10 . A Miscellaneous Algorithms Algorithm 2 getProb ( j, k , s ) : Calculate p ( x j i = k | s ) if m · k = 0 then return α 0 γ m · k + γ n j ·· + α 0 else return n j · k + α 0 m · k m · k + γ n j ·· + α 0 end if Algorithm 3 addCustomer ( j, k , s old ): Adds new cus- tomer eating dish k to restau rant j . s := s old W ith p robab ilities propo rtional to: n j tk ( t = 1 , . . . , m j · ) : Inc rement n j tk (the c ustomer sits at t -th table) α 0 m · k m ·· + γ : sit customer at a new table t new serv- ing dish k in restaurant j ( n j t new k := 1 , k j t new := k , increment m j k ) return u pdated s Algorithm 4 remov eCustomer ( j, k , s old ): Removes e x- isting customer eating dish k f rom restaurant j . s := s old Sample t j i in propor tional to n j t ji k Decrement n j t ji k (the customer at t j i -th table is re- moved) if n j t ji k becomes zero then Remove th e un occupied table t j i from restauran t j , decremen t m j k end if return u pdated s B Step-wise Gibbs sampler T o man ipulate emission probab ility F ( x i ) with a co njugate prior, we introduced a similar no tation to HCR P , which can be intuitively unde rstood. Algorithm 5 Step-wise Gibbs sweep for HCRP-HMM Input: y 1 , . . . , y i : observed emissions x 1 , . . . , x i : pr eviously inferr ed states s old : set of CRP seating arran gements F old : set of emission distributions for i = 1 , . . . , T , in random order do s 1 = remo veCustomer ( x i +1 , x t , s old ) r old 3 = getProb ( x i +1 , x t , s 1 ) F 0 = removeCustomer ( y i , x i , F old ) r old 2 = getProb ( y i , x t , s 1 ) s 0 = remo veCustomer ( x i , x i − 1 , s 1 ) r old 1 = getProb ( x i , x i − 1 , s 0 ) Sample x ∗ i in prop ortion to q ( x t ) wh ere q ( x t ) ∝  α 0 γ ( α 0 + n x t − 1 ·· )( γ + m ·· ) δ x t +1 + p ( x t | S ( x t − 1 ) 0 )  · p ( y t | F x t ) · p ( x t +1 | S ( x t ) 0 ) r ∗ 1 = g etProb ( x ∗ t , x t − 1 , s 0 ) s 1 = add Customer ( x ∗ t , x t − 1 , s 0 ) r ∗ 2 = g etProb ( y t , x t , F 0 ) F ∗ = addCustom er ( y t , x t , F 0 ) r ∗ 3 = g etProb ( x t +1 , x ∗ t , s 1 ) s ∗ = add Customer ( x t +1 , x ∗ t , s 1 ) R := min  1 , r ∗ 1 r old 1 r ∗ 2 r old 2 r ∗ 3 r old 3 q ( x t ) q ( x ∗ t )  h x t , s , F i := ( h x ∗ i , s ∗ , F ∗ i with probability R h x t , s old , F old i otherwise end for C Blocked Gibbs sampler For details on the F orw ardB ackwa rdSampl ing routine, please refer to the literature [ ? ]. Algorithm 6 remo veSeq ( i 0 , i 1 , x , s old , F old ) : remove customers for a part of state sequence ( x i 0 , . . . , x i 1 ) L = i 1 − i 0 − 1 s L = remo veCustomer ( x i 0 + L , x i 1 , s old ) r old L = p ( x j i = k | s ) F L = F old for ℓ = L − 1 downto 0 do s ℓ = remo veCustomer ( x i b + ℓ , x i b + ℓ − 1 , s ℓ +1 ) F ℓ = removeCustomer ( y i b + ℓ , x i b + ℓ , F ℓ +1 ) r old ℓ = getProb ( x i b + ℓ , x i b + ℓ − 1 , s ℓ ) · getProb ( y i b + ℓ , x i b + ℓ , F ℓ ) end for return h s 0 , F 0 , Q L ℓ =0 r old ℓ i Algorithm 7 addSeq ( i 0 , i 1 , x , s 0 , F 0 ) : ad d cu stomers for a part of state sequence ( x i 0 , . . . , x i 1 ) L = i 1 − i 0 − 1 for ℓ = 0 t o L − 1 do r ∗ ℓ = getProb ( x i b + ℓ , x i b + ℓ − 1 , s ℓ ) · getProb ( y i b + ℓ , x ∗ i b + ℓ , F ℓ ) s ℓ +1 = add Customer ( x ∗ i b + ℓ , x ∗ i b + ℓ − 1 , s ℓ ) F ℓ +1 = addCustom er ( y i b + ℓ , x ∗ i b + ℓ , F ℓ ) end for r ∗ L = g etProb ( x i 1 , x ∗ i 1 − 1 , s ∗ L ) s ∗ = add Customer ( x i 1 + L , x ∗ i 1 − 1 , s ∗ L ) ; F ∗ = F ∗ L return h s ∗ , F ∗ L , Q L ℓ =0 r ∗ ℓ i Algorithm 8 Blocked Gibbs sweep for HCRP -HMM Input: y 1 , . . . , y i : observed emissions x = x 1 , . . . , x T : pr eviously inferr ed states s : set of CRP seating arran gements F : set of emission distrib u tions B : number of blocks Choose block bou ndaries i 1 , . . . , i B − 1 ∈ { 2 , . . . , T } ; i 0 := 1 , i B = T for b = 0 , . . . , B − 1 , in rand om order do h s 0 , F 0 , r old i = removeSeq ( i b , i b +1 − i b , x , s , F , 0 ) ; x ∗ i = x i for all t < i b or t ≥ t b +1 ( x ∗ i b , . . . , x ∗ i b + L − 1 ) = FBSampler ( ˆ π | S 0 , F 0 , y i b : i b + L − 1 , x i b − 1 , x i b + L ) Calculate q old = q ( x i b , . . . , x i b + L − 1 ) an d q ∗ = q ( x ∗ i b , . . . , x ∗ i b + L − 1 ) Q old = CRP( γ , H ) Q ∗ = CRP( γ , H ) if x t b +1 refers to a new st ate in s 0 then Q ∗ := add Customer ( x t b +1 , Q ∗ ) Q old := addCustom er ( x t b +1 , Q old ) end if for t = t b to t b +1 − 1 do if x i refers to a new st ate in s 0 then q old := q old ∗ ge tProb ( x i , Q old ) Q old := addCustom er ( x i , Q old ) end if if x ∗ i = k new then sample s ∼ Q ∗ ; x ∗ i := s q ∗ := q ∗ ∗ ge tProb ( x ∗ i , Q ∗ ) Q ∗ := add Customer ( x i , Q ∗ ) end if end for S ∗ 0 = S 0 ; F ∗ 0 = F 0 h s ∗ , F ∗ , r ∗ = add Seq ( i 0 , L, x , s 0 , F 0 ) R := min  1 , q old q ∗ · r old · r ∗  h x , s , F i := ( h x ∗ , s ∗ , F ∗ i with prob ability R h x old , s old , F old i otherwise end for D Split-Merge sampler Algorithm 9 Split-Merge S ampler for an HCRP-HMM Input: y 1 , . . . , y T : ob served emission s x 1 , . . . , x T : pr eviously inferr ed states s old : set of CRP seating arran gements F old : set of emission distributions R thr ∼ Uniform(0 , 1) Choose distinct t 1 , t 2 ∈ { 1 , . . . , T } Identify all frag ments ( b i , e i ) s.t. for all t ∈ ( b i , . . . , e i ) , x i ∈ { x t 1 , x t 2 } ∧ t / ∈ { t 1 , t 2 } , and not con tained in other fragmen ts Permute fragments random ly Let U be the num ber of fragments s U +1 = s old , F U +1 = F old if x t 1 = x t 2 then { Try split move } for i = U downto 1 do h s i , F i , r old i i = removeSeq ( b i , e i , x , s i +1 , F i +1 ) q old i = 1 end for x ∗ t 2 = new k index else { T ry merge move } for i = U downto 1 do h s i , F i , r old i i = removeSeq ( x b i : e i , s i +1 , F i +1 ) q old i = SeqProb ( ˆ π | s i +1 , F i , y b i : e i , x b i − 1: e i +1 ) F orw ardProb ( ˆ π | s i +1 , F i , y b i : e i , x b i − 1 , x e i +1 ; { x t 1 , x t 2 } ) end for x ∗ t 2 = x t 1 end if { Remove customers that accounts for transitions around x old t 2 } F 0 = removeCustomer ( y t 2 , x t 2 , F 1 ) p old 0 = p ( y t 2 | F x t 2 ) s 0 := s 1 if t 2 − 1 is no t i n any fragment then s 0 := remo veCustomer ( x t 2 , x t 2 − 1 , s 0 )) r old 0 ∗ = getProb ( x t 2 , x t 2 − 1 , s 1 )) end if if t 2 + 1 is no t i n any fragment then s 0 := remo veCustomer ( x t 2 + 1 , x t 2 , s 0 )) r old 0 ∗ = getProb ( x t 2 +1 x t 2 , s 0 )) end if q old 0 = q ∗ 0 = 1 (continu e to Alg orithm 10) Algorithm 10 Split-Merge Sampler for an HCR P-HMM (con tinued) { Add customers that accounts for transitions around x ∗ t 2 } p ∗ 0 = p ( y t 2 | F ∗ x ∗ t 2 ) F ∗ 1 = addCustom er ( y t 2 , x ∗ t 2 , F ∗ 0 ) s ∗ 1 := s 0 if t 2 + 1 is no t i n any fragment then r ∗ 0 ∗ = g etP r ob ( x t 2 +1 x t 2 , s ∗ 1 )) s ∗ 1 := add Customer ( x t 2 + 1 , x t 2 , s ∗ 1 )) end if if t 2 − 1 is no t i n any fragment then r ∗ 0 ∗ = g etP r ob ( x t 1 | x t 2 − 1 , s ∗ 1 )) s ∗ 1 := add Customer ( x t 2 , x t 2 − 1 , s ∗ 1 )) end if if x t 1 = x t 2 then { Try split move } for i = 1 t o U do x ∗ b i , . . . , x ∗ b i + L i − 1 = LimitedFBSampler ( ˆ π | s ∗ i +1 , F ∗ i , y b i : b i + L i − 1 , x ∗ b i − 1 , x ∗ b i + L ; { x ∗ t 1 , x ∗ t 2 } ) q ∗ i = SeqProb ( ˆ π | s ∗ i +1 , F ∗ i , y b i : e i , x ∗ b i − 1: e i +1 ) F orw ardProb ( ˆ π | s ∗ i +1 , F ∗ i , y b i : e i , x ∗ b i − 1 , x ∗ e i +1 ; { x ∗ t 1 , x ∗ t 2 } ) h s ∗ i +2 , F ∗ i +1 , r ∗ i i = addSeq ( b i , e i , x ∗ , s ∗ i +1 , F ∗ i ) end for else { T ry merge move } for i = 1 t o U do R cur = Q i − 1 i ′ =0 r ∗ i q ∗ i · Q I i =0 r old i r old i if R thr ≥ R cur then rejection determined , exit loop end if x ∗ b i , . . . , x ∗ e i = x t 1 q ∗ i = 1 h s ∗ i , F ∗ i − 1 , r ∗ i i = addSeq ( b i , e i , x , s i +1 , F i ) end for end if R = Q I i =0 r ∗ i r old i · Q I i =1 q old i q ∗ i h x , s , F i = ( h x ∗ , s ∗ , F ∗ i R thr < R h x old , s old , F old i otherwise

Restricted Collapsed Draw: Accurate Sampling for Hierarchical Chinese Restaurant Process Hidden Markov Models

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment