We present and analyse three online algorithms for learning in discrete Hidden Markov Models (HMMs) and compare them with the Baldi-Chauvin Algorithm. Using the Kullback-Leibler divergence as a measure of generalisation error we draw learning curves in simplified situations. The performance for learning drifting concepts of one of the presented algorithms is analysed and compared with the Baldi-Chauvin algorithm in the same situations. A brief discussion about learning and symmetry breaking based on our results is also presented.
Deep Dive into Online Learning in Discrete Hidden Markov Models.
We present and analyse three online algorithms for learning in discrete Hidden Markov Models (HMMs) and compare them with the Baldi-Chauvin Algorithm. Using the Kullback-Leibler divergence as a measure of generalisation error we draw learning curves in simplified situations. The performance for learning drifting concepts of one of the presented algorithms is analysed and compared with the Baldi-Chauvin algorithm in the same situations. A brief discussion about learning and symmetry breaking based on our results is also presented.
Hidden Markov Models (HMMs) [1,2] are extensively studied machine learning models for time series with several applications in fields like speech recognition [2], bioinformatics [3,4] and LDPC codes [5]. They consist of a Markov chain of non-observable hidden states q t ∈ S, t = 1, ..., T , S = {s 1 , s 2 , ..., s n }, with initial probability vector π i = P(q 1 = s i ) and transition matrix A ij (t) = P(q t+1 = s j |q t = s i ), i, j = 1, .., n. At discrete times t, each q t emits an observed state y t ∈ O, O = {o 1 , ..., o m }, with emission probability matrix B iα (t) = P(y t = o α |q t = s i ), i = 1, ..., n, α = 1, ..., m, which are the actual observations of the time series represented, from time t = 1 to t = T , by the observed sequence y T 1 = {y 1 , y 2 , ..., y T }. The q t 's form the so called hidden sequence q T 1 = {q 1 , q 2 , ..., q T }. The probability of observing a sequence y T 1 given ω ≡ (π, A, B) is
In the learning process, the HMM is fed with a series and adapts its parameters to produce similar ones. Data feeding can range from offline (all data is fed and parameters calculated all at once) to online (data is fed by parts and partial calculations are made).
We study a scenario with data generated by a HMM of unknown parameters, an extension of the student-teacher scenario from neural networks. The performance, as a function of the number of observations, is given by how far, measured by a suitable cri-terion, is the student from the teacher. Here we use the naturally arising Kullback-Leibler (KL) divergence that, although not accessible in practice since it needs knowledge of the teacher, is an extension of the idea of generalisation error being very informative.
We propose three algorithms and compare them with the Baldi-Chauvin Algorithm (BC) [6]: the Baum-Welch Online Algorithm (BWO), an adaptation of the offline Baum-Welch Reestimation Formulas (BW) [1] and, starting from a Bayesian formulation, an approximation named Bayesian Online Algorithm (BOnA), that can be simplified again without noticeable lost of performance to a Mean Posterior Algorithm (MPA). BOnA and MPA, inspired by Amari [7] and Opper [8], are essentially mean field methods [9] in which a manifold of prior tractable distributions is introduced and the new datum leads, through Bayes theorem, to a non-tractable posterior. The key step is to take as the new prior, not the posterior, but the closest distribution (in some sense) in the manifold.
The paper is organised as follows: first, BWO is introduced and analysed. Next, we derive BOnA for HMMs and, from it, MPA. We compare MPA and BC for drifting concepts. Then, we discuss learning and symmetry breaking and end with our conclusions.
The Baum-Welch Online Algorithm (BWO) is an online adaptation of BW where in each iteration of BW, y becomes y p , the p-th observed sequence. Multiplying the BW increment by a learning rate η BW we get the update equations for ω
with ∆ω p the BW variations for y p . The complexity of BWO is polynomial in n and T .
In figure 1, the HMM learns sequences generated by a teacher with n = 2, m = 3 and T = 2 for different η BW . Initial students have matrices with all entries set to the same value, what we call a symmetric initial student. We took averages over 500 random teachers and distances are given by the KL-divergence between two HMMs ω 1 and ω 2
We see that after a certain number of sequences the HMM stops learning, which is particular to the symmetric initial student and disappears for a non-symmetric one.
Denoting the variation of the parameters in BC by ∆, in BW by ∆, in BWO by ∆, and with γ t (i) ≡ P(q t = s i |y p , ω p ), we have to first order in λ For η BW ≈ λη BC /n and small λ, variations in BC are proportional to those in BWO, but with different effective learning rates for each matrix depending on y p . Simulations show that actual values are of the same order of approximated ones.
The Bayesian Online Algorithm (BOnA) [8] uses Bayesian inference to adjust ω in the HMM using a data set D P = {y 1 , …, y P }. For each data, the prior distribution is updated by Bayes’ theorem. This update takes a prior from a parametric family and transforms it in a posterior which in general has no longer the same parametric form. The strategy used by BOnA is then to project the posterior back into the initial parametric family. In order to achieve this, we minimise the KL-divergence between the posterior and a distribution in the parametric family. This minimisation will enable us to find the parameters of the closest parametric distribution by which we will approximate our posterior. The student HMM ω parameters in each step of the learning process are estimated as the means of the each projected distribution.
For a parametric family that has the form P (x) ∝ e -P i λ i f i (x) , which can be obtained by the MaxEnt principle where we constrain the averages over P (x) of arbitrary functions f i (x), minimising the KL-divergence
…(Full text truncated)…
This content is AI-processed based on ArXiv data.