The Baum-Welsh algorithm together with its derivatives and variations has been the main technique for learning Hidden Markov Models (HMM) from observational data. We present an HMM learning algorithm based on the non-negative matrix factorization (NMF) of higher order Markovian statistics that is structurally different from the Baum-Welsh and its associated approaches. The described algorithm supports estimation of the number of recurrent states of an HMM and iterates the non-negative matrix factorization (NMF) algorithm to improve the learned HMM parameters. Numerical examples are provided as well.
Deep Dive into Learning Hidden Markov Models using Non-Negative Matrix Factorization.
The Baum-Welsh algorithm together with its derivatives and variations has been the main technique for learning Hidden Markov Models (HMM) from observational data. We present an HMM learning algorithm based on the non-negative matrix factorization (NMF) of higher order Markovian statistics that is structurally different from the Baum-Welsh and its associated approaches. The described algorithm supports estimation of the number of recurrent states of an HMM and iterates the non-negative matrix factorization (NMF) algorithm to improve the learned HMM parameters. Numerical examples are provided as well.
arXiv:0809.4086v2 [cs.LG] 8 Jan 2011
SUBMITTED TO IEEE TRANSACTIONS ON INFORMATION THEORY, SEPTEMBER 2008
1
Learning Hidden Markov Models using
Non-Negative Matrix Factorization
George Cybenko, Fellow, IEEE, and Valentino Crespi, Member, IEEE
Abstract—The Baum-Welch algorithm together with its deriva-
tives and variations has been the main technique for learning
Hidden Markov Models (HMM) from observational data. We
present an HMM learning algorithm based on the non-negative
matrix factorization (NMF) of higher order Markovian statistics
that is structurally different from the Baum-Welch and its asso-
ciated approaches. The described algorithm supports estimation
of the number of recurrent states of an HMM and iterates the
non-negative matrix factorization (NMF) algorithm to improve
the learned HMM parameters. Numerical examples are provided
as well.
Index Terms—Hidden Markov Models, machine learning, non-
negative matrix factorization.
I. INTRODUCTION
Hidden Markov Models (HMM) have been successfully
used to model stochastic systems arising in a variety of appli-
cations ranging from biology to engineering to finance [1], [2],
[3], [4], [5], [6]. Following accepted notation for representing
the parameters and structure of HMM’s (see [7], [8], [9], [1],
[10] for example), we will use the following terminology and
definitions:
1) N is the number of states of the Markov chain underly-
ing the HMM. The state space is S = {S1, ..., SN} and
the system’s state process at time t is denoted by xt;
2) M is the number of distinct observables or symbols
generated by the HMM. The set of possible observables
is V = {v1, ..., vM} and the observation process at time
t is denoted by yt. We denote by yt2
t1 the subprocess
yt1yt1+1 . . . yt2;
3) The joint probabilities
aij(k) = P(xt+1 = Sj, yt+1 = vk|xt = Si);
are the time-invariant probabilities of transitioning to
state Sj at time t + 1 and emitting observation vk given
that at time t the system was in state Si. Observation
vk is emitted during the transition from state Si to state
Sj. We use A(k) = (aij(k)) to denote the matrix of
state transition probabilities that emit the same symbol
vk. Note that A = P
k A(k) is the stochastic matrix
representing the Markov chain state process xt.
4) The initial state distribution, at time t = 1, is given by
Γ = {γ1, ..., γN} where γi = P(x1 = Si) ≥0 and
P
i γi = 1.
G. Cybenko is with the Thayer School of Engineering, Dartmouth College,
Hanover, NH 03755 USA e-mail: gvc@dartmouth.edu.
V. Crespi is with the Department of Computer Science, California State
University at Los Angeles, LA, 90032 USA e-mail: vcrespi@calstatela.edu.
Manuscript submitted September 2008
Collectively, matrices A(k) and Γ completely define the HMM
and we say that a model for the HMM is λ = ({A(k) | 1 ≤
k ≤M}, Γ).
We present an algorithm for learning an HMM from single
or multiple observation sequences. The traditional approach
for learning an HMM is the Baum-Welch Algorithm [1] which
has been extended in a variety of ways by others [11], [12],
[13].
Recently, a novel and promising approach to the HMM ap-
proximation problem was proposed by Finesso et al. [14]. That
approach is based on Anderson’s HMM stochastic realization
technique [15] which demonstrates that a positive factorization
of a certain Hankel matrix (consisting of observation string
probabilities) can be used to recover the hidden Markov
model’s probability matrices. Finesso and his coauthors used
recently developed non-negative matrix factorization (NMF)
algorithms [16] to express those stochastic realization tech-
niques as an operational algorithm. Earlier ideas in that vein
were anticipated by Upper in 1997 [17], although that work
did not benefit from HMM stochastic realization techniques or
NMF algorithms, both of which were developed after 1997.
Methods based on stochastic realization techniques, includ-
ing the one presented here, are fundamentally different from
Baum-Welch based methods in that the algorithms use as input
observation sequence probabilities as opposed to raw obser-
vation sequences. Anderson’s and Finesso’s approaches use
system realization methods while our algorithm is in the spirit
of the Myhill-Nerode [18] construction for building automata
from languages. In the Myhill-Nerode construction, states are
defined as equivalence classes of pasts which produce the same
futures. In an HMM, the “future” of a state is a probability
distribution over future observations. Following this intuition
we derive our result in a manner that appears comparatively
more concise and elementary, in relation to the aforementioned
approaches by Anderson and Finesso.
At a conceptual level, our algorithm operates as follows.
We first estimate the matrix of an observation sequence’s high
order statistics. This matrix has a natural non-negative matrix
factorization (NMF) [16] which can be interpreted in terms
of the probability distribution of future observations given the
current state of the underlyin
…(Full text truncated)…
This content is AI-processed based on ArXiv data.