Adjusted Viterbi training for hidden Markov models

Reading time: 5 minute
...

📝 Original Info

  • Title: Adjusted Viterbi training for hidden Markov models
  • ArXiv ID: 0709.2317
  • Date: 2007-09-17
  • Authors: ** - Not explicitly listed in the excerpt; the original paper (likely by A. N. B. B. et al.) should be consulted for full author list. **

📝 Abstract

To estimate the emission parameters in hidden Markov models one commonly uses the EM algorithm or its variation. Our primary motivation, however, is the Philips speech recognition system wherein the EM algorithm is replaced by the Viterbi training algorithm. Viterbi training is faster and computationally less involved than EM, but it is also biased and need not even be consistent. We propose an alternative to the Viterbi training -- adjusted Viterbi training -- that has the same order of computational complexity as Viterbi training but gives more accurate estimators. Elsewhere, we studied the adjusted Viterbi training for a special case of mixtures, supporting the theory by simulations. This paper proves the adjusted Viterbi training to be also possible for more general hidden Markov models.

💡 Deep Analysis

📄 Full Content

We consider a set of procedures to estimate the emission parameters of a finite state hidden Markov model given observations x 1 , . . . , x n . Thus, Y is a Markov chain with (finite) state space S, transition matrix P = (p ij ), and initial distribution π. To every state l ∈ S there corresponds an emission distribution P l with density f l that is known up to the parametrization f l (x; θ l ). When Y reaches state l, an observation according to P l and independent of everything else, is emitted.

The standard method for finding the maximum likelihood estimator of the emission parameters θ l is the EM-algorithm that in the present context is also known as the Baum-Welch or forward-backward algorithm [1,2,8,9,18,19]. Since the EM-algorithm can in practice be slow and computationally expensive, one seeks reasonable alternatives. One such alternative is Viterbi training (VT). VT is used in speech recognition [8,15,19,20,21,22], natural language modeling [16], image analysis [14], bioinformatics [5,17]. We are also motivated by connections with constrained vector quantization [4,6]. The basic idea behind VT is to replace the computationally costly expectation (E) step of the EM-algorithm by an appropriate maximization step with fewer and simpler computations. In speech recognition, essentially the same training procedure was already described by L. Rabiner et al. in [10,20] (see also [18,19]). Rabiner considered this procedure as a variation of the Lloyd algorithm used in vector quantization, referring to Viterbi training as the segmential K-means training. The analogy with the vector quantization is especially pronounced when the underlying chain is simply a sequence of i.i.d. variables, observations on which are consequently an i.i.d. sample from a mixture distribution. For such mixture models, VT was also described by R. Gray et al. in [4], where the training algorithm was considered in the vector quantization context under the name of entropy constrained vector quantization (ECVQ).

The VT algorithm for estimation of the emission parameters of the hidden Markov model can be described as follows. Using some initial values for the parameters, find a realization of Y that maximizes the likelihood of the given observations. Such an n-tuple of states is called a Viterbi alignment. Every Viterbi alignment partitions the sample into subsamples corresponding to the states appearing in the alignment. A subsample corresponding to state l is regarded as an i.i.d. sample from P l and is used to find μl , the maximum likelihood estimate of θ l . These estimates are then used to find an alignment in the next step of the training, and so on. It can be shown that in general this procedure converges in finitely many steps; also, it is usually much faster than the EM-algorithm.

Although VT is computationally feasible and converges fast, it has a significant disadvantage: The obtained estimators need not be (local) maximum likelihood estimators; moreover, they are generally biased and inconsistent. (VT does not necessarily increase the likelihood, it is, however, an ascent algorithm maximizing a certain other objective function.) Despite this deficiency, speech recognition experiments do not show any significant degradation of the recognition performance when the EM algorithm is replaced by VT. There appears no other explanation of this phenomena but the “curse of complexity” of the very speech recognition system based on HMM. This paper considers VT largely outside the speech recognition context. We regard the VT procedure merely as a parameter estimation method, and we address the following question: Is it possible to adjust VT in such a way that the adjusted training still has the attractive properties of VT (fast convergence and computational feasibility) and that the estimators are, at the same time, “more accurate” than those of the baseline proce-dure? In particular, we focus on a special property of the EM algorithm that VT lacks. This property ensures that the true parameters are asymptotically a fixed point of the algorithm. In other words, for a sufficiently large sample, the EM algorithm “recognizes” the true parameters and does not change them much. VT does not have this property; even when the initial parameters are correct (and n is arbitrarily large), an iteration of the training procedure would in general disturb them. We thus attempt to modify VT in order to make the true parameters an asymptotic fixed point of the resulting algorithm.

In accomplishing this task it is crucial to understand the asymptotic behavior of P n l , the empirical measures corresponding to the subsamples obtained from the alignment. These measures depend on the set of parameters used by the alignment, and in order for the true parameters to be asymptotically fixed by (adjusted) VT, the following must hold: If P n l is obtained by the alignment with the true parameters, and n is sufficiently large, then μl , the estimator obtained f

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut