Nonparametric hidden Markov models

Nonparametric hidden Markov models

Jurgen Van Gael and Zoubin Ghahramani

Introduction

HM models: time series with discrete hidden states Infinite HM models (iHMM): nonparametric Bayesian approach Equivalence between Polya urn and HDP interpretations for iHMM Inference algorithms: collapsed Gibbs sampler, beam sampler Use of iHMM: simple sequence labeling task

Introduction

Underlying hidden structure examples Observed pixels corresponding to objects Power-spectra coefficients on a speech signal corresponding to phones Price movements of financial instruments corresponding to underlying economic

and political events Models with such underlying hidden variables can be more inter-

pretable and better predictive properties than models directly relat-ing with observed variables

HMM assumes 1st order Markov properties on the Markov chain of hidden vari-ables with a KxK transition matrix

Observation depends usually on an observation model F parameterized by a state-dependent parameter

Choosing the number of states K: nonparametric Bayesian approach for hidden Markov model with countably infinite number of hidden states

From HMMs to Bayesian HMMs

An example of HMM: speech recognition Hidden state sequence: phones Observation: acoustic signals Parameters , come from a physical model of speech / can be learned from

recordings of speech Computational questions

1.(, , K) is given: apply Bayes rule to find posterior of hidden variables

Computation can be done by a dynamic programming called forward-backward algorithm

2. K given, , not given: apply EM 3 .(, , K) is not given: penalizing, etc..


Fully Bayesian approach Adding priors for , and extending full joint pdf as

Compute the marginal likelihood or evidence for comparing, choosing or averag-ing over different values of K.

Analytic computing of the marginal likelihood is intractable


Methods for dealing the intractability MCMC 1: by estimating the marginal likelihood explicitly. Annealed importance

sampling, Bridge sampling. Computationally expensive.

MCMC 2: by switching between different K values. Reversible jump MCMC

Approximation by using good state sequence: by independency of parameters and conjugacy between prior and likelihood under given hidden states, marginal likelihood can be computed analytically.

Variational Bayesian inference: by computing lower bound of the marginal likeli-hood and applying VB inference.

Infinite HMM – hierarchical Polya Urn

iHMM: Instead of defining K different HMMs, implicitly define a dis-tribution over the number of visited states.

Polya Urn: add a ball of new color: / (+ni). add a ball of color i : ni / (+ni). Nonparametric clustering scheme

Hierarchical Polya Urn: Assume separate Urn(k) for each state k At each time step t, select a ball from the corresponding Urn(k)_(t-1) Interpretation of transition probability by the # of balls of color j in Urn color i:

Probability of drawing from oracle:

Infinite HMM – HDP

HDP and hierarchical Polya Urn

Set rows of transition matrix equal to the sticks of Gj

Gj corresponds to the Urn for the j-th state Key fact: all Urns share the same set of parameters via oracle Urn

Inference Gibbs sampler: O(KT2) Approximate Gibbs sampler: O(KT) State sequence variables are strongly correlated slow mixing Beam sampler as an auxiliary variable MCMC algorithm

Resamples the whole Markov chain at once Hence suffers less from slow mixing

Inference – collapsed Gibbs sampler

Given and s1:T, the DPs for each transition becomes independent (?) By fixing s1:T, the j-th state does not depend on the previous state could be marginalized

Inference – collapsed Gibbs sampler Sampling st :

Conditional likelihood of yt :

Second factor: a draw from a Polya urn

Inference – collapsed Gibbs sampler

Sampling : from the Polya Urn of the base distribution (oracle Urn)

mij : the number of oracle calls for a ball with label j when queried the oracle from state i.

Note: use for sampling : # of transitions from i to j . mij : # of elements in Sij that were obtained from querying the oracle.

Complexity: O(TK+K*K) Strong correlation of the sequential data: slow mixing behavior

Inference – Beam sampler

A method of resampling the whole state sequence at once Forward-filtering backward-sampling algorithm does not apply because of the

number of states and hence the number of potential state trajectories is infinite Introducing auxiliary variables

Conditioned on , the number of trajectories is finite These auxiliary variables do not change the marginal distributions over other

variables hence MCMC sampling still converges to the true posterior Sampling and :

k = ~ Each k is independent of others conditional on and

Inference – Beam sampler

Compute only for finitely many st, st-1 values.

Inference – Beam sampler Complexity: O(TK2) when K states are presented

Remarks: auxiliary variables need not be sampled from uniform. Beta distribution could also be used to bias auxiliary variables close to the boundaries of

Example: unsupervised part-of–speech (PoS) tagging

PoS-tagging: annotating the words in a sentence with their appropriate part-of-speech tag

“ The man sat” ‘The’ : determiner, ‘man’: noun, ‘sat’: verb HM model is commonly used

Observation: words Hidden: unknown PoP-tag Usually learned using a corpus of annotated sentences: building corpus is expensive

In iHMM Multinomial likelihood is assumed with base distribution H as symmetric Dirichlet so its conjugate to multinomial likeli-

hood Trained on section 0 of WSJ of Penn Treebank: 1917 sentences with total of

50282 word tokens (observations) and 7904 word types (dictionary size) Initialize the sampler with 50 states with 50000 iterations

Example: unsupervised part-of–speech (PoS) tagging

Top 5 words for the five most common states Top line: state ID and frequency Rows: top 5 words with frequency in the sample state 9: class of prepositions State 12: determinants + possessive pronouns State 8: punctuation + some coordinating conjunction State 18: nouns State 17: personal pronouns

Beyond the iHMM: input-output(IO) iHMM

MC affected by external factors A robot is driving around in a room while taking pictures (room index picture) If robot follows a particular policy, robots action can be integrated as an input to

iHMM (IO-iHMM) Three dimensional transition matrix:

Beyond the iHMM: sticky and block-diagonal iHMM Weight on the diagonal of the transition matrix controls the frequency of

state transitions Probability of staying in state i for g times: Sticky iHMM: by adding a prior probability mass to the diagonal of the tran-

sition matrix and applying a dynamic programming based inference Appropriate for segmentation problems where the number of segments is

not known a priori To carry more weight for diagonal entry:

is a parameter for controlling the switching rate Block-diagonal iHMM:for grouping of states

Sticky iHMM is a case for size 1 block Larger blocks allow unsupervised clustering of states Used for unsupervised learning of view-based object models from video datawhere each block corresponds to an object. Intuition behind: Temporary contiguous video frames are more likely correspond

to different views of the same objects than different objects Hidden semi-Markov model

Assuming an explicit duration model for the time spent in a particular state

Beyond the iHMM: iHMM with Pitman-Yor base distribution

Frequency vs. rank of colors (on log-log scale) DP is quite specific about distribution implied in the Polya Urn: colors that appear

once or twice is very small Pitman-Yor can be more specific about the tails Pitman-Yor fits a power-law distribution (linear fitting in the plot) Replace DP by Pitman-Yor in most cases Helpful comments on beam sampler

Beyond the iHMM: autoregressive iHMM, SLD-iHMM

AR-iHMM: Observations follow auto-regressive dynamics SLD-iHMM: part of the continuous variables are observed and the unob-

served variables follow linear dynamics

SLD model

FA-HMM model

Documents

Nonparametric hidden Markov models