23
Nonparametric hidden Markov models Jurgen Van Gael and Zoubin Ghahramani

Nonparametric hidden Markov models

  • Upload
    adolph

  • View
    57

  • Download
    0

Embed Size (px)

DESCRIPTION

Nonparametric hidden Markov models. Jurgen Van Gael and Zoubin Ghahramani. Introduction. HM models: time series with discrete hidden states Infinite HM models ( iHMM ): nonparametric Bayesian approach Equivalence between Polya urn and HDP interpretations for iHMM - PowerPoint PPT Presentation

Citation preview

Page 1: Nonparametric hidden Markov models

Nonparametric hidden Markov models

Jurgen Van Gael and Zoubin Ghahramani

Page 2: Nonparametric hidden Markov models

Introduction

HM models: time series with discrete hidden states Infinite HM models (iHMM): nonparametric Bayesian approach Equivalence between Polya urn and HDP interpretations for iHMM Inference algorithms: collapsed Gibbs sampler, beam sampler Use of iHMM: simple sequence labeling task

Page 3: Nonparametric hidden Markov models

Introduction

Underlying hidden structure examples Observed pixels corresponding to objects Power-spectra coefficients on a speech signal corresponding to phones Price movements of financial instruments corresponding to underlying economic

and political events Models with such underlying hidden variables can be more inter-

pretable and better predictive properties than models directly relat-ing with observed variables

HMM assumes 1st order Markov properties on the Markov chain of hidden vari-ables with a KxK transition matrix

Observation depends usually on an observation model F parameterized by a state-dependent parameter

Choosing the number of states K: nonparametric Bayesian approach for hidden Markov model with countably infinite number of hidden states

Page 4: Nonparametric hidden Markov models

From HMMs to Bayesian HMMs

An example of HMM: speech recognition Hidden state sequence: phones Observation: acoustic signals Parameters , come from a physical model of speech / can be learned from

recordings of speech Computational questions

1.(, , K) is given: apply Bayes rule to find posterior of hidden variables

Computation can be done by a dynamic programming called forward-backward algorithm

2. K given, , not given: apply EM 3 .(, , K) is not given: penalizing, etc..

Page 5: Nonparametric hidden Markov models

From HMMs to Bayesian HMMs

Fully Bayesian approach Adding priors for , and extending full joint pdf as

Compute the marginal likelihood or evidence for comparing, choosing or averag-ing over different values of K.

Analytic computing of the marginal likelihood is intractable

Page 6: Nonparametric hidden Markov models

From HMMs to Bayesian HMMs

Methods for dealing the intractability MCMC 1: by estimating the marginal likelihood explicitly. Annealed importance

sampling, Bridge sampling. Computationally expensive.

MCMC 2: by switching between different K values. Reversible jump MCMC

Approximation by using good state sequence: by independency of parameters and conjugacy between prior and likelihood under given hidden states, marginal likelihood can be computed analytically.

Variational Bayesian inference: by computing lower bound of the marginal likeli-hood and applying VB inference.

Page 7: Nonparametric hidden Markov models

Infinite HMM – hierarchical Polya Urn

iHMM: Instead of defining K different HMMs, implicitly define a dis-tribution over the number of visited states.

Polya Urn: add a ball of new color: / (+ni). add a ball of color i : ni / (+ni). Nonparametric clustering scheme

Hierarchical Polya Urn: Assume separate Urn(k) for each state k At each time step t, select a ball from the corresponding Urn(k)_(t-1) Interpretation of transition probability by the # of balls of color j in Urn color i:

Probability of drawing from oracle:

Page 8: Nonparametric hidden Markov models
Page 9: Nonparametric hidden Markov models

Infinite HMM – HDP

Page 10: Nonparametric hidden Markov models

HDP and hierarchical Polya Urn

Set rows of transition matrix equal to the sticks of Gj

Gj corresponds to the Urn for the j-th state Key fact: all Urns share the same set of parameters via oracle Urn

Page 11: Nonparametric hidden Markov models

Inference Gibbs sampler: O(KT2) Approximate Gibbs sampler: O(KT) State sequence variables are strongly correlated slow mixing Beam sampler as an auxiliary variable MCMC algorithm

Resamples the whole Markov chain at once Hence suffers less from slow mixing

Page 12: Nonparametric hidden Markov models

Inference – collapsed Gibbs sampler

Given and s1:T, the DPs for each transition becomes independent (?) By fixing s1:T, the j-th state does not depend on the previous state could be marginalized

Page 13: Nonparametric hidden Markov models

Inference – collapsed Gibbs sampler Sampling st :

Conditional likelihood of yt :

Second factor: a draw from a Polya urn

Page 14: Nonparametric hidden Markov models

Inference – collapsed Gibbs sampler

Sampling : from the Polya Urn of the base distribution (oracle Urn)

mij : the number of oracle calls for a ball with label j when queried the oracle from state i.

Note: use for sampling : # of transitions from i to j . mij : # of elements in Sij that were obtained from querying the oracle.

Complexity: O(TK+K*K) Strong correlation of the sequential data: slow mixing behavior

Page 15: Nonparametric hidden Markov models

Inference – Beam sampler

A method of resampling the whole state sequence at once Forward-filtering backward-sampling algorithm does not apply because of the

number of states and hence the number of potential state trajectories is infinite Introducing auxiliary variables

Conditioned on , the number of trajectories is finite These auxiliary variables do not change the marginal distributions over other

variables hence MCMC sampling still converges to the true posterior Sampling and :

k = ~ Each k is independent of others conditional on and

Page 16: Nonparametric hidden Markov models

Inference – Beam sampler

Compute only for finitely many st, st-1 values.

Page 17: Nonparametric hidden Markov models

Inference – Beam sampler Complexity: O(TK2) when K states are presented

Remarks: auxiliary variables need not be sampled from uniform. Beta distribution could also be used to bias auxiliary variables close to the boundaries of

Page 18: Nonparametric hidden Markov models

Example: unsupervised part-of–speech (PoS) tagging

PoS-tagging: annotating the words in a sentence with their appropriate part-of-speech tag

“ The man sat” ‘The’ : determiner, ‘man’: noun, ‘sat’: verb HM model is commonly used

Observation: words Hidden: unknown PoP-tag Usually learned using a corpus of annotated sentences: building corpus is expensive

In iHMM Multinomial likelihood is assumed with base distribution H as symmetric Dirichlet so its conjugate to multinomial likeli-

hood Trained on section 0 of WSJ of Penn Treebank: 1917 sentences with total of

50282 word tokens (observations) and 7904 word types (dictionary size) Initialize the sampler with 50 states with 50000 iterations

Page 19: Nonparametric hidden Markov models

Example: unsupervised part-of–speech (PoS) tagging

Top 5 words for the five most common states Top line: state ID and frequency Rows: top 5 words with frequency in the sample state 9: class of prepositions State 12: determinants + possessive pronouns State 8: punctuation + some coordinating conjunction State 18: nouns State 17: personal pronouns

Page 20: Nonparametric hidden Markov models

Beyond the iHMM: input-output(IO) iHMM

MC affected by external factors A robot is driving around in a room while taking pictures (room index picture) If robot follows a particular policy, robots action can be integrated as an input to

iHMM (IO-iHMM) Three dimensional transition matrix:

Page 21: Nonparametric hidden Markov models

Beyond the iHMM: sticky and block-diagonal iHMM Weight on the diagonal of the transition matrix controls the frequency of

state transitions Probability of staying in state i for g times: Sticky iHMM: by adding a prior probability mass to the diagonal of the tran-

sition matrix and applying a dynamic programming based inference Appropriate for segmentation problems where the number of segments is

not known a priori To carry more weight for diagonal entry:

is a parameter for controlling the switching rate Block-diagonal iHMM:for grouping of states

Sticky iHMM is a case for size 1 block Larger blocks allow unsupervised clustering of states Used for unsupervised learning of view-based object models from video datawhere each block corresponds to an object. Intuition behind: Temporary contiguous video frames are more likely correspond

to different views of the same objects than different objects Hidden semi-Markov model

Assuming an explicit duration model for the time spent in a particular state

Page 22: Nonparametric hidden Markov models

Beyond the iHMM: iHMM with Pitman-Yor base distribution

Frequency vs. rank of colors (on log-log scale) DP is quite specific about distribution implied in the Polya Urn: colors that appear

once or twice is very small Pitman-Yor can be more specific about the tails Pitman-Yor fits a power-law distribution (linear fitting in the plot) Replace DP by Pitman-Yor in most cases Helpful comments on beam sampler

Page 23: Nonparametric hidden Markov models

Beyond the iHMM: autoregressive iHMM, SLD-iHMM

AR-iHMM: Observations follow auto-regressive dynamics SLD-iHMM: part of the continuous variables are observed and the unob-

served variables follow linear dynamics

SLD model

FA-HMM model