101035 中文信息处理 Chinese NLP Lecture 7. 词 —— 词性标注（ 2 ） Part-of-Speech Tagging (2) 统计模型的训练（ Training a statistical model ）马尔可夫链（

101035 中文信息处理

Chinese NLP

Lecture 7

2

词——词性标注（ 2）Part-of-Speech Tagging (2)

• 统计模型的训练（ Training a statistical model）• 马尔可夫链（Markov chain)

• 隐马尔可夫模型（ Hidden Markov Model, or HMM）

• 隐马尔可夫标注算法（ HMM POS tagging）

3

统计模型的训练Training a Statistical Model

• Back to POS tagging

• Given a word sequence , decide its best POS sequence among all .

Bayes Rule

Likelihood

Prior

4

• Computing Probabilities

Simplifying assumption

s

Counts from corpus

The above probability computation is oversimplified. Consult the textbook about deleted interpolation and other smoothing methods for better probability computation.

5

• Using Probabilities for POS Tagging

• Example

What POS is race in Secretariat is expected to race tomorrow?

6

• Using Probabilities for POS Tagging

• Example

• Using the (87-tag) Brown corpus, we get

P(NN|TO) = 0.00047 P(VB|TO) = 0.83

P(race|NN) = 0.00057 P(race|VB) = 0.00012

P(NR|VB) = 0.0027 P(NR|NN) = 0.0012

• Compare

P(VB|TO) P(NR|VB) P(race|VB) = 0.00000027

P(NN|TO) P(NR|NN) P(race|NN) = 0.00000000032

7

马尔可夫链Markov Chain

• Definition

• A Markov chain is a special case of a weighted automaton in which the input sequence uniquely determines which states the automaton will go through.

8

• Graphical Model Representation

• A set of N states: Q = q1q2 … qN

• A transition probability matrix: A = a01a02 … an1 … ann

• A special start state and end state: q0, qF

• Alternately, we use an initial probability distribution over states.

π = π1π2 … πN. πi is the probability that the Markov chain will start in state i.

9

• What is Special about Markov Chains

• A Markov chain can’t represent inherently ambiguous problems, so it is only useful for assigning probabilities to unambiguous sequences.

• A Markov chain is not suitable for POS tagging, because the states (POS) cannot be directly observed.

• Markov assumption:

10

In-Class Exercise

• Using the following Markov chain, compute the probability of the sequence: {cold hot cold hot}

11

隐马尔可夫模型Hidden Markov Model

• Markov Chain vs HMM

• A Markov chain is useful when we need to compute a probability for a sequence of events that we can observe in the world.

• HMM allows us to talk about both observed events (like words that we see in the input) and hidden events (like POS tags).

12

• HMM Components

• A set of N states: Q = q1q2 … qN

• A transition probability matrix: A = a11a12 … an1 … ann

• A sequence of T observations: O = o1o2 … oT

• A sequence of observation likelihoods, or emission probabilities:

B = bi(oT)

• A special start state and end state: q0, qF

• Alternately, we use an initial probability distribution over states.

π = π1π2 … πN. πi is the probability that the Markov chain will start in state i.

13

• HMM Assumptions

• Markov assumption

• Output independence assumption

• Fundamental Problems

• Computing likelihood: Given an HMM λ = (A,B) and an observation sequence O, determine the likelihood P(O|λ).

• Decoding: Given an observation sequence O and an HMM λ = (A,B), discover the best hidden state sequence Q.

• Learning: Given an observation sequence O and the set of states in the HMM, learn the HMM parameters A and B.

14

• A Running Example

• Jason eating ice creams on some day

There is some relation between weather states (hot, cold) and the number of ice creams eaten on that day.

An integer represents the number of ice creams eaten on a given day (observed), and a sequence of H and C designates the weather states (hidden) that caused Jason to eat the ice cream.

15

• Computing Likelihood

• Given an HMM model, what is the likelihood of {3, 1, 3}? Note that we do not know the hidden states (weather).

• Forward algorithm (a kind of dynamic programming)

• αt(j) represents the probability of being in state j after seeing the first t observations, given the model λ.

the previous forward path

probability from the previous time

stepthe transition

probability from previous state qi to

current state qj

the state observation likelihood of the observation symbol ot giventhe current state j

16


• Algorithm

• Initialization

• Recursion

• Termination

forward[s, t] =

17


• Decoding

• Given an HMM model and an ice cream sequence {3, 1, 3}, what is the hidden weather states?

• Viterbi algorithm (a kind of dynamic programming)

• vt(j) represents the probability that the HMM is in state j after seeing the first t observations and passing through the most probable state sequence q0,q1, ...,qt−1,, given the model λ.

the previous Viterbi path probability

from the previous time step

the transition probability from

previous state qi to current state qj

the state observation likelihood of the observation symbol ot giventhe current state j

18

19

• Decoding

• Algorithm

• Initialization

• Recursion

,

,

• Termination

20

• Decoding

• Algorithm

21

• Decoding

22

• Learning

• Given an ice cream sequence {3, 1, 3} and the set of possible weather states {H, C}, what are the HMM parameters (A and B)?

• Forward-Backward algorithm (a kind of Expectation Maximization)

• βt(j) represents the probability of seeing the observations from time t+1 to the end, given that we are in state j at time t, and the model λ.

• Initialization

• Recursion

• Termination

23

• Learning

• Algorithm

the probability of being in state j at

time t

the probability of being in state i attime t and state j at time t+1

Solving the learning problem is the most complicated. Consult your textbook to find more details.

24

隐马尔可夫标注算法HMM POS Tagging

• Using Viterbi to solve the decoding problem

• An English exampleI want to race.

Transition probabilities

A

Emission probabilities

B

25

• An English Example

26

In-Class Exercise

• Compute V3(3) on the previous page, using the given probabilities. Note that you need to first compute all the V2(*).

27

• Other Tagging Methods

• CLAWS (a brute-force algorithm)

• In a word span (the beginning and end words have unique POS’s), calculate all the path possibilities and choose the maximum.

• VOLSUNGA (a greedy algorithm)

• As an improvement on CLAWS, it finds the optimal path step by step. In each step, it only considers the best path so far found. The ultimate optimal path is simply the sum of parts.

28

• A Chinese Example

• In implementation, we often use log probabilities to prevent numerical underflows due to small probability products. If we take the negative log probabilities, finding maximum product becomes finding minimum sum.

，报道新闻了，

Transition probabilities A /

Transition Costs TC

Emission probabilities B / Emission Costs

EC

29


• CLAWS

Path 1：w-n-n-u-w

Cost1 = TC[w,n]+TC[n,n]+TC[n,u]+TC[u,w]=2.09+1.76+2.40+2.22=8.47

Path 2：w-n-n-v-w

Cost2 = TC[w,n]+TC[n,n]+TC[n,v]+TC[v,w]=2.09+1.76+1.71+1.85=7.41

Path 3：w-n-n-y-w

Cost3 = TC[w,n]+TC[n,n]+TC[n,y]+TC[y,w]=2.09+1.76+5.10+0.08=9.03

Path 4：w-v-n-u-w

Cost4 = TC[w,v]+TC[v,n]+TC[n,u]+TC[u,w]=1.90+1.72+2.40+2.22=8.24

Path 5：w-v-n-v-w

Cost5 = TC[w,v]+TC[v,n]+TC[n,v]+TC[v,w]=1.90+1.72+1.71+1.85=7.18

Path 6：w-v-n-y-w

Cost6 = TC[w,v]+TC[v,n]+TC[n,y]+TC[y,w]=1.90+1.72+5.10+0.08=8.80

The result is 报道 /v 新闻 /n 了 /v

30


• VOLSUNGA

Step1：min{TC[w,n]+EC[报道 |n], TC[w,v]+EC[报道 |v]}

= min{2.09+8.22,1.90+5.69}

T[1] = v

Step2：min{TC[v,n]+EC[新闻 |n]}

= min{1.72+6.55}

T[2] = n

Step3：min{TC[n,u]+EC[了 |u], TC[n,v]+EC[了 |v], TC[n,y]+EC[了 |y]}

= min{2.40+1.98,1.71+7.76,5.10+0.38}

T[3] = u

The result is 报道 /v 新闻 /n 了 /u

31


• Viterbi

Step1： k = 1

Cost[1, 报道 /n] = (Cost[0, w] + TC[w,n]) + EC[报道 |n] = 10.31

Cost[1, 报道 /v] = (Cost[0, w] + TC[w,n]) + EC[报道 |v] = 7.59

Step2： k = 2

Cost[2, 新闻 /n] = min{(Cost[1, v] + TC[v,n]), (Cost[1, n] + TC[n,n])}+ EC[新闻 |n] = 7.59 + 1.72 + 6.55 = 15.86

Step3： k = 3

Cost[3, 了 /u] = (Cost[2, n] + TC[n,u]) + EC[了 |u] = 20.24

Cost[3, 了 /v] = (Cost[2, n] + TC[n,v]) + EC[了 |v] = 25.33

Cost[3, 了 /y] = (Cost[2, n] + TC[n,y]) + EC[了 |y] = 21.34

Step4: k = 4

Cost[4, ， /w] = min{(Cost[3, u] + TC[u,w]), (Cost[3, v] + TC[v,w]), (Cost[3, y] + TC[y,w])}+ EC[， |w] = 21.34 + 0.08 + 0 = 21.42

The result is 报道 /v 新闻 /n 了 /y

Cost[k, t] comes

from the negative

log probability

:Cost[k, t]

= min{Cost[k-1, s] +

TC[s, t]} + EC[wk|t]

32

• 统计模型的训练• Computing Probabilities

• 马尔可夫链• Definition

• Graphical Representation

• 隐马尔可夫模型• Components

• Assumptions

Wrap-Up


• Decoding

• Learning

• 隐马尔可夫标注算法• Viterbi

• Examples

• Vs Other Methods

Documents

101035 中文信息处理 Chinese NLP Lecture 7. 词 —— 词性标注（ 2 ） Part-of-Speech Tagging (2) 统计模型的训练（ Training a statistical model ） 马尔可夫链（

101035 中文信息处理 Chinese NLP Lecture 7. 词 —— 词性标注（ 2 ） Part-of-Speech Tagging (2) 统计模型的训练（ Training a statistical model ）马尔可夫链（