32
101035 中中中中中中 Chinese NLP Lecture 7

101035 中文信息处理 Chinese NLP Lecture 7. 词 —— 词性标注( 2 ) Part-of-Speech Tagging (2) 统计模型的训练( Training a statistical model ) 马尔可夫链(

Embed Size (px)

Citation preview

Page 1: 101035 中文信息处理 Chinese NLP Lecture 7. 词 —— 词性标注( 2 ) Part-of-Speech Tagging (2) 统计模型的训练( Training a statistical model ) 马尔可夫链(

101035 中文信息处理

Chinese NLP

Lecture 7

Page 2: 101035 中文信息处理 Chinese NLP Lecture 7. 词 —— 词性标注( 2 ) Part-of-Speech Tagging (2) 统计模型的训练( Training a statistical model ) 马尔可夫链(

2

词——词性标注( 2)Part-of-Speech Tagging (2)

• 统计模型的训练( Training a statistical model)• 马尔可夫链(Markov chain)

• 隐马尔可夫模型( Hidden Markov Model, or HMM)

• 隐马尔可夫标注算法( HMM POS tagging)

Page 3: 101035 中文信息处理 Chinese NLP Lecture 7. 词 —— 词性标注( 2 ) Part-of-Speech Tagging (2) 统计模型的训练( Training a statistical model ) 马尔可夫链(

3

统计模型的训练Training a Statistical Model

• Back to POS tagging

• Given a word sequence , decide its best POS sequence among all .

Bayes Rule

Likelihood

Prior

Page 4: 101035 中文信息处理 Chinese NLP Lecture 7. 词 —— 词性标注( 2 ) Part-of-Speech Tagging (2) 统计模型的训练( Training a statistical model ) 马尔可夫链(

4

• Computing Probabilities

Simplifying assumption

s

Counts from corpus

The above probability computation is oversimplified. Consult the textbook about deleted interpolation and other smoothing methods for better probability computation.

Page 5: 101035 中文信息处理 Chinese NLP Lecture 7. 词 —— 词性标注( 2 ) Part-of-Speech Tagging (2) 统计模型的训练( Training a statistical model ) 马尔可夫链(

5

• Using Probabilities for POS Tagging

• Example

What POS is race in Secretariat is expected to race tomorrow?

Page 6: 101035 中文信息处理 Chinese NLP Lecture 7. 词 —— 词性标注( 2 ) Part-of-Speech Tagging (2) 统计模型的训练( Training a statistical model ) 马尔可夫链(

6

• Using Probabilities for POS Tagging

• Example

• Using the (87-tag) Brown corpus, we get

P(NN|TO) = 0.00047 P(VB|TO) = 0.83

P(race|NN) = 0.00057 P(race|VB) = 0.00012

P(NR|VB) = 0.0027 P(NR|NN) = 0.0012

• Compare

P(VB|TO) P(NR|VB) P(race|VB) = 0.00000027

P(NN|TO) P(NR|NN) P(race|NN) = 0.00000000032

Page 7: 101035 中文信息处理 Chinese NLP Lecture 7. 词 —— 词性标注( 2 ) Part-of-Speech Tagging (2) 统计模型的训练( Training a statistical model ) 马尔可夫链(

7

马尔可夫链Markov Chain

• Definition

• A Markov chain is a special case of a weighted automaton in which the input sequence uniquely determines which states the automaton will go through.

Page 8: 101035 中文信息处理 Chinese NLP Lecture 7. 词 —— 词性标注( 2 ) Part-of-Speech Tagging (2) 统计模型的训练( Training a statistical model ) 马尔可夫链(

8

• Graphical Model Representation

• A set of N states: Q = q1q2 … qN

• A transition probability matrix: A = a01a02 … an1 … ann

• A special start state and end state: q0, qF

• Alternately, we use an initial probability distribution over states.

π = π1π2 … πN. πi is the probability that the Markov chain will start in state i.

Page 9: 101035 中文信息处理 Chinese NLP Lecture 7. 词 —— 词性标注( 2 ) Part-of-Speech Tagging (2) 统计模型的训练( Training a statistical model ) 马尔可夫链(

9

• What is Special about Markov Chains

• A Markov chain can’t represent inherently ambiguous problems, so it is only useful for assigning probabilities to unambiguous sequences.

• A Markov chain is not suitable for POS tagging, because the states (POS) cannot be directly observed.

• Markov assumption:

Page 10: 101035 中文信息处理 Chinese NLP Lecture 7. 词 —— 词性标注( 2 ) Part-of-Speech Tagging (2) 统计模型的训练( Training a statistical model ) 马尔可夫链(

10

In-Class Exercise

• Using the following Markov chain, compute the probability of the sequence: {cold hot cold hot}

Page 11: 101035 中文信息处理 Chinese NLP Lecture 7. 词 —— 词性标注( 2 ) Part-of-Speech Tagging (2) 统计模型的训练( Training a statistical model ) 马尔可夫链(

11

隐马尔可夫模型Hidden Markov Model

• Markov Chain vs HMM

• A Markov chain is useful when we need to compute a probability for a sequence of events that we can observe in the world.

• HMM allows us to talk about both observed events (like words that we see in the input) and hidden events (like POS tags).

Page 12: 101035 中文信息处理 Chinese NLP Lecture 7. 词 —— 词性标注( 2 ) Part-of-Speech Tagging (2) 统计模型的训练( Training a statistical model ) 马尔可夫链(

12

• HMM Components

• A set of N states: Q = q1q2 … qN

• A transition probability matrix: A = a11a12 … an1 … ann

• A sequence of T observations: O = o1o2 … oT

• A sequence of observation likelihoods, or emission probabilities:

B = bi(oT)

• A special start state and end state: q0, qF

• Alternately, we use an initial probability distribution over states.

π = π1π2 … πN. πi is the probability that the Markov chain will start in state i.

Page 13: 101035 中文信息处理 Chinese NLP Lecture 7. 词 —— 词性标注( 2 ) Part-of-Speech Tagging (2) 统计模型的训练( Training a statistical model ) 马尔可夫链(

13

• HMM Assumptions

• Markov assumption

• Output independence assumption

• Fundamental Problems

• Computing likelihood: Given an HMM λ = (A,B) and an observation sequence O, determine the likelihood P(O|λ).

• Decoding: Given an observation sequence O and an HMM λ = (A,B), discover the best hidden state sequence Q.

• Learning: Given an observation sequence O and the set of states in the HMM, learn the HMM parameters A and B.

Page 14: 101035 中文信息处理 Chinese NLP Lecture 7. 词 —— 词性标注( 2 ) Part-of-Speech Tagging (2) 统计模型的训练( Training a statistical model ) 马尔可夫链(

14

• A Running Example

• Jason eating ice creams on some day

There is some relation between weather states (hot, cold) and the number of ice creams eaten on that day.

An integer represents the number of ice creams eaten on a given day (observed), and a sequence of H and C designates the weather states (hidden) that caused Jason to eat the ice cream.

Page 15: 101035 中文信息处理 Chinese NLP Lecture 7. 词 —— 词性标注( 2 ) Part-of-Speech Tagging (2) 统计模型的训练( Training a statistical model ) 马尔可夫链(

15

• Computing Likelihood

• Given an HMM model, what is the likelihood of {3, 1, 3}? Note that we do not know the hidden states (weather).

• Forward algorithm (a kind of dynamic programming)

• αt(j) represents the probability of being in state j after seeing the first t observations, given the model λ.

the previous forward path

probability from the previous time

stepthe transition

probability from previous state qi to

current state qj

the state observation likelihood of the observation symbol ot giventhe current state j

Page 16: 101035 中文信息处理 Chinese NLP Lecture 7. 词 —— 词性标注( 2 ) Part-of-Speech Tagging (2) 统计模型的训练( Training a statistical model ) 马尔可夫链(

16

• Computing Likelihood

• Algorithm

• Initialization

• Recursion

• Termination

forward[s, t] =

Page 17: 101035 中文信息处理 Chinese NLP Lecture 7. 词 —— 词性标注( 2 ) Part-of-Speech Tagging (2) 统计模型的训练( Training a statistical model ) 马尔可夫链(

17

• Computing Likelihood

Page 18: 101035 中文信息处理 Chinese NLP Lecture 7. 词 —— 词性标注( 2 ) Part-of-Speech Tagging (2) 统计模型的训练( Training a statistical model ) 马尔可夫链(

• Decoding

• Given an HMM model and an ice cream sequence {3, 1, 3}, what is the hidden weather states?

• Viterbi algorithm (a kind of dynamic programming)

• vt(j) represents the probability that the HMM is in state j after seeing the first t observations and passing through the most probable state sequence q0,q1, ...,qt−1,, given the model λ.

the previous Viterbi path probability

from the previous time step

the transition probability from

previous state qi to current state qj

the state observation likelihood of the observation symbol ot giventhe current state j

18

Page 19: 101035 中文信息处理 Chinese NLP Lecture 7. 词 —— 词性标注( 2 ) Part-of-Speech Tagging (2) 统计模型的训练( Training a statistical model ) 马尔可夫链(

19

• Decoding

• Algorithm

• Initialization

• Recursion

,

,

• Termination

Page 20: 101035 中文信息处理 Chinese NLP Lecture 7. 词 —— 词性标注( 2 ) Part-of-Speech Tagging (2) 统计模型的训练( Training a statistical model ) 马尔可夫链(

20

• Decoding

• Algorithm

Page 21: 101035 中文信息处理 Chinese NLP Lecture 7. 词 —— 词性标注( 2 ) Part-of-Speech Tagging (2) 统计模型的训练( Training a statistical model ) 马尔可夫链(

21

• Decoding

Page 22: 101035 中文信息处理 Chinese NLP Lecture 7. 词 —— 词性标注( 2 ) Part-of-Speech Tagging (2) 统计模型的训练( Training a statistical model ) 马尔可夫链(

22

• Learning

• Given an ice cream sequence {3, 1, 3} and the set of possible weather states {H, C}, what are the HMM parameters (A and B)?

• Forward-Backward algorithm (a kind of Expectation Maximization)

• βt(j) represents the probability of seeing the observations from time t+1 to the end, given that we are in state j at time t, and the model λ.

• Initialization

• Recursion

• Termination

Page 23: 101035 中文信息处理 Chinese NLP Lecture 7. 词 —— 词性标注( 2 ) Part-of-Speech Tagging (2) 统计模型的训练( Training a statistical model ) 马尔可夫链(

23

• Learning

• Algorithm

the probability of being in state j at

time t

the probability of being in state i attime t and state j at time t+1

Solving the learning problem is the most complicated. Consult your textbook to find more details.

Page 24: 101035 中文信息处理 Chinese NLP Lecture 7. 词 —— 词性标注( 2 ) Part-of-Speech Tagging (2) 统计模型的训练( Training a statistical model ) 马尔可夫链(

24

隐马尔可夫标注算法HMM POS Tagging

• Using Viterbi to solve the decoding problem

• An English exampleI want to race.

Transition probabilities

A

Emission probabilities

B

Page 25: 101035 中文信息处理 Chinese NLP Lecture 7. 词 —— 词性标注( 2 ) Part-of-Speech Tagging (2) 统计模型的训练( Training a statistical model ) 马尔可夫链(

25

• An English Example

Page 26: 101035 中文信息处理 Chinese NLP Lecture 7. 词 —— 词性标注( 2 ) Part-of-Speech Tagging (2) 统计模型的训练( Training a statistical model ) 马尔可夫链(

26

In-Class Exercise

• Compute V3(3) on the previous page, using the given probabilities. Note that you need to first compute all the V2(*).

Page 27: 101035 中文信息处理 Chinese NLP Lecture 7. 词 —— 词性标注( 2 ) Part-of-Speech Tagging (2) 统计模型的训练( Training a statistical model ) 马尔可夫链(

27

• Other Tagging Methods

• CLAWS (a brute-force algorithm)

• In a word span (the beginning and end words have unique POS’s), calculate all the path possibilities and choose the maximum.

• VOLSUNGA (a greedy algorithm)

• As an improvement on CLAWS, it finds the optimal path step by step. In each step, it only considers the best path so far found. The ultimate optimal path is simply the sum of parts.

Page 28: 101035 中文信息处理 Chinese NLP Lecture 7. 词 —— 词性标注( 2 ) Part-of-Speech Tagging (2) 统计模型的训练( Training a statistical model ) 马尔可夫链(

28

• A Chinese Example

• In implementation, we often use log probabilities to prevent numerical underflows due to small probability products. If we take the negative log probabilities, finding maximum product becomes finding minimum sum.

,报道新闻了,

Transition probabilities A /

Transition Costs TC

Emission probabilities B / Emission Costs

EC

Page 29: 101035 中文信息处理 Chinese NLP Lecture 7. 词 —— 词性标注( 2 ) Part-of-Speech Tagging (2) 统计模型的训练( Training a statistical model ) 马尔可夫链(

29

• A Chinese Example

• CLAWS

Path 1:w-n-n-u-w

Cost1 = TC[w,n]+TC[n,n]+TC[n,u]+TC[u,w]=2.09+1.76+2.40+2.22=8.47

Path 2:w-n-n-v-w

Cost2 = TC[w,n]+TC[n,n]+TC[n,v]+TC[v,w]=2.09+1.76+1.71+1.85=7.41

Path 3:w-n-n-y-w

Cost3 = TC[w,n]+TC[n,n]+TC[n,y]+TC[y,w]=2.09+1.76+5.10+0.08=9.03

Path 4:w-v-n-u-w

Cost4 = TC[w,v]+TC[v,n]+TC[n,u]+TC[u,w]=1.90+1.72+2.40+2.22=8.24

Path 5:w-v-n-v-w

Cost5 = TC[w,v]+TC[v,n]+TC[n,v]+TC[v,w]=1.90+1.72+1.71+1.85=7.18

Path 6:w-v-n-y-w

Cost6 = TC[w,v]+TC[v,n]+TC[n,y]+TC[y,w]=1.90+1.72+5.10+0.08=8.80

The result is 报道 /v 新闻 /n 了 /v

Page 30: 101035 中文信息处理 Chinese NLP Lecture 7. 词 —— 词性标注( 2 ) Part-of-Speech Tagging (2) 统计模型的训练( Training a statistical model ) 马尔可夫链(

30

• A Chinese Example

• VOLSUNGA

Step1:min{TC[w,n]+EC[报道 |n], TC[w,v]+EC[报道 |v]}

= min{2.09+8.22,1.90+5.69}

T[1] = v

Step2:min{TC[v,n]+EC[新闻 |n]}

= min{1.72+6.55}

T[2] = n

Step3:min{TC[n,u]+EC[了 |u], TC[n,v]+EC[了 |v], TC[n,y]+EC[了 |y]}

= min{2.40+1.98,1.71+7.76,5.10+0.38}

T[3] = u

The result is 报道 /v 新闻 /n 了 /u

Page 31: 101035 中文信息处理 Chinese NLP Lecture 7. 词 —— 词性标注( 2 ) Part-of-Speech Tagging (2) 统计模型的训练( Training a statistical model ) 马尔可夫链(

31

• A Chinese Example

• Viterbi

Step1: k = 1

Cost[1, 报道 /n] = (Cost[0, w] + TC[w,n]) + EC[报道 |n] = 10.31

Cost[1, 报道 /v] = (Cost[0, w] + TC[w,n]) + EC[报道 |v] = 7.59

Step2: k = 2

Cost[2, 新闻 /n] = min{(Cost[1, v] + TC[v,n]), (Cost[1, n] + TC[n,n])}+ EC[新闻 |n] = 7.59 + 1.72 + 6.55 = 15.86

Step3: k = 3

Cost[3, 了 /u] = (Cost[2, n] + TC[n,u]) + EC[了 |u] = 20.24

Cost[3, 了 /v] = (Cost[2, n] + TC[n,v]) + EC[了 |v] = 25.33

Cost[3, 了 /y] = (Cost[2, n] + TC[n,y]) + EC[了 |y] = 21.34

Step4: k = 4

Cost[4, , /w] = min{(Cost[3, u] + TC[u,w]), (Cost[3, v] + TC[v,w]), (Cost[3, y] + TC[y,w])}+ EC[, |w] = 21.34 + 0.08 + 0 = 21.42

The result is 报道 /v 新闻 /n 了 /y

Cost[k, t] comes

from the negative

log probability

:Cost[k, t]

= min{Cost[k-1, s] +

TC[s, t]} + EC[wk|t]

Page 32: 101035 中文信息处理 Chinese NLP Lecture 7. 词 —— 词性标注( 2 ) Part-of-Speech Tagging (2) 统计模型的训练( Training a statistical model ) 马尔可夫链(

32

• 统计模型的训练• Computing Probabilities

• 马尔可夫链• Definition

• Graphical Representation

• 隐马尔可夫模型• Components

• Assumptions

Wrap-Up

• Computing Likelihood

• Decoding

• Learning

• 隐马尔可夫标注算法• Viterbi

• Examples

• Vs Other Methods