Upload
garry-reeves
View
336
Download
9
Embed Size (px)
Citation preview
101035 中文信息处理
Chinese NLP
Lecture 7
2
词——词性标注( 2)Part-of-Speech Tagging (2)
• 统计模型的训练( Training a statistical model)• 马尔可夫链(Markov chain)
• 隐马尔可夫模型( Hidden Markov Model, or HMM)
• 隐马尔可夫标注算法( HMM POS tagging)
3
统计模型的训练Training a Statistical Model
• Back to POS tagging
• Given a word sequence , decide its best POS sequence among all .
Bayes Rule
Likelihood
Prior
4
• Computing Probabilities
Simplifying assumption
s
Counts from corpus
The above probability computation is oversimplified. Consult the textbook about deleted interpolation and other smoothing methods for better probability computation.
5
• Using Probabilities for POS Tagging
• Example
What POS is race in Secretariat is expected to race tomorrow?
6
• Using Probabilities for POS Tagging
• Example
• Using the (87-tag) Brown corpus, we get
P(NN|TO) = 0.00047 P(VB|TO) = 0.83
P(race|NN) = 0.00057 P(race|VB) = 0.00012
P(NR|VB) = 0.0027 P(NR|NN) = 0.0012
• Compare
P(VB|TO) P(NR|VB) P(race|VB) = 0.00000027
P(NN|TO) P(NR|NN) P(race|NN) = 0.00000000032
7
马尔可夫链Markov Chain
• Definition
• A Markov chain is a special case of a weighted automaton in which the input sequence uniquely determines which states the automaton will go through.
8
• Graphical Model Representation
• A set of N states: Q = q1q2 … qN
• A transition probability matrix: A = a01a02 … an1 … ann
• A special start state and end state: q0, qF
• Alternately, we use an initial probability distribution over states.
π = π1π2 … πN. πi is the probability that the Markov chain will start in state i.
9
• What is Special about Markov Chains
• A Markov chain can’t represent inherently ambiguous problems, so it is only useful for assigning probabilities to unambiguous sequences.
• A Markov chain is not suitable for POS tagging, because the states (POS) cannot be directly observed.
• Markov assumption:
10
In-Class Exercise
• Using the following Markov chain, compute the probability of the sequence: {cold hot cold hot}
11
隐马尔可夫模型Hidden Markov Model
• Markov Chain vs HMM
• A Markov chain is useful when we need to compute a probability for a sequence of events that we can observe in the world.
• HMM allows us to talk about both observed events (like words that we see in the input) and hidden events (like POS tags).
12
• HMM Components
• A set of N states: Q = q1q2 … qN
• A transition probability matrix: A = a11a12 … an1 … ann
• A sequence of T observations: O = o1o2 … oT
• A sequence of observation likelihoods, or emission probabilities:
B = bi(oT)
• A special start state and end state: q0, qF
• Alternately, we use an initial probability distribution over states.
π = π1π2 … πN. πi is the probability that the Markov chain will start in state i.
13
• HMM Assumptions
• Markov assumption
• Output independence assumption
• Fundamental Problems
• Computing likelihood: Given an HMM λ = (A,B) and an observation sequence O, determine the likelihood P(O|λ).
• Decoding: Given an observation sequence O and an HMM λ = (A,B), discover the best hidden state sequence Q.
• Learning: Given an observation sequence O and the set of states in the HMM, learn the HMM parameters A and B.
14
• A Running Example
• Jason eating ice creams on some day
There is some relation between weather states (hot, cold) and the number of ice creams eaten on that day.
An integer represents the number of ice creams eaten on a given day (observed), and a sequence of H and C designates the weather states (hidden) that caused Jason to eat the ice cream.
15
• Computing Likelihood
• Given an HMM model, what is the likelihood of {3, 1, 3}? Note that we do not know the hidden states (weather).
• Forward algorithm (a kind of dynamic programming)
• αt(j) represents the probability of being in state j after seeing the first t observations, given the model λ.
the previous forward path
probability from the previous time
stepthe transition
probability from previous state qi to
current state qj
the state observation likelihood of the observation symbol ot giventhe current state j
16
• Computing Likelihood
• Algorithm
• Initialization
• Recursion
• Termination
forward[s, t] =
17
• Computing Likelihood
• Decoding
• Given an HMM model and an ice cream sequence {3, 1, 3}, what is the hidden weather states?
• Viterbi algorithm (a kind of dynamic programming)
• vt(j) represents the probability that the HMM is in state j after seeing the first t observations and passing through the most probable state sequence q0,q1, ...,qt−1,, given the model λ.
the previous Viterbi path probability
from the previous time step
the transition probability from
previous state qi to current state qj
the state observation likelihood of the observation symbol ot giventhe current state j
18
19
• Decoding
• Algorithm
• Initialization
• Recursion
,
,
• Termination
20
• Decoding
• Algorithm
21
• Decoding
22
• Learning
• Given an ice cream sequence {3, 1, 3} and the set of possible weather states {H, C}, what are the HMM parameters (A and B)?
• Forward-Backward algorithm (a kind of Expectation Maximization)
• βt(j) represents the probability of seeing the observations from time t+1 to the end, given that we are in state j at time t, and the model λ.
• Initialization
• Recursion
• Termination
23
• Learning
• Algorithm
the probability of being in state j at
time t
the probability of being in state i attime t and state j at time t+1
Solving the learning problem is the most complicated. Consult your textbook to find more details.
24
隐马尔可夫标注算法HMM POS Tagging
• Using Viterbi to solve the decoding problem
• An English exampleI want to race.
Transition probabilities
A
Emission probabilities
B
25
• An English Example
26
In-Class Exercise
• Compute V3(3) on the previous page, using the given probabilities. Note that you need to first compute all the V2(*).
27
• Other Tagging Methods
• CLAWS (a brute-force algorithm)
• In a word span (the beginning and end words have unique POS’s), calculate all the path possibilities and choose the maximum.
• VOLSUNGA (a greedy algorithm)
• As an improvement on CLAWS, it finds the optimal path step by step. In each step, it only considers the best path so far found. The ultimate optimal path is simply the sum of parts.
28
• A Chinese Example
• In implementation, we often use log probabilities to prevent numerical underflows due to small probability products. If we take the negative log probabilities, finding maximum product becomes finding minimum sum.
,报道新闻了,
Transition probabilities A /
Transition Costs TC
Emission probabilities B / Emission Costs
EC
29
• A Chinese Example
• CLAWS
Path 1:w-n-n-u-w
Cost1 = TC[w,n]+TC[n,n]+TC[n,u]+TC[u,w]=2.09+1.76+2.40+2.22=8.47
Path 2:w-n-n-v-w
Cost2 = TC[w,n]+TC[n,n]+TC[n,v]+TC[v,w]=2.09+1.76+1.71+1.85=7.41
Path 3:w-n-n-y-w
Cost3 = TC[w,n]+TC[n,n]+TC[n,y]+TC[y,w]=2.09+1.76+5.10+0.08=9.03
Path 4:w-v-n-u-w
Cost4 = TC[w,v]+TC[v,n]+TC[n,u]+TC[u,w]=1.90+1.72+2.40+2.22=8.24
Path 5:w-v-n-v-w
Cost5 = TC[w,v]+TC[v,n]+TC[n,v]+TC[v,w]=1.90+1.72+1.71+1.85=7.18
Path 6:w-v-n-y-w
Cost6 = TC[w,v]+TC[v,n]+TC[n,y]+TC[y,w]=1.90+1.72+5.10+0.08=8.80
The result is 报道 /v 新闻 /n 了 /v
30
• A Chinese Example
• VOLSUNGA
Step1:min{TC[w,n]+EC[报道 |n], TC[w,v]+EC[报道 |v]}
= min{2.09+8.22,1.90+5.69}
T[1] = v
Step2:min{TC[v,n]+EC[新闻 |n]}
= min{1.72+6.55}
T[2] = n
Step3:min{TC[n,u]+EC[了 |u], TC[n,v]+EC[了 |v], TC[n,y]+EC[了 |y]}
= min{2.40+1.98,1.71+7.76,5.10+0.38}
T[3] = u
The result is 报道 /v 新闻 /n 了 /u
31
• A Chinese Example
• Viterbi
Step1: k = 1
Cost[1, 报道 /n] = (Cost[0, w] + TC[w,n]) + EC[报道 |n] = 10.31
Cost[1, 报道 /v] = (Cost[0, w] + TC[w,n]) + EC[报道 |v] = 7.59
Step2: k = 2
Cost[2, 新闻 /n] = min{(Cost[1, v] + TC[v,n]), (Cost[1, n] + TC[n,n])}+ EC[新闻 |n] = 7.59 + 1.72 + 6.55 = 15.86
Step3: k = 3
Cost[3, 了 /u] = (Cost[2, n] + TC[n,u]) + EC[了 |u] = 20.24
Cost[3, 了 /v] = (Cost[2, n] + TC[n,v]) + EC[了 |v] = 25.33
Cost[3, 了 /y] = (Cost[2, n] + TC[n,y]) + EC[了 |y] = 21.34
Step4: k = 4
Cost[4, , /w] = min{(Cost[3, u] + TC[u,w]), (Cost[3, v] + TC[v,w]), (Cost[3, y] + TC[y,w])}+ EC[, |w] = 21.34 + 0.08 + 0 = 21.42
The result is 报道 /v 新闻 /n 了 /y
Cost[k, t] comes
from the negative
log probability
:Cost[k, t]
= min{Cost[k-1, s] +
TC[s, t]} + EC[wk|t]
32
• 统计模型的训练• Computing Probabilities
• 马尔可夫链• Definition
• Graphical Representation
• 隐马尔可夫模型• Components
• Assumptions
Wrap-Up
• Computing Likelihood
• Decoding
• Learning
• 隐马尔可夫标注算法• Viterbi
• Examples
• Vs Other Methods