View
50
Download
1
Category
Preview:
DESCRIPTION
통계적 자연어 처리. 포항공대 자연어처리 연구실 1999.8.27 튜토리알 이근배. contents. statistical vs. structured NLP statistics for computational linguistics POS tagging PCFG parsing other applications conclusion. references. Eugene Charniak. Statistical Language Learning. MIT press, 1993 - PowerPoint PPT Presentation
Citation preview
통계적 자연어 처리 통계적 자연어 처리
포항공대 자연어처리 연구실포항공대 자연어처리 연구실1999.8.27 1999.8.27 튜토리알 튜토리알
이근배 이근배
Postech NLP lab
contentscontents
statistical vs. structured NLPstatistical vs. structured NLP statistics for computational linguisticsstatistics for computational linguistics POS taggingPOS tagging PCFG parsingPCFG parsing other applicationsother applications conclusionconclusion
Postech NLP lab
referencesreferences Eugene Charniak. Statistical Language Learning. MIT press, 1993Eugene Charniak. Statistical Language Learning. MIT press, 1993 Christopher D. Manning and Hinrich Schutze, Foundations of Statistical Natural Language PrChristopher D. Manning and Hinrich Schutze, Foundations of Statistical Natural Language Pr
ocessing, MIT press, 1999. ocessing, MIT press, 1999. (basic and applied statistics)(basic and applied statistics)
- Brigitte Krenn and Christer Samuelsson. The Linguist's Guide to Statistics. Internet shareware, http://www.coli.uni-sb.de/~krenn/edu.html
(general)(general) - Abney, S. Statistical methods and linguistics. In J. Klavans and P. Resnik. The Balancing act, M
IT Press, 1996 - Church and Mercer, Introduction to the special issue on computational linguistics using large c
orpora, Computational Lingusitcs, 19, 1993 (POS tagging)(POS tagging)
- Cutting et al. A practical part-of-speech tagger, In Proceedings of the 3rd conference on applied natural language processing, 1992.
- Church. A stochastic parts program and noun phrase parser for unrestricted text. in Proceedings of the 2nd conference on applied natural language processing, 1988
- Weischedel et. al. Coping with ambiguity and unknown words through probablistic models, Computational linguistics, 19(2), 1993
- J. Kupiec. Robust part-of-speech tagging using a hidden Markov model, computer speech and language, 6, 1992
- E. Brill. A simple rule-based part-of-speech tagger. Proceedings of the 3rd conference on applied NLP, 1992
Postech NLP lab
referencesreferences
- E. Roche and Y. Schabes. Deterministic part-of-speech tagging with finite state transducers. Computational linguistics 21, 1995.
- B. Merialdo. Tagging English text with a probabilistic model. Computational linguistics 20, 1994.
(Statistical parsing)(Statistical parsing) - K. Lari and S. Young. The estimation of stochastic context-free grammar using the inside-
outside algorithm. Computer speech and language 4, 1990 - F. Pereira and Y. Schabes. Inside-outside reestimation frm partially bracketed corpora. AC
L 30, 1992 - T. Briscoe and J. Carroll. Generalized probabilistic LR parsing of natural language (corpora)
with unification-based grammars. Computational linguistics 19, 1993. - E. Black et. al. Towards history-based grammars: using richer models for probablistic pars
ing. ACL 31, 1993. - D. Margerman. Statistical decision-tree models for parsing, ACL 33, 1995. - Brill. Automatic grammar induction and parsing free text: a transformation-based approac
h. ACL 31, 1993
Postech NLP lab
referecesrefereces
(statistical disambiguation)(statistical disambiguation) - D. Hindle and M. Rooth. Structural ambiguity and lexical relations. Computational linguistics
19, 1993 - K. Church and P. Hanks. Word association norms, mutual information and lexicography, ACL
28, 1990 - Alshawi and Carter. Training and scaling preference functions for disambiguation. computatio
nal lingustics 20(4), 1994. (word classes and WSD)(word classes and WSD)
- Gale et. al. Work on statistical methods for word-sense disambiguation, Proceedings from AAAI fall symposium: Probablistic approaches to natural language, 1992.
- Gale et al. Estimating upper and lower bounds on the performance of word-sense disambiguation programs. ACL 30, 1992.
- Yarowsky. Word sense disambiguation using statistical method of Roget's categories trained on large copora, Coling, 1992.
- Yarowsky. Unsupervised word-sense disambiguation rivaling supervised methods, ACL 33, 1995
- Pereira et. al. Distributional clustering of English words, ACL 31, 1993 - Dagan et al. Contextual word similarity and estimation from sparse data, ACL 31, 1993 - Dagan et al. Similarity-based estimation of word coocurrence probabilites, ACL 32, 1994.
Postech NLP lab
referencesreferences
(text alignment(text alignment and machine translation) and machine translation) - Kay and Roscheisen. Text-translation alignment. computational linguistics 19, 1993 - Gale and Church. A program for aligning sentences in bilingual corpora, Computational linguistics
19, 1993 - Brown et al. The mathematics of statistical machine translation: parameter estimation. computat
ional linguistics, 1993 - Brown et al. A statiscal approach to machine translation. Computational linguistics 16, 1990 - Wu. Aligning a parallel English-Chinese corpus statistically with lexical criteria. ACL 32, 1994 - Church. Char_align: A program for aligning parall 디 texts at the character level. ACL 31, 1993 - Sproat et al. A stochstic finite-state word segmentation algorithm for Chinese, ACL 32, 1994
(lexical knowledge acquisition)(lexical knowledge acquisition) - Manning. Automatic acquistion of a large subcategorization dictionary from corpora. ACL 31, 199
3 - Smadja. Retrieving collocations from text: Xtract, Computational linguistics, 1993
(speech and others)(speech and others) - Brown et al. Class-based n-gram models of natural language, computational linguistics, 18(4), 1
992
Postech NLP lab
구조냐 통계냐 구조냐 통계냐 - - 역사의 시계추 역사의 시계추
Statistical analysis
data driven
empirical
connectionist
speech community
Structural analysisrule driven
rationalsymbolic
NLU, Chomskian, Shankian, AI community
Postech NLP lab
구조적 구조적 NLPNLP
grammar rules + lexicons grammar rules + lexicons Grammatical category (POS, syntactic category) unification features (connectivity, agreements, semantics..)
chart parsingchart parsing compositional semanticscompositional semantics
한계한계 : : 엄청난 엄청난 ambiguityambiguity “List the sales of the products produced in 1973 with the products pr
oduced in 1972” ==> 455 parses (Martin et. al. 1981)
Postech NLP lab
통계적 통계적 NLPNLP
구문단계의 문법의 역할 구문단계의 문법의 역할 -- -- 어떤 단어열이 올바른 문장이 되나어떤 단어열이 올바른 문장이 되나 ?? Pr (w1, w2, …wn) 계산하기
pr(w1)pr(w2|w1)pr(w3|w1w2)…..pr(wn|w1, ….wn-1)
pr(w2 |w1) = count (w1w2) / count (w1) [MLE]
예 ) the (big, pig) dog Shannon game -- predicting the next word given word sequence
language modeling -- probability matrixlanguage modeling -- probability matrix 언어모델 평가 -- cross entropy 개념 적용 - pr(w1,n) log prM(w1,n)
when prM(w1,n) = pr(w1,n) cross entropy becomes minimum and langu
age model M is perfect
Postech NLP lab
Chomsky vs. ShannonChomsky vs. Shannon
What Shannon saysWhat Shannon says for n-th order probability approximation of sentence s grammatical (s) <-->
What Chomsky saysWhat Chomsky says there is no way to choose n and such that for all sentence s, grammatical (s) <-->
What’s wrong with n-th order Markov model?What’s wrong with n-th order Markov model? probability is ok but finite-state machine is not
0)(lim sPnn
)(sPn
Postech NLP lab
contentscontents
statistical vs. structured NLPstatistical vs. structured NLP statistics for computational linguisticsstatistics for computational linguistics POS taggingPOS tagging PCFG parsingPCFG parsing other applicationsother applications conclusionconclusion
Postech NLP lab
저는 통계를 하나도 모르는데요저는 통계를 하나도 모르는데요 ?? Bayesian inversion formulaBayesian inversion formula
p(a | b) = p(a).p(b|a) / p(b)
Maximum-likelihood estimation (MLE)Maximum-likelihood estimation (MLE) 관찰된 결과가 나올 확률이 최대로 되게 max L(x1, x2, …, xn, ) ==> relative frequency e.g. pr(w2 |w1) = count (w1
w2) / count (w1)
cf. MAP estimation, Bayesian estimation, function approximation using NN
smoothing (discounting) required : adding one, held-out, deleted-estimation, good-turing, Charniak’s linear interpolation, Katz’s backing-off, etc
Random variableRandom variable rv: (sample space) --> R (real number)
random process (stochastic process)random process (stochastic process) characterized by p(xt+1 | x1, x2, ….xt)
Postech NLP lab
아직도 모르겠는데요아직도 모르겠는데요 ??
Markov chain: special stochastic processMarkov chain: special stochastic process p(xt+1 | x1, x2, ….xt) = p(xt+1| xt) : transition matrix
Markov modelMarkov model state transition matrix pij = p(sj | si) <- 1st Markov assumption
signal matrix aij = p(oj | si) <--2nd Markov assumption
initial state vector vi= p(si)
hidden markov modelhidden markov model state sequence hidden can only observe signal sequence
Postech NLP lab
좀 알것 같아요좀 알것 같아요
EntropyEntropy discrete r.v. with p(xi) - how much uncertainty by not knowing the out
come of r.v. 매 특정 결과를 코딩하는 데 드는 비트수의 평균값 E[ - log2 P(r ) ] = i -pi log2 pi
uniform distribution 일때 최대
joint entropy/ conditional entropyjoint entropy/ conditional entropy
mutual informationmutual information MI [r1, r2] = E [log p(r1, r2) / p(r1) p(r2)] = H[r1] - H[r1|r2] MI >> 0 strong correlation MI << 0 strong negative correlation
Postech NLP lab
이게 전부예요이게 전부예요 ??
PerplexityPerplexity perp[r] = e H[r] = branching factor in the word sequence
cross entropy -- how good is your model?cross entropy -- how good is your model? Hp[q] = x p(x) ln q(x) (min when p = q)
relative entropy (KL distance; KL divergence) distance measure between two distributions p, q D[p||q] = Hp[q] - H[p] >0
information radius(Irad) D[p||(p+q)/2] + D[q||(p+q)/2] best divergence measures between probability distribution
Postech NLP lab
Basic corpus linguisticsBasic corpus linguistics
Empirical evaluationEmpirical evaluation black box evaluation: system test as a whole glass box evaluation: component-wise test need designed test material + annotated copora + evaluation measur
e
Postech NLP lab
Basic corpus linguisticsBasic corpus linguistics
Contingency table measureContingency table measure recall = a/ a+c ; for completeness precision = a / a+b ; for correctness fallout = b/ b+d
yes correct no correctdecide yes true pos(a) false pos(b a+bdecide no false neg(c ) true neg(d c+d
a+c b+d n
Postech NLP lab
Basic corpus linguisticsBasic corpus linguistics
Corpora according to text typeCorpora according to text type balanced corpora (e.g. Brown corpus) pyramidal corpora: large sample of few representative genres to smal
l sample of wide variety of genres opportunistic corpora
corpora according to annotation typecorpora according to annotation type raw corpus: tokenized/cleaned/meta tagged pos-tagged tree banks sense tagged
corpora according to usecorpora according to use training testing/evaluation cross validation (10-fold)
Postech NLP lab
Basic corpus linguisticsBasic corpus linguistics
Zipf’s law f 1 / r (f: frequency of a word, r: rank (position) of the word) principle of least efforts for both speaker (small frequent vocabulary) and
listener (large vocabulary of rarer words for less ambiguities) Only a few words will have enough examples ==> always needs smoothin
g! Collocations : the whole is beyond the sum of the parts cf. Co-occurre
nce: no-order constraints (=association) compound (e.g. disk drive) phrasal verbs (e.g. make up) stock phrases (e.g. bacon and eggs) idioms (e.g. kick the bucket) several words long and can be discontinuous measure: variance, t-test, x2-test, likelihood ratio, MI, etc
KWIC (key word in contexts)KWIC (key word in contexts)
Postech NLP lab
contentscontents
statistical vs. structured NLPstatistical vs. structured NLP statistics for computational linguisticsstatistics for computational linguistics POS taggingPOS tagging PCFG parsingPCFG parsing other applicationsother applications conclusionconclusion
Postech NLP lab
POS taggingPOS tagging
포항공대 이근배포항공대 이근배 교수님께서 신을 교수님께서 신을 신고신고 신고신고하러 가신다하러 가신다 ..[ 0,0 ( 0,0 )] 등 1.000000e+00(1.000000e+00) s< 문장시작 >([)[ 1,10( 1,1 )] 미 8.288423e-11(6.102822e-13) MPO< 포항공대 >( 포항공대 )[11,11( 2,2 )] 등 8.736421e-02(2.559207e-20) s<#>(#)[12,18( 3,3 )] 미 9.236515e-08(7.008548e-24) MPN< 이근배 >( 이근배 )[19,19( 4,4 )] 등 8.736421e-02(2.939022e-31) s<#>(#)[20,23( 5,5 )] 등 4.469725e+00(1.564634e-25) MC< 교수 >( 교수 )[24,26( 6,6 )] 등 1.373613e+02(1.504397e-25) -< 님 >( 님 )[27,30( 7,7 )] 등 1.307859e+01(1.831031e-25) jC< 이 >( 께서 )[31,31( 8,8 )] 등 8.736421e-02(7.678394e-33) s<#>(#)[32,34( 9,9 )] 등 3.250709e+00(3.667919e-27) MC< 신 >( 신 )[35,37(10,10)] 등 1.264760e+01(3.865534e-27) jC< 을 >( 을 )[38,38(11,11)] 등 8.736421e-02(1.621005e-34) s<#>(#)[39,41(12,12)] 등 5.807344e+00(1.021970e-28) DR< 신 >( 신 )[42,43(13,13)] 등 3.936314e+01(1.918250e-28) eCC< 고 >( 고 )[44,44(14,14)] 등 8.736421e-02(8.044147e-36) s<#>(#)[45,49(15,15)] 등 8.588220e-04(1.297090e-33) MC< 신고 >( 신고 )[50,51(16,16)] 등 2.626376e+01(1.404345e-33) y< 하 >( 하 )[52,56(17,19)] 등 1.445488e+03(1.043073e-31) eCC< 러 >( 러 )[52,56(17,19)] 등 1.445488e+03(1.043073e-31) s<#>(#)[52,56(17,19)] 등 1.445488e+03(1.043073e-31) DI< 가 >( 가 )[57,58(20,20)] 등 4.657808e+01(1.348953e-31) eGS< 시 >( 시 )[59,61(21,21)] 등 1.841659e+01(4.754894e-31) eGE< 는다 >( ㄴ다 )[62,64(22,22)] 등 1.250000e-07(1.365400e-38) s.<.>(.)[65,65(23,23)] 등 2.500000e-05(1.638481e-49) s< 문장끝 >(])
Postech NLP lab
POS taggingPOS tagging
Task: finding argmax Task: finding argmax t1,nt1,n p(w p(w1,n1,n , t , t1,n1,n))
HMM modeling HMM modeling states -- tag, signals -- words cf. ME(max entropy) modeling/ NN(neural net) modeling
3 problems of HMM modeling given the sequence of signals (o3 problems of HMM modeling given the sequence of signals (o
1,n1,n) and states (s) and states (s1,n1,n))
estimate the signal sequence pr(o1,n) ==> e.g. language identification/
language modeling
determine the most probable state seqeunce argmax s1,n p(o1,n , s1,n) =
=> e.g POS tagging, speech recognition determine the model parameters (P, A, v) for given signal sequence =
=> HMM training (MLE)
Postech NLP lab
POS taggingPOS tagging
Task: finding argmax Task: finding argmax t1,nt1,n p(w p(w1,n1,n , t , t1,n1,n))
argmax t1,n p(wi | ti) p(ti+1 | ti)
two markov assumption after chain rule conditionalization
Viterbi algorithm: time-synchronous dynamic programmingViterbi algorithm: time-synchronous dynamic programming
T T+1
time
state
T+2
sj
si
di(t+1) =dj(t).si
where j = [argmaxkp(dk(t))*p(si|sk)]*p(wt+1|si)
d: best state(tag) sequence ending with si
s: state (tag)w: word
sk
Postech NLP lab
POS taggingPOS tagging
Training POS taggerTraining POS tagger r1(i) = prob of starting in state si
t et (i,j) = expected number of transition from state si to state sj
t rt(i) = expected number of transition from state si
twj rt(i) = expected number of times word wj is emitted in state si
vi = r1(i) (initial state vector)pij = t et (i,j) / t rt(i) (transition matrix)aij = twj rt(i) / t rt(i) (observation matrix)
Tagged corpus (supervised): frequency count (markov model, not HMM)
raw corpus (unsupervised): EM algorithm (Baum-Welch re-estimation)
Postech NLP lab
POS taggingPOS tagging
Problems of HMM training (using EM algorithm = MLE)Problems of HMM training (using EM algorithm = MLE) critical points (never move) -- use random noise over-fitting to the training data local minimum
Calculation of p(wCalculation of p(w1,n1,n))
define ai(t): prob of ending in state si emiting w1,t-1 (forward variable)
define bi(t) : prob of seeing wt,n if the state is si at time t (backward var
iable)
aj(t+1) = [i ai(t) p(sj|si)]p(wt|sj)
p(w1,n) = i ai(n+1)
Postech NLP lab
contentscontents
statistical vs. structured NLPstatistical vs. structured NLP statistics for computational linguisticsstatistics for computational linguistics POS taggingPOS tagging PCFG parsingPCFG parsing other applicationsother applications conclusionconclusion
Postech NLP lab
PCFG parsingPCFG parsing
P(wP(w1,n1,n) = ) = P(tP(t1,n1,n) : t) : t1,n1,n = parse tree covering w = parse tree covering w11 -- w -- wnn
p(tp(t1,n1,n) = ) = p(rule) : using sub-tree independence assumption p(rule) : using sub-tree independence assumption
Njk,l
wk wl
Postech NLP lab
PCGF parsingPCGF parsing
Why pcfgWhy pcfg ordering of the parsers (structural ambiguity) accounts for grammar induction (with only positive examples) compatible with lexical language modeling
the green banana pcfg n--> banana/ n--> time/ n--> number ….Trigram-->the green (time/number/banana)
Fred watered his mother’s small gardentrigram -->p(garden|mother’s small) pcfg--> p(x=garden|x head of DO of “to water”)
Postech NLP lab
PCFG parsingPCFG parsing
Pcfg vs HMM (prob. Regular grammar) ==> different the way tPcfg vs HMM (prob. Regular grammar) ==> different the way they assign probabilityhey assign probability in pcfg: s p(s) = 1 (s: sentences)
in hmm: w1,n p(w1,n) = 1 (w1,n: sentence of length n)
p1 : Alice went to the ____. p2: Alice went to the office. in hmm p1 > p2 in pcfg p1 < p2
Postech NLP lab
PCFG parsingPCFG parsing
Finding p(w1,n)Finding p(w1,n)bj(k,l) = p(wk,l | Nj
k,l):inside prob (similar to backward prob)
aj(k,l) = p(w1,k-1, Njk,l,wl+1,n): outside prob(similar to
forward prob)
Njk,l
wk wl
N1
w1…..wk-1wk…wlwl+1…..wn
Nj
Postech NLP lab
PCFG parsingPCFG parsing
Finding p(wFinding p(w1,n1,n) using inside probability) using inside probability cf. Finding most likely parse for a sentence
bb11(1,n) = p(w(1,n) = p(w1,n1,n | N | N111,n1,n) = p(w) = p(w1,n1,n))
For Chomsky normal formFor Chomsky normal formbj(k,k) = p(Nj --> wk)bj(k,l) = p,q,m p(Nj-->NpNq) bp(k,m) bq(m+1,l)
Nj
Np Nq
wk wm wm+1 wl
Postech NLP lab
PCFG parsingPCFG parsing
Training PCFGTraining PCFG inside-outside re-estimation (MLE training a la Baum-Welch)inside-outside re-estimation (MLE training a la Baum-Welch)
For Chomsky normal formFor Chomsky normal form
pe(Ni-->sj) = c(Ni-->sj) / k c(Ni-->sk) (sj: sentential form j)
c(Nj-->NpNq) = 1/p(w1,n) k,l,m aj
(k,l) p(Nj-->NpNq) bp(k,m) bq(m+1,l)
c(Ni--> wj) = 1/p(w1,n) k ai(k,k) p(Ni--> wj, wj =wk)
Np Nq
wk wm wm+1 wl
Nk,lj
Postech NLP lab
contentscontents
statistical vs. structured NLPstatistical vs. structured NLP statistics for computational linguisticsstatistics for computational linguistics POS taggingPOS tagging PCFG parsingPCFG parsing other applicationsother applications conclusionconclusion
Postech NLP lab
Other applicationsOther applications
Local syntactic disambiguationLocal syntactic disambiguation PP attachment problems
She ate the soup with the spoon (n2) p(A =n1 | prep, v, n1) > p(A=v | prep, v, n1)
relative-clause attachement Fred awarded a prize to the dog and Bill who trained it
noun/noun, adjective/noun combination ambiguity song bird feeder kit metal bird feeder kit novice bird feeder kit
Postech NLP lab
Other applicationsOther applications
Word clusteringWord clustering n-dim vector --> distance metric --> clustering algorithm n-dim vector (feature): collocation, verb-noun case role distance metric: mutual information, relative entropy
WSD (word-sense disambiguation)WSD (word-sense disambiguation) sense tagging for polysemous, homonymous words contextual properties of a target word --> n-dim vector supervised method
use sense tagged corpus (or bi-lingual parallel corpus) find argmaxs p(s) x p(x | s), [x in c(wi): mutually independent] (s: a sense, c(wi) local context of word wi)
unsupervised method dictionary-based, thesaurus-based, one-sense-per-collocation, local syntax-ba
sed, sense discrimination (clustering)
Postech NLP lab
Other applicationsOther applications
Text alignmentText alignment align list of pairs (word, sentence, paragraph) from two texts solution: subset of cartesian product
Statistical machine translationStatistical machine translation assign a prob to every pair of the sentences
find argmaxs p(s | t) = argmax s p(s, t) (s: sourcel, t: target)
Source language
model
p(s)
Translation
model
p(t|s)
Decoder
argmax s p(s|t) S T S’
Postech NLP lab
결론결론 : : 구조냐 통계냐 구조냐 통계냐 - Hybrid NLP - Hybrid NLP
Statistical analysis
data driven
empirical
connectionist
speech community
Structural analysisrule driven
rationalsymbolic
NLU, Chomskian, Shankian, AI community
Hybrid: structural (linguistic theory; prior knowledge)+ statistical preference
Recommended