통계적 자연어 처리 포항공대 자연어처리 연구실 1999.8.27 튜토리알 이근배. Postech NLP lab contents ostatistical vs. structured NLP ostatistics for computational

통계적 자연어 처리 통계적 자연어 처리

포항공대 자연어처리 연구실포항공대 자연어처리 연구실1999.8.27 1999.8.27 튜토리알 튜토리알

이근배 이근배

Postech NLP lab

contentscontents

statistical vs. structured NLPstatistical vs. structured NLP statistics for computational linguisticsstatistics for computational linguistics POS taggingPOS tagging PCFG parsingPCFG parsing other applicationsother applications conclusionconclusion

Postech NLP lab

referencesreferences Eugene Charniak. Statistical Language Learning. MIT press, 1993Eugene Charniak. Statistical Language Learning. MIT press, 1993 Christopher D. Manning and Hinrich Schutze, Foundations of Statistical Natural Language PrChristopher D. Manning and Hinrich Schutze, Foundations of Statistical Natural Language Pr

ocessing, MIT press, 1999. ocessing, MIT press, 1999. (basic and applied statistics)(basic and applied statistics)

- Brigitte Krenn and Christer Samuelsson. The Linguist's Guide to Statistics. Internet shareware, http://www.coli.uni-sb.de/~krenn/edu.html

(general)(general) - Abney, S. Statistical methods and linguistics. In J. Klavans and P. Resnik. The Balancing act, M

IT Press, 1996 - Church and Mercer, Introduction to the special issue on computational linguistics using large c

orpora, Computational Lingusitcs, 19, 1993 (POS tagging)(POS tagging)

- Cutting et al. A practical part-of-speech tagger, In Proceedings of the 3rd conference on applied natural language processing, 1992.

- Church. A stochastic parts program and noun phrase parser for unrestricted text. in Proceedings of the 2nd conference on applied natural language processing, 1988

- Weischedel et. al. Coping with ambiguity and unknown words through probablistic models, Computational linguistics, 19(2), 1993

- J. Kupiec. Robust part-of-speech tagging using a hidden Markov model, computer speech and language, 6, 1992

- E. Brill. A simple rule-based part-of-speech tagger. Proceedings of the 3rd conference on applied NLP, 1992

Postech NLP lab

referencesreferences

- E. Roche and Y. Schabes. Deterministic part-of-speech tagging with finite state transducers. Computational linguistics 21, 1995.

- B. Merialdo. Tagging English text with a probabilistic model. Computational linguistics 20, 1994.

(Statistical parsing)(Statistical parsing) - K. Lari and S. Young. The estimation of stochastic context-free grammar using the inside-

outside algorithm. Computer speech and language 4, 1990 - F. Pereira and Y. Schabes. Inside-outside reestimation frm partially bracketed corpora. AC

L 30, 1992 - T. Briscoe and J. Carroll. Generalized probabilistic LR parsing of natural language (corpora)

with unification-based grammars. Computational linguistics 19, 1993. - E. Black et. al. Towards history-based grammars: using richer models for probablistic pars

ing. ACL 31, 1993. - D. Margerman. Statistical decision-tree models for parsing, ACL 33, 1995. - Brill. Automatic grammar induction and parsing free text: a transformation-based approac

h. ACL 31, 1993

Postech NLP lab

referecesrefereces

(statistical disambiguation)(statistical disambiguation) - D. Hindle and M. Rooth. Structural ambiguity and lexical relations. Computational linguistics

19, 1993 - K. Church and P. Hanks. Word association norms, mutual information and lexicography, ACL

28, 1990 - Alshawi and Carter. Training and scaling preference functions for disambiguation. computatio

nal lingustics 20(4), 1994. (word classes and WSD)(word classes and WSD)

- Gale et. al. Work on statistical methods for word-sense disambiguation, Proceedings from AAAI fall symposium: Probablistic approaches to natural language, 1992.

- Gale et al. Estimating upper and lower bounds on the performance of word-sense disambiguation programs. ACL 30, 1992.

- Yarowsky. Word sense disambiguation using statistical method of Roget's categories trained on large copora, Coling, 1992.

- Yarowsky. Unsupervised word-sense disambiguation rivaling supervised methods, ACL 33, 1995

- Pereira et. al. Distributional clustering of English words, ACL 31, 1993 - Dagan et al. Contextual word similarity and estimation from sparse data, ACL 31, 1993 - Dagan et al. Similarity-based estimation of word coocurrence probabilites, ACL 32, 1994.

Postech NLP lab

referencesreferences

(text alignment(text alignment and machine translation) and machine translation) - Kay and Roscheisen. Text-translation alignment. computational linguistics 19, 1993 - Gale and Church. A program for aligning sentences in bilingual corpora, Computational linguistics

19, 1993 - Brown et al. The mathematics of statistical machine translation: parameter estimation. computat

ional linguistics, 1993 - Brown et al. A statiscal approach to machine translation. Computational linguistics 16, 1990 - Wu. Aligning a parallel English-Chinese corpus statistically with lexical criteria. ACL 32, 1994 - Church. Char_align: A program for aligning parall 디 texts at the character level. ACL 31, 1993 - Sproat et al. A stochstic finite-state word segmentation algorithm for Chinese, ACL 32, 1994

(lexical knowledge acquisition)(lexical knowledge acquisition) - Manning. Automatic acquistion of a large subcategorization dictionary from corpora. ACL 31, 199

3 - Smadja. Retrieving collocations from text: Xtract, Computational linguistics, 1993

(speech and others)(speech and others) - Brown et al. Class-based n-gram models of natural language, computational linguistics, 18(4), 1

992

Postech NLP lab

구조냐 통계냐 구조냐 통계냐 - - 역사의 시계추 역사의 시계추

Statistical analysis

data driven

empirical

connectionist

speech community

Structural analysisrule driven

rationalsymbolic

NLU, Chomskian, Shankian, AI community

Postech NLP lab

구조적 구조적 NLPNLP

grammar rules + lexicons grammar rules + lexicons Grammatical category (POS, syntactic category) unification features (connectivity, agreements, semantics..)

chart parsingchart parsing compositional semanticscompositional semantics

한계한계 : : 엄청난 엄청난 ambiguityambiguity “List the sales of the products produced in 1973 with the products pr

oduced in 1972” ==> 455 parses (Martin et. al. 1981)

Postech NLP lab

통계적 통계적 NLPNLP

구문단계의 문법의 역할 구문단계의 문법의 역할 -- -- 어떤 단어열이 올바른 문장이 되나어떤 단어열이 올바른 문장이 되나 ?? Pr (w1, w2, …wn) 계산하기

pr(w1)pr(w2|w1)pr(w3|w1w2)…..pr(wn|w1, ….wn-1)

pr(w2 |w1) = count (w1w2) / count (w1) [MLE]

예 ) the (big, pig) dog Shannon game -- predicting the next word given word sequence

language modeling -- probability matrixlanguage modeling -- probability matrix 언어모델 평가 -- cross entropy 개념 적용 - pr(w1,n) log prM(w1,n)

when prM(w1,n) = pr(w1,n) cross entropy becomes minimum and langu

age model M is perfect

Postech NLP lab

Chomsky vs. ShannonChomsky vs. Shannon

What Shannon saysWhat Shannon says for n-th order probability approximation of sentence s grammatical (s) <-->

What Chomsky saysWhat Chomsky says there is no way to choose n and such that for all sentence s, grammatical (s) <-->

What’s wrong with n-th order Markov model?What’s wrong with n-th order Markov model? probability is ok but finite-state machine is not

0)(lim sPnn

)(sPn

Postech NLP lab

contentscontents


Postech NLP lab

저는 통계를 하나도 모르는데요저는 통계를 하나도 모르는데요 ?? Bayesian inversion formulaBayesian inversion formula

p(a | b) = p(a).p(b|a) / p(b)

Maximum-likelihood estimation (MLE)Maximum-likelihood estimation (MLE) 관찰된 결과가 나올 확률이 최대로 되게 max L(x1, x2, …, xn, ) ==> relative frequency e.g. pr(w2 |w1) = count (w1

w2) / count (w1)

cf. MAP estimation, Bayesian estimation, function approximation using NN

smoothing (discounting) required : adding one, held-out, deleted-estimation, good-turing, Charniak’s linear interpolation, Katz’s backing-off, etc

Random variableRandom variable rv: (sample space) --> R (real number)

random process (stochastic process)random process (stochastic process) characterized by p(xt+1 | x1, x2, ….xt)

Postech NLP lab

아직도 모르겠는데요아직도 모르겠는데요 ??

Markov chain: special stochastic processMarkov chain: special stochastic process p(xt+1 | x1, x2, ….xt) = p(xt+1| xt) : transition matrix

Markov modelMarkov model state transition matrix pij = p(sj | si) <- 1st Markov assumption

signal matrix aij = p(oj | si) <--2nd Markov assumption

initial state vector vi= p(si)

hidden markov modelhidden markov model state sequence hidden can only observe signal sequence

Postech NLP lab

좀 알것 같아요좀 알것 같아요

EntropyEntropy discrete r.v. with p(xi) - how much uncertainty by not knowing the out

come of r.v. 매 특정 결과를 코딩하는 데 드는 비트수의 평균값 E[ - log2 P(r ) ] = i -pi log2 pi

uniform distribution 일때 최대

joint entropy/ conditional entropyjoint entropy/ conditional entropy

mutual informationmutual information MI [r1, r2] = E [log p(r1, r2) / p(r1) p(r2)] = H[r1] - H[r1|r2] MI >> 0 strong correlation MI << 0 strong negative correlation

Postech NLP lab

이게 전부예요이게 전부예요 ??

PerplexityPerplexity perp[r] = e H[r] = branching factor in the word sequence

cross entropy -- how good is your model?cross entropy -- how good is your model? Hp[q] = x p(x) ln q(x) (min when p = q)

relative entropy (KL distance; KL divergence) distance measure between two distributions p, q D[p||q] = Hp[q] - H[p] >0

information radius(Irad) D[p||(p+q)/2] + D[q||(p+q)/2] best divergence measures between probability distribution

Postech NLP lab

Basic corpus linguisticsBasic corpus linguistics

Empirical evaluationEmpirical evaluation black box evaluation: system test as a whole glass box evaluation: component-wise test need designed test material + annotated copora + evaluation measur

e

Postech NLP lab


Contingency table measureContingency table measure recall = a/ a+c ; for completeness precision = a / a+b ; for correctness fallout = b/ b+d

yes correct no correctdecide yes true pos(a) false pos(b a+bdecide no false neg(c ) true neg(d c+d

a+c b+d n

Postech NLP lab


Corpora according to text typeCorpora according to text type balanced corpora (e.g. Brown corpus) pyramidal corpora: large sample of few representative genres to smal

l sample of wide variety of genres opportunistic corpora

corpora according to annotation typecorpora according to annotation type raw corpus: tokenized/cleaned/meta tagged pos-tagged tree banks sense tagged

corpora according to usecorpora according to use training testing/evaluation cross validation (10-fold)

Postech NLP lab


Zipf’s law f 1 / r (f: frequency of a word, r: rank (position) of the word) principle of least efforts for both speaker (small frequent vocabulary) and

listener (large vocabulary of rarer words for less ambiguities) Only a few words will have enough examples ==> always needs smoothin

g! Collocations : the whole is beyond the sum of the parts cf. Co-occurre

nce: no-order constraints (=association) compound (e.g. disk drive) phrasal verbs (e.g. make up) stock phrases (e.g. bacon and eggs) idioms (e.g. kick the bucket) several words long and can be discontinuous measure: variance, t-test, x2-test, likelihood ratio, MI, etc

KWIC (key word in contexts)KWIC (key word in contexts)

Postech NLP lab

contentscontents


Postech NLP lab

POS taggingPOS tagging

포항공대 이근배포항공대 이근배 교수님께서 신을 교수님께서 신을 신고신고 신고신고하러 가신다하러 가신다 ..[ 0,0 ( 0,0 )] 등 1.000000e+00(1.000000e+00) s< 문장시작 >([)[ 1,10( 1,1 )] 미 8.288423e-11(6.102822e-13) MPO< 포항공대 >( 포항공대 )[11,11( 2,2 )] 등 8.736421e-02(2.559207e-20) s<#>(#)[12,18( 3,3 )] 미 9.236515e-08(7.008548e-24) MPN< 이근배 >( 이근배 )[19,19( 4,4 )] 등 8.736421e-02(2.939022e-31) s<#>(#)[20,23( 5,5 )] 등 4.469725e+00(1.564634e-25) MC< 교수 >( 교수 )[24,26( 6,6 )] 등 1.373613e+02(1.504397e-25) -< 님 >( 님 )[27,30( 7,7 )] 등 1.307859e+01(1.831031e-25) jC< 이 >( 께서 )[31,31( 8,8 )] 등 8.736421e-02(7.678394e-33) s<#>(#)[32,34( 9,9 )] 등 3.250709e+00(3.667919e-27) MC< 신 >( 신 )[35,37(10,10)] 등 1.264760e+01(3.865534e-27) jC< 을 >( 을 )[38,38(11,11)] 등 8.736421e-02(1.621005e-34) s<#>(#)[39,41(12,12)] 등 5.807344e+00(1.021970e-28) DR< 신 >( 신 )[42,43(13,13)] 등 3.936314e+01(1.918250e-28) eCC< 고 >( 고 )[44,44(14,14)] 등 8.736421e-02(8.044147e-36) s<#>(#)[45,49(15,15)] 등 8.588220e-04(1.297090e-33) MC< 신고 >( 신고 )[50,51(16,16)] 등 2.626376e+01(1.404345e-33) y< 하 >( 하 )[52,56(17,19)] 등 1.445488e+03(1.043073e-31) eCC< 러 >( 러 )[52,56(17,19)] 등 1.445488e+03(1.043073e-31) s<#>(#)[52,56(17,19)] 등 1.445488e+03(1.043073e-31) DI< 가 >( 가 )[57,58(20,20)] 등 4.657808e+01(1.348953e-31) eGS< 시 >( 시 )[59,61(21,21)] 등 1.841659e+01(4.754894e-31) eGE< 는다 >( ㄴ다 )[62,64(22,22)] 등 1.250000e-07(1.365400e-38) s.<.>(.)[65,65(23,23)] 등 2.500000e-05(1.638481e-49) s< 문장끝 >(])

Postech NLP lab


Task: finding argmax Task: finding argmax t1,nt1,n p(w p(w1,n1,n , t , t1,n1,n))

HMM modeling HMM modeling states -- tag, signals -- words cf. ME(max entropy) modeling/ NN(neural net) modeling

3 problems of HMM modeling given the sequence of signals (o3 problems of HMM modeling given the sequence of signals (o

1,n1,n) and states (s) and states (s1,n1,n))

estimate the signal sequence pr(o1,n) ==> e.g. language identification/

language modeling

determine the most probable state seqeunce argmax s1,n p(o1,n , s1,n) =

=> e.g POS tagging, speech recognition determine the model parameters (P, A, v) for given signal sequence =

=> HMM training (MLE)

Postech NLP lab


Task: finding argmax Task: finding argmax t1,nt1,n p(w p(w1,n1,n , t , t1,n1,n))

argmax t1,n p(wi | ti) p(ti+1 | ti)

two markov assumption after chain rule conditionalization

Viterbi algorithm: time-synchronous dynamic programmingViterbi algorithm: time-synchronous dynamic programming

T T+1

time

state

T+2

sj

si

di(t+1) =dj(t).si

where j = [argmaxkp(dk(t))*p(si|sk)]*p(wt+1|si)

d: best state(tag) sequence ending with si

s: state (tag)w: word

sk

Postech NLP lab


Training POS taggerTraining POS tagger r1(i) = prob of starting in state si

t et (i,j) = expected number of transition from state si to state sj

t rt(i) = expected number of transition from state si

twj rt(i) = expected number of times word wj is emitted in state si

vi = r1(i) (initial state vector)pij = t et (i,j) / t rt(i) (transition matrix)aij = twj rt(i) / t rt(i) (observation matrix)

Tagged corpus (supervised): frequency count (markov model, not HMM)

raw corpus (unsupervised): EM algorithm (Baum-Welch re-estimation)

Postech NLP lab


Problems of HMM training (using EM algorithm = MLE)Problems of HMM training (using EM algorithm = MLE) critical points (never move) -- use random noise over-fitting to the training data local minimum

Calculation of p(wCalculation of p(w1,n1,n))

define ai(t): prob of ending in state si emiting w1,t-1 (forward variable)

define bi(t) : prob of seeing wt,n if the state is si at time t (backward var

iable)

aj(t+1) = [i ai(t) p(sj|si)]p(wt|sj)

p(w1,n) = i ai(n+1)

Postech NLP lab

contentscontents


Postech NLP lab

PCFG parsingPCFG parsing

P(wP(w1,n1,n) = ) = P(tP(t1,n1,n) : t) : t1,n1,n = parse tree covering w = parse tree covering w11 -- w -- wnn

p(tp(t1,n1,n) = ) = p(rule) : using sub-tree independence assumption p(rule) : using sub-tree independence assumption

Njk,l

wk wl

Postech NLP lab

PCGF parsingPCGF parsing

Why pcfgWhy pcfg ordering of the parsers (structural ambiguity) accounts for grammar induction (with only positive examples) compatible with lexical language modeling

the green banana pcfg n--> banana/ n--> time/ n--> number ….Trigram-->the green (time/number/banana)

Fred watered his mother’s small gardentrigram -->p(garden|mother’s small) pcfg--> p(x=garden|x head of DO of “to water”)

Postech NLP lab


Pcfg vs HMM (prob. Regular grammar) ==> different the way tPcfg vs HMM (prob. Regular grammar) ==> different the way they assign probabilityhey assign probability in pcfg: s p(s) = 1 (s: sentences)

in hmm: w1,n p(w1,n) = 1 (w1,n: sentence of length n)

p1 : Alice went to the ____. p2: Alice went to the office. in hmm p1 > p2 in pcfg p1 < p2

Postech NLP lab


Finding p(w1,n)Finding p(w1,n)bj(k,l) = p(wk,l | Nj

k,l):inside prob (similar to backward prob)

aj(k,l) = p(w1,k-1, Njk,l,wl+1,n): outside prob(similar to

forward prob)

Njk,l

wk wl

N1

w1…..wk-1wk…wlwl+1…..wn

Nj

Postech NLP lab


Finding p(wFinding p(w1,n1,n) using inside probability) using inside probability cf. Finding most likely parse for a sentence

bb11(1,n) = p(w(1,n) = p(w1,n1,n | N | N111,n1,n) = p(w) = p(w1,n1,n))

For Chomsky normal formFor Chomsky normal formbj(k,k) = p(Nj --> wk)bj(k,l) = p,q,m p(Nj-->NpNq) bp(k,m) bq(m+1,l)

Nj

Np Nq

wk wm wm+1 wl

Postech NLP lab


Training PCFGTraining PCFG inside-outside re-estimation (MLE training a la Baum-Welch)inside-outside re-estimation (MLE training a la Baum-Welch)

For Chomsky normal formFor Chomsky normal form

pe(Ni-->sj) = c(Ni-->sj) / k c(Ni-->sk) (sj: sentential form j)

c(Nj-->NpNq) = 1/p(w1,n) k,l,m aj

(k,l) p(Nj-->NpNq) bp(k,m) bq(m+1,l)

c(Ni--> wj) = 1/p(w1,n) k ai(k,k) p(Ni--> wj, wj =wk)

Np Nq

wk wm wm+1 wl

Nk,lj

Postech NLP lab

contentscontents


Postech NLP lab

Other applicationsOther applications

Local syntactic disambiguationLocal syntactic disambiguation PP attachment problems

She ate the soup with the spoon (n2) p(A =n1 | prep, v, n1) > p(A=v | prep, v, n1)

relative-clause attachement Fred awarded a prize to the dog and Bill who trained it

noun/noun, adjective/noun combination ambiguity song bird feeder kit metal bird feeder kit novice bird feeder kit

Postech NLP lab


Word clusteringWord clustering n-dim vector --> distance metric --> clustering algorithm n-dim vector (feature): collocation, verb-noun case role distance metric: mutual information, relative entropy

WSD (word-sense disambiguation)WSD (word-sense disambiguation) sense tagging for polysemous, homonymous words contextual properties of a target word --> n-dim vector supervised method

use sense tagged corpus (or bi-lingual parallel corpus) find argmaxs p(s) x p(x | s), [x in c(wi): mutually independent] (s: a sense, c(wi) local context of word wi)

unsupervised method dictionary-based, thesaurus-based, one-sense-per-collocation, local syntax-ba

sed, sense discrimination (clustering)

Postech NLP lab


Text alignmentText alignment align list of pairs (word, sentence, paragraph) from two texts solution: subset of cartesian product

Statistical machine translationStatistical machine translation assign a prob to every pair of the sentences

find argmaxs p(s | t) = argmax s p(s, t) (s: sourcel, t: target)

Source language

model

p(s)

Translation

model

p(t|s)

Decoder

argmax s p(s|t) S T S’

Postech NLP lab

결론결론 : : 구조냐 통계냐 구조냐 통계냐 - Hybrid NLP - Hybrid NLP

Statistical analysis

data driven

empirical

connectionist

speech community

Structural analysisrule driven

rationalsymbolic

NLU, Chomskian, Shankian, AI community

Hybrid: structural (linguistic theory; prior knowledge)+ statistical preference

Documents

통계적 자연어 처리 포항공대 자연어처리 연구실 1999.8.27 튜토리알 이근배. Postech NLP lab contents ostatistical vs. structured NLP ostatistics for computational