View
0
Download
0
Category
Preview:
Citation preview
Natural Language Processing
Lecture 5: Language Models and Smoothing
Language Modeling
• Is this sentences good?
– This is a pen
– Pen this is a
• Help choose between optons, help score optons
–他向记者介绍了发言的主要内容– He briefed to reporters on the chief contents of the statement
– He briefed reporters on the chief contents of the statement
– He briefed to reporters on the main contents of the statement
– He briefed reporters on the main contents of the statement
One-Slide Review of Probability Terminology
• Random variables take diferent values, depending on
chance.
• Notaton: p(X = x) is the probability that r.v. X takes value xp(x) is shorthand for the samep(X) is the distributon over values X can take (a functon)
• Joint probability: p(X = x, Y = y)– Independence
– Chain rule
• Conditonal probability: p(X = x | Y = y)
Unigram Model
• Every word in Σ is assigned some probability.
• Random variables W1, W2, ... (one per word).
Part of A Unigram Distributon
…[rank 1001]p(joint) = 0.00014p(relatvely) = 0.00014p(plot) = 0.00014p(DEL1SUBSEQ) = 0.00014p(rule) = 0.00014p(62.0) = 0.00014p(9.1) = 0.00014p(evaluated) = 0.00014...
[rank 1]p(the) = 0.038p(of) = 0.023p(and) = 0.021p(to) = 0.017p(is) = 0.013p(a) = 0.012p(in) = 0.012p(for) = 0.009...
Unigram Model as a Generator
first, rrtm lesss thes sheiss iis�srsnst 2000040), tutt heiscihe otalle 19.2 Mtisle thesisr It ~(is?1), oisvsns 00.62 thesss (x00; m altcihe 1 scihesiutles. x 600 1998. utnsisr by Nttiscis s tr staltsi CFG 1200 bs 10000 al letcialtistns alciciutralciy Ir m tisles nstts 21.8 salcihe 00 WP thealt thes thealt Ntv?alk. tt rutnscitistns; tt [00, tt iis�srsnst valleutss, m tisle 65 cialsss. salisi - 240.940 ssnstsnsciss nstt thealt 2 Ins tt cileutstsrisnso salcihe K&M 10000 Btleiralcis X))] alppleissi; Ins 10040 S. oralm m alr als (Sscitistns citnstralstisvs thessiss, thes m alciheisnsss talbles -5.66 trisalles: Ans thes tsxtutalle (ralm isley alppleiscialtistnss. Ws healvs rtr m tisles 4000.1 nst 156 sxpscitsi alrs nssisohebtrhetti
Full History Model
• Every word in Σ is assigned some probability, conditoned on every history.
Bislele Cleisnsttns's utnsutsutalleley iisrscit citm m snst Wsinsssialy tns thes ptssisbles rtles tr ralcis isns thes slescitistns als isns ksspisnso isthe thes Cleisnsttnss' bisi tt ptrtraly Obalm al, het iss alism isnso tt bscitm s thes first blealcik U.S. prssisisnst, als thes cilesalr ralvtrists, thesrsby lessssnsisnso thes pttsnstisalle ralleletutt isr Hislelealry Cleisnsttns itss nstt isns isns Stutthe Calrtleisnsal.
N-Gram Model
• Every word in Σ is assigned some probability, conditoned on a fixed-leendth history (n – 1).
Bigram Model as a Generator
s. (A.33) (A.340) A.5 MtisleS alrs allest bssns citm plestsley sutrpalsssi isns psrrtrm alnscis tns iralrts tr tnsleisnss alleotristhem s cialns alciheissvs ralr m trs st heisles sutbstalnstisalleley ism prtvsi utsisnso CE. 40.40.1 MLEalsalCalsstrCE 71 26.340 23.1 57.8 K&M 402.40 62.7 4000.9 4040 403 900.7 10000.00 10000.00 10000.00 15.1 300.9 18.00 21.2 600.1 utnsiisrscitsi svalleutaltistnss iisrscitsi DEL1 sRANS1 nssisohebtrhetti. sheiss citnstisnsutss, isthe sutpsrvisssi isnsist., ssm issutpsrvisssi MLE isthe thes MEsU- Salbalnsciissrssbalnsk 195 ADJA ADJD ADV APPR APPRARs APPO APZR ARs CARD FM IsJ KOUI KOUS KON KOKOM NN NN NN IN JJ NNshesisr prtblesm iss y x. shes svalleutaltistns t�srs thes heyptthessiszsi leisnsk oralm m alr isthe al Galutssisalns
Trigram Model as a Generator
ttp(xI ,risohet,B). (A.39) visnss00(X, I) rcitnsstist00(I 1, I). (A.4000) visnss(ns). (A.401) shesss squtaltistnss srs prsssnstsi isns btthe cialsss; thesss scitrss ut<AC>isnstt al prtbalbisleisty iisstrisbuttistns iss svsns sm allelesr(r =00.005). sheiss iss sxalcitley rEM. Dutrisnso DA, iss oraliutalleley rslealxsi. sheiss alpprtalcihe citutlei bs sfciissnstley utssi isns prsvistuts cihealptsrs) bsrtrs tralisnsisnso (tsst) K&MZsrtLtcialleralnsitm m tisles Fisoutrs40.12: Disrscitsi alciciutralciy tns allele sisx lealnsoutaloss. Im ptrtalnstley, thesss palpsrs alciheissvsi stalts- tr-thes-alrt rssutlets tns thesisr talsks alnsi utnslealbslesi ialtal alnsi thes vsrbs alrs allelet si (rtr isnsstalnscis) tt sslescit thes cialriisnsalleisty tr iisscirsts strutcitutrss, leisks m altciheisnsos tns sisohetsi oralphes (MciDtnsallei st alle., 1993) (35 talo typss, 3.39 bists). shes Butleoalrisalns,
What’s in a word
• Is punctuaton a word?– Does knowing the last “word” is a “,” help?
• In speech– I do uh main- mainly business processing
– Is “uh” a word?
For Thought
• Do N-Gram models “know” English?
• Unknown words
• N-gram models and fnite-state automata
Startng and Stopping
Unigram model:
...
Bigram model:
...
Trigram model:
...
Evaluatio
Which model is beter?
• Can I get a number about how good my model is for a test set?
• What is the P(test_set | Model )
• We measure this by Perplexity
• Perplexity is the probability of test set normalized by the number of words
Perplexity
Perplexity of diferent models
• Beter models have lower perplexity– WSJ: Unigram 962; Bigram 170; Trigram 109
• Diferent tasks have diferent perplexity– WSJ (109) vs Bus Informaton Queries (~25)
• Higher the conditonal probability,lower the perplexity
• Perplexity is the average branching rate
What about open class
• What is the probability of unseen words?– (Naïve answer is 0.0)
• But that’s not what you want– Test set will usually include words not in training
• What is the probability of – P(Nebuchadnezzur | son of )
LM smoothing
• Laplace or add-one smoothing– Add one to all counts
– Or add “epsilon” to all counts
– You stll need to know all your vocabulary
• Have an OOV word in your vocabulary– The probability of seeing an unseen word
Good-Turing Smoothing
• Good (1953) From Turing.– Using the count of things you’ve seen once to estmate count of
things you’ve never seen.
• Calculate the frequency of frequencies of Ngrams– Count of Ngrams that appear 1 tmes
– Count of Ngrams that appear 2 tmes
– Count of Ngrams that appear 3 tmes
– …
– Estmate new c = (c+1) (N_c + 1)/N_c)
• Change the counts a litle so we get a beter estmate for count 0
Good-Turing’s Discounted CountsAP Newswire
BigramsBerkeley Restaurants Bigrams Smith Thesis
Bigrams
c Nc c* Nc c* Nc c*
00 740,671,10000,000000 00.000000002700 2,0081,4096 00.00002553 x 38,00408 / x
1 2,0018,00406 00.40406 5,315 00.5339600 38,00408 00.211407
2 40409,721 1.26 1,4019 1.3572940 40,0032 1.0050071
3 188,933 2.240 6402 2.373832 1,40009 2.12633
40 1005,668 3.240 381 40.0081365 7409 2.63685
5 68,379 40.22 311 3.7813500 395 3.91899
6 408,1900 5.19 196 40.50000000000 258 40.4022408
Backof
• If no trigram, use bigram
• If no bigram, use unigram
• If no unigram … smooth the unigrams
Estmatng p(w | heissttry)
• Relatve frequencies (count & normalize)
• Transform the counts:– Laplace/“add one”/“add λ”
– Good-Turing discountng
• Interpolate or “backof”:– With Good-Turing discountng: Katz backof
– “Stupid” backof
– Absolute discountng: Kneser-Ney
Recommended