Lecture 5: Language Models and Smoothingdemo.clab.cs.cmu.edu/NLP/S19/files/slides/05-lm.pdf · Cleisnsttnss' bisi tt ptrtraly Obalm al, het iss alism isnso tt bscitm s thes first

Natural Language Processing

Lecture 5: Language Models and Smoothing

Language Modeling

• Is this sentences good?

– This is a pen

– Pen this is a

• Help choose between optons, help score optons

–他向记者介绍了发言的主要内容– He briefed to reporters on the chief contents of the statement

– He briefed reporters on the chief contents of the statement

– He briefed to reporters on the main contents of the statement

– He briefed reporters on the main contents of the statement

One-Slide Review of Probability Terminology

• Random variables take diferent values, depending on

chance.

• Notaton: p(X = x) is the probability that r.v. X takes value xp(x) is shorthand for the samep(X) is the distributon over values X can take (a functon)

• Joint probability: p(X = x, Y = y)– Independence

– Chain rule

• Conditonal probability: p(X = x | Y = y)

Unigram Model

• Every word in Σ is assigned some probability.

• Random variables W1, W2, ... (one per word).

Part of A Unigram Distributon

…[rank 1001]p(joint) = 0.00014p(relatvely) = 0.00014p(plot) = 0.00014p(DEL1SUBSEQ) = 0.00014p(rule) = 0.00014p(62.0) = 0.00014p(9.1) = 0.00014p(evaluated) = 0.00014...

[rank 1]p(the) = 0.038p(of) = 0.023p(and) = 0.021p(to) = 0.017p(is) = 0.013p(a) = 0.012p(in) = 0.012p(for) = 0.009...

Unigram Model as a Generator

first, rrtm lesss thes sheiss iis�srsnst 2000040), tutt heiscihe otalle 19.2 Mtisle thesisr It ~(is?1), oisvsns 00.62 thesss (x00; m altcihe 1 scihesiutles. x 600 1998. utnsisr by Nttiscis s tr staltsi CFG 1200 bs 10000 al letcialtistns alciciutralciy Ir m tisles nstts 21.8 salcihe 00 WP thealt thes thealt Ntv?alk. tt rutnscitistns; tt [00, tt iis�srsnst valleutss, m tisle 65 cialsss. salisi - 240.940 ssnstsnsciss nstt thealt 2 Ins tt cileutstsrisnso salcihe K&M 10000 Btleiralcis X))] alppleissi; Ins 10040 S. oralm m alr als (Sscitistns citnstralstisvs thessiss, thes m alciheisnsss talbles -5.66 trisalles: Ans thes tsxtutalle (ralm isley alppleiscialtistnss. Ws healvs rtr m tisles 4000.1 nst 156 sxpscitsi alrs nssisohebtrhetti

Full History Model

• Every word in Σ is assigned some probability, conditoned on every history.

Bislele Cleisnsttns's utnsutsutalleley iisrscit citm m snst Wsinsssialy tns thes ptssisbles rtles tr ralcis isns thes slescitistns als isns ksspisnso isthe thes Cleisnsttnss' bisi tt ptrtraly Obalm al, het iss alism isnso tt bscitm s thes first blealcik U.S. prssisisnst, als thes cilesalr ralvtrists, thesrsby lessssnsisnso thes pttsnstisalle ralleletutt isr Hislelealry Cleisnsttns itss nstt isns isns Stutthe Calrtleisnsal.

N-Gram Model

• Every word in Σ is assigned some probability, conditoned on a fixed-leendth history (n – 1).

Bigram Model as a Generator

s. (A.33) (A.340) A.5 MtisleS alrs allest bssns citm plestsley sutrpalsssi isns psrrtrm alnscis tns iralrts tr tnsleisnss alleotristhem s cialns alciheissvs ralr m trs st heisles sutbstalnstisalleley ism prtvsi utsisnso CE. 40.40.1 MLEalsalCalsstrCE 71 26.340 23.1 57.8 K&M 402.40 62.7 4000.9 4040 403 900.7 10000.00 10000.00 10000.00 15.1 300.9 18.00 21.2 600.1 utnsiisrscitsi svalleutaltistnss iisrscitsi DEL1 sRANS1 nssisohebtrhetti. sheiss citnstisnsutss, isthe sutpsrvisssi isnsist., ssm issutpsrvisssi MLE isthe thes MEsU- Salbalnsciissrssbalnsk 195 ADJA ADJD ADV APPR APPRARs APPO APZR ARs CARD FM IsJ KOUI KOUS KON KOKOM NN NN NN IN JJ NNshesisr prtblesm iss y x. shes svalleutaltistns t�srs thes heyptthessiszsi leisnsk oralm m alr isthe al Galutssisalns

Trigram Model as a Generator

ttp(xI ,risohet,B). (A.39) visnss00(X, I) rcitnsstist00(I 1, I). (A.4000) visnss(ns). (A.401) shesss squtaltistnss srs prsssnstsi isns btthe cialsss; thesss scitrss ut<AC>isnstt al prtbalbisleisty iisstrisbuttistns iss svsns sm allelesr(r =00.005). sheiss iss sxalcitley rEM. Dutrisnso DA, iss oraliutalleley rslealxsi. sheiss alpprtalcihe citutlei bs sfciissnstley utssi isns prsvistuts cihealptsrs) bsrtrs tralisnsisnso (tsst) K&MZsrtLtcialleralnsitm m tisles Fisoutrs40.12: Disrscitsi alciciutralciy tns allele sisx lealnsoutaloss. Im ptrtalnstley, thesss palpsrs alciheissvsi stalts- tr-thes-alrt rssutlets tns thesisr talsks alnsi utnslealbslesi ialtal alnsi thes vsrbs alrs allelet si (rtr isnsstalnscis) tt sslescit thes cialriisnsalleisty tr iisscirsts strutcitutrss, leisks m altciheisnsos tns sisohetsi oralphes (MciDtnsallei st alle., 1993) (35 talo typss, 3.39 bists). shes Butleoalrisalns,

What’s in a word

• Is punctuaton a word?– Does knowing the last “word” is a “,” help?

• In speech– I do uh main- mainly business processing

– Is “uh” a word?

For Thought

• Do N-Gram models “know” English?

• Unknown words

• N-gram models and fnite-state automata

Startng and Stopping

Unigram model:

...

Bigram model:

...

Trigram model:

...

Evaluatio

Which model is beter?

• Can I get a number about how good my model is for a test set?

• What is the P(test_set | Model )

• We measure this by Perplexity

• Perplexity is the probability of test set normalized by the number of words

Perplexity

Perplexity of diferent models

• Beter models have lower perplexity– WSJ: Unigram 962; Bigram 170; Trigram 109

• Diferent tasks have diferent perplexity– WSJ (109) vs Bus Informaton Queries (~25)

• Higher the conditonal probability,lower the perplexity

• Perplexity is the average branching rate

What about open class

• What is the probability of unseen words?– (Naïve answer is 0.0)

• But that’s not what you want– Test set will usually include words not in training

• What is the probability of – P(Nebuchadnezzur | son of )

LM smoothing

• Laplace or add-one smoothing– Add one to all counts

– Or add “epsilon” to all counts

– You stll need to know all your vocabulary

• Have an OOV word in your vocabulary– The probability of seeing an unseen word

Good-Turing Smoothing

• Good (1953) From Turing.– Using the count of things you’ve seen once to estmate count of

things you’ve never seen.

• Calculate the frequency of frequencies of Ngrams– Count of Ngrams that appear 1 tmes

– Count of Ngrams that appear 2 tmes

– Count of Ngrams that appear 3 tmes

– …

– Estmate new c = (c+1) (N_c + 1)/N_c)

• Change the counts a litle so we get a beter estmate for count 0

Good-Turing’s Discounted CountsAP Newswire

BigramsBerkeley Restaurants Bigrams Smith Thesis

Bigrams

c Nc c* Nc c* Nc c*

00 740,671,10000,000000 00.000000002700 2,0081,4096 00.00002553 x 38,00408 / x

1 2,0018,00406 00.40406 5,315 00.5339600 38,00408 00.211407

2 40409,721 1.26 1,4019 1.3572940 40,0032 1.0050071

3 188,933 2.240 6402 2.373832 1,40009 2.12633

40 1005,668 3.240 381 40.0081365 7409 2.63685

5 68,379 40.22 311 3.7813500 395 3.91899

6 408,1900 5.19 196 40.50000000000 258 40.4022408

Backof

• If no trigram, use bigram

• If no bigram, use unigram

• If no unigram … smooth the unigrams

Estmatng p(w | heissttry)

• Relatve frequencies (count & normalize)

• Transform the counts:– Laplace/“add one”/“add λ”

– Good-Turing discountng

• Interpolate or “backof”:– With Good-Turing discountng: Katz backof

– “Stupid” backof

– Absolute discountng: Kneser-Ney

Documents

Lecture 5: Language Models and Smoothingdemo.clab.cs.cmu.edu/NLP/S19/files/slides/05-lm.pdf · Cleisnsttnss' bisi tt ptrtraly Obalm al, het iss alism isnso tt bscitm s thes first