Click here to load reader

Corpora and Statistical Methods – Lecture 7

  • Upload
    lida

  • View
    33

  • Download
    0

Embed Size (px)

DESCRIPTION

Corpora and Statistical Methods – Lecture 7. Albert Gatt. Part 2. Smoothing (aka discounting) techniques. Overview…. Smoothing methods: Simple smoothing Witten-Bell & Good-Turing estimation Held-out estimation and cross-validation Combining several n-gram models: back-off models. - PowerPoint PPT Presentation

Citation preview

Slide 1

Albert GattCorpora and Statistical Methods Lecture 7Smoothing (aka discounting) techniquesPart 2OverviewSmoothing methods:Simple smoothingWitten-Bell & Good-Turing estimationHeld-out estimation and cross-validation

Combining several n-gram models:back-off models

Rationale behind smoothingSample frequencies seen events with probability P unseen events (including grammatical zeroes) with probability 0

Real population frequencies seen events (including the unseen events in our sample)+ smoothingto approximateLower probabilities for seen events (discounting). Left over probability mass distributed over unseens (smoothing).results inLaplaces law, Lidstones law and the Jeffreys-Perks lawInstances in the Training Corpus:inferior to ________

F(w)Maximum Likelihood Estimate

F(w)Unknowns are assigned 0% probability massActual Probability Distribution

F(w)These are non-zero probabilities in the real distributionLaPlaces Law (Add-one smoothing)

F(w)LaPlaces Law (Add-one smoothing)

F(w)LaPlaces Law

F(w)NB. This method ends up assigning most prob. mass to unseensGeneralisation: Lidstones Law

P = probability of specific n-gramC(x) = count of n-gram x in training dataN = total n-grams in training dataV = number of bins (possible n-grams) = small positive numberM.L.E: = 0LaPlaces Law: = 1 (add-one smoothing)Jeffreys-Perks Law: = Jeffreys-Perks Law

F(w)Objections to Lidstones LawNeed an a priori way to determine .

Predicts all unseen events to be equally likely

Gives probability estimates linear in the M.L.E. frequencyWitten-Bell discountingMain intuitionA zero-frequency event can be thought of as an event which hasnt happened (yet).The probability of it happening can be estimated from the probability of sth happening for the first time.

The count of things which are seen only once can be used to estimate the count of things that are never seen.Witten-Bell methodT = no. of times we saw an event for the first time. = no of different n-gram types (bins)NB: T is no. of types actually attested (unlike V, the no of possible types in add-one smoothing)

Estimate total probability mass of unseen n-grams:

each token is an event & each new type is an eventso above equation gives MLE of the probability of a new type event occurring (being seen for the first time)This is the total probability mass to be distributed among all zero events (unseens)

no of actual n-grams (N) + no of actual types (T)Witten-Bell methodDivide the total probability mass among all the zero n-grams. Can distribute it equally.

Remove this probability mass from the non-zero n-grams (discounting):

Witten-Bell vs. Add-oneIf we work with unigrams, Witten-Bell and Add-one smoothing give very similar results.

The difference is with n-grams for n>1.

Main idea: estimate probability of an unseen bigram from the probability of seeing a bigram starting with w1 for the first time. Witten-Bell with bigramsGeneralised total probability mass estimate:

No. bigram types beginning with wxNo. bigram tokens beginning with wxEstimated total probability of bigrams starting with some word wxWitten-Bell with bigramsNon-zero bigrams get discounted as before, but again conditioning on history:

Note: Witten-Bell wont assign the same probability mass to all unseen n-grams. The amount assigned will depend on the first word in the bigram (first n-1 words in the n-gram).