Introduction to Kneser-Ney Smoothing on Top of Generalized Language Models for Next Word Prediction

Web Science & Technologies

University of Koblenz ▪ Landau, Germany

Introduction to Kneser-NeySmoothing on Top of Generalized Language

Models for Next Word Prediction

Martin Körner

Oberseminar

25.07.2013

Martin Körner

mkoerner@uni-koblenz.de

Oberseminar 25.07.2013

2 of 30

Content

Introduction

Language Models

Generalized Language Models

Smoothing

Progress

Summary

Martin Körner

3 of 30

Content

Introduction

Language Models

Smoothing

Progress

Summary

Martin Körner

4 of 30

Introduction: Motivation

Next word prediction: What is the next word a user will

Use cases for next word prediction:

Augmentative and Alternative

Communication (AAC)

Small keyboards (Smartphones)

Martin Körner

5 of 30

Introduction to next word prediction

How do we predict words?

1. Rationalist approach

• Manually encoding information about language

• “Toy” problems only

2. Empiricist approach

• Statistical, pattern recognition, and machine learning

methods applied on corpora

• Result: Language models

Martin Körner

6 of 30

Content

Introduction

Language Models

Smoothing

Progress

Summary

Martin Körner

7 of 30

Language models in general

Language model: How likely is a sentence 𝑠?

Probability distribution: 𝑃 𝑠

Calculate 𝑃 𝑠 by multiplying conditional probabilities

Example:

Empirical approach would fail

Martin Körner

8 of 30

Conditional probabilities simplified

Markov assumption [JM80]:

Only the last n-1 words are relevant for a prediction

Example with n=5:

𝑃 sure | If you′re going to San Francisco , be

≈ 𝑃 sure | San Francisco , be

Counts as a word

Martin Körner

9 of 30

Definitions and Markov assumption

n-gram: Sequence of length n with a count

E.g.: 5-gram:

If you′re going to San 4

Sequence naming:

𝑤1𝑖−1 ≔ 𝑤1 𝑤2 …𝑤𝑖−1

Markov assumption formalized:

𝑃 𝑤𝑖 𝑤1𝑖−1 ≈ 𝑃 𝑤𝑖 𝑤𝑖−𝑛+1

𝑖−1

n-1 words

Martin Körner

10 of 30

Formalizing next word prediction

Instead of 𝑃(𝑠):

Only one conditional probability 𝑃 𝑤𝑖 𝑤𝑖−𝑛+1𝑖−1

• Simplify 𝑃 𝑤𝑖 𝑤𝑖−𝑛+1𝑖−1 to 𝑃 𝑤𝑛 𝑤1

𝑛−1

NWP 𝑤1𝑛−1 = argmax𝑤𝑛∈𝑊 𝑃 𝑤𝑛 𝑤1

𝑛−1

How to calculate the probability 𝑃 𝑤𝑛 𝑤1𝑛−1 ?

Set of all words in the corpus

n-1 words n-1 words

Conditional probability with Markov assumption

Martin Körner

11 of 30

How to calculate 𝑃(𝑤𝑛|𝑤1𝑛−1)

The easiest way:

Maximum likelihood:

𝑃ML 𝑤𝑛 𝑤1𝑛−1 =

𝑐(𝑤1𝑛)

𝑐(𝑤1𝑛−1)

Example:

𝑃 San | If you′re going to =𝑐 If you′re going to San

𝑐 If you′re going to

Martin Körner

12 of 30

Content

Introduction

Language Models

Smoothing

Progress

Summary

Martin Körner

13 of 30

Intro Generalized Language Models (GLMs)

Main idea:

Insert wildcard words (∗) into sequences

Example:

Instead of 𝑃 San | If you′re going to :

• 𝑃 San | If ∗ ∗ ∗

• 𝑃 San | If ∗ ∗ to

• 𝑃 San | If ∗ going ∗

• 𝑃 San | If ∗ going to

• 𝑃 San | If you′re ∗ ∗

• …

Separate different types of GLMs based on:

1. Sequence length

2. Number of wildcard words

Aggregate results

Length: 5, Wildcard words: 2

Martin Körner

14 of 30

Why Generalized Language Models?

Data sparsity of n-grams

“If you′re going to San” is seen less often than for example

“If ∗ ∗ to San”

Question: Does that really improve the prediction?

Result of evaluation: Yes

… but we should use smoothing for language models

Martin Körner

15 of 30

Content

Introduction

Language Models

Smoothing

Progress

Summary

Martin Körner

16 of 30

Smoothing

Problem: Unseen sequences

Try to estimate probabilities of unseen sequences

Probabilities of seen sequences need to be reduced

Two approaches:

1. Backoff smoothing

2. Interpolation smoothing

Martin Körner

17 of 30

Backoff smoothing

If sequence unseen: use shorter sequence

E.g.: if 𝑃 San | going to = 0 use 𝑃 San | to

𝑃𝑏𝑎𝑐𝑘 𝑤𝑛 𝑤𝑖𝑛−1 =

𝜏 𝑤𝑛 𝑤𝑖𝑛−1 𝑖𝑓 𝑐 𝑤𝑖

𝑛 > 0

𝛾 ∗ 𝑃𝑏𝑎𝑐𝑘 𝑤𝑛 𝑤𝑖+1𝑛−1 𝑖𝑓 𝑐 𝑤𝑖

𝑛 = 0

Weight Lower order

probability (recursive)

Higher order

probability

Martin Körner

18 of 30

Interpolated Smoothing

Always use shorter sequence for calculation

𝑃𝑖𝑛𝑡𝑒𝑟 𝑤𝑛 𝑤𝑖𝑛−1 = 𝜏 𝑤𝑛 𝑤𝑖

𝑛−1 + 𝛾 ∗ 𝑃𝑖𝑛𝑡𝑒𝑟 𝑤𝑛 𝑤𝑖+1𝑛−1

Seems to work better than backoff smoothing

Higher order

probability

Weight Lower order

probability (recursive)

Martin Körner

19 of 30

Kneser-Ney smoothing [KN95] intro

Interpolated smoothing

Idea: Improve lower order calculation

Example: Word visiting unseen in corpus

𝑃 Francisco | visiting = 0

Normal interpolation: 0 + γ ∗ 𝑃 Francisco

𝑃 San | visiting = 0

Normal interpolation: 0 + γ ∗ 𝑃 San

Result: Francisco is as likely as San at that position

Is that correct?

Difference between Francisco and San?

Answer: Number of different contexts

Martin Körner

20 of 30

Kneser-Ney smoothing idea

For lower order calculation:

Don’t use 𝑐 𝑤𝑛 Instead: Number of different bigrams the word completes:

𝑁1+ • 𝑤𝑛 ≔ 𝑤𝑛−1: 𝑐 𝑤𝑛−1𝑛 > 0

Or in general:

𝑁1+ • 𝑤𝑖+1𝑛 = 𝑤𝑖: 𝑐 𝑤𝑖

𝑛 > 0

In addition:

𝑁1+ • 𝑤𝑖+1𝑛−1• = 𝑤𝑛

𝑁1+ • 𝑤𝑖+1𝑛

𝑁1+ 𝑤𝑖𝑛−1 • = 𝑤𝑛: 𝑐 𝑤𝑖

𝑛 > 0

Martin Körner

21 of 30

Kneser-Ney smoothing equation (highest)

Highest order calculation:

𝑃KN 𝑤𝑛 𝑤𝑖𝑛−1 =

max{𝑐 𝑤𝑖𝑛 − 𝐷, 0}

𝑐 𝑤𝑖𝑛−1

𝑁1+ 𝑤𝑖𝑛−1 • 𝑃KN 𝑤𝑛 𝑤𝑖+1

𝑛−1

Total counts

Assure positive valueDiscount value

0 ≤ 𝐷 ≤ 1

Lower order probability

(recursion)

Lower order weight

Martin Körner

22 of 30

Kneser-Ney smoothing equation

Lower order calculation:

max{𝑁1+ • 𝑤𝑖𝑛 − 𝐷, 0}

𝑁1+ • 𝑤𝑖𝑛−1 •

𝑁1+ 𝑤𝑖𝑛−1 • 𝑃KN 𝑤𝑛 𝑤𝑖+1

𝑛−1

Lowest order calculation: 𝑃KN 𝑤𝑛 =𝑁1+ •𝑤𝑖

𝑁1+ •𝑤𝑖𝑛−1•

Continuation count

Total continuation counts

Assure positive valueDiscount value

Lower order probability

(recursion)

Lower order weight

Martin Körner

23 of 30

Modified Kneser-Ney smoothing [CG98]

Different discount values for different absolute counts

Lower order calculation:

max{𝑁1+ • 𝑤𝑖𝑛 − 𝐷(𝑐 𝑤𝑖

𝑛 ), 0}

𝑁1+ • 𝑤𝑖𝑛−1 •

𝐷1𝑁1 𝑤𝑖𝑛−1 • + 𝐷2𝑁2 𝑤𝑖

𝑛−1 • + 𝐷3+𝑁3+ 𝑤𝑖𝑛−1 •

𝑁1+ • 𝑤𝑖𝑛−1 •

𝑃KN 𝑤𝑛 𝑤𝑖+1𝑛−1

State of the art (since 15 years!)

Martin Körner

24 of 30

Smoothing of GLMs

We can use all smoothing techniques on GLMs as well!

Small modification:

E.g: 𝑃 San | If ∗ going ∗

Lower order sequence :

– Normally: 𝑃 San | ∗ going ∗

– Instead use 𝑃 San | going ∗

Martin Körner

25 of 30

Content

Introduction

Language Models

Smoothing

Progress

Summary

Martin Körner

26 of 30

Progress

Done Yet:

Extract text from XML files

Building GLMs

Kneser-Ney and modified Kneser-Ney smoothing

Indexing with MySQL

ToDo’s

Finish evaluation program

Run evaluation

Analyze results

Martin Körner

27 of 30

Content

Introduction

Language Models

Smoothing

Progress

Summary

Martin Körner

28 of 30

Summary

Data Sets Language Models Smoothing

• More Data

• Better Data• Katz

• Good-Turing

• Witten-Bell

• Kneser-Ney

• …

• n-grams

• Generalized

Language

Models

Martin Körner

29 of 30

Thank you for your attention!

Questions?

Martin Körner

30 of 30

Sources

Images: Wheelchair Joystick (Slide 4):

http://i01.i.aliimg.com/img/pb/741/422/527/527422741_355.jpg

Smartphone Keyboard (Slide 4):

https://activecaptain.com/articles/mobilePhones/iPhone/iPhone_Keyboard.jpg

References: [CG98]: Stanley Chen and Joshua Goodman. An empirical study of smoothing

techniques for language modeling. Technical report, Technical Report TR-10-

98, Harvard University, August, 1998.

[JM80]: F. Jelinek and R.L. Mercer. Interpolated estimation of markov source

parameters from sparse data. In Proceedings of the Workshop on Pattern

Recognition in Practice, pages 381–397, 1980.

[KN95]: Reinhard Kneser and Hermann Ney. Improved backing-off for m-gram

language modeling. In Acoustics, Speech, and Signal Processing, 1995.

ICASSP-95., 1995 International Conference on, volume 1, pages 181–184.

IEEE, 1995.

Introduction to Kneser-Ney Smoothing on Top of Generalized Language Models for Next Word Prediction

Technology

Smoothing for nonsmooth optimization - Princeton Universityyc5/ele522_optimization/lectures/smoothing.pdf · Nesterov’s smoothing idea Practically, we rarely meet pure black box

3. Regression & Exponential Smoothing - Department of Mathematics

FACES OF GENERALIZED PERMUTOHEDRA …

The Equitable Colorings of Kneser Graphs

IMPLEMENTASI METODE TRIPLE EXPONENTIAL SMOOTHING …

Tpb Single Mova & Smoothing

Martin Kneser - uni-goettingen.de

Analysis and Numerical Solution of Generalized Lyapunov ...stykel/Publications/disser.pdf · 3.1.2 Stability ... 5 Numerical solution of generalized Lyapunov equations 77 5.1 Generalized

DAN DIVIDEN PAYOUT RATIO TERHADAP INCOME SMOOTHING

PEMBANDINGAN PARAMETER SMOOTHING GENERALIZED …digilib.its.ac.id/public/ITS-paper-32289-1109100058-Presentation.pdfsystem antenna tersebut dapat dipandang sebagai dipole listrik vertikal

Metode Smoothing

Exponential Smoothing

LOCAL POLYNOMIAL SMOOTHING UNTUK MENGATASI MASALAH …

Generalized Anxiety Disorder

PENGARUH AUDIT QUALITY DAN INCOME SMOOTHING …

Forecast dengan Smoothing

On the Lattice Smoothing Parameter Problem

Smoothing and interpolation

Ciclos Hamiltonianos em Grafos Kneser · 2015-07-22 · ciclos hamiltonianos em grafos kneser letícia rodrigues bueno tese submetida ao corpo docente do instituto alberto luiz coimbra

Teorema de Radó-Kneser-Choquet como antecedente del