View
108
Download
3
Category
Preview:
DESCRIPTION
The first talk on the topic of my bachelor thesis with a focus on Kneser-Ney smoothing.
Citation preview
Web Science & Technologies
University of Koblenz ▪ Landau, Germany
Introduction to Kneser-NeySmoothing on Top of Generalized Language
Models for Next Word Prediction
Martin Körner
Oberseminar
25.07.2013
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
2 of 30
WeST
Content
Introduction
Language Models
Generalized Language Models
Smoothing
Progress
Summary
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
3 of 30
WeST
Content
Introduction
Language Models
Generalized Language Models
Smoothing
Progress
Summary
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
4 of 30
WeST
Introduction: Motivation
Next word prediction: What is the next word a user will
type?
Use cases for next word prediction:
Augmentative and Alternative
Communication (AAC)
Small keyboards (Smartphones)
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
5 of 30
WeST
Introduction to next word prediction
How do we predict words?
1. Rationalist approach
• Manually encoding information about language
• “Toy” problems only
2. Empiricist approach
• Statistical, pattern recognition, and machine learning
methods applied on corpora
• Result: Language models
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
6 of 30
WeST
Content
Introduction
Language Models
Generalized Language Models
Smoothing
Progress
Summary
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
7 of 30
WeST
Language models in general
Language model: How likely is a sentence 𝑠?
Probability distribution: 𝑃 𝑠
Calculate 𝑃 𝑠 by multiplying conditional probabilities
Example:
𝑃 If you′re going to San Francisco , be sure …=𝑃 you′re | If ∗ 𝑃 going | If you′re ∗𝑃 to | If you′re going ∗ 𝑃 San | If you′re going to ∗𝑃 Francisco | If you′re going to San ∗ ⋯
Empirical approach would fail
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
8 of 30
WeST
Conditional probabilities simplified
Markov assumption [JM80]:
Only the last n-1 words are relevant for a prediction
Example with n=5:
𝑃 sure | If you′re going to San Francisco , be
≈ 𝑃 sure | San Francisco , be
Counts as a word
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
9 of 30
WeST
Definitions and Markov assumption
n-gram: Sequence of length n with a count
E.g.: 5-gram:
If you′re going to San 4
Sequence naming:
𝑤1𝑖−1 ≔ 𝑤1 𝑤2 …𝑤𝑖−1
Markov assumption formalized:
𝑃 𝑤𝑖 𝑤1𝑖−1 ≈ 𝑃 𝑤𝑖 𝑤𝑖−𝑛+1
𝑖−1
n-1 words
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
10 of 30
WeST
Formalizing next word prediction
Instead of 𝑃(𝑠):
Only one conditional probability 𝑃 𝑤𝑖 𝑤𝑖−𝑛+1𝑖−1
• Simplify 𝑃 𝑤𝑖 𝑤𝑖−𝑛+1𝑖−1 to 𝑃 𝑤𝑛 𝑤1
𝑛−1
NWP 𝑤1𝑛−1 = argmax𝑤𝑛∈𝑊 𝑃 𝑤𝑛 𝑤1
𝑛−1
How to calculate the probability 𝑃 𝑤𝑛 𝑤1𝑛−1 ?
Set of all words in the corpus
n-1 words n-1 words
Conditional probability with Markov assumption
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
11 of 30
WeST
How to calculate 𝑃(𝑤𝑛|𝑤1𝑛−1)
The easiest way:
Maximum likelihood:
𝑃ML 𝑤𝑛 𝑤1𝑛−1 =
𝑐(𝑤1𝑛)
𝑐(𝑤1𝑛−1)
Example:
𝑃 San | If you′re going to =𝑐 If you′re going to San
𝑐 If you′re going to
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
12 of 30
WeST
Content
Introduction
Language Models
Generalized Language Models
Smoothing
Progress
Summary
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
13 of 30
WeST
Intro Generalized Language Models (GLMs)
Main idea:
Insert wildcard words (∗) into sequences
Example:
Instead of 𝑃 San | If you′re going to :
• 𝑃 San | If ∗ ∗ ∗
• 𝑃 San | If ∗ ∗ to
• 𝑃 San | If ∗ going ∗
• 𝑃 San | If ∗ going to
• 𝑃 San | If you′re ∗ ∗
• …
Separate different types of GLMs based on:
1. Sequence length
2. Number of wildcard words
Aggregate results
Length: 5, Wildcard words: 2
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
14 of 30
WeST
Why Generalized Language Models?
Data sparsity of n-grams
“If you′re going to San” is seen less often than for example
“If ∗ ∗ to San”
Question: Does that really improve the prediction?
Result of evaluation: Yes
… but we should use smoothing for language models
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
15 of 30
WeST
Content
Introduction
Language Models
Generalized Language Models
Smoothing
Progress
Summary
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
16 of 30
WeST
Smoothing
Problem: Unseen sequences
Try to estimate probabilities of unseen sequences
Probabilities of seen sequences need to be reduced
Two approaches:
1. Backoff smoothing
2. Interpolation smoothing
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
17 of 30
WeST
Backoff smoothing
If sequence unseen: use shorter sequence
E.g.: if 𝑃 San | going to = 0 use 𝑃 San | to
𝑃𝑏𝑎𝑐𝑘 𝑤𝑛 𝑤𝑖𝑛−1 =
𝜏 𝑤𝑛 𝑤𝑖𝑛−1 𝑖𝑓 𝑐 𝑤𝑖
𝑛 > 0
𝛾 ∗ 𝑃𝑏𝑎𝑐𝑘 𝑤𝑛 𝑤𝑖+1𝑛−1 𝑖𝑓 𝑐 𝑤𝑖
𝑛 = 0
Weight Lower order
probability (recursive)
Higher order
probability
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
18 of 30
WeST
Interpolated Smoothing
Always use shorter sequence for calculation
𝑃𝑖𝑛𝑡𝑒𝑟 𝑤𝑛 𝑤𝑖𝑛−1 = 𝜏 𝑤𝑛 𝑤𝑖
𝑛−1 + 𝛾 ∗ 𝑃𝑖𝑛𝑡𝑒𝑟 𝑤𝑛 𝑤𝑖+1𝑛−1
Seems to work better than backoff smoothing
Higher order
probability
Weight Lower order
probability (recursive)
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
19 of 30
WeST
Kneser-Ney smoothing [KN95] intro
Interpolated smoothing
Idea: Improve lower order calculation
Example: Word visiting unseen in corpus
𝑃 Francisco | visiting = 0
Normal interpolation: 0 + γ ∗ 𝑃 Francisco
𝑃 San | visiting = 0
Normal interpolation: 0 + γ ∗ 𝑃 San
Result: Francisco is as likely as San at that position
Is that correct?
Difference between Francisco and San?
Answer: Number of different contexts
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
20 of 30
WeST
Kneser-Ney smoothing idea
For lower order calculation:
Don’t use 𝑐 𝑤𝑛 Instead: Number of different bigrams the word completes:
𝑁1+ • 𝑤𝑛 ≔ 𝑤𝑛−1: 𝑐 𝑤𝑛−1𝑛 > 0
Or in general:
𝑁1+ • 𝑤𝑖+1𝑛 = 𝑤𝑖: 𝑐 𝑤𝑖
𝑛 > 0
In addition:
𝑁1+ • 𝑤𝑖+1𝑛−1• = 𝑤𝑛
𝑁1+ • 𝑤𝑖+1𝑛
𝑁1+ 𝑤𝑖𝑛−1 • = 𝑤𝑛: 𝑐 𝑤𝑖
𝑛 > 0
Count
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
21 of 30
WeST
Kneser-Ney smoothing equation (highest)
Highest order calculation:
𝑃KN 𝑤𝑛 𝑤𝑖𝑛−1 =
max{𝑐 𝑤𝑖𝑛 − 𝐷, 0}
𝑐 𝑤𝑖𝑛−1
+
𝐷
𝑐 𝑤𝑖𝑛−1
𝑁1+ 𝑤𝑖𝑛−1 • 𝑃KN 𝑤𝑛 𝑤𝑖+1
𝑛−1
count
Total counts
Assure positive valueDiscount value
0 ≤ 𝐷 ≤ 1
Lower order probability
(recursion)
Lower order weight
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
22 of 30
WeST
Kneser-Ney smoothing equation
Lower order calculation:
𝑃KN 𝑤𝑛 𝑤𝑖𝑛−1 =
max{𝑁1+ • 𝑤𝑖𝑛 − 𝐷, 0}
𝑁1+ • 𝑤𝑖𝑛−1 •
+
𝐷
𝑁1+ • 𝑤𝑖𝑛−1 •
𝑁1+ 𝑤𝑖𝑛−1 • 𝑃KN 𝑤𝑛 𝑤𝑖+1
𝑛−1
Lowest order calculation: 𝑃KN 𝑤𝑛 =𝑁1+ •𝑤𝑖
𝑛
𝑁1+ •𝑤𝑖𝑛−1•
Continuation count
Total continuation counts
Assure positive valueDiscount value
Lower order probability
(recursion)
Lower order weight
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
23 of 30
WeST
Modified Kneser-Ney smoothing [CG98]
Different discount values for different absolute counts
Lower order calculation:
𝑃KN 𝑤𝑛 𝑤𝑖𝑛−1 =
max{𝑁1+ • 𝑤𝑖𝑛 − 𝐷(𝑐 𝑤𝑖
𝑛 ), 0}
𝑁1+ • 𝑤𝑖𝑛−1 •
+
𝐷1𝑁1 𝑤𝑖𝑛−1 • + 𝐷2𝑁2 𝑤𝑖
𝑛−1 • + 𝐷3+𝑁3+ 𝑤𝑖𝑛−1 •
𝑁1+ • 𝑤𝑖𝑛−1 •
𝑃KN 𝑤𝑛 𝑤𝑖+1𝑛−1
State of the art (since 15 years!)
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
24 of 30
WeST
Smoothing of GLMs
We can use all smoothing techniques on GLMs as well!
Small modification:
E.g: 𝑃 San | If ∗ going ∗
Lower order sequence :
– Normally: 𝑃 San | ∗ going ∗
– Instead use 𝑃 San | going ∗
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
25 of 30
WeST
Content
Introduction
Language Models
Generalized Language Models
Smoothing
Progress
Summary
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
26 of 30
WeST
Progress
Done Yet:
Extract text from XML files
Building GLMs
Kneser-Ney and modified Kneser-Ney smoothing
Indexing with MySQL
ToDo’s
Finish evaluation program
Run evaluation
Analyze results
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
27 of 30
WeST
Content
Introduction
Language Models
Generalized Language Models
Smoothing
Progress
Summary
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
28 of 30
WeST
Summary
Data Sets Language Models Smoothing
• More Data
• Better Data• Katz
• Good-Turing
• Witten-Bell
• Kneser-Ney
• …
• n-grams
• Generalized
Language
Models
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
29 of 30
WeST
Thank you for your attention!
Questions?
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
30 of 30
WeST
Sources
Images: Wheelchair Joystick (Slide 4):
http://i01.i.aliimg.com/img/pb/741/422/527/527422741_355.jpg
Smartphone Keyboard (Slide 4):
https://activecaptain.com/articles/mobilePhones/iPhone/iPhone_Keyboard.jpg
References: [CG98]: Stanley Chen and Joshua Goodman. An empirical study of smoothing
techniques for language modeling. Technical report, Technical Report TR-10-
98, Harvard University, August, 1998.
[JM80]: F. Jelinek and R.L. Mercer. Interpolated estimation of markov source
parameters from sparse data. In Proceedings of the Workshop on Pattern
Recognition in Practice, pages 381–397, 1980.
[KN95]: Reinhard Kneser and Hermann Ney. Improved backing-off for m-gram
language modeling. In Acoustics, Speech, and Signal Processing, 1995.
ICASSP-95., 1995 International Conference on, volume 1, pages 181–184.
IEEE, 1995.
Recommended