6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make...

Preview:

Citation preview

6. N-GRAMs

부산대학교 인공지능연구실최성자

2

Word prediction

“I’d like to make a collect …”Call, telephone, or person-to-person

- Spelling error detection

- Augmentative communication

- Context-sensitive spelling error correction

3

Language Model

Language Model (LM)– statistical model of word sequences

n-gram: Use the previous n -1 words to predict the next word

4

Applications

context-sensitive spelling error detection and correction“He is trying to fine out.”

“The design an construction will take a year.”

machine translation

5

Counting Words in Corpora

Corpora (on-line text collections) Which words to count

– What we are going to count– Where we are going to find the things to count

6

Brown Corpus

1 million words 500 texts Varied genres (newspaper, novels, non-

fiction, academic, etc.) Assembled at Brown University in 1963-64 The first large on-line text collection used

in corpus-based NLP research

7

Issues in Word Counting

Punctuation symbols (. , ? !) Capitalization (“He” vs. “he”, “Bush” vs. “b

ush”) Inflected forms (“cat” vs. “cats”)

– Wordform: cat, cats, eat, eats, ate, eating, eaten– Lemma (Stem): cat, eat

8

Types vs. Tokens

Tokens (N): Total number of running words Types (B): Number of distinct words in a corpus (si

ze of the vocabulary)

Example: “They picnicked by the pool, then lay back on the gra

ss and looked at the stars.”–16 word tokens, 14 word types (not counting punctuation)

“※ Types” will mean wordform types and not lemma type, and punctuation marks will generally be counted as word

9

How Many Words in English?

Shakespeare’s complete works– 884,647 wordform tokens– 29,066 wordform types

Brown Corpus– 1 million wordform tokens– 61,805 wordform types– 37,851 lemma types

10

Simple (Unsmoothed) N-grams

Task: Estimating the probability of a word First attempt:

– Suppose there is no corpus available

– Use uniform distribution

– Assume:• word types = V (e.g., 100,000)

11

Simple (Unsmoothed) N-grams

Task: Estimating the probability of a word Second attempt:

– Suppose there is a corpus

– Assume:• word tokens = N

• # times w appears in corpus = C(w)

12

Simple (Unsmoothed) N-grams

Task: Estimating the probability of a word Third attempt:

– Suppose there is a corpus

– Assume a word depends on its n –1 previous words

13

Simple (Unsmoothed) N-grams

14

Simple (Unsmoothed) N-grams

n-gram approximation:– Wk only depends on its previous n–1words

15

Bigram Approximation

Example:P(I want to eat British food)

= P(I|<s>) P(want|I) P(to|want) P(eat|to) P(British|eat) P(food|British)

<s>: a special word meaning “start of sentence”

16

Note on Practical Problem

Multiplying many probabilities results in a very small number and can cause numerical underflow

Use logprob instead in the actual computation

17

Estimating N-gram Probability

Maximum Likelihood Estimate (MLE)

18

19

Estimating Bigram Probability

Example:– C(to eat) = 860

– C(to) = 3256

20

21

Two Important facts

The increasing accuracy of N-gram models as we increse the value of N

Very strong dependency on their training corpus (in particular its genre and its size in words)

22

Smoothing

Any particular training corpus is finite Sparse data problem Deal with zero probability

23

Smoothing

Smoothing– Reevaluating zero probability n-grams and

assigning them non-zero probability

Also called Discounting– Lowering non-zero n-gram counts in order to

assign some probability mass to the zero n-grams

24

Add-One Smoothing for Bigram

25

26

27

Things Seen Once

Use the count of things seen once to help estimate the count of things never seen

28

Witten-Bell Discounting

29

Witten-Bell Discounting for Bigram

30

Witten-Bell Discounting for Bigram

31

Seen count Unseen count

32

33

Good-Turing Discounting for Bigram

34

35

Backoff

36

Backoff

37

Entropy

Measure of uncertainty Used to evaluate quality of n-gram models (how

well a language model matches a given language) Entropy H(X) of a random variable X:

Measured in bits Number of bits to encode information in the optimal

coding scheme

38

Example 1

39

Example 2

40

Perplexity

41

Entropy of a Sequence

42

Entropy of a Language

43

Cross Entropy

Used for comparing two language models p: Actual probability distribution that generated

some data m: A model of p (approximation to p) Cross entropy of m on p:

44

Cross Entropy

By Shannon-McMillan-Breimantheorem:

Property of cross entropy:

Difference between H(p,m) and H(p) is a measure of how accurate model m is

The more accurate a model, the lower its cross-entropy

Recommended