6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make...

6. N-GRAMs

부산대학교 인공지능연구실최성자

Word prediction

“I’d like to make a collect …”Call, telephone, or person-to-person

- Spelling error detection

- Augmentative communication

- Context-sensitive spelling error correction

Language Model

Language Model (LM)– statistical model of word sequences

n-gram: Use the previous n -1 words to predict the next word

Applications

context-sensitive spelling error detection and correction“He is trying to fine out.”

“The design an construction will take a year.”

machine translation

Counting Words in Corpora

Corpora (on-line text collections) Which words to count

– What we are going to count– Where we are going to find the things to count

Brown Corpus

1 million words 500 texts Varied genres (newspaper, novels, non-

fiction, academic, etc.) Assembled at Brown University in 1963-64 The first large on-line text collection used

in corpus-based NLP research

Issues in Word Counting

Punctuation symbols (. , ? !) Capitalization (“He” vs. “he”, “Bush” vs. “b

ush”) Inflected forms (“cat” vs. “cats”)

– Wordform: cat, cats, eat, eats, ate, eating, eaten– Lemma (Stem): cat, eat

Types vs. Tokens

Tokens (N): Total number of running words Types (B): Number of distinct words in a corpus (si

ze of the vocabulary)

Example: “They picnicked by the pool, then lay back on the gra

ss and looked at the stars.”–16 word tokens, 14 word types (not counting punctuation)

“※ Types” will mean wordform types and not lemma type, and punctuation marks will generally be counted as word

How Many Words in English?

Shakespeare’s complete works– 884,647 wordform tokens– 29,066 wordform types

Brown Corpus– 1 million wordform tokens– 61,805 wordform types– 37,851 lemma types

Simple (Unsmoothed) N-grams

Task: Estimating the probability of a word First attempt:

– Suppose there is no corpus available

– Use uniform distribution

– Assume:• word types = V (e.g., 100,000)

Task: Estimating the probability of a word Second attempt:

– Suppose there is a corpus

– Assume:• word tokens = N

• # times w appears in corpus = C(w)

Task: Estimating the probability of a word Third attempt:

– Suppose there is a corpus

– Assume a word depends on its n –1 previous words

n-gram approximation:– Wk only depends on its previous n–1words

Bigram Approximation

Example:P(I want to eat British food)

<s>: a special word meaning “start of sentence”

Note on Practical Problem

Multiplying many probabilities results in a very small number and can cause numerical underflow

Use logprob instead in the actual computation

Estimating N-gram Probability

Maximum Likelihood Estimate (MLE)

Estimating Bigram Probability

Example:– C(to eat) = 860

– C(to) = 3256

Two Important facts

The increasing accuracy of N-gram models as we increse the value of N

Very strong dependency on their training corpus (in particular its genre and its size in words)

Smoothing

Any particular training corpus is finite Sparse data problem Deal with zero probability

Smoothing

Smoothing– Reevaluating zero probability n-grams and

assigning them non-zero probability

Also called Discounting– Lowering non-zero n-gram counts in order to

assign some probability mass to the zero n-grams

Add-One Smoothing for Bigram

Things Seen Once

Use the count of things seen once to help estimate the count of things never seen

Witten-Bell Discounting

Witten-Bell Discounting for Bigram

Seen count Unseen count

Good-Turing Discounting for Bigram

Backoff

Entropy

Measure of uncertainty Used to evaluate quality of n-gram models (how

well a language model matches a given language) Entropy H(X) of a random variable X:

Measured in bits Number of bits to encode information in the optimal

coding scheme

Example 1

Example 2

Perplexity

Entropy of a Sequence

Entropy of a Language

Cross Entropy

Used for comparing two language models p: Actual probability distribution that generated

some data m: A model of p (approximation to p) Cross entropy of m on p:

Cross Entropy

By Shannon-McMillan-Breimantheorem:

Property of cross entropy:

Difference between H(p,m) and H(p) is a measure of how accurate model m is

The more accurate a model, the lower its cross-entropy

6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make...

Documents

Coronary Stent Angioplast 를 통한 MCA Stenosis 의 치료 부산대학교 병원 진단 방사선과

WEINBEGLEITUNG ˜˚ ˇ˚ ˘˚ pro Person 48 €G˛˝˙ˆˆ- · WEINBEGLEITUNG ˜˚pro Person 32 €G˛˝˙ˆˆ- ˇ˚pro Person 40 €G˛˝˙ˆˆ- ˘˚pro Person 48 €G˛˝˙ˆˆ-

Core Text Mining Operations 2007 년 02 월 06 일 부산대학교 인공지능연구실 한기덕 Text : The Text Mining Handbook pp.19~41

List Person

Skizofren person Abilify Behandlet person

현대경제의 이해 임 정 덕 부산대학교 교수

Inference in First-Order Logic 부산대학교 전자전기컴퓨터공학과 인공지능연구실 김민호 (karma@pusan.ac.kr)

2. Mathematical Foundations 2001. 7. 10. 인공지능연구실 성경희 Foundations of Statistic Natural Language Processing

MIT 1-2-FLY · 1 Woche pro Person ab 1 Woche pro Person ab 1 Woche pro Person ab 1 Woche pro Person ab 1 Woche pro Person ab 1 Woche pro Person ab Gran Canaria Fuerteventura Lanzarote

Machine Learning II 부산대학교 전자전기컴퓨터공학과 인공지능연구실 김민호 (karma@pusan.ac.kr)

Zestawienie person

his.pusan.ac.kr › sites › cse › download › ece_brochure_2016.pdf 부산대학교 공과대학 전기컴퓨터공학부 …사물인터넷 디바이스, 플랫폼, 서비스

MARINA PERSON

제30회 부산대학교 예술대학 미술학과 졸업작품전

부산대학교 공과대학 전기컴퓨터공학부 정보컴퓨터공학전공cse.pusan.ac.kr/sites/cse/download/ece_brochure_2016.pdf정보컴퓨터 공학 전공 | 교수진 소개

Megaputer Intelligence 2000. 3. 27 인공지능연구실 석사 2 학기 최윤정 (cris@ai.ewha.ac.kr)

Bitte nur 1 Person - SchönherrBitte nur 1 Person eintreten

Linking person perception and person knowledge in the ... › portal › files › 7286330 › ... · Linking person perception and person knowledge in the human brain Inez M. Greven,

부산대학교 무선랜(PNU-WiFi) MAC OSX 사용자 · 2017. 12. 27. · 부산대학교 무선랜(pnu-wifi). mac osx . 사용자 설명서. 2015. 10. 부 산 대 학 교 (정보전산원)

신입생 모집요강•™년도... · 2018-08-30 · 2019학년도 교육대학원 신입생 모집요강 부산대학교 교육대학원