26
8. Word Classes and Part-of- Speech Tagging 2007 년 5 년 26 년 년년년년 년년년 년년년 Text: Speech and Language Processing Page.287 ~ 303

8. Word Classes and Part-of-Speech Tagging 2007 년 5 월 26 일 인공지능 연구실 이경택 Text: Speech and Language Processing Page.287 ~ 303

Embed Size (px)

Citation preview

Page 1: 8. Word Classes and Part-of-Speech Tagging 2007 년 5 월 26 일 인공지능 연구실 이경택 Text: Speech and Language Processing Page.287 ~ 303

8. Word Classes and Part-of-Speech Tagging

2007 년 5 월 26 일인공지능 연구실 이경택

Text: Speech and Language Processing

Page.287 ~ 303

Page 2: 8. Word Classes and Part-of-Speech Tagging 2007 년 5 월 26 일 인공지능 연구실 이경택 Text: Speech and Language Processing Page.287 ~ 303

Origin of POS

Techne: a grammatical sketch of Greek which is written by Dionysius Thrax of Alexandria (c. 100 B.C.) or someone else. Eight parts-of-speech: noun, verb, pronoun, preposition, adverb, conjunction, p

articiple, article The basis for practically all subsequent part-of-speech descriptions of Greek, L

atin and most European language for the next 2000 years.

Page 3: 8. Word Classes and Part-of-Speech Tagging 2007 년 5 월 26 일 인공지능 연구실 이경택 Text: Speech and Language Processing Page.287 ~ 303

Recent Lists of POS

Recent POS list have much larger than before Penn Treeback (Marcus et al., 1993): 45 Brown corpus (Francis, 1979; Francis and Kučera, 1982): 87 C7 tagset (Garside et al., 1997): 146

Synonym of POS word classes morphological classes lexical tags

Page 4: 8. Word Classes and Part-of-Speech Tagging 2007 년 5 월 26 일 인공지능 연구실 이경택 Text: Speech and Language Processing Page.287 ~ 303

POS can be used in

Recognize or Produce pronunciation of words CONtent (noun), conTENT (adjective) Object (noun), obJECT (adjective) ……

In information retrieval Stemming Select out nouns or other important words ASR language model like class-based N-grams Partial parsing

Page 5: 8. Word Classes and Part-of-Speech Tagging 2007 년 5 월 26 일 인공지능 연구실 이경택 Text: Speech and Language Processing Page.287 ~ 303

8.1 (Mostly) English Word Classes

Closed class types: have relatively fixed membership Ex. Prepositions: new prepositions are rarely coined. Generally function words (ex. of, it, and, or, ……)

- Very short

- Occur frequently

- Play an important role in grammar

Open class types: have relatively updatable membership Ex. Noun and verb: new words continually coined or borrowed from other

language. Four major open classes (but not all of human language have all of these)

- Nouns

- Verbs

- Adjectives

- Adverbs

Page 6: 8. Word Classes and Part-of-Speech Tagging 2007 년 5 월 26 일 인공지능 연구실 이경택 Text: Speech and Language Processing Page.287 ~ 303

Open Classes

Noun Verb Adjective Adverb

Page 7: 8. Word Classes and Part-of-Speech Tagging 2007 년 5 월 26 일 인공지능 연구실 이경택 Text: Speech and Language Processing Page.287 ~ 303

Definition of Noun

Functional definition is not good The name given to the lexical class in which the words for most people, places,

or things occur Bandwidth? Relationship? Pacing?

Semantic definition of noun Thing like its ability to occur with determiners (a goat, its bandwidth, Plato’s

Republic), to take (IBM’s annual revenue), and for most but not all nouns, to occur in the plural form (goats, abaci).

Page 8: 8. Word Classes and Part-of-Speech Tagging 2007 년 5 월 26 일 인공지능 연구실 이경택 Text: Speech and Language Processing Page.287 ~ 303

Grouping Noun

Uniqueness Proper nouns: Regina, Colorado, IBM, …… Common nouns: book, stair, apple, ……

Countable Count nouns

- Can occur in both the singular and plural: goat(s), relationship(s), ……

- Can be counted: (one, two, ……) goat(s) Mass nouns

- Cannot be counted: two snows (x), two communisms (x)

- Can appear without articles where singular count nouns cannot: Snow is white (o), Goat is white (x)

Page 9: 8. Word Classes and Part-of-Speech Tagging 2007 년 5 월 26 일 인공지능 연구실 이경택 Text: Speech and Language Processing Page.287 ~ 303

Verbs

Verbs have a number of morphological forms Non-3rd-person-sg: eat 3rd-person-sg: eats Progressive: eating Past participle: eaten

Auxiliaries: subclass of English verbs

Page 10: 8. Word Classes and Part-of-Speech Tagging 2007 년 5 월 26 일 인공지능 연구실 이경택 Text: Speech and Language Processing Page.287 ~ 303

Adjectives

Terms that describe properties or qualities Concept of color, age, value, ……

There are languages without adjectives. (ex. Chinese)

Page 11: 8. Word Classes and Part-of-Speech Tagging 2007 년 5 월 26 일 인공지능 연구실 이경택 Text: Speech and Language Processing Page.287 ~ 303

Adverbs

Directional adverbs, locative adverbs: specify the direction or location of some action Ex. home, here, downhill

Degree adverbs: specify the extent of some action, process, or property Ex. extremely, very, somewhat

Manner adverbs, temporal adverbs: describe the time that some action or event took place Ex. yesterday, Monday

Some adverbs (ex. Monday) are tagged in some tagging schemes as nouns

Page 12: 8. Word Classes and Part-of-Speech Tagging 2007 년 5 월 26 일 인공지능 연구실 이경택 Text: Speech and Language Processing Page.287 ~ 303

Closed Classes

Prepositions: on, under, over, near, by, at, from, to, with Determiners: a, an, the Pronouns: she, who, I, others Conjunctions: and, but, or, as, if, when Auxiliary verbs: can, may, should, are Particles: up, down, on, off, in, out, at, by Numerals: one, two, three, first, second, third

Page 13: 8. Word Classes and Part-of-Speech Tagging 2007 년 5 월 26 일 인공지능 연구실 이경택 Text: Speech and Language Processing Page.287 ~ 303

Prepositions

Occur before noun phrases Often indicating spatial or temporal relations

Literal (ex. on it, before then, by the house) Metaphorical (on time, with gusto, beside herself)

Often indicate other relations as well Ex. Hamlet was written by Shakespeare, and [from Shakespeare] “And I did laugh sans

intermission an hour by his dial”

Figure 8.1 Prepositions (and particles) of English from the CELEX on-line dictionary. Frequently counts are from the COBUILD 16 million word corpus

Page 14: 8. Word Classes and Part-of-Speech Tagging 2007 년 5 월 26 일 인공지능 연구실 이경택 Text: Speech and Language Processing Page.287 ~ 303

Particle

Often combines with a verb to form a larger unit called a phrasal verb Come in: adjective Come with: preposition Come on: particle

Figure 8.2 English single-word particles from Quirk et al. (1985).

Page 15: 8. Word Classes and Part-of-Speech Tagging 2007 년 5 월 26 일 인공지능 연구실 이경택 Text: Speech and Language Processing Page.287 ~ 303

Determiners (articles)

a, an: mark a noun phrase as indefinite the: mark it as definite this?, that?

COBUILD statistics out of 16 million words the: 1,071,676 a: 413,887 an: 59,359

Page 16: 8. Word Classes and Part-of-Speech Tagging 2007 년 5 월 26 일 인공지능 연구실 이경택 Text: Speech and Language Processing Page.287 ~ 303

Conjunctions

Used to join two phrases, clauses, sentences Coordinating conjunction: equal statu

s

- and, or, but Subordinating conjunction: embedded

status

- that (ex. I thought that you might like some milk)

- complementizers: Subordinating conjunctions like that which link a verb to its argument in this way (more: Chapter 9, 11)

Figure 8.3 Coordinating and subordinating conjunctions of English from the CELEX on-line dictionary. Frequency counts are from the COBUILD 16 million word corpus.

Page 17: 8. Word Classes and Part-of-Speech Tagging 2007 년 5 월 26 일 인공지능 연구실 이경택 Text: Speech and Language Processing Page.287 ~ 303

Pronouns

A kind of shorthand for referring to some noun phrase or entity or event. Personal pronouns: you, she, I, it, me,

…… Possessive pronouns: my, your, his, h

er, its, one’s, our, their, …… Wh-pronouns: what, who, whom, wh

oever

Figure 8.4 Pronouns of English from the CELEX on=line dictionary. Frequency counts are from the COBUILD 16 million word corpus.

Page 18: 8. Word Classes and Part-of-Speech Tagging 2007 년 5 월 26 일 인공지능 연구실 이경택 Text: Speech and Language Processing Page.287 ~ 303

Auxiliary Verbs

Words that mark certain semantic features of a main verb = modal verb be: copula verb do have: perfect tenses can: ability, possibility may: permission, possibility ……

Figure 8.5 English modal verbs from the CELEX on-line dictionary. Frequency counts are from the COBUILD 16 million word corpus.

Page 19: 8. Word Classes and Part-of-Speech Tagging 2007 년 5 월 26 일 인공지능 연구실 이경택 Text: Speech and Language Processing Page.287 ~ 303

Other Closed Classes

Interjections on, ah, hey, man, alas, ……

negatives no, not, ……

politeness markers please, thank you, ……

greetings hello, goodbye, ……

existential there There are two on the table

Page 20: 8. Word Classes and Part-of-Speech Tagging 2007 년 5 월 26 일 인공지능 연구실 이경택 Text: Speech and Language Processing Page.287 ~ 303

8.2 Tagsets for English

There are various tagsets for English. Brown corpus (Francis, 1979; Francis

and Kučera, 1982): 87 tags Penn Treebank (Marcus et al., 1993):

45 tags British National Corpus (Garside st a

l., 1997): 61 tags (C5 tagset) C7 tagset: 164 tags

Which tagset to use for a particular application depends on how much information the application needs

Figure 8.6 Penn Treebank part-of-speech tags (including punctuation)

Page 21: 8. Word Classes and Part-of-Speech Tagging 2007 년 5 월 26 일 인공지능 연구실 이경택 Text: Speech and Language Processing Page.287 ~ 303

8.3 Part-of-Speech Tagging

Definition: Process of assigning a POS or other lexical class marker to each word in a corpus.

Input: a string of words, tagset (ex. Book that flight, Penn Treebank tagset) Output: a single best tag for each word (ex. Book/VB that/DT flight/NN ./.)

Problem: resolve ambiguity → disambiguation Ex. book (Hand me that book, Book that flight)

Figure 8.7 The number of word types in Brown corpus by degree of ambiguity (after DeRose(1988))

Page 22: 8. Word Classes and Part-of-Speech Tagging 2007 년 5 월 26 일 인공지능 연구실 이경택 Text: Speech and Language Processing Page.287 ~ 303

Taggers

Rule-based taggers Generally involve a large database of hand-written disambiguation rule Ex. ENGTWOL (based on the Constraint Grammar architecture of Karlsson et

al. (1995))

Stochastic taggers Generally resolve tagging ambiguities by using a training corpus to compute th

e probability of a given word having a given tag in a given context. Ex. HMM tagger(=Maximum Likelihood Tagger = Markov model tagger, base

d on the Hidden Markov Model)

Transformation-based tagger, Brill tagger (after Brill(1995)) Shares features of rule-based tagger and stochastic tagger The rules are automatically induced from a previously tagged training corpus.

Page 23: 8. Word Classes and Part-of-Speech Tagging 2007 년 5 월 26 일 인공지능 연구실 이경택 Text: Speech and Language Processing Page.287 ~ 303

8.4 Rule-Based Part-of-Speech Tagging

Earliest algorithm Based on two-stage architecture

- First stage: assign each word a list of potential POS using dictionary

- Second stage: winnow down the lists using hand-written disambiguation rule

Page 24: 8. Word Classes and Part-of-Speech Tagging 2007 년 5 월 26 일 인공지능 연구실 이경택 Text: Speech and Language Processing Page.287 ~ 303

ENGTWOL (Voutilainen, 1995)

lexicon Based on two-level morphology Using 56,000 entries for English word stems (Heikkilä, 1995) Counting a word with multiple POS as separate entries

Figure 8.8 Sample lexical entries from the ENGTWOL lexicon described in Voutilainen (1995) and Heikkilä (1995)

Page 25: 8. Word Classes and Part-of-Speech Tagging 2007 년 5 월 26 일 인공지능 연구실 이경택 Text: Speech and Language Processing Page.287 ~ 303

ENGTWOL – Process1

Process First stage: Each word is run through the two-level lexicon transducer and the a

ll possible POS are returned.

- Ex.

- Pavlov PAVLOV N NOM SG PROPER

- had HAVE V PAST VFIN SVO

- HAVE PCP2 SVO

- shown SHOW PCP2 SVOO SVO SV

- that ADV

- PRON DEM SG

- DET CENTRAL DEM SG

- CS

- salivation N NOM SG

Page 26: 8. Word Classes and Part-of-Speech Tagging 2007 년 5 월 26 일 인공지능 연구실 이경택 Text: Speech and Language Processing Page.287 ~ 303

ENGTWOL – Process2

Second stage

- Eliminate tags that are inconsistent with the context using a set of about 1,100 constraints in negative way

- Ex.

- Adverbial-that rule

- Given input: “that”

- if (+1 A/ADV/QUANT); /* if next word is adj, adverb, or quantifier */ (+2 SENT-LIM); /* and following which is a sentence boundary, */ (NOT -1 SVOC/A); /* and the previous word is not a verb like */ /* ‘consider’ which allows adjs as object complements */

- then eliminate non-ADV tags

- else eliminate ADV tag Also uses

- Probabilistic constraints

- Other syntactic information