Upload
ashley-perry
View
226
Download
1
Embed Size (px)
Citation preview
임성신임성신
[email protected]@pusan.ac.kr
Speech and Language Processing
Ch8. WORD CLASSES AND PART-OF-Ch8. WORD CLASSES AND PART-OF-SPEECH TAGGINGSPEECH TAGGING
2Artificial Intelligence LaboratoryArtificial Intelligence Laboratory
AgendaAgenda
What are they?What are they? DistributionDistribution TagsetsTagsets TaggingTagging
Rules Probabilities Transformation-Based(Brill)
3Artificial Intelligence LaboratoryArtificial Intelligence Laboratory
Parts of SpeechParts of Speech
Start with eight basic categoriesStart with eight basic categories Noun, verb, pronoun, preposition, adjective, adverb, article,
conjunction
These categories are based on morphological and These categories are based on morphological and distributional properties (not semantics)distributional properties (not semantics)
Some cases are easy, others are murkySome cases are easy, others are murky
4Artificial Intelligence LaboratoryArtificial Intelligence Laboratory
Parts of SpeechParts of Speech
Two kinds of categoryTwo kinds of category Closed class
• Prepositions, articles, conjunctions, pronouns
Open class• Nouns, verbs, adjectives, adverbs
5Artificial Intelligence LaboratoryArtificial Intelligence Laboratory
Fig 8.1 Prepositions(and particles) of English from the CELEX on-line dictionary.Frequency counts are from the COBUILD 16 million word corpus.
6Artificial Intelligence LaboratoryArtificial Intelligence Laboratory
Fig 8.2 English single-word particles from Quirk et al.(1985).
7Artificial Intelligence LaboratoryArtificial Intelligence Laboratory
Fig 8.3 Coordinating and subordinating conjunctions of English from the CELEX on-line dictionary. Frequency counts are from the COBUILD 16 million word corpus.
8Artificial Intelligence LaboratoryArtificial Intelligence Laboratory
Fig 8.4 Pronouns of English from the CELEX on-line dictionary. Frequency counts are from the COBUILD 16 million word corpus.
9Artificial Intelligence LaboratoryArtificial Intelligence Laboratory
Fig 8.5 English modal verbs from the CELEX on-line dictionary.Frequency counts are from the COBUILD 16 million word corpus.
10Artificial Intelligence LaboratoryArtificial Intelligence Laboratory
Sets of Parts of Speech: TagsetsSets of Parts of Speech: Tagsets
There are various standard tagsets to choose from; There are various standard tagsets to choose from; some have a lot more tags than otherssome have a lot more tags than others
The choice of tagset is based on the applicationThe choice of tagset is based on the application Accurate tagging can be done with even large tagsetAccurate tagging can be done with even large tagset
ss
11Artificial Intelligence LaboratoryArtificial Intelligence Laboratory
Fig 8.6 Penn Treebank part-of-speech tags (including punctuation).
12Artificial Intelligence LaboratoryArtificial Intelligence Laboratory
TaggingTagging
Part of speech tagging is the process of assigning pPart of speech tagging is the process of assigning parts of speech to each word in a sentence… Assume arts of speech to each word in a sentence… Assume we havewe have A tagset A dictionary that gives you the possible set of tags for each
entry A text to be tagged A reason?
The/DT grand/JJ jury/NN commented/VBD on/IN a/DT number/NN of/IN other/JJ topics/NNS./.
13Artificial Intelligence LaboratoryArtificial Intelligence Laboratory
Figure 8.7 The number of word types in Brown corpus by degree of ambiguity (after DeRose(1988)).
14Artificial Intelligence LaboratoryArtificial Intelligence Laboratory
Tagging - RulesTagging - Rules
Hand-crafted rules for ambiguous words that test the Hand-crafted rules for ambiguous words that test the context to make appropriate choicescontext to make appropriate choices Early attempts fairly error-prone Extremely labor-intensive
15Artificial Intelligence LaboratoryArtificial Intelligence Laboratory
Figure 8.8 Sample lexical entries from the ENGTWOL lexicon described in Voutilainen(1995) and Heikkila(1995).
16Artificial Intelligence LaboratoryArtificial Intelligence Laboratory
Tagging - ProbabilitiesTagging - Probabilities
장점장점 충분한 크기의 태그부탁 말뭉치만 주어지면 태깅에 필요한
통계정보의 추출이 용이하기 때문에 확장성이 좋고 적용범위가 넓으며 전체적인 정확성이 비교적 높다는 장점
단점단점 말뭉치에 의존적 의미 있는 통계정보를 추출하기 위해서는 일정크기 이상의
태그부탁 말뭉치 필요 말뭉치 구축에 시간과 노력이 많이 요구됨 말뭉치가 편중되어 있거나 불충분한 경우에는 data sparseness
로 인해 신뢰도가 떨어짐
17Artificial Intelligence LaboratoryArtificial Intelligence Laboratory
Tagging - ProbabilitiesTagging - Probabilities
We want the best set of tags for a sequence of wordsWe want the best set of tags for a sequence of words(a sentence)(a sentence)
)(
)()|(maxarg)|(maxarg
WP
TPTWPWTP
)()|(maxarg)|(maxarg TPTWPWTP
W is a sequence of wordsW is a sequence of wordsT is a sequence of tagsT is a sequence of tags
The probability of the word sequence P(W)will be the same for each tag sequence
n
i
ii
n
i
ii ttPtPtwP2
112
)|(*)(*)|(maxarg
18Artificial Intelligence LaboratoryArtificial Intelligence Laboratory
Tagging - Transformation-Based(Brill tagging)Tagging - Transformation-Based(Brill tagging)
Combine rules and statistics…Combine rules and statistics… TBL(Transformation-Based Learning) is based on rules Rules are automatically induced from the data(ML)
19Artificial Intelligence LaboratoryArtificial Intelligence Laboratory
Brill tagging - ExamplesBrill tagging - Examples
RaceRace “race” as NN: .98 “race” as VB: .02
So you’ll be wrong 2% of the time, which really isn’t So you’ll be wrong 2% of the time, which really isn’t badbad
Patch the cases where you know it has to be a verbPatch the cases where you know it has to be a verb Change NN to VB when previous tag is TO
20Artificial Intelligence LaboratoryArtificial Intelligence Laboratory
Brill tagging - RulesBrill tagging - Rules
Where did that transformational rule come from?Where did that transformational rule come from? Define a hypothesis space of rules that might help decrease
an error rate Search that space (exhaustively?) to find rules that most
reduce an error rate. Continue to add rules until some stopping criteria is
reached
Figure 8.9 Brill’s(1995) templates. Each begins with “Change tag a to tag b when : …”. The variables a, b, z and w range over parts-of-speech.