28
101035 中中中中中中 Chinese NLP Lecture 6

101035 中文信息处理 Chinese NLP Lecture 6. 词 —— 词性标注( 1 ) Part-of-Speech Tagging (1) 词性和词性标注( Part-of-Speech, or POS) 标签集( Tagsets )

Embed Size (px)

Citation preview

101035 中文信息处理

Chinese NLP

Lecture 6

词——词性标注( 1 )Part-of-Speech Tagging (1)

• 词性和词性标注( Part-of-Speech, or POS)

• 标签集( Tagsets )• 基于规则的词性标注( Rule-based POS tagging )• 词性标注评测( POS tagging evaluation )

词性和词性标注POS Tagging Basics

• Motivation

• 词性: Part-of-Speech, POS, word classes, morphological classes, or lexical tags

• Open-class words (e.g. noun, verb) vs closed-class words (e.g. auxiliary, conjunction)

• POS tagging is crucial for various NLP applications. 编辑这篇报道 编辑 这 篇 报

道编辑 /v 这 /r 篇 /q 报道

/n

• Tagging Ambiguity

• One word may be labeled with different POS tags.

bookI will book a flight to Chicago.

Please hand me that book.

报道这篇报道写得很及时。

央视新闻报道了春晚的筹备进展。

Fortunately, in both English and Chinese, many words have only a unique tag. POS tagging for them is thus trivial. But for others, POS tagging is non-trivial disambiguation.

• Disambiguation

• For many words, their multiple POS tags are not equally likely.

• Context information is important for tag disambiguation. For example, English articles are often followed by nouns.

can

Auxiliary (be able to)

Noun (a metal container)

Verb (to put something in a metal container)

的助词(表示所属关系)

名词(箭靶的中心)

• Tagging Methods

• Rule-based tagging

• Statistical-based (HMM-based) tagging

• Transformation-based tagging

• Memory-based tagging

• Hybrid tagging

标签集Tagsets

• Definition

• A tagset is a list of all POS tags for a particular language.

• Tagsets are manually compiled and language-specific.

• There are different tagsets for languages like English and Chinese.

• English Tagsets

• Penn Treebank tagset (45)

• Brown corpus tagset (87)

• C7 tagset (146)

• English Tagsets

• Example (Penn Treebank Tagset)

• Chinese Tagsets

• 《新著国语文法》( 5 类 9 种)• 《文法简论》( 4 类 9 种)• 《信息处理用现代汉语词类及标记集规范》( 20 个大

类、 24 个小类、 8 个次小类)• 《现代汉语语法信息词典》( 4 类 9 种)• 北大《人民日报》词类标记集( 39 tags )

• Chinese Tagsets

• Example ( 北大《人民日报》词类标记集 )

In-Class Exercise

• Using the Penn Treebank tagset (page 8) and the following tagging result, write about what each word’s POS is. (e.g. The: determiner)

The/DT grand/JJ jury/NN commented/VBD on/IN a/DT number/NN of/IN other/JJ topics/NNS ./.

基于规则的词性标注Rule-based POS Tagging

• Rule-based POS Tagging for English

• Two-stage architecture

• Stage 1: use a dictionary to assign each word all possible POS’s

• Stage 2: use hand-written disambiguation rules to decide a single POS for each word

• One of the most comprehensive rule-based approaches is the Constraint Grammar approach. The EngCG tagger is based on it.

• Rule-based POS Tagging for Chinese

• Rules based on collocation

• If an n/v word coordinates with ( using coordination conjunctions or 、) an n word, label it n.

哲学的产生是人类思想 (n) 和认识 (n, v) 的伟大变革。

• Rule-based POS Tagging for Chinese

• Rules based on collocation

• If an n/v word and an n word are in the same syntactic context, i.e., sharing or modified by the same word or structure, label it n.

生产产品时既要重视产出 (n, v) ,又要重视质量(n) 。

• Rule-based POS Tagging for Chinese

• Rules based on collocation

• If an n/v word follows a determiner that can only modify nouns, label it n.

她是一位研究自然语言处理的女教授 (n, v) 。

• Rule-based POS Tagging for Chinese

• Rules based on collocation

• If an n/v word follows an adjective that can only modify nouns, label it n.

中国为世界和平做出了伟大贡献 (n, v) 。

Similarly, there are rules based on pronouns (代词) , numerals and quantifiers (数量词) , nominal-object verbs(体宾动词) , monosyllabic adjectives (单音节形容词) and signature characters (特征字) .

• Rule-based POS Tagging for Chinese

• Rules derived from phrase structure

• Phrase structure rules are derived from the realistic grammatical phenomena in a corpus. There are generic rules and specific rules.

• Generic rules are designed for a certain POS and are function-driven.

• Specific rules are tailored for individual words and are word-driven.

• Rules Derived from Phrase Structure

• Affix rules (词缀规则)• R1: Let K1 = { 金、银、红、黄、绿、蓝、白、灰、黑 } and

length = 3, if X1 ∈ K1 and X2X3 is in the form of BB, then X (X1X2X3) is a ( 形容词 ).

• R2: Let K2 = { 一、几 } and length = 3, if X1 ∈ K2 and X2X3 is in the form of BB, then X is m ( 数量词 ).

金灿灿 (a) ,绿油油 (a)

一件件 (m) ,几次次 (m)

• Rules Derived from Phrase Structure

• Affix rules (词缀规则)• R3: Let K3 = { 老、大、小 } and X2…XL is a surname, if

X1 ∈ K3, then X is n ( 名词 ).

• R4: Let K4 = { 老、总、局 } and X1…XL-1 is a surname, if XL ∈ K4, then X is n ( 名词 )..

老陈 (n) ,大张 (n) ,小王 (n)

王老 (n) ,孙总 (n) ,赵局 (n)

• Rules Derived from Phrase Structure

• Affix rules (词缀规则)• R5: Let K5 = { 赛、酸、仪、家、学、色 }, if XL ∈ K5,

then X is n ( 名词 ).

• R6: Let K6 = { 化 }, if XL ∈ K6, then X is v ( 动词 ).

• R7: Let K7 = { 然 }, if XL ∈ K7, then X is d ( 副词 ).

篮球赛 (n) ,核苷酸 (n) 、地震仪 (n) 、音乐家 (n) 、计算语言学 (n) 、深红色 (n)

自动化 (v) ,全球化 (v)

欣然 (d) ,忽然 (d)

• Rules Derived from Phrase Structure

• Repetition rules (重叠词规则)• R8: If X is in the form of AABB and AB is an adjective

or verb, then X is also a ( 形容词 ) or v ( 动词 ) .

• R9: If X is in the form of AA and A is a verb or quantifier, then X is also v ( 动词 ) or q ( 量词 ) .

打打闹闹 (v) ,高高兴兴 (a)

朵朵 (q) 云彩,让我听听 (v)

• Rule-based POS Tagging for Chinese

• Exception to the rules

• Rules only generalize the majority. But almost every rule has exceptions.

黑蛐蛐 against R1敬老 against R4

In-Class Exercise

• Give one more counterexample (反例) to any of the rules for Chinese POS tagging (R1 – R9). Explain which rule it is against.

词性标注评测POS Tagging Evaluation

• Word-Level Accuracy

• Sentence-Level Accuracy

• Word-Level vs Sentence-Level

• Generally, sentence-level accuracy is much lower than word-level accuracy.

• Word-level accuracy is the tagging accuracy by default.

• Sentence-level accuracy makes more sense for syntactic analysis.

• Word-level accuracy makes more sense for semantic analysis.

• Error Analysis

• In word-level evaluation, a confusion matrix often helps us to better understand where we make mistakes.

Correct POS

Tagged POS

• 词性和词性标注• Motivation

• Ambiguity

• 标签集• English

• Chinese

Wrap-Up

• 基于规则的词性标注• English

• Chinese – Collocation

• Chinese – Phrase Structure

• 词性标注评测• Word-Level

• Sentence-Level

• Error Analysis