28
101035 中中中中中中 Chinese NLP Lecture 13

101035 中文信息处理 Chinese NLP Lecture 13. 应用 —— 文本分类( 1 ) Text Classification (1) 文本分类概况( Overview) 文本分类的用途( Applications)

Embed Size (px)

Citation preview

Page 1: 101035 中文信息处理 Chinese NLP Lecture 13. 应用 —— 文本分类( 1 ) Text Classification (1) 文本分类概况( Overview) 文本分类的用途( Applications)

101035 中文信息处理

Chinese NLP

Lecture 13

Page 2: 101035 中文信息处理 Chinese NLP Lecture 13. 应用 —— 文本分类( 1 ) Text Classification (1) 文本分类概况( Overview) 文本分类的用途( Applications)

2

应用——文本分类( 1 )Text Classification (1)

• 文本分类概况( Overview)

• 文本分类的用途( Applications)

• 文本的表示( Text representation )• 文本特征选择( Feature selection )

Page 3: 101035 中文信息处理 Chinese NLP Lecture 13. 应用 —— 文本分类( 1 ) Text Classification (1) 文本分类概况( Overview) 文本分类的用途( Applications)

3

文本分类概况Overview

• Definition

• Text classification, or text categorization, is the process of assigning a text to one or more given classes or categories.

• In this definition, text can be news report, technical paper, email, patent, webpage, book chapter, or a part of them. It ranges from a character or word to an entire book.

Page 4: 101035 中文信息处理 Chinese NLP Lecture 13. 应用 —— 文本分类( 1 ) Text Classification (1) 文本分类概况( Overview) 文本分类的用途( Applications)

4

• Classification System (分类系统)• Text classification is mainly concerned with content-

based classification.

• Some well-known classification systems include the Thomson Reuters Business Classification (TRBC) and Chinese Library Classification (CLC, 中图分类 ).

• In some domains, the classification system is usually manually crafted.

Politics, sports, economy, entertainment, …

Spam, ham

Sensitive, insensitive

Positive, neutral, negative

Page 5: 101035 中文信息处理 Chinese NLP Lecture 13. 应用 —— 文本分类( 1 ) Text Classification (1) 文本分类概况( Overview) 文本分类的用途( Applications)

5

• Types of Classification

• Two classes (binary), one label

• Multiple classes, one label

• Multiple classes, multiple labels

Page 6: 101035 中文信息处理 Chinese NLP Lecture 13. 应用 —— 文本分类( 1 ) Text Classification (1) 文本分类概况( Overview) 文本分类的用途( Applications)

6

• Supervised Learning Approach (有监督学习)

?Training documents

(Labeled)Learning machine

(an algorithm)

Trained machine

Unseen (test, query) document

Labeled documen

t

Page 7: 101035 中文信息处理 Chinese NLP Lecture 13. 应用 —— 文本分类( 1 ) Text Classification (1) 文本分类概况( Overview) 文本分类的用途( Applications)

7

• Mathematical Definition of Text Classification (数学定义)• Mathematically, text classification is a mapping of

unclassified text to the given classes. The mapping can be one-to-one or one-to-many.

• For each where di is a document in the document set D and ci is a class in the class set C. If the Boolean value is True, the document belongs to ci and not if otherwise. The classification model is to construct a function:

: { , }D C T F

Page 8: 101035 中文信息处理 Chinese NLP Lecture 13. 应用 —— 文本分类( 1 ) Text Classification (1) 文本分类概况( Overview) 文本分类的用途( Applications)

8

In-Class Exercise

• To automatically decide whether an English word is spelled correctly or not is a _____________ classification problem.

A) one-class, one-label

B) one-class, two-label

C) two-class, one-label

D) two-class, two-label

Page 9: 101035 中文信息处理 Chinese NLP Lecture 13. 应用 —— 文本分类( 1 ) Text Classification (1) 文本分类概况( Overview) 文本分类的用途( Applications)

9

文本分类的用途Applications

• Spam Filtering (垃圾邮件过滤)

• Genre Recognition (文体识别)

Page 10: 101035 中文信息处理 Chinese NLP Lecture 13. 应用 —— 文本分类( 1 ) Text Classification (1) 文本分类概况( Overview) 文本分类的用途( Applications)

10

• Authorship Identification (作者身份识别)

• Webpage Categorization (网页分类)

• Sentiment Analysis (情感分析)

Page 11: 101035 中文信息处理 Chinese NLP Lecture 13. 应用 —— 文本分类( 1 ) Text Classification (1) 文本分类概况( Overview) 文本分类的用途( Applications)

11

文本的表示Text Representation

• Before being applied to a learning algorithm, a target text must be properly represented.

• Features are used to represent the most important information in the text.

• N features decide the N dimensions to vectorize the text.

Page 12: 101035 中文信息处理 Chinese NLP Lecture 13. 应用 —— 文本分类( 1 ) Text Classification (1) 文本分类概况( Overview) 文本分类的用途( Applications)

12

• Text Features

• Characters

• Applicable to Chinese text (字)

• Words

• For Chinese, after word segmentation is done

• Many text classification applications use only word features, called the BOW (Bag-of-Words) model.

• N-grams

• N-grams are generalized words (unigrams)

• The bigrams of 中国人民 are ( 中国 , 国人 , 人民 )

• Large n-grams cause data sparseness problem

Page 13: 101035 中文信息处理 Chinese NLP Lecture 13. 应用 —— 文本分类( 1 ) Text Classification (1) 文本分类概况( Overview) 文本分类的用途( Applications)

13

• Text Features

• POS

• Rarely used alone

• Punctuations and Symbols

• Some of them (!, : - ) ) are effective for special text (tweet)

• Syntactic Patterns

• After syntactic parsing is done

• A pattern (feature) is like “NP VP PP”

• Semantic Patterns

• After semantic analysis (e.g. SRL) is done

• A pattern (feature) is like “Agent Target Patient Instrument”

Page 14: 101035 中文信息处理 Chinese NLP Lecture 13. 应用 —— 文本分类( 1 ) Text Classification (1) 文本分类概况( Overview) 文本分类的用途( Applications)

14

• Vector Space Model (向量空间模型)• The Vector Space Model (VSM) is based on Statistics

and Vector Algebra.

• A document is represented as a vector of features (e.g. words).

• Each dimension corresponds to a feature. If there are n features, a document is an n-dimensional vector.

• If a feature occurs in the document, its value in the vector is non-zero (known as the weight of the term, which can be binary, count or real-valued).

Page 15: 101035 中文信息处理 Chinese NLP Lecture 13. 应用 —— 文本分类( 1 ) Text Classification (1) 文本分类概况( Overview) 文本分类的用途( Applications)

15

• Binary WeightsDoc 1: Computers have brought the world to our fingertips. We will try to understand at a basic level the science – old and new – underlying this new Computational Universe. Our quest takes us on a broad sweep of scientific knowledge and related technologies. Doc 2: An introduction to computer science in the context of scientific, engineering, and commercial applications. The goal of the course is to teach basic principles and practical issues, while at the same time preparing students to use computers effectively for applications in computer science.

Features: engineering knowledge scienceDoc 1: 0 1 1

Doc 2: 1 0 1

The representation of a set of documents as vectors in a common vector space is known as the Vector Space Model.

Page 16: 101035 中文信息处理 Chinese NLP Lecture 13. 应用 —— 文本分类( 1 ) Text Classification (1) 文本分类概况( Overview) 文本分类的用途( Applications)

16

• Term Frequency (TF) WeightsDoc 1: Computers have brought the world to our fingertips. We will try to understand at a basic level the science – old and new – underlying this new Computational Universe. Our quest takes us on a broad sweep of scientific knowledge and related technologies. Doc 2: An introduction to computer science in the context of scientific, engineering, and commercial applications. The goal of the course is to teach basic principles and practical issues, while at the same time preparing students to use computers effectively for applications in computer science.

Features: engineering knowledge scienceDoc 1: 0 1 1

Doc 2: 1 0 2

The representation of a set of documents as vectors in a common vector space in known as the Vector Space Model.

Page 17: 101035 中文信息处理 Chinese NLP Lecture 13. 应用 —— 文本分类( 1 ) Text Classification (1) 文本分类概况( Overview) 文本分类的用途( Applications)

17

• Term Weighting Schemes

• The raw tf is usually normalized by some variable related to document length to prevent a bias towards longer documents.

• A usual way of normalization is Euclidean Normalization.

• d = (d1, d2, … dn) is a vector representation of a document d in an n-dimensional vector space, the Euclidean length of d is defined to be

• Then the normalized

Euclidean normalized tf valuestf values

Doc 1

Doc 2

Doc 3

engineering

0 1 2

knowledge

1 0 0

science 1 2 4

Doc 1

Doc 2

Doc 3

engineering

0 0.447 0.447

knowledge

0.707 0 0

science 0.707 0.894 0.894Length 2 5 20

Page 18: 101035 中文信息处理 Chinese NLP Lecture 13. 应用 —— 文本分类( 1 ) Text Classification (1) 文本分类概况( Overview) 文本分类的用途( Applications)

18

• Term Weighting Schemes

• The inverse document frequency is a measure of the general importance of a term t in the document collection.

• The idf weight of term t is defined as follows

where N is the total number of the documents in the collection, the document frequency dft is the number of documents in the collection that contain t.

• The tf.idf weight of a term is the product of its tf weight and its idf weight. It is one of the best known weighting schemes and used widely in NLP applications.

Page 19: 101035 中文信息处理 Chinese NLP Lecture 13. 应用 —— 文本分类( 1 ) Text Classification (1) 文本分类概况( Overview) 文本分类的用途( Applications)

19

In-Class Exercise

• The following table lists the TF of 3 documents as well as the IDF for the 3 words. Compute the vectors for the 3 documents using the tf.idf weighting scheme.

IDF: 0.477 0.447 0

Doc 1: 0 1 1

Doc 2: 1 0 2

Doc 3: 0 0 2

Features: engineering knowledge science

Page 20: 101035 中文信息处理 Chinese NLP Lecture 13. 应用 —— 文本分类( 1 ) Text Classification (1) 文本分类概况( Overview) 文本分类的用途( Applications)

20

文本特征选择Feature Selection

• Motivation

X={xij}

n

mxi

y ={yj}

w

n is usually large

We need to select only a subset from all the

features

Page 21: 101035 中文信息处理 Chinese NLP Lecture 13. 应用 —— 文本分类( 1 ) Text Classification (1) 文本分类概况( Overview) 文本分类的用途( Applications)

21

• Information Gain (IG, 信息增益 )

• For feature t and class c, IG measures the information gain of t as against c in documents with t and without t:

1 1 1( ) ( ) lg ( ) ( ) ( | ) lg ( | ) ( ) ( | ) lg ( | )

m m m

i i i i i ii i iIG t P c P c P t P c t P c t P t P c t P c t

P(ci): probability of documents of class ciP(t): probability of documents with feature tP(): probability of documents without feature tP(ci|t): probability of documents of class ci given that they have feature tP(ci|): probability of documents of class ci given that they do not have feature tm: number of classes

Page 22: 101035 中文信息处理 Chinese NLP Lecture 13. 应用 —— 文本分类( 1 ) Text Classification (1) 文本分类概况( Overview) 文本分类的用途( Applications)

22

• Information Gain

• The probabilities are estimated using MLE (Maximum Likelihood Estimation, 最大似然估计 ). E.g.,

• One advantage of IG is that it considers the contribution of a feature not occurring in the text.

• IG performs poorly if the class distribution and feature distribution are very unbalanced.

Page 23: 101035 中文信息处理 Chinese NLP Lecture 13. 应用 —— 文本分类( 1 ) Text Classification (1) 文本分类概况( Overview) 文本分类的用途( Applications)

23

• Mutual Information (MI, 互信息 )

• MI measures the correlation between feature t and class c, which is defined as:

or

where N = A + B + C + D

( )( , ) lg

( ) ( )

P t cMI t c

P t P c

( , ) lg( ) ( )

A NMI t c

A C A B

A B

C D

t

~t

c ~c

Page 24: 101035 中文信息处理 Chinese NLP Lecture 13. 应用 —— 文本分类( 1 ) Text Classification (1) 文本分类概况( Overview) 文本分类的用途( Applications)

24

• Mutual Information

• MI is a widely used method in statistical language models.

• For multiple classes, we often take either the maximum or average MI:

• MI is not very effective for low-frequency features.

Page 25: 101035 中文信息处理 Chinese NLP Lecture 13. 应用 —— 文本分类( 1 ) Text Classification (1) 文本分类概况( Overview) 文本分类的用途( Applications)

25

• Chi Square (χ2, Chi 方统计 )

• χ2 measures the correlation between feature t and class c, which is defined as:

where N = A + B + C + DA B

C D

t

~t

c ~c

))()()((

)(),(

22

DCBADBCA

CBADNct

Page 26: 101035 中文信息处理 Chinese NLP Lecture 13. 应用 —— 文本分类( 1 ) Text Classification (1) 文本分类概况( Overview) 文本分类的用途( Applications)

26

• Chi Square

• For multiple classes, we often take either the maximum or average χ2:

• Unlike MI, χ2 is a normalized statistic.

• Like MI, χ2 is not very effective for low-frequency features.

Page 27: 101035 中文信息处理 Chinese NLP Lecture 13. 应用 —— 文本分类( 1 ) Text Classification (1) 文本分类概况( Overview) 文本分类的用途( Applications)

27

• Summary

• Using IG, MI, or χ2, we can select the features above a threshold (an absolute value), or a given proportion of features (e.g. 10%).

• Using selected features often results in lower computational cost and similar or even better performance.

• Experiments are needed to decide which measure is the best for a target problem.

Page 28: 101035 中文信息处理 Chinese NLP Lecture 13. 应用 —— 文本分类( 1 ) Text Classification (1) 文本分类概况( Overview) 文本分类的用途( Applications)

28

• 文本分类概况• Definitions

• Classification Systems

• Classification Types

• 文本分类的用途

Wrap-Up

• 文本的表示• Text Features

• Vector Space Model

• 文本特征选择• Information Gain

• Mutual Information

• Chi Square