101035 中文信息处理 Chinese NLP Lecture 13. 应用 —— 文本分类（ 1 ） Text Classification (1) 文本分类概况（ Overview) 文本分类的用途（ Applications)

101035 中文信息处理

Chinese NLP

Lecture 13

2

应用——文本分类（ 1 ）Text Classification (1)

• 文本分类概况（ Overview)

• 文本分类的用途（ Applications)

• 文本的表示（ Text representation ）• 文本特征选择（ Feature selection ）

3

文本分类概况Overview

• Definition

• Text classification, or text categorization, is the process of assigning a text to one or more given classes or categories.

• In this definition, text can be news report, technical paper, email, patent, webpage, book chapter, or a part of them. It ranges from a character or word to an entire book.

4

• Classification System （分类系统）• Text classification is mainly concerned with content-

based classification.

• Some well-known classification systems include the Thomson Reuters Business Classification (TRBC) and Chinese Library Classification (CLC, 中图分类 ).

• In some domains, the classification system is usually manually crafted.

Politics, sports, economy, entertainment, …

Spam, ham

Sensitive, insensitive

Positive, neutral, negative

…

5

• Types of Classification

• Two classes (binary), one label

• Multiple classes, one label

• Multiple classes, multiple labels

6

• Supervised Learning Approach （有监督学习）

?Training documents

(Labeled)Learning machine

(an algorithm)

Trained machine

Unseen (test, query) document

Labeled documen

t

7

• Mathematical Definition of Text Classification （数学定义）• Mathematically, text classification is a mapping of

unclassified text to the given classes. The mapping can be one-to-one or one-to-many.

• For each where di is a document in the document set D and ci is a class in the class set C. If the Boolean value is True, the document belongs to ci and not if otherwise. The classification model is to construct a function:

: { , }D C T F

8

In-Class Exercise

• To automatically decide whether an English word is spelled correctly or not is a _____________ classification problem.

A) one-class, one-label

B) one-class, two-label

C) two-class, one-label

D) two-class, two-label

9

文本分类的用途Applications

• Spam Filtering （垃圾邮件过滤）

• Genre Recognition （文体识别）

10

• Authorship Identification （作者身份识别）

• Webpage Categorization （网页分类）

• Sentiment Analysis （情感分析）

11

文本的表示Text Representation

• Before being applied to a learning algorithm, a target text must be properly represented.

• Features are used to represent the most important information in the text.

• N features decide the N dimensions to vectorize the text.

12

• Text Features

• Characters

• Applicable to Chinese text （字）

• Words

• For Chinese, after word segmentation is done

• Many text classification applications use only word features, called the BOW (Bag-of-Words) model.

• N-grams

• N-grams are generalized words (unigrams)

• The bigrams of 中国人民 are ( 中国 , 国人 , 人民 )

• Large n-grams cause data sparseness problem

13

• Text Features

• POS

• Rarely used alone

• Punctuations and Symbols

• Some of them (!, : - ) ) are effective for special text (tweet)

• Syntactic Patterns

• After syntactic parsing is done

• A pattern (feature) is like “NP VP PP”

• Semantic Patterns

• After semantic analysis (e.g. SRL) is done

• A pattern (feature) is like “Agent Target Patient Instrument”

14

• Vector Space Model （向量空间模型）• The Vector Space Model (VSM) is based on Statistics

and Vector Algebra.

• A document is represented as a vector of features (e.g. words).

• Each dimension corresponds to a feature. If there are n features, a document is an n-dimensional vector.

• If a feature occurs in the document, its value in the vector is non-zero (known as the weight of the term, which can be binary, count or real-valued).

15

• Binary WeightsDoc 1: Computers have brought the world to our fingertips. We will try to understand at a basic level the science – old and new – underlying this new Computational Universe. Our quest takes us on a broad sweep of scientific knowledge and related technologies. Doc 2: An introduction to computer science in the context of scientific, engineering, and commercial applications. The goal of the course is to teach basic principles and practical issues, while at the same time preparing students to use computers effectively for applications in computer science.

Features: engineering knowledge scienceDoc 1: 0 1 1

Doc 2: 1 0 1

The representation of a set of documents as vectors in a common vector space is known as the Vector Space Model.

16

• Term Frequency (TF) WeightsDoc 1: Computers have brought the world to our fingertips. We will try to understand at a basic level the science – old and new – underlying this new Computational Universe. Our quest takes us on a broad sweep of scientific knowledge and related technologies. Doc 2: An introduction to computer science in the context of scientific, engineering, and commercial applications. The goal of the course is to teach basic principles and practical issues, while at the same time preparing students to use computers effectively for applications in computer science.

Features: engineering knowledge scienceDoc 1: 0 1 1

Doc 2: 1 0 2

The representation of a set of documents as vectors in a common vector space in known as the Vector Space Model.

17

• Term Weighting Schemes

• The raw tf is usually normalized by some variable related to document length to prevent a bias towards longer documents.

• A usual way of normalization is Euclidean Normalization.

• d = (d1, d2, … dn) is a vector representation of a document d in an n-dimensional vector space, the Euclidean length of d is defined to be

• Then the normalized

Euclidean normalized tf valuestf values

Doc 1

Doc 2

Doc 3

engineering

0 1 2

knowledge

1 0 0

science 1 2 4

Doc 1

Doc 2

Doc 3

engineering

0 0.447 0.447

knowledge

0.707 0 0

science 0.707 0.894 0.894Length 2 5 20

18

• Term Weighting Schemes

• The inverse document frequency is a measure of the general importance of a term t in the document collection.

• The idf weight of term t is defined as follows

where N is the total number of the documents in the collection, the document frequency dft is the number of documents in the collection that contain t.

• The tf.idf weight of a term is the product of its tf weight and its idf weight. It is one of the best known weighting schemes and used widely in NLP applications.

19

In-Class Exercise

• The following table lists the TF of 3 documents as well as the IDF for the 3 words. Compute the vectors for the 3 documents using the tf.idf weighting scheme.

IDF: 0.477 0.447 0

Doc 1: 0 1 1

Doc 2: 1 0 2

Doc 3: 0 0 2

Features: engineering knowledge science

20

文本特征选择Feature Selection

• Motivation

X={xij}

n

mxi

y ={yj}

w

n is usually large

We need to select only a subset from all the

features

21

• Information Gain (IG, 信息增益 )

• For feature t and class c, IG measures the information gain of t as against c in documents with t and without t:

1 1 1( ) ( ) lg ( ) ( ) ( | ) lg ( | ) ( ) ( | ) lg ( | )

m m m

i i i i i ii i iIG t P c P c P t P c t P c t P t P c t P c t

P(ci): probability of documents of class ciP(t): probability of documents with feature tP(): probability of documents without feature tP(ci|t): probability of documents of class ci given that they have feature tP(ci|): probability of documents of class ci given that they do not have feature tm: number of classes

22

• Information Gain

• The probabilities are estimated using MLE (Maximum Likelihood Estimation, 最大似然估计 ). E.g.,

• One advantage of IG is that it considers the contribution of a feature not occurring in the text.

• IG performs poorly if the class distribution and feature distribution are very unbalanced.

23

• Mutual Information (MI, 互信息 )

• MI measures the correlation between feature t and class c, which is defined as:

or

where N = A + B + C + D

( )( , ) lg

( ) ( )

P t cMI t c

P t P c

( , ) lg( ) ( )

A NMI t c

A C A B

A B

C D

t

~t

c ~c

24

• Mutual Information

• MI is a widely used method in statistical language models.

• For multiple classes, we often take either the maximum or average MI:

• MI is not very effective for low-frequency features.

25

• Chi Square (χ2, Chi 方统计 )

• χ2 measures the correlation between feature t and class c, which is defined as:

where N = A + B + C + DA B

C D

t

~t

c ~c

))()()((

)(),(

22

DCBADBCA

CBADNct

26

• Chi Square

• For multiple classes, we often take either the maximum or average χ2:

• Unlike MI, χ2 is a normalized statistic.

• Like MI, χ2 is not very effective for low-frequency features.

27

• Summary

• Using IG, MI, or χ2, we can select the features above a threshold (an absolute value), or a given proportion of features (e.g. 10%).

• Using selected features often results in lower computational cost and similar or even better performance.

• Experiments are needed to decide which measure is the best for a target problem.

28

• 文本分类概况• Definitions

• Classification Systems

• Classification Types

• 文本分类的用途

Wrap-Up

• 文本的表示• Text Features

• Vector Space Model

• 文本特征选择• Information Gain

• Mutual Information

• Chi Square

Documents

101035 中文信息处理 Chinese NLP Lecture 13. 应用 —— 文本分类（ 1 ） Text Classification (1) 文本分类概况（ Overview) 文本分类的用途（ Applications)