Speaker ： Shau-Shiang Hung ( 洪紹祥 ) Adviser ： Shu-Chen Cheng ( 鄭淑真 ) Date ： 99/05/04 1 Qirui Zhang, Jinghua Tan, Huaying Zhou, Weiye Tao, Kejing He, "Machine

Speaker ： Shau-Shiang Hung ( 洪紹祥 )Adviser ： Shu-Chen Cheng ( 鄭淑真 )

Date ： 99/05/04

1

Qirui Zhang, Jinghua Tan, Huaying Zhou, Weiye Tao, Kejing He, "Machine Learning Methods for Medical Text Categorization," paccs, pp.494-497, 2009 Pacific-Asia Conference on Circuits, Communications and Systems, 2009

Outline Introduction Document indexingClassification AlgorithmExperiments Conclusion

2

Introduction Text categorization (TC) is the process of

automatically assigning one or more predefined category labels to text documents.

Digital medical information is rapidly increasing with the development of network.

How to effectively deal with and organize them is a problem in the field of medical informatics.

3

Document indexing Because classifiers cannot directly interpret

documents, it is necessary to transform them into the forms that classifiers can identify.

Vector space model (VSM) is a famous statistical model.

),..,...,,,( ||321 jTijjjjj wwwwwd

,...),,( 1 服務觀光古蹟文章

4

Document indexing A. Standard Term Frequency Inverse

Document Frequency (TFIDF)

)(#

||log),(#),(

kTr

rjkjk

t

Tdtdttfidf

all_j

k

t

t

100

1000log

100

10)1,( 文章古蹟tfidf

5

Document indexing In order for the weights to fall the [0,1]

interval and for the documents to be represent by vectors of equal length, the weights resulting from tfidf are often normalized by cosine normalization.

101

2

kj|T|s js

jkkj w ,

)),d(tfidf(t

),dtfidf(tw

文章 1 所有關鍵字的 TFIDF 平方相加

6

Document indexing B. Improvement

Term Frequency, Inverted Document Frequency and Inverted Entropy (TFIDFIE)

In the field of text classification, the importance of term depends on not only its term frequency, but also its contribution to classification. For example:

Term1 客房 and Term2 風景 has same weight

7

Document indexing In order to stand out the relation between

terms and categories, we also calculate the distribution of those documents in categories in course of weighting terms. This distribution can be weight by information entropy H.

||

1 )(#log

)(#),(

C

lkTr

kl

kTr

kljk

t

DF

t

DFdtH

)-(H100

15log

100

15

100

25log

100

25

100

20log

100

20

100

10log

100

10)1,(

44332211 文章客房 ),()(#

||log),(#

),(jk

kTrjk

jkdtH

tTr

dtdttdidfie

8

||1

2)),((

),(Ts js

jkkj

dttfidf

dttfidfiew

Classification AlgorithmA. K-Nearest Neighbor (KNN)B. Support Vector Machine (SVM)C. Naïve Bayes (NB)D. Clonal Selection Algorithm Based on

Antibody Density (CSABAD) Because the nature of immune algorithm is to

distinguish between self and non-self, it can be used in text categorization.

9

Classification Algorithm• CSABAD In text categorization, Antigen

training text. B cell

An individual of classifier. Antibody

affinity between the individual and training documents.

The final classifier is composed with many memory B cells.

The cosine value of two vectors is used to measure the affinity f(xi,dj) between of B cell xi and antigen djThe affinity f(xi) of B cell xi and N antigens is

defined as the average value of all N affinities.

The antibody selection probability P(xi) is defined as follows:

M

i

M

j

ji

M

j

ji

i

xfxf

xfxf

xP

1 1

1

|)()(|

|)()(|

)(

10

ExperimentsA. Data collection

OHSUMED is a bibliographical document collection.

Using a single-label subset of OHSUMED is called OHSCAL, which consists of 11162 documents include 10 categories.

11

ExperimentsB. Experiment results and analysis

Randomly divided the OHSCAL dataset into a training set and a test set in the proportion of 2:1.

For eliminating the chanciness of experimental results, we made ten independent experiments on OHSCAL.

12

Conclusion In this paper, we propose an improved approach,

called TFIDFIE. It considers the distribution of documents in the training set in which the term occurs.

The experiments show that SVM and CSABAD outperform significantly kNN and Naive Bayes, and TFIDFIE is more effective than TFIDF.

Considering the characteristics of professional medical words, we will study the feature selection in the medical text classification in further work.

13

Documents

Speaker ： Shau-Shiang Hung ( 洪紹祥 ) Adviser ： Shu-Chen Cheng ( 鄭淑真 ) Date ： 99/05/04 1 Qirui Zhang, Jinghua Tan, Huaying Zhou, Weiye Tao, Kejing He, "Machine