22
DATA MINING & MACHINE LEARNING FINAL PROJECT Group 2 R95922027 李李李 R95922034 李李李 R95922081 李李李 R95942129 李李李

Group 2 R95922027 李庭閣 R95922034 孔垂玖 R95922081 許守傑 R95942129 鄭力維

Embed Size (px)

Citation preview

Page 1: Group 2 R95922027 李庭閣 R95922034 孔垂玖 R95922081 許守傑 R95942129 鄭力維

DATA MINING & MACHINE LEARNING FINAL PROJECT

Group 2R95922027 李庭閣R95922034 孔垂玖R95922081 許守傑R95942129 鄭力維

Page 2: Group 2 R95922027 李庭閣 R95922034 孔垂玖 R95922081 許守傑 R95942129 鄭力維

Outline

Experiment setting Feature extraction Model training Hybrid-Model Conclusion Reference

Page 3: Group 2 R95922027 李庭閣 R95922034 孔垂玖 R95922081 許守傑 R95942129 鄭力維

Experiment setting

Selected online corpus:

enron Removing html tags Factoring important headers

Six folders from enron1 to enron6. Contain totally 13496 spam mails &

15045 ham mails

Page 4: Group 2 R95922027 李庭閣 R95922034 孔垂玖 R95922081 許守傑 R95942129 鄭力維

Outline

Experiment setting Feature extraction Model training Hybrid-Model Conclusion Reference

Page 5: Group 2 R95922027 李庭閣 R95922034 孔垂玖 R95922081 許守傑 R95942129 鄭力維

Feature Extration

1. Transmitted Time of the Mail2. Number of the Receiver3. Existence of Attachment4. Existence of images in mail5. Existence of Cited URLs in mail6. Symbols in Mail Title7. Mail-body

Page 6: Group 2 R95922027 李庭閣 R95922034 孔垂玖 R95922081 許守傑 R95942129 鄭力維

Transmitted Time of the Mail& Number of the Receiver

Spam: Non-uniform Distribution

Spam:Only Single Receiver

Page 7: Group 2 R95922027 李庭閣 R95922034 孔垂玖 R95922081 許守傑 R95942129 鄭力維

Probability of being Spam for Transmitted Time & Receiver Size

]|[]|[

]|[]|[

hamhPspamhP

hamhPhdatehamP

]|[]|[

]|[]|#[

hamrPspamrP

hamrProfreceiverhamP

Page 8: Group 2 R95922027 李庭閣 R95922034 孔垂玖 R95922081 許守傑 R95942129 鄭力維

Attachment, Images, and URL

Attachment Image URLSpam 0.0307% 0.6816% 30.779%Ham 7.3712% 0% 7.0521%

8.01.78.30

8.30) URLsciting Mail|Spam(

999.0)images containing Mail|Spam(

004.03712.70307.0

0307.0)attachment with Mail|Spam(

P

P

P

Page 9: Group 2 R95922027 李庭閣 R95922034 孔垂玖 R95922081 許守傑 R95942129 鄭力維

Symbols in Mail Titles

Marks Probability of being Spam Mail

Feature Showing Rate

~ ^ | * % [] ! ? = 0.911 28% in spam\ / ; & 0.182 16% in ham

Title Absentness Spam senders add titles now.

Arabic Numeral : Almost equal probability (Date, ID)

Non-alphanumeric Character & Punctuation Marks:Appear more often in Spam

Appear more often in ham

Page 10: Group 2 R95922027 李庭閣 R95922034 孔垂玖 R95922081 許守傑 R95942129 鄭力維

Mail-body

Build the internal structure of words Use a good NLP tool called Treetagger

to help us do word stemming Given the stemmed words appeared

in each mail, we build a sparse format vector to represent the “semantic” of a mail

Page 11: Group 2 R95922027 李庭閣 R95922034 孔垂玖 R95922081 許守傑 R95942129 鄭力維

Outline

Experiment setting Feature extraction Model training Hybrid-Model Conclusion Reference

Page 12: Group 2 R95922027 李庭閣 R95922034 孔垂玖 R95922081 許守傑 R95942129 鄭力維

Naïve BayesGiven a bag of words (x1, x2, x3,…,xn), Naïve Bayes is powerful for document classification. ( , )

log ( | ) log log ( , ) log ( )( )j i

j i j i ii

c x CP x C c x C c C

c C

Page 13: Group 2 R95922027 李庭閣 R95922034 孔垂玖 R95922081 許守傑 R95942129 鄭力維

Vector Space Model

Create a word-document (mail) matrix by SRILM.

For every mail (column) pair, a similarity value can be calculated.

d1 d2 ........ dj .......... dNw1 w2

wi

wM

wij

d1 d2 ........ dj .......... dNw1 w2

wi

wM

w1 w2

wi

wM

wij

ijij

j

cw

n

( , )|| || * || ||

Ti j

i ji j

d dsimilarity d d

d d

Page 14: Group 2 R95922027 李庭閣 R95922034 孔垂玖 R95922081 許守傑 R95942129 鄭力維

KNN (Vector Space Model)

As K = 1, the KNN classification model show the best accuracy.

Page 15: Group 2 R95922027 李庭閣 R95922034 孔垂玖 R95922081 許守傑 R95942129 鄭力維

Maximum Entropy Maximize the entropy and minimize the Kullback-Leiber distance between model and the real distribution.

The elements in word-document matrix are modified to the binary value {0, 1}.

Page 16: Group 2 R95922027 李庭閣 R95922034 孔垂玖 R95922081 許守傑 R95942129 鄭力維

SVMBinary : Select binary value {0,1} to represent that this word appears or notNormalized : Count the occurrence of each word and divide them by their maximum occurrence counts.

Page 17: Group 2 R95922027 李庭閣 R95922034 孔垂玖 R95922081 許守傑 R95942129 鄭力維

Outline

Experiment setting Feature extraction Model training Hybrid-Model Conclusion Reference

Page 18: Group 2 R95922027 李庭閣 R95922034 孔垂玖 R95922081 許守傑 R95942129 鄭力維

Single-layered-perceptron Hybrid Model

Inputlayer

OutputLayer

Naïve Bayes

knn

Maximum entropy

Inputlayer

OutputLayer

Naïve Bayes

knn

Maximum entropy

The accuracy of NN-based Hybrid Model is always the highest.

Mail(Bag of words)

Naïve Bayes

K-nearest neighbor

Maximum entropy

Decisionmaker

committee

Mail(Bag of words)

Naïve Bayes

K-nearest neighbor

Maximum entropy

Naïve Bayes

K-nearest neighbor

Maximum entropy

Decisionmaker

committee

Page 19: Group 2 R95922027 李庭閣 R95922034 孔垂玖 R95922081 許守傑 R95942129 鄭力維

Committee-based Hybrid-model

The voting model averages the classification result, promoting the ability of the filter slightly. However, sometimes voting might reduce the accuracy because of misjudgments of majority.

1. Knn + naïve Bayes + Maximum Entropy2. naïve Bayes + Maximum Entropy + SVM

Page 20: Group 2 R95922027 李庭閣 R95922034 孔垂玖 R95922081 許守傑 R95942129 鄭力維

Outline

Experiment setting Feature extraction Model training Hybrid-Model Conclusion Reference

Page 21: Group 2 R95922027 李庭閣 R95922034 孔垂玖 R95922081 許守傑 R95942129 鄭力維

Conclusion

7 features are shown mail type discrimination. Transmitted Time & Receiver Size Attachment, Image, and URL Non-alphanumeric Character & Punctuation

Marks 5 populous Machine Learning are proved

suitable for spam filter Naïve Bayes, KNN, SVM

2 Model combination ways are tested. Committee-based & Single Neural Network

Page 22: Group 2 R95922027 李庭閣 R95922034 孔垂玖 R95922081 許守傑 R95942129 鄭力維

Reference

[1]. M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz, "A Bayesian Approach to Filtering Junk E-Mail," in Proc. AAAI 1998, Jul. 1998.

[2] A plan for spam: http://www.paulgraham.com/spam.html

[3]Enron Corpus: http://www.aueb.gr/users/ion/

[4]Treetagger: http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/DecisionTreeTagger.html

[5]Maximum Entropy: http://homepages.inf.ed.ac.uk/s0450736/maxent_toolkit.html

[6]SRILM: http://www.speech.sri.com/projects/srilm/

[7]SVM: http://svmlight.joachims.org/