Naïve Bayesian Classifier를 이용한 Spam Filtering의 MapReduce 구현

Preview:

DESCRIPTION

Naïve Bayesian Classifier를이용한Spam Filtering의MapReduce구현 2008. 8. 27 한재선(NexR대표이사) jshan0000@gmail.com www.nexr.co.kr

Citation preview

Naïve Bayesian Classifier를이용한 Spam Filtering의

MapReduce 구현

2008. 8. 27

한재선 (NexR 대표이사)

jshan0000@gmail.com

www.nexr.co.kr

Target Application

Spam Filtering

OR

Spam Email

This is the easiest, fastest, and most effective way to lose bothpounds and inches permanently!!! This weight loss program isdesigned specifically to "boost" weight-loss efforts by assistingbody metabolism, and helping the body's ability to manage weight.A powerful, safe, 30 Day Program. This is one program you won'tfeel starved on. Complete program for one amazing low price!Program includes: <b>BONUS AMAZING FAT ABSORBER CAPSULES, 30 DAY -WEIGHTREDUCTION PLAN, PROGRESS REPORT!</b><br><br>SPECIAL BONUS..."FAT ABSORBERS", AS SEEN ON TVWith every order...AMAZING MELT AWAY FAT ABSORBER CAPSULES withdirections ( Absolutely Free ) ...With these capsulesyou can eat what you enjoy, without the worry of fat in your diet.2 to 3 capsules 15 minutes before eating or snack, and the fat will beabsorbed and passed through the body without the digestion of fat intothe body. <br><br>You will be losing by tomorrow! Don't Wait, visit our webpage below, and order now!

기본 아이디어

Spam Email에 자주 등장하는 단어들을

많이 포함하고 있는 Email을

Spam Email이라 간주하자!

기본 아이디어

Spam Email에 자주 등장하는 단어들을

많이 포함하고 있는 Email을

Spam Email이라 간주하자!

Training

Classifying

Training: 개념

This is the easiest, fastest, and most effective way to lose bothpounds and inches permanently!!! This weight loss program isdesigned specifically to "boost" weight-loss efforts by assistingbody metabolism, and helping the body's ability to manage weight.A powerful, safe, 30 Day Program. This is one program you won'tfeel starved on. Complete program for one amazing low price!Program includes: <b>BONUS AMAZING FAT ABSORBER CAPSULES, 30 DAY -WEIGHTREDUCTION PLAN, PROGRESS REPORT!</b><br><br>SPECIAL BONUS..."FAT ABSORBERS", AS SEEN ON TVWith every order...AMAZING MELT AWAY FAT ABSORBER CAPSULES withdirections ( Absolutely Free ) ...With these capsulesyou can eat what you enjoy, without the worry of fat in your diet.2 to 3 capsules 15 minutes before eating or snack, and the fat will beabsorbed and passed through the body without the digestion of fat intothe body. <br><br>You will be losing by tomorrow! Don't Wait, visit our webpage below, and order now!

program 9price 8reduction 8bonus 7amazing 7diet 6capsules 4

.

.

.order 2boost 1manage 1visit 1tomorrow 1

Feature ExtractionFeature = 문서의 지문

Training: 개념Training dataset

(Spam)

2 Categories(Classes)

word freq in spam Pr(word | spam) freq in ham Pr(word | ham)

bonus 3590 0.7 737 0.15

hadoop 252 0.05 1308 0.24

… … … … …

Training dataset(Ham)

Training: 구현

word freq in spam Pr(word | spam) freq in ham Pr(word | ham)

bonus 3590 0.7 737 0.15

hadoop 252 0.05 1308 0.24

… … … … …

Spam들에서word frequency

counting

Spam들에서word probability

계산

1. 각 category에서 word frequency counting2. 각 category에서 word probability 계산

Ham들에서word frequency

counting

Ham들에서word probability

계산

Training: 구현

word freq in spam Pr(word | spam) freq in ham Pr(word | ham)

bonus 3590 0.7 737 0.15

hadoop 252 0.05 1308 0.24

… … … … …

Spam들에서word frequency

counting

Spam들에서word probability

계산

1. 각 category에서 word frequency counting2. 각 category에서 word probability 계산

Ham들에서word frequency

counting

Ham들에서word probability

계산

Map

Reduce

Training: MapReduce

Map

Reduce

Spam Ham

(spam::bonus, 1)(ham::bonus, 1)

(spam::bonus, 3590)(ham::bonus, 737)

(spam, contents)(ham, contents)

parsing

adding

TransformUse MapReduce?

Classifying: 개념

Test email

word freq in spam Pr(word | spam) freq in ham Pr(word | ham)

bonus 3590 0.7 737 0.15

hadoop 252 0.05 1308 0.24

… … … … …

Pr(email | spam) = Pr(w1|spam) xPr(w2|spam) x …

featureextraction

Pr(email | ham) = Pr(w1|ham) xPr(w2|ham) x …

Classifying: Bayes Thoerem

얻고자 하는 확률은

Pr(spam | email) & Pr(ham | email)

Bayes Theorem

Pr(A|B) = Pr(B|A) * Pr(A) / Pr(B)

Pr (Cat|Email) = Pr(Email | Cat) * Pr(Cat) / Pr(Email)

Classifying: MapReduce

Map

Reduce(Identity)

Unknown

(spam, contents)

(contents)

parsing &calculation

wordfreq in spam

Pr(word | spam)

freq in hamPr(word |

ham)

bonus 3590 0.7 737 0.15

hadoop 252 0.05 1308 0.24

… … … … …

reading &instantiation

CategoryObjects

Advanced: Map에서 Training 결과 공유

HDFS

DistributedCache

HBase

RDBMS

Advanced:DistributedCache

• Distribute application-specific large, read-only files efficiently

• Cache files (text, archivs, ejars etc.)

– Only copied once per job

• Code Example

// in Job configureJobConf conf = new JobConf(getConf(), NaiveBayesianClassifierMR.class);DistributedCache.addCacheFile(new Path(cachePath).toUri(), conf);

// in Map configurePath[] localFiles = DistributedCache.getLocalCacheFiles(conf);String cachedFile = localFiles[0].toString();BufferedReader br = new BufferedReader(new FileReader(cachedFile));

Advanced:Custom parameter 전달

• Custom 변수를 각 Map에게 전달 필요

– 예: spam, ham 결정을 위한 threshold

(ham인데 spam으로 판단되는 false positive 결과를 줄이기 위한 조치)

• conf.set() & conf.get()

– 예: conf.set(“nbc.spam_threshold”, 3)

conf.set(“nbc.ham_threshold”, 1)

Advanced:Complete ML Framework

Filter Validator

Classifier

Evaluator

Visualizer

Clustering

Recommendation

Input

Output

DataSink

DataSource