Weka 사용법 - Welcome to SNU Biointelligence Lab!!€¦ · PPT file · Web view ·...

Preview:

Citation preview

Introduction to WekaML Seminar for Rookies

2012-02-03Byoung-Hee Kim

Biointelligence Lab, Seoul National University

(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 2

BI?

(Predictive) Analytics

Data Mining

Machine Learning

AI

Hype Cycle ofHype Cycle of Emerging Technologies 2010, GartnerEmerging Technologies 2010, Gartner

Analytics as a Mainstream Technology

3(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Analytics as a Mainstream Technology

4(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Components of Data Mining

(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 5

Weka as a Must-Have Tool

(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 6

I use Weka constantly in my speech work. It is the first thing a reach for when encountering a new problem. What a terrific tool.

A must for anyone even marginally interested in machine learning and classification techniques.

One of the most useful AI software packages available. It's only serious flaw is being infected with the GNU virus.

Reviews in Sourceforge.net

7

Agenda

(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/

8

General Information on Weka

Weka: Data Mining Software in Java Weka is a collection of machine learning algorithms for

data mining & machine learning tasks What you can do with Weka are

data pre-processing, feature selection, classification, regression, clustering, association rules, and visualization

Weka is an open source software issued under the GNU General Public License

How to get? http://www.cs.waikato.ac.nz/ml/weka/ or just type ‘Weka’ in google.

(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Components of Weka

(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 9

Explorer lets you do various data mining tasks in interactive, step-by-step way.The first choice, usually

KnowledgeFlow allows you to design configurations for streamed data processing

Experimenter allows you to classification and regression in batch way-Different parameter settings-Various datasets-Comparison of models-Large-scale statistical experiments

Simple CLI lies behind other interfaces. By entering textual commands, you can access to all features of the Weka system.

Auxiliary Tools in the menu

Practice: Classifying Iris Flower

(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 10

Iris virginicaIris versicolorIris setosa

Features for Classification

Practice: Classifying Iris Flower

(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 11

Terminology

(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 12

13

Neural Networks

MLP (Multilayer Perceptron) In Weka, Classifiers-functions-MultilayerPerceptron

(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/

14

Decision Trees

J48 (Java implementation of C4.5) In Weka, classifiers-trees-J48

(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Support Vector Machines

SMO (sequential minimal optimization) for training SVM In Weka, classifiers-functions-SMO

(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 15

Practice Scenario

Basic Comparing the performances of algorithms

MultilayerPerceptron vs. J48 vs. SVM Checking the trained model (structure & parameter)

Tuning parameters to get better models Understanding ‘Test options’ & ‘Classifier output’ in

Weka

Advanced Building committee machines using ‘meta’ algorithms for

classification Preprocessing / data manipulation – applying ‘Filter’ Batch experiment with ‘Experimenter’ Design & run a batch process with ‘KnowledgeFlow’

16(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Dataset for Practice with Weka

Just open “iris.arff” in the data folder of Weka

(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 17

Data format for Weka (.ARFF)

@RELATION iris@ATTRIBUTE sepallength REAL@ATTRIBUTE sepalwidth REAL@ATTRIBUTE petallength REAL@ATTRIBUTE petalwidth REAL@ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}@DATA

5.1, 3.5, 1.4, 0.2, Iris-setosa4.9, 3.0, 1.4, 0.2, Iris-setosa4.7, 3.2, 1.3, 0.2, Iris-setosa…7.0, 3.2, 4.7, 1.4, Iris-versicolor6.4, 3.2, 4.5, 1.5, Iris-versicolor6.9, 3.1, 4.9, 1.5, Iris-versicolor…

Data (CSV format)

Header

18

Note: You can easily generate ‘arff’ file by adding a header to a usual CSV text file(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Neural Networks in Weka

19

click • load a file that contains the training data by clicking ‘Open file’ button• ‘ARFF’ or ‘CSV’ formats are

readible

• Click ‘Classify’ tab• Click ‘Choose’ button• Select ‘weka – function - MultilayerPerceptron

• Click ‘MultilayerPerceptron’ • Set parameters for MLP• Set parameters for Test• Click ‘Start’ for learning

(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/

20

Some Notes on the Parameter Setting

Parameter Setting = Car Tuning need much experience or many times of trial you may get worse results if you are unlucky

Multilayer Perceptron (MLP) Main parameters for learning: hiddenLayers, learningRate,

momentum, trainingTime (epoch), seed

J48 Main parameters: unpruned, numFolds, minNumObj Many parameters are for controlling the size of the result tree,

i.e. confidenceFactor, pruning

SMO (SVM) Main parameters: c (complexity parameter), kernel, kernel

parameters(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Test Options and Classifier Output

21

There are various metrics for evaluation

Setting the data set used for evaluation

(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/

22

How to Evaluate the Performance? (1/2)

Usually, build a ‘Confusion Matrix’ on the test data set

Evaluation Metrics Accuracy (percent correct) Precision Recall Many other metrics: F-measure, Kappa score,

etc.

For fare evaluation, the ‘cross-validation’ scheme is used

(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/

23

How to Evaluate the Performance? (2/2)

Confusion Matrix Real

Prediction Positive Negative

Positive TP FPAll with positive

Test

Negative FN TNAll with

Negative Test

All with Disease

All without Disease

Everyone

FNTNFPTPTNTP

Accuracy

FNTPTP

RecallFPTP

TP

Precision

As recall ↑ precision ↓conversely:

As recall ↓ precision ↑

(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/

24

Evaluation Method - Cross Validation

K-fold Cross Validation The data set is randomly divided into k

subsets. One of the k subsets is used as the ‘test set’

and the other k-1 subsets are put together to form a ‘training set’.

128 128128 128 128D1 D2 D3 D4 D5

128D6

128 128128 128 128D1 D2 D3 D4 D6

128D5

128 128128 128 128D2 D3 D4 D5 D6

128D1

k

iiError

kError

1

16-fold cross validation

(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Data Manipulation with Filter in Weka

Attribute Selection, discretize

Instance Re-sampling, selecting specified folds

(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 25

Using Experimenter in Weka

Tool for ‘Batch’ experiments

26

click

• Set experiment type/iteration control• Set datasets / algorithms

Click ‘New’

• Select ‘Run’ tab and click ‘Start’• If it has finished successfully,

click ‘Analyse’ tab and see the summary

(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Usages of Experimenter

Model selection for classification/regression Various approaches

Repeated training/test set split Repeated cross-validation (c.f. double cross-validation) Averaging

Comparison between models / algorithms Paired t-test On various metrics: accuracies / RMSE / etc.

Batch and/or Distributed processing Load/save experiment settings http://weka.wikispaces.com/Remote+Experiment Multi-core support : utilize all the cores on a multi-core

machine(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 27

KnowledgeFlow for Analysis Process Design

28(‘Process Flow Diagram’ of SAS® Enterprise Miner )

(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/

KnowledgeFlow: Example Usage

Decision tree (J48)

(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 29

(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 30

(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 31

(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 32

Simple CLI

Example command and result java weka.classifiers.functions.MultilayerPerceptron -L 0.3 -M

0.2 -N 500 -V 0 -S 0 -E 20 -H a -t "C:\Program Files\Weka-3-6\data\iris.arff"

(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 33

You may build a command line script for various experiments easily

Refer Ch.1 of WekaManual-3-*-*.pdf for further information

Other ML Open Source S/W’s

KNIME Konstanz Information Miner http://www.knime.org/

RapidMiner With Weka as its core http://rapid-i.com/index.php?lang=en

TANAGRA http://eric.univ-lyon2.fr/~ricco/tanagra/en/

tanagra.html (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 34

General Information on Weka

Current version (2012-2-3) Stable version: 3.6.6 Developer version: 3.7.5

(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 35

References

Weka Wiki: http://weka.wikispaces.com/ Primer: good starting point

Weka online documentation: http://www.cs.waikato.ac.nz/ml/weka/index_documentation.html

Textbook Ian H. Witten, Eibe Frank, Mark A. Hall, Data

Mining: Practical Machine Learning Tools and Techniques (Third Edition), Morgan Kaufmann, Jan. 2011.

Articles Data mining with WEKA, Part 1, Part 2, Part 3 in

IBM Technical Library Weka 를 이용한 예측프로그램 만들기 – 월간 마소 연재

(2009 7,8,9 월호 ) 블로그 , MS Live

(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 36