DataMining Kyoto

7/29/2019 DataMining Kyoto

1/54

JanezDemar

DataMining

MaterialforlecturesonDatamining

at

Kyoto

University,

Dept.

of

Health

Informatics

inJuly2010


2/54

Contents

INTRODUCTION 4

(FREE)DATA

MINING

SOFTWARE

5

BASICDATAMANIPULATIONANDPREPARATION 6

SOMETERMINOLOGY 6

DATAPREPARATION 6

FORMATTINGTHEDATA 7

BASICDATAINTEGRITYCHECKS 7

MAKINGTHEDATAUSERFRIENDLY 8

KEEPINGSOMETHEDATAFORLATERTESTING

8MISSINGDATA 9

DISCRETEVERSUSNUMERICDATA 10

FROMDISCRETETOCONTINUOUS 10

FROMCONTINUOUSTODISCRETE 11

VARIABLESELECTION 13

FEATURECONSTRUCTIONANDDIMENSIONALITYREDUCTION 15

SIMPLIFYINGVARIABLES 15

CONSTRUCTIONOFDERIVEDVARIABLES 15

VISUALIZATION 17

HISTOGRAMS 17

SCATTERPLOT 18

PARALLELCOORDINATES 24

MOSAICPLOT 25

SIEVEDIAGRAM 28

"INTELLIGENT"VISUALIZATIONS 30

DATAMODELING(LEARNING,CLUSTERING) 31

PREDICTIVEMODELING/LEARNING 32

CLASSIFICATIONTREES 32

CLASSIFICATIONRULES 34

NAVEBAYESIANCLASSIFIERANDLOGISTICREGRESSION 36


3/54

OTHERMETHODS 38

UNSUPERVISEDMODELING 38

DISTANCES 38

HIERARCHICALCLUSTERING 39

KMEANSCLUSTERING 42

MULTIDIMENSIONALSCALING 42

VISUALIZATIONOFNETWORKS 44

EVALUATIONOFMODELPERFORMANCE 45

SAMPLINGTHEDATAFORTESTING 45

MEASURESOFMODELPERFORMANCE 46

SCALARMEASURESOFQUALITY 46

RECEIVEROPERATINGCHARACTERISTIC (ROC)CURVE 51

CLASSIFICATIONCOST

53


4/54

Introduction

Thismaterialcontentsthe"handouts"giventostudentsfordatamininglectureheldattheDepartment

ofHealth Informaticsat theUniversityofKyoto.The informalityofthe"course",whichtook fivetwo

hourlecturesisreflectedininformalityofthistext,too,asitevenincludesreferencestodiscussionsat

pastlecturesandsoon.

Somemorematerialisavailableathttp://www.ailab.si/janez/kyoto.


5/54

(Free)Dataminingsoftware

Hereisaverysmallselectionoffreedataminingsoftware.Googleformore.

Orange(http://www.ailab.si)

is

afree

data

mining

software

we

are

going

to

use

for

these

lectures.

Knime (http://www.knime.org/) is excellent, well designed and nice looking. It has fewervisualizationmethodsthanOrangeand it'sabitmoredifficulttoworkwith,but itwon'tcostyou

anythingtoinstallitandtryitout.

Weka (http://www.cs.waikato.ac.nz/ml/weka/) isorientedmore towardsmachine learning.Whilevery strong in terms of different kinds of prediction models it can build, it has practically no

visualizationanditsdifficulttouse.Idon'treallyrecommendit.

Tanagra (http://eric.univlyon2.fr/~ricco/tanagra/) issmallerpackagewrittenbyaonemanband,buthasmuchmorestatisticsthanOrange.

Orangehasadifferentuser interfacefromwhatyouareusedto(similartoKnimeand,say,SASMiner

and SPSS Modeler). Its central part isOrangeCanvas onto which we putwidgets. Each widget has a

certain function (loading data, filtering, fitting a certain model, showing some visualization), and

receivesand/orgivesdatafrom/tootherwidgets.Wesaythatwidgetssendsignalstoeachother.We

set up the data flow by arranging the widgets into schema and connecting them. Each widget is

representedwithaniconwhichcanhave"ears"ontheleft(input)andright(outputs).Twowidgetscan

onlybeconnectediftheirsignaltypesmatch:awidgetoutputtingadatatablecannotbeconnectedtoa

widgetwhichexpectssomethingcompletelydifferent.WewilllearnmoreaboutOrangeaswego.


6/54

Basicdatamanipulationandpreparation

Someterminology

Throughoutthelecturesyouwillhavetolistentoanunfortunatemixtureofterminologyfromstatistics,

machinelearningandelsewhere.Hereisashortdictionary.

Ourdatasamplewillconsistsofdata instancese.g.patientrecords. Inmachine learningwepreferto

saywehaveadatasetanditconsistsofexamples.

Eachexampleisdescribedbyfeaturesorvaluesofvariables(statistics)orattributes(machinelearning).

Thesefeaturescanbenumerical(continuous)ordiscrete(symbolic,ordinalornominal;thelattertwo

terms actually refer to two different things: ordinal values can be ordered, e.g. small, medium, big,

whilenominalhavenospecificorder,justnames,e.g.red,green,blue).

Oftenthereisacertainvariablewewanttopredict.Statisticianscallitdependentvariable,inmedicine

youwillhearthetermoutcomeandinmachinelearningwillcallitclassorclassattribute.Suchdatais

calledclassifieddata(norelationtoCIAorKGB).

Machine learning considers examples to belong to two or more classes, and the task is then to

determinetheunknownclassofnewexamples(e.g.diagnosesfornewlyarrivedpatients).Wesaythat

weclassifynewexamples,themethodfordoingthisiscalledaclassifier,theproblemoffindingagood

classifieriscalledclassificationproblemandsoon.Theprocessofconstructingtheclassifieriscalled

classifier induction from data or also learning fromdata. The part of data used for learning is often

calledlearning

dataor

training

data(asopposedto

testing

data,whichweusedfortesting).

The "class" can alsobe continuous. If so, thiswe shouldn't call it aclass (althoughweoften do).The

problemisthennotaclassificationproblemanymore;wecallitaregressionproblem.

Instatisticterminology,thetaskistopredict,andthestatisticaltermforaclassifierispredictivemodel.

Wedonotlearnhere,wefitthemodeltothedata.Wesometimesalsousethetermmodeling.Ifthe

dataisclassified,theproblemiscalledsupervisedmodeling,otherwiseit'sunsupervised.

Sincedataminingisbasedonbothfields,wewillmixtheterminologyallthetime.Getusedtoit.

Datapreparation

This is related to Orange, but similar things also have to be done when using any other data mining

software.Moreover,especiallytheprocedureswewilltalkaboutinthebeginningarealsoapplicableto

anymeansofhandlingthedata,includingstatistics.Theyaresoobvious,thatwetypicallyforgetabout

themandscratchourheadslateron,whenthingsbeginlookingweird.


7/54

Ifallthiswillseemtoyouascommonsense,thisissimplybecausemostofdataminingisjustsystematic

applicationofpurecommonsense.

Formattingthedata

OrangedoesnothandleExceldataandSQLdatabases(ifyoudon'tknowwhatisSQL,don'tbother)well

yet,sowefeeditthedataintextualtabdelimitedform.Thesimplestwaytopreparethedataistouse

Excel. Put names ofvariables into the first row, followed by data instances in each row. Save as tab

delimited,withextension.txt.

This format issomewhat limitedwithrespecttowhatcanyoutellOrangeaboutvariables.Abitmore

complicated(butstillsimpleformatusesthesecondandthethirdrowtodefinethevariabletypesand

properties).Youcanreadmoreaboutitathttp://www.ailab.si/orange/doc/reference/tabdelimited.htm.

Youcanseeseveralexamplesinc:\Python26\lib\sitepackages\orange\doc\datasets,assumingthatyou

useWindowsandPython2.6.

For

technical

reasons,

Orange

cannot

handle

neither

values,

not

variables

names

and

not

even

file

names which include nonASCII (e.g. kanji) characters. There is little we can do to fix Orange at the

moment.

Basicdataintegritychecks

Checkdataforerrorsandinconsistencies.Checkforspellingerrors.Ifsomebodymistypesmaleasmole,

wearesuddenlygoingtohavethreesexesinyourdata,females,malesandmoles.

Orange(andmanyothertools)iscasesensitive:youshouldnotwritethesamevaluessometimeswith

smallandsometimeswithcapitalletters.Thatis,femaleandFemalearetwodifferentspecies.

Checkforanyotherobviouslywrongdata,suchasmistypednumbers(therearenotmany18kgpatients

arounddidtheymean81?!)

How are unknown values denoted? There are various "conventions": question marks, periods, blank

fields,NA, 1,9,99Value9seemstobeespeciallypopularfordenotingunknownsinsymbolicvalues

encodednumerically,likegender,e.g.0forfemale,1formaleand9forunknown.If9'sarekeptinthe

dataand thesoftware isnotspecifically told that9meansanunknownvalue, itwilleitherthink that

therearethreeoreventensexes(asiflifewasnotcomplicatedenoughwithtwo).Oreventhatsexisa

numericalvariablerangingfrom0to9.

Youwill

often

encounter

data

in

which

unknowns

are

marked

differently

for

different

variables

or

even

foroneandthesamevariable.

Does the same data always mean the same thing? And opposite: can the same thing be encoded by

more thanonedifferentvalue?Check forerrors indata recordingprotocol,especially if thedatawas

collectedbydifferentpeopleor institutions. Inthedata I'mworkingonrightnow, itwouldseemthat

somehospitalshavemajorityofpatientsclassifiedasSimple trauma,whileothershavemostpatients


8/54

withOther_Ent.Couldthisbeoneandthesamething,butrecordeddifferentlyindifferenthospitalsdue

tolaxspecifications?

AtypicalclassicalexamplewhichIhappilyalsoencounteredwhile inNara isthatsome institutions

would record the patients' height in meters and others in centimeters, giving an interesting bimodal

distribution

(besides

rendering

the

variable

useless

for

further

analysis).

Mostsuchproblemscanbediscoveredbyobservingdistributionsofvariablevalues.

Such discrepancies should be found and solved in the beginning. If you were given the data by

somebodyelse,youshouldaskhimtoexaminetheseproblems.

HandsonLoadthedataintoOrangeandseewhether

it looksOK.Trytofindanymistakesandfix

them (use Excel or a similar program for

that).

Makingthedatauserfriendly

Data isoftenencodedbynumbers.0forfemale,1 formale(wait,orwas ittheopposite?youwillask

yourselfallthetimewhileworkingwithsuchdata).Ijustgotdatainwhichallnominalvalueshadtheir

descriptivenames,exceptgender forsome reason.Consider renaming thesevalues, itwillmakeyour

lifemorepleasant.

This of course depends upon the software and the methods which you are going to use. We will

encounterasimplecasequitesooninwhichweactuallywantthegendertobeencodedby0and1.

Keepingsomethedataforlatertesting

An importantproblem,whichwasalsopointedoutbyProf.Nakayamaatourfirst lecturewashowto

combinethestatisticalhypothesistestingwithdatamining.Theonlycorrectway Iknowoftheonly

onewithwhichyoushouldhavenoproblemthrustingyourresultsandconvincethereviewersofyour

paperoftheirvalidity,tooistouseaseparatedatasetfortestingyourhypothesesandmodels.Since

youtypicallygetallthedatawhichisavailableinthebeginning,youshouldsacrificeapartofyourdata

fortesting.

As

arule

of

athumb,

you

would

randomly

select

30%

of

data

instances,

put

them

into

a

separatefileandleaveitalonetilljustbeforetheend.

If your data has dates (e.g. admittance date), it may make a convincing case if instead of randomly

selecting30%,you takeout the last30%of thedata.Thiswould then look likeasimulated followup

study.Ifyoudecideforthis,youonlyhavetobecarefulthatyourdataisunbiased.Forinstance,youare

researchingaconditionsrelatedtodiabetesandthelast30%ofdatawascollectedinthesummer,using

File DataTable AttributeStatistics Distributions


9/54

thisdataasatestsetmaynotbeagoodidea(sincediabeticpatientshavemoreproblemsinsummer,as

farasIknow).

You should split the data at this point. Fixing the problems in the data, which we discussed above,

before splitting it, is a good idea, otherwise you will have to repeat the cleaning on the test data

separately.

But

from

this

point

on,

we

shall

do

things

with

the

data

which

already

affect

your

model,

so

thetestdatashouldnotbeinvolvedintheseprocedures.

HandsonThere isageneralpurposewidget inOrangetoselectrandomsamplesfrom

thedata.Youcanputthedatafromthiswidgetintotheoneforsavingdata.

Tryitandprepareaseparatedatafilefortesting.

Missing

data

This topic again also concerns statistics, although typical statistical courses show little interest in it.

Somedatawilloftenbemissing.Whattodoaboutthat?

First,therearedifferentkindsofmissingdata.

Itcanbemissingbecausewedon'tcareabout it,since it is irrelevantandwasnotcollectedforaparticularpatient.(Ididn'tmeasurehispulseratesinceheseemedperfectlyOK.)

Itcanbeinapplicable.(No,mysonisnotpregnant.Yes,doctor,I'mcertainaboutit.) Itmaybejustunknown.

Especially in the latter case, it may be interesting to know: why? Is itsunknowingness related to the

case?Doesittellussomething?

Maybethedatawasnotcollectedsinceitsvaluewasknown(e.g.pulseratewasnotrecordedwhenit

wasnormal).Maybethemissingvaluescanbeconcludedfromtheotherattributes(bloodlosswasnot

recordedwhentherewerenowoundssoweknowitiszero0)orinapplicable(thereasonofdeathis

missing,butanotherattributetellsthatthepatientsurvived).

Foragoodexample: inadatasetwhichIonceworkedwith,anumberofpatientshadamissingvalue

forurinecolor.WhenIaskedthepersonwhogavemethedataaboutthereason,hetoldmethatthey

wereunconscious

at

the

time

they

were

picked

up

by

the

ER

team.

Knowingwhyitismissingcanhelpdealingwithit.Hereareafewthingswecando:

Asktheownerofthedataoradomainexpertifhecanprovideeitherperinstanceorgeneralvaluestoimpute(thetechnicaltermforinsertingthevaluesforunknowns).Ifthepulsewasnotrecorded

whenitwasnormal,well,weknowwhatthemissingvalueshouldbe.

DataSampler Save


10/54

Sometimes you may want to replace the missing value with a special value or even a specificvariabletellingwhetherthemissingvalueisknownornot.Fortheurinecolorexample,ifthedata

doesnotcontainthevariabletellinguswhetherthepatient isconscious,wecanaddthevariable

Consciousanddetermineitsvaluebasedonwhethertheurinecolorisknownornot.Ifitisknown,

thepatientisconscious,ifnot,itis"maybeconscious"(oreven"unconscious"ifwecanknowthat

thedatawascollectedsothatthecolor isknownforallconsciouspatients).Notknowingavalue

canthusbequiteinformative.

Sometimes you can use a model to impute the unknown value. For instance, if the data is onchildren below 5, and some data on their weights is missing but you know their ages, you can

impute theaveragebody mass for thechildren'sage. You canevenusesophisticatedmodels for

that (that is, instead of predicting what you have to predict for this data, e.g. the diagnosis,you

predictvaluesoffeaturesbasedonvaluesofotherfeatures).

Imputeconstantvalues,suchastheworstorthebestpossible,orthemostcommonvalue(for

discrete

attributes)

or

the

average/median

(for

the

continuous).

Removeallinstanceswithmissingvalues.Thisisthesafestroute,exceptthatitreducesyourdataset.Youmaynotlikethisoneifyoursampleisalreadysmallandyouhavetoomanymissingvalues.

Remove the variables with missing values. You won't like this one if the variable is important,obviously.Youmaycombinethiswiththeprevious:removethevariableswithmostmissingvalues

andthenremovethedata instanceswhicharemissingvaluesofanyofotherattributes.Thismay

stillbepainful,though.

Keepeverythingas it is.Mostmethods,especially inmachine learning,canhandlemissingvalueswithoutanyproblem.Beaware, though,that thisoften impliessomekindof implicit imputation,

guessing.

Ifyou

know

something

about

these

missing

values,

it

is

better

to

use

one

of

the

first

solutionsdescribed.

HandsonThereisanimputationwidgetinOrange.Seewhatitoffers.

Discreteversusnumericdata

Moststatistical

methods

require

numeric

data,

and

many

machine

learning

methods

prefer

discrete

data.Howdoweconvertbackandforth?

Fromdiscretetocontinuous

Thistaskisreallysimpleforvariableswhichcanonlytaketwovalues,likegender(ifwedon'thavethe

gender 9). Here you can simply change these values to 0 and 1. So if 0 is female and 1 is male, the

variable represents the "masculinity"of thepatient.Consider renaming thevariable to"male"soyou

Impute


11/54

won't have to guess the meaning of 0 and 1 all the time. Statisticians call such variables dummy

variables,dummies,indicatorvariables,orindicators.

If the variable is actually ordered (small, medium, big) you may consider encoding it as 0, 1, 2. Be

careful,though:somemethodswillassumethelinearity.Isthedifferencebetweensmallandbigreally

two

times

that

between

small

and

medium?

Withgeneralmultivaluedvariables,therearetwooptions.WhenyourpatientscanbeAsian,Caucasian

orBlack,youcanreplace thevariableRacewiththreevariables:Asian,CaucasianandBlack.Foreach

patient, only one would be1. (Instatistics, ifyou do, for instance, linear regressionon thisdata,you

needtoremovetheconstantterm!)

Theotheroptionoccurs incaseswhereyoucandesignateonevalueas"normal".Ifurinecolorcanbe

normal,redorgreen,youwouldreplaceitwithtwoattributes,redandgreen.Atmostoneofthemcan

be1;ifbotharezero,thatwouldmeanthatthecolorisnormal.

Handson

ThereisanimputationwidgetinOrange.Seewhatitoffers.

Fromcontinuoustodiscrete

Thisprocedureiscalledcategorizationinstatistics,sinceitsplitsthedatainto"categories".Forinstance,

the patient can be split into categories young, middleage and old. In machine learning this is called

discretization.

Discretization can be unsupervised or supervised (there go these ugly terms again!). Unsupervised

discretizationisbasedonvaluesofvariablesalone.Mosttypicallywesplitthedataintoquartiles,that

is,wesplitthedataintofourapproximatelyequalgroupsbasedonthevalueoftheattribute.Wecould

thushavetheyoungestquarterofthepatients,thesecondyoungest,theolderandtheoldestquarter.

This procedure is simple and usually works best. Instead of four we can also use other number of

intervals,althoughweseldomusemorethan5or lessthan3 (unlesswehaveaverygoodreasonfor

doingso).

Supervised methodsset theboundaries (also called thresholds) between intervalsso that thedata is

split into groups which differ in class membership. Put simply, say that a certain disease tends to be

morecommon forpeopleolderthan70.70 isthusthenatural thresholdwhichsplitsthepatient into

twogroups,

and

it

does

not

really

matter

whether

it

splits

the

sample

of

patients

in

half

or

where

the

quartileslie.

Thesupervisedmethodsarebasedonoptimizingafunctionwhichdefinesthequalityofthesplit.One

can, for instance, compute the chisquare statistics to compare the distribution of outcomes (class

distributions, as we would say in machine learning) for those above and below a certain threshold.

Continuize


12/54

Differentthresholdsyielddifferentgroupsand thusdifferentchisquares.Wecan thensimplychoose

thethresholdwiththehighestchisquare.

Apopularapproachinmachinelearningistocomputethesocalledinformationgainofthesplit,which

measures how much information about the instance's class (e.g. disease outcome) do we get from

knowing

whether

the

patient

is

above

or

below

a

certain

threshold.

While

related

to

chi

square

and

similar measures, one nice thing about this approach is that it can evaluate whether setting the

thresholdgivesmore informationthan itrequires.Inotherwords, it includesacriterionwhichtellsus

whethertosplittheintervalornot.

With that, we can run the method in an automatic mode which splits the values of the continuous

variable into ever smaller subintervals until it decides it's enough. Quite often it will not split the

variable at all, hence proposing to use "a single interval". This implicitly tells us that the variable is

useless (that is,thevariable tellsusnothing, ithasan informationgainof0 foranysplit)andwecan

discardit.

Youcancombinesupervisedmethodswithmanualfittingofthresholds.Thefigureaboveshowsapart

of Orange's Discretize widget. The graph shows the chisquare statistics (y axis) for the attribute we

wouldgetifwesplittheintervalofvaluesofcontinuousattributeintwoatthegiventhreshold(xaxis).

The gray curve shows the probability of class Y (righthand y axis) at various values of the attribute.

Histogramsatthebottomandthetopshowdistributionofattribute'svaluesforbothclasses.

Finally, there may exists a standard categorization for the feature, so you can forget about the

unsupervised

and

supervised,

and

use

the

expert

provided

intervals

instead.

For

instance,

by

some

Japanese definition BMI between 18.5 and 22.9 is normal, 23.0 to 24.9 us overweight and 25.0 and

above isobese.Thesenumbersweredecidedwithsomereason,soyoucanjustusethemunlessyou

haveaparticularreasonfordefiningotherthresholds.

While discretization would seem to decrease the precision of measurements, the discretized data is

usuallyjustasgoodastheoriginalnumericaldata.Itmayevenincreasetheaccuracywhenthedatahas


13/54

somestrong outliers, that is, measurements outofscale.By splitting thedata into four intervals, the

outliersaresimplyputintothelastintervalanddonotaffectthedataasstronglyasbefore.(Forthose

moreversed instatistics:thishasasimilareffectasusingnonparametrictest insteadofaparametric

one,forinstanceWilcoxon'stestinsteadofttest.Nonparametric

Handson

Loadtheheartdiseasedatafromdocumentationdatasetsandobservehow

the Discretize widget discretizes the data. In particular, observe the Age

attribute.Selectonlywomenandseeifyoucanfindanythinginteresting.

Variableselection

We often need to select a subset of variables and of data instances. Using fewer variables can often

resultin

better

predictive

models

in

terms

of

accuracy

and

stability.

Decidewhichattributesarereallyimportant.Ifyoutaketoomany,youaregoingtohaveamess.Ifyouremovetoomany,youmaymisssomethingimportant.

Aswealreadymentioned,youcanremoveattributeswithtoomanymissingvalues.Thesewillbeproblematic for many data mining procedures, and are not very useful anyway, since they dont

giveyoumuchdata.Also, iftheywerentmeasuredwhenyourdatawascollected (toexpensive?

difficult? invasive or painful?), think whether those who use your model will be prepared to

measureit.

Ifyou

have

more

attributes

with

asimilar

meaning,

it

is

often

beneficial

to

keep

only

one

unless

they are measured separately and there is a lot of noise, so you can use multiple attributes to

averageoutthenoise.(Youcandecidelaterwhichoneyouarekeeping.)

Ifyouarebuildingapredictivemodel,verifywhichattributeswillbeavailablewhenthepredictionneeds to be made. Remove the attributes that are only measured later or those whose values

depend upon the prediction (for instance, drugs that are administered depending upon the

predictionforacancerrecurrence).

Dataminingsoftwarecanhelpyouinmakingthesedecisions.

First, there are simple univariate measures of attribute quality. With twoclass problems, you can

measure the attribute significance by ttest for continuous and chisquare for discrete attributes. For

multiclassproblems,youhaveANOVAforcontinuousandchisquare(again)fordiscreteattributes.In

regression problems you probably want to check the univariate regression coefficients (I honestly

confessthatIhaven'tdonesoyet).

Discretize Selectdata


14/54

Second, you may observe the relations between attributes and relations between attributes and the

outcome graphically. Sometimes relations willjust pop up, so you want to keep the attribute, and

sometimesitwillbeobviousthattherearenone,soyouwanttodiscardthem.

Finally, if you are constructing a supervised model, the features can be chosen during the model

induction.

This

is

called

model

based

variable

selection

or,

in

machine

learning,

a

wrapper

approach

to

featuresubsetselection.Inanycase,therearethreebasicmethods.Backwardfeatureselectionstarts

with all variables, constructs the model and then checks, one by one, which variable it can exclude

withoutharming(orwiththe leastharm)topredictiveaccuracy(oranothermeasureofquality)ofthe

model. It removes that variable and repeats the process until any further removing would drastically

decreasethemodel'squality.Forwardfeatureselectionstartswithnovariablesandaddsoneatatime

always theone fromwhich themodelprofitsmost. Itstopswhenwegainnothingmoreby further

additions.Finally,instepwiseselection,onefeaturecanbeaddedorremovedateachstep.

The figure shows a part of Orange's widget for ranking variables according to various measures (we

won'tgointodetailsaboutwhattheyare,butwecanifyouareinterestedjustask!).

Note

that

the

general

problem

with

univariate

measure

(that

is,

those

which

only

consider

a

single

attribute at a time) is that they cannot assess (or have difficulties with assessing) the importance of

attributes which are important in context of other attributes. If a certain disease is common among

youngmalesandoldfemales,thenneithergendernorageby itselftellusanything,sobothvariables

wouldseemworthless.Consideredtogether,theytellusalot.

HandsonThere are two widgets for selecting a subset of attributes. One is

concentratedonestimatingtheattributequality(Rank),withtheotherwe

can remove/add attributes and change their roles (class attribute, meta

attributes)Especiallythelatterisveryusefulwhenexploringthedata,for

instance to test different subsets of attributes when building prediction

models.

Rank

Selectattributes


15/54

Featureconstructionanddimensionalityreduction

Simplifyingvariables

Youcanoftenachievebetterresultsbymergingseveralvaluesofadiscretevariable intoasingleone.

Sometimesyou

may

also

want

to

simplify

the

problem

by

merging

the

class

values.

For

instance,

in

non

alcoholicfattyliverdiseasewhichI'mstudyingrightnow,theliverconditionisclassifiedintooneoffour

bruntstages,yetmost(orevenall)papersonthistopicwhichconstructpredictivemodelsconvertedthe

problem into binary one, for instance whether the stage is 0, 1 or 2 (Outcome=0), or 3 or 4

(Outcome=1).Reformulated likethis,theproblem iseasiertosolve.Besidesthat,manystatisticaland

machinelearningmethodsonlyworkwithbinaryandnotmultivaluedoutcomes.

TherearenowidgetsforthisinOrange.Youhavetodoitin,say,Excel.Orangecanhelpyouindeciding

which values are "compatible" in a sense. We will learn about this the next time, as we explore

visualizations.

Constructionofderivedvariables

Sometimes it isusefultocombineseveral features intoone,e.g.presenceofneurological,pulmonary

andcardiovascularproblemsintoageneralproblems.Therecanbevariousreasonsfordoingthis,the

simplest being a small proportion of patients which each single disease. Some features need to be

combined because of interactions, as the case described above: while age and gender considered

together may give all the information we need, most statistical and machine learning methodsjust

cannotuse themsincetheyconsider (anddiscard) themoneata time.Weneed tocombine the two

variablesintoanewonetellinguswhetherthepatientbelongsintooneoftheincreasedriskgroups.

Finally,we

should

mention

but

really

only

mention

the

so

called

curse

of

dimensionality.

In

many

areas of science, particularly in genetics, we have more variables than instances. For instance, we

measureexpressionsof30,000geneson100humancancertissues.Aruleofathumb(butnothingmore

than a rule of a thumb) in statistics is that the sample size must be at least 30 times the number of

variables, so we certainly cannot run, say linear or logistic regression (neither any machine learning

algorithm)onthisdata.

There are no easy solutions to this problem. For instance, one may apply theprincipal component

analysis(orsupervisedprincipalcomponentanalysisorsimilarprocedures),astatisticalprocedurewhich

canconstructasmallsetof,say,10variablescomputedfromtheentiresetofvariables insuchaway

that

these

10

variables

carry

as

much

(or

all)

data

from

the

original

variables.

The

problem

is

that

principalcomponentanalysisitselfwouldalsorequire,well,30timesmoredatainstancethanvariables

(wecanbendthisrule,butnottothepointwhenthenumberofvariablesexceedsthesamplesize,at

leastnotbythatmuch).

OK,ifwecannotkeepthedatafromallvariables,canwejustdiscardthemajorityofvariables,sincewe

knowthattheyarenotimportant?Outof30,000genes,onlyacouple,adozenmaybe,isrelatedtothe

diseasewestudy.Wecoulddothis,buthowdowedeterminewhichgenesarerelatedtothedisease?


16/54

Ttestsor informationgainorsimilarforeachgeneseparatelywouldn'twork:withsomanygenesand

suchasmallsample,manygeneswillberandomlyrelatedtowhatweare tryingtopredict. Ifweare

unlucky(and inthiscontextweare),therandomlyrelatedgeneswill lookmorerelatedtothedisease

thanthosewhichareactuallyrelevant.

The

only

way

out

is

to

use

supplementary

data

sources

to

filter

the

variables.

In

genetics,

one

may

consideronlygeneswhicharealreadyknowntoberelatedwiththediseaseandgeneswhichareknown

to be related to these genes. More advanced methods use variables which represent average

expressions (orwhatever we measure) of groupsofgenes, where these groups aredefined basedon

largenumberofindependentexperiments,ongenesequencesimilarities,coexpressionsandsimilar.

The area of dimensionality reduction is on the frontier of data mining in genetics. The progress in

discovering the genetics of cancer and potential cures are not that much in collecting more data but

ratherinfindingwaystosplitthedatawehaveintoirrelevantandrelevant.


17/54

Visualization

WeshallexploresomegeneralpurposevisualizationsofferedbyOrangeandsimilarpackages.

Histograms

Histogram is a common visualization for representing absolute or relative distributions of discrete or

numericdata.Inlinewithitsmachinelearningorigins,Orangeaddsthefunctionalityrelatedtoclassified

data,thatis,datainwhichinstancesbelongtotwoormoreclasses.Thewidgetforplottinghistograms

iscalledDistributions.Thepicturebelowshowsthegenderdistributionontheheartdiseasedataset,

whichwebrieflyworkedwithinthelastlecture.

Onthelefthandside,theusercanchoosethevariablewhichdistribution(s)hewantstoobserve.Here

wechosethediscreteattributegender.Thescaleontheleftsideoftheplotcorrespondstothenumber

ofinstanceswitheachvalueoftheattribute,whichisshownbytheheightsofbars.Coloredregionsof

thebars

correspond

to

the

two

classes

blue

represents

outcome

"0"

(vessels

around

the

heart

are

not

narrowed),andredpartsrepresentthepatientswithcloggedvessels(outcome"1").Thedatacontains

twiceasmanymenthanwomen.Withregardtoproportions,malesaremuchmore likelytobe"red"

(havethedisease)thenfemales.

Tocomparesuchprobabilities,wecanusethescaleontheright,whichcorrespondstoprobabilitiesof

thetargetvalue(wechosetargetvalue"1",seethesettingsontheleft).Thecirclesandlinesdrawnover

thebarsshowtheseprobabilities:forfemalesit'saround25%(withconfidenceintervalsappx.2030%),


18/54

andformalesit'saround55%(5159).Youcanalsoseethesenumbersbymovingthemousecursorover

partsofthebar.

Note: to see the confidence intervals, you need to click Settings (top left), and then check Show

confidenceintervals.

Similargraphs

can

be

drawn

for

numeric

attributes,

for

instance

SBP

at

rest.

The number of bars can be set in Settings. Probabilities are now shown with the thicker line in the

middle,and

the

thinner

lines

represent

the

confidence

intervals.

You

can

notice

how

these

spread

on

theleftandrightside,wheretherearefewerdatainstancesavailabletomaketheestimation.

HandsonexercisePlot the histogram for the age distribution. Do you notice something interesting with regard to

probabilities?

Scatterplot

Scatter plots show individual data instances as symbols whose positions in the plane are defined by

valuesoftwoattributes.Additionalfeaturescanberepresentedbytheshape,colorandsizeofsymbols.

There are many ways to tweak the scatter plot into even more interesting visualizations. The Gap

Minderapplication,withwhichweamusedourselvesonourfirstlecture(Japanvs.Sloveniavs.China)is

nothingelsethenascatterplotwithanadditionalslidertobrowsethroughhistoricdata.


19/54

Although scatter plot visualizations in data mining software are less fancy than those in specialized

applications like Gap Minder, they are still the mostuseful and versatile visualization we have, in my

opinion.

Here isoneofmy favorites:patients' ageversusmaximalheart rateduringexercise.Colorsagain tell

whether

the

patient

has

clogged

vessels

or

not.

First,

disregarding

the

colors,

we

see

the

relation

between

the

two

variables

(this

is

what

scatter

plots

aremostusefulfor).Youngerpatientscanreachhigherheartrates(thebottomrightsideoftheplotis

empty).Amongolderpatients,therearesomewhocanstillreachalmosttheheartratesalmostashigh

as youngsters (although the topmost right area is a bit empty), while some patients' heart rates are

ratherlow.Thedotsarespreadinakindoftriangle,andtheageandheartrateareinverselycorrelated

theolderthepatient,thelowertheheartrate.

Furthermore, the dots right below are almost exclusively red, so these are the people with clogged

vessels.Leftabove,peopleareyoungandhealthy,whilethoseatthetopabovecanbeboth.

Thisisatypicalexampleofhowdatamininggoes:atthispointwecanstartexploringthepattern.Isit

trueor

just

coincidental?

The

triangular

shape

of

the

dot

cloud

seems

rather

convincing,

so

let's

say

we

believeit.Ifso,whatisthereason?Canweexplainit?Anaveexplanationcouldbethatwhenweage,

somepeoplestillexercisealotandtheirheartsare"asgoodasnew",whilesomearenotthislucky.

Thisisnotnecessarilytrue.WhenIshowedthisgraphtocolleagueswithmoreknowledgeincardiology,

they told me that the lower heart rate may be caused by beta blockers, so this can be a kind of

confoundingfactorwhichisnotincludedinthedata.Ifthisisthecase,thenheartratetellsus(oritisat

leastrelatedto)whetherthepatientistakingbetablockersornot.


20/54

ToverifythiswecouldsetthesymbolsizetodenotetherestSBP (tomake thepictureclearer, Ialso

increasedthesymbolsizeandtransparency,seetheSettingstab).InsteadoftherestSBP,wecouldalso

plot the ST segment variable. I don't know the data and the problem well enough to be able to say

whetherthisreallytellsussomethingornot.

Theiconsinthebottomletuschoosetoolsformagnifyingpartsofthegraphorforselectingdatainthe

graph(icons

with

the

rectangle

and

the

lasso).

When

examples

are

selected,

the

widget

sends

them

to

anywidgetsconnectedtoitsoutput.

HandsonexerciseDrawthescatterplotsothatthecolorsrepresentthegenderinsteadoftheoutcome.Doesitseem

thatthepatientswhoseheartratesgodownareprevalentlyofonegender(which)?Canyouverify

whether this is actually the case? (Hint: select these examples and use the Attribute statistics

widgettocomparethedistributionofgendersonentiredatasetandontheselectedsample.)

Thewidgethasmanymoretricksanduses.Forinstance,ithasmultipleoutputs.Youcangiveitasubset

ofexamples

which

come

from

another

widget.

It

will

show

the

corresponding

examples

from

this

subset

withfilledsymbols,whileothersymbolswillbehollow. IntheexamplebelowweusedtheSelectdata

widgettoselectpatientswithexerciseinducesanginaandrestSBPbelow131andshowthegroupasa

subgroupintheScatterplot.Notethatthisschemaisreallysimple:manymorewidgetscanoutputthe

data,andthesubsetplottedcanbefarmoreinterestingthanthisone.


21/54

HandsonexerciseConstruct

the

schema

above

and

see

what

happens

in

the

Scatter

plot.

HandsonexerciseLet's explore the difference in genders in some more detail. Remember the interesting finding

about the probabilitiesofhavingcloggedvesselswith respect to theageof thepatient? Use the

SelectDatawidgettoseparatewomenandmenandthenusetwoDistributionwidgetstoplotthe

probability of disease with respect to age for each gender separately. Can you explain this very

interestingfinding(althoughnotunexpected,whenyouthinkaboutit)frommedicalpointofview?

Thereisanotherwidgetforplottingscatterplots,inwhichpositionsofinstancescanbedefinedbymore

thantwo

attributes.

It

is

called

Linear

projection.

To

plot

meaningful

plots,

all

data

needs

to

be

either

continuousorbinary.WewillfeeditthroughwidgetContinuizetoensurethis.


22/54

Onthelefthandsidewecanselectthevariableswewantvisualized.Fortheheartdiseasedata,wewill

select

all

except

the

outcome,

diameter

narrowing.

Linear

projection

does

not

use

perpendicular

axis,

as

theordinaryscatterplot,butcaninsteaduseanarbitraryarrangementofaxis.Theinitialarrangementis

circular.Togetausefulvisualization,clicktheFreeVizbuttonatthebottom.Inthedialogthatshowsup,

select Optimize separation. The visualization shows you, basically, which variables are typical for

patientswhichdoanddonothavenarrowedvessels.

Alternatively, in the FreeViz dialog you can choose the tab projection and then Supervised principal

componentanalysis.

Foraneasierexplanationofhowtointerpretthevisualization,hereisavisualizationofanotherdataset

inwhichthetaskistopredictthekindoftheanimalbasedonitsfeatures.Wehaveaddedsomecolors

byclicking

the

Show

probabilities

in

the

Settings

tab.

Eachdotrepresentsananimal,withtheircolorscorrespondingtodifferenttypesofanimals.Positions

are computed from animal's features, e.g. those which are airborne are in the upper left (and those

whichareairborneandhaveteethwillbeinthemiddletoughluck,butthereareprobablynotmany

suchanimalswhichisalsothereasonwhyfeaturesairborneandtoothedareontheoppositesides).


23/54

Therearequiteafewhypotheses(sometrueandsomefalse)thatwecanmakefromthispicture.For

instance:

Mammalsarehairyandgivemilk;theyarealsocatsize,althoughthisisnotthattypical. Airborneanimalsareinsectsandbirds. Birdshavefeathers. Havingfeathersandbeingairbornearerelatedfeatures. Animalseithergivemilkorlayeggsthesetwofeaturesareontheoppositesides. Similarly,someanimalsareaquaticandhavefins,andsomehavelegsandbreath. Whethertheanimalisvenomousanddomestictellusnothingaboutitskind. Backboneistypicalforanimalswithfins.(?!)

Someofthesehypothesisarepartiallytrueandthelastoneisclearlywrong.Theproblemis,obviously,

that all features need to be placed somewhere on twodimensional plane, so they cannot all have a

meaningful place. The widget and the corresponding optimization can be a powerful generator of

hypothesis,butallthesehypothesisneedtobeverifiedagainstcommonsenseandadditionaldata.


24/54

Parallelcoordinates

Intheparallelcoordinatesvisualization,theaxis,whichrepresentthefeaturesareputparalleltoeach

other and each data instance is represented with a line which goes through points in the line that

correspondtoinstance'sfeatures.

Onthe zoodataset,each linerepresentsoneanimal.Weseethatmammalsgivemilkandhavehair

(mostpink linesgo through the topof the twoaxisrepresentingmilkandhair).Featheryanimalsare

exclusively birds (all lines through the top of thefeathers axis are red). Some birds are airborne and

somearenot(redlinessplitbetweenthefeathersandtheairborneaxis).Fish(greenlines)donotgive

milk,havenohair,havebackbone,nofeathers,cannotfly,andlayeggs(wecanseethisbyfollowingthe

greenlines).

HandsonexerciseDraw the followinggraph forheartdiseasedata.Select thesamesubsetof featuresandarrange

theminthesameorder.Seteverythingtomakethevisualizationappearasbelow.


25/54

Can you see the same age vs. maximal heart rate relation as we spotted in the scatter plot?

Anything else which might seem interesting? Are females in the data really older than males? Is

thereanyrelationbetweenthemaximalheartrateandthetypeofthechestpain?

Mosaicplot

InMosaicplotwesplitthedatasetintoeversmallersubsetsbasedonvaluesofdiscrete(ordiscretized)

attributes. We can see the proportions of examples in each group and the class distribution. Figures

belowshowconsecutivesplitsofthezoodataset.


26/54

Hairyanimals representalmostonehalfof thedataset (the rightbox in the firstpicture isalmostas

wideasthe leftbox). (Note:thisdatasetdoesnotrepresentanunbiasedsampleofanimalkingdom.)

The"hairy"boxisviolet,thereforemosthairyanimalsaremammals.Theyellowishpartrepresentsthe

hairyinsects.Thereisalmostnovioletintheleftbox,thereforearealmostnononhairymammals.

Wespliteachofthetwosubgroups(nonhairy,hairy)bythenumberof legs.Mosthairyanimalshave

four legs,followedbythosewithtwoandthosewithsix.Thereareonlyafewwithno legs.Four and

twoleggedhairycreaturesaremammals,andsixleggedareinsects.


27/54

A thirdofnonhairyanimalshaveno legs.Halfof them (inourdatasetonly,not innature!)are fish,

followedbyafewinvertebrates,mammalsandreptiles.Onethirdofnonhairyanimalshavetwolegs;all

ofthemarebirds.Fourleggednonhairiesareamphibiansand reptiles,andthesixleggedaremostly

insects.

For

the

last

split,

we

have

split

each

group

according

to

whether

they

are

aquatic

or

not.

All

mammals

withoutlegsareaquatic,whilethenumberofaquaticsamongtwo andfourleggedissmall(thevertical

stripsontherighthandsidearethin).Therearenohairyaquaticinsects.

Majority of nonhairy animals without legs are aquatic: fish, and some mammals and invertebrates.

Nonaquatic ones are invertebrates and reptiles. As we discovered before, twolegged nonhairy

features are birds (red). One third of them is aquatic and two thirds are not. Fourlegged nonhairy

animalscanbenonaquaticreptiles,whileatwothirdmajoritygoestotheaquaticamphibians.

Mosaicplotscanalsoshowaprioridistributionsorexpecteddistributions(inamannersimilartothose

of the chisquare contingency table, that is, the expected number of instances computed from the

marginaldistributions).

The

plot

below

splits

the

data

on

heart

disease

by

age

group

and

gender.

The

smallstripsontheleftsideofeachboxshowthepriorclassdistribution(theproportionofpatientswith

andwithoutnarrowedvesselsontheentiredataset).

Theoccurrenceofnarrowedvesselsisverysmallamongyoungandoldfemales,whereasinthemiddle

agegrouptheproportionisfarlarger(butstillbelowthetotaldatasetaverage!theproblemissomuch

morecommonamongmales!)Formales,theproportionsimplyincreaseswithage.


28/54

HandsonexerciseHowdidweactuallyplotthis?IfwefeedthedatawithcontinuousvariablesintotheMosaicwidget,

it will automatically split values of numeric attributes into intervals using one of the builtin

discretizationmethods.Herewemanuallydefined theagegroupsbyusing theDiscretizewidget.

Trytodoityourself!(Hint:putDiscretizebetweenFileandMosaic.InDiscretize,clickExploreand

setindividualdiscretizations.SelectAgeandentertheintervalthresholdsintotheCustom1boxor

definethe intervalsonthegraphbelow leftclickingaddsandrightclickingremovesthresholds,

andyoucanalsodragthresholdsaround.Donot forget toclickCommit in thewidgets,orcheck

Commitautomatically.)

Sievediagram

Sievediagram,alsocalledparquetdiagram, isavisualizationoftwowaycontingencytables,similarto

those

used

in

chisquare.

The

rectangle

representing

the

data

set

is

split

horizontally

and

vertically

accordingtotheproportionofvaluesoftwodiscreteattributes.

Note the difference between this and the mosaic plot. In

mosaicplot,thedataissplitintoeversmallersubgroups:each

age group is split separately into groups according to the

distributionofgenderswithinthatspecificagegroup.Thesize


29/54

of the box in the Mosaic plot thus corresponds to theobserved (actual)numberof instancesofeach

kind.Forcontrast,theSievediagramsplitsbytwoattributesindependently,sothesizeoftherectangle

reflectstheexpectednumberofinstances.Thisnumbercanbecomparedtotheobservednumber:Sieve

plotshowsthecomparisonbygriddensityandcolor(bluerepresentsahigherthanexpectednumberof

instances and red represents a lower than expected number). The diagram above suggests that, for

instance,youngermalesaremorecommonthanexpectedaccordingtopriordistribution. Wecancheck

thenumberbymovingthemousecursoroverthecorrespondingbox.

Given that the younger age group includes 39.27% of data and males represent 67.99%, we would

expect39.27%67.99%=26.70%youngmales.Theactualnumber is28.71%,which ismorethan

expected.

Thisdifference is,however,negligible.Here isabetterexample. In the figurebelow, theverticalsplit

followsthedistributionofgenders(themalepartistwicetheheightofthefemalepart),andthevertical

splitcorrespondstoresultsofthethaltest(onehalfof instanceshavenormalresultsandmostothers

have

reversible

defect).

The number of males with normal thal test is 28.57 % (86 patients), which is less than the expected

37.56%(113patients).Thenumberofwomenwithnormalthaltestisunexpectedlyhigh,80insteadof

theexpected53.Andsoforth.


30/54

"Intelligent"visualizations

Thereare

many

different

scatter

plots

we

can

draw for some data, and the number of

differentmorecomplexvisualizations iseven

higher. Orange features a set of tools for

findinggooddatavisualizations.For instance,

this plot (a variation of scatter plot called

RadViz), which was found automatically by

Orange, uses expressions of 7 genes out of

2308,todistinguishbetweenthefourtypesof

SRBCT

cancer.

In

most

widgets

you

will

find

buttons Optimize or VizRank (Visualization

ranking), which will try out huge number of

different visualizations and present those

whichseemmostinteresting.


31/54

DataModeling(Learning,Clustering)

Modelingreferstodescribingthedata inamoreabstract formcalled,obviously,amodel.Themodel

(usually) does not include the data on individual instances from which it is constructed, but rather

describesthegeneralpropertiesentiredataset(machinelearningoftenreferstothisasgeneralization).

We usually distinguish between unsupervised and supervised modeling. Roughly, imagine that data

instancesbelongtosomekindofgroupsandourtaskistodescribethesegroups.Thecomputer'swork

canbesupervised in the sense that the computer is told which instancesbelong to whichgroup. We

calledthesegroups"classes";wesaythatweknowtheclassofeachexample.Thecomputerneedsto

learnhowtodistinguishbetweenthetwogroups.Suchknowledgeisrepresentedbyamodelthat,ifina

properform,canteachussomething(new,potentiallyuseful,actionable)aboutthedata.Moreover,

we can use it to determine the group membership of new data instances. For example, if the data

representstwogroupsofpeople,somearesickandsomearenot,theresultingmodelcanclassifynew

patientsintooneofthesetwogroups.

In unsupervised learning, the instances are not split into classes, so the computer is not guided

(supervised)inthesamesenseasabove.Thetaskistypicallytofindasuitablegroupingofinstances.In

statistics, this is called clustering (cluster being a synonym for a group). We also often use visual

methodswhichplotdata insuchawaythatgroupsbecomeobvioustothehumanobserver,whocan

thenmanually"drawtheborders"betweenthemorusetheobservedgroupingforwhateverheneeds.

Dataforunsupervisedlearning andforsupervisedlearning


32/54

Predictivemodeling/Learning

The task of predictive modeling is to use the given data (training examples, learning examples) to

construct models which can predict classes or, even better, class probabilities for new, unseen

instances. Inthecontextofdatamining,wewould likethesemodelstoprovidesome insight intothe

data:themodelsshouldbeinterpretable;theyshouldshowpatternsinthedata.Themodelsshouldbe

abletoexplaintheirpredictions intermsofthefeaturesthepatient isprobablysick,andthereason

whywethinksoisbecausehehastheseandthesesymptoms.

Classificationtrees

Induction of classification trees (also called decision trees) is a popular traditional machine learning

method,whichproducesmodelsastheoneonthefigurebelow.

Whilethisisvisuallypleasing,thetreeaspresentedbythebelowwidgetmaybeeasiertoread.

Wehaveusedthreewidgetsrelatedtotrees:ClassificationTreegetsthedataandoutputsatree(thisis

thefirstwidgetwehavemetsofarwhichdoesnotoutputdatainstancebutsomethingelse,amodel!).

ClassificationTreeViewerandClassificationTreeGraphgettreesontheirinputsandshowthem.They


33/54

alsoprovideoutput:whenyouclickontreenodes,theyoutputalltrainingexampleswhichbelongtothe

node.YouwillfindallthesewidgetsincategoryClassify.

Saywegetanewpatient.The treesuggests thatweshould firstperformthe thaltest (whateverthis

means).Let'ssaythatresultsarenormal.Forsuchpatientswethenaskaboutthechestpain;suppose

thatit'stypicalanginal.Inthatcase,wechecktheageofthepatient.Sayheis45,whichisbelow56.5.

Thereforewepredicta0.04probabilityofhavingthedisease.

The algorithm for construction of such trees from data works like this. First, it checks all available

variablesandfindsonewhichsplitsthedataintosubsetswhichareasdistinctaspossibleinidealcase,

the

attribute

would

perfectly

distinguish

between

the

classes,

in

practice,

it

tries

to

get

as

close

as

possible.Forourdata,itdecidedforthaltest.Thenitrepeatsthesearchforthebestattributeineachof

thetwosubgroups. Incidentally,thesamefeature(thetypeofchestpain)seemsthebestattribute in

both.Theprocessthencontinuesforallsubgroupsuntilcertainstoppingcriteriaaremet:thenumberof

data instances inagroup istosmalltosplitthem further,thesubgroupsarepureenough (ideally,all

instancesbelong tothesameclass,sothere isnoneedtosplitthemany further),therearenomore

usefulfeaturestosplitby,thetreeistoolargeetc.

Treesaregreatfordatamining,because

there'snosophisticatedmathematicsbehindthem,sotheyareconceptuallyeasytounderstand, atreecanbeprintedout,analyzedandeasilyusedinreallifesituations, indata mining tools,wecanplaywith them interactively; in Orange,wecan connectthe tree to

otherwidgets,clickthenodesofthetreetoobservethesubgroupsandsoon.

Theyare,however,notsogreatbecause

whiletheymayperformrelativelywellintermsofproportionofcorrectpredictions,theyperformpoorlyintermsofprobabilityestimates;

theyare"unstable":asmallchange inthedataset, likeremovinga few instances,canresult inacompletelydifferenttree.Sameforchangingtheparametersofthe learningalgorithm.Treesgive

oneperspectiveofthedata,onepossiblemodel,notaspecial,unique,distinguishedmodel;

treesneedtobekeptsmall:computercaneasilyinduceatreewithafewhundredleaves,yetsuchtreesaredifficulttounderstandandtouse,andthenumberofinstancesinindividualleavesisthen

sosmallthatwecannot"thrust"them.Ontheotherhand,smalltreesrefertoonlyasmallnumer

ofattributes,e.g.toonlyasmallnumberofmedicalsymptomsrelatedtothedisease.


34/54

The latter point is really important. Trees are good for conceptually simple problems, where a few

features are enough to classify the instances. They are a really bad choice for problems in which the

predictionmustbemadebyobservingandsumminguptheevidencefromalargernumberoffeatures.

A usefully small tree cannot contain more than, say, ten features. If making the prediction requires

observingmorevariables,forgetaboutthetrees.

HandsonexerciseConstructatree liketheoneabove,thenplotthesamedata inaMosaicplot inwhichyouselect

theattributesfromtherootofthetreeandthefirstbranchesofthetree.

Classificationrules

Classificationrulesareessentiallysimilartotrees,exceptthat insteadofhavingahierarchyofchoices,

theyaredescribedbylistsofconditions,likethosebelow.

Thealgorithmforconstructionofsuchrulesstartswithanemptyrule(nocondition,itpredictsthesame

classfor

all

examples).

It

then

adds

conditions,

which

specialize

the

rule,

fitting

it

to

aparticular

group

of

instancesuntilitdecidesthatit'senough(stoppingcriteriaarebasedonthepurityofthegroup,number

of instances covered by the rule and some statistical tests). It prints out the rule, removes the data

instancescoveredbytheruleandrepeatstheprocesswiththeremaininginstances.


35/54

HandsonExerciseConstructthebelowschema.ExamplescomingfromtheCN2RulesViewershouldbeconnectedto

ExampleSubsetof theScatterplot (thesimplestway tohave this is to firstdraw theconnection

betweenFileandScatterplotandthenbetweentheCN2RulesViewerandScatterplot).

Nowclick

on

individual

rules

in

the

CN2

Rules

Viewer

and

the

Scatter

plot

will

mark

the

training

instancescoveredbytherule.

Classification rules havesimilarpropertiesas the trees,except that theirshortcomingsare somewhat

less acute: they can handle more attributes and are better in predicting probabilities. They may be

especiallyapplicabletoproblemswithmultipleclasses,as inthefollowingexamplesfromourbeloved

zoodataset.


36/54

NaveBayesianClassifierandLogisticRegression

NaveBayesianclassifiercomputesconditionalclassprobabilities,withfeaturevaluesasconditions.For

instance,itcomputestheprobabilityforhavingthediseaseforfemalesandformales,theprobabilityfor

havingthediseaseforyoungerandolderpatients,forthosewithnonanginalpain,typicalanginalpain

andatypical

anginal

pain,

and

so

on.

To

predict

the

overall

probability

for

having

the

disease,

it

"sums

up"theconditionalprobabilitiescorrespondingtoalldataaboutthepatient.

Themodelcanbesimplifiedintosumming"points"plotted inanomogram, liketheoneshownbelow.

SupposewehaveapatientwhoseSTbyexerciseequals2,hisheartrategoesupto200,andhefeels

asymptomaticchestpain.Othervaluesareunknown.NaveBayesianclassifierwouldgivehim+1point

for the ST by exercise,3.75 for heart rate, +1 for the chest pain these mappings can be seen by

"projecting"thepositionsofthepointscorrespondingtothevaluesontotheaxisabovethenomogram.

Altogether,thisgives+1+3.75+1=1.75points.Thescalebelowtransformsthepoints intoactual

probabilities: 1.75pointscorrespondtoaround15%chanceofhavingthedisease.

Thereis

asimilar

model

from

statistics

called

linear

regression.

On

the

outside,

they

look

the

same:

they

can both be visualized using a nomogram. Mathematically, they differ in the way they compute the

numberofpointsgivenforeachfeature:naveBayesianclassifierobserveseachfeatureseparatelyand

logisticregressiontriestooptimizethewholemodelatonce.Weshallnotgo intomathematicshere.

Wewillonlydescribethepracticalconsequencesofthedifferenceinthetwoapproaches.


37/54

Thesemodelsarejustgreat.

Theyareveryintuitive,easytoexplain;theycanbeprintedoutandusedinthephysician'soffice. Theygivesurprisinglyaccuratepredictions. Thepredictionscanbeeasilyexplained:thestrongestargumentthatthepersonfromourexample

ishealthyishisextremelyhighheartrate.Theothertwopiecesofdatawehave,theSTbyexercise

andthetypeofchestpainbothactuallyopposethepredictionthatheishealthy,yettheirinfluence

issmallerthanthatoftheheartrate.Theinsideviewintohowthemodelmakespredictionallows

theexperttodecidewhethertheymakesenseorwhethertheyshouldbeoverruledandignored.

The model can show us which features are more important than others in general, without aparticular given instance. ST by exercise and maximal heart rate are, in general, the two most

informative features, since they have the longest axis they can provide the strongest evidence

(thegreatestnumberofpoints)fororagainstthedisease.Thechestpainandthaltestaretheleast

important(amongthoseshowninthepictures).Afeaturecanalsohavezeroinfluence,whenallits

valuesoronthevertical line(0points).Afeaturecansometimesevenspeakonly infavororonly

againstthetargetclass;theremaybetestswhichcan,iftheirresultscomeoutpositive,tellthatthe

patientissick,whilenegativeresultmaymeannothing.

Whatabouttheirshortcomings?ForthenaveBayesianclassifier,itsshortcomingscomefromobserving

(thatis,computingthestrength)ofeachfeatureseparatelyfromothers.

In case of diseases whose prediction requires observing two interacting features (for a naveexample, a disease which attacks young men and old women), it cannot "combine" the two

features.Suchdiseases(and,ingeneral,suchfeatures)arehoweverrareinpractice.

Another,more

common

problem

are

correlated

features.

Observing

each

feature

separately,

nave

Bayesianclassifiercancountthesameevidencetwice.Considerachilddiseasewhoseprobability

decreaseswithchild'sage;therearenootherfactorsrelatedtothedisease.Ifthedatawecollected

containschildren'sageandbodyheight,naveBayesianclassifierwill(wrongly?)considerheightas

animportantfactoralthoughthisisonlybecauseofitsrelationwithage.Countingthesamepiece

ofevidencetwicecausesthenaveBayesianclassifierto"overshoot"theprobabilitypredictions.

Thelatterisnotmuchofaproblem:theclassifierisstillprettyaccurate.Yetthedifferenceisimportant

for comparison with logistic regression: logistic regression does not have this problem. The class

probabilitiesitpredictsaretypicallyveryaccurate.Andtheprice?

Logisticregression

cannot

be

used

to

evaluate

importance

of

individual

features.

In

the

childhood

diseaseexamplefromabove, logisticregressionmightsaythatdiseaseprobabilityiscorrelatedby

agealone,andnotbytheheight,whichwouldbecorrect.Ontheotherhand, itcansay thatthe

only important attribute is body height, while age cannot offer any new evidence, which is

misleading.Usuallyitdoessomethinginbetween.


38/54

Logistic regressioncannothandleunknownvalues.Alldatahas tobeknown toclassifyapatient.NotsofornaveBayesianclassifierforwhichwecanaddjustnewevidencewhenwegetit.

Bothmodels,especially naveBayesianclassifier,aregood forsummingup theevidence froma large

number of features. They can learn from a single feature, but they can also sum up small pieces of

evidence

from

thousands

of

features.

For

instance,

Nave

Bayesian

classifiers

are

often

used

for

spam

filtering in which each word from the message (and all other properties of the message) contribute

someevidenceforthemailbeingspamorham.

Whileweavoidedanytheoryhere,thereisreallyalotofitinbehind(unlikeclassificationtrees,which

areasimplemachinelearninghack).Logisticregressionisamodificationoflinearregression,whichuses

logistic function for modeling of probabilities. The "points" assigned to features are actually logodds

ratios, which are a more practical alternative to conditional probabilities Nave Bayesian classifier

represents thesimplestcaseofBayesiannetwork,amodelwhichcan incorporateprobabilisticcasual

models.Moreover,there isawholebranchofstatistics,calledBayesianstatistics,whichrepresentsan

alternative

to

the

inferential

statistics

(testing

of

hypothesis

etc)

and

is

growing

fast

in

popularity

and

usefulness.

Othermethods

Therearemanyothersimilarmethods.Hereweonlymentionedthesimpleones,yet,bymyexperience,

nave Bayesian classifier can easily compete and lose by only a small difference with the most

powerfulandcomplexmachine learningalgorithms like, for instance,support vectormachines. Ifyou

wantmodelswhicharesimpletouseandunderstand,youshouldlogisticregressionandnaveBayesian

classifiers. The possible few percent loss in performance will be more than paid of by your ability to

explainand,ifneeded,overrulethepredictionsitmakes.

Unsupervisedmodeling

Distances

Unsupervised modeling is often based on a notion ofdistance ordifference between data points. In

ordertodoanygroupingbasedonsimilarity,weneedtosomehowdefinethesimilarity(oritsopposite,

dissimilarity,difference).

Whendealingwithordinarydata,themostcommondefinitionisbasedonEuclideandistance.Ifaandb

are two data instances described by values (a1, a2, a3) and (b1, b2, b3), we can define the difference

betweenthemas(a1b1)2+(a2b2)

2+(a3b3)2(wecantakeasquarerootofthat,butitwon'tchangemuch).

Weonlyneedtobecarefultoputallnumbersonthesamescaleadifferenceof0.1minbodyheight

maymeanmuchmorethanadifferenceof0.1kginbodyweight.Thesimplest(althoughnotnecessarily

the best) way of putting the data on the same scale is to normalize all variables by subtracting the

averageanddividingbystandardvariance.


39/54

Fordiscretedata,weusuallysaythatwhentwovaluesaredifferent,thedifferencebetweenthemis1.

Sometimes the distance is not so trivial, especially when the data instances are not described by

features/attributes/variables.Say that objectsofmystudyare lecturersatKyotoUniversity (including

occasionalspeakers,likeme)andwewouldwanttoclusterthembasedontheirscientificareas.Howdo

we

define

a

meaningful

difference

between

them?

This

depends,

of

course,

on

the

purpose

of

the

grouping. Say that I would like to group them by their fields of expertise (and that I don't know the

departmentsatwhichtheyactuallywork).Theproblemisactuallyeasierthanitlooks:wehaveaccessto

public data bases of scientific papers. Say that Dr. Aoki and I have written 3 papers together, while

altogether(that is,Dr.Aoki,meorbothofus incoauthorship)wewrote36papers(Iammakingthis

up!). We can then say that the similarity between us is 3/36 or 0.08. The similarity between Dr.

Nakayamaandmeisthen0(wehavenocommonpapers,sothesimilarityiszerodividedbysomething).

Two professors who wrote all their papers together would have a similarity of 1. To compute the

difference,wesimplysubtractthesimilarityfrom1:thedifferencebetweenDr.Aokiandmewouldbe1

0.08 = 0.92. This definition of similarity is known as the Jaccard index

(http://en.wikipedia.org/wiki/Jaccard_index).

The

purpose

of

this

discussion

is

not

to

present

the

Jaccard index by rather to demonstrate that there may be cases in which the definition of

distance/difference/similarityrequiresabitofcreativity.(Seeanexampleonthewebpage:a"map"of

scientificjournals fromcomputerscience,where thesimilaritybetweenapairofjournals isbasedon

whetherthesamepeoplecontributepaperstobothofthem.)

Hierarchicalclustering

We will show how this works on animals. All data in the zoo data set is discrete, so the dissimilarity

(distance) between two animals equals the number of features in which they differ. The schema in

OrangeCanvas

looks

like

this:

ExampleDistancewidgetgetsasetoftraininginstancesandcomputesthedistancesbetweenthem.A

coupleofdefinitionsareavailable,butallareattributebased inthesensethattheysomehowsumup

thedifferencesbetweenvaluesofindividualattributes.

Hierarchicalclusteringofthisdataisshownontheleftpicture,andtherightpictureshowsapartofit.(I

haven'tconstructedthedataandhavenoideawhygirlswere includedasanimalssimilartobearsand

dolphins.)

The figure is called a dendrogram. Probably not much explanation is needed with regard to what it

meansandhowitneedstoberead:itmergestheanimalsintoeverlargergroupsor,readfromleftto

right,splitsthemintoeversmallergroups.

Clusteringisintendedtobeinformativebyitself.IfIcamefromanotherplanetandhavenoideaabout

theEarthlyanimals,thiscanhelpmeorganizethemintoameaningfulhierarchy.


40/54

We can play with the clustering widget by drawing a

vertical threshold and "cut" the clustering at a certain

point,definingacertainnumberofdistinctgroups.

For the curious: this is how clustering actually works.

The

method

goes

from

the

bottom

up

(or,

in

this

visualization, from right to the left). At each step, it

merges the two most similar clusters into a common

cluster. Itstartssothateachanimal(object, ingeneral)

hasitsowncluster,andfinishesthemergingwhenthere

isonlyasingleclusterleft.

The only detail to work out is how do we define

distances between clusters, if we know (that is, have

defined, using Euclidean distances or Jaccard index or

whatever)

the

distance

between

single

objects.

The

first

choiceisobvious:thedistancebetweentwoclusterscan

be the average distance between all objects in one

cluster and objects in another. This is called average

linkage.Thedistancecanbealsodefinedasthedistance

between the closest objects belonging to the two

clusters,thesocalledminimallinkage.Asyoucanguess,

themaximal linkage is the distance between the most

distantpairofobjectsbelongingtothetwoclusters.And

nobodycanguesswhatthefourthoption,Wardlinkage,

is.

Neither

does

it

matter:

remember

that

the

term

linkagereferstothewayclustersarecomparedtoeach

other.Whenconstructingtheclustering,tryallofthem

andfindtheonewhichgivesthenicestpicture.

The length of horizontal lines in the dendrogram

corresponds to the similarity between clusters. The

objects on the lowest level are similar to each other,

laterthedistancesusuallygrow.Theoptimalnumberof

clustersistypicallyoneatwhichwenoticeasharpjump

inthe

distances/line

lengths.

We can also turn the data around and compute

distances between variables instead of between

animals. We can, for instance, compute the Pearson

correlation coefficient (which is actually closely related

totheEuclideandistance).


41/54

Results may seem a bit strange at first: having hair is

closelyrelatedtogivingmilk,that'sOK.Buthavingmilk

and laying eggs?! Being aquatic, having fins and

breath?! Itmakessense,though: layingeggsandgiving

milk is strongly correlated, though the correlation is in

this case negative. Knowing that the animal is aquatic

tells us something about it having fins, but also

something about whether it can breathe or not. We

could, of course, define distance differently and get

clustering of features which actually cooccur (positive

correlation).Asalwaysinunsupervisedmodeling,it'sallaquestionofhowwedefinethedistance.

HandsonExerciseLoad the data set called Iris.tab, compute Euclidean distances and construct a hierarchical

clustering,and

connect

the

result

to

the

Scatter

plot

(the

middle

part

of

the

schema

below).

For the clustering widget, see the recommending setting on the right. In

particular,checkShowcutofflineandAppendclusterIDs,settheplaceto

Classattribute.

ThebestpairofattributestoobserveintheScatterplotispetallengthand

petalwidth.

Nowyoucanputacutofflineintothedendrogram;thedendrogramwillbe

split into discrete clusters shown by colors and the widget will output

instances classified according to their cluster membership, that is, the

"class"

of

an

instance

will

correspond

to

the

cluster

it

belongs

to.

The

Scatter plot will color the instances according to this, clusterbased class

instead of the original class. Play with the widget by setting different

thresholdsandobservingtheresultsintheScatterplot.(Ifthewidgetdoes

not output examples uncheck Commit on change and check it again, or


42/54

leaveituncheckedandpushCommitwhenyouchangeofthethreshold;sorryforthebug.)

You can also attach theDistribution widget to the clustering. In this case, change the Place in

HierarchicalclusteringwidgetfromClassattributetoAttribute.The IDoftheclusterwillnowbe

addedtoinstancesasanordinaryattribute,nottheclass.Byobservingthedistributionofiristypes

(the

original

class)

within

each

cluster,

we

can

see

how

well

the

constructed

groups

match

the

classes. In other words, this shows how well can unsupervised learning function in problems for

whichwecouldalsousesupervisedlearning.

KMeansclustering

KMeansclusteringsplitsthedataintokclusters,wheretheparameterkissetmanually.Italwaysuses

Euclidean distance (or a similar measure, but let's not go into that). It cannot offer any fancy

dendrograms.Itsimplysplitsdataintokgroupsandthat'sit.

The advantage of kmeans clustering is that its clusters are often "better defined" (this is purely

subjective,intuitive

speaking)

than

those

we

get

by

hierarchical

clustering.

In

asense,

if

two

objects

in

k

meansbelongtothesamecluster,this"meansmore"thanittwoobjectsareinthesameclusterwhen

usinghierarchicalclustering.

HandsonExerciseUsethebelowschematoanalyzehowwelldoclustersbythekMeansclusteringmatchtheactual

classes,likewedidwithhierarchicalclusteringbefore.

Note

that

the

input

to

kMeans

is

not

a

matrix

of

distances

between

instances,

but

instances

(examples)themselves.

TheKMeanscanproposethenumberofclusters,orwecansetitmanually.Trysettingifto3and

see how it works. Then let the widget determine the optimal number of clustering itself. If it

proposedtwo,check thescatterplotson the first pageof this text.Yes, that's it.Seeoneof the

basicproblemsofclustering(andunsupervisedlearningingeneral)?

Multidimensionalscaling

Clusteringmaysometimesseemtoorigid:anobjectbelongstooneandonlyonecluster.Especially in

hierarchicalclustering

(but

also

in

kmeans)

it

can

often

happen

that

an

object

is

actually

more

similar

to

objectsfromanotherclusterthantoobjectsfromitscluster.(Thisseemswrong,butifyouthinkabout

it,you'lldiscoverit'strueandunavoidable.)


43/54

Enter multidimensional scaling. You can image it like this: you would want to have a map of Honshu

cities,butallyouhavearedistancesbetweenallpairsofcities.Canyouuse them to reconstruct the

map?Sureyoucan.Theresultingmapmayhaveawrongorientation(thedatadoesnottellyouwhat's

northandwhat'swest),anditcanbeevenmirrored,butyoucanreconstructamapofHonshuwhichis

correctinthesensethatitputsOsaka,NaraandKyotoclosetoeachotherandhalfwaybetweenTokyo

andHiroshima.

You can do the same with other data. We can construct a map of animals in which similar animals

(wheresimilarityiscomputedinanywaywewant,sayusingEuclideandistances,asbefore)arecloseto

eachotherandanimalswhichsharepropertiesoftwoanimalgroupsareput inbetweenthem.Asan

Orangespecifictwist,wecanvisualizeorganizetheimagebyconnectingthemostsimilarpairs.

Seetheimagebelow.(Imustagainapologizefortherudeandpoliticallyincorrectconnectionbetween

girlsandgorillas;Irepeatonceagainthatthisisadatasetfromtheinternetwhichiscommonlyusedto

demonstratemachinelearningmethods.Ihavenothingtodowithit.)

Notethat

this

method

does

not

define

any

clusters,

like

hierarchical

clustering

or

kmeans.

Yet

it

makes


44/54

theclustersobvioustotheuser,andthisisoftenenough.

Multidimensional scaling uses an optimization method which can get "stuck" in a certain placement.

Whenusingit,youshouldrepeatthelayoutoptimizationprocessformultipletimesandselecttheresult

which you like the most, or even multiple maps, if they, for instance, show data from different

interesting

perspectives.

Visualizationofnetworks

Networksarebecomingaverypopularmodel ingeneticsandalsoelsewhere.Theyconsistofvertices

(nodes), which represent objects, and are connected by edges. The network can be constructed in

differentways.Wecanconstructitfromdistanceslikethoseinclustering:wecanconnecttwoobjectsif

they are sufficiently close to each other. Connection can also represent existence of a relation. For

instance,wecanconnectpairsofproteinswhichinteractwitheachother.Twoscientistsareconnected

if they coauthoredat leastone paper. Two symptoms can be connected, if they often cooccur (e.g.

sorethroatandfever).Twoairportsareconnectedifthereisadirectflightbetweenthem.Twohospitals

areconnected,ifpatientsaretransferredbetweenthem.

Analysisofnetworksissomewhatsimilartoobservingthemultidimensionalscalingmap(moreover,the

techniqueusedforoptimizingthe layout issimilar,too).Wecanfind"islands"ofmutually interrelated

objects (in graph theory, these are called cliques) groups of collaborating scientists, collaborating

hospitals,relatedsymptoms

Wecandomorethatthat,though:wecanselectabunchofobjectsbasedontheirpropertiesandthen

tellthetoolweused(sayaNetworkexplorerwidgetinOrange),toselectandshowonlytheirvicinity,

e.g. the network up to, say, theirseconddegree neighbor (nodeswhichareat most twoedgesaway

from

the

selected

nodes).

In

genetics,

we

can

select

a

subset

of

genes

for

which

it

is

known

that

they

playarole inacertainhereditarydiseaseandthenobserve thegenesrelatedtothegenesrelatedto

thesegenesinhopethatsomeofthesemayberelatedtothedisease,too.

Wecanfindthehubs,ofthenetwork:thenodeswhichhavesignificantlylargernumberofconnections

thanothernodesandthusholdthenetworktogether.Ingenetics,thesemaybesomegeneralpurpose

genes,e.g.genescorrespondingtosomecommonproteins.Insocialnetworks,thesemaybethemost

influentialmembersofthesociety.Inanothercontext,thesenodesmayrepresentAlQaedaleaders

Asmallcollectionofinterestingnetworksispublishedonthewebsite.


45/54

Evaluationofmodelperformance

Lasttime,wediscussedsupervisedandunsupervisedmodeling,whichgiveusvariousmodelsdescribing

thedata.Todaywewilllearnhowtomeasurethequalityoftheobtainedmodels,eithertochoosethe

optimalmodelortocheckthequalityofthemodeltodecidewhetheritisgoodenoughtobeused.

Inthecontextofunsupervisedmodeling,likeclustering,thequalityofmodelsisdifficulttoevaluate.In

mostcaseswecanonlyevaluatethemsubjectivelybycheckingwhethertheymakesense,whetherthey

conform toourexpectationsand lookuseful forwhateverwewant tousethem.Supervised learning,

however, gives models which make predictions, so their quality can be measured objectively by

observing the quality of predictions. In today's lecture we'll learn how to prepare the test data and

discussdifferentmeasuresofqualityofmodels.

Samplingthe

data

for

testing

Thereisaverysimplerule:nevertestthemodelonthedatawhichwasusedforitsconstruction.

Thepurposeofassessingtheperformanceofmodelsistodeterminehowwellthemodelwillbehaveif

usedinpractice.Sincethemodelwillbeusedonnew,yetunseendatainstances(patients),italsohas

tobetestedondatawhichithasnotseenbefore.Otherwisethebestmodelwouldbesimplyonewhich

remembersalldataandusesthatknowledgewhenmakingpredictions(or,better,"postdictions").

Forthispurpose,weuseresamplingtechniquesinwhichonlya(randomly)chosensubsetofdataisused

for constructing the model and the remaining part is used to compute its quality. This is repeated

multipletimesandintheendwereporttheaveragequalityofthemodeloveralltrials.

Themostcommontechniqueusediscalledcrossvalidation.Itsplitsthedatainto(usually)tensubsets,

calledfolds. In the first step, it reserves the first fold for testing: it uses data from folds 210 for

constructingthemodelandthentests itonthefirstfold.Then itreservesthesecondfold fortesting;

folds1and29areusedforconstructingthemodelandfold2fortestingit.Andsoon;ateachstepone

foldisheldoutfortestingandothersareusedforlearning.

Anothersamplingmethod iscalledleaveoneoutorjackknife.Insteadof leavingoutonefold(10%of

examples,whenwedoa10foldcrossvalidation),itleavesoutonlyoneexampleatatime.Itconstructs

amodel

from

all

examples

except

this

one,

and

tests

the

prediction

on

this

example.

This

gives

very

reliableresults,buttakesalotoftime:ifthedatahas1000instances,itwillconstruct1000models.

Inrepeated randomsampling,wechooseagivennumberof instances (say70%) forconstructingthe

model,usetheremaining30%fortesting,andrepeatthiswholeprocedureacertainnumberoftimes.

Besides these three, which are also available in Orange, there exist a number of other resampling

methods,whicharenicelydescribedhere:http://en.wikipedia.org/wiki/Resampling_(statistics).


46/54

Asmallproblemwithallthesemethodsisthattheygiveanaverageperformanceofconstructedmodels

and not of a single model. Say that we decide that the best model for our data is nave Bayesian

classifier,whichhasanAUCof0.90(wewillsoonseewhat'sAUC,don'tworry)basedonanaverageof

thetenfoldcrossvalidation.Havingdecided,weusetheentiredatasettoconstructonesinglenave

Bayesianmodel.WehoweverdonotknowtheexpectedAUCofthismodel.Weonlyknowthatnave

Bayesian models for this data have average accuracies of 0.90, while this model may be better or

worse.

Weusuallydonotworryaboutthis.First,thefinalmodelwasconstructedfromabit largerdataset

fromall10foldsandnotjust9ofthem.Moredatausuallymeansbettermodels.Second,mostmodels

arequitestable(classificationtreesexcluded!).AlltennaveBayesianmodelsfromcrossvalidationwere

probablyquitesimilartoeachotherandtothisfinalmodel.

There is, though,onebiggerproblem.Constructingthemodelusually involvesa lotofmanualwork

everything which we covered in our past lectures. When we get the data, we will use various

visualizations

and

other

methods

to

select

the

features

which

we

want

to

use

in

our

models,

we

may

construct new features from existing models, we will fit various thresholds, parameters of learning

algorithms(e.g.wewillsetthedesiredsizeofthetrees).Althoughwemayhaveusedsomeresampling

techniqueintheend,theentiredatasetswasusedwhenpreparingforthemodelingwhichisapartof

modeling,too,andshouldnotbedoneonthedataonwhichthemodelislatertested.

Thisisthereasonwhywestressedinthefirstandinthesecondlecturethatapartofthedataneedsto

beleftaside.Asaruleofathumb,weusuallyrandomlyselect30%ofdataandputitinaseparatefile.

Thenwegothrougheverythingwe(hopefully)learnedduringtheselecturesusingtheremaining70%of

data.Whenwedecideforthefinalmodel,wetaketheheldoutdataanduseittotestthemodel.These

resultsarewhatwereportasthetrueexpectedperformanceofourmodel.

Measuresofmodelperformance

There are multiple perspectives from which we canjudge models with regard to the quality of their

predictions.

Scalarmeasuresofquality

Classificationaccuracy (CA) is the easiest and the most intuitive measure of quality. It measures the

proportionof

correct

answers

the

model

gives:

ifit

makes

the

correct

prediction

for

85

cases

out

of

100,

wesaytheclassificationaccuracyis85%.

LetusmeasuretheaccuracyofthenaveBayesianclassifier, logisticregressionandclassificationtrees

ontheheartdiseasedataset.Thecentralwidgetintoday'slectureiscalledTestLearners.Onitsinputit

requiresadataset(wewillprovideonefromthefilewidget)andlearningalgorithms.Theschemabelow


47/54

looksabitunusual:lastweek,wegavedatatolearningalgorithms,sothattheycouldconstructmodels

fromit.Howdotheconstructmodelsnow,withoutdata?

WidgetfornaveBayesianclassifierhastwooutputs.Whengivendata(aslastweek),itoutputsamodel

(whichwefedto,for instance,thewidgetshowingnomograms).Butevenwhennotgivenanydata, it

outputs

a

learning

algorithm.

Learning

algorithm

is

not

a

model,

but

a

procedure

for

building

the

model.

So,whatTestLearnersgetsinthisschemaarenotdataandthreemodelsbutdataandthreeprocedures

formakingmodels.TestLearnerssplitsthedataindifferentways(crossvalidationorleaveoneoror),

repeatedlyrunstheprovidedproceduresononesubsetandteststheresultingmodelsontheother.

HandsonexerciseConstructtheaboveschema.

Classification

accuracydoes

not

distinguish

between

the

"meanings"

of

classes.

Not

all

mistakes

are

equal.Forinstance,failingtodetectadiseasecancauseadeathofapatient,whilediagnosingahealthy

personassickisusuallylessfateful.Todistinguishbetweendifferenttypesofwrongpredictions,wewill

selectoneoftheclassesandcallitthetargetclassorthepositiveclass.Inmedicine,positivewillusually

meanhavingadisease,aconditionorsomethinglikethat.Nowwecandistinguishbetweenfourcases

ofcorrectandwrongpredictions


48/54

truepositive (TP) denotes a data instance (e.g. a patient) for whom the model made a positivepredictionandthepredictioniscorrect(sothepatientistrulypositive)

falsepositive (FP)denotes instanceswhichwere falselyclassifiedaspositive (e.g.wepredict thepatienttohavethedisease,thoughhedoesnot)

truenegative

(TN)

instances

are

those

for

which

the

model

gave

negative

prediction

and

the

predictioniscorrect

falsenegative(FN)representsinstanceswhoarefalselyclassifiedasnegative(theyaresickbutthemodelpredictedthemtobehealthy)

predictedvalue

positive(P') negative(N')

positive(P) truepositive(TP) falsenegative(FN)actualvalue

negative(N) falsepositive(FP) truenegative(TN)

Besides TP, FN, FP and TN, we will also use P and N to denote the number of positive and negative

instances, and P' and N' for the number of all positive and negativepredictions. Obviously, P=TP+FN,

N=TN+FP,P'=TP+FPandN'=TN+FN.Thematrixlikethisoneisaspecialcaseofconfusionmatrix.Orange

showsinitawidgetcalledConfusionmatrix,whichweconnecttotheTestLearnerswidget.

HandsonexerciseTomakethingsinteresting,wewillalsoaddaDatatablewidgetandaScatterplotwidget,towhich

wegiveexamplesfromtheFilewidgetandfromtheConfusionmatrix.Thelattershouldrepresenta

subsetof

data

(the

easiest

way

to

Documents

DataMining Kyoto