Upload
ajithakalyankrish
View
230
Download
0
Embed Size (px)
Citation preview
7/29/2019 DataMining Kyoto
1/54
JanezDemar
DataMining
MaterialforlecturesonDatamining
at
Kyoto
University,
Dept.
of
Health
Informatics
inJuly2010
7/29/2019 DataMining Kyoto
2/54
Contents
INTRODUCTION 4
(FREE)DATA
MINING
SOFTWARE
5
BASICDATAMANIPULATIONANDPREPARATION 6
SOMETERMINOLOGY 6
DATAPREPARATION 6
FORMATTINGTHEDATA 7
BASICDATAINTEGRITYCHECKS 7
MAKINGTHEDATAUSERFRIENDLY 8
KEEPINGSOMETHEDATAFORLATERTESTING
8MISSINGDATA 9
DISCRETEVERSUSNUMERICDATA 10
FROMDISCRETETOCONTINUOUS 10
FROMCONTINUOUSTODISCRETE 11
VARIABLESELECTION 13
FEATURECONSTRUCTIONANDDIMENSIONALITYREDUCTION 15
SIMPLIFYINGVARIABLES 15
CONSTRUCTIONOFDERIVEDVARIABLES 15
VISUALIZATION 17
HISTOGRAMS 17
SCATTERPLOT 18
PARALLELCOORDINATES 24
MOSAICPLOT 25
SIEVEDIAGRAM 28
"INTELLIGENT"VISUALIZATIONS 30
DATAMODELING(LEARNING,CLUSTERING) 31
PREDICTIVEMODELING/LEARNING 32
CLASSIFICATIONTREES 32
CLASSIFICATIONRULES 34
NAVEBAYESIANCLASSIFIERANDLOGISTICREGRESSION 36
7/29/2019 DataMining Kyoto
3/54
OTHERMETHODS 38
UNSUPERVISEDMODELING 38
DISTANCES 38
HIERARCHICALCLUSTERING 39
KMEANSCLUSTERING 42
MULTIDIMENSIONALSCALING 42
VISUALIZATIONOFNETWORKS 44
EVALUATIONOFMODELPERFORMANCE 45
SAMPLINGTHEDATAFORTESTING 45
MEASURESOFMODELPERFORMANCE 46
SCALARMEASURESOFQUALITY 46
RECEIVEROPERATINGCHARACTERISTIC (ROC)CURVE 51
CLASSIFICATIONCOST
53
7/29/2019 DataMining Kyoto
4/54
Introduction
Thismaterialcontentsthe"handouts"giventostudentsfordatamininglectureheldattheDepartment
ofHealth Informaticsat theUniversityofKyoto.The informalityofthe"course",whichtook fivetwo
hourlecturesisreflectedininformalityofthistext,too,asitevenincludesreferencestodiscussionsat
pastlecturesandsoon.
Somemorematerialisavailableathttp://www.ailab.si/janez/kyoto.
7/29/2019 DataMining Kyoto
5/54
(Free)Dataminingsoftware
Hereisaverysmallselectionoffreedataminingsoftware.Googleformore.
Orange(http://www.ailab.si)
is
afree
data
mining
software
we
are
going
to
use
for
these
lectures.
Knime (http://www.knime.org/) is excellent, well designed and nice looking. It has fewervisualizationmethodsthanOrangeand it'sabitmoredifficulttoworkwith,but itwon'tcostyou
anythingtoinstallitandtryitout.
Weka (http://www.cs.waikato.ac.nz/ml/weka/) isorientedmore towardsmachine learning.Whilevery strong in terms of different kinds of prediction models it can build, it has practically no
visualizationanditsdifficulttouse.Idon'treallyrecommendit.
Tanagra (http://eric.univlyon2.fr/~ricco/tanagra/) issmallerpackagewrittenbyaonemanband,buthasmuchmorestatisticsthanOrange.
Orangehasadifferentuser interfacefromwhatyouareusedto(similartoKnimeand,say,SASMiner
and SPSS Modeler). Its central part isOrangeCanvas onto which we putwidgets. Each widget has a
certain function (loading data, filtering, fitting a certain model, showing some visualization), and
receivesand/orgivesdatafrom/tootherwidgets.Wesaythatwidgetssendsignalstoeachother.We
set up the data flow by arranging the widgets into schema and connecting them. Each widget is
representedwithaniconwhichcanhave"ears"ontheleft(input)andright(outputs).Twowidgetscan
onlybeconnectediftheirsignaltypesmatch:awidgetoutputtingadatatablecannotbeconnectedtoa
widgetwhichexpectssomethingcompletelydifferent.WewilllearnmoreaboutOrangeaswego.
7/29/2019 DataMining Kyoto
6/54
Basicdatamanipulationandpreparation
Someterminology
Throughoutthelecturesyouwillhavetolistentoanunfortunatemixtureofterminologyfromstatistics,
machinelearningandelsewhere.Hereisashortdictionary.
Ourdatasamplewillconsistsofdata instancese.g.patientrecords. Inmachine learningwepreferto
saywehaveadatasetanditconsistsofexamples.
Eachexampleisdescribedbyfeaturesorvaluesofvariables(statistics)orattributes(machinelearning).
Thesefeaturescanbenumerical(continuous)ordiscrete(symbolic,ordinalornominal;thelattertwo
terms actually refer to two different things: ordinal values can be ordered, e.g. small, medium, big,
whilenominalhavenospecificorder,justnames,e.g.red,green,blue).
Oftenthereisacertainvariablewewanttopredict.Statisticianscallitdependentvariable,inmedicine
youwillhearthetermoutcomeandinmachinelearningwillcallitclassorclassattribute.Suchdatais
calledclassifieddata(norelationtoCIAorKGB).
Machine learning considers examples to belong to two or more classes, and the task is then to
determinetheunknownclassofnewexamples(e.g.diagnosesfornewlyarrivedpatients).Wesaythat
weclassifynewexamples,themethodfordoingthisiscalledaclassifier,theproblemoffindingagood
classifieriscalledclassificationproblemandsoon.Theprocessofconstructingtheclassifieriscalled
classifier induction from data or also learning fromdata. The part of data used for learning is often
calledlearning
dataor
training
data(asopposedto
testing
data,whichweusedfortesting).
The "class" can alsobe continuous. If so, thiswe shouldn't call it aclass (althoughweoften do).The
problemisthennotaclassificationproblemanymore;wecallitaregressionproblem.
Instatisticterminology,thetaskistopredict,andthestatisticaltermforaclassifierispredictivemodel.
Wedonotlearnhere,wefitthemodeltothedata.Wesometimesalsousethetermmodeling.Ifthe
dataisclassified,theproblemiscalledsupervisedmodeling,otherwiseit'sunsupervised.
Sincedataminingisbasedonbothfields,wewillmixtheterminologyallthetime.Getusedtoit.
Datapreparation
This is related to Orange, but similar things also have to be done when using any other data mining
software.Moreover,especiallytheprocedureswewilltalkaboutinthebeginningarealsoapplicableto
anymeansofhandlingthedata,includingstatistics.Theyaresoobvious,thatwetypicallyforgetabout
themandscratchourheadslateron,whenthingsbeginlookingweird.
7/29/2019 DataMining Kyoto
7/54
Ifallthiswillseemtoyouascommonsense,thisissimplybecausemostofdataminingisjustsystematic
applicationofpurecommonsense.
Formattingthedata
OrangedoesnothandleExceldataandSQLdatabases(ifyoudon'tknowwhatisSQL,don'tbother)well
yet,sowefeeditthedataintextualtabdelimitedform.Thesimplestwaytopreparethedataistouse
Excel. Put names ofvariables into the first row, followed by data instances in each row. Save as tab
delimited,withextension.txt.
This format issomewhat limitedwithrespecttowhatcanyoutellOrangeaboutvariables.Abitmore
complicated(butstillsimpleformatusesthesecondandthethirdrowtodefinethevariabletypesand
properties).Youcanreadmoreaboutitathttp://www.ailab.si/orange/doc/reference/tabdelimited.htm.
Youcanseeseveralexamplesinc:\Python26\lib\sitepackages\orange\doc\datasets,assumingthatyou
useWindowsandPython2.6.
For
technical
reasons,
Orange
cannot
handle
neither
values,
not
variables
names
and
not
even
file
names which include nonASCII (e.g. kanji) characters. There is little we can do to fix Orange at the
moment.
Basicdataintegritychecks
Checkdataforerrorsandinconsistencies.Checkforspellingerrors.Ifsomebodymistypesmaleasmole,
wearesuddenlygoingtohavethreesexesinyourdata,females,malesandmoles.
Orange(andmanyothertools)iscasesensitive:youshouldnotwritethesamevaluessometimeswith
smallandsometimeswithcapitalletters.Thatis,femaleandFemalearetwodifferentspecies.
Checkforanyotherobviouslywrongdata,suchasmistypednumbers(therearenotmany18kgpatients
arounddidtheymean81?!)
How are unknown values denoted? There are various "conventions": question marks, periods, blank
fields,NA, 1,9,99Value9seemstobeespeciallypopularfordenotingunknownsinsymbolicvalues
encodednumerically,likegender,e.g.0forfemale,1formaleand9forunknown.If9'sarekeptinthe
dataand thesoftware isnotspecifically told that9meansanunknownvalue, itwilleitherthink that
therearethreeoreventensexes(asiflifewasnotcomplicatedenoughwithtwo).Oreventhatsexisa
numericalvariablerangingfrom0to9.
Youwill
often
encounter
data
in
which
unknowns
are
marked
differently
for
different
variables
or
even
foroneandthesamevariable.
Does the same data always mean the same thing? And opposite: can the same thing be encoded by
more thanonedifferentvalue?Check forerrors indata recordingprotocol,especially if thedatawas
collectedbydifferentpeopleor institutions. Inthedata I'mworkingonrightnow, itwouldseemthat
somehospitalshavemajorityofpatientsclassifiedasSimple trauma,whileothershavemostpatients
7/29/2019 DataMining Kyoto
8/54
withOther_Ent.Couldthisbeoneandthesamething,butrecordeddifferentlyindifferenthospitalsdue
tolaxspecifications?
AtypicalclassicalexamplewhichIhappilyalsoencounteredwhile inNara isthatsome institutions
would record the patients' height in meters and others in centimeters, giving an interesting bimodal
distribution
(besides
rendering
the
variable
useless
for
further
analysis).
Mostsuchproblemscanbediscoveredbyobservingdistributionsofvariablevalues.
Such discrepancies should be found and solved in the beginning. If you were given the data by
somebodyelse,youshouldaskhimtoexaminetheseproblems.
HandsonLoadthedataintoOrangeandseewhether
it looksOK.Trytofindanymistakesandfix
them (use Excel or a similar program for
that).
Makingthedatauserfriendly
Data isoftenencodedbynumbers.0forfemale,1 formale(wait,orwas ittheopposite?youwillask
yourselfallthetimewhileworkingwithsuchdata).Ijustgotdatainwhichallnominalvalueshadtheir
descriptivenames,exceptgender forsome reason.Consider renaming thesevalues, itwillmakeyour
lifemorepleasant.
This of course depends upon the software and the methods which you are going to use. We will
encounterasimplecasequitesooninwhichweactuallywantthegendertobeencodedby0and1.
Keepingsomethedataforlatertesting
An importantproblem,whichwasalsopointedoutbyProf.Nakayamaatourfirst lecturewashowto
combinethestatisticalhypothesistestingwithdatamining.Theonlycorrectway Iknowoftheonly
onewithwhichyoushouldhavenoproblemthrustingyourresultsandconvincethereviewersofyour
paperoftheirvalidity,tooistouseaseparatedatasetfortestingyourhypothesesandmodels.Since
youtypicallygetallthedatawhichisavailableinthebeginning,youshouldsacrificeapartofyourdata
fortesting.
As
arule
of
athumb,
you
would
randomly
select
30%
of
data
instances,
put
them
into
a
separatefileandleaveitalonetilljustbeforetheend.
If your data has dates (e.g. admittance date), it may make a convincing case if instead of randomly
selecting30%,you takeout the last30%of thedata.Thiswould then look likeasimulated followup
study.Ifyoudecideforthis,youonlyhavetobecarefulthatyourdataisunbiased.Forinstance,youare
researchingaconditionsrelatedtodiabetesandthelast30%ofdatawascollectedinthesummer,using
File DataTable AttributeStatistics Distributions
7/29/2019 DataMining Kyoto
9/54
thisdataasatestsetmaynotbeagoodidea(sincediabeticpatientshavemoreproblemsinsummer,as
farasIknow).
You should split the data at this point. Fixing the problems in the data, which we discussed above,
before splitting it, is a good idea, otherwise you will have to repeat the cleaning on the test data
separately.
But
from
this
point
on,
we
shall
do
things
with
the
data
which
already
affect
your
model,
so
thetestdatashouldnotbeinvolvedintheseprocedures.
HandsonThere isageneralpurposewidget inOrangetoselectrandomsamplesfrom
thedata.Youcanputthedatafromthiswidgetintotheoneforsavingdata.
Tryitandprepareaseparatedatafilefortesting.
Missing
data
This topic again also concerns statistics, although typical statistical courses show little interest in it.
Somedatawilloftenbemissing.Whattodoaboutthat?
First,therearedifferentkindsofmissingdata.
Itcanbemissingbecausewedon'tcareabout it,since it is irrelevantandwasnotcollectedforaparticularpatient.(Ididn'tmeasurehispulseratesinceheseemedperfectlyOK.)
Itcanbeinapplicable.(No,mysonisnotpregnant.Yes,doctor,I'mcertainaboutit.) Itmaybejustunknown.
Especially in the latter case, it may be interesting to know: why? Is itsunknowingness related to the
case?Doesittellussomething?
Maybethedatawasnotcollectedsinceitsvaluewasknown(e.g.pulseratewasnotrecordedwhenit
wasnormal).Maybethemissingvaluescanbeconcludedfromtheotherattributes(bloodlosswasnot
recordedwhentherewerenowoundssoweknowitiszero0)orinapplicable(thereasonofdeathis
missing,butanotherattributetellsthatthepatientsurvived).
Foragoodexample: inadatasetwhichIonceworkedwith,anumberofpatientshadamissingvalue
forurinecolor.WhenIaskedthepersonwhogavemethedataaboutthereason,hetoldmethatthey
wereunconscious
at
the
time
they
were
picked
up
by
the
ER
team.
Knowingwhyitismissingcanhelpdealingwithit.Hereareafewthingswecando:
Asktheownerofthedataoradomainexpertifhecanprovideeitherperinstanceorgeneralvaluestoimpute(thetechnicaltermforinsertingthevaluesforunknowns).Ifthepulsewasnotrecorded
whenitwasnormal,well,weknowwhatthemissingvalueshouldbe.
DataSampler Save
7/29/2019 DataMining Kyoto
10/54
Sometimes you may want to replace the missing value with a special value or even a specificvariabletellingwhetherthemissingvalueisknownornot.Fortheurinecolorexample,ifthedata
doesnotcontainthevariabletellinguswhetherthepatient isconscious,wecanaddthevariable
Consciousanddetermineitsvaluebasedonwhethertheurinecolorisknownornot.Ifitisknown,
thepatientisconscious,ifnot,itis"maybeconscious"(oreven"unconscious"ifwecanknowthat
thedatawascollectedsothatthecolor isknownforallconsciouspatients).Notknowingavalue
canthusbequiteinformative.
Sometimes you can use a model to impute the unknown value. For instance, if the data is onchildren below 5, and some data on their weights is missing but you know their ages, you can
impute theaveragebody mass for thechildren'sage. You canevenusesophisticatedmodels for
that (that is, instead of predicting what you have to predict for this data, e.g. the diagnosis,you
predictvaluesoffeaturesbasedonvaluesofotherfeatures).
Imputeconstantvalues,suchastheworstorthebestpossible,orthemostcommonvalue(for
discrete
attributes)
or
the
average/median
(for
the
continuous).
Removeallinstanceswithmissingvalues.Thisisthesafestroute,exceptthatitreducesyourdataset.Youmaynotlikethisoneifyoursampleisalreadysmallandyouhavetoomanymissingvalues.
Remove the variables with missing values. You won't like this one if the variable is important,obviously.Youmaycombinethiswiththeprevious:removethevariableswithmostmissingvalues
andthenremovethedata instanceswhicharemissingvaluesofanyofotherattributes.Thismay
stillbepainful,though.
Keepeverythingas it is.Mostmethods,especially inmachine learning,canhandlemissingvalueswithoutanyproblem.Beaware, though,that thisoften impliessomekindof implicit imputation,
guessing.
Ifyou
know
something
about
these
missing
values,
it
is
better
to
use
one
of
the
first
solutionsdescribed.
HandsonThereisanimputationwidgetinOrange.Seewhatitoffers.
Discreteversusnumericdata
Moststatistical
methods
require
numeric
data,
and
many
machine
learning
methods
prefer
discrete
data.Howdoweconvertbackandforth?
Fromdiscretetocontinuous
Thistaskisreallysimpleforvariableswhichcanonlytaketwovalues,likegender(ifwedon'thavethe
gender 9). Here you can simply change these values to 0 and 1. So if 0 is female and 1 is male, the
variable represents the "masculinity"of thepatient.Consider renaming thevariable to"male"soyou
Impute
7/29/2019 DataMining Kyoto
11/54
won't have to guess the meaning of 0 and 1 all the time. Statisticians call such variables dummy
variables,dummies,indicatorvariables,orindicators.
If the variable is actually ordered (small, medium, big) you may consider encoding it as 0, 1, 2. Be
careful,though:somemethodswillassumethelinearity.Isthedifferencebetweensmallandbigreally
two
times
that
between
small
and
medium?
Withgeneralmultivaluedvariables,therearetwooptions.WhenyourpatientscanbeAsian,Caucasian
orBlack,youcanreplace thevariableRacewiththreevariables:Asian,CaucasianandBlack.Foreach
patient, only one would be1. (Instatistics, ifyou do, for instance, linear regressionon thisdata,you
needtoremovetheconstantterm!)
Theotheroptionoccurs incaseswhereyoucandesignateonevalueas"normal".Ifurinecolorcanbe
normal,redorgreen,youwouldreplaceitwithtwoattributes,redandgreen.Atmostoneofthemcan
be1;ifbotharezero,thatwouldmeanthatthecolorisnormal.
Handson
ThereisanimputationwidgetinOrange.Seewhatitoffers.
Fromcontinuoustodiscrete
Thisprocedureiscalledcategorizationinstatistics,sinceitsplitsthedatainto"categories".Forinstance,
the patient can be split into categories young, middleage and old. In machine learning this is called
discretization.
Discretization can be unsupervised or supervised (there go these ugly terms again!). Unsupervised
discretizationisbasedonvaluesofvariablesalone.Mosttypicallywesplitthedataintoquartiles,that
is,wesplitthedataintofourapproximatelyequalgroupsbasedonthevalueoftheattribute.Wecould
thushavetheyoungestquarterofthepatients,thesecondyoungest,theolderandtheoldestquarter.
This procedure is simple and usually works best. Instead of four we can also use other number of
intervals,althoughweseldomusemorethan5or lessthan3 (unlesswehaveaverygoodreasonfor
doingso).
Supervised methodsset theboundaries (also called thresholds) between intervalsso that thedata is
split into groups which differ in class membership. Put simply, say that a certain disease tends to be
morecommon forpeopleolderthan70.70 isthusthenatural thresholdwhichsplitsthepatient into
twogroups,
and
it
does
not
really
matter
whether
it
splits
the
sample
of
patients
in
half
or
where
the
quartileslie.
Thesupervisedmethodsarebasedonoptimizingafunctionwhichdefinesthequalityofthesplit.One
can, for instance, compute the chisquare statistics to compare the distribution of outcomes (class
distributions, as we would say in machine learning) for those above and below a certain threshold.
Continuize
7/29/2019 DataMining Kyoto
12/54
Differentthresholdsyielddifferentgroupsand thusdifferentchisquares.Wecan thensimplychoose
thethresholdwiththehighestchisquare.
Apopularapproachinmachinelearningistocomputethesocalledinformationgainofthesplit,which
measures how much information about the instance's class (e.g. disease outcome) do we get from
knowing
whether
the
patient
is
above
or
below
a
certain
threshold.
While
related
to
chi
square
and
similar measures, one nice thing about this approach is that it can evaluate whether setting the
thresholdgivesmore informationthan itrequires.Inotherwords, it includesacriterionwhichtellsus
whethertosplittheintervalornot.
With that, we can run the method in an automatic mode which splits the values of the continuous
variable into ever smaller subintervals until it decides it's enough. Quite often it will not split the
variable at all, hence proposing to use "a single interval". This implicitly tells us that the variable is
useless (that is,thevariable tellsusnothing, ithasan informationgainof0 foranysplit)andwecan
discardit.
Youcancombinesupervisedmethodswithmanualfittingofthresholds.Thefigureaboveshowsapart
of Orange's Discretize widget. The graph shows the chisquare statistics (y axis) for the attribute we
wouldgetifwesplittheintervalofvaluesofcontinuousattributeintwoatthegiventhreshold(xaxis).
The gray curve shows the probability of class Y (righthand y axis) at various values of the attribute.
Histogramsatthebottomandthetopshowdistributionofattribute'svaluesforbothclasses.
Finally, there may exists a standard categorization for the feature, so you can forget about the
unsupervised
and
supervised,
and
use
the
expert
provided
intervals
instead.
For
instance,
by
some
Japanese definition BMI between 18.5 and 22.9 is normal, 23.0 to 24.9 us overweight and 25.0 and
above isobese.Thesenumbersweredecidedwithsomereason,soyoucanjustusethemunlessyou
haveaparticularreasonfordefiningotherthresholds.
While discretization would seem to decrease the precision of measurements, the discretized data is
usuallyjustasgoodastheoriginalnumericaldata.Itmayevenincreasetheaccuracywhenthedatahas
7/29/2019 DataMining Kyoto
13/54
somestrong outliers, that is, measurements outofscale.By splitting thedata into four intervals, the
outliersaresimplyputintothelastintervalanddonotaffectthedataasstronglyasbefore.(Forthose
moreversed instatistics:thishasasimilareffectasusingnonparametrictest insteadofaparametric
one,forinstanceWilcoxon'stestinsteadofttest.Nonparametric
Handson
Loadtheheartdiseasedatafromdocumentationdatasetsandobservehow
the Discretize widget discretizes the data. In particular, observe the Age
attribute.Selectonlywomenandseeifyoucanfindanythinginteresting.
Variableselection
We often need to select a subset of variables and of data instances. Using fewer variables can often
resultin
better
predictive
models
in
terms
of
accuracy
and
stability.
Decidewhichattributesarereallyimportant.Ifyoutaketoomany,youaregoingtohaveamess.Ifyouremovetoomany,youmaymisssomethingimportant.
Aswealreadymentioned,youcanremoveattributeswithtoomanymissingvalues.Thesewillbeproblematic for many data mining procedures, and are not very useful anyway, since they dont
giveyoumuchdata.Also, iftheywerentmeasuredwhenyourdatawascollected (toexpensive?
difficult? invasive or painful?), think whether those who use your model will be prepared to
measureit.
Ifyou
have
more
attributes
with
asimilar
meaning,
it
is
often
beneficial
to
keep
only
one
unless
they are measured separately and there is a lot of noise, so you can use multiple attributes to
averageoutthenoise.(Youcandecidelaterwhichoneyouarekeeping.)
Ifyouarebuildingapredictivemodel,verifywhichattributeswillbeavailablewhenthepredictionneeds to be made. Remove the attributes that are only measured later or those whose values
depend upon the prediction (for instance, drugs that are administered depending upon the
predictionforacancerrecurrence).
Dataminingsoftwarecanhelpyouinmakingthesedecisions.
First, there are simple univariate measures of attribute quality. With twoclass problems, you can
measure the attribute significance by ttest for continuous and chisquare for discrete attributes. For
multiclassproblems,youhaveANOVAforcontinuousandchisquare(again)fordiscreteattributes.In
regression problems you probably want to check the univariate regression coefficients (I honestly
confessthatIhaven'tdonesoyet).
Discretize Selectdata
7/29/2019 DataMining Kyoto
14/54
Second, you may observe the relations between attributes and relations between attributes and the
outcome graphically. Sometimes relations willjust pop up, so you want to keep the attribute, and
sometimesitwillbeobviousthattherearenone,soyouwanttodiscardthem.
Finally, if you are constructing a supervised model, the features can be chosen during the model
induction.
This
is
called
model
based
variable
selection
or,
in
machine
learning,
a
wrapper
approach
to
featuresubsetselection.Inanycase,therearethreebasicmethods.Backwardfeatureselectionstarts
with all variables, constructs the model and then checks, one by one, which variable it can exclude
withoutharming(orwiththe leastharm)topredictiveaccuracy(oranothermeasureofquality)ofthe
model. It removes that variable and repeats the process until any further removing would drastically
decreasethemodel'squality.Forwardfeatureselectionstartswithnovariablesandaddsoneatatime
always theone fromwhich themodelprofitsmost. Itstopswhenwegainnothingmoreby further
additions.Finally,instepwiseselection,onefeaturecanbeaddedorremovedateachstep.
The figure shows a part of Orange's widget for ranking variables according to various measures (we
won'tgointodetailsaboutwhattheyare,butwecanifyouareinterestedjustask!).
Note
that
the
general
problem
with
univariate
measure
(that
is,
those
which
only
consider
a
single
attribute at a time) is that they cannot assess (or have difficulties with assessing) the importance of
attributes which are important in context of other attributes. If a certain disease is common among
youngmalesandoldfemales,thenneithergendernorageby itselftellusanything,sobothvariables
wouldseemworthless.Consideredtogether,theytellusalot.
HandsonThere are two widgets for selecting a subset of attributes. One is
concentratedonestimatingtheattributequality(Rank),withtheotherwe
can remove/add attributes and change their roles (class attribute, meta
attributes)Especiallythelatterisveryusefulwhenexploringthedata,for
instance to test different subsets of attributes when building prediction
models.
Rank
Selectattributes
7/29/2019 DataMining Kyoto
15/54
Featureconstructionanddimensionalityreduction
Simplifyingvariables
Youcanoftenachievebetterresultsbymergingseveralvaluesofadiscretevariable intoasingleone.
Sometimesyou
may
also
want
to
simplify
the
problem
by
merging
the
class
values.
For
instance,
in
non
alcoholicfattyliverdiseasewhichI'mstudyingrightnow,theliverconditionisclassifiedintooneoffour
bruntstages,yetmost(orevenall)papersonthistopicwhichconstructpredictivemodelsconvertedthe
problem into binary one, for instance whether the stage is 0, 1 or 2 (Outcome=0), or 3 or 4
(Outcome=1).Reformulated likethis,theproblem iseasiertosolve.Besidesthat,manystatisticaland
machinelearningmethodsonlyworkwithbinaryandnotmultivaluedoutcomes.
TherearenowidgetsforthisinOrange.Youhavetodoitin,say,Excel.Orangecanhelpyouindeciding
which values are "compatible" in a sense. We will learn about this the next time, as we explore
visualizations.
Constructionofderivedvariables
Sometimes it isusefultocombineseveral features intoone,e.g.presenceofneurological,pulmonary
andcardiovascularproblemsintoageneralproblems.Therecanbevariousreasonsfordoingthis,the
simplest being a small proportion of patients which each single disease. Some features need to be
combined because of interactions, as the case described above: while age and gender considered
together may give all the information we need, most statistical and machine learning methodsjust
cannotuse themsincetheyconsider (anddiscard) themoneata time.Weneed tocombine the two
variablesintoanewonetellinguswhetherthepatientbelongsintooneoftheincreasedriskgroups.
Finally,we
should
mention
but
really
only
mention
the
so
called
curse
of
dimensionality.
In
many
areas of science, particularly in genetics, we have more variables than instances. For instance, we
measureexpressionsof30,000geneson100humancancertissues.Aruleofathumb(butnothingmore
than a rule of a thumb) in statistics is that the sample size must be at least 30 times the number of
variables, so we certainly cannot run, say linear or logistic regression (neither any machine learning
algorithm)onthisdata.
There are no easy solutions to this problem. For instance, one may apply theprincipal component
analysis(orsupervisedprincipalcomponentanalysisorsimilarprocedures),astatisticalprocedurewhich
canconstructasmallsetof,say,10variablescomputedfromtheentiresetofvariables insuchaway
that
these
10
variables
carry
as
much
(or
all)
data
from
the
original
variables.
The
problem
is
that
principalcomponentanalysisitselfwouldalsorequire,well,30timesmoredatainstancethanvariables
(wecanbendthisrule,butnottothepointwhenthenumberofvariablesexceedsthesamplesize,at
leastnotbythatmuch).
OK,ifwecannotkeepthedatafromallvariables,canwejustdiscardthemajorityofvariables,sincewe
knowthattheyarenotimportant?Outof30,000genes,onlyacouple,adozenmaybe,isrelatedtothe
diseasewestudy.Wecoulddothis,buthowdowedeterminewhichgenesarerelatedtothedisease?
7/29/2019 DataMining Kyoto
16/54
Ttestsor informationgainorsimilarforeachgeneseparatelywouldn'twork:withsomanygenesand
suchasmallsample,manygeneswillberandomlyrelatedtowhatweare tryingtopredict. Ifweare
unlucky(and inthiscontextweare),therandomlyrelatedgeneswill lookmorerelatedtothedisease
thanthosewhichareactuallyrelevant.
The
only
way
out
is
to
use
supplementary
data
sources
to
filter
the
variables.
In
genetics,
one
may
consideronlygeneswhicharealreadyknowntoberelatedwiththediseaseandgeneswhichareknown
to be related to these genes. More advanced methods use variables which represent average
expressions (orwhatever we measure) of groupsofgenes, where these groups aredefined basedon
largenumberofindependentexperiments,ongenesequencesimilarities,coexpressionsandsimilar.
The area of dimensionality reduction is on the frontier of data mining in genetics. The progress in
discovering the genetics of cancer and potential cures are not that much in collecting more data but
ratherinfindingwaystosplitthedatawehaveintoirrelevantandrelevant.
7/29/2019 DataMining Kyoto
17/54
Visualization
WeshallexploresomegeneralpurposevisualizationsofferedbyOrangeandsimilarpackages.
Histograms
Histogram is a common visualization for representing absolute or relative distributions of discrete or
numericdata.Inlinewithitsmachinelearningorigins,Orangeaddsthefunctionalityrelatedtoclassified
data,thatis,datainwhichinstancesbelongtotwoormoreclasses.Thewidgetforplottinghistograms
iscalledDistributions.Thepicturebelowshowsthegenderdistributionontheheartdiseasedataset,
whichwebrieflyworkedwithinthelastlecture.
Onthelefthandside,theusercanchoosethevariablewhichdistribution(s)hewantstoobserve.Here
wechosethediscreteattributegender.Thescaleontheleftsideoftheplotcorrespondstothenumber
ofinstanceswitheachvalueoftheattribute,whichisshownbytheheightsofbars.Coloredregionsof
thebars
correspond
to
the
two
classes
blue
represents
outcome
"0"
(vessels
around
the
heart
are
not
narrowed),andredpartsrepresentthepatientswithcloggedvessels(outcome"1").Thedatacontains
twiceasmanymenthanwomen.Withregardtoproportions,malesaremuchmore likelytobe"red"
(havethedisease)thenfemales.
Tocomparesuchprobabilities,wecanusethescaleontheright,whichcorrespondstoprobabilitiesof
thetargetvalue(wechosetargetvalue"1",seethesettingsontheleft).Thecirclesandlinesdrawnover
thebarsshowtheseprobabilities:forfemalesit'saround25%(withconfidenceintervalsappx.2030%),
7/29/2019 DataMining Kyoto
18/54
andformalesit'saround55%(5159).Youcanalsoseethesenumbersbymovingthemousecursorover
partsofthebar.
Note: to see the confidence intervals, you need to click Settings (top left), and then check Show
confidenceintervals.
Similargraphs
can
be
drawn
for
numeric
attributes,
for
instance
SBP
at
rest.
The number of bars can be set in Settings. Probabilities are now shown with the thicker line in the
middle,and
the
thinner
lines
represent
the
confidence
intervals.
You
can
notice
how
these
spread
on
theleftandrightside,wheretherearefewerdatainstancesavailabletomaketheestimation.
HandsonexercisePlot the histogram for the age distribution. Do you notice something interesting with regard to
probabilities?
Scatterplot
Scatter plots show individual data instances as symbols whose positions in the plane are defined by
valuesoftwoattributes.Additionalfeaturescanberepresentedbytheshape,colorandsizeofsymbols.
There are many ways to tweak the scatter plot into even more interesting visualizations. The Gap
Minderapplication,withwhichweamusedourselvesonourfirstlecture(Japanvs.Sloveniavs.China)is
nothingelsethenascatterplotwithanadditionalslidertobrowsethroughhistoricdata.
7/29/2019 DataMining Kyoto
19/54
Although scatter plot visualizations in data mining software are less fancy than those in specialized
applications like Gap Minder, they are still the mostuseful and versatile visualization we have, in my
opinion.
Here isoneofmy favorites:patients' ageversusmaximalheart rateduringexercise.Colorsagain tell
whether
the
patient
has
clogged
vessels
or
not.
First,
disregarding
the
colors,
we
see
the
relation
between
the
two
variables
(this
is
what
scatter
plots
aremostusefulfor).Youngerpatientscanreachhigherheartrates(thebottomrightsideoftheplotis
empty).Amongolderpatients,therearesomewhocanstillreachalmosttheheartratesalmostashigh
as youngsters (although the topmost right area is a bit empty), while some patients' heart rates are
ratherlow.Thedotsarespreadinakindoftriangle,andtheageandheartrateareinverselycorrelated
theolderthepatient,thelowertheheartrate.
Furthermore, the dots right below are almost exclusively red, so these are the people with clogged
vessels.Leftabove,peopleareyoungandhealthy,whilethoseatthetopabovecanbeboth.
Thisisatypicalexampleofhowdatamininggoes:atthispointwecanstartexploringthepattern.Isit
trueor
just
coincidental?
The
triangular
shape
of
the
dot
cloud
seems
rather
convincing,
so
let's
say
we
believeit.Ifso,whatisthereason?Canweexplainit?Anaveexplanationcouldbethatwhenweage,
somepeoplestillexercisealotandtheirheartsare"asgoodasnew",whilesomearenotthislucky.
Thisisnotnecessarilytrue.WhenIshowedthisgraphtocolleagueswithmoreknowledgeincardiology,
they told me that the lower heart rate may be caused by beta blockers, so this can be a kind of
confoundingfactorwhichisnotincludedinthedata.Ifthisisthecase,thenheartratetellsus(oritisat
leastrelatedto)whetherthepatientistakingbetablockersornot.
7/29/2019 DataMining Kyoto
20/54
ToverifythiswecouldsetthesymbolsizetodenotetherestSBP (tomake thepictureclearer, Ialso
increasedthesymbolsizeandtransparency,seetheSettingstab).InsteadoftherestSBP,wecouldalso
plot the ST segment variable. I don't know the data and the problem well enough to be able to say
whetherthisreallytellsussomethingornot.
Theiconsinthebottomletuschoosetoolsformagnifyingpartsofthegraphorforselectingdatainthe
graph(icons
with
the
rectangle
and
the
lasso).
When
examples
are
selected,
the
widget
sends
them
to
anywidgetsconnectedtoitsoutput.
HandsonexerciseDrawthescatterplotsothatthecolorsrepresentthegenderinsteadoftheoutcome.Doesitseem
thatthepatientswhoseheartratesgodownareprevalentlyofonegender(which)?Canyouverify
whether this is actually the case? (Hint: select these examples and use the Attribute statistics
widgettocomparethedistributionofgendersonentiredatasetandontheselectedsample.)
Thewidgethasmanymoretricksanduses.Forinstance,ithasmultipleoutputs.Youcangiveitasubset
ofexamples
which
come
from
another
widget.
It
will
show
the
corresponding
examples
from
this
subset
withfilledsymbols,whileothersymbolswillbehollow. IntheexamplebelowweusedtheSelectdata
widgettoselectpatientswithexerciseinducesanginaandrestSBPbelow131andshowthegroupasa
subgroupintheScatterplot.Notethatthisschemaisreallysimple:manymorewidgetscanoutputthe
data,andthesubsetplottedcanbefarmoreinterestingthanthisone.
7/29/2019 DataMining Kyoto
21/54
HandsonexerciseConstruct
the
schema
above
and
see
what
happens
in
the
Scatter
plot.
HandsonexerciseLet's explore the difference in genders in some more detail. Remember the interesting finding
about the probabilitiesofhavingcloggedvesselswith respect to theageof thepatient? Use the
SelectDatawidgettoseparatewomenandmenandthenusetwoDistributionwidgetstoplotthe
probability of disease with respect to age for each gender separately. Can you explain this very
interestingfinding(althoughnotunexpected,whenyouthinkaboutit)frommedicalpointofview?
Thereisanotherwidgetforplottingscatterplots,inwhichpositionsofinstancescanbedefinedbymore
thantwo
attributes.
It
is
called
Linear
projection.
To
plot
meaningful
plots,
all
data
needs
to
be
either
continuousorbinary.WewillfeeditthroughwidgetContinuizetoensurethis.
7/29/2019 DataMining Kyoto
22/54
Onthelefthandsidewecanselectthevariableswewantvisualized.Fortheheartdiseasedata,wewill
select
all
except
the
outcome,
diameter
narrowing.
Linear
projection
does
not
use
perpendicular
axis,
as
theordinaryscatterplot,butcaninsteaduseanarbitraryarrangementofaxis.Theinitialarrangementis
circular.Togetausefulvisualization,clicktheFreeVizbuttonatthebottom.Inthedialogthatshowsup,
select Optimize separation. The visualization shows you, basically, which variables are typical for
patientswhichdoanddonothavenarrowedvessels.
Alternatively, in the FreeViz dialog you can choose the tab projection and then Supervised principal
componentanalysis.
Foraneasierexplanationofhowtointerpretthevisualization,hereisavisualizationofanotherdataset
inwhichthetaskistopredictthekindoftheanimalbasedonitsfeatures.Wehaveaddedsomecolors
byclicking
the
Show
probabilities
in
the
Settings
tab.
Eachdotrepresentsananimal,withtheircolorscorrespondingtodifferenttypesofanimals.Positions
are computed from animal's features, e.g. those which are airborne are in the upper left (and those
whichareairborneandhaveteethwillbeinthemiddletoughluck,butthereareprobablynotmany
suchanimalswhichisalsothereasonwhyfeaturesairborneandtoothedareontheoppositesides).
7/29/2019 DataMining Kyoto
23/54
Therearequiteafewhypotheses(sometrueandsomefalse)thatwecanmakefromthispicture.For
instance:
Mammalsarehairyandgivemilk;theyarealsocatsize,althoughthisisnotthattypical. Airborneanimalsareinsectsandbirds. Birdshavefeathers. Havingfeathersandbeingairbornearerelatedfeatures. Animalseithergivemilkorlayeggsthesetwofeaturesareontheoppositesides. Similarly,someanimalsareaquaticandhavefins,andsomehavelegsandbreath. Whethertheanimalisvenomousanddomestictellusnothingaboutitskind. Backboneistypicalforanimalswithfins.(?!)
Someofthesehypothesisarepartiallytrueandthelastoneisclearlywrong.Theproblemis,obviously,
that all features need to be placed somewhere on twodimensional plane, so they cannot all have a
meaningful place. The widget and the corresponding optimization can be a powerful generator of
hypothesis,butallthesehypothesisneedtobeverifiedagainstcommonsenseandadditionaldata.
7/29/2019 DataMining Kyoto
24/54
Parallelcoordinates
Intheparallelcoordinatesvisualization,theaxis,whichrepresentthefeaturesareputparalleltoeach
other and each data instance is represented with a line which goes through points in the line that
correspondtoinstance'sfeatures.
Onthe zoodataset,each linerepresentsoneanimal.Weseethatmammalsgivemilkandhavehair
(mostpink linesgo through the topof the twoaxisrepresentingmilkandhair).Featheryanimalsare
exclusively birds (all lines through the top of thefeathers axis are red). Some birds are airborne and
somearenot(redlinessplitbetweenthefeathersandtheairborneaxis).Fish(greenlines)donotgive
milk,havenohair,havebackbone,nofeathers,cannotfly,andlayeggs(wecanseethisbyfollowingthe
greenlines).
HandsonexerciseDraw the followinggraph forheartdiseasedata.Select thesamesubsetof featuresandarrange
theminthesameorder.Seteverythingtomakethevisualizationappearasbelow.
7/29/2019 DataMining Kyoto
25/54
Can you see the same age vs. maximal heart rate relation as we spotted in the scatter plot?
Anything else which might seem interesting? Are females in the data really older than males? Is
thereanyrelationbetweenthemaximalheartrateandthetypeofthechestpain?
Mosaicplot
InMosaicplotwesplitthedatasetintoeversmallersubsetsbasedonvaluesofdiscrete(ordiscretized)
attributes. We can see the proportions of examples in each group and the class distribution. Figures
belowshowconsecutivesplitsofthezoodataset.
7/29/2019 DataMining Kyoto
26/54
Hairyanimals representalmostonehalfof thedataset (the rightbox in the firstpicture isalmostas
wideasthe leftbox). (Note:thisdatasetdoesnotrepresentanunbiasedsampleofanimalkingdom.)
The"hairy"boxisviolet,thereforemosthairyanimalsaremammals.Theyellowishpartrepresentsthe
hairyinsects.Thereisalmostnovioletintheleftbox,thereforearealmostnononhairymammals.
Wespliteachofthetwosubgroups(nonhairy,hairy)bythenumberof legs.Mosthairyanimalshave
four legs,followedbythosewithtwoandthosewithsix.Thereareonlyafewwithno legs.Four and
twoleggedhairycreaturesaremammals,andsixleggedareinsects.
7/29/2019 DataMining Kyoto
27/54
A thirdofnonhairyanimalshaveno legs.Halfof them (inourdatasetonly,not innature!)are fish,
followedbyafewinvertebrates,mammalsandreptiles.Onethirdofnonhairyanimalshavetwolegs;all
ofthemarebirds.Fourleggednonhairiesareamphibiansand reptiles,andthesixleggedaremostly
insects.
For
the
last
split,
we
have
split
each
group
according
to
whether
they
are
aquatic
or
not.
All
mammals
withoutlegsareaquatic,whilethenumberofaquaticsamongtwo andfourleggedissmall(thevertical
stripsontherighthandsidearethin).Therearenohairyaquaticinsects.
Majority of nonhairy animals without legs are aquatic: fish, and some mammals and invertebrates.
Nonaquatic ones are invertebrates and reptiles. As we discovered before, twolegged nonhairy
features are birds (red). One third of them is aquatic and two thirds are not. Fourlegged nonhairy
animalscanbenonaquaticreptiles,whileatwothirdmajoritygoestotheaquaticamphibians.
Mosaicplotscanalsoshowaprioridistributionsorexpecteddistributions(inamannersimilartothose
of the chisquare contingency table, that is, the expected number of instances computed from the
marginaldistributions).
The
plot
below
splits
the
data
on
heart
disease
by
age
group
and
gender.
The
smallstripsontheleftsideofeachboxshowthepriorclassdistribution(theproportionofpatientswith
andwithoutnarrowedvesselsontheentiredataset).
Theoccurrenceofnarrowedvesselsisverysmallamongyoungandoldfemales,whereasinthemiddle
agegrouptheproportionisfarlarger(butstillbelowthetotaldatasetaverage!theproblemissomuch
morecommonamongmales!)Formales,theproportionsimplyincreaseswithage.
7/29/2019 DataMining Kyoto
28/54
HandsonexerciseHowdidweactuallyplotthis?IfwefeedthedatawithcontinuousvariablesintotheMosaicwidget,
it will automatically split values of numeric attributes into intervals using one of the builtin
discretizationmethods.Herewemanuallydefined theagegroupsbyusing theDiscretizewidget.
Trytodoityourself!(Hint:putDiscretizebetweenFileandMosaic.InDiscretize,clickExploreand
setindividualdiscretizations.SelectAgeandentertheintervalthresholdsintotheCustom1boxor
definethe intervalsonthegraphbelow leftclickingaddsandrightclickingremovesthresholds,
andyoucanalsodragthresholdsaround.Donot forget toclickCommit in thewidgets,orcheck
Commitautomatically.)
Sievediagram
Sievediagram,alsocalledparquetdiagram, isavisualizationoftwowaycontingencytables,similarto
those
used
in
chisquare.
The
rectangle
representing
the
data
set
is
split
horizontally
and
vertically
accordingtotheproportionofvaluesoftwodiscreteattributes.
Note the difference between this and the mosaic plot. In
mosaicplot,thedataissplitintoeversmallersubgroups:each
age group is split separately into groups according to the
distributionofgenderswithinthatspecificagegroup.Thesize
7/29/2019 DataMining Kyoto
29/54
of the box in the Mosaic plot thus corresponds to theobserved (actual)numberof instancesofeach
kind.Forcontrast,theSievediagramsplitsbytwoattributesindependently,sothesizeoftherectangle
reflectstheexpectednumberofinstances.Thisnumbercanbecomparedtotheobservednumber:Sieve
plotshowsthecomparisonbygriddensityandcolor(bluerepresentsahigherthanexpectednumberof
instances and red represents a lower than expected number). The diagram above suggests that, for
instance,youngermalesaremorecommonthanexpectedaccordingtopriordistribution. Wecancheck
thenumberbymovingthemousecursoroverthecorrespondingbox.
Given that the younger age group includes 39.27% of data and males represent 67.99%, we would
expect39.27%67.99%=26.70%youngmales.Theactualnumber is28.71%,which ismorethan
expected.
Thisdifference is,however,negligible.Here isabetterexample. In the figurebelow, theverticalsplit
followsthedistributionofgenders(themalepartistwicetheheightofthefemalepart),andthevertical
splitcorrespondstoresultsofthethaltest(onehalfof instanceshavenormalresultsandmostothers
have
reversible
defect).
The number of males with normal thal test is 28.57 % (86 patients), which is less than the expected
37.56%(113patients).Thenumberofwomenwithnormalthaltestisunexpectedlyhigh,80insteadof
theexpected53.Andsoforth.
7/29/2019 DataMining Kyoto
30/54
"Intelligent"visualizations
Thereare
many
different
scatter
plots
we
can
draw for some data, and the number of
differentmorecomplexvisualizations iseven
higher. Orange features a set of tools for
findinggooddatavisualizations.For instance,
this plot (a variation of scatter plot called
RadViz), which was found automatically by
Orange, uses expressions of 7 genes out of
2308,todistinguishbetweenthefourtypesof
SRBCT
cancer.
In
most
widgets
you
will
find
buttons Optimize or VizRank (Visualization
ranking), which will try out huge number of
different visualizations and present those
whichseemmostinteresting.
7/29/2019 DataMining Kyoto
31/54
DataModeling(Learning,Clustering)
Modelingreferstodescribingthedata inamoreabstract formcalled,obviously,amodel.Themodel
(usually) does not include the data on individual instances from which it is constructed, but rather
describesthegeneralpropertiesentiredataset(machinelearningoftenreferstothisasgeneralization).
We usually distinguish between unsupervised and supervised modeling. Roughly, imagine that data
instancesbelongtosomekindofgroupsandourtaskistodescribethesegroups.Thecomputer'swork
canbesupervised in the sense that the computer is told which instancesbelong to whichgroup. We
calledthesegroups"classes";wesaythatweknowtheclassofeachexample.Thecomputerneedsto
learnhowtodistinguishbetweenthetwogroups.Suchknowledgeisrepresentedbyamodelthat,ifina
properform,canteachussomething(new,potentiallyuseful,actionable)aboutthedata.Moreover,
we can use it to determine the group membership of new data instances. For example, if the data
representstwogroupsofpeople,somearesickandsomearenot,theresultingmodelcanclassifynew
patientsintooneofthesetwogroups.
In unsupervised learning, the instances are not split into classes, so the computer is not guided
(supervised)inthesamesenseasabove.Thetaskistypicallytofindasuitablegroupingofinstances.In
statistics, this is called clustering (cluster being a synonym for a group). We also often use visual
methodswhichplotdata insuchawaythatgroupsbecomeobvioustothehumanobserver,whocan
thenmanually"drawtheborders"betweenthemorusetheobservedgroupingforwhateverheneeds.
Dataforunsupervisedlearning andforsupervisedlearning
7/29/2019 DataMining Kyoto
32/54
Predictivemodeling/Learning
The task of predictive modeling is to use the given data (training examples, learning examples) to
construct models which can predict classes or, even better, class probabilities for new, unseen
instances. Inthecontextofdatamining,wewould likethesemodelstoprovidesome insight intothe
data:themodelsshouldbeinterpretable;theyshouldshowpatternsinthedata.Themodelsshouldbe
abletoexplaintheirpredictions intermsofthefeaturesthepatient isprobablysick,andthereason
whywethinksoisbecausehehastheseandthesesymptoms.
Classificationtrees
Induction of classification trees (also called decision trees) is a popular traditional machine learning
method,whichproducesmodelsastheoneonthefigurebelow.
Whilethisisvisuallypleasing,thetreeaspresentedbythebelowwidgetmaybeeasiertoread.
Wehaveusedthreewidgetsrelatedtotrees:ClassificationTreegetsthedataandoutputsatree(thisis
thefirstwidgetwehavemetsofarwhichdoesnotoutputdatainstancebutsomethingelse,amodel!).
ClassificationTreeViewerandClassificationTreeGraphgettreesontheirinputsandshowthem.They
7/29/2019 DataMining Kyoto
33/54
alsoprovideoutput:whenyouclickontreenodes,theyoutputalltrainingexampleswhichbelongtothe
node.YouwillfindallthesewidgetsincategoryClassify.
Saywegetanewpatient.The treesuggests thatweshould firstperformthe thaltest (whateverthis
means).Let'ssaythatresultsarenormal.Forsuchpatientswethenaskaboutthechestpain;suppose
thatit'stypicalanginal.Inthatcase,wechecktheageofthepatient.Sayheis45,whichisbelow56.5.
Thereforewepredicta0.04probabilityofhavingthedisease.
The algorithm for construction of such trees from data works like this. First, it checks all available
variablesandfindsonewhichsplitsthedataintosubsetswhichareasdistinctaspossibleinidealcase,
the
attribute
would
perfectly
distinguish
between
the
classes,
in
practice,
it
tries
to
get
as
close
as
possible.Forourdata,itdecidedforthaltest.Thenitrepeatsthesearchforthebestattributeineachof
thetwosubgroups. Incidentally,thesamefeature(thetypeofchestpain)seemsthebestattribute in
both.Theprocessthencontinuesforallsubgroupsuntilcertainstoppingcriteriaaremet:thenumberof
data instances inagroup istosmalltosplitthem further,thesubgroupsarepureenough (ideally,all
instancesbelong tothesameclass,sothere isnoneedtosplitthemany further),therearenomore
usefulfeaturestosplitby,thetreeistoolargeetc.
Treesaregreatfordatamining,because
there'snosophisticatedmathematicsbehindthem,sotheyareconceptuallyeasytounderstand, atreecanbeprintedout,analyzedandeasilyusedinreallifesituations, indata mining tools,wecanplaywith them interactively; in Orange,wecan connectthe tree to
otherwidgets,clickthenodesofthetreetoobservethesubgroupsandsoon.
Theyare,however,notsogreatbecause
whiletheymayperformrelativelywellintermsofproportionofcorrectpredictions,theyperformpoorlyintermsofprobabilityestimates;
theyare"unstable":asmallchange inthedataset, likeremovinga few instances,canresult inacompletelydifferenttree.Sameforchangingtheparametersofthe learningalgorithm.Treesgive
oneperspectiveofthedata,onepossiblemodel,notaspecial,unique,distinguishedmodel;
treesneedtobekeptsmall:computercaneasilyinduceatreewithafewhundredleaves,yetsuchtreesaredifficulttounderstandandtouse,andthenumberofinstancesinindividualleavesisthen
sosmallthatwecannot"thrust"them.Ontheotherhand,smalltreesrefertoonlyasmallnumer
ofattributes,e.g.toonlyasmallnumberofmedicalsymptomsrelatedtothedisease.
7/29/2019 DataMining Kyoto
34/54
The latter point is really important. Trees are good for conceptually simple problems, where a few
features are enough to classify the instances. They are a really bad choice for problems in which the
predictionmustbemadebyobservingandsumminguptheevidencefromalargernumberoffeatures.
A usefully small tree cannot contain more than, say, ten features. If making the prediction requires
observingmorevariables,forgetaboutthetrees.
HandsonexerciseConstructatree liketheoneabove,thenplotthesamedata inaMosaicplot inwhichyouselect
theattributesfromtherootofthetreeandthefirstbranchesofthetree.
Classificationrules
Classificationrulesareessentiallysimilartotrees,exceptthat insteadofhavingahierarchyofchoices,
theyaredescribedbylistsofconditions,likethosebelow.
Thealgorithmforconstructionofsuchrulesstartswithanemptyrule(nocondition,itpredictsthesame
classfor
all
examples).
It
then
adds
conditions,
which
specialize
the
rule,
fitting
it
to
aparticular
group
of
instancesuntilitdecidesthatit'senough(stoppingcriteriaarebasedonthepurityofthegroup,number
of instances covered by the rule and some statistical tests). It prints out the rule, removes the data
instancescoveredbytheruleandrepeatstheprocesswiththeremaininginstances.
7/29/2019 DataMining Kyoto
35/54
HandsonExerciseConstructthebelowschema.ExamplescomingfromtheCN2RulesViewershouldbeconnectedto
ExampleSubsetof theScatterplot (thesimplestway tohave this is to firstdraw theconnection
betweenFileandScatterplotandthenbetweentheCN2RulesViewerandScatterplot).
Nowclick
on
individual
rules
in
the
CN2
Rules
Viewer
and
the
Scatter
plot
will
mark
the
training
instancescoveredbytherule.
Classification rules havesimilarpropertiesas the trees,except that theirshortcomingsare somewhat
less acute: they can handle more attributes and are better in predicting probabilities. They may be
especiallyapplicabletoproblemswithmultipleclasses,as inthefollowingexamplesfromourbeloved
zoodataset.
7/29/2019 DataMining Kyoto
36/54
NaveBayesianClassifierandLogisticRegression
NaveBayesianclassifiercomputesconditionalclassprobabilities,withfeaturevaluesasconditions.For
instance,itcomputestheprobabilityforhavingthediseaseforfemalesandformales,theprobabilityfor
havingthediseaseforyoungerandolderpatients,forthosewithnonanginalpain,typicalanginalpain
andatypical
anginal
pain,
and
so
on.
To
predict
the
overall
probability
for
having
the
disease,
it
"sums
up"theconditionalprobabilitiescorrespondingtoalldataaboutthepatient.
Themodelcanbesimplifiedintosumming"points"plotted inanomogram, liketheoneshownbelow.
SupposewehaveapatientwhoseSTbyexerciseequals2,hisheartrategoesupto200,andhefeels
asymptomaticchestpain.Othervaluesareunknown.NaveBayesianclassifierwouldgivehim+1point
for the ST by exercise,3.75 for heart rate, +1 for the chest pain these mappings can be seen by
"projecting"thepositionsofthepointscorrespondingtothevaluesontotheaxisabovethenomogram.
Altogether,thisgives+1+3.75+1=1.75points.Thescalebelowtransformsthepoints intoactual
probabilities: 1.75pointscorrespondtoaround15%chanceofhavingthedisease.
Thereis
asimilar
model
from
statistics
called
linear
regression.
On
the
outside,
they
look
the
same:
they
can both be visualized using a nomogram. Mathematically, they differ in the way they compute the
numberofpointsgivenforeachfeature:naveBayesianclassifierobserveseachfeatureseparatelyand
logisticregressiontriestooptimizethewholemodelatonce.Weshallnotgo intomathematicshere.
Wewillonlydescribethepracticalconsequencesofthedifferenceinthetwoapproaches.
7/29/2019 DataMining Kyoto
37/54
Thesemodelsarejustgreat.
Theyareveryintuitive,easytoexplain;theycanbeprintedoutandusedinthephysician'soffice. Theygivesurprisinglyaccuratepredictions. Thepredictionscanbeeasilyexplained:thestrongestargumentthatthepersonfromourexample
ishealthyishisextremelyhighheartrate.Theothertwopiecesofdatawehave,theSTbyexercise
andthetypeofchestpainbothactuallyopposethepredictionthatheishealthy,yettheirinfluence
issmallerthanthatoftheheartrate.Theinsideviewintohowthemodelmakespredictionallows
theexperttodecidewhethertheymakesenseorwhethertheyshouldbeoverruledandignored.
The model can show us which features are more important than others in general, without aparticular given instance. ST by exercise and maximal heart rate are, in general, the two most
informative features, since they have the longest axis they can provide the strongest evidence
(thegreatestnumberofpoints)fororagainstthedisease.Thechestpainandthaltestaretheleast
important(amongthoseshowninthepictures).Afeaturecanalsohavezeroinfluence,whenallits
valuesoronthevertical line(0points).Afeaturecansometimesevenspeakonly infavororonly
againstthetargetclass;theremaybetestswhichcan,iftheirresultscomeoutpositive,tellthatthe
patientissick,whilenegativeresultmaymeannothing.
Whatabouttheirshortcomings?ForthenaveBayesianclassifier,itsshortcomingscomefromobserving
(thatis,computingthestrength)ofeachfeatureseparatelyfromothers.
In case of diseases whose prediction requires observing two interacting features (for a naveexample, a disease which attacks young men and old women), it cannot "combine" the two
features.Suchdiseases(and,ingeneral,suchfeatures)arehoweverrareinpractice.
Another,more
common
problem
are
correlated
features.
Observing
each
feature
separately,
nave
Bayesianclassifiercancountthesameevidencetwice.Considerachilddiseasewhoseprobability
decreaseswithchild'sage;therearenootherfactorsrelatedtothedisease.Ifthedatawecollected
containschildren'sageandbodyheight,naveBayesianclassifierwill(wrongly?)considerheightas
animportantfactoralthoughthisisonlybecauseofitsrelationwithage.Countingthesamepiece
ofevidencetwicecausesthenaveBayesianclassifierto"overshoot"theprobabilitypredictions.
Thelatterisnotmuchofaproblem:theclassifierisstillprettyaccurate.Yetthedifferenceisimportant
for comparison with logistic regression: logistic regression does not have this problem. The class
probabilitiesitpredictsaretypicallyveryaccurate.Andtheprice?
Logisticregression
cannot
be
used
to
evaluate
importance
of
individual
features.
In
the
childhood
diseaseexamplefromabove, logisticregressionmightsaythatdiseaseprobabilityiscorrelatedby
agealone,andnotbytheheight,whichwouldbecorrect.Ontheotherhand, itcansay thatthe
only important attribute is body height, while age cannot offer any new evidence, which is
misleading.Usuallyitdoessomethinginbetween.
7/29/2019 DataMining Kyoto
38/54
Logistic regressioncannothandleunknownvalues.Alldatahas tobeknown toclassifyapatient.NotsofornaveBayesianclassifierforwhichwecanaddjustnewevidencewhenwegetit.
Bothmodels,especially naveBayesianclassifier,aregood forsummingup theevidence froma large
number of features. They can learn from a single feature, but they can also sum up small pieces of
evidence
from
thousands
of
features.
For
instance,
Nave
Bayesian
classifiers
are
often
used
for
spam
filtering in which each word from the message (and all other properties of the message) contribute
someevidenceforthemailbeingspamorham.
Whileweavoidedanytheoryhere,thereisreallyalotofitinbehind(unlikeclassificationtrees,which
areasimplemachinelearninghack).Logisticregressionisamodificationoflinearregression,whichuses
logistic function for modeling of probabilities. The "points" assigned to features are actually logodds
ratios, which are a more practical alternative to conditional probabilities Nave Bayesian classifier
represents thesimplestcaseofBayesiannetwork,amodelwhichcan incorporateprobabilisticcasual
models.Moreover,there isawholebranchofstatistics,calledBayesianstatistics,whichrepresentsan
alternative
to
the
inferential
statistics
(testing
of
hypothesis
etc)
and
is
growing
fast
in
popularity
and
usefulness.
Othermethods
Therearemanyothersimilarmethods.Hereweonlymentionedthesimpleones,yet,bymyexperience,
nave Bayesian classifier can easily compete and lose by only a small difference with the most
powerfulandcomplexmachine learningalgorithms like, for instance,support vectormachines. Ifyou
wantmodelswhicharesimpletouseandunderstand,youshouldlogisticregressionandnaveBayesian
classifiers. The possible few percent loss in performance will be more than paid of by your ability to
explainand,ifneeded,overrulethepredictionsitmakes.
Unsupervisedmodeling
Distances
Unsupervised modeling is often based on a notion ofdistance ordifference between data points. In
ordertodoanygroupingbasedonsimilarity,weneedtosomehowdefinethesimilarity(oritsopposite,
dissimilarity,difference).
Whendealingwithordinarydata,themostcommondefinitionisbasedonEuclideandistance.Ifaandb
are two data instances described by values (a1, a2, a3) and (b1, b2, b3), we can define the difference
betweenthemas(a1b1)2+(a2b2)
2+(a3b3)2(wecantakeasquarerootofthat,butitwon'tchangemuch).
Weonlyneedtobecarefultoputallnumbersonthesamescaleadifferenceof0.1minbodyheight
maymeanmuchmorethanadifferenceof0.1kginbodyweight.Thesimplest(althoughnotnecessarily
the best) way of putting the data on the same scale is to normalize all variables by subtracting the
averageanddividingbystandardvariance.
7/29/2019 DataMining Kyoto
39/54
Fordiscretedata,weusuallysaythatwhentwovaluesaredifferent,thedifferencebetweenthemis1.
Sometimes the distance is not so trivial, especially when the data instances are not described by
features/attributes/variables.Say that objectsofmystudyare lecturersatKyotoUniversity (including
occasionalspeakers,likeme)andwewouldwanttoclusterthembasedontheirscientificareas.Howdo
we
define
a
meaningful
difference
between
them?
This
depends,
of
course,
on
the
purpose
of
the
grouping. Say that I would like to group them by their fields of expertise (and that I don't know the
departmentsatwhichtheyactuallywork).Theproblemisactuallyeasierthanitlooks:wehaveaccessto
public data bases of scientific papers. Say that Dr. Aoki and I have written 3 papers together, while
altogether(that is,Dr.Aoki,meorbothofus incoauthorship)wewrote36papers(Iammakingthis
up!). We can then say that the similarity between us is 3/36 or 0.08. The similarity between Dr.
Nakayamaandmeisthen0(wehavenocommonpapers,sothesimilarityiszerodividedbysomething).
Two professors who wrote all their papers together would have a similarity of 1. To compute the
difference,wesimplysubtractthesimilarityfrom1:thedifferencebetweenDr.Aokiandmewouldbe1
0.08 = 0.92. This definition of similarity is known as the Jaccard index
(http://en.wikipedia.org/wiki/Jaccard_index).
The
purpose
of
this
discussion
is
not
to
present
the
Jaccard index by rather to demonstrate that there may be cases in which the definition of
distance/difference/similarityrequiresabitofcreativity.(Seeanexampleonthewebpage:a"map"of
scientificjournals fromcomputerscience,where thesimilaritybetweenapairofjournals isbasedon
whetherthesamepeoplecontributepaperstobothofthem.)
Hierarchicalclustering
We will show how this works on animals. All data in the zoo data set is discrete, so the dissimilarity
(distance) between two animals equals the number of features in which they differ. The schema in
OrangeCanvas
looks
like
this:
ExampleDistancewidgetgetsasetoftraininginstancesandcomputesthedistancesbetweenthem.A
coupleofdefinitionsareavailable,butallareattributebased inthesensethattheysomehowsumup
thedifferencesbetweenvaluesofindividualattributes.
Hierarchicalclusteringofthisdataisshownontheleftpicture,andtherightpictureshowsapartofit.(I
haven'tconstructedthedataandhavenoideawhygirlswere includedasanimalssimilartobearsand
dolphins.)
The figure is called a dendrogram. Probably not much explanation is needed with regard to what it
meansandhowitneedstoberead:itmergestheanimalsintoeverlargergroupsor,readfromleftto
right,splitsthemintoeversmallergroups.
Clusteringisintendedtobeinformativebyitself.IfIcamefromanotherplanetandhavenoideaabout
theEarthlyanimals,thiscanhelpmeorganizethemintoameaningfulhierarchy.
7/29/2019 DataMining Kyoto
40/54
We can play with the clustering widget by drawing a
vertical threshold and "cut" the clustering at a certain
point,definingacertainnumberofdistinctgroups.
For the curious: this is how clustering actually works.
The
method
goes
from
the
bottom
up
(or,
in
this
visualization, from right to the left). At each step, it
merges the two most similar clusters into a common
cluster. Itstartssothateachanimal(object, ingeneral)
hasitsowncluster,andfinishesthemergingwhenthere
isonlyasingleclusterleft.
The only detail to work out is how do we define
distances between clusters, if we know (that is, have
defined, using Euclidean distances or Jaccard index or
whatever)
the
distance
between
single
objects.
The
first
choiceisobvious:thedistancebetweentwoclusterscan
be the average distance between all objects in one
cluster and objects in another. This is called average
linkage.Thedistancecanbealsodefinedasthedistance
between the closest objects belonging to the two
clusters,thesocalledminimallinkage.Asyoucanguess,
themaximal linkage is the distance between the most
distantpairofobjectsbelongingtothetwoclusters.And
nobodycanguesswhatthefourthoption,Wardlinkage,
is.
Neither
does
it
matter:
remember
that
the
term
linkagereferstothewayclustersarecomparedtoeach
other.Whenconstructingtheclustering,tryallofthem
andfindtheonewhichgivesthenicestpicture.
The length of horizontal lines in the dendrogram
corresponds to the similarity between clusters. The
objects on the lowest level are similar to each other,
laterthedistancesusuallygrow.Theoptimalnumberof
clustersistypicallyoneatwhichwenoticeasharpjump
inthe
distances/line
lengths.
We can also turn the data around and compute
distances between variables instead of between
animals. We can, for instance, compute the Pearson
correlation coefficient (which is actually closely related
totheEuclideandistance).
7/29/2019 DataMining Kyoto
41/54
Results may seem a bit strange at first: having hair is
closelyrelatedtogivingmilk,that'sOK.Buthavingmilk
and laying eggs?! Being aquatic, having fins and
breath?! Itmakessense,though: layingeggsandgiving
milk is strongly correlated, though the correlation is in
this case negative. Knowing that the animal is aquatic
tells us something about it having fins, but also
something about whether it can breathe or not. We
could, of course, define distance differently and get
clustering of features which actually cooccur (positive
correlation).Asalwaysinunsupervisedmodeling,it'sallaquestionofhowwedefinethedistance.
HandsonExerciseLoad the data set called Iris.tab, compute Euclidean distances and construct a hierarchical
clustering,and
connect
the
result
to
the
Scatter
plot
(the
middle
part
of
the
schema
below).
For the clustering widget, see the recommending setting on the right. In
particular,checkShowcutofflineandAppendclusterIDs,settheplaceto
Classattribute.
ThebestpairofattributestoobserveintheScatterplotispetallengthand
petalwidth.
Nowyoucanputacutofflineintothedendrogram;thedendrogramwillbe
split into discrete clusters shown by colors and the widget will output
instances classified according to their cluster membership, that is, the
"class"
of
an
instance
will
correspond
to
the
cluster
it
belongs
to.
The
Scatter plot will color the instances according to this, clusterbased class
instead of the original class. Play with the widget by setting different
thresholdsandobservingtheresultsintheScatterplot.(Ifthewidgetdoes
not output examples uncheck Commit on change and check it again, or
7/29/2019 DataMining Kyoto
42/54
leaveituncheckedandpushCommitwhenyouchangeofthethreshold;sorryforthebug.)
You can also attach theDistribution widget to the clustering. In this case, change the Place in
HierarchicalclusteringwidgetfromClassattributetoAttribute.The IDoftheclusterwillnowbe
addedtoinstancesasanordinaryattribute,nottheclass.Byobservingthedistributionofiristypes
(the
original
class)
within
each
cluster,
we
can
see
how
well
the
constructed
groups
match
the
classes. In other words, this shows how well can unsupervised learning function in problems for
whichwecouldalsousesupervisedlearning.
KMeansclustering
KMeansclusteringsplitsthedataintokclusters,wheretheparameterkissetmanually.Italwaysuses
Euclidean distance (or a similar measure, but let's not go into that). It cannot offer any fancy
dendrograms.Itsimplysplitsdataintokgroupsandthat'sit.
The advantage of kmeans clustering is that its clusters are often "better defined" (this is purely
subjective,intuitive
speaking)
than
those
we
get
by
hierarchical
clustering.
In
asense,
if
two
objects
in
k
meansbelongtothesamecluster,this"meansmore"thanittwoobjectsareinthesameclusterwhen
usinghierarchicalclustering.
HandsonExerciseUsethebelowschematoanalyzehowwelldoclustersbythekMeansclusteringmatchtheactual
classes,likewedidwithhierarchicalclusteringbefore.
Note
that
the
input
to
kMeans
is
not
a
matrix
of
distances
between
instances,
but
instances
(examples)themselves.
TheKMeanscanproposethenumberofclusters,orwecansetitmanually.Trysettingifto3and
see how it works. Then let the widget determine the optimal number of clustering itself. If it
proposedtwo,check thescatterplotson the first pageof this text.Yes, that's it.Seeoneof the
basicproblemsofclustering(andunsupervisedlearningingeneral)?
Multidimensionalscaling
Clusteringmaysometimesseemtoorigid:anobjectbelongstooneandonlyonecluster.Especially in
hierarchicalclustering
(but
also
in
kmeans)
it
can
often
happen
that
an
object
is
actually
more
similar
to
objectsfromanotherclusterthantoobjectsfromitscluster.(Thisseemswrong,butifyouthinkabout
it,you'lldiscoverit'strueandunavoidable.)
7/29/2019 DataMining Kyoto
43/54
Enter multidimensional scaling. You can image it like this: you would want to have a map of Honshu
cities,butallyouhavearedistancesbetweenallpairsofcities.Canyouuse them to reconstruct the
map?Sureyoucan.Theresultingmapmayhaveawrongorientation(thedatadoesnottellyouwhat's
northandwhat'swest),anditcanbeevenmirrored,butyoucanreconstructamapofHonshuwhichis
correctinthesensethatitputsOsaka,NaraandKyotoclosetoeachotherandhalfwaybetweenTokyo
andHiroshima.
You can do the same with other data. We can construct a map of animals in which similar animals
(wheresimilarityiscomputedinanywaywewant,sayusingEuclideandistances,asbefore)arecloseto
eachotherandanimalswhichsharepropertiesoftwoanimalgroupsareput inbetweenthem.Asan
Orangespecifictwist,wecanvisualizeorganizetheimagebyconnectingthemostsimilarpairs.
Seetheimagebelow.(Imustagainapologizefortherudeandpoliticallyincorrectconnectionbetween
girlsandgorillas;Irepeatonceagainthatthisisadatasetfromtheinternetwhichiscommonlyusedto
demonstratemachinelearningmethods.Ihavenothingtodowithit.)
Notethat
this
method
does
not
define
any
clusters,
like
hierarchical
clustering
or
kmeans.
Yet
it
makes
7/29/2019 DataMining Kyoto
44/54
theclustersobvioustotheuser,andthisisoftenenough.
Multidimensional scaling uses an optimization method which can get "stuck" in a certain placement.
Whenusingit,youshouldrepeatthelayoutoptimizationprocessformultipletimesandselecttheresult
which you like the most, or even multiple maps, if they, for instance, show data from different
interesting
perspectives.
Visualizationofnetworks
Networksarebecomingaverypopularmodel ingeneticsandalsoelsewhere.Theyconsistofvertices
(nodes), which represent objects, and are connected by edges. The network can be constructed in
differentways.Wecanconstructitfromdistanceslikethoseinclustering:wecanconnecttwoobjectsif
they are sufficiently close to each other. Connection can also represent existence of a relation. For
instance,wecanconnectpairsofproteinswhichinteractwitheachother.Twoscientistsareconnected
if they coauthoredat leastone paper. Two symptoms can be connected, if they often cooccur (e.g.
sorethroatandfever).Twoairportsareconnectedifthereisadirectflightbetweenthem.Twohospitals
areconnected,ifpatientsaretransferredbetweenthem.
Analysisofnetworksissomewhatsimilartoobservingthemultidimensionalscalingmap(moreover,the
techniqueusedforoptimizingthe layout issimilar,too).Wecanfind"islands"ofmutually interrelated
objects (in graph theory, these are called cliques) groups of collaborating scientists, collaborating
hospitals,relatedsymptoms
Wecandomorethatthat,though:wecanselectabunchofobjectsbasedontheirpropertiesandthen
tellthetoolweused(sayaNetworkexplorerwidgetinOrange),toselectandshowonlytheirvicinity,
e.g. the network up to, say, theirseconddegree neighbor (nodeswhichareat most twoedgesaway
from
the
selected
nodes).
In
genetics,
we
can
select
a
subset
of
genes
for
which
it
is
known
that
they
playarole inacertainhereditarydiseaseandthenobserve thegenesrelatedtothegenesrelatedto
thesegenesinhopethatsomeofthesemayberelatedtothedisease,too.
Wecanfindthehubs,ofthenetwork:thenodeswhichhavesignificantlylargernumberofconnections
thanothernodesandthusholdthenetworktogether.Ingenetics,thesemaybesomegeneralpurpose
genes,e.g.genescorrespondingtosomecommonproteins.Insocialnetworks,thesemaybethemost
influentialmembersofthesociety.Inanothercontext,thesenodesmayrepresentAlQaedaleaders
Asmallcollectionofinterestingnetworksispublishedonthewebsite.
7/29/2019 DataMining Kyoto
45/54
Evaluationofmodelperformance
Lasttime,wediscussedsupervisedandunsupervisedmodeling,whichgiveusvariousmodelsdescribing
thedata.Todaywewilllearnhowtomeasurethequalityoftheobtainedmodels,eithertochoosethe
optimalmodelortocheckthequalityofthemodeltodecidewhetheritisgoodenoughtobeused.
Inthecontextofunsupervisedmodeling,likeclustering,thequalityofmodelsisdifficulttoevaluate.In
mostcaseswecanonlyevaluatethemsubjectivelybycheckingwhethertheymakesense,whetherthey
conform toourexpectationsand lookuseful forwhateverwewant tousethem.Supervised learning,
however, gives models which make predictions, so their quality can be measured objectively by
observing the quality of predictions. In today's lecture we'll learn how to prepare the test data and
discussdifferentmeasuresofqualityofmodels.
Samplingthe
data
for
testing
Thereisaverysimplerule:nevertestthemodelonthedatawhichwasusedforitsconstruction.
Thepurposeofassessingtheperformanceofmodelsistodeterminehowwellthemodelwillbehaveif
usedinpractice.Sincethemodelwillbeusedonnew,yetunseendatainstances(patients),italsohas
tobetestedondatawhichithasnotseenbefore.Otherwisethebestmodelwouldbesimplyonewhich
remembersalldataandusesthatknowledgewhenmakingpredictions(or,better,"postdictions").
Forthispurpose,weuseresamplingtechniquesinwhichonlya(randomly)chosensubsetofdataisused
for constructing the model and the remaining part is used to compute its quality. This is repeated
multipletimesandintheendwereporttheaveragequalityofthemodeloveralltrials.
Themostcommontechniqueusediscalledcrossvalidation.Itsplitsthedatainto(usually)tensubsets,
calledfolds. In the first step, it reserves the first fold for testing: it uses data from folds 210 for
constructingthemodelandthentests itonthefirstfold.Then itreservesthesecondfold fortesting;
folds1and29areusedforconstructingthemodelandfold2fortestingit.Andsoon;ateachstepone
foldisheldoutfortestingandothersareusedforlearning.
Anothersamplingmethod iscalledleaveoneoutorjackknife.Insteadof leavingoutonefold(10%of
examples,whenwedoa10foldcrossvalidation),itleavesoutonlyoneexampleatatime.Itconstructs
amodel
from
all
examples
except
this
one,
and
tests
the
prediction
on
this
example.
This
gives
very
reliableresults,buttakesalotoftime:ifthedatahas1000instances,itwillconstruct1000models.
Inrepeated randomsampling,wechooseagivennumberof instances (say70%) forconstructingthe
model,usetheremaining30%fortesting,andrepeatthiswholeprocedureacertainnumberoftimes.
Besides these three, which are also available in Orange, there exist a number of other resampling
methods,whicharenicelydescribedhere:http://en.wikipedia.org/wiki/Resampling_(statistics).
7/29/2019 DataMining Kyoto
46/54
Asmallproblemwithallthesemethodsisthattheygiveanaverageperformanceofconstructedmodels
and not of a single model. Say that we decide that the best model for our data is nave Bayesian
classifier,whichhasanAUCof0.90(wewillsoonseewhat'sAUC,don'tworry)basedonanaverageof
thetenfoldcrossvalidation.Havingdecided,weusetheentiredatasettoconstructonesinglenave
Bayesianmodel.WehoweverdonotknowtheexpectedAUCofthismodel.Weonlyknowthatnave
Bayesian models for this data have average accuracies of 0.90, while this model may be better or
worse.
Weusuallydonotworryaboutthis.First,thefinalmodelwasconstructedfromabit largerdataset
fromall10foldsandnotjust9ofthem.Moredatausuallymeansbettermodels.Second,mostmodels
arequitestable(classificationtreesexcluded!).AlltennaveBayesianmodelsfromcrossvalidationwere
probablyquitesimilartoeachotherandtothisfinalmodel.
There is, though,onebiggerproblem.Constructingthemodelusually involvesa lotofmanualwork
everything which we covered in our past lectures. When we get the data, we will use various
visualizations
and
other
methods
to
select
the
features
which
we
want
to
use
in
our
models,
we
may
construct new features from existing models, we will fit various thresholds, parameters of learning
algorithms(e.g.wewillsetthedesiredsizeofthetrees).Althoughwemayhaveusedsomeresampling
techniqueintheend,theentiredatasetswasusedwhenpreparingforthemodelingwhichisapartof
modeling,too,andshouldnotbedoneonthedataonwhichthemodelislatertested.
Thisisthereasonwhywestressedinthefirstandinthesecondlecturethatapartofthedataneedsto
beleftaside.Asaruleofathumb,weusuallyrandomlyselect30%ofdataandputitinaseparatefile.
Thenwegothrougheverythingwe(hopefully)learnedduringtheselecturesusingtheremaining70%of
data.Whenwedecideforthefinalmodel,wetaketheheldoutdataanduseittotestthemodel.These
resultsarewhatwereportasthetrueexpectedperformanceofourmodel.
Measuresofmodelperformance
There are multiple perspectives from which we canjudge models with regard to the quality of their
predictions.
Scalarmeasuresofquality
Classificationaccuracy (CA) is the easiest and the most intuitive measure of quality. It measures the
proportionof
correct
answers
the
model
gives:
ifit
makes
the
correct
prediction
for
85
cases
out
of
100,
wesaytheclassificationaccuracyis85%.
LetusmeasuretheaccuracyofthenaveBayesianclassifier, logisticregressionandclassificationtrees
ontheheartdiseasedataset.Thecentralwidgetintoday'slectureiscalledTestLearners.Onitsinputit
requiresadataset(wewillprovideonefromthefilewidget)andlearningalgorithms.Theschemabelow
7/29/2019 DataMining Kyoto
47/54
looksabitunusual:lastweek,wegavedatatolearningalgorithms,sothattheycouldconstructmodels
fromit.Howdotheconstructmodelsnow,withoutdata?
WidgetfornaveBayesianclassifierhastwooutputs.Whengivendata(aslastweek),itoutputsamodel
(whichwefedto,for instance,thewidgetshowingnomograms).Butevenwhennotgivenanydata, it
outputs
a
learning
algorithm.
Learning
algorithm
is
not
a
model,
but
a
procedure
for
building
the
model.
So,whatTestLearnersgetsinthisschemaarenotdataandthreemodelsbutdataandthreeprocedures
formakingmodels.TestLearnerssplitsthedataindifferentways(crossvalidationorleaveoneoror),
repeatedlyrunstheprovidedproceduresononesubsetandteststheresultingmodelsontheother.
HandsonexerciseConstructtheaboveschema.
Classification
accuracydoes
not
distinguish
between
the
"meanings"
of
classes.
Not
all
mistakes
are
equal.Forinstance,failingtodetectadiseasecancauseadeathofapatient,whilediagnosingahealthy
personassickisusuallylessfateful.Todistinguishbetweendifferenttypesofwrongpredictions,wewill
selectoneoftheclassesandcallitthetargetclassorthepositiveclass.Inmedicine,positivewillusually
meanhavingadisease,aconditionorsomethinglikethat.Nowwecandistinguishbetweenfourcases
ofcorrectandwrongpredictions
7/29/2019 DataMining Kyoto
48/54
truepositive (TP) denotes a data instance (e.g. a patient) for whom the model made a positivepredictionandthepredictioniscorrect(sothepatientistrulypositive)
falsepositive (FP)denotes instanceswhichwere falselyclassifiedaspositive (e.g.wepredict thepatienttohavethedisease,thoughhedoesnot)
truenegative
(TN)
instances
are
those
for
which
the
model
gave
negative
prediction
and
the
predictioniscorrect
falsenegative(FN)representsinstanceswhoarefalselyclassifiedasnegative(theyaresickbutthemodelpredictedthemtobehealthy)
predictedvalue
positive(P') negative(N')
positive(P) truepositive(TP) falsenegative(FN)actualvalue
negative(N) falsepositive(FP) truenegative(TN)
Besides TP, FN, FP and TN, we will also use P and N to denote the number of positive and negative
instances, and P' and N' for the number of all positive and negativepredictions. Obviously, P=TP+FN,
N=TN+FP,P'=TP+FPandN'=TN+FN.Thematrixlikethisoneisaspecialcaseofconfusionmatrix.Orange
showsinitawidgetcalledConfusionmatrix,whichweconnecttotheTestLearnerswidget.
HandsonexerciseTomakethingsinteresting,wewillalsoaddaDatatablewidgetandaScatterplotwidget,towhich
wegiveexamplesfromtheFilewidgetandfromtheConfusionmatrix.Thelattershouldrepresenta
subsetof
data
(the
easiest
way
to