DataMining Kyoto

Embed Size (px)

Citation preview

  • 7/29/2019 DataMining Kyoto

    1/54

    JanezDemar

    DataMining

    MaterialforlecturesonDatamining

    at

    Kyoto

    University,

    Dept.

    of

    Health

    Informatics

    inJuly2010

  • 7/29/2019 DataMining Kyoto

    2/54

    Contents

    INTRODUCTION 4

    (FREE)DATA

    MINING

    SOFTWARE

    5

    BASICDATAMANIPULATIONANDPREPARATION 6

    SOMETERMINOLOGY 6

    DATAPREPARATION 6

    FORMATTINGTHEDATA 7

    BASICDATAINTEGRITYCHECKS 7

    MAKINGTHEDATAUSERFRIENDLY 8

    KEEPINGSOMETHEDATAFORLATERTESTING

    8MISSINGDATA 9

    DISCRETEVERSUSNUMERICDATA 10

    FROMDISCRETETOCONTINUOUS 10

    FROMCONTINUOUSTODISCRETE 11

    VARIABLESELECTION 13

    FEATURECONSTRUCTIONANDDIMENSIONALITYREDUCTION 15

    SIMPLIFYINGVARIABLES 15

    CONSTRUCTIONOFDERIVEDVARIABLES 15

    VISUALIZATION 17

    HISTOGRAMS 17

    SCATTERPLOT 18

    PARALLELCOORDINATES 24

    MOSAICPLOT 25

    SIEVEDIAGRAM 28

    "INTELLIGENT"VISUALIZATIONS 30

    DATAMODELING(LEARNING,CLUSTERING) 31

    PREDICTIVEMODELING/LEARNING 32

    CLASSIFICATIONTREES 32

    CLASSIFICATIONRULES 34

    NAVEBAYESIANCLASSIFIERANDLOGISTICREGRESSION 36

  • 7/29/2019 DataMining Kyoto

    3/54

    OTHERMETHODS 38

    UNSUPERVISEDMODELING 38

    DISTANCES 38

    HIERARCHICALCLUSTERING 39

    KMEANSCLUSTERING 42

    MULTIDIMENSIONALSCALING 42

    VISUALIZATIONOFNETWORKS 44

    EVALUATIONOFMODELPERFORMANCE 45

    SAMPLINGTHEDATAFORTESTING 45

    MEASURESOFMODELPERFORMANCE 46

    SCALARMEASURESOFQUALITY 46

    RECEIVEROPERATINGCHARACTERISTIC (ROC)CURVE 51

    CLASSIFICATIONCOST

    53

  • 7/29/2019 DataMining Kyoto

    4/54

    Introduction

    Thismaterialcontentsthe"handouts"giventostudentsfordatamininglectureheldattheDepartment

    ofHealth Informaticsat theUniversityofKyoto.The informalityofthe"course",whichtook fivetwo

    hourlecturesisreflectedininformalityofthistext,too,asitevenincludesreferencestodiscussionsat

    pastlecturesandsoon.

    Somemorematerialisavailableathttp://www.ailab.si/janez/kyoto.

  • 7/29/2019 DataMining Kyoto

    5/54

    (Free)Dataminingsoftware

    Hereisaverysmallselectionoffreedataminingsoftware.Googleformore.

    Orange(http://www.ailab.si)

    is

    afree

    data

    mining

    software

    we

    are

    going

    to

    use

    for

    these

    lectures.

    Knime (http://www.knime.org/) is excellent, well designed and nice looking. It has fewervisualizationmethodsthanOrangeand it'sabitmoredifficulttoworkwith,but itwon'tcostyou

    anythingtoinstallitandtryitout.

    Weka (http://www.cs.waikato.ac.nz/ml/weka/) isorientedmore towardsmachine learning.Whilevery strong in terms of different kinds of prediction models it can build, it has practically no

    visualizationanditsdifficulttouse.Idon'treallyrecommendit.

    Tanagra (http://eric.univlyon2.fr/~ricco/tanagra/) issmallerpackagewrittenbyaonemanband,buthasmuchmorestatisticsthanOrange.

    Orangehasadifferentuser interfacefromwhatyouareusedto(similartoKnimeand,say,SASMiner

    and SPSS Modeler). Its central part isOrangeCanvas onto which we putwidgets. Each widget has a

    certain function (loading data, filtering, fitting a certain model, showing some visualization), and

    receivesand/orgivesdatafrom/tootherwidgets.Wesaythatwidgetssendsignalstoeachother.We

    set up the data flow by arranging the widgets into schema and connecting them. Each widget is

    representedwithaniconwhichcanhave"ears"ontheleft(input)andright(outputs).Twowidgetscan

    onlybeconnectediftheirsignaltypesmatch:awidgetoutputtingadatatablecannotbeconnectedtoa

    widgetwhichexpectssomethingcompletelydifferent.WewilllearnmoreaboutOrangeaswego.

  • 7/29/2019 DataMining Kyoto

    6/54

    Basicdatamanipulationandpreparation

    Someterminology

    Throughoutthelecturesyouwillhavetolistentoanunfortunatemixtureofterminologyfromstatistics,

    machinelearningandelsewhere.Hereisashortdictionary.

    Ourdatasamplewillconsistsofdata instancese.g.patientrecords. Inmachine learningwepreferto

    saywehaveadatasetanditconsistsofexamples.

    Eachexampleisdescribedbyfeaturesorvaluesofvariables(statistics)orattributes(machinelearning).

    Thesefeaturescanbenumerical(continuous)ordiscrete(symbolic,ordinalornominal;thelattertwo

    terms actually refer to two different things: ordinal values can be ordered, e.g. small, medium, big,

    whilenominalhavenospecificorder,justnames,e.g.red,green,blue).

    Oftenthereisacertainvariablewewanttopredict.Statisticianscallitdependentvariable,inmedicine

    youwillhearthetermoutcomeandinmachinelearningwillcallitclassorclassattribute.Suchdatais

    calledclassifieddata(norelationtoCIAorKGB).

    Machine learning considers examples to belong to two or more classes, and the task is then to

    determinetheunknownclassofnewexamples(e.g.diagnosesfornewlyarrivedpatients).Wesaythat

    weclassifynewexamples,themethodfordoingthisiscalledaclassifier,theproblemoffindingagood

    classifieriscalledclassificationproblemandsoon.Theprocessofconstructingtheclassifieriscalled

    classifier induction from data or also learning fromdata. The part of data used for learning is often

    calledlearning

    dataor

    training

    data(asopposedto

    testing

    data,whichweusedfortesting).

    The "class" can alsobe continuous. If so, thiswe shouldn't call it aclass (althoughweoften do).The

    problemisthennotaclassificationproblemanymore;wecallitaregressionproblem.

    Instatisticterminology,thetaskistopredict,andthestatisticaltermforaclassifierispredictivemodel.

    Wedonotlearnhere,wefitthemodeltothedata.Wesometimesalsousethetermmodeling.Ifthe

    dataisclassified,theproblemiscalledsupervisedmodeling,otherwiseit'sunsupervised.

    Sincedataminingisbasedonbothfields,wewillmixtheterminologyallthetime.Getusedtoit.

    Datapreparation

    This is related to Orange, but similar things also have to be done when using any other data mining

    software.Moreover,especiallytheprocedureswewilltalkaboutinthebeginningarealsoapplicableto

    anymeansofhandlingthedata,includingstatistics.Theyaresoobvious,thatwetypicallyforgetabout

    themandscratchourheadslateron,whenthingsbeginlookingweird.

  • 7/29/2019 DataMining Kyoto

    7/54

    Ifallthiswillseemtoyouascommonsense,thisissimplybecausemostofdataminingisjustsystematic

    applicationofpurecommonsense.

    Formattingthedata

    OrangedoesnothandleExceldataandSQLdatabases(ifyoudon'tknowwhatisSQL,don'tbother)well

    yet,sowefeeditthedataintextualtabdelimitedform.Thesimplestwaytopreparethedataistouse

    Excel. Put names ofvariables into the first row, followed by data instances in each row. Save as tab

    delimited,withextension.txt.

    This format issomewhat limitedwithrespecttowhatcanyoutellOrangeaboutvariables.Abitmore

    complicated(butstillsimpleformatusesthesecondandthethirdrowtodefinethevariabletypesand

    properties).Youcanreadmoreaboutitathttp://www.ailab.si/orange/doc/reference/tabdelimited.htm.

    Youcanseeseveralexamplesinc:\Python26\lib\sitepackages\orange\doc\datasets,assumingthatyou

    useWindowsandPython2.6.

    For

    technical

    reasons,

    Orange

    cannot

    handle

    neither

    values,

    not

    variables

    names

    and

    not

    even

    file

    names which include nonASCII (e.g. kanji) characters. There is little we can do to fix Orange at the

    moment.

    Basicdataintegritychecks

    Checkdataforerrorsandinconsistencies.Checkforspellingerrors.Ifsomebodymistypesmaleasmole,

    wearesuddenlygoingtohavethreesexesinyourdata,females,malesandmoles.

    Orange(andmanyothertools)iscasesensitive:youshouldnotwritethesamevaluessometimeswith

    smallandsometimeswithcapitalletters.Thatis,femaleandFemalearetwodifferentspecies.

    Checkforanyotherobviouslywrongdata,suchasmistypednumbers(therearenotmany18kgpatients

    arounddidtheymean81?!)

    How are unknown values denoted? There are various "conventions": question marks, periods, blank

    fields,NA, 1,9,99Value9seemstobeespeciallypopularfordenotingunknownsinsymbolicvalues

    encodednumerically,likegender,e.g.0forfemale,1formaleand9forunknown.If9'sarekeptinthe

    dataand thesoftware isnotspecifically told that9meansanunknownvalue, itwilleitherthink that

    therearethreeoreventensexes(asiflifewasnotcomplicatedenoughwithtwo).Oreventhatsexisa

    numericalvariablerangingfrom0to9.

    Youwill

    often

    encounter

    data

    in

    which

    unknowns

    are

    marked

    differently

    for

    different

    variables

    or

    even

    foroneandthesamevariable.

    Does the same data always mean the same thing? And opposite: can the same thing be encoded by

    more thanonedifferentvalue?Check forerrors indata recordingprotocol,especially if thedatawas

    collectedbydifferentpeopleor institutions. Inthedata I'mworkingonrightnow, itwouldseemthat

    somehospitalshavemajorityofpatientsclassifiedasSimple trauma,whileothershavemostpatients

  • 7/29/2019 DataMining Kyoto

    8/54

    withOther_Ent.Couldthisbeoneandthesamething,butrecordeddifferentlyindifferenthospitalsdue

    tolaxspecifications?

    AtypicalclassicalexamplewhichIhappilyalsoencounteredwhile inNara isthatsome institutions

    would record the patients' height in meters and others in centimeters, giving an interesting bimodal

    distribution

    (besides

    rendering

    the

    variable

    useless

    for

    further

    analysis).

    Mostsuchproblemscanbediscoveredbyobservingdistributionsofvariablevalues.

    Such discrepancies should be found and solved in the beginning. If you were given the data by

    somebodyelse,youshouldaskhimtoexaminetheseproblems.

    HandsonLoadthedataintoOrangeandseewhether

    it looksOK.Trytofindanymistakesandfix

    them (use Excel or a similar program for

    that).

    Makingthedatauserfriendly

    Data isoftenencodedbynumbers.0forfemale,1 formale(wait,orwas ittheopposite?youwillask

    yourselfallthetimewhileworkingwithsuchdata).Ijustgotdatainwhichallnominalvalueshadtheir

    descriptivenames,exceptgender forsome reason.Consider renaming thesevalues, itwillmakeyour

    lifemorepleasant.

    This of course depends upon the software and the methods which you are going to use. We will

    encounterasimplecasequitesooninwhichweactuallywantthegendertobeencodedby0and1.

    Keepingsomethedataforlatertesting

    An importantproblem,whichwasalsopointedoutbyProf.Nakayamaatourfirst lecturewashowto

    combinethestatisticalhypothesistestingwithdatamining.Theonlycorrectway Iknowoftheonly

    onewithwhichyoushouldhavenoproblemthrustingyourresultsandconvincethereviewersofyour

    paperoftheirvalidity,tooistouseaseparatedatasetfortestingyourhypothesesandmodels.Since

    youtypicallygetallthedatawhichisavailableinthebeginning,youshouldsacrificeapartofyourdata

    fortesting.

    As

    arule

    of

    athumb,

    you

    would

    randomly

    select

    30%

    of

    data

    instances,

    put

    them

    into

    a

    separatefileandleaveitalonetilljustbeforetheend.

    If your data has dates (e.g. admittance date), it may make a convincing case if instead of randomly

    selecting30%,you takeout the last30%of thedata.Thiswould then look likeasimulated followup

    study.Ifyoudecideforthis,youonlyhavetobecarefulthatyourdataisunbiased.Forinstance,youare

    researchingaconditionsrelatedtodiabetesandthelast30%ofdatawascollectedinthesummer,using

    File DataTable AttributeStatistics Distributions

  • 7/29/2019 DataMining Kyoto

    9/54

    thisdataasatestsetmaynotbeagoodidea(sincediabeticpatientshavemoreproblemsinsummer,as

    farasIknow).

    You should split the data at this point. Fixing the problems in the data, which we discussed above,

    before splitting it, is a good idea, otherwise you will have to repeat the cleaning on the test data

    separately.

    But

    from

    this

    point

    on,

    we

    shall

    do

    things

    with

    the

    data

    which

    already

    affect

    your

    model,

    so

    thetestdatashouldnotbeinvolvedintheseprocedures.

    HandsonThere isageneralpurposewidget inOrangetoselectrandomsamplesfrom

    thedata.Youcanputthedatafromthiswidgetintotheoneforsavingdata.

    Tryitandprepareaseparatedatafilefortesting.

    Missing

    data

    This topic again also concerns statistics, although typical statistical courses show little interest in it.

    Somedatawilloftenbemissing.Whattodoaboutthat?

    First,therearedifferentkindsofmissingdata.

    Itcanbemissingbecausewedon'tcareabout it,since it is irrelevantandwasnotcollectedforaparticularpatient.(Ididn'tmeasurehispulseratesinceheseemedperfectlyOK.)

    Itcanbeinapplicable.(No,mysonisnotpregnant.Yes,doctor,I'mcertainaboutit.) Itmaybejustunknown.

    Especially in the latter case, it may be interesting to know: why? Is itsunknowingness related to the

    case?Doesittellussomething?

    Maybethedatawasnotcollectedsinceitsvaluewasknown(e.g.pulseratewasnotrecordedwhenit

    wasnormal).Maybethemissingvaluescanbeconcludedfromtheotherattributes(bloodlosswasnot

    recordedwhentherewerenowoundssoweknowitiszero0)orinapplicable(thereasonofdeathis

    missing,butanotherattributetellsthatthepatientsurvived).

    Foragoodexample: inadatasetwhichIonceworkedwith,anumberofpatientshadamissingvalue

    forurinecolor.WhenIaskedthepersonwhogavemethedataaboutthereason,hetoldmethatthey

    wereunconscious

    at

    the

    time

    they

    were

    picked

    up

    by

    the

    ER

    team.

    Knowingwhyitismissingcanhelpdealingwithit.Hereareafewthingswecando:

    Asktheownerofthedataoradomainexpertifhecanprovideeitherperinstanceorgeneralvaluestoimpute(thetechnicaltermforinsertingthevaluesforunknowns).Ifthepulsewasnotrecorded

    whenitwasnormal,well,weknowwhatthemissingvalueshouldbe.

    DataSampler Save

  • 7/29/2019 DataMining Kyoto

    10/54

    Sometimes you may want to replace the missing value with a special value or even a specificvariabletellingwhetherthemissingvalueisknownornot.Fortheurinecolorexample,ifthedata

    doesnotcontainthevariabletellinguswhetherthepatient isconscious,wecanaddthevariable

    Consciousanddetermineitsvaluebasedonwhethertheurinecolorisknownornot.Ifitisknown,

    thepatientisconscious,ifnot,itis"maybeconscious"(oreven"unconscious"ifwecanknowthat

    thedatawascollectedsothatthecolor isknownforallconsciouspatients).Notknowingavalue

    canthusbequiteinformative.

    Sometimes you can use a model to impute the unknown value. For instance, if the data is onchildren below 5, and some data on their weights is missing but you know their ages, you can

    impute theaveragebody mass for thechildren'sage. You canevenusesophisticatedmodels for

    that (that is, instead of predicting what you have to predict for this data, e.g. the diagnosis,you

    predictvaluesoffeaturesbasedonvaluesofotherfeatures).

    Imputeconstantvalues,suchastheworstorthebestpossible,orthemostcommonvalue(for

    discrete

    attributes)

    or

    the

    average/median

    (for

    the

    continuous).

    Removeallinstanceswithmissingvalues.Thisisthesafestroute,exceptthatitreducesyourdataset.Youmaynotlikethisoneifyoursampleisalreadysmallandyouhavetoomanymissingvalues.

    Remove the variables with missing values. You won't like this one if the variable is important,obviously.Youmaycombinethiswiththeprevious:removethevariableswithmostmissingvalues

    andthenremovethedata instanceswhicharemissingvaluesofanyofotherattributes.Thismay

    stillbepainful,though.

    Keepeverythingas it is.Mostmethods,especially inmachine learning,canhandlemissingvalueswithoutanyproblem.Beaware, though,that thisoften impliessomekindof implicit imputation,

    guessing.

    Ifyou

    know

    something

    about

    these

    missing

    values,

    it

    is

    better

    to

    use

    one

    of

    the

    first

    solutionsdescribed.

    HandsonThereisanimputationwidgetinOrange.Seewhatitoffers.

    Discreteversusnumericdata

    Moststatistical

    methods

    require

    numeric

    data,

    and

    many

    machine

    learning

    methods

    prefer

    discrete

    data.Howdoweconvertbackandforth?

    Fromdiscretetocontinuous

    Thistaskisreallysimpleforvariableswhichcanonlytaketwovalues,likegender(ifwedon'thavethe

    gender 9). Here you can simply change these values to 0 and 1. So if 0 is female and 1 is male, the

    variable represents the "masculinity"of thepatient.Consider renaming thevariable to"male"soyou

    Impute

  • 7/29/2019 DataMining Kyoto

    11/54

    won't have to guess the meaning of 0 and 1 all the time. Statisticians call such variables dummy

    variables,dummies,indicatorvariables,orindicators.

    If the variable is actually ordered (small, medium, big) you may consider encoding it as 0, 1, 2. Be

    careful,though:somemethodswillassumethelinearity.Isthedifferencebetweensmallandbigreally

    two

    times

    that

    between

    small

    and

    medium?

    Withgeneralmultivaluedvariables,therearetwooptions.WhenyourpatientscanbeAsian,Caucasian

    orBlack,youcanreplace thevariableRacewiththreevariables:Asian,CaucasianandBlack.Foreach

    patient, only one would be1. (Instatistics, ifyou do, for instance, linear regressionon thisdata,you

    needtoremovetheconstantterm!)

    Theotheroptionoccurs incaseswhereyoucandesignateonevalueas"normal".Ifurinecolorcanbe

    normal,redorgreen,youwouldreplaceitwithtwoattributes,redandgreen.Atmostoneofthemcan

    be1;ifbotharezero,thatwouldmeanthatthecolorisnormal.

    Handson

    ThereisanimputationwidgetinOrange.Seewhatitoffers.

    Fromcontinuoustodiscrete

    Thisprocedureiscalledcategorizationinstatistics,sinceitsplitsthedatainto"categories".Forinstance,

    the patient can be split into categories young, middleage and old. In machine learning this is called

    discretization.

    Discretization can be unsupervised or supervised (there go these ugly terms again!). Unsupervised

    discretizationisbasedonvaluesofvariablesalone.Mosttypicallywesplitthedataintoquartiles,that

    is,wesplitthedataintofourapproximatelyequalgroupsbasedonthevalueoftheattribute.Wecould

    thushavetheyoungestquarterofthepatients,thesecondyoungest,theolderandtheoldestquarter.

    This procedure is simple and usually works best. Instead of four we can also use other number of

    intervals,althoughweseldomusemorethan5or lessthan3 (unlesswehaveaverygoodreasonfor

    doingso).

    Supervised methodsset theboundaries (also called thresholds) between intervalsso that thedata is

    split into groups which differ in class membership. Put simply, say that a certain disease tends to be

    morecommon forpeopleolderthan70.70 isthusthenatural thresholdwhichsplitsthepatient into

    twogroups,

    and

    it

    does

    not

    really

    matter

    whether

    it

    splits

    the

    sample

    of

    patients

    in

    half

    or

    where

    the

    quartileslie.

    Thesupervisedmethodsarebasedonoptimizingafunctionwhichdefinesthequalityofthesplit.One

    can, for instance, compute the chisquare statistics to compare the distribution of outcomes (class

    distributions, as we would say in machine learning) for those above and below a certain threshold.

    Continuize

  • 7/29/2019 DataMining Kyoto

    12/54

    Differentthresholdsyielddifferentgroupsand thusdifferentchisquares.Wecan thensimplychoose

    thethresholdwiththehighestchisquare.

    Apopularapproachinmachinelearningistocomputethesocalledinformationgainofthesplit,which

    measures how much information about the instance's class (e.g. disease outcome) do we get from

    knowing

    whether

    the

    patient

    is

    above

    or

    below

    a

    certain

    threshold.

    While

    related

    to

    chi

    square

    and

    similar measures, one nice thing about this approach is that it can evaluate whether setting the

    thresholdgivesmore informationthan itrequires.Inotherwords, it includesacriterionwhichtellsus

    whethertosplittheintervalornot.

    With that, we can run the method in an automatic mode which splits the values of the continuous

    variable into ever smaller subintervals until it decides it's enough. Quite often it will not split the

    variable at all, hence proposing to use "a single interval". This implicitly tells us that the variable is

    useless (that is,thevariable tellsusnothing, ithasan informationgainof0 foranysplit)andwecan

    discardit.

    Youcancombinesupervisedmethodswithmanualfittingofthresholds.Thefigureaboveshowsapart

    of Orange's Discretize widget. The graph shows the chisquare statistics (y axis) for the attribute we

    wouldgetifwesplittheintervalofvaluesofcontinuousattributeintwoatthegiventhreshold(xaxis).

    The gray curve shows the probability of class Y (righthand y axis) at various values of the attribute.

    Histogramsatthebottomandthetopshowdistributionofattribute'svaluesforbothclasses.

    Finally, there may exists a standard categorization for the feature, so you can forget about the

    unsupervised

    and

    supervised,

    and

    use

    the

    expert

    provided

    intervals

    instead.

    For

    instance,

    by

    some

    Japanese definition BMI between 18.5 and 22.9 is normal, 23.0 to 24.9 us overweight and 25.0 and

    above isobese.Thesenumbersweredecidedwithsomereason,soyoucanjustusethemunlessyou

    haveaparticularreasonfordefiningotherthresholds.

    While discretization would seem to decrease the precision of measurements, the discretized data is

    usuallyjustasgoodastheoriginalnumericaldata.Itmayevenincreasetheaccuracywhenthedatahas

  • 7/29/2019 DataMining Kyoto

    13/54

    somestrong outliers, that is, measurements outofscale.By splitting thedata into four intervals, the

    outliersaresimplyputintothelastintervalanddonotaffectthedataasstronglyasbefore.(Forthose

    moreversed instatistics:thishasasimilareffectasusingnonparametrictest insteadofaparametric

    one,forinstanceWilcoxon'stestinsteadofttest.Nonparametric

    Handson

    Loadtheheartdiseasedatafromdocumentationdatasetsandobservehow

    the Discretize widget discretizes the data. In particular, observe the Age

    attribute.Selectonlywomenandseeifyoucanfindanythinginteresting.

    Variableselection

    We often need to select a subset of variables and of data instances. Using fewer variables can often

    resultin

    better

    predictive

    models

    in

    terms

    of

    accuracy

    and

    stability.

    Decidewhichattributesarereallyimportant.Ifyoutaketoomany,youaregoingtohaveamess.Ifyouremovetoomany,youmaymisssomethingimportant.

    Aswealreadymentioned,youcanremoveattributeswithtoomanymissingvalues.Thesewillbeproblematic for many data mining procedures, and are not very useful anyway, since they dont

    giveyoumuchdata.Also, iftheywerentmeasuredwhenyourdatawascollected (toexpensive?

    difficult? invasive or painful?), think whether those who use your model will be prepared to

    measureit.

    Ifyou

    have

    more

    attributes

    with

    asimilar

    meaning,

    it

    is

    often

    beneficial

    to

    keep

    only

    one

    unless

    they are measured separately and there is a lot of noise, so you can use multiple attributes to

    averageoutthenoise.(Youcandecidelaterwhichoneyouarekeeping.)

    Ifyouarebuildingapredictivemodel,verifywhichattributeswillbeavailablewhenthepredictionneeds to be made. Remove the attributes that are only measured later or those whose values

    depend upon the prediction (for instance, drugs that are administered depending upon the

    predictionforacancerrecurrence).

    Dataminingsoftwarecanhelpyouinmakingthesedecisions.

    First, there are simple univariate measures of attribute quality. With twoclass problems, you can

    measure the attribute significance by ttest for continuous and chisquare for discrete attributes. For

    multiclassproblems,youhaveANOVAforcontinuousandchisquare(again)fordiscreteattributes.In

    regression problems you probably want to check the univariate regression coefficients (I honestly

    confessthatIhaven'tdonesoyet).

    Discretize Selectdata

  • 7/29/2019 DataMining Kyoto

    14/54

    Second, you may observe the relations between attributes and relations between attributes and the

    outcome graphically. Sometimes relations willjust pop up, so you want to keep the attribute, and

    sometimesitwillbeobviousthattherearenone,soyouwanttodiscardthem.

    Finally, if you are constructing a supervised model, the features can be chosen during the model

    induction.

    This

    is

    called

    model

    based

    variable

    selection

    or,

    in

    machine

    learning,

    a

    wrapper

    approach

    to

    featuresubsetselection.Inanycase,therearethreebasicmethods.Backwardfeatureselectionstarts

    with all variables, constructs the model and then checks, one by one, which variable it can exclude

    withoutharming(orwiththe leastharm)topredictiveaccuracy(oranothermeasureofquality)ofthe

    model. It removes that variable and repeats the process until any further removing would drastically

    decreasethemodel'squality.Forwardfeatureselectionstartswithnovariablesandaddsoneatatime

    always theone fromwhich themodelprofitsmost. Itstopswhenwegainnothingmoreby further

    additions.Finally,instepwiseselection,onefeaturecanbeaddedorremovedateachstep.

    The figure shows a part of Orange's widget for ranking variables according to various measures (we

    won'tgointodetailsaboutwhattheyare,butwecanifyouareinterestedjustask!).

    Note

    that

    the

    general

    problem

    with

    univariate

    measure

    (that

    is,

    those

    which

    only

    consider

    a

    single

    attribute at a time) is that they cannot assess (or have difficulties with assessing) the importance of

    attributes which are important in context of other attributes. If a certain disease is common among

    youngmalesandoldfemales,thenneithergendernorageby itselftellusanything,sobothvariables

    wouldseemworthless.Consideredtogether,theytellusalot.

    HandsonThere are two widgets for selecting a subset of attributes. One is

    concentratedonestimatingtheattributequality(Rank),withtheotherwe

    can remove/add attributes and change their roles (class attribute, meta

    attributes)Especiallythelatterisveryusefulwhenexploringthedata,for

    instance to test different subsets of attributes when building prediction

    models.

    Rank

    Selectattributes

  • 7/29/2019 DataMining Kyoto

    15/54

    Featureconstructionanddimensionalityreduction

    Simplifyingvariables

    Youcanoftenachievebetterresultsbymergingseveralvaluesofadiscretevariable intoasingleone.

    Sometimesyou

    may

    also

    want

    to

    simplify

    the

    problem

    by

    merging

    the

    class

    values.

    For

    instance,

    in

    non

    alcoholicfattyliverdiseasewhichI'mstudyingrightnow,theliverconditionisclassifiedintooneoffour

    bruntstages,yetmost(orevenall)papersonthistopicwhichconstructpredictivemodelsconvertedthe

    problem into binary one, for instance whether the stage is 0, 1 or 2 (Outcome=0), or 3 or 4

    (Outcome=1).Reformulated likethis,theproblem iseasiertosolve.Besidesthat,manystatisticaland

    machinelearningmethodsonlyworkwithbinaryandnotmultivaluedoutcomes.

    TherearenowidgetsforthisinOrange.Youhavetodoitin,say,Excel.Orangecanhelpyouindeciding

    which values are "compatible" in a sense. We will learn about this the next time, as we explore

    visualizations.

    Constructionofderivedvariables

    Sometimes it isusefultocombineseveral features intoone,e.g.presenceofneurological,pulmonary

    andcardiovascularproblemsintoageneralproblems.Therecanbevariousreasonsfordoingthis,the

    simplest being a small proportion of patients which each single disease. Some features need to be

    combined because of interactions, as the case described above: while age and gender considered

    together may give all the information we need, most statistical and machine learning methodsjust

    cannotuse themsincetheyconsider (anddiscard) themoneata time.Weneed tocombine the two

    variablesintoanewonetellinguswhetherthepatientbelongsintooneoftheincreasedriskgroups.

    Finally,we

    should

    mention

    but

    really

    only

    mention

    the

    so

    called

    curse

    of

    dimensionality.

    In

    many

    areas of science, particularly in genetics, we have more variables than instances. For instance, we

    measureexpressionsof30,000geneson100humancancertissues.Aruleofathumb(butnothingmore

    than a rule of a thumb) in statistics is that the sample size must be at least 30 times the number of

    variables, so we certainly cannot run, say linear or logistic regression (neither any machine learning

    algorithm)onthisdata.

    There are no easy solutions to this problem. For instance, one may apply theprincipal component

    analysis(orsupervisedprincipalcomponentanalysisorsimilarprocedures),astatisticalprocedurewhich

    canconstructasmallsetof,say,10variablescomputedfromtheentiresetofvariables insuchaway

    that

    these

    10

    variables

    carry

    as

    much

    (or

    all)

    data

    from

    the

    original

    variables.

    The

    problem

    is

    that

    principalcomponentanalysisitselfwouldalsorequire,well,30timesmoredatainstancethanvariables

    (wecanbendthisrule,butnottothepointwhenthenumberofvariablesexceedsthesamplesize,at

    leastnotbythatmuch).

    OK,ifwecannotkeepthedatafromallvariables,canwejustdiscardthemajorityofvariables,sincewe

    knowthattheyarenotimportant?Outof30,000genes,onlyacouple,adozenmaybe,isrelatedtothe

    diseasewestudy.Wecoulddothis,buthowdowedeterminewhichgenesarerelatedtothedisease?

  • 7/29/2019 DataMining Kyoto

    16/54

    Ttestsor informationgainorsimilarforeachgeneseparatelywouldn'twork:withsomanygenesand

    suchasmallsample,manygeneswillberandomlyrelatedtowhatweare tryingtopredict. Ifweare

    unlucky(and inthiscontextweare),therandomlyrelatedgeneswill lookmorerelatedtothedisease

    thanthosewhichareactuallyrelevant.

    The

    only

    way

    out

    is

    to

    use

    supplementary

    data

    sources

    to

    filter

    the

    variables.

    In

    genetics,

    one

    may

    consideronlygeneswhicharealreadyknowntoberelatedwiththediseaseandgeneswhichareknown

    to be related to these genes. More advanced methods use variables which represent average

    expressions (orwhatever we measure) of groupsofgenes, where these groups aredefined basedon

    largenumberofindependentexperiments,ongenesequencesimilarities,coexpressionsandsimilar.

    The area of dimensionality reduction is on the frontier of data mining in genetics. The progress in

    discovering the genetics of cancer and potential cures are not that much in collecting more data but

    ratherinfindingwaystosplitthedatawehaveintoirrelevantandrelevant.

  • 7/29/2019 DataMining Kyoto

    17/54

    Visualization

    WeshallexploresomegeneralpurposevisualizationsofferedbyOrangeandsimilarpackages.

    Histograms

    Histogram is a common visualization for representing absolute or relative distributions of discrete or

    numericdata.Inlinewithitsmachinelearningorigins,Orangeaddsthefunctionalityrelatedtoclassified

    data,thatis,datainwhichinstancesbelongtotwoormoreclasses.Thewidgetforplottinghistograms

    iscalledDistributions.Thepicturebelowshowsthegenderdistributionontheheartdiseasedataset,

    whichwebrieflyworkedwithinthelastlecture.

    Onthelefthandside,theusercanchoosethevariablewhichdistribution(s)hewantstoobserve.Here

    wechosethediscreteattributegender.Thescaleontheleftsideoftheplotcorrespondstothenumber

    ofinstanceswitheachvalueoftheattribute,whichisshownbytheheightsofbars.Coloredregionsof

    thebars

    correspond

    to

    the

    two

    classes

    blue

    represents

    outcome

    "0"

    (vessels

    around

    the

    heart

    are

    not

    narrowed),andredpartsrepresentthepatientswithcloggedvessels(outcome"1").Thedatacontains

    twiceasmanymenthanwomen.Withregardtoproportions,malesaremuchmore likelytobe"red"

    (havethedisease)thenfemales.

    Tocomparesuchprobabilities,wecanusethescaleontheright,whichcorrespondstoprobabilitiesof

    thetargetvalue(wechosetargetvalue"1",seethesettingsontheleft).Thecirclesandlinesdrawnover

    thebarsshowtheseprobabilities:forfemalesit'saround25%(withconfidenceintervalsappx.2030%),

  • 7/29/2019 DataMining Kyoto

    18/54

    andformalesit'saround55%(5159).Youcanalsoseethesenumbersbymovingthemousecursorover

    partsofthebar.

    Note: to see the confidence intervals, you need to click Settings (top left), and then check Show

    confidenceintervals.

    Similargraphs

    can

    be

    drawn

    for

    numeric

    attributes,

    for

    instance

    SBP

    at

    rest.

    The number of bars can be set in Settings. Probabilities are now shown with the thicker line in the

    middle,and

    the

    thinner

    lines

    represent

    the

    confidence

    intervals.

    You

    can

    notice

    how

    these

    spread

    on

    theleftandrightside,wheretherearefewerdatainstancesavailabletomaketheestimation.

    HandsonexercisePlot the histogram for the age distribution. Do you notice something interesting with regard to

    probabilities?

    Scatterplot

    Scatter plots show individual data instances as symbols whose positions in the plane are defined by

    valuesoftwoattributes.Additionalfeaturescanberepresentedbytheshape,colorandsizeofsymbols.

    There are many ways to tweak the scatter plot into even more interesting visualizations. The Gap

    Minderapplication,withwhichweamusedourselvesonourfirstlecture(Japanvs.Sloveniavs.China)is

    nothingelsethenascatterplotwithanadditionalslidertobrowsethroughhistoricdata.

  • 7/29/2019 DataMining Kyoto

    19/54

    Although scatter plot visualizations in data mining software are less fancy than those in specialized

    applications like Gap Minder, they are still the mostuseful and versatile visualization we have, in my

    opinion.

    Here isoneofmy favorites:patients' ageversusmaximalheart rateduringexercise.Colorsagain tell

    whether

    the

    patient

    has

    clogged

    vessels

    or

    not.

    First,

    disregarding

    the

    colors,

    we

    see

    the

    relation

    between

    the

    two

    variables

    (this

    is

    what

    scatter

    plots

    aremostusefulfor).Youngerpatientscanreachhigherheartrates(thebottomrightsideoftheplotis

    empty).Amongolderpatients,therearesomewhocanstillreachalmosttheheartratesalmostashigh

    as youngsters (although the topmost right area is a bit empty), while some patients' heart rates are

    ratherlow.Thedotsarespreadinakindoftriangle,andtheageandheartrateareinverselycorrelated

    theolderthepatient,thelowertheheartrate.

    Furthermore, the dots right below are almost exclusively red, so these are the people with clogged

    vessels.Leftabove,peopleareyoungandhealthy,whilethoseatthetopabovecanbeboth.

    Thisisatypicalexampleofhowdatamininggoes:atthispointwecanstartexploringthepattern.Isit

    trueor

    just

    coincidental?

    The

    triangular

    shape

    of

    the

    dot

    cloud

    seems

    rather

    convincing,

    so

    let's

    say

    we

    believeit.Ifso,whatisthereason?Canweexplainit?Anaveexplanationcouldbethatwhenweage,

    somepeoplestillexercisealotandtheirheartsare"asgoodasnew",whilesomearenotthislucky.

    Thisisnotnecessarilytrue.WhenIshowedthisgraphtocolleagueswithmoreknowledgeincardiology,

    they told me that the lower heart rate may be caused by beta blockers, so this can be a kind of

    confoundingfactorwhichisnotincludedinthedata.Ifthisisthecase,thenheartratetellsus(oritisat

    leastrelatedto)whetherthepatientistakingbetablockersornot.

  • 7/29/2019 DataMining Kyoto

    20/54

    ToverifythiswecouldsetthesymbolsizetodenotetherestSBP (tomake thepictureclearer, Ialso

    increasedthesymbolsizeandtransparency,seetheSettingstab).InsteadoftherestSBP,wecouldalso

    plot the ST segment variable. I don't know the data and the problem well enough to be able to say

    whetherthisreallytellsussomethingornot.

    Theiconsinthebottomletuschoosetoolsformagnifyingpartsofthegraphorforselectingdatainthe

    graph(icons

    with

    the

    rectangle

    and

    the

    lasso).

    When

    examples

    are

    selected,

    the

    widget

    sends

    them

    to

    anywidgetsconnectedtoitsoutput.

    HandsonexerciseDrawthescatterplotsothatthecolorsrepresentthegenderinsteadoftheoutcome.Doesitseem

    thatthepatientswhoseheartratesgodownareprevalentlyofonegender(which)?Canyouverify

    whether this is actually the case? (Hint: select these examples and use the Attribute statistics

    widgettocomparethedistributionofgendersonentiredatasetandontheselectedsample.)

    Thewidgethasmanymoretricksanduses.Forinstance,ithasmultipleoutputs.Youcangiveitasubset

    ofexamples

    which

    come

    from

    another

    widget.

    It

    will

    show

    the

    corresponding

    examples

    from

    this

    subset

    withfilledsymbols,whileothersymbolswillbehollow. IntheexamplebelowweusedtheSelectdata

    widgettoselectpatientswithexerciseinducesanginaandrestSBPbelow131andshowthegroupasa

    subgroupintheScatterplot.Notethatthisschemaisreallysimple:manymorewidgetscanoutputthe

    data,andthesubsetplottedcanbefarmoreinterestingthanthisone.

  • 7/29/2019 DataMining Kyoto

    21/54

    HandsonexerciseConstruct

    the

    schema

    above

    and

    see

    what

    happens

    in

    the

    Scatter

    plot.

    HandsonexerciseLet's explore the difference in genders in some more detail. Remember the interesting finding

    about the probabilitiesofhavingcloggedvesselswith respect to theageof thepatient? Use the

    SelectDatawidgettoseparatewomenandmenandthenusetwoDistributionwidgetstoplotthe

    probability of disease with respect to age for each gender separately. Can you explain this very

    interestingfinding(althoughnotunexpected,whenyouthinkaboutit)frommedicalpointofview?

    Thereisanotherwidgetforplottingscatterplots,inwhichpositionsofinstancescanbedefinedbymore

    thantwo

    attributes.

    It

    is

    called

    Linear

    projection.

    To

    plot

    meaningful

    plots,

    all

    data

    needs

    to

    be

    either

    continuousorbinary.WewillfeeditthroughwidgetContinuizetoensurethis.

  • 7/29/2019 DataMining Kyoto

    22/54

    Onthelefthandsidewecanselectthevariableswewantvisualized.Fortheheartdiseasedata,wewill

    select

    all

    except

    the

    outcome,

    diameter

    narrowing.

    Linear

    projection

    does

    not

    use

    perpendicular

    axis,

    as

    theordinaryscatterplot,butcaninsteaduseanarbitraryarrangementofaxis.Theinitialarrangementis

    circular.Togetausefulvisualization,clicktheFreeVizbuttonatthebottom.Inthedialogthatshowsup,

    select Optimize separation. The visualization shows you, basically, which variables are typical for

    patientswhichdoanddonothavenarrowedvessels.

    Alternatively, in the FreeViz dialog you can choose the tab projection and then Supervised principal

    componentanalysis.

    Foraneasierexplanationofhowtointerpretthevisualization,hereisavisualizationofanotherdataset

    inwhichthetaskistopredictthekindoftheanimalbasedonitsfeatures.Wehaveaddedsomecolors

    byclicking

    the

    Show

    probabilities

    in

    the

    Settings

    tab.

    Eachdotrepresentsananimal,withtheircolorscorrespondingtodifferenttypesofanimals.Positions

    are computed from animal's features, e.g. those which are airborne are in the upper left (and those

    whichareairborneandhaveteethwillbeinthemiddletoughluck,butthereareprobablynotmany

    suchanimalswhichisalsothereasonwhyfeaturesairborneandtoothedareontheoppositesides).

  • 7/29/2019 DataMining Kyoto

    23/54

    Therearequiteafewhypotheses(sometrueandsomefalse)thatwecanmakefromthispicture.For

    instance:

    Mammalsarehairyandgivemilk;theyarealsocatsize,althoughthisisnotthattypical. Airborneanimalsareinsectsandbirds. Birdshavefeathers. Havingfeathersandbeingairbornearerelatedfeatures. Animalseithergivemilkorlayeggsthesetwofeaturesareontheoppositesides. Similarly,someanimalsareaquaticandhavefins,andsomehavelegsandbreath. Whethertheanimalisvenomousanddomestictellusnothingaboutitskind. Backboneistypicalforanimalswithfins.(?!)

    Someofthesehypothesisarepartiallytrueandthelastoneisclearlywrong.Theproblemis,obviously,

    that all features need to be placed somewhere on twodimensional plane, so they cannot all have a

    meaningful place. The widget and the corresponding optimization can be a powerful generator of

    hypothesis,butallthesehypothesisneedtobeverifiedagainstcommonsenseandadditionaldata.

  • 7/29/2019 DataMining Kyoto

    24/54

    Parallelcoordinates

    Intheparallelcoordinatesvisualization,theaxis,whichrepresentthefeaturesareputparalleltoeach

    other and each data instance is represented with a line which goes through points in the line that

    correspondtoinstance'sfeatures.

    Onthe zoodataset,each linerepresentsoneanimal.Weseethatmammalsgivemilkandhavehair

    (mostpink linesgo through the topof the twoaxisrepresentingmilkandhair).Featheryanimalsare

    exclusively birds (all lines through the top of thefeathers axis are red). Some birds are airborne and

    somearenot(redlinessplitbetweenthefeathersandtheairborneaxis).Fish(greenlines)donotgive

    milk,havenohair,havebackbone,nofeathers,cannotfly,andlayeggs(wecanseethisbyfollowingthe

    greenlines).

    HandsonexerciseDraw the followinggraph forheartdiseasedata.Select thesamesubsetof featuresandarrange

    theminthesameorder.Seteverythingtomakethevisualizationappearasbelow.

  • 7/29/2019 DataMining Kyoto

    25/54

    Can you see the same age vs. maximal heart rate relation as we spotted in the scatter plot?

    Anything else which might seem interesting? Are females in the data really older than males? Is

    thereanyrelationbetweenthemaximalheartrateandthetypeofthechestpain?

    Mosaicplot

    InMosaicplotwesplitthedatasetintoeversmallersubsetsbasedonvaluesofdiscrete(ordiscretized)

    attributes. We can see the proportions of examples in each group and the class distribution. Figures

    belowshowconsecutivesplitsofthezoodataset.

  • 7/29/2019 DataMining Kyoto

    26/54

    Hairyanimals representalmostonehalfof thedataset (the rightbox in the firstpicture isalmostas

    wideasthe leftbox). (Note:thisdatasetdoesnotrepresentanunbiasedsampleofanimalkingdom.)

    The"hairy"boxisviolet,thereforemosthairyanimalsaremammals.Theyellowishpartrepresentsthe

    hairyinsects.Thereisalmostnovioletintheleftbox,thereforearealmostnononhairymammals.

    Wespliteachofthetwosubgroups(nonhairy,hairy)bythenumberof legs.Mosthairyanimalshave

    four legs,followedbythosewithtwoandthosewithsix.Thereareonlyafewwithno legs.Four and

    twoleggedhairycreaturesaremammals,andsixleggedareinsects.

  • 7/29/2019 DataMining Kyoto

    27/54

    A thirdofnonhairyanimalshaveno legs.Halfof them (inourdatasetonly,not innature!)are fish,

    followedbyafewinvertebrates,mammalsandreptiles.Onethirdofnonhairyanimalshavetwolegs;all

    ofthemarebirds.Fourleggednonhairiesareamphibiansand reptiles,andthesixleggedaremostly

    insects.

    For

    the

    last

    split,

    we

    have

    split

    each

    group

    according

    to

    whether

    they

    are

    aquatic

    or

    not.

    All

    mammals

    withoutlegsareaquatic,whilethenumberofaquaticsamongtwo andfourleggedissmall(thevertical

    stripsontherighthandsidearethin).Therearenohairyaquaticinsects.

    Majority of nonhairy animals without legs are aquatic: fish, and some mammals and invertebrates.

    Nonaquatic ones are invertebrates and reptiles. As we discovered before, twolegged nonhairy

    features are birds (red). One third of them is aquatic and two thirds are not. Fourlegged nonhairy

    animalscanbenonaquaticreptiles,whileatwothirdmajoritygoestotheaquaticamphibians.

    Mosaicplotscanalsoshowaprioridistributionsorexpecteddistributions(inamannersimilartothose

    of the chisquare contingency table, that is, the expected number of instances computed from the

    marginaldistributions).

    The

    plot

    below

    splits

    the

    data

    on

    heart

    disease

    by

    age

    group

    and

    gender.

    The

    smallstripsontheleftsideofeachboxshowthepriorclassdistribution(theproportionofpatientswith

    andwithoutnarrowedvesselsontheentiredataset).

    Theoccurrenceofnarrowedvesselsisverysmallamongyoungandoldfemales,whereasinthemiddle

    agegrouptheproportionisfarlarger(butstillbelowthetotaldatasetaverage!theproblemissomuch

    morecommonamongmales!)Formales,theproportionsimplyincreaseswithage.

  • 7/29/2019 DataMining Kyoto

    28/54

    HandsonexerciseHowdidweactuallyplotthis?IfwefeedthedatawithcontinuousvariablesintotheMosaicwidget,

    it will automatically split values of numeric attributes into intervals using one of the builtin

    discretizationmethods.Herewemanuallydefined theagegroupsbyusing theDiscretizewidget.

    Trytodoityourself!(Hint:putDiscretizebetweenFileandMosaic.InDiscretize,clickExploreand

    setindividualdiscretizations.SelectAgeandentertheintervalthresholdsintotheCustom1boxor

    definethe intervalsonthegraphbelow leftclickingaddsandrightclickingremovesthresholds,

    andyoucanalsodragthresholdsaround.Donot forget toclickCommit in thewidgets,orcheck

    Commitautomatically.)

    Sievediagram

    Sievediagram,alsocalledparquetdiagram, isavisualizationoftwowaycontingencytables,similarto

    those

    used

    in

    chisquare.

    The

    rectangle

    representing

    the

    data

    set

    is

    split

    horizontally

    and

    vertically

    accordingtotheproportionofvaluesoftwodiscreteattributes.

    Note the difference between this and the mosaic plot. In

    mosaicplot,thedataissplitintoeversmallersubgroups:each

    age group is split separately into groups according to the

    distributionofgenderswithinthatspecificagegroup.Thesize

  • 7/29/2019 DataMining Kyoto

    29/54

    of the box in the Mosaic plot thus corresponds to theobserved (actual)numberof instancesofeach

    kind.Forcontrast,theSievediagramsplitsbytwoattributesindependently,sothesizeoftherectangle

    reflectstheexpectednumberofinstances.Thisnumbercanbecomparedtotheobservednumber:Sieve

    plotshowsthecomparisonbygriddensityandcolor(bluerepresentsahigherthanexpectednumberof

    instances and red represents a lower than expected number). The diagram above suggests that, for

    instance,youngermalesaremorecommonthanexpectedaccordingtopriordistribution. Wecancheck

    thenumberbymovingthemousecursoroverthecorrespondingbox.

    Given that the younger age group includes 39.27% of data and males represent 67.99%, we would

    expect39.27%67.99%=26.70%youngmales.Theactualnumber is28.71%,which ismorethan

    expected.

    Thisdifference is,however,negligible.Here isabetterexample. In the figurebelow, theverticalsplit

    followsthedistributionofgenders(themalepartistwicetheheightofthefemalepart),andthevertical

    splitcorrespondstoresultsofthethaltest(onehalfof instanceshavenormalresultsandmostothers

    have

    reversible

    defect).

    The number of males with normal thal test is 28.57 % (86 patients), which is less than the expected

    37.56%(113patients).Thenumberofwomenwithnormalthaltestisunexpectedlyhigh,80insteadof

    theexpected53.Andsoforth.

  • 7/29/2019 DataMining Kyoto

    30/54

    "Intelligent"visualizations

    Thereare

    many

    different

    scatter

    plots

    we

    can

    draw for some data, and the number of

    differentmorecomplexvisualizations iseven

    higher. Orange features a set of tools for

    findinggooddatavisualizations.For instance,

    this plot (a variation of scatter plot called

    RadViz), which was found automatically by

    Orange, uses expressions of 7 genes out of

    2308,todistinguishbetweenthefourtypesof

    SRBCT

    cancer.

    In

    most

    widgets

    you

    will

    find

    buttons Optimize or VizRank (Visualization

    ranking), which will try out huge number of

    different visualizations and present those

    whichseemmostinteresting.

  • 7/29/2019 DataMining Kyoto

    31/54

    DataModeling(Learning,Clustering)

    Modelingreferstodescribingthedata inamoreabstract formcalled,obviously,amodel.Themodel

    (usually) does not include the data on individual instances from which it is constructed, but rather

    describesthegeneralpropertiesentiredataset(machinelearningoftenreferstothisasgeneralization).

    We usually distinguish between unsupervised and supervised modeling. Roughly, imagine that data

    instancesbelongtosomekindofgroupsandourtaskistodescribethesegroups.Thecomputer'swork

    canbesupervised in the sense that the computer is told which instancesbelong to whichgroup. We

    calledthesegroups"classes";wesaythatweknowtheclassofeachexample.Thecomputerneedsto

    learnhowtodistinguishbetweenthetwogroups.Suchknowledgeisrepresentedbyamodelthat,ifina

    properform,canteachussomething(new,potentiallyuseful,actionable)aboutthedata.Moreover,

    we can use it to determine the group membership of new data instances. For example, if the data

    representstwogroupsofpeople,somearesickandsomearenot,theresultingmodelcanclassifynew

    patientsintooneofthesetwogroups.

    In unsupervised learning, the instances are not split into classes, so the computer is not guided

    (supervised)inthesamesenseasabove.Thetaskistypicallytofindasuitablegroupingofinstances.In

    statistics, this is called clustering (cluster being a synonym for a group). We also often use visual

    methodswhichplotdata insuchawaythatgroupsbecomeobvioustothehumanobserver,whocan

    thenmanually"drawtheborders"betweenthemorusetheobservedgroupingforwhateverheneeds.

    Dataforunsupervisedlearning andforsupervisedlearning

  • 7/29/2019 DataMining Kyoto

    32/54

    Predictivemodeling/Learning

    The task of predictive modeling is to use the given data (training examples, learning examples) to

    construct models which can predict classes or, even better, class probabilities for new, unseen

    instances. Inthecontextofdatamining,wewould likethesemodelstoprovidesome insight intothe

    data:themodelsshouldbeinterpretable;theyshouldshowpatternsinthedata.Themodelsshouldbe

    abletoexplaintheirpredictions intermsofthefeaturesthepatient isprobablysick,andthereason

    whywethinksoisbecausehehastheseandthesesymptoms.

    Classificationtrees

    Induction of classification trees (also called decision trees) is a popular traditional machine learning

    method,whichproducesmodelsastheoneonthefigurebelow.

    Whilethisisvisuallypleasing,thetreeaspresentedbythebelowwidgetmaybeeasiertoread.

    Wehaveusedthreewidgetsrelatedtotrees:ClassificationTreegetsthedataandoutputsatree(thisis

    thefirstwidgetwehavemetsofarwhichdoesnotoutputdatainstancebutsomethingelse,amodel!).

    ClassificationTreeViewerandClassificationTreeGraphgettreesontheirinputsandshowthem.They

  • 7/29/2019 DataMining Kyoto

    33/54

    alsoprovideoutput:whenyouclickontreenodes,theyoutputalltrainingexampleswhichbelongtothe

    node.YouwillfindallthesewidgetsincategoryClassify.

    Saywegetanewpatient.The treesuggests thatweshould firstperformthe thaltest (whateverthis

    means).Let'ssaythatresultsarenormal.Forsuchpatientswethenaskaboutthechestpain;suppose

    thatit'stypicalanginal.Inthatcase,wechecktheageofthepatient.Sayheis45,whichisbelow56.5.

    Thereforewepredicta0.04probabilityofhavingthedisease.

    The algorithm for construction of such trees from data works like this. First, it checks all available

    variablesandfindsonewhichsplitsthedataintosubsetswhichareasdistinctaspossibleinidealcase,

    the

    attribute

    would

    perfectly

    distinguish

    between

    the

    classes,

    in

    practice,

    it

    tries

    to

    get

    as

    close

    as

    possible.Forourdata,itdecidedforthaltest.Thenitrepeatsthesearchforthebestattributeineachof

    thetwosubgroups. Incidentally,thesamefeature(thetypeofchestpain)seemsthebestattribute in

    both.Theprocessthencontinuesforallsubgroupsuntilcertainstoppingcriteriaaremet:thenumberof

    data instances inagroup istosmalltosplitthem further,thesubgroupsarepureenough (ideally,all

    instancesbelong tothesameclass,sothere isnoneedtosplitthemany further),therearenomore

    usefulfeaturestosplitby,thetreeistoolargeetc.

    Treesaregreatfordatamining,because

    there'snosophisticatedmathematicsbehindthem,sotheyareconceptuallyeasytounderstand, atreecanbeprintedout,analyzedandeasilyusedinreallifesituations, indata mining tools,wecanplaywith them interactively; in Orange,wecan connectthe tree to

    otherwidgets,clickthenodesofthetreetoobservethesubgroupsandsoon.

    Theyare,however,notsogreatbecause

    whiletheymayperformrelativelywellintermsofproportionofcorrectpredictions,theyperformpoorlyintermsofprobabilityestimates;

    theyare"unstable":asmallchange inthedataset, likeremovinga few instances,canresult inacompletelydifferenttree.Sameforchangingtheparametersofthe learningalgorithm.Treesgive

    oneperspectiveofthedata,onepossiblemodel,notaspecial,unique,distinguishedmodel;

    treesneedtobekeptsmall:computercaneasilyinduceatreewithafewhundredleaves,yetsuchtreesaredifficulttounderstandandtouse,andthenumberofinstancesinindividualleavesisthen

    sosmallthatwecannot"thrust"them.Ontheotherhand,smalltreesrefertoonlyasmallnumer

    ofattributes,e.g.toonlyasmallnumberofmedicalsymptomsrelatedtothedisease.

  • 7/29/2019 DataMining Kyoto

    34/54

    The latter point is really important. Trees are good for conceptually simple problems, where a few

    features are enough to classify the instances. They are a really bad choice for problems in which the

    predictionmustbemadebyobservingandsumminguptheevidencefromalargernumberoffeatures.

    A usefully small tree cannot contain more than, say, ten features. If making the prediction requires

    observingmorevariables,forgetaboutthetrees.

    HandsonexerciseConstructatree liketheoneabove,thenplotthesamedata inaMosaicplot inwhichyouselect

    theattributesfromtherootofthetreeandthefirstbranchesofthetree.

    Classificationrules

    Classificationrulesareessentiallysimilartotrees,exceptthat insteadofhavingahierarchyofchoices,

    theyaredescribedbylistsofconditions,likethosebelow.

    Thealgorithmforconstructionofsuchrulesstartswithanemptyrule(nocondition,itpredictsthesame

    classfor

    all

    examples).

    It

    then

    adds

    conditions,

    which

    specialize

    the

    rule,

    fitting

    it

    to

    aparticular

    group

    of

    instancesuntilitdecidesthatit'senough(stoppingcriteriaarebasedonthepurityofthegroup,number

    of instances covered by the rule and some statistical tests). It prints out the rule, removes the data

    instancescoveredbytheruleandrepeatstheprocesswiththeremaininginstances.

  • 7/29/2019 DataMining Kyoto

    35/54

    HandsonExerciseConstructthebelowschema.ExamplescomingfromtheCN2RulesViewershouldbeconnectedto

    ExampleSubsetof theScatterplot (thesimplestway tohave this is to firstdraw theconnection

    betweenFileandScatterplotandthenbetweentheCN2RulesViewerandScatterplot).

    Nowclick

    on

    individual

    rules

    in

    the

    CN2

    Rules

    Viewer

    and

    the

    Scatter

    plot

    will

    mark

    the

    training

    instancescoveredbytherule.

    Classification rules havesimilarpropertiesas the trees,except that theirshortcomingsare somewhat

    less acute: they can handle more attributes and are better in predicting probabilities. They may be

    especiallyapplicabletoproblemswithmultipleclasses,as inthefollowingexamplesfromourbeloved

    zoodataset.

  • 7/29/2019 DataMining Kyoto

    36/54

    NaveBayesianClassifierandLogisticRegression

    NaveBayesianclassifiercomputesconditionalclassprobabilities,withfeaturevaluesasconditions.For

    instance,itcomputestheprobabilityforhavingthediseaseforfemalesandformales,theprobabilityfor

    havingthediseaseforyoungerandolderpatients,forthosewithnonanginalpain,typicalanginalpain

    andatypical

    anginal

    pain,

    and

    so

    on.

    To

    predict

    the

    overall

    probability

    for

    having

    the

    disease,

    it

    "sums

    up"theconditionalprobabilitiescorrespondingtoalldataaboutthepatient.

    Themodelcanbesimplifiedintosumming"points"plotted inanomogram, liketheoneshownbelow.

    SupposewehaveapatientwhoseSTbyexerciseequals2,hisheartrategoesupto200,andhefeels

    asymptomaticchestpain.Othervaluesareunknown.NaveBayesianclassifierwouldgivehim+1point

    for the ST by exercise,3.75 for heart rate, +1 for the chest pain these mappings can be seen by

    "projecting"thepositionsofthepointscorrespondingtothevaluesontotheaxisabovethenomogram.

    Altogether,thisgives+1+3.75+1=1.75points.Thescalebelowtransformsthepoints intoactual

    probabilities: 1.75pointscorrespondtoaround15%chanceofhavingthedisease.

    Thereis

    asimilar

    model

    from

    statistics

    called

    linear

    regression.

    On

    the

    outside,

    they

    look

    the

    same:

    they

    can both be visualized using a nomogram. Mathematically, they differ in the way they compute the

    numberofpointsgivenforeachfeature:naveBayesianclassifierobserveseachfeatureseparatelyand

    logisticregressiontriestooptimizethewholemodelatonce.Weshallnotgo intomathematicshere.

    Wewillonlydescribethepracticalconsequencesofthedifferenceinthetwoapproaches.

  • 7/29/2019 DataMining Kyoto

    37/54

    Thesemodelsarejustgreat.

    Theyareveryintuitive,easytoexplain;theycanbeprintedoutandusedinthephysician'soffice. Theygivesurprisinglyaccuratepredictions. Thepredictionscanbeeasilyexplained:thestrongestargumentthatthepersonfromourexample

    ishealthyishisextremelyhighheartrate.Theothertwopiecesofdatawehave,theSTbyexercise

    andthetypeofchestpainbothactuallyopposethepredictionthatheishealthy,yettheirinfluence

    issmallerthanthatoftheheartrate.Theinsideviewintohowthemodelmakespredictionallows

    theexperttodecidewhethertheymakesenseorwhethertheyshouldbeoverruledandignored.

    The model can show us which features are more important than others in general, without aparticular given instance. ST by exercise and maximal heart rate are, in general, the two most

    informative features, since they have the longest axis they can provide the strongest evidence

    (thegreatestnumberofpoints)fororagainstthedisease.Thechestpainandthaltestaretheleast

    important(amongthoseshowninthepictures).Afeaturecanalsohavezeroinfluence,whenallits

    valuesoronthevertical line(0points).Afeaturecansometimesevenspeakonly infavororonly

    againstthetargetclass;theremaybetestswhichcan,iftheirresultscomeoutpositive,tellthatthe

    patientissick,whilenegativeresultmaymeannothing.

    Whatabouttheirshortcomings?ForthenaveBayesianclassifier,itsshortcomingscomefromobserving

    (thatis,computingthestrength)ofeachfeatureseparatelyfromothers.

    In case of diseases whose prediction requires observing two interacting features (for a naveexample, a disease which attacks young men and old women), it cannot "combine" the two

    features.Suchdiseases(and,ingeneral,suchfeatures)arehoweverrareinpractice.

    Another,more

    common

    problem

    are

    correlated

    features.

    Observing

    each

    feature

    separately,

    nave

    Bayesianclassifiercancountthesameevidencetwice.Considerachilddiseasewhoseprobability

    decreaseswithchild'sage;therearenootherfactorsrelatedtothedisease.Ifthedatawecollected

    containschildren'sageandbodyheight,naveBayesianclassifierwill(wrongly?)considerheightas

    animportantfactoralthoughthisisonlybecauseofitsrelationwithage.Countingthesamepiece

    ofevidencetwicecausesthenaveBayesianclassifierto"overshoot"theprobabilitypredictions.

    Thelatterisnotmuchofaproblem:theclassifierisstillprettyaccurate.Yetthedifferenceisimportant

    for comparison with logistic regression: logistic regression does not have this problem. The class

    probabilitiesitpredictsaretypicallyveryaccurate.Andtheprice?

    Logisticregression

    cannot

    be

    used

    to

    evaluate

    importance

    of

    individual

    features.

    In

    the

    childhood

    diseaseexamplefromabove, logisticregressionmightsaythatdiseaseprobabilityiscorrelatedby

    agealone,andnotbytheheight,whichwouldbecorrect.Ontheotherhand, itcansay thatthe

    only important attribute is body height, while age cannot offer any new evidence, which is

    misleading.Usuallyitdoessomethinginbetween.

  • 7/29/2019 DataMining Kyoto

    38/54

    Logistic regressioncannothandleunknownvalues.Alldatahas tobeknown toclassifyapatient.NotsofornaveBayesianclassifierforwhichwecanaddjustnewevidencewhenwegetit.

    Bothmodels,especially naveBayesianclassifier,aregood forsummingup theevidence froma large

    number of features. They can learn from a single feature, but they can also sum up small pieces of

    evidence

    from

    thousands

    of

    features.

    For

    instance,

    Nave

    Bayesian

    classifiers

    are

    often

    used

    for

    spam

    filtering in which each word from the message (and all other properties of the message) contribute

    someevidenceforthemailbeingspamorham.

    Whileweavoidedanytheoryhere,thereisreallyalotofitinbehind(unlikeclassificationtrees,which

    areasimplemachinelearninghack).Logisticregressionisamodificationoflinearregression,whichuses

    logistic function for modeling of probabilities. The "points" assigned to features are actually logodds

    ratios, which are a more practical alternative to conditional probabilities Nave Bayesian classifier

    represents thesimplestcaseofBayesiannetwork,amodelwhichcan incorporateprobabilisticcasual

    models.Moreover,there isawholebranchofstatistics,calledBayesianstatistics,whichrepresentsan

    alternative

    to

    the

    inferential

    statistics

    (testing

    of

    hypothesis

    etc)

    and

    is

    growing

    fast

    in

    popularity

    and

    usefulness.

    Othermethods

    Therearemanyothersimilarmethods.Hereweonlymentionedthesimpleones,yet,bymyexperience,

    nave Bayesian classifier can easily compete and lose by only a small difference with the most

    powerfulandcomplexmachine learningalgorithms like, for instance,support vectormachines. Ifyou

    wantmodelswhicharesimpletouseandunderstand,youshouldlogisticregressionandnaveBayesian

    classifiers. The possible few percent loss in performance will be more than paid of by your ability to

    explainand,ifneeded,overrulethepredictionsitmakes.

    Unsupervisedmodeling

    Distances

    Unsupervised modeling is often based on a notion ofdistance ordifference between data points. In

    ordertodoanygroupingbasedonsimilarity,weneedtosomehowdefinethesimilarity(oritsopposite,

    dissimilarity,difference).

    Whendealingwithordinarydata,themostcommondefinitionisbasedonEuclideandistance.Ifaandb

    are two data instances described by values (a1, a2, a3) and (b1, b2, b3), we can define the difference

    betweenthemas(a1b1)2+(a2b2)

    2+(a3b3)2(wecantakeasquarerootofthat,butitwon'tchangemuch).

    Weonlyneedtobecarefultoputallnumbersonthesamescaleadifferenceof0.1minbodyheight

    maymeanmuchmorethanadifferenceof0.1kginbodyweight.Thesimplest(althoughnotnecessarily

    the best) way of putting the data on the same scale is to normalize all variables by subtracting the

    averageanddividingbystandardvariance.

  • 7/29/2019 DataMining Kyoto

    39/54

    Fordiscretedata,weusuallysaythatwhentwovaluesaredifferent,thedifferencebetweenthemis1.

    Sometimes the distance is not so trivial, especially when the data instances are not described by

    features/attributes/variables.Say that objectsofmystudyare lecturersatKyotoUniversity (including

    occasionalspeakers,likeme)andwewouldwanttoclusterthembasedontheirscientificareas.Howdo

    we

    define

    a

    meaningful

    difference

    between

    them?

    This

    depends,

    of

    course,

    on

    the

    purpose

    of

    the

    grouping. Say that I would like to group them by their fields of expertise (and that I don't know the

    departmentsatwhichtheyactuallywork).Theproblemisactuallyeasierthanitlooks:wehaveaccessto

    public data bases of scientific papers. Say that Dr. Aoki and I have written 3 papers together, while

    altogether(that is,Dr.Aoki,meorbothofus incoauthorship)wewrote36papers(Iammakingthis

    up!). We can then say that the similarity between us is 3/36 or 0.08. The similarity between Dr.

    Nakayamaandmeisthen0(wehavenocommonpapers,sothesimilarityiszerodividedbysomething).

    Two professors who wrote all their papers together would have a similarity of 1. To compute the

    difference,wesimplysubtractthesimilarityfrom1:thedifferencebetweenDr.Aokiandmewouldbe1

    0.08 = 0.92. This definition of similarity is known as the Jaccard index

    (http://en.wikipedia.org/wiki/Jaccard_index).

    The

    purpose

    of

    this

    discussion

    is

    not

    to

    present

    the

    Jaccard index by rather to demonstrate that there may be cases in which the definition of

    distance/difference/similarityrequiresabitofcreativity.(Seeanexampleonthewebpage:a"map"of

    scientificjournals fromcomputerscience,where thesimilaritybetweenapairofjournals isbasedon

    whetherthesamepeoplecontributepaperstobothofthem.)

    Hierarchicalclustering

    We will show how this works on animals. All data in the zoo data set is discrete, so the dissimilarity

    (distance) between two animals equals the number of features in which they differ. The schema in

    OrangeCanvas

    looks

    like

    this:

    ExampleDistancewidgetgetsasetoftraininginstancesandcomputesthedistancesbetweenthem.A

    coupleofdefinitionsareavailable,butallareattributebased inthesensethattheysomehowsumup

    thedifferencesbetweenvaluesofindividualattributes.

    Hierarchicalclusteringofthisdataisshownontheleftpicture,andtherightpictureshowsapartofit.(I

    haven'tconstructedthedataandhavenoideawhygirlswere includedasanimalssimilartobearsand

    dolphins.)

    The figure is called a dendrogram. Probably not much explanation is needed with regard to what it

    meansandhowitneedstoberead:itmergestheanimalsintoeverlargergroupsor,readfromleftto

    right,splitsthemintoeversmallergroups.

    Clusteringisintendedtobeinformativebyitself.IfIcamefromanotherplanetandhavenoideaabout

    theEarthlyanimals,thiscanhelpmeorganizethemintoameaningfulhierarchy.

  • 7/29/2019 DataMining Kyoto

    40/54

    We can play with the clustering widget by drawing a

    vertical threshold and "cut" the clustering at a certain

    point,definingacertainnumberofdistinctgroups.

    For the curious: this is how clustering actually works.

    The

    method

    goes

    from

    the

    bottom

    up

    (or,

    in

    this

    visualization, from right to the left). At each step, it

    merges the two most similar clusters into a common

    cluster. Itstartssothateachanimal(object, ingeneral)

    hasitsowncluster,andfinishesthemergingwhenthere

    isonlyasingleclusterleft.

    The only detail to work out is how do we define

    distances between clusters, if we know (that is, have

    defined, using Euclidean distances or Jaccard index or

    whatever)

    the

    distance

    between

    single

    objects.

    The

    first

    choiceisobvious:thedistancebetweentwoclusterscan

    be the average distance between all objects in one

    cluster and objects in another. This is called average

    linkage.Thedistancecanbealsodefinedasthedistance

    between the closest objects belonging to the two

    clusters,thesocalledminimallinkage.Asyoucanguess,

    themaximal linkage is the distance between the most

    distantpairofobjectsbelongingtothetwoclusters.And

    nobodycanguesswhatthefourthoption,Wardlinkage,

    is.

    Neither

    does

    it

    matter:

    remember

    that

    the

    term

    linkagereferstothewayclustersarecomparedtoeach

    other.Whenconstructingtheclustering,tryallofthem

    andfindtheonewhichgivesthenicestpicture.

    The length of horizontal lines in the dendrogram

    corresponds to the similarity between clusters. The

    objects on the lowest level are similar to each other,

    laterthedistancesusuallygrow.Theoptimalnumberof

    clustersistypicallyoneatwhichwenoticeasharpjump

    inthe

    distances/line

    lengths.

    We can also turn the data around and compute

    distances between variables instead of between

    animals. We can, for instance, compute the Pearson

    correlation coefficient (which is actually closely related

    totheEuclideandistance).

  • 7/29/2019 DataMining Kyoto

    41/54

    Results may seem a bit strange at first: having hair is

    closelyrelatedtogivingmilk,that'sOK.Buthavingmilk

    and laying eggs?! Being aquatic, having fins and

    breath?! Itmakessense,though: layingeggsandgiving

    milk is strongly correlated, though the correlation is in

    this case negative. Knowing that the animal is aquatic

    tells us something about it having fins, but also

    something about whether it can breathe or not. We

    could, of course, define distance differently and get

    clustering of features which actually cooccur (positive

    correlation).Asalwaysinunsupervisedmodeling,it'sallaquestionofhowwedefinethedistance.

    HandsonExerciseLoad the data set called Iris.tab, compute Euclidean distances and construct a hierarchical

    clustering,and

    connect

    the

    result

    to

    the

    Scatter

    plot

    (the

    middle

    part

    of

    the

    schema

    below).

    For the clustering widget, see the recommending setting on the right. In

    particular,checkShowcutofflineandAppendclusterIDs,settheplaceto

    Classattribute.

    ThebestpairofattributestoobserveintheScatterplotispetallengthand

    petalwidth.

    Nowyoucanputacutofflineintothedendrogram;thedendrogramwillbe

    split into discrete clusters shown by colors and the widget will output

    instances classified according to their cluster membership, that is, the

    "class"

    of

    an

    instance

    will

    correspond

    to

    the

    cluster

    it

    belongs

    to.

    The

    Scatter plot will color the instances according to this, clusterbased class

    instead of the original class. Play with the widget by setting different

    thresholdsandobservingtheresultsintheScatterplot.(Ifthewidgetdoes

    not output examples uncheck Commit on change and check it again, or

  • 7/29/2019 DataMining Kyoto

    42/54

    leaveituncheckedandpushCommitwhenyouchangeofthethreshold;sorryforthebug.)

    You can also attach theDistribution widget to the clustering. In this case, change the Place in

    HierarchicalclusteringwidgetfromClassattributetoAttribute.The IDoftheclusterwillnowbe

    addedtoinstancesasanordinaryattribute,nottheclass.Byobservingthedistributionofiristypes

    (the

    original

    class)

    within

    each

    cluster,

    we

    can

    see

    how

    well

    the

    constructed

    groups

    match

    the

    classes. In other words, this shows how well can unsupervised learning function in problems for

    whichwecouldalsousesupervisedlearning.

    KMeansclustering

    KMeansclusteringsplitsthedataintokclusters,wheretheparameterkissetmanually.Italwaysuses

    Euclidean distance (or a similar measure, but let's not go into that). It cannot offer any fancy

    dendrograms.Itsimplysplitsdataintokgroupsandthat'sit.

    The advantage of kmeans clustering is that its clusters are often "better defined" (this is purely

    subjective,intuitive

    speaking)

    than

    those

    we

    get

    by

    hierarchical

    clustering.

    In

    asense,

    if

    two

    objects

    in

    k

    meansbelongtothesamecluster,this"meansmore"thanittwoobjectsareinthesameclusterwhen

    usinghierarchicalclustering.

    HandsonExerciseUsethebelowschematoanalyzehowwelldoclustersbythekMeansclusteringmatchtheactual

    classes,likewedidwithhierarchicalclusteringbefore.

    Note

    that

    the

    input

    to

    kMeans

    is

    not

    a

    matrix

    of

    distances

    between

    instances,

    but

    instances

    (examples)themselves.

    TheKMeanscanproposethenumberofclusters,orwecansetitmanually.Trysettingifto3and

    see how it works. Then let the widget determine the optimal number of clustering itself. If it

    proposedtwo,check thescatterplotson the first pageof this text.Yes, that's it.Seeoneof the

    basicproblemsofclustering(andunsupervisedlearningingeneral)?

    Multidimensionalscaling

    Clusteringmaysometimesseemtoorigid:anobjectbelongstooneandonlyonecluster.Especially in

    hierarchicalclustering

    (but

    also

    in

    kmeans)

    it

    can

    often

    happen

    that

    an

    object

    is

    actually

    more

    similar

    to

    objectsfromanotherclusterthantoobjectsfromitscluster.(Thisseemswrong,butifyouthinkabout

    it,you'lldiscoverit'strueandunavoidable.)

  • 7/29/2019 DataMining Kyoto

    43/54

    Enter multidimensional scaling. You can image it like this: you would want to have a map of Honshu

    cities,butallyouhavearedistancesbetweenallpairsofcities.Canyouuse them to reconstruct the

    map?Sureyoucan.Theresultingmapmayhaveawrongorientation(thedatadoesnottellyouwhat's

    northandwhat'swest),anditcanbeevenmirrored,butyoucanreconstructamapofHonshuwhichis

    correctinthesensethatitputsOsaka,NaraandKyotoclosetoeachotherandhalfwaybetweenTokyo

    andHiroshima.

    You can do the same with other data. We can construct a map of animals in which similar animals

    (wheresimilarityiscomputedinanywaywewant,sayusingEuclideandistances,asbefore)arecloseto

    eachotherandanimalswhichsharepropertiesoftwoanimalgroupsareput inbetweenthem.Asan

    Orangespecifictwist,wecanvisualizeorganizetheimagebyconnectingthemostsimilarpairs.

    Seetheimagebelow.(Imustagainapologizefortherudeandpoliticallyincorrectconnectionbetween

    girlsandgorillas;Irepeatonceagainthatthisisadatasetfromtheinternetwhichiscommonlyusedto

    demonstratemachinelearningmethods.Ihavenothingtodowithit.)

    Notethat

    this

    method

    does

    not

    define

    any

    clusters,

    like

    hierarchical

    clustering

    or

    kmeans.

    Yet

    it

    makes

  • 7/29/2019 DataMining Kyoto

    44/54

    theclustersobvioustotheuser,andthisisoftenenough.

    Multidimensional scaling uses an optimization method which can get "stuck" in a certain placement.

    Whenusingit,youshouldrepeatthelayoutoptimizationprocessformultipletimesandselecttheresult

    which you like the most, or even multiple maps, if they, for instance, show data from different

    interesting

    perspectives.

    Visualizationofnetworks

    Networksarebecomingaverypopularmodel ingeneticsandalsoelsewhere.Theyconsistofvertices

    (nodes), which represent objects, and are connected by edges. The network can be constructed in

    differentways.Wecanconstructitfromdistanceslikethoseinclustering:wecanconnecttwoobjectsif

    they are sufficiently close to each other. Connection can also represent existence of a relation. For

    instance,wecanconnectpairsofproteinswhichinteractwitheachother.Twoscientistsareconnected

    if they coauthoredat leastone paper. Two symptoms can be connected, if they often cooccur (e.g.

    sorethroatandfever).Twoairportsareconnectedifthereisadirectflightbetweenthem.Twohospitals

    areconnected,ifpatientsaretransferredbetweenthem.

    Analysisofnetworksissomewhatsimilartoobservingthemultidimensionalscalingmap(moreover,the

    techniqueusedforoptimizingthe layout issimilar,too).Wecanfind"islands"ofmutually interrelated

    objects (in graph theory, these are called cliques) groups of collaborating scientists, collaborating

    hospitals,relatedsymptoms

    Wecandomorethatthat,though:wecanselectabunchofobjectsbasedontheirpropertiesandthen

    tellthetoolweused(sayaNetworkexplorerwidgetinOrange),toselectandshowonlytheirvicinity,

    e.g. the network up to, say, theirseconddegree neighbor (nodeswhichareat most twoedgesaway

    from

    the

    selected

    nodes).

    In

    genetics,

    we

    can

    select

    a

    subset

    of

    genes

    for

    which

    it

    is

    known

    that

    they

    playarole inacertainhereditarydiseaseandthenobserve thegenesrelatedtothegenesrelatedto

    thesegenesinhopethatsomeofthesemayberelatedtothedisease,too.

    Wecanfindthehubs,ofthenetwork:thenodeswhichhavesignificantlylargernumberofconnections

    thanothernodesandthusholdthenetworktogether.Ingenetics,thesemaybesomegeneralpurpose

    genes,e.g.genescorrespondingtosomecommonproteins.Insocialnetworks,thesemaybethemost

    influentialmembersofthesociety.Inanothercontext,thesenodesmayrepresentAlQaedaleaders

    Asmallcollectionofinterestingnetworksispublishedonthewebsite.

  • 7/29/2019 DataMining Kyoto

    45/54

    Evaluationofmodelperformance

    Lasttime,wediscussedsupervisedandunsupervisedmodeling,whichgiveusvariousmodelsdescribing

    thedata.Todaywewilllearnhowtomeasurethequalityoftheobtainedmodels,eithertochoosethe

    optimalmodelortocheckthequalityofthemodeltodecidewhetheritisgoodenoughtobeused.

    Inthecontextofunsupervisedmodeling,likeclustering,thequalityofmodelsisdifficulttoevaluate.In

    mostcaseswecanonlyevaluatethemsubjectivelybycheckingwhethertheymakesense,whetherthey

    conform toourexpectationsand lookuseful forwhateverwewant tousethem.Supervised learning,

    however, gives models which make predictions, so their quality can be measured objectively by

    observing the quality of predictions. In today's lecture we'll learn how to prepare the test data and

    discussdifferentmeasuresofqualityofmodels.

    Samplingthe

    data

    for

    testing

    Thereisaverysimplerule:nevertestthemodelonthedatawhichwasusedforitsconstruction.

    Thepurposeofassessingtheperformanceofmodelsistodeterminehowwellthemodelwillbehaveif

    usedinpractice.Sincethemodelwillbeusedonnew,yetunseendatainstances(patients),italsohas

    tobetestedondatawhichithasnotseenbefore.Otherwisethebestmodelwouldbesimplyonewhich

    remembersalldataandusesthatknowledgewhenmakingpredictions(or,better,"postdictions").

    Forthispurpose,weuseresamplingtechniquesinwhichonlya(randomly)chosensubsetofdataisused

    for constructing the model and the remaining part is used to compute its quality. This is repeated

    multipletimesandintheendwereporttheaveragequalityofthemodeloveralltrials.

    Themostcommontechniqueusediscalledcrossvalidation.Itsplitsthedatainto(usually)tensubsets,

    calledfolds. In the first step, it reserves the first fold for testing: it uses data from folds 210 for

    constructingthemodelandthentests itonthefirstfold.Then itreservesthesecondfold fortesting;

    folds1and29areusedforconstructingthemodelandfold2fortestingit.Andsoon;ateachstepone

    foldisheldoutfortestingandothersareusedforlearning.

    Anothersamplingmethod iscalledleaveoneoutorjackknife.Insteadof leavingoutonefold(10%of

    examples,whenwedoa10foldcrossvalidation),itleavesoutonlyoneexampleatatime.Itconstructs

    amodel

    from

    all

    examples

    except

    this

    one,

    and

    tests

    the

    prediction

    on

    this

    example.

    This

    gives

    very

    reliableresults,buttakesalotoftime:ifthedatahas1000instances,itwillconstruct1000models.

    Inrepeated randomsampling,wechooseagivennumberof instances (say70%) forconstructingthe

    model,usetheremaining30%fortesting,andrepeatthiswholeprocedureacertainnumberoftimes.

    Besides these three, which are also available in Orange, there exist a number of other resampling

    methods,whicharenicelydescribedhere:http://en.wikipedia.org/wiki/Resampling_(statistics).

  • 7/29/2019 DataMining Kyoto

    46/54

    Asmallproblemwithallthesemethodsisthattheygiveanaverageperformanceofconstructedmodels

    and not of a single model. Say that we decide that the best model for our data is nave Bayesian

    classifier,whichhasanAUCof0.90(wewillsoonseewhat'sAUC,don'tworry)basedonanaverageof

    thetenfoldcrossvalidation.Havingdecided,weusetheentiredatasettoconstructonesinglenave

    Bayesianmodel.WehoweverdonotknowtheexpectedAUCofthismodel.Weonlyknowthatnave

    Bayesian models for this data have average accuracies of 0.90, while this model may be better or

    worse.

    Weusuallydonotworryaboutthis.First,thefinalmodelwasconstructedfromabit largerdataset

    fromall10foldsandnotjust9ofthem.Moredatausuallymeansbettermodels.Second,mostmodels

    arequitestable(classificationtreesexcluded!).AlltennaveBayesianmodelsfromcrossvalidationwere

    probablyquitesimilartoeachotherandtothisfinalmodel.

    There is, though,onebiggerproblem.Constructingthemodelusually involvesa lotofmanualwork

    everything which we covered in our past lectures. When we get the data, we will use various

    visualizations

    and

    other

    methods

    to

    select

    the

    features

    which

    we

    want

    to

    use

    in

    our

    models,

    we

    may

    construct new features from existing models, we will fit various thresholds, parameters of learning

    algorithms(e.g.wewillsetthedesiredsizeofthetrees).Althoughwemayhaveusedsomeresampling

    techniqueintheend,theentiredatasetswasusedwhenpreparingforthemodelingwhichisapartof

    modeling,too,andshouldnotbedoneonthedataonwhichthemodelislatertested.

    Thisisthereasonwhywestressedinthefirstandinthesecondlecturethatapartofthedataneedsto

    beleftaside.Asaruleofathumb,weusuallyrandomlyselect30%ofdataandputitinaseparatefile.

    Thenwegothrougheverythingwe(hopefully)learnedduringtheselecturesusingtheremaining70%of

    data.Whenwedecideforthefinalmodel,wetaketheheldoutdataanduseittotestthemodel.These

    resultsarewhatwereportasthetrueexpectedperformanceofourmodel.

    Measuresofmodelperformance

    There are multiple perspectives from which we canjudge models with regard to the quality of their

    predictions.

    Scalarmeasuresofquality

    Classificationaccuracy (CA) is the easiest and the most intuitive measure of quality. It measures the

    proportionof

    correct

    answers

    the

    model

    gives:

    ifit

    makes

    the

    correct

    prediction

    for

    85

    cases

    out

    of

    100,

    wesaytheclassificationaccuracyis85%.

    LetusmeasuretheaccuracyofthenaveBayesianclassifier, logisticregressionandclassificationtrees

    ontheheartdiseasedataset.Thecentralwidgetintoday'slectureiscalledTestLearners.Onitsinputit

    requiresadataset(wewillprovideonefromthefilewidget)andlearningalgorithms.Theschemabelow

  • 7/29/2019 DataMining Kyoto

    47/54

    looksabitunusual:lastweek,wegavedatatolearningalgorithms,sothattheycouldconstructmodels

    fromit.Howdotheconstructmodelsnow,withoutdata?

    WidgetfornaveBayesianclassifierhastwooutputs.Whengivendata(aslastweek),itoutputsamodel

    (whichwefedto,for instance,thewidgetshowingnomograms).Butevenwhennotgivenanydata, it

    outputs

    a

    learning

    algorithm.

    Learning

    algorithm

    is

    not

    a

    model,

    but

    a

    procedure

    for

    building

    the

    model.

    So,whatTestLearnersgetsinthisschemaarenotdataandthreemodelsbutdataandthreeprocedures

    formakingmodels.TestLearnerssplitsthedataindifferentways(crossvalidationorleaveoneoror),

    repeatedlyrunstheprovidedproceduresononesubsetandteststheresultingmodelsontheother.

    HandsonexerciseConstructtheaboveschema.

    Classification

    accuracydoes

    not

    distinguish

    between

    the

    "meanings"

    of

    classes.

    Not

    all

    mistakes

    are

    equal.Forinstance,failingtodetectadiseasecancauseadeathofapatient,whilediagnosingahealthy

    personassickisusuallylessfateful.Todistinguishbetweendifferenttypesofwrongpredictions,wewill

    selectoneoftheclassesandcallitthetargetclassorthepositiveclass.Inmedicine,positivewillusually

    meanhavingadisease,aconditionorsomethinglikethat.Nowwecandistinguishbetweenfourcases

    ofcorrectandwrongpredictions

  • 7/29/2019 DataMining Kyoto

    48/54

    truepositive (TP) denotes a data instance (e.g. a patient) for whom the model made a positivepredictionandthepredictioniscorrect(sothepatientistrulypositive)

    falsepositive (FP)denotes instanceswhichwere falselyclassifiedaspositive (e.g.wepredict thepatienttohavethedisease,thoughhedoesnot)

    truenegative

    (TN)

    instances

    are

    those

    for

    which

    the

    model

    gave

    negative

    prediction

    and

    the

    predictioniscorrect

    falsenegative(FN)representsinstanceswhoarefalselyclassifiedasnegative(theyaresickbutthemodelpredictedthemtobehealthy)

    predictedvalue

    positive(P') negative(N')

    positive(P) truepositive(TP) falsenegative(FN)actualvalue

    negative(N) falsepositive(FP) truenegative(TN)

    Besides TP, FN, FP and TN, we will also use P and N to denote the number of positive and negative

    instances, and P' and N' for the number of all positive and negativepredictions. Obviously, P=TP+FN,

    N=TN+FP,P'=TP+FPandN'=TN+FN.Thematrixlikethisoneisaspecialcaseofconfusionmatrix.Orange

    showsinitawidgetcalledConfusionmatrix,whichweconnecttotheTestLearnerswidget.

    HandsonexerciseTomakethingsinteresting,wewillalsoaddaDatatablewidgetandaScatterplotwidget,towhich

    wegiveexamplesfromtheFilewidgetandfromtheConfusionmatrix.Thelattershouldrepresenta

    subsetof

    data

    (the

    easiest

    way

    to