Weka

  • Upload
    cricket

  • View
    45

  • Download
    2

Embed Size (px)

DESCRIPTION

Weka. Praktické použití. Antonín Pavelka. Weka - úvod. systém pro analýzu dat a prediktivní modelování University of Waikato, Nový Zéland 1993 TCL/TK, C, Makefiles 1997 rozhodnutí přejít na čistou Javu integrována RapidMiner Petaho (systém business intelligence) - PowerPoint PPT Presentation

Citation preview

  • WekaAntonn PavelkaPraktick pouit

    Weka

  • Weka - vodsystm pro analzu dat a prediktivn modelovnUniversity of Waikato, Nov Zland1993 TCL/TK, C, Makefiles1997 rozhodnut pejt na istou JavuintegrovnaRapidMinerPetaho (systm business intelligence)GNU General Public License

    *

    Weka

  • Ovldngrafick rozhranExplorer jednotliv innosti na kliknutExperimenter systematick srovnnKnowledge flow innosti jako tokpkazov dekJava API*

    Weka

  • Ukzka grafick rozhran ...*

    Weka

  • ... pkazov dek ...java classpath weka.jar weka.classifiers.bayes.NaiveBayes t data/iris.arff*

    Weka

  • ... Java API*

    Weka

  • 1. Attribute-Relation File Format (ARFF)*ARFF soubor

    @relation spambase% spam, non-spam@attribute word_freq_make real@attribute 'char_freq_# real@attribute {spam, ham}@data0,0.64,0.64,spam0.21,0.28,0.5,spam0.06,0,0.71,ham

    Chybjc hodnoty4.4,?,1.5,?,Tolkienetzce@attribute LCC string@attribute LCSH string

    @dataAG5, 'Encyclopedias and dictionaries.;Twentieth century.as@ATTRIBUTE timestamp DATE "yyyy-MM-dd HH:mm:ss" @DATA "2001-04-03 12:12:12" "2001-05-03 12:59:55"dk formt0, X, 0, Y, "class A" {1 X, 3 Y, 4 "class A"}0, 0, W, 0, "class B" {2 W, 4 "class B"}

    Weka

  • 2. Pedzpracovn dat*

    Weka

  • Histogramy*uiten seln atributpodezel seln atributbinrn clov atribut20-hodnotov atribut

    Weka

  • Filtry*Remove V R 1-5,8 (V = inverze, zachovej pouze tyto atributy)Discretizenkter algoritmy nepracuj s slyurychlennkdy i zven pesnostipevzorkovndoplnn chybjcch atribut, odstrann chybjcch hodnotObfuscatorPrincipal Component Analysis, Partial Least SquaresAttributeSelection

    Weka

  • StringToWordVector*@attribute text string@attribute class {class1,class2,class3}

    @data'\n\t\n\t\tDumbek\'s Rand'

  • Klasifikace algoritmy 1*NaiveBayes, BayesNet, Averaged One-Dependence Estimators (AODE)SMO, SMOreg, LibSVM

    Weka

  • StringKernel*@attribute name string@attribute class {female, male}@dataMidori,femaleKoichi,male

    291 enskch a 385 muskch jmen (odstranno 13 univerzlnch jmen)prvn sputn: Q2 = 63 %

    Weka

  • Dal SVM parametry a jejich optimalizace*meta.CVParameterSelection P "C 0.5 50000.0 5.0" ... Cross-validation Parameter: '-C' ranged from 0.5 to 50000.0 with 5.0 stepsClassifier Options: -C 12500.375 ...

    bez predikce spolehlivosti

    TP Rate FP Rate Precision Recall F-Measure ROC Area Class 0.887 0.192 0.777 0.887 0.828 0.847 female 0.808 0.113 0.904 0.808 0.853 0.847 maleWeighted Avg. 0.842 0.147 0.849 0.842 0.842 0.847

    predikce spolehlivosti logistickou regres

    TP Rate FP Rate Precision Recall F-Measure ROC Area Class 0.835 0.148 0.81 0.835 0.822 0.921 female 0.852 0.165 0.872 0.852 0.862 0.921 maleWeighted Avg. 0.845 0.158 0.846 0.845 0.845 0.921

    Weka

  • Predikovan spolehlivost a ROC kivka*

    Weka

  • Klasifikace algoritmy 2*MultilayerPerceptronvalidan mnoinapomalLinearRegressionPLSClassifier Partial Least Squares regressionstromyJ48, RandomForest, ...metaboosting, bagging, ...ClassifivationViaRegressionAttributeSelectedClassifierCostSensitiveClassifier

    Weka

  • Ven chyb*TP Rate 0.81 0.915

    meta.CostSensitiveClassifier% Rows Columns2 2% Matrix elements0 21 0

    cena za patn klasifikovan P je 2x vt ne za N

    Weka

  • Vbr atribut*Metoda hodnocenatributChiSquaredAttributeEvalSVMAttributeEvalpodmnoinyCfsSubsetEvalWrapperSubsetEval, ClassifierSubsetEval

    Metoda prohledvnpro atributyRankerpro podmnoinyBestFirstGeneticSearch

    Redukce dimenz filtremPrincipal Component Analysis, Partial Least Squares

    Weka

  • Experimenter*

    Weka

  • Knowledge Flow*

    Weka

  • Zdroje*

    KnihyWEKA Manual for Version 3-7-0Data Mining: Practical Machine Learning Tools and Techniques

    Webhttp://www.cs.waikato.ac.nz/ml/weka/http://weka.wikispaces.com/http://wekadocs.com/http://www.hakank.org/weka/

    Weka

  • Sputn Wekyssh X lethemodule add javavytvote si pracovn adres (mkdir , cd )wget loschmidt.chemi.muni.cz/~tonda/w.zipunzip w.zipjava Xmx256m jar weka.jar*

    Weka

  • kol 1Explorer J48 a SMOspuste 2x Weku a Explorerv obou otevete spambase.arff a bte do tabu Classifyv Test options, More options nastavte Output predictions na Plain textv prvnm vyberte klasifiktor trees.J48kliknte do polka vpravo od tlatka Choose a nastavteuseLaplace: Truespuste 10-ti nsobn kov ovenv druhmvyberte klasifiktor functions.SMOkliknte do polka vpravo od tlatka Choose a nastavtebuildLogisticModels: TruenumFolds: 10spuste 10-ti nsobn kov ovenSrovnejte rychlost a pesnost obou algoritm. Odhadnte uitenost predikce dvryhodnosti vsledku (=== Predictions on test data ===, sloupec prediction).

    *

    Weka

  • kol 2Knoledge Flow - ROC kivkyspuste Knowledge Flowotevete spam_roc.kfnastavte ArffLoader na spambase.arffkliknte pravm tlatkem na ArffLoader, Start loadingsrovnejte ROC kivky NaiveBayese a BayesNetu (klik pravm tlatkem na horn Model Performance Chart, Show chart)srovnejte ROC kivky BayesNetu a AODE. Po kliknut na bod kivky se zobraz sla. Kolik procent spamu identifikujeme, pokud jsme ochotn tolerovat, e ve spamovm koi skon 4 % hamu (spam = class 1, osa X: FPR = FP/N, osa Y: TPR = TP / P)?

    *

    Weka

  • kol 3Experimenter - srovnn klasifiktorspuse Experimenterkliknte na tlatko NewResult destination: nastavte cestu a zvolte jmno novho ARFF souborupidejte dataset spam_discretized.arffpidejte algoritmy bayes.AODE, tree.J48, tree.RandomForestspuste vpoet v tabu Runjakmile skon, pejte do tabu Analysekliknte na Experiment a Perform TestJe pesnost nkter z metod na tto sad statisticky vznamn lep na hladin 0.05? Jak je to s Area_under_ROC?*

    Weka

    */celkem/celkem*/celkem/celkem