L2-Gioi Thieu WEKA

Embed Size (px)

Citation preview

  • 7/22/2019 L2-Gioi Thieu WEKA

    1/18

    Khai Ph DLiu

    ([email protected])

    Trng i hc Bch Khoa H Ni

    Vin Cng ngh Thng tin v Truyn thng

    Nm hc 2012-2013

  • 7/22/2019 L2-Gioi Thieu WEKA

    2/18

    Ni dung mn hc:

    Gii thiu v Khai ph d liu

    Gii thiu v cng c WEKA

    Tin x l d liu

    Pht hin cc lut kt hp

    Cc k thut phn nhm

    Lc cng tc

    2Khai Ph DLiu

  • 7/22/2019 L2-Gioi Thieu WEKA

    3/18

    EKA Gii thiu WEKA l mt cng c phn mm vit bng Java, phc v

    lnh vc hc my v khai ph d liu

    Cc tnh nng chnh Mt tp cc cng c tin x l d liu, cc gii thut hc my,

    ,

    Giao din ha (gm c tnh nng hin th ha d liu)

    Mi trng cho php so snh cc gii thut hc my v khaip u

    C th ti v t a ch:

    . . . .

    3Khai Ph DLiu

  • 7/22/2019 L2-Gioi Thieu WEKA

    4/18

    WEKA Cc mi trng chnh Simple CLI

    ao n n g n u ng n n -

    Explorer (chng ta s ch yu s dng mi trng ny!)

    khm ph d liu

    ExperimenterMi trng cho php tin hnh cc th nghim v thc hin cckim tra thng k (statistical tests) gia cc m hnh hc my

    now e ge owMi trng cho php bn tng tc ha kiu ko/th thitk cc bc (cc thnh phn) ca mt th nghim

    4Khai Ph DLiu

  • 7/22/2019 L2-Gioi Thieu WEKA

    5/18

    WEKA Mi trng Explorer

    5Khai Ph DLiu

  • 7/22/2019 L2-Gioi Thieu WEKA

    6/18

    WEKA Mi trng Explorer Preprocess

    chn v thay i (x l) d liu lm vic Classify

    hun luyn v kim tra cc m hnh hc my (phn loi, hochi u /d on

    Cluster

    hc cc nhm t d liu (phn cm)

    khm ph cc lut kt hp t d liu Select attributes

    xc nh v la chn cc thuc tnh lin quan (quan trng)nht ca d liu

    Visualize

    xem (hin th) biu tng tc 2 chiu i vi d liu

    6Khai Ph DLiu

  • 7/22/2019 L2-Gioi Thieu WEKA

    7/18

    WEKA Khun dng ca tp d liu WEKA chlm vic vi cc tp tin vn bn (text) c khun

    dngARFF

    V d ca mt tp d liu@r el at i on weat her

    Tn ca tpd liu

    @at t r i but e out l ook {sunny, over cast , r ai ny}

    @at t r i but e t emper at ur e r eal

    Thuc tnh

    kiu nh danhat t r but e hum d t y r eal

    @at t r i but e wi ndy {TRUE, FALSE}

    @at t r i but e pl ay {yes, no}

    Thuc tnh kiu s

    Thuc tnh phn lp

    @dat a

    sunny, 85, 85, FALSE, no

    over cast 83 86 FALSE es

    cui cng)

    Cc v d

    7Khai Ph DLiu

  • 7/22/2019 L2-Gioi Thieu WEKA

    8/18

    WEKA Explorer: Tin x l d liu D liu c thc nhp vo (imported) t mt tp tin c

    khun d n : ARFF CSV

    D liu cng c thc c vo t mt a chURL, hoc tmt c s d liu thng qua JDBC

    Cc cng c tin x l d liu ca WEKA c gi l filters

    Ri rc ha (Discretization)

    Ly mu (Re-sampling)

    La chn thuc tnh (Attribute selection)

    uy n rans orm ng v p om n ng c c u c n

    Hy xem giao din ca WEKA Explorer

    8Khai Ph DLiu

  • 7/22/2019 L2-Gioi Thieu WEKA

    9/18

    WEKA Explorer: Cc b phn lp (1) Cc b phn lp (Classifiers) ca WEKA tng ng vi

    (phn lp) hoc cc i lng kiu s (hi quy/d on)

    Nave Bayes classifier and Bayesian networks

    Decision trees

    Instance-based classifiers

    Support vector machines

    Neural networks

    y xem g ao n c a xp orer

    9Khai Ph DLiu

  • 7/22/2019 L2-Gioi Thieu WEKA

    10/18

    WEKA Explorer: Cc b phn lp (2) La chn mt b phn lp (classifier)

    La chn cc ty chn cho vic kim tra (test options) Use training set. B phn loi hc c sc nh gi

    Supplied test set. S dng mt tp d liu khc (vi tp

    hc) cho vic nh gi Cross-validation. Tp d liu sc chia u thnh k tp

    (folds) c kch thc xp xnhau, v b phn loi hc c sc nh gi bi phng php cross-validation

    Percentage split. Chnh t l phn chia tp d liu i vivic nh gi

    10Khai Ph DLiu

  • 7/22/2019 L2-Gioi Thieu WEKA

    11/18

    WEKA Explorer: Cc b

    phn l

    p (3) More options

    .

    Output per-class stats. Hin th cc thng tin thng k vprecision/recall i vi mi lp

    .

    (entropy) ca tp d liu

    Output confusion matrix. Hin th thng tin v ma trn li phn lp(confusion matrix) i vi phn lp hc c

    Store predictions for visualization. Cc don ca b phn lpc lu li trong b nh, c thc hin th sau

    Output predictions. Hin th chi tit cc don i vi tp kim tra

    Cost-sensitive evaluation. Cc li (ca b phn lp) c xc nhda trn ma trn chi ph (cost matrix) chnh

    Random seed for XVal / % Split. Chnh gi trrandom seedc sng c o qu r n a c n ng u n n c c v c o p m ra

    11Khai Ph DLiu

  • 7/22/2019 L2-Gioi Thieu WEKA

    12/18

    WEKA Explorer: Cc b

    phn l

    p (4) Classifier output hin th cc thng tin quan trng

    . ,

    d liu, s lng cc v d, cc thuc tnh, v f.f. th nghim Classifier model (full training set). Biu din (dng text) ca

    Predictions on test data. Thng tin chi tit v cc don ca

    b phn lp i vi tp kim tra ummary. c ng v m c c n x c c a p n p,

    i vi f.f. th nghim chn

    Detailed Accuracy By Class. Thng tin chi tit v mc chnh

    x c c a p n p v m p Confusion Matrix. Cc thnh phn ca ma trn ny th hin s

    lng cc v d kim tra (test instances) c phn lp ng vp n p sa

    12Khai Ph DLiu

  • 7/22/2019 L2-Gioi Thieu WEKA

    13/18

    WEKA Explorer: Cc b

    phn l

    p (5) Result list cung cp mt s chc nng hu ch

    .

    c vo trong mt tp tin nh phn (binary file) Load model.c li mt m hnh c hc trc t mt

    Re-evaluate model on current test set.nh gi mt m hnh

    (b phn lp) hc c trc i vi tp kim tra (test set)n

    Visualize classifier errors. Hin th ca s biu th hin cckt qu ca vic phn lp

    Cc v dc phn lp chnh xc sc biu din bng k hiubi du cho (x), cn cc v d b phn lp sai sc biu dinbng k hiu vung ()

    13Khai Ph DLiu

  • 7/22/2019 L2-Gioi Thieu WEKA

    14/18

    WEKA Explorer: Cc b

    phn cm (1)

    Cc b phn cm (Cluster builders) ca WEKA tng

    ti vi mt tp d liu

    Expectation maximization (EM)

    k-Means ...

    Cc b phn cm c thc hin th kt qu v so

    s n v c c cm p cHy xem giao din ca WEKA Explorer

    14Khai Ph DLiu

  • 7/22/2019 L2-Gioi Thieu WEKA

    15/18

    WEKA Explorer: Cc b

    phn cm (2)

    La chn mt b phn cm (cluster builder)

    a c n c p n cm c us er mo e Use training set. Cc cm hc c s c kim tra i vi tp hc

    Supplied test set. S dng mt tp d liu khc kim tra cc cmhc c

    Percentage split. Ch nh t l phn chia tp d liu ban u cho vic

    xy dng tp kim tra u v u . x

    hc c i vi cc lp c ch nh

    Store clusters for visualization

    Lu li cc b phn lp trong b nh, c th hin th sau

    Ignore attributes

    15Khai Ph DLiu

  • 7/22/2019 L2-Gioi Thieu WEKA

    16/18

    WEKA Explorer: Lut k

    t h

    p

    La chn mt m hnh (gii thut) pht hin lut kt hp ssoc a or ou pu

    Run information. Cc ty chn i vi m hnh pht hin lutkt hp, tn ca tp d liu, s lng cc v d, cc thuc tnh

    Associator model (full training set). Biu din (dng text) catp cc lut kt hp pht hin c

    h tr ti thiu (minimum support) tin cy ti thiu (minimum confidence)

    Kch thc ca cc tp mc thng xuyn (large/frequentitemsets

    Lit k cc lut kt hp tm c

    Hy xem giao din ca WEKA Explorer

    16Khai Ph DLiu

  • 7/22/2019 L2-Gioi Thieu WEKA

    17/18

    WEKA Explorer: L

    a ch

    n thuc tnh

    xc nh nhng thuc tnh no l quan trng nht

    Trong WEKA, mt phng php la chn thuc tnh(attribute selection) bao gm 2 phn: .

    ph hp ca cc thuc tnh

    Vd: correlation-based, wrapper, information gain, chi-,

    Search Method. xc nh mt phng php (th t) xt ccthuc tnh

    Vd: best-first, random, exhaustive, ranking,

    Hy xem giao din ca WEKA Explorer

    17Khai Ph DLiu

  • 7/22/2019 L2-Gioi Thieu WEKA

    18/18

    WEKA Explorer: Hin th

    d

    liu

    Hin th d liu rt cn thit trong thc tGip xc nh mc kh khn ca bi ton hc

    WEKA c th hin th Mi thuc tnh ring l (1-D visualization) -

    Cc gi tr (cc nhn) lp khc nhau sc hin th

    bng cc mu khc nhau an r J er r v c n r r ng n,

    khi c qu nhiu v d (im) tp trung xung quanh mtv tr trn biu

    Tnh nng phng to/thu nh (bng cch tng/gim gi trca PlotSize v PointSize)

    Hy xem giao din ca WEKA Explorer

    18Khai Ph DLiu