124
陳勇汀(布丁) [email protected] 課程網頁: http://l.pulipuli.info/nccu/17-tm 2017/3/31 超簡單!文本機器分類入門

簡易文本語義分類入門 (20170331)

Embed Size (px)

Citation preview

  • () [email protected]

    : http://l.pulipuli.info/nccu/17-tm2017/3/31

    mailto:[email protected]://l.pulipuli.info/nccu/17-tm

  • 2

  • 3

    ~~

    (1)...

    P4

    P1

    P2 /

    P3

    P4

    P5

  • 4

  • 5

    P4 P6

  • 6

    Weka

  • https://www.youtube.com/watch?v=-W3pnicVgn0

    7

  • 8

  • 1.

    2.

    3.

    4.

    5.

    6.

    9

  • http://l.pulipuli.info/nccu/17-tm

    10

    Weka 3.8

    ()

    Google

    CSV to ARFFARFF to CSV

    http://l.pulipuli.info/nccu/17-tmhttps://docs.google.com/document/d/1FMJz4rWNGuJnSVEwJG5vFx_0lpLk0VqNTCO0oEHokPs/pubhttps://docs.google.com/spreadsheets/http://pulipulichen.github.io/jieba-js/weka_csv_arff.htmlhttp://pulipulichen.github.io/jieba-js/weka_arff_csv.html

  • 11

    Part 1.

  • 12

  • 13

  • 14

    cm g

    2.98 0.74

    3.5 0.76

  • 15

  • 16

    1.

    2.

    3.

    ? !

  • 17

    1

    ? !

    46

    26

    : 8

    : 2

    2

  • ,,,,,,

    18

    2

    ? !

    2

    2

  • 19

    3

    ? !

  • 20

    3

    ? !

    5

    6

    5 1 1 1 1 1 0 0

    6 1 1 1 0 0 1 1

    1=; 0=

  • 21https://www.wikiwand.com/zh-tw/%E5%90%91%E9%87%8F%E7%A9%BA%E9%96%93%E6%A8%A1%E5%9E%8B

    Gerard Salton

    d1d2

  • 22

    Part 2.

  • 23

    1

    2

    3

  • 24

  • 25

  • 26

  • https://github.com/fxsjy/jieba

    Python

    PHPJavaNode.js.NET (C#)C++R

    27

    Jieba

    https://www.slideshare.net/ssuser4568b0/jieba

    1.

    2.Trie

    3.DAG

    4.

    5.

    HMMViterbi

    https://github.com/fxsjy/jiebahttps://www.slideshare.net/ssuser4568b0/jieba

  • 28

    Jieba-JS

    https://goo.gl/YrSTn9

    https://goo.gl/YrSTn9

  • 29

  • 30

  • 31

    CSV to ARFF

    1. CSV

    []

    2. CSV to ARFF 3. ARFF

  • LibreOffice CalcGoogle

    Unicode

    32

    Microsoft Office

    Big5

    1. CSV

  • 33

    1. CSV

    document

    class

    class: ?

  • http://l.pulipuli.info/nccu/17-tm

    34

    1. CSV

    CSV

    http://l.pulipuli.info/nccu/17-tmhttps://docs.google.com/spreadsheets/d/1jaHDl0692t5OHzRlE4YOXJiRKTgv5xQajl_wqWGlQUY/export?format=csvhttps://docs.google.com/spreadsheets/d/1jaHDl0692t5OHzRlE4YOXJiRKTgv5xQajl_wqWGlQUY/export?format=csv

  • 1. CSV

    35

  • http://l.pulipuli.info/nccu/17-tm

    36

    2. CSV to ARFF

    CSV to ARFF

    http://l.pulipuli.info/nccu/17-tmhttps://pulipulichen.github.io/jieba-js/weka_csv_arff.html

  • 37

    2. CSV to ARFF

    1. CSV

    2.

    3.

  • 38

    3. ARFF&

    train

    test

  • 39

    Weka

    weka.filters.unsupervised.arrtibute.

    StringToWordVector

  • http://l.pulipuli.info/nccu/17-tm

    40

    1.CSV

    2.CSV to ARFF

    3.ARFF

    train

    test

    http://l.pulipuli.info/nccu/17-tm

  • 41

    Part 3.

  • 42

    Weka

    Java

    WindowsMac OSLinux

  • 43

    Weka

    http://l.pulipuli.info/nccu/17-tm

    Weka 3.8.1

    http://l.pulipuli.info/nccu/17-tmhttps://docs.google.com/document/d/1FMJz4rWNGuJnSVEwJG5vFx_0lpLk0VqNTCO0oEHokPs/pub

  • Weka (1/2)

    Weka

    C:\Program Files

    \Weka-3-8

    RunWeka.ini

    >

    44

  • 45

    Weka (2/2)

    fileEncoding=Cp1252

    fileEncoding=utf-8

  • 46

    &

    1

    2

    3

    4

  • 1-1.

    1-2.

    1-3. ()

    47

    1.

    train

    test

  • 48

    2. & 2-1. Weka Explorer

    2-2.

    2-3. Meta

    2-4. NaiveBayes

    2-5.

    StringToWordVector

    2-6.

  • 49

    2-1. Weka Explorer

    Weka 3.8 Explorer

  • 50

    2-2. (1/2)

    Open file

    train

  • 51

    2-2. (2/2)

    6

    document class

  • 52

    2-3. Meta

  • 53

    2-3. Meta

    weka.classifiers.meta.

    FilteredClassifier

  • 54

    2-4. NaiveBayes

    weka.classifiers.bayes.

    NaiveBayes

  • 55

    2-5. ()StringToWordVector

    weka.filters.

    unsupervised.arrtibute.

    StringToWordVector

  • 56

  • 57

    2-6.

    class

  • 58

    2. &

    2-1. Weka Explorer

    2-2.

    2-3. Meta

    2-4. NaiveBayes

    2-5.

    StringToWordVector

    2-6.

  • 59

    3.

    3-1.

    3-2.

    3-3.

  • Cross-vailidation: 610

    (6)

    60

    3-1.

  • 1~6

    61

    6

    ()

    1

    2

    3

    4

    5

    6

    1.

    2.

    4/6=66.7%

  • 62

    3-2.

    Start

  • 63

    3-3.

    Correctly Classified

    Instances:

    66.7%

  • 64

    ...

    66.7%

    31

  • 3-1.

    3-2.

    3-3.

    65

    3.

  • 66

    4.

    4-1.

    4-2.

    4-3.

    4-4.

    4-5. ARFF to CSV

    4-6.

  • 4-1. (1/2)

    67

    Supplied test set: [Set...]

  • 4-1. (2/2)

    68

    Open file

    test

    class

  • 69

    4-2. (1/2)

    More options...

    Output predictions:

    [Choose] CSV

  • 70

    4-2. (2/2)

    outputDistribution:

    TrueCSV

  • 71

    4-3.

    Start

  • 72

    4-4. ARFF

    Save result buffer

    result.txt

  • 73

    4-5. ARFF to CSV (1/3)

    http://l.pulipuli.info/nccu/17-tm

    ARFF to CSV

    http://l.pulipuli.info/nccu/17-tmhttp://pulipulichen.github.io/jieba-js/weka_arff_csv.html

  • 74

    4-5. ARFF to CSV (2/3)

    test

    result.txt

  • 75

    4-5. ARFF to CSV (3/3)

    .csv

  • 76

    4-6. (1/3)Google

    http://l.pulipuli.info/nccu/17-tm

    Google

    http://l.pulipuli.info/nccu/17-tmhttps://drive.google.com/drive

  • 77

    4-6. (2/3)

    .csv

  • 78

    4-6. (3/4)

  • 79

    4-6. (4/4)

  • 80

    document class

    predicted

    class

    pro_dis:

    pro_dis:

    ? *1 0

    ? *0.619 0.381

    1896

    ? *1 0

    Google

    ? 0.001 *0.999

  • 4-1.

    4-2.

    4-3.

    4-4.

    4-5. ARFF to CSV

    4-6.

    81

    4.

  • 82

    Part 4.

  • 83

  • 84

    Information Gain

    ()

    1100% 260%40%

    166% 33% 2100% 3100%

    ()

    2 1

    2 1

    2 3

    1 1

    1 2

    2 2

    2 2

  • 85

  • 86

    1.

    2.:

    StringToWordVector

    3.Class

    4.

    InfoGainAttributeEval

    Ranker

    5.

  • 87

    2. (1/2)

    weka.filters.

    unsupervised.arrtibute.

    StringToWordVector

  • 88

    2. (2/2)

  • 89

    3. Class

    class

  • 90

    4.

    weka.attributeSelection.

    InfoGainAttributeEval

    weka.attributeSelection.

    Ranker

  • 91

    5. (1/2)

    Start

  • 92

    5. (2/2)

    Selected attributes

  • 93

    Ranked attributes:

    0.459 9

    0.459 49

    0.459 46

    document class

    28 15

    80

    1891

    3D

  • 94

    document class

    28 15

    80

    1891

    3D

    111

  • 1.

    2.:

    StringToWordVector

    3.Class

    4.

    InfoGainAttributeEval

    Ranker

    5.

    95

  • 96

    Part 5.

  • 97

    1.

    2.

    3.

    4. TF-IDF

    5.

    6.() ()

    7.

  • 98

    1891

    1891

  • 99

    tokenizer:

    CharacterNGram

    Tokenizer

    - max 1

    -min 1

    StringToWordVector

  • 100

  • 101

    (1/2)

  • 102

    (2/2)

  • 103

    ? !

    1 1 1 1 1 8

    1 1 1 1 1 9

  • 104

    ()

    28 15

    1 1 1 1 1

    1 2 2 1 1

  • 105

    TF-IDF

    ...

    1 0 1

    0 1 1

    TF-IDF

    ...

    2 0 1

    0 2 1

  • 106

    TF-IDF

    =

    =

  • 107

    TF-IDF

    =

    2

    2

    *2

  • 108

    TF-IDF

    (1/2)

    =

    *3

  • 109

    TF-IDF

    (2/2)

    =

    (

    )

    *()

  • 110

    Weka

    2-5: IDFTransform

    TFTransform

  • 111

    weka.classifiers.bayes.

    NaiveBayes

    weka.classifiers.function.

    Logistic

    weka.classifiers.trees.

    J48

    weka.classifiers.functions.

    SMO

    weka.classifiers.functions.

    MultilayerPerceptron

    weka.classifiers.functions.

    NeuralNetwork

  • Weka

    112

  • Weka

    :

    WEKA

    2015

    ISBN: 978-986-379-067-9

    113

    510.25474 368

  • () ()

    114

    97 23

    document

    class

    ...

    ....

    document

    class

    ... 97

    23

  • 115

    doNotOperateOn

    PerClassBasis:

    True

    StringToWordVector

  • 116

    weka.classifiers.bayes.

    NaiveBayes

    weka.classifiers.functions.

    MultilayerPerceptron

    /

  • 117

    &

    1

    2

    3

    4 Weka

  • 118

    Part 6.

  • 119

    P4

    P6

  • 120

    P6194.87%

    ()

  • 121

    ?

  • 122

  • 123

    http://l.pulipuli.info/nccu/17-tm

    http://l.pulipuli.info/nccu/17-tm

  • http://blog.pulipuli.info/

    http://blog.pulipuli.info/