34
日日日日日日日日 MeCab, CaboCha 日日日 工工 工

[PPT]MeCab 汎用日本語形態素解析エンジンchasen.naist.jp/.../doc/mecab-cabocha-nlp-seminar-2008.ppt · Web view日本語解析ツール MeCab, CaboCha の紹介 工藤

Embed Size (px)

Citation preview

  • MeCab, CaboCha

  • Spam ...(tokenization)(stemming, lemmatization)(part-of-speech tagging)MeCab:

  • MeCab

    ,,,*,*,*,,,

    ,,*,*,*,*,,,

    ,,*,*,,,,,

    ,,*,*,,,,,

    ,*,*,*,,,,,

    ,,*,*,*,*,,,

    ,,,*,*,*,,,

    ,,*,*,,,,,

    ,*,*,*,,,,,

    EOS

    1,2,3,4,,,,,EOS: End of sentence ()

  • : ()

  • ()

    2

  • TRIE

    , MeCab

  • BOS

    []

    []

    []

    []

    []

    []

    []

    EOS

    []

  • BOS

    []

    []

    []

    []

    []

    []

    []

    EOS

    []

    () (80): (KAKASI): :

  • BOS

    []

    []

    []

    []

    []

    []

    []

    EOS

    []

    ()Viterbi

  • (Viterbi )

    4500

    4200

    5700

    3150

    3200

    7100

    4550

  • (Viterbi )

    4500

    4200

    5700

    3150

    3200

    2700

    1400

    6900

    7250

    4550

    7900

  • (Viterbi )

    4200

    5700

    3150

    3200

    1400

    2700

    1400

    1300

    6900

    4500

    800

    4550

    1500

    7300

    8200

    7650

  • (Viterbi )

    4200

    5700

    3150

    3200

    1400

    2700

    1400

    1300

    6900

    4500

    800

    4550

    1500

    7300

    600

    1200

    7400

    8260

  • (Viterbi )

    700

    2700

    1000

    4200

    5700

    3150

    3200

    800

    1400

    2700

    1400

    1300

    6900

    4500

    800

    4550

    1500

    600

    7300

    600

    1200

    960

    500

    7400

  • (Viterbi )

    700

    2700

    1000

    4200

    5700

    3150

    3200

    800

    1400

    2700

    1400

    1300

    6900

    4500

    800

    4550

    1500

    600

    7300

    600

    1200

    960

    500

    7400

  • (90) () ... + +

  • (VisualMorphs)

  • Conditional Random Fields

    MeCab 0.90 CRF HMM HMM1/3 CRF

  • CRF

    BOS

    []

    []

    []

    []

    []

    []

    []

    EOS

    []

    () 0

    -: 0

    -: 0

    -: 0

    -: 0

    -: 0

    -: 0

    -: 0

    -: 0

    -: 0

    -: 0

    -: 0

    -: 0

  • CRF

    BOS

    []

    []

    []

    []

    []

    []

    []

    EOS

    []

    +1 -1

    ]

    -: 0

    -: 0

    -: 0

    -: 0

    -: 0

    -: 0

    -: 0

    -: 0

    -: 0

    -: 0

    -: 0

    -: 0

  • CRF

    BOS

    []

    []

    []

    []

    []

    []

    []

    EOS

    []

    -: +1

    -: 0

    -: -1

    -: 0

    -: +1

    -: -1

    -: +1

    -: +1

    -: -1

    -: -1

    -: -1

    -: +1

    ]

    +1 -1

  • CRF

    -: 1

    -: 0

    -: -1

    -: 0

    -: 1

    -: -

    -: 1

    -: 1

    -: -1

    -: -1

    -: -1

    -: 1

    1

    BOS

    []

    []

    []

    []

    []

    []

    []

    EOS

    []

  • CRF

    -: 1

    -: 0

    -: -1

    -: 0

    -: 1

    -: -

    -: 1

    -: 1

    -: -1

    -: -1

    -: -1

    -: 1

    +1 -1 = 0

    BOS

    []

    []

    []

    []

    []

    []

    []

    EOS

    []

  • 1 1,2,3 11

  • 92 0.6

    05 5.1

    JUMAN

    96.09.30

    ChaSen

    NAIST TRIE

    MeCab

    TRIE

    back port

    03

    Sen

    MeCab

    Java port

    06 0.9

    03 2.3.3

    C++

    03 4.0

    back port

    02.12.4

    96 3.1

    06 1.2.2

    94 0.6

    97 1.0

    prolog , C 2

  • MeCab

    /grep *.cpp :-)()

  • MeCab

    /ChaSen , , , , ChaSen API C/C++, Perl, Java, Python, Ruby, C# ... N-best , ,

    use MeCab;

    my $str = "";

    my $mecab = new MeCab::Tagger();

    for (my $n = $mecab->parseToNode($str);

    $n; $n = $n->{next}) {

    printf %s\t%s\n, $n->{surface}, $n->{feature};

    }

  • MeCab /

    ,166,166,8487,,,,*,*,*,,,

    ,1306,1306,1849,,,,,*,*,,,

    ,1304,1304,7265,,,,,*,*,,,

    ....

    1. dic.csv ()

    1306 166 -2559

    1304 1303 401

    166 1304 608

    2. matrix.def ()

    []

    []

    []

    id id

    , id, id, , (CSV) id,id -1

    - (,,)CSV

    -2559

    608

    1306 1849 1306

    166 8447 166

    1304 7265 1304

  • MeCab /

    NUMERIC 1 1 0

    ALPHA 1 1 0

    HIRAGANA 0 1 2

    0x00C0..0x00FF ALPHA 0x3041..0x309F HIRAGANA

    ...

    4. char.def ()

    KANJI,1285,1285,11426,,,*,*,*,*,*

    NUMERIC,1295,1295,27386,,,*,*,*,*,*

    ALPHA,1285,1285,13398,,,*,*,*,*,*

    5. unk.def ()

    Unicode

  • MeCab

    CSV MeCab 4CSVURL

    ,166,166,8487,,,,*,*,*,,particle

    ,1304,1304,7265,,,,,*,*,,,cherry

    ....

  • CaboCha

    Support Vector Machines (SVMs) SVM PKE (Cascaded Chunking Model)

    To get over these problems, we now introduce a new parsing algorithm called cascaded chunking model.

    The original idea of our model stems from the English parsing proposed by Abeny 91.

    The cascaded chunking model parses a sentence deterministically only deciding whether the current segment modifies the segment on its immediate right hand side.

    Furthermore, training examples are extracted using this algorithm itself

  • CaboCha

  • MeCab/CaboCha

    MeCabhttp://mecab.sourceforge.net/CaboChahttp://chasen.org/~taku/software/cabocha/ , Vol.41, No.11, pp.1208-1214, November 2000., Vol.43, No.3, pp.685-695, March 2002.MeCab Taku Kudo, Kaoru Yamamoto, Yuji Matsumoto (2004)Appliying Conditional Random Fields to Japanese Morphological Analysis, EMNLP 2004 CaboCha, Vol 43, No. 6 pp. 1834-1842, June 2002. , Vol.45, No.9, pp.2177-2185, September 2004--, Vol.19, No.3, pp.334-339, May 2004.

    To get over these problems, we now introduce a new parsing algorithm called cascaded chunking model.

    The original idea of our model stems from the English parsing proposed by Abeny 91.

    The cascaded chunking model parses a sentence deterministically only deciding whether the current segment modifies the segment on its immediate right hand side.

    Furthermore, training examples are extracted using this algorithm itself