Click here to load reader

UniDic version 1.3.9 ユーザーズマニュアル mine/japanese/nlp+slp/...書ファイル(たとえば )を.dic 以外の拡張子に改名してください。注意 ソース辞書は文字コードがutf8

  • View
    259

  • Download
    2

Embed Size (px)

Text of UniDic version 1.3.9 ユーザーズマニュアル...

  • UniDic version 1.3.9

    2008 7

  • UniDic version 1.3.9 Users ManualYasuharu Den, Atsushi Yamada, Hideki Ogura, Hanae Koiso, and Toshinobu Ogiso

    Copyright c 20072008 The UniDic consortium. All rights reserved.

    version 1.3.0 2 April 2007version 1.3.5 12 October 2007version 1.3.8 25 April 2008version 1.3.9 15 July 2008

  • I 2

    1 2

    1.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    1.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    1.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    2 UniDic-chasen 6

    2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    2.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    2.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    2.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    2.6 chasenrc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    3 UniDic-mecab 10

    3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

    3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

    3.3 dicrc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

    II 11

    4 UniDic 11

    4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

    4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

    4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

    4.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

    5 14

    5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

    5.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    5.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

    6 20

    6.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

    6.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

    6.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

    6.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

    i

  • 6.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

    6.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

    6.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

    7 24

    7.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

    A 25

    A.1 Version 1.3.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

    A.2 Version 1.3.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

    A.3 Version 1.3.8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

    ii

  • UniDicChaSenUniDic-chasenMeCabUniDic-

    mecabChaSenMeCab

    ChaSen IPA

    ipadic

    UniDic

    ISTC

    21 1822

    18

    Tel: 042-540-4300

    E-mail: [email protected]

    1

  • I

    1

    UniDicChaSenver. 2.4.0 MeCabver. 0.96

    1.1

    WindowsLinux/Cygwin

    WindowschauniLinux/Cygwin

    Linux/CygwinconfigureChaSen/MeCab

    ./configure --with-use-mecab=0 # ChaSen

    ./configure --with-use-chasen=0 # MeCab

    Cygwinconfigure Cygwin

    ./configure --with-systemtop=D:/Cygwin

    utf8 ChaSen -i

    w

    chasen -i w <

    1.2

    1.2.1 UniDic-chasen

    Windows

    1. unidic-chasen139_XXXX.zipunidic-chasen139_XXXX

    XXXX utf8, sjis, eucj

    2. ChaSen C:\Program Files\ChaSen dic

    dic

    3. 1 2 dic

    Linux/Cygwin

    1. unidic-chasen139_XXXX.tar.gz unidic-chasen139_XXXX

    2

  • XXXX utf8, eucj, sjis

    2. ChaSen/usr/local/lib/chasen/dic*1 unidic

    unidic

    3. 1 2 unidic

    4. 3 unidic chasenrc$HOME/.chasenrc

    GRAMMAR

    CygwinGRAMMARWindows

    (GRAMMAR D:/Cygwin/usr/local/lib/chasen/dic/unidic)

    1.2.2 UniDic-mecab

    Windows

    1. unidic-mecab139_XXXX.zipunidic-mecab139_XXXX

    XXXX utf8, sjis, eucj

    2. MeCab C:\Program Files\MeCab

    dic unidic

    unidic

    3. 1 2 unidic

    MeCab-d

    mecab -d "C:\Program Files\MeCab\dic\unidic"

    Linux/Cygwin

    1. unidic-mecab139_XXXX.tar.gz unidic-mecab139_XXXX

    XXXX utf8, eucj, sjis

    2. MeCab/usr/local/lib/mecab/dic*2 unidic

    unidic

    3. 1 2 unidic

    MeCab-d

    mecab -d /usr/local/lib/mecab/dic/unidic

    1.3

    *1 ChaSen chasen-config --dicdir*2 MeCab mecab-config --dicdir

    3

  • 1.3.1 UniDic-chasen

    Windows

    1. unidic-chasen139src.zipunidic-chasen139src

    2. ChaSen C:\Program Files\ChaSen dic

    3. 1 unidic-chasen139src ChaSen

    4. 3 unidic-chasen139src Makefile.bat

    2 dic

    4

    Filler.dic.dic

    utf8

    2Makefile_sjis.bat Makefile_eucj.bat

    utf8

    Linux/Cygwin

    1. unidic-chasen139src.tar.gz unidic-chasen139src

    2. 1 unidic-chasen139src./configure && make

    3. make install /usr/local/lib/chasen/dic/unidic

    4. 3 unidic chasenrc $HOME/.chasenrc

    GRAMMAR

    2 configure

    --with-exclude-dic

    ,

    ./configure --with-exclude-dic=Filler.dic

    Cygwinconfigure Cygwin

    ./configure --with-systemtop=D:/Cygwin

    GRAMMARWindows

    (GRAMMAR D:/Cygwin/usr/local/lib/chasen/dic/unidic)

    utf8

    2configurewith-encoding=sShift-JIS

    eEUC-JPmake install utf8

    4

  • 1.3.2 UniDic-mecab

    Windows

    1. unidic-mecab139src.zipunidic-mecab139src

    2. MeCab C:\Program Files\MeCab

    dic unidic

    3. 1 unidic-mecab139srcMeCab

    4. 3 unidic-mecab139src Makefile.bat

    2 unidic

    4

    Filler.csv.csv

    utf8

    Makefile_sjis.bat Makefile_eucj.bat

    utf8

    MeCab-d

    mecab -d "C:\Program Files\MeCab\dic\unidic"

    Linux/Cygwin

    1. unidic-mecab139src.tar.gz unidic-mecab139src

    2. 1 unidic-mecab139src./configure && make

    3. make install/usr/local/lib/mecab/dic/unidic

    2 configure

    --with-exclude-dic

    ,

    ./configure --with-exclude-dic=Filler.csv

    utf8

    configurewith-charset=sjisShift-JIS euc-jpEUC-JP

    make install utf8

    MeCab-d

    mecab -d /usr/local/lib/mecab/dic/unidic

    5

  • 2 UniDic-chasen

    2.1

    grammar.cha %

    ctypes.chacforms.cha (

    ()()()()()())

    (( %)( %))

    2.2

    ctypes.cha (( )

    (-...

    -)) 2.3

    cforms.cha (-

    ((- *)(- *)( *)( *)(- *)(- *)(- *)(- *)(- *)(- *)(- *)))

    6

  • ChaSen ipadicUniDic-chasencforms.cha

    2.4

    .dic (POS ( ))((LEX ( 0)) (READING ) (PRON )(INFO orthBase="" kanaBase="" pronBase=""

    lForm="" lemma="" form=""aConType=" %[email protected], %F1, %[email protected]" goshu=""))

    (POS ( ))((LEX ( 4000)) (READING ) (PRON )(INFO orthBase="" kanaBase="" pronBase=""

    lForm="" lemma="" form=""aType="1" aConType="C3" goshu=""))

    aModType (POS ( ))((LEX ( 261)) (READING ) (PRON )(CTYPE -) (CFORM -) (BASE )(INFO orthBase="" kanaBase="" pronBase=""

    lForm="" lemma="" form=""aType="2" aConType="C1" goshu=""))

    (POS ( ))((LEX ( 261)) (READING ) (PRON )(CTYPE -) (CFORM -) (BASE )(INFO orthBase="" kanaBase="" pronBase=""

    lForm="" lemma="" form=""aType="2" aConType="C1" goshu=""))

    (POS ( ))((LEX ( 261)) (READING ) (PRON )(CTYPE -) (CFORM ) (BASE )(INFO orthBase="" kanaBase="" pronBase=""

    lForm="" lemma="" form=""aType="2" aConType="C1" aModType="[email protected]" goshu=""))

    ...

    (POS ( ))((LEX ( 261)) (READING ) (PRON )(CTYPE -) (CFORM -) (BASE )(INFO orthBase="" kanaBase="" pronBase=""

    lForm="" lemma="" form=""aType="2" aConType="C1" aModType="[email protected]" goshu=""))

    UniDic-chasen ChaSen

    INFO

    4.4

    7

  • 2.5

    connect.cha

    (( ((( )))((( ))) )

    814)

    (( ((( ) - -))((() - - )) )

    147)

    (( (((*)))

    ((() - - )) )8000)

    (( ((( ) * * ))((() - - )) )

    425) 2.6 chasenrc

    chasenrcChaSen (GRAMMAR /usr/local/lib/chasen/dic)

    (DADIC chadic)

    (UNKNOWN_POS ( ))

    (OUTPUT_FORMAT ; 1"%m\n")

    (OUTPUT_COMPOUND "SEG")

    (EOS_STRING "")

    (DEF_CONN_COST 10000)

    (POS_COST

    ((*) 1)

    ((UNKNOWN) 30000) )

    (CONN_WEIGHT 1)

    (MORPH_WEIGHT 1)

    (COST_WIDTH 0)

    (ANNOTATION (("") "%m\n"))

    8

  • GRAMMAR

    UNKNOWN_POS

    OUTPUT_FORMAT

    EOS_STRING

    ANNOTATION

    xml

    UniDic-chasen ChaSen

    xml

    OUTPUT_FORMAT ; ... 1

    xslt uniutils

    xml2txt.xsl

    9

  • 3 UniDic-mecab

    3.1

    .csv

    UniDic-chasen.dic

    3.2

    .def

    http://mecab.sourceforge.net/

    3.3 dicrc

    dicrcMeCab cost-factor = 700

    bos-feature = BOS/EOS,*,*,*,*,*,*,*,*,*,*,*,*

    eval-size = 9

    unk-eval-size = 4

    max-grouping-size = 10

    output-format-type = unidic

    node-format-unidic = %m\t%f[10]\t%f[6]\t%f[7]\t%F-[0,1,2,3]\t%f[4]\t%f[5]\t%f[12]\n

    unk-format-unidic = %m\t%m\t%m