Click here to load reader
View
259
Download
2
Embed Size (px)
UniDic version 1.3.9
2008 7
UniDic version 1.3.9 Users ManualYasuharu Den, Atsushi Yamada, Hideki Ogura, Hanae Koiso, and Toshinobu Ogiso
Copyright c 20072008 The UniDic consortium. All rights reserved.
version 1.3.0 2 April 2007version 1.3.5 12 October 2007version 1.3.8 25 April 2008version 1.3.9 15 July 2008
I 2
1 2
1.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 UniDic-chasen 6
2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.6 chasenrc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 UniDic-mecab 10
3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.3 dicrc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
II 11
4 UniDic 11
4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5 14
5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
5.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
6 20
6.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
6.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
6.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
6.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
i
6.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
6.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
6.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
7 24
7.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
A 25
A.1 Version 1.3.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
A.2 Version 1.3.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
A.3 Version 1.3.8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
ii
UniDicChaSenUniDic-chasenMeCabUniDic-
mecabChaSenMeCab
ChaSen IPA
ipadic
UniDic
ISTC
21 1822
18
Tel: 042-540-4300
E-mail: [email protected]
1
I
1
UniDicChaSenver. 2.4.0 MeCabver. 0.96
1.1
WindowsLinux/Cygwin
WindowschauniLinux/Cygwin
Linux/CygwinconfigureChaSen/MeCab
./configure --with-use-mecab=0 # ChaSen
./configure --with-use-chasen=0 # MeCab
Cygwinconfigure Cygwin
./configure --with-systemtop=D:/Cygwin
utf8 ChaSen -i
w
chasen -i w <
1.2
1.2.1 UniDic-chasen
Windows
1. unidic-chasen139_XXXX.zipunidic-chasen139_XXXX
XXXX utf8, sjis, eucj
2. ChaSen C:\Program Files\ChaSen dic
dic
3. 1 2 dic
Linux/Cygwin
1. unidic-chasen139_XXXX.tar.gz unidic-chasen139_XXXX
2
XXXX utf8, eucj, sjis
2. ChaSen/usr/local/lib/chasen/dic*1 unidic
unidic
3. 1 2 unidic
4. 3 unidic chasenrc$HOME/.chasenrc
GRAMMAR
CygwinGRAMMARWindows
(GRAMMAR D:/Cygwin/usr/local/lib/chasen/dic/unidic)
1.2.2 UniDic-mecab
Windows
1. unidic-mecab139_XXXX.zipunidic-mecab139_XXXX
XXXX utf8, sjis, eucj
2. MeCab C:\Program Files\MeCab
dic unidic
unidic
3. 1 2 unidic
MeCab-d
mecab -d "C:\Program Files\MeCab\dic\unidic"
Linux/Cygwin
1. unidic-mecab139_XXXX.tar.gz unidic-mecab139_XXXX
XXXX utf8, eucj, sjis
2. MeCab/usr/local/lib/mecab/dic*2 unidic
unidic
3. 1 2 unidic
MeCab-d
mecab -d /usr/local/lib/mecab/dic/unidic
1.3
*1 ChaSen chasen-config --dicdir*2 MeCab mecab-config --dicdir
3
1.3.1 UniDic-chasen
Windows
1. unidic-chasen139src.zipunidic-chasen139src
2. ChaSen C:\Program Files\ChaSen dic
3. 1 unidic-chasen139src ChaSen
4. 3 unidic-chasen139src Makefile.bat
2 dic
4
Filler.dic.dic
utf8
2Makefile_sjis.bat Makefile_eucj.bat
utf8
Linux/Cygwin
1. unidic-chasen139src.tar.gz unidic-chasen139src
2. 1 unidic-chasen139src./configure && make
3. make install /usr/local/lib/chasen/dic/unidic
4. 3 unidic chasenrc $HOME/.chasenrc
GRAMMAR
2 configure
--with-exclude-dic
,
./configure --with-exclude-dic=Filler.dic
Cygwinconfigure Cygwin
./configure --with-systemtop=D:/Cygwin
GRAMMARWindows
(GRAMMAR D:/Cygwin/usr/local/lib/chasen/dic/unidic)
utf8
2configurewith-encoding=sShift-JIS
eEUC-JPmake install utf8
4
1.3.2 UniDic-mecab
Windows
1. unidic-mecab139src.zipunidic-mecab139src
2. MeCab C:\Program Files\MeCab
dic unidic
3. 1 unidic-mecab139srcMeCab
4. 3 unidic-mecab139src Makefile.bat
2 unidic
4
Filler.csv.csv
utf8
Makefile_sjis.bat Makefile_eucj.bat
utf8
MeCab-d
mecab -d "C:\Program Files\MeCab\dic\unidic"
Linux/Cygwin
1. unidic-mecab139src.tar.gz unidic-mecab139src
2. 1 unidic-mecab139src./configure && make
3. make install/usr/local/lib/mecab/dic/unidic
2 configure
--with-exclude-dic
,
./configure --with-exclude-dic=Filler.csv
utf8
configurewith-charset=sjisShift-JIS euc-jpEUC-JP
make install utf8
MeCab-d
mecab -d /usr/local/lib/mecab/dic/unidic
5
2 UniDic-chasen
2.1
grammar.cha %
ctypes.chacforms.cha (
()()()()()())
(( %)( %))
2.2
ctypes.cha (( )
(-...
-)) 2.3
cforms.cha (-
((- *)(- *)( *)( *)(- *)(- *)(- *)(- *)(- *)(- *)(- *)))
6
ChaSen ipadicUniDic-chasencforms.cha
2.4
.dic (POS ( ))((LEX ( 0)) (READING ) (PRON )(INFO orthBase="" kanaBase="" pronBase=""
lForm="" lemma="" form=""aConType=" %[email protected], %F1, %[email protected]" goshu=""))
(POS ( ))((LEX ( 4000)) (READING ) (PRON )(INFO orthBase="" kanaBase="" pronBase=""
lForm="" lemma="" form=""aType="1" aConType="C3" goshu=""))
aModType (POS ( ))((LEX ( 261)) (READING ) (PRON )(CTYPE -) (CFORM -) (BASE )(INFO orthBase="" kanaBase="" pronBase=""
lForm="" lemma="" form=""aType="2" aConType="C1" goshu=""))
(POS ( ))((LEX ( 261)) (READING ) (PRON )(CTYPE -) (CFORM -) (BASE )(INFO orthBase="" kanaBase="" pronBase=""
lForm="" lemma="" form=""aType="2" aConType="C1" goshu=""))
(POS ( ))((LEX ( 261)) (READING ) (PRON )(CTYPE -) (CFORM ) (BASE )(INFO orthBase="" kanaBase="" pronBase=""
lForm="" lemma="" form=""aType="2" aConType="C1" aModType="[email protected]" goshu=""))
...
(POS ( ))((LEX ( 261)) (READING ) (PRON )(CTYPE -) (CFORM -) (BASE )(INFO orthBase="" kanaBase="" pronBase=""
lForm="" lemma="" form=""aType="2" aConType="C1" aModType="[email protected]" goshu=""))
UniDic-chasen ChaSen
INFO
4.4
7
2.5
connect.cha
(( ((( )))((( ))) )
814)
(( ((( ) - -))((() - - )) )
147)
(( (((*)))
((() - - )) )8000)
(( ((( ) * * ))((() - - )) )
425) 2.6 chasenrc
chasenrcChaSen (GRAMMAR /usr/local/lib/chasen/dic)
(DADIC chadic)
(UNKNOWN_POS ( ))
(OUTPUT_FORMAT ; 1"%m\n")
(OUTPUT_COMPOUND "SEG")
(EOS_STRING "")
(DEF_CONN_COST 10000)
(POS_COST
((*) 1)
((UNKNOWN) 30000) )
(CONN_WEIGHT 1)
(MORPH_WEIGHT 1)
(COST_WIDTH 0)
(ANNOTATION (("") "%m\n"))
8
GRAMMAR
UNKNOWN_POS
OUTPUT_FORMAT
EOS_STRING
ANNOTATION
xml
UniDic-chasen ChaSen
xml
OUTPUT_FORMAT ; ... 1
xslt uniutils
xml2txt.xsl
9
3 UniDic-mecab
3.1
.csv
UniDic-chasen.dic
3.2
.def
http://mecab.sourceforge.net/
3.3 dicrc
dicrcMeCab cost-factor = 700
bos-feature = BOS/EOS,*,*,*,*,*,*,*,*,*,*,*,*
eval-size = 9
unk-eval-size = 4
max-grouping-size = 10
output-format-type = unidic
node-format-unidic = %m\t%f[10]\t%f[6]\t%f[7]\t%F-[0,1,2,3]\t%f[4]\t%f[5]\t%f[12]\n
unk-format-unidic = %m\t%m\t%m