Click here to load reader

百億語のコーパスを用いた日本語の語彙・文法情 報の プロファイリング

  • View
    55

  • Download
    0

Embed Size (px)

DESCRIPTION

第3回コーパス日 本 語学ワークショップ 2013 年 2 月 28 日. 百億語のコーパスを用いた日本語の語彙・文法情 報の プロファイリング. スルダノヴィッチ・イレーナ (国立国語研究所・リュブリャーナ大学 ) スホメル・ヴィット (マサリック大学言語処理センター) 小木曽智信 (国立国語研究所) キルガリフ・アダム (レクシカルコンピューティング・リーズ大学). [email protected] 発表概要. コーパス開発の背景 TenTen コーパス 群ー JpTenTen コーパス構 築 UniDic 短単位と長単 位のアノテーション - PowerPoint PPT Presentation

Text of 百億語のコーパスを用いた日本語の語彙・文法情 報の プロファイリング

PowerPointova predstavitev

2013228

[email protected]++223

KOTONOHABCCWJ2006200820082008

20063JpTenTen1TenTenCorpus Factory (Kilgarriff 2010JpTenTen11100Pomikalek&Suchomel 2012

Kilgarriff 2010SpiderLingPomikalek&Suchomel 2012JusText Pomikalek 2011text in sentences onlyde-duplicatePomikalek 20114JpTenTen25MeCab 0.98UniDic2.1.0 20116Comainu 0.60UniDic

7Srdanovi2008UniDic8SkEKilgarriff 2004Manatee5JpTenTen10,321,875,665(UniDic)15,553,207734,758GBComjpnetinfoOther50%32%9%5%4%6JTenTen7% --N.c.g158882668215,39%-P.case135924732613,17%Aux8633452348,36%--N.c.vs5544256385,37%-V.bnd5329616235,16%-V.g5068152274,91%-Supsym.p4472903264,33%-N.num4143140844,01%-Supsym.c3768141143,65%-P.conj3724405193,61%-P.bind3551830283,44%-Supsym.g2805554262,72%--N.c.adv2157502662,09%--Suff.n.g2062476172,00%-Sym.ch2045859931,98%Adv1734810561,68%-Supsym.bo1715030101,66%-P.adv1588370491,54%-Supsym.bc1535316271,49%--N.c.count1289866631,25%-P.fin1198794421,16%Pron1141095411,11%JpTenTen1JpWaCChaSenIPADIC

JpTenTenMeCabUniDic2011

//////////////////////

8 (n) equality; good order8JpTenTen2UniDic()PronpronounAdvadverbAuxauxiliary_verb-P.bindparticle(binding)-P.advparticle(adverbial)() ku_wrdku_wording-Cond.gconditional.general (katei)-Cond.intconditional.integrated (katei)Impimperative (meirei)-Real.grealis.general (izen)ka_irrkahen_verb.irregularsa_irrsahen_verb.irregularza_irrzahen_verb.irregular-V1i.akamiichidan_verb_i_row.a_column-V1i.kakamiichidan_verb_i_row.ka_column9JpTenTen

https://the.sketchengine.co.uk/

10

CQL

CQLCorpus Query Language:wordlemmatag:lemma_kanainfl_typeinfl_form

11CQL:[word=*"][word=""][tag="N.*"][word=""][tag="Ai.*" & infl_form="Attr.*" ]

(1)(2)(3)(4)122007Srdanovi2008Gahl1998corpus query syntax ()

(1)ChaSenIPADICMeCab-UniDic

(2)

(3)

13

infl_form=Cont.*tag="Ai.*" tag=N.c.vs*DUAL=modifier_Ai_cont/modifies_N+2:[tag="Ai.*" & word!="|" & infl_form="Cont.*"] [tag="Pref"]? 1:[tag="N.c.vs"]

14+15 WordSketch Word Sketch ThesaurusWS Diff

16 (+-+

1718

19

20

+JpTenTen21

Word Sketch Word Sketch UniDic22

-----------Multiword sketches23

Word Sketch~ Word Sketch MWU

: [tag="P.conj"& word=""] [tag="V.bnd"]1964 (Martin 2004, 512)24

Martin2004JpTenTen100

2526

(1), 2008Sketch Engine23, , 5980200722, 101-12320114, , 2011BCCWJ, 22, 331-338 2011UniDic2.0:22, 411-8Baroni, Marko & Kilgarriff, Adam (2006) Large linguistically-processed Web corpora for multiple languages, In Proceedings EACL Trento, Italy Gahl, S., 1998, Automatic extraction of subcorpora based on subcategorization frames from a part-of-speech tagged corpus, ms., ICSI-BerkeleyKilgarriff, Adam, Rychly, Pavel, Smr, Pavel & Tugwell, David (2004). The Sketch Engine. Proceedings of EURALEX. France: Universit de Bretagne. 105-116.27(2)Kilgarriff, A., Kov, V., Krek, S., Srdanovi, I., Tiberius, C. (2010). A Quantitative Evaluationof Word Sketches. Proceedings of the XIV Euralex International Congress. Leeuwarden:Fryske Academy. 7pp.Kilgarriff, Adam, Reddy, Siva, Pomiklek, Jan and Pvs, Avinesh (2010) A corpus factory for many languages. In proceedings of LREC, MaltaMartin, Samuel E. (2004) A reference grammar of Japanese. University of Hawaii Press, HonoluluPomikalek, Jan (2011) Removing Boilerplate and Duplicate Content from Web Corpora. PhD thesis, Masaryk University, BrnoPomiklek, Jan, Suchomel, Vt (2012) Efficient Web Crawling for Large Text Corpora. ACL SIGWAC Web as Corpus (at conference WWW)Sharoff, S. (2006) Open-source corpora: using the net to fish for linguistic data, International Journal of Corpus Linguistics, 11 (4), pp. 435462.Srdanovi, Irena, Erjavec Tomaz & Kilgarriff, Adam (2008). A web corpus and word-sketches for Japanese. Shizen gengo shori (Journal of Natural Language Processing) 15/2. 137-159.Srdanovi, Irena, Ida, Naomi, Shigemori Bucar, Chikako, Kilgarriff, Adam, Kovar, Vojtech (2011). Japanese Word Sketches: Advantages and Problems. Acta Linguistica Asiatica, 1 (2) 28URL

KOTONOHA http://www.ninjal.ac.jp/kotonoha/Sketch Enginehttp://www.sketchengine.co.uk/SpiderLinghttp://nlp.fi.muni.cz/trac/spiderlingComainuhttps://maro.ninjal.ac.jp/Comainu/related_paper/UniDic http://download.unidic.org/MeCab: Yet Another Part-of-Speech and Morphological Analyzer http://mecab.googlecode.com

29

Search related