58
29/11/06 NCLT Seminar series 1 Automatic conversion of a Japanese text corpus into f- structure OYA, Masanori ( 大大 大大 ) National Center for Language Technology School of Computing, Dublin City University

29/11/06NCLT Seminar series 1 Automatic conversion of a Japanese text corpus into f-structure OYA, Masanori ( 大矢 政徳 ) National Center for Language Technology

  • View
    215

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 29/11/06NCLT Seminar series 1 Automatic conversion of a Japanese text corpus into f-structure OYA, Masanori ( 大矢 政徳 ) National Center for Language Technology

29/11/06 NCLT Seminar series1

Automatic conversion of a Japanese text corpus into f-structure

OYA, Masanori ( 大矢 政徳 )

National Center for Language Technology

School of Computing, Dublin City University

Page 2: 29/11/06NCLT Seminar series 1 Automatic conversion of a Japanese text corpus into f-structure OYA, Masanori ( 大矢 政徳 ) National Center for Language Technology

29/11/06NCLT Seminar series2

1. Overview

Japanese grammar Kyoto Text Corpus (KTC) Converting KTC into dependency trees Converting KTC into f-structure Problems Evaluation Summary

Page 3: 29/11/06NCLT Seminar series 1 Automatic conversion of a Japanese text corpus into f-structure OYA, Masanori ( 大矢 政徳 ) National Center for Language Technology

29/11/06NCLT Seminar series3

2. Japanese grammar

Syntax– Writing system– SOV as the basic word order– Use of particles for grammatical functions– Tense, aspect and mood are specified by verbal or adjectival morphology– “bunsetsu” (sentential units)– Ellipses of core arguments– Topicalization– Two types of relative clauses– Case particles derived from verbs– Adverbial nouns– Coordination

Page 4: 29/11/06NCLT Seminar series 1 Automatic conversion of a Japanese text corpus into f-structure OYA, Masanori ( 大矢 政徳 ) National Center for Language Technology

29/11/06NCLT Seminar series4

2. Japanese grammar

Writing system: three different types of scripts– Chinese characters (1945 and more)

Nouns (possible to be written in Hiragana or Katakana) Stems of verbs and adjectives

– Hiragana (104) Inflections of verbs and adjectives Particles Words that have no Chinese counterparts

– Katakana (124) Nouns borrowed from foreign languages Technical and scientific names Onomatopoeia

– No spaces are given between words

Page 5: 29/11/06NCLT Seminar series 1 Automatic conversion of a Japanese text corpus into f-structure OYA, Masanori ( 大矢 政徳 ) National Center for Language Technology

29/11/06NCLT Seminar series5

2. Japanese grammar

a i u e o ya yu yo あ い う え お

a i u e o

k か き く け こ きゃ きゅ きょ ka ki ku ke ko kya kyu kyo

g が ぎ ぐ げ ご ぎゃ ぎゅ ぎょ ga gi gu ge go gya gyu gyo

s さ し す せ そ しゃ しゅ しょ sa shi su se so sha shu sho

z ざ じ ず ぜ ぞ じゃ じゅ じょ za ji zu ze zo ja ju jo

t た ち つ て と ちゃ ちゅ ちょ ta chi tsu te to cha chu cho

d だ ぢ づ で ど da ji zu de do

n な に ぬ ね の にゃ にゅ にょ ん na ni nu ne no nya nyu nyo n’

h は ひ ふ へ ほ ひゃ ひゅ ひょ ha hi fu he ho hya hyu hyo

b ば び ぶ べ ぼ びゃ びゅ びょ ba bi bu be bo bya byu byo

p ぱ ぴ ぷ ぺ ぽ ぴゃ ぴゅ ぴょ pa pi pu pe po pya pyu pyo

m ま み む め も みゃ みゅ みょ ma mi mu me mo mya myu myo

y や ゆ よ ya yu yo

r ら り る れ ろ りゃ りゅ りょ ra ri ru re ro rya ryu ryo

w わ を wa o

a i u e o ya yu yo

ア a イ i ウ u エ e オ o

k カ ka キ ki ク ku ケ ke コ ko

キャ kya キュ kyu キョ kyo

g ガ ga ギ gi グ gu ゲ ge ゴ go

ギャ gya ギュ gyu ギョ gyo

s サ sa シ si ス su セ se ソ so

sh シャ sha シ shi シュ shu シェ she ショ sho

z ザ za ジ zi ズ zu ゼ ze ゾ zo

j ジャ ja ジ ji ジュ ju ジェ je ジョ jo

t タ ta ティ ti ツ tsu テ te ト to

ch チャ cha チ chi チュ chu チェ che チョ cho

d ダ da ディ di デュ du デ de ド do

n ナ na ニ ni ヌ nu ネ ne ノ no

ニャ nya ニュ nyu ニョ nyo ン n

h ハ ha ヒ hi フ hu ヘ he ホ ho

ヒャ hya ヒュ hyu ヒョ hyo

f ファ fa フィ fi フ fu フェ fe フォ fo

b バ ba ビ bi ブ bu ベ be ボ bo

ビャ bya ビュ byu ビョ byo

v ヴァ va ヴィ vi ヴ vu ヴェ ve ヴォ vo

p パ pa ピ pi プ pu ペ pe ポ po

ピャ pya ピュ pyu ピョ pyo

m マ ma ミ mi ム mu メ me モ mo

ミャ mya ミュ myu ミョ myo

y ヤ ya ユ yu ヨ yo

r ラ ra リ ri ル ru レ re ロ ro

リャ rya リュ ryu リョ ryo

w ワ wa ウィ wi ウ wu ウェ we ウォ wo

ヲ o

The chart of Hiragana The chart of Katakana

Page 6: 29/11/06NCLT Seminar series 1 Automatic conversion of a Japanese text corpus into f-structure OYA, Masanori ( 大矢 政徳 ) National Center for Language Technology

29/11/06NCLT Seminar series6

2. Japanese grammar

SOV as the basic word order; scrambling is prevalent Use of particles for grammatical functions

Example:太郎はダブリンの大学に行った。Taro-wa dabulin-no daigaku-ni it-taTaro-TOP Dublin-in college-to go-PST“Taro went to a college in Dublin.”– “-wa”, “-ga”, “-wo” and “-ni” – used for core grammatical functions– Other particles – used for adjuncts (postpositional phrases or

complementizer)(Tsujimura 2006)

– The particle “-ni” is ambiguous; it can be used as the OBL case marker or a postposition for temporal or locative adverbials (semantic distinction is possible).

Page 7: 29/11/06NCLT Seminar series 1 Automatic conversion of a Japanese text corpus into f-structure OYA, Masanori ( 大矢 政徳 ) National Center for Language Technology

29/11/06NCLT Seminar series7

2. Japanese grammar

Tense, aspect and mood are specified by verbal or adjectival morphologyExample:太郎はダブリンの大学に行っている。Taro-wa dabulin-no daigaku-ni it-teiruTaro-TOP Dublin-in college-to go-PROG.PRES“Taro is going to a college in Dublin.”

太郎はダブリンの大学に行ったのだろうか。Taro-wa dabulin-no daigaku-ni it-ta-nodarou-kaTaro-TOP Dublin-in college-to go-PST-AUX-INT“(I wonder) whether Taro went to a college in Dublin.” etc.

Page 8: 29/11/06NCLT Seminar series 1 Automatic conversion of a Japanese text corpus into f-structure OYA, Masanori ( 大矢 政徳 ) National Center for Language Technology

29/11/06NCLT Seminar series8

2. Japanese grammar

“Bunsetsu”, or syntactic units– One bunsetsu = a content word + a particle or inflection

≈ Chinese characters + hiragana

or katakana

Example:

太郎はダブリンの大学に行っている。Taro-wa dabulin-no daigaku-ni it-teiruUnit 0 Unit 1 Unit 2 Unit 3

• Spaces represent bunsetsu boundaries.

• Hyphens represent morphological boundaries within a bunsetsu.

Page 9: 29/11/06NCLT Seminar series 1 Automatic conversion of a Japanese text corpus into f-structure OYA, Masanori ( 大矢 政徳 ) National Center for Language Technology

29/11/06NCLT Seminar series9

2. Japanese grammar

Ellipses of core arguments– Pro-drop: contextually-evident units are absent from the sentence– Gender, person and number of the subject are not specified by verbal or

adjectival morphologyExample:ダブリンの大学に行った。dabulin-no daigaku-ni it-taDublin-in college-OBL go-PST“I/We/You/He/She/It/They/(Someone in the context) went to a college in

Dublin.”-Personal pronouns are also available, but they are not equivalent with personal

pronouns in English (e.g., variations of 1st singular personal pronouns: ‘ore’, ‘atashi’, ‘boku’, ‘watashi’, ‘watakushi’, etc.; variations of 2nd singular personal pronouns: ‘kimi’, ‘anata’, ‘anta’, ‘omae’, etc)

Page 10: 29/11/06NCLT Seminar series 1 Automatic conversion of a Japanese text corpus into f-structure OYA, Masanori ( 大矢 政徳 ) National Center for Language Technology

29/11/06NCLT Seminar series10

2. Japanese grammar

Topicalization– Topicalized units have the particle “wa”– Non-topicalized units are the focus of the sentence

Example:

ダブリンの大学には太郎が行った。dabulin-no daigaku-ni-wa Taro-ga it-ta

Dublin-in college-OBL-TOP Taro-NOM go-PST

“To a college in Dublin, Taro went.” or “It is Taro who went to a college in Dublin

Page 11: 29/11/06NCLT Seminar series 1 Automatic conversion of a Japanese text corpus into f-structure OYA, Masanori ( 大矢 政徳 ) National Center for Language Technology

29/11/06NCLT Seminar series11

2. Japanese grammar

Relative clauses– If a clause ends with a verb in a sentence-ending form

and it comes before a noun, then the clause is a relative clause: Japanese has no relative pronouns.

Example:私が行った大学watashi-ga itta daigaku

1sg-NOM go-PST college

“the college I went to.”

Page 12: 29/11/06NCLT Seminar series 1 Automatic conversion of a Japanese text corpus into f-structure OYA, Masanori ( 大矢 政徳 ) National Center for Language Technology

29/11/06NCLT Seminar series12

2. Japanese grammar

Two types of “relative clauses”; “inner” relative clauses (true relative clauses)and “outer” relative clauses (appositions) (Teramura 1991)Example:

– 私が答えを見つけた証拠watashi-ga kotae-wo mitsuketa shoko1sg-NOM answer-ACC find-PST evidence“The evidence that I found out the answer” (“outer”)

– 私が見つけた証拠watashi-ga mitsuketa shoko1sg-NOM find-PST evidence“The evidence that I found out ”∅ (“inner”: ∅ =evidence)

“The evidence that I found out PRO” (“outer”: PRO≠evidence; something evident in the context)

– If one of the core arguments is in ellipsis, then it is difficult to distinguish a true relative clause from an apposition.

Page 13: 29/11/06NCLT Seminar series 1 Automatic conversion of a Japanese text corpus into f-structure OYA, Masanori ( 大矢 政徳 ) National Center for Language Technology

29/11/06NCLT Seminar series13

2. Japanese grammar

Particles derived from verbs:– Some case particles are derived from verbs; case particles of this

type have a fixed meaning (Masuoka and Takubo 1992)

Example:

ついて tsuite “about” (same form with the adverbial form of the verb つく “ attach”)

私は計算言語学について話した。Watashi-wa keisangengogaku-ni-tsuite hanashi-taI-TOP computational linguistics-OBL-about talk-PST

“I talked about computational linguistics.”

Page 14: 29/11/06NCLT Seminar series 1 Automatic conversion of a Japanese text corpus into f-structure OYA, Masanori ( 大矢 政徳 ) National Center for Language Technology

29/11/06NCLT Seminar series14

2. Japanese grammar

Adverbial nouns– They function as the head of an adverbial phrase with a complement

(Masuoka and Takubo 1992)

Example:

ダブリンの大学に通っている時、津波が日本を襲った。Dabulin-no daigaku-ni kayotteiru toki, tsunami-ga nihon-wo osot-ta.

Dublin-in college-OBL go-PROG time, tsunami-NOM Japan-ACC strike-PST

“When I was studying at a college in Dublin, a tsunami struck Japan.”

– It is also difficult to distinguish the complements in these cases from relative clauses; no syntactic nor morphological clues are available.

Page 15: 29/11/06NCLT Seminar series 1 Automatic conversion of a Japanese text corpus into f-structure OYA, Masanori ( 大矢 政徳 ) National Center for Language Technology

29/11/06NCLT Seminar series15

2. Japanese grammar

Coordination– The first coordinated bunsetsu has the particle “to” (but not necessarily), and it

is dependent on the next coordinated bunsetsu.

Example:ダブリンの大学に通っている時、地震と津波が日本を襲った。Dabulin-no daigaku-ni kayotteiru toki, jishin-to tsunami-ga nihon-wo

osot-ta.Dublin-in college-OBL go-PROG time, jishin-AND tsunami-NOM Japan-ACC

strike-PST“When I was studying at a college in Dublin, an earthquake and a tsunami struck

Japan.”

– Only the last coordinated bunsetsu has the particle which specifies its grammatical function;

Page 16: 29/11/06NCLT Seminar series 1 Automatic conversion of a Japanese text corpus into f-structure OYA, Masanori ( 大矢 政徳 ) National Center for Language Technology

29/11/06NCLT Seminar series16

3. Kyoto Text Corpus (KTC)

An automatically parsed text corpus of a newspaper (Mainichi Shimbun)

All articles from the 1st to the 17th of January, 1995 (19,687 sentences, 518,687 tokens) and the editorials from January to December, 1995 (18,708 sentences, 453995 tokens).

Developed by Sadao Kurohashi and Makoto Nagao at the University of Kyoto, using JUMAN and KNP

Page 17: 29/11/06NCLT Seminar series 1 Automatic conversion of a Japanese text corpus into f-structure OYA, Masanori ( 大矢 政徳 ) National Center for Language Technology

29/11/06NCLT Seminar series17

All the texts are automatically annotated with morphological tags by JUMAN (Kurohashi and Nagao 1994) (“juman” means 100,000)

The output of JUMAN are parsed by KNP (Kurohashi and Nagao 1994) based on the dependency among “bunsetsu”, and corrected manually

No syntactic CFG category tags are annotated Valency of verbal predicate is not annotated

3. Kyoto Text Corpus (KTC)

Page 18: 29/11/06NCLT Seminar series 1 Automatic conversion of a Japanese text corpus into f-structure OYA, Masanori ( 大矢 政徳 ) National Center for Language Technology

29/11/06NCLT Seminar series18

3. Kyoto Text Corpus (KTC)

JUMAN: morphological analyzer for Japanese based on Bigram information

Least-cost path method (Kurohashi and Kawahara 1992)– Costs are assigned to each morpheme and each pair of two morphemes in

a sentence The lower the morpheme frequency, or the lower the frequency of pairs of

morphemes, the higher the cost– If a sentence has several possible analyses, JUMAN sums up the costs,

and determines the least-cost analysis as the most plausible analysis for the sentence

Accuracy: around 99.0 % (comparison of automatic analysis and manually corrected analysis of 10,000 sentences)

Page 19: 29/11/06NCLT Seminar series 1 Automatic conversion of a Japanese text corpus into f-structure OYA, Masanori ( 大矢 政徳 ) National Center for Language Technology

29/11/06NCLT Seminar series19

3. Kyoto Text Corpus (KTC)

The example of the output of JUMAN:

太郎は大学に行った。 “ Taro went to a college.”Taro wa daigaku ni itta.Taro TOP college OBL went

Page 20: 29/11/06NCLT Seminar series 1 Automatic conversion of a Japanese text corpus into f-structure OYA, Masanori ( 大矢 政徳 ) National Center for Language Technology

29/11/06NCLT Seminar series20

3. Kyoto Text Corpus (KTC)

The example of the output of JUMAN:

太郎は大学に行った。 “ Taro went to a college.”Taro wa daigaku ni itta.Taro TOP college OBL went

#S-ID: 950101001-001太郎 tarou * Noun Name * *は wa * Particle AdverbialParticle * *大学 daigaku * Noun NormalNoun * *に ni * Particle CaseParticle * *行った itta iku Verb * ConsonantVerb Past。 * mark period * *EOS

Page 21: 29/11/06NCLT Seminar series 1 Automatic conversion of a Japanese text corpus into f-structure OYA, Masanori ( 大矢 政徳 ) National Center for Language Technology

29/11/06NCLT Seminar series21

3. Kyoto Text Corpus (KTC)

KNP: dependency structure analyzer based on “bunsetsu”

KNP converts the output of JUMAN into a bunsetsu strings.

Accuracy: 90%(comparison of automatic analysis and manually corrected analysis of 10,000 sentences) (Kurohashi and Nagao 1998)

Page 22: 29/11/06NCLT Seminar series 1 Automatic conversion of a Japanese text corpus into f-structure OYA, Masanori ( 大矢 政徳 ) National Center for Language Technology

29/11/06NCLT Seminar series22

3. Kyoto Text Corpus (KTC)

太郎は大学に行った。 “ Taro went to a college.”Taro-wa daigaku-ni it-ta.Taro TOP college OBL went

#S-ID: 950101001-001太郎 tarou * Noun Name * *は wa * Particle AdverbialParticle * *大学 daigaku * Noun NormalNoun * *に ni * Particle CaseParticle * *行った itta iku Verb * ConsonantVerb Past。 * mark period * *EOS

Page 23: 29/11/06NCLT Seminar series 1 Automatic conversion of a Japanese text corpus into f-structure OYA, Masanori ( 大矢 政徳 ) National Center for Language Technology

29/11/06NCLT Seminar series23

太郎は大学に行った。 “ Taro went to a college.”Taro-wa daigaku-ni it-ta.Taro TOP college OBL went

#S-ID: 950101001-001* 0 2D太郎 tarou * Noun Name * *は wa * Particle AdverbialParticle * **1 2D大学 daigaku * Noun NormalNoun * *に ni * Particle CaseParticle * **2 -1D行った itta iku Verb * ConsonantVerb Past。 * mark period * *EOS

3. Kyoto Text Corpus (KTC)

Page 24: 29/11/06NCLT Seminar series 1 Automatic conversion of a Japanese text corpus into f-structure OYA, Masanori ( 大矢 政徳 ) National Center for Language Technology

29/11/06NCLT Seminar series24

太郎は大学に行った。 “ Taro went to a college.”Taro wa daigaku ni itta.Taro TOP college OBL went

#S-ID: 950101001-001* 0 2D太郎 tarou * Noun Name * *は wa * Particle AdverbialParticle * **1 2D大学 daigaku * Noun NormalNoun * *に ni * Particle CaseParticle * **2 -1D行った itta iku Verb * ConsonantVerb Past。 * mark period * *EOS

3. Kyoto Text Corpus (KTC)

Unit 0 Unit 1 Unit 2

Page 25: 29/11/06NCLT Seminar series 1 Automatic conversion of a Japanese text corpus into f-structure OYA, Masanori ( 大矢 政徳 ) National Center for Language Technology

29/11/06NCLT Seminar series25

大学に太郎は行った。 “ Taro went to a college.”daigaku ni Taro wa itta.college OBL Taro TOP went

#S-ID: 950101001-001*0 2D大学 daigaku * Noun NormalNoun * *に ni * Particle CaseParticle * **1 2D太郎 tarou * Noun Name * *は wa * Particle AdverbialParticle * **2 -1D行った itta iku Verb * ConsonantVerb Past。 * mark period * *EOS

3. Kyoto Text Corpus (KTC)

Unit 0 Unit 1 Unit 2

Page 26: 29/11/06NCLT Seminar series 1 Automatic conversion of a Japanese text corpus into f-structure OYA, Masanori ( 大矢 政徳 ) National Center for Language Technology

29/11/06NCLT Seminar series26

4. Converting KTC into dependency trees

Motivation: – LFG-based automatic grammar induction for Japanese;

GramLab: Treebank based Acquisition of Multilingual Resources (Cahill et al. 2002, etc.)

Related work:– Japanese XLE at Fuji Xerox (Masuichi et al. 2006, etc. )– PCFG-based Automatic grammar induction from Japanese

Corpus (Tokunaga et al. 2005, etc.)– Case frame induction from Japanese Corpus (Kurohashi et al.

2006, etc.)

Page 27: 29/11/06NCLT Seminar series 1 Automatic conversion of a Japanese text corpus into f-structure OYA, Masanori ( 大矢 政徳 ) National Center for Language Technology

29/11/06NCLT Seminar series27

Procedure:

4. Converting KTC into dependency trees

Text corpus

Dependency trees

F-structures

At least one syntactic category is annotated on each "bunsetsu" in a sentence. All “bunsetsu’ in a sentence are integrated into a dependency tree of the sentence.

Page 28: 29/11/06NCLT Seminar series 1 Automatic conversion of a Japanese text corpus into f-structure OYA, Masanori ( 大矢 政徳 ) National Center for Language Technology

29/11/06NCLT Seminar series28

太郎は大学に行った。 “ Taro went to a college.”Taro wa daigaku ni itta.Taro TOP college OBL went

#S-ID: 950101001-001* 0 2D太郎 tarou * Noun Name * *は wa * Particle AdverbialParticle * **1 2D大学 daigaku * Noun NormalNoun * *に ni * Particle CaseParticle * **2 -1D行った itta iku Verb * ConsonantVerb Past。 * mark period * *EOS

Unit 0 Unit 2Unit 1

4. Converting KTC into dependency trees

Page 29: 29/11/06NCLT Seminar series 1 Automatic conversion of a Japanese text corpus into f-structure OYA, Masanori ( 大矢 政徳 ) National Center for Language Technology

29/11/06NCLT Seminar series29

太郎は大学に行った。 “ Taro went to college.”Taro wa daigaku ni itta.Taro TOP college OBL went

#S-ID: 950101001-001* 0 2D太郎 tarou * Noun Name * *は wa * Particle AdverbialParticle * **1 2D大学 daigaku * Noun NormalNoun * *に ni * Particle CaseParticle * **2 -1D行った itta iku Verb * ConsonantVerb Past。 * mark period * *EOS

Unit 0 Unit 2Unit 1

4. Converting KTC into dependency trees

Topic:

OBL:

TopP NP V

Page 30: 29/11/06NCLT Seminar series 1 Automatic conversion of a Japanese text corpus into f-structure OYA, Masanori ( 大矢 政徳 ) National Center for Language Technology

29/11/06NCLT Seminar series30

4. Converting KTC into dependency trees

waTaro daigaku ni itta 。

Page 31: 29/11/06NCLT Seminar series 1 Automatic conversion of a Japanese text corpus into f-structure OYA, Masanori ( 大矢 政徳 ) National Center for Language Technology

29/11/06NCLT Seminar series31

5. Converting KTC into f-structures

Motivation:– Are syntactic categories necessary for Japanese?

Word order is (relatively) free. The type (or absence) of particles in each unit specifies its

grammatical function (e.g., if a noun has a particle “wo”, then it is an object)

Verbal morphology specifies the grammatical function of each clause (but not always unambiguous).

Page 32: 29/11/06NCLT Seminar series 1 Automatic conversion of a Japanese text corpus into f-structure OYA, Masanori ( 大矢 政徳 ) National Center for Language Technology

29/11/06NCLT Seminar series32

Generating f-structure equations directly from the corpus

Text corpus

Dependency trees

F-structures

5. Converting KTC into f-structures

Page 33: 29/11/06NCLT Seminar series 1 Automatic conversion of a Japanese text corpus into f-structure OYA, Masanori ( 大矢 政徳 ) National Center for Language Technology

29/11/06NCLT Seminar series33

Generating f-structure equations directly from the corpus

Text corpus

F-structures

5. Converting KTC into f-structures

F-structure equations are directly generated from each unit. All the units are unified into the f-structure of the sentence according to the dependency.

Page 34: 29/11/06NCLT Seminar series 1 Automatic conversion of a Japanese text corpus into f-structure OYA, Masanori ( 大矢 政徳 ) National Center for Language Technology

29/11/06NCLT Seminar series34

太郎は大学に行った。 “ Taro went to a college.”Taro wa daigaku ni itta.Taro TOP college OBL went

#S-ID: 950101001-001* 0 2D太郎 tarou * Noun Name * *は wa * Particle AdverbialParticle * **1 2D大学 daigaku * Noun NormalNoun * *に ni * Particle CaseParticle * **2 -1D行った itta iku Verb * ConsonantVerb Past。 * mark period * *EOS

Unit 0 Unit 2Unit 1

4. Converting KTC into dependency trees

Page 35: 29/11/06NCLT Seminar series 1 Automatic conversion of a Japanese text corpus into f-structure OYA, Masanori ( 大矢 政徳 ) National Center for Language Technology

29/11/06NCLT Seminar series35

太郎は大学に行った。 “ Taro went to college.”Taro wa daigaku ni itta.Taro TOP college OBL went

#S-ID: 950101001-001* 0 2D太郎 tarou * Noun Name * *は wa * Particle AdverbialParticle * **1 2D大学 daigaku * Noun NormalNoun * *に ni * Particle CaseParticle * **2 -1D行った itta iku Verb * ConsonantVerb Past。 * mark period * *EOS

f0 f2f1

4. Converting KTC into dependency trees

Topic:

OBL:

Page 36: 29/11/06NCLT Seminar series 1 Automatic conversion of a Japanese text corpus into f-structure OYA, Masanori ( 大矢 政徳 ) National Center for Language Technology

29/11/06NCLT Seminar series36

太郎は大学に行った。 “ Taro went to college.”Taro wa daigaku ni itta.Taro TOP college OBL went

#S-ID: 950101001-001* 0 2D太郎 tarou * Noun Name * *は wa * Particle AdverbialParticle * **1 2D大学 daigaku * Noun NormalNoun * *に ni * Particle CaseParticle * **2 -1D行った itta iku Verb * ConsonantVerb Past。 * mark period * *EOS

4. Converting KTC into dependency trees

F2:pred = '行く ',F2:tns = 'pst',F2:stmt = 'decl',F2:style = 'plain',F0:pred = '太郎 ',F0:prtav = 'は ',F0 elm F2:topic,F1:pred = '大学 ',F1:case = 'に ',F2:obl = F1.

Functional equations from the corpus:

Page 37: 29/11/06NCLT Seminar series 1 Automatic conversion of a Japanese text corpus into f-structure OYA, Masanori ( 大矢 政徳 ) National Center for Language Technology

29/11/06NCLT Seminar series37

太郎は大学に行った。 “ Taro went to college.”Taro wa daigaku ni itta.Taro TOP college OBL went

F2:pred = ' 行く ',F2:tns = 'pst',F2:stmt = 'decl',F2:style = 'plain',F0:pred = ' 太郎 ',F0:prtav = ' は ',F0 elm F2:topic,F1:pred = ' 大学 ',F1:case = ' に ',F2:obl = F1.

4. Converting KTC into dependency trees

Page 38: 29/11/06NCLT Seminar series 1 Automatic conversion of a Japanese text corpus into f-structure OYA, Masanori ( 大矢 政徳 ) National Center for Language Technology

29/11/06NCLT Seminar series38

太郎は大学に行った。 “ Taro went to college.”Taro wa daigaku ni itta.Taro TOP college OBL went

F2:pred = '行く ',F2:tns = 'pst',F2:stmt = 'decl',F2:style = 'plain',F0:pred = '太郎 ',F0:prtav = 'は ',F0 elm F2:topic,F1:pred = '大学 ',F1:case = 'に ',F2:obl = F1.

pred : '行く 'tns : pststmt : declstyle : plaintopic : 1 : pred : '太郎 ' prtav : 'は 'obl : pred : '大学 ' case : 'に '

4. Converting KTC into dependency trees

F-structure from the functional equations:

Page 39: 29/11/06NCLT Seminar series 1 Automatic conversion of a Japanese text corpus into f-structure OYA, Masanori ( 大矢 政徳 ) National Center for Language Technology

29/11/06NCLT Seminar series39

5. Problems

This “Generating f-structure equations directly from the corpus” method does not always work well.

– Core argument ellipses– Two types of relative clauses– Particles derived from verbs– Adverbial nouns– Coordination

The context among units must be taken into consideration to make the generation more accurate.

Page 40: 29/11/06NCLT Seminar series 1 Automatic conversion of a Japanese text corpus into f-structure OYA, Masanori ( 大矢 政徳 ) National Center for Language Technology

29/11/06NCLT Seminar series40

Ellipses of core arguments– Contextually-evident units are absent from the sentence– Gender, person and number of the subject are not specified by verbal

or adjectival morphology

Example:

ダブリンの大学に行った。dabulin-no daigaku-ni it-ta

Dublin-in college-OBL go-PST

“He/She/They went to a college in Dublin.”

5. Problems

Page 41: 29/11/06NCLT Seminar series 1 Automatic conversion of a Japanese text corpus into f-structure OYA, Masanori ( 大矢 政徳 ) National Center for Language Technology

29/11/06NCLT Seminar series41

Core argument ellipses– KTC does not annotate on missing elements.– No equations for missing elements can be generated

from KTC.– For the f-structure with ellipses, “PRO” must be added

to make the f-structure complete.

5. Problems

Page 42: 29/11/06NCLT Seminar series 1 Automatic conversion of a Japanese text corpus into f-structure OYA, Masanori ( 大矢 政徳 ) National Center for Language Technology

29/11/06NCLT Seminar series42

5. Problems

Core argument ellipses– If a predicate has no subject in the clause, then an equation for the

subject is added.– If a transitive verb has no object, then an equation for the subject

must be added …– However, KTC does not annotate on the valency of verbal

predicate, hence it is impossible to tell which verb is transitive only on the basis of annotated information.

– Case-frame is required to detect missing objects of transitive verbs.

Page 43: 29/11/06NCLT Seminar series 1 Automatic conversion of a Japanese text corpus into f-structure OYA, Masanori ( 大矢 政徳 ) National Center for Language Technology

29/11/06NCLT Seminar series43

Two types of “relative clauses”; “inner” relative clauses (true relative clauses)and “outer” relative clauses (appositions) (Teramura 1991)Example:私が答えを見つけた証拠watashi-ga kotae-wo mitsuketa shoko1sg-NOM answer-ACC find-PST evidence“The evidence that I found out the answer” (“outer”)

私が見つけた証拠watashi-ga mitsuketa shoko1sg-NOM find-PST evidence“The evidence that I found out ” ∅ (“inner”: =evidence)∅“The evidence that I found out PRO” (“outer”: PRO≠evidence; something evident in

the context)If one of the core arguments is in ellipsis, then it is difficult to distinguish an “outer”

relative clause from an “inner” relative clause.

5. Problems

Page 44: 29/11/06NCLT Seminar series 1 Automatic conversion of a Japanese text corpus into f-structure OYA, Masanori ( 大矢 政徳 ) National Center for Language Technology

29/11/06NCLT Seminar series44

5. Problems

Two types of relative clause– Features in one bunsetsu are not enough to distinguish

them.– A probabilistic model of analysing them (Abekawa

and Okumura 2005) employs the cooccurrence probability of head nouns and verbal predicates in “outer” relative clauses.

– This method is expected to be applicable to the present method (in future).

Page 45: 29/11/06NCLT Seminar series 1 Automatic conversion of a Japanese text corpus into f-structure OYA, Masanori ( 大矢 政徳 ) National Center for Language Technology

29/11/06NCLT Seminar series45

5. Problems

Case particles derived from verbs– Case particles of this type are analyzed by KNP as

verbs, not as case particles.– Bunsetsus with them are analyzed as sentential

adjuncts, not as postpositional adjuncts or as complements (in the case of “ という” ).

– The equations must be revised properly.

Page 46: 29/11/06NCLT Seminar series 1 Automatic conversion of a Japanese text corpus into f-structure OYA, Masanori ( 大矢 政徳 ) National Center for Language Technology

29/11/06NCLT Seminar series46

5. Problems

Particles derived from verbs:Some case particles are derived from verbs; case particles of this

type have a fixed meaning (Masuoka and Takubo 1992)

Example:ついて tsuite “about” (same form with the adverbial form of the

verb つく “ attach”)私は計算言語学について話した。Watashi-wa keisangengogaku-ni-tsuite hanashi-taI-TOP computational linguistics-OBL-about talk-PST“I talked about computational linguistics.”

Page 47: 29/11/06NCLT Seminar series 1 Automatic conversion of a Japanese text corpus into f-structure OYA, Masanori ( 大矢 政徳 ) National Center for Language Technology

29/11/06NCLT Seminar series47

Adverbial nouns

They function as the head of an adverbial phrase with a complement (Masuoka and Takubo 1992)

Example:

ダブリンの大学に通っている時、津波が日本を襲った。Dabulin-no daigaku-ni kayotteiru toki, tsunami-ga nihon-wo osotta.

Dublin-in college-OBL go-PROG time, tsunami-NOM Japan-ACC strike-PST

“When I was studying at a college in Dublin, a tsunami struck Japan.”

– It is also difficult to distinguish the complements in these cases from relative clauses; no syntactic nor morphological clues are available.

5. Problems

Page 48: 29/11/06NCLT Seminar series 1 Automatic conversion of a Japanese text corpus into f-structure OYA, Masanori ( 大矢 政徳 ) National Center for Language Technology

29/11/06NCLT Seminar series48

5. Problems

Adverbial nouns– Features in one bunsetsu in not enough to distinguish

between them.– If a clause is dependent on an adverbial noun and it is

analyzed as a relative clause, then the equation of the clause must be replaced by that of complement.

Page 49: 29/11/06NCLT Seminar series 1 Automatic conversion of a Japanese text corpus into f-structure OYA, Masanori ( 大矢 政徳 ) National Center for Language Technology

29/11/06NCLT Seminar series49

Coordination– The first coordinated bunsetsu has the particle “to” (but not necessarily), and it

is dependent on the next coordinated bunsetsu.

Example:ダブリンの大学に通っている時、地震と津波が日本を襲った。Dabulin-no daigaku-ni kayotteiru toki, jishin-to tsunami-ga nihon-wo

osotta.Dublin-in college-OBL go-PROG time, jishin-AND tsunami-NOM Japan-ACC

strike-PST“When I was studying at a college in Dublin, an earthquake and a tsunami struck

Japan.”

– Only the last coordinated bunsetsu has the particle which specifies its grammatical function;

5. Problems

Page 50: 29/11/06NCLT Seminar series 1 Automatic conversion of a Japanese text corpus into f-structure OYA, Masanori ( 大矢 政徳 ) National Center for Language Technology

29/11/06NCLT Seminar series50

Coordination– Only the last coordinated bunsetsu has the particle which specifies

its grammatical function; other coordinated bunsetsus cannot be analyzed to have appropriate grammatical functions.

– The last coordinated bunsetsu does not have any feature within it as a coordinate; the bunsetsu context must be taken into consideration in order to convert it properly into f-structure equations

– Dependency among coordinated bunsetsus must also be reanalyzed;

5. Problems

Page 51: 29/11/06NCLT Seminar series 1 Automatic conversion of a Japanese text corpus into f-structure OYA, Masanori ( 大矢 政徳 ) National Center for Language Technology

29/11/06NCLT Seminar series51

Coordination– Dependency among coordinated bunsetsus must be

reanalyzed;

Example:

jishin-to tsunami-ga

5. Problems

Page 52: 29/11/06NCLT Seminar series 1 Automatic conversion of a Japanese text corpus into f-structure OYA, Masanori ( 大矢 政徳 ) National Center for Language Technology

29/11/06NCLT Seminar series52

Coordination– Dependency among coordinated bunsetsus must be

reanalyzed;

Example:

jishin-to tsunami-ga

5. Problems

Jishin-to tsunami -ga

The coordinated bunsetsus are the elements of a new unit, which constitutes a new bunsetsu with the case particle (“ga” in this example).

Page 53: 29/11/06NCLT Seminar series 1 Automatic conversion of a Japanese text corpus into f-structure OYA, Masanori ( 大矢 政徳 ) National Center for Language Technology

29/11/06NCLT Seminar series53

5. Problems

Among these problems, the following problems still remain in the method:– Object ellipses– Distinguishing two types of relative clauses– Particles derived from verbs

Page 54: 29/11/06NCLT Seminar series 1 Automatic conversion of a Japanese text corpus into f-structure OYA, Masanori ( 大矢 政徳 ) National Center for Language Technology

29/11/06NCLT Seminar series54

6. Evaluation of the method

200 sentences were randomly selected from KTC. F-structures of these sentences are automatically

generated by the method. These f-structures are manually corrected, and

used as the Gold standard. The automatically generated f-structure of these

200 sentences are compared with the Gold standard.

Page 55: 29/11/06NCLT Seminar series 1 Automatic conversion of a Japanese text corpus into f-structure OYA, Masanori ( 大矢 政徳 ) National Center for Language Technology

29/11/06NCLT Seminar series55

6. Evaluation of the method

Pred-only GFs PRECISION(%) RECALL(%) F-SCORE(%)

adj 80.60 96.43 87.80cj 100.00 96.80 98.37comp 74.19 58.97 65.71obj 98.73 82.54 89.91obl 85.62 91.91 88.65padj 98.45 91.81 95.01rel 70.86 96.40 81.68sadj 82.26 65.38 72.86subj 93.17 92.29 92.73topic 98.68 95.51 97.07

88.26 86.80 86.98

Page 56: 29/11/06NCLT Seminar series 1 Automatic conversion of a Japanese text corpus into f-structure OYA, Masanori ( 大矢 政徳 ) National Center for Language Technology

29/11/06NCLT Seminar series56

The method of generating f-structure equations directly from the dependency-based corpus of Japanese needs more improvement.

The result can be applied to improve the parsing result of KNP.

Using Japanese f-structures in MT

7. Future work …

Page 57: 29/11/06NCLT Seminar series 1 Automatic conversion of a Japanese text corpus into f-structure OYA, Masanori ( 大矢 政徳 ) National Center for Language Technology

29/11/06NCLT Seminar series57

References

Abekawa, T and M. Okumura. 2005. Corpus-Based Analysis of Japanese Relatie Clause Constructions. IJCNLP 2005 pp. 46-57.

Cahill A, Cahill A, M. McCarthy, J. van Genabith and A. Way . Automatic Annotation of the Penn-Treebank with LFG F-Structure Information. LREC 2002 workshop on Linguistic Knowledge Acquisition and Representation, pp. 8-15

Kurohashi, S and D.Kawahara. 1992. JUMAN: user's manual. ms. Kurohashi, S and M. Nagao. 1994. A syntactic analysis method of long Japanese sentences based on the detection of

conjunctive structures. Computational Linguistics, 20(4), pp. 507-534. Kurohashi, S and M. Nagao. 1998. Building a Japanese Parsed Corpus while Improving the Parsing System. Proceedings

of the 1st International Conference on Language Resources and Evaluation, pp. 719-724. Kurohashi, S, D, Kawahara and T. Shibata. 2005. Morphological and syntactic analyses using JUMAN/KNP. ms. Masuoka, T and Y. Takubo. 1992. Kiso nihongo bunpo. Kuroshio Publication. Noguchi, M, H, Ichikawa, T, Hashimoto and T. Takenobu. 2006. A new approach to syntactic annotation. Proceedings of

5th International Conference on Language Resources and Evaluation (LREC2006). pp.6 Noro T, C, Koike, T, Hashimoto, T, Tokunaga and H. Tanaka. 2005. Evaluation of a Japanese CFG Derived from a

Syntactically Annotated Corpus with respect to Dependency Measures. The 5th Workshop on Asian Language Resources. pp.9

Shibatani, M. 1990. The Languages of Japan. Cambridge University Press Teramura, H. 1991. Nihongo no shintakusu to imi. Kuroshio Publication Tomoko Ohkuma, Hiroshi Masuichi, and Takeshi Yoshioka. 2006. Disambiguation of Japanese Focus Particles by using

Lexical Functional Grammar. Journal of Natural Language Processing, 13(1):27-52. Tsujimura, N. 2006. An Introduction to Japanese Linguistics (2nd ed.). Blackwell Publications

Page 58: 29/11/06NCLT Seminar series 1 Automatic conversion of a Japanese text corpus into f-structure OYA, Masanori ( 大矢 政徳 ) National Center for Language Technology

29/11/06NCLT Seminar series58

Thank you very much!