Japanese Dependency Analysis using Cascaded Chunking

Japanese Dependency Analysis using Cascaded Chunking

Taku Kudo 工藤　拓

Yuji Matsumoto 松本　裕治

Nara Institute Science and Technology, JAPAN

Motivation Kudo, Matsumoto 2000 (VLC)

Presented a state-of-the-art Japanese dependency parser using SVMs (89.09% for standard dataset)

Could show the high generalization performance and feature selection abilities of SVMs

Problems Not scalable

• 2 weeks training using 7,958 sentences• Hard to train with larger data

Slow in Parsing • 2 ～ 3 sec./sentence• Too slow to use it for actual NL applications

Goal Improve the scalability and the

parsing efficiency without loosing accuracy !

How? Apply Cascaded Chunking model to

dependency parsing and the selection of training examples

Reduce the number of times SVMs are consulted in parsing

Reduce the number of negative examples learned

Outline Japanese dependency analysis Two models

Probabilistic model (previous) Cascaded Chunking model (new!)

Features used for training and classification

Experiments and results Conclusion and future work

Japanese Dependency Analysis (1/2) Analysis of relationship between

phrasal units called bunsetsu (segments), base phrases in English

Two Constraints Each segment modifies one of the

right-side segments (Japanese is head final language)

Dependencies do not cross each other

Japanese Dependency Analysis (2/2)

Morphological analysis andBunsetsu identification

私は / 彼女と / 京都に / 行きます I with her to Kyoto-loc go

私は彼女と京都に行きます I go to Kyoto with her.

Raw text

私は / 彼女と / 京都に / 行きます

Dependency Analysis

Probabilistic Model

私は 1 / 彼女と 2 / 京都に 3 / 行きます 4

I-top / with her / to Kyoto-loc / go

Input

1.030.80.220.70.20.11

432

Dependency Matrix

ModifieeM

odifier

1. Build a Dependency Matrix ME, DT or SVMs (How probable one segment modifies another)

2. Search the optimal dependencies which maximize the sentence probabilities using CYK or Chart

Output

私は 1 / 彼女と 2 / 京都に 3 / 行きます 4

Problems of Probabilistic model(1/2)

Selection of training examples: All candidates of two segments which have Dependency relation→ positive No dependency relation→ negative

This straightforward way of selection requires a total (where n is # of segments in a sentence) training examples per sentence

Difficult to combine probabilistic model with SVMs which require polynomial computational cost

2/)1( nn

Problems of Probabilistic model(2/2)

parsing time is necessary with CYK or Chart

Even if beam-search is applied, parsing time is always necessary

The classification cost of SVMs is much more expensive than other ML algorithms such as ME and DT

)( 3nO

)( 2nO

Cascaded Chunking Model English parsing [Abney 1991] Parses a sentence deterministically

only deciding whether the current segment modifies the segment on its immediate right hand side

Training examples are extracted using this algorithm itself

Example: Training Phase

彼は 1 　　彼女の 2 　温かい 3 　真心に 4 　感動した。 5

He her warm heart be moved (He was moved by her warm heart.)

Annotated sentence

SVMsTrainingData

Pairs of tag (D or O) and context(features) are stored as training data for SVMs

Tag is decided by annotated corpus


O O D D O

? ? ? ?

彼は 1 　　彼女の 2 　真心に 4 　感動した。 5


O D D O

? ? ?

彼は 1 　　真心に 4 　感動した。 5

? ?


O D O 彼は 1 　　　感動した。 5

彼は 1 　　　感動した。 5

D O

?

　　　感動した。 5

finish彼は 1 　　彼女の 2 　温かい 3 　真心に 4 　感動した。 5

Example: Test Phase


He her warm heart be moved (He was moved by her warm heart.)

Test sentence

SVMsTag is decided by SVMs built in training phase


O O D D O

? ? ? ?



O D D O

? ? ?


? ?


O D O 彼は 1 　　　感動した。 5

彼は 1 　　　感動した。 5

D O

?

　　　感動した。 5

finish彼は 1 　　彼女の 2 　温かい 3 　真心に 4 　感動した。 5

Advantages of Cascaded Chunking model

Simple and Efficient Prob.: v.s. cascaded chunking: Lower than since most of segments

modify segment on its immediate right-hand-side

Training examples is much smaller Independent from ML algorithm

Can be combined with any ML algorithms which work as a binary classifier

Probabilities of dependency are not necessary

)( 3nO )( 2nO)( 2nO

Features

彼の 1 友人は 2 　この本を 3 　持っている 4 　女性を 5 　探している 6 His friend-top this book-acc have lady-acc be looking for

modifier modifiee

Static Features modifier/modifiee

Head/Functional Word: (surface,POS,POS-subcategory,inflection- type,inflection-form), brackets, quotations, punctuations, position

Between segments: distance, case-particles, brackets, quotations, punctuations

Dynamic Features [Kudo, Matsumoto 2000] A,B : Static features of Functional word C: Static features of Head word

B A CModify or not?

His friend is looking for a lady who has this book.

Experimental Setting Kyoto University Corpus 2.0/3.0

Standard Data Set• Training: 7,958 sentences / Test: 1,246 sentences• Same data as [Uchimoto et al. 98, Kudo, Matsumoto 00]

Large Data Set• 2-fold Cross-Validation using all 38,383 sentences

Kernel Function: 3rd polynomial Evaluation method

Dependency accuracy Sentence accuracy

ResultsData Set Standard LargeModel Cascaded

ChunkingProbabilistic Cascaded

ChunkingProbabilistic

Dependency Acc. (%)

89.29 89.09 90.04 N/A

Sentence Acc. (%)

47.53 46.17 53.16 N/A

# of training sentences

7,956 7,956 19,191 19,191

# of training examples

110,355 459,105 251,254 1,074,316

Training time (hours)

8 336 48 N/A

Parsing time (sec./sent.)

0.5 2.1 0.7 N/A

Effect of Dynamic Features(1/2)

Effect of Dynamic Features (2/2)

Deleted type of dynamic features

Difference from the model with all dynamic features Dependency Acc. Sentence Acc.

A -0.28 % -0.89 %

B -0.10% -0.89 %

C -0.28 % -0.56 %

AB -0.33 % -1.21 %

AC -0.55 % -0.97 %

BC -0.54 % -1.61 %

ABC -0.58 % -2.34 %

彼の 1 友人は 2 　この本を 3 　持っている 4 　女性を 5 　探している 6 His Friend-top this book-acc have lady-acc be looking for

modifier modifiee

B A CModify or not?

Probabilistic v.s. Cascaded Chunking (1/2)

彼は 1 この本を 2 　　持っている 3 女性を 4 　探している 5He-top this book-acc have lady-acc be looking for

modifier modifiee (He is looking for a lady who has this book.)

Positive: この本を 2 → 　持っている 3

Negative: この本を 2 → 　探している 5

Probabilistic models commit a number of unnecessary examples

unnecessary

Probabilistic Model uses all candidates of dependency relation as training data

Probabilistic v.s. Cascaded Chunking (2/2)

Probabilistic Cascaded ChunkingStrategy Maximize

sentence probability

Shift-ReduceDeterministic

Merit Can see all candidates of dependency

Simple, efficient and scalableAccurate as Prob. model

Demerit Not efficient,Commit to unnecessary training examples

Cannot see the all (posterior) candidates of dependency

Conclusion

A new Japanese dependency parser using a cascaded chunking model

It outperforms the previous probabilistic model with respect to accuracy, efficiency and scalability

Dynamic features significantly contribute to improve the performance

Future Work Coordinate structure analysis

Coordinate structures frequently appear in Japanese long sentences and make analysis hard

Use posterior context Hard to parse the following sentence

only using cascaded chunking model

僕の　　母の　　　ダイヤの　　指輪My mother’s diamond ring

Comparison with Related WorkModel Training Corpus (# of

sentences) Acc. (%)

Our Model Cascaded Chunking + SVMs

Kyoto Univ. (19,191) 90.46

Kyoto Univ. (7,956) 89.29

Kudo et al. 00 Prob. + SVMs Kyoto Univ. (7,956) 89.09

Uchimoto et al. 00

Prob. + ME Kyoto Univ. (7,956) 87.93

Kanayama et al. 00

Prob. + ME + HPSG

EDR (192,778) 88.55

Haruno et al. 98 Prob. + DT + Boosting

EDR (50,000) 85.03

Fujio et al. 98 Prob. + ML EDR (190,000) 86.67

Support Vector Machines [Vapnik]

1iy

1iy

dd

d

Maximize the margin d

d

||||2

|||||1|

|||||1|

wwxw

wxw

bbd ii

Min. ：s.t. ：

2/||||)( 2ww L1])[( by ii xw

0 bxw

1 bxw1 bxw

Soft Margin Kernel Function

Documents

Japanese Dependency Analysis using Cascaded Chunking