A* CCG Parsing with a Supertag and ... - Masashi Yoshikawa · Masashi Yoshikawa and Hiroshi Noji and Yuji Matsumoto Graduate School of Information and Science Nara Institute of Science

Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pages 277–287Vancouver, Canada, July 30 - August 4, 2017. c©2017 Association for Computational Linguistics

https://doi.org/10.18653/v1/P17-1026

Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pages 277–287Vancouver, Canada, July 30 - August 4, 2017. c©2017 Association for Computational Linguistics

https://doi.org/10.18653/v1/P17-1026

A* CCG Parsing with a Supertag and Dependency Factored Model

Masashi Yoshikawa and Hiroshi Noji and Yuji MatsumotoGraduate School of Information and Science

Nara Institute of Science and Technology8916-5, Takayama, Ikoma, Nara, 630-0192, Japan

{ masashi.yoshikawa.yh8, noji, matsu }@is.naist.jp

Abstract

We propose a new A* CCG parsing modelin which the probability of a tree is decom-posed into factors of CCG categories andits syntactic dependencies both defined onbi-directional LSTMs. Our factored modelallows the precomputation of all probabil-ities and runs very efficiently, while mod-eling sentence structures explicitly via de-pendencies. Our model achieves the state-of-the-art results on English and JapaneseCCG parsing.1

1 Introduction

Supertagging in lexicalized grammar parsing isknown as almost parsing (Bangalore and Joshi,1999), in that each supertag is syntactically infor-mative and most ambiguities are resolved once acorrect supertag is assigned to every word. Re-cently this property is effectively exploited in A*Combinatory Categorial Grammar (CCG; Steed-man (2000)) parsing (Lewis and Steedman, 2014;Lewis et al., 2016), in which the probability ofa CCG tree y on a sentence x of length N isthe product of the probabilities of supertags (cate-gories) ci (locally factored model):

P (y|x) =∏

i∈[1,N ]

Ptag(ci|x). (1)

By not modeling every combinatory rule in aderivation, this formulation enables us to employefficient A* search (see Section 2), which finds themost probable supertag sequence that can build awell-formed CCG tree.

Although much ambiguity is resolved with thissupertagging, some ambiguity still remains. Fig-ure 1 shows an example, where the two CCG

1 Our software and the pretrained models are available at:https://github.com/masashi-y/depccg.

(a)

a house in Paris in France

NP (NP\NP)/NP NP (NP\NP)/NP NP> >

NP\NP NP\NP<

NP<

NP(b)

a house in Paris in France

NP (NP\NP)/NP NP (NP\NP)/NP NP>

NP\NP<

NP>

NP\NP<

NP

Figure 1: CCG trees that are equally likely underEq. 1. Our model resolves this ambiguity by mod-eling the head of every word (dependencies).

parses are derived from the same supertags. Lewiset al.’s approach to this problem is resorting tosome deterministic rule. For example, Lewis et al.(2016) employ the attach low heuristics, whichis motivated by the right-branching tendency ofEnglish, and always prioritizes (b) for this typeof ambiguity. Though for English it empiricallyworks well, an obvious limitation is that it does notalways derive the correct parse; consider a phrase“a house in Paris with a garden”, for which thecorrect parse has the structure corresponding to (a)instead.

In this paper, we provide a way to resolve theseremaining ambiguities under the locally factoredmodel, by explicitly modeling bilexical dependen-cies as shown in Figure 1. Our joint model is stilllocally factored so that an efficient A* search canbe applied. The key idea is to predict the head ofevery word independently as in Eq. 1 with a strongunigram model, for which we utilize the scoringmodel in the recent successful graph-based depen-dency parsing on LSTMs (Kiperwasser and Gold-berg, 2016; Dozat and Manning, 2016). Specif-

277

https://doi.org/10.18653/v1/P17-1026

https://doi.org/10.18653/v1/P17-1026

ically, we extend the bi-directional LSTM (bi-LSTM) architecture of Lewis et al. (2016) predict-ing the supertag of a word to predict the head of itat the same time with a bilinear transformation.

The importance of modeling structures beyondsupertags is demonstrated in the performance gainin Lee et al. (2016), which adds a recursive com-ponent to the model of Eq. 1. Unfortunately, thisformulation loses the efficiency of the original onesince it needs to compute a recursive neural net-work every time it searches for a new node. Ourmodel does not resort to the recursive networkswhile modeling tree structures via dependencies.

We also extend the tri-training method of Lewiset al. (2016) to learn our model with dependen-cies from unlabeled data. On English CCGbanktest data, our model with this technique achieves88.8% and 94.0% in terms of labeled and unla-beled F1, which mark the best scores so far.

Besides English, we provide experiments onJapanese CCG parsing. Japanese employs freerword order dominated by the case markers and adeterministic rule such as the attach low methodmay not work well. We show that this is actuallythe case; our method outperforms the simple ap-plication of Lewis et al. (2016) in a large margin,10.0 points in terms of clause dependency accu-racy.

2 Background

Our work is built on A* CCG parsing (Section2.1), which we extend in Section 3 with a headprediction model on bi-LSTMs (Section 2.2).

2.1 Supertag-factored A* CCG Parsing

CCG has a nice property that since every categoryis highly informative about attachment decisions,assigning it to every word (supertagging) resolvesmost of its syntactic structure. Lewis and Steed-man (2014) utilize this characteristics of the gram-mar. Let a CCG tree y be a list of categories⟨c1, . . . , cN ⟩ and a derivation on it. Their modellooks for the most probable y given a sentence xof length N from the set Y (x) of possible CCGtrees under the model of Eq. 1:

y = arg maxy∈Y (x)

∑

i∈[1,N ]

log Ptag(ci|x).

Since this score is factored into each supertag, theycall the model a supertag-factored model.

Exact inference of this problem is possible byA* parsing (Klein and D. Manning, 2003), whichuses the following two scores on a chart:

b(Ci,j) =∑

ck∈ci,j

log Ptag(ck|x),

a(Ci,j) =∑

k∈[1,N ]\[i,j]

maxck

log Ptag(ck|x),

where Ci,j is a chart item called an edge, whichabstracts parses spanning interval [i, j] rooted bycategory C. The chart maps each edge to thederivation with the highest score, i.e., the Viterbiparse for Ci,j . ci,j is the sequence of categories onsuch Viterbi parse, and thus b is called the Viterbiinside score, while a is the approximation (upperbound) of the Viterbi outside score.

A* parsing is a kind of CKY chart parsing aug-mented with an agenda, a priority queue that keepsthe edges to be explored. At every step it pops theedge e with the highest priority b(e) + a(e) andinserts that into the chart, and enqueue any edgesthat can be built by combining e with other edgesin the chart. The algorithm terminates when anedge C1,N is popped from the agenda.

A* search for this model is quite efficient be-cause both b and a can be obtained from the uni-gram category distribution on every word, whichcan be precomputed before search. The heuris-tics a gives an upper bound on the true Viterbioutside score (i.e., admissible). Along with thisthe condition that the inside score never increasesby expansion (monotonicity) guarantees that thefirst found derivation on C1,N is always optimal.a(Ci,j) matches the true outside score if the one-best category assignments on the outside words(arg maxck

log Ptag(ck|x)) can comprise a well-formed tree with Ci,j , which is generally not true.

Scoring model For modeling Ptag, Lewis andSteedman (2014) use a log-linear model with fea-tures from a fixed window context. Lewis et al.(2016) extend this with bi-LSTMs, which encodethe complete sentence and capture the long rangesyntactic information. We base our model on thisbi-LSTM architecture, and extend it to modeling ahead word at the same time.

Attachment ambiguity In A* search, an edgewith the highest priority b+a is searched first, butas shown in Figure 1 the same categories (with thesame priority) may sometimes derive more than

278

one tree. In Lewis and Steedman (2014), they pri-oritize the parse with longer dependencies, whichthey judge with a conversion rule from a CCGtree to a dependency tree (Section 4). Lewis et al.(2016) employ another heuristics prioritizing lowattachments of constituencies, but inevitably theseheuristics cannot be flawless in any situations. Weprovide a simple solution to this problem by ex-plicitly modeling bilexical dependencies.

2.2 Bi-LSTM Dependency ParsingFor modeling dependencies, we borrow the ideafrom the recent graph-based neural dependencyparsing (Kiperwasser and Goldberg, 2016; Dozatand Manning, 2016) in which each dependencyarc is scored directly on the outputs of bi-LSTMs.Though the model is first-order, bi-LSTMs enableconditioning on the entire sentence and lead to thestate-of-the-art performance. Note that this mech-anism is similar to modeling of the supertag distri-bution discussed above, in that for each word thedistribution of the head choice is unigram and canbe precomputed. As we will see this keeps ourjoint model still locally factored and A* searchtractable. For score calculation, we use an ex-tended bilinear transformation by Dozat and Man-ning (2016) that models the prior headness of eachtoken as well, which they call biaffine.

3 Proposed Method

3.1 A* parsing with Supertag andDependency Factored Model

We define a CCG tree y for a sentence x =⟨xi, . . . , xN ⟩ as a triplet of a list of CCG cat-egories c = ⟨c1, . . . , cN ⟩, dependencies h =⟨h1, . . . , hN ⟩, and the derivation, where hi is thehead index of xi. Our model is defined as follows:

P (y|x) =∏

i∈[1,N ]

Ptag(ci|x)∏

i∈[1,N ]

Pdep(hi|x).

(2)

The added term Pdep is a unigram distribution ofthe head choice.

A* search is still tractable under this model.The search problem is changed as:

y = arg maxy∈Y (x)

( ∑

i∈[1,N ]

log Ptag(ci|x)

+∑

i∈[1,N ]

log Pdep(hi|x)

),

John metNP S\NP/NP NP

Mary

b(e2) b(e1) b(e3) = b(e1) + b(e2) + logPdep(met → John)NP S\NP/NP NPJohn saw Mary

NP S\NPS

Figure 2: Viterbi inside score for edge e3 underour model is the sum of those of e1 and e2 and thescore of dependency arc going from the head of e2

to that of e1 (the head direction changes accordingto the child categories).

and the inside score is given by:

b(Ci,j) =∑

ck∈ci,j

log Ptag(ck|x) (3)

+∑

k∈[i,j]\{root(hCi,j)}

log Pdep(hk|x),

where hCi,j is a dependency subtree for the Viterbi

parse on Ci,j and root(h) returns the root index.We exclude the head score for the subtree root to-ken since it cannot be resolved inside [i, j]. Thiscauses the mismatch between the goal inside scoreb(C1,N ) and the true model score (log of Eq. 2),which we adjust by adding a special unary rule thatis always applied to the popped goal edge C1,N .

We can calculate the dependency terms in Eq. 3on the fly when expanding the chart. Let the cur-rently popped edge be Ai,k, which will be com-bined with Bk,j into Ci,j . The key observation isthat only one dependency arc (between root(hA

i,k)

and root(hBk,j)) is resolved at every combination

(see Figure 2). For every rule C → A B wecan define the head direction (see Section 4) andPdep is obtained accordingly. For example, whenthe right child B becomes the head, b(Ci,j) =b(Ai,k) + b(Bk,j) + log Pdep(hl = m|x), wherel = root(hA

i,k) and m = root(hBk,j) (l < m).

The Viterbi outside score is changed as:

a(Ci,j) =∑

k∈[1,N ]\[i,j]

maxck

log Ptag(ck|x)

+∑

k∈L

maxhk

log Pdep(hk|x),

where L = [1, N ] \ [k′|k′ ∈ [i, j], root(hCi,j) =

k′]. We regard root(hCi,j) as an outside word since

its head is undefined yet. For every outside wordwe independently assign the weight of its argmax

279

head, which may not comprise a well-formed de-pendency tree. We initialize the agenda by addingan item for every supertag C and word xi with thescore a(Ci,i) =

∑k∈I\{i} max log Ptag(ck|x) +∑

k∈I max log Pdep(hk|x). Note that the depen-dency component of it is the same for every word.

3.2 Network ArchitectureFollowing Lewis et al. (2016) and Dozat and Man-ning (2016), we model Ptag and Pdep using bi-LSTMs for exploiting the entire sentence to cap-ture the long range phenomena. See Figure 3 forthe overall network architecture, where Ptag andPdep share the common bi-LSTM hidden vectors.

First we map every word xi to their hidden vec-tor ri with bi-LSTMs. The input to the LSTMsis word embeddings, which we describe in Sec-tion 6. We add special start and end tokens to eachsentence with the trainable parameters followingLewis et al. (2016). For Pdep, we use the biaffinetransformation in Dozat and Manning (2016):

gdepi = MLP dep

child(ri),

gdephi

= MLP dephead(rhi

),

Pdep(hi|x) (4)

∝ exp((gdepi )TWdepg

dephi

+ wdepgdephi

),

where MLP is a multilayered perceptron.Though Lewis et al. (2016) simply use an MLPfor mapping ri to Ptag, we additionally utilize thehidden vector of the most probable head hi =arg maxh′

iPdep(h

′i|x), and apply ri and rhi

to abilinear function:2

gtagi = MLP tag

child(ri),

gtaghi

= MLP taghead(rhi

), (5)

ℓ = (gtagi )TUtagg

taghi

+ Wtag

[gtag

i

gtaghi

]+ btag,

Ptag(ci|x) ∝ exp(ℓc),

where Utag is a third order tensor. As in Lewis etal. these values can be precomputed before search,which makes our A* parsing quite efficient.

4 CCG to Dependency Conversion

Now we describe our conversion rules from a CCGtree to a dependency one, which we use in two pur-

2 This is inspired by the formulation of label prediction inDozat and Manning (2016), which performs the best amongother settings that remove or reverse the dependence betweenthe head model and the supertag model.

LSTM LSTM LSTM LSTM

LSTM LSTM LSTM LSTM

concat concat concat concat

x1 x2 x3 x4

Bilinear Biaffine

S NP S/S .. ..

x1 x2 x3 .. ..

r1 r2 r3 r4

PdepPtag

Figure 3: Neural networks of our supertag anddependency factored model. First we map everyword xi to a hidden vector ri by bi-LSTMs, andthen apply biaffine (Eq. 4) and bilinear (Eq. 5)transformations to obtain the distributions of de-pendency heads (Pdep) and supertags (Ptag).

poses: 1) creation of the training data for the de-pendency component of our model; and 2) extrac-tion of a dependency arc at each combinatory ruleduring A* search (Section 3.1). Lewis and Steed-man (2014) describe one way to extract dependen-cies from a CCG tree (LEWISRULE). Below inaddition to this we describe two simpler alterna-tives (HEADFIRST and HEADFINAL), and see theeffects on parsing performance in our experiments(Section 6). See Figure 4 for the overview.

LEWISRULE This is the same as the conversionrule in Lewis and Steedman (2014). As shown inFigure 4c the output looks a familiar English de-pendency tree.

For forward application and (generalized) for-ward composition, we define the head to be theleft argument of the combinatory rule, unless itmatches either X/X or X/(X\Y ), in which casethe right argument is the head. For example, on“Black Monday” in Figure 4a we choose Mon-day as the head of Black. For the backward rules,the conversions are defined as the reverse of thecorresponding forward rules. For other rules, Re-movePunctuation (rp) chooses the non punctua-tion argument as the head, while Conjunction (Φ)chooses the right argument.3

3When applying LEWISRULE to Japanese, we ignore thefeature values in determining the head argument, which wefind often leads to a more natural dependency structure. Forexample, in “tabe ta” (eat PAST), the category of auxiliaryverb “ta” is Sf1\Sf2 with f1 = f2, and thus Sf1 = Sf2 . Wechoose “tabe” as the head in this case by removing the featurevalues, which makes the category X\X .

280

No , it was n′t Black Monday .

S/S , NP (S\NP)/NP (S\NP )\(S\NP ) NP/NP NP .<B× >

(S\NP)/NP NP>

S\NP<

Srp

S>

Srp

S

(a) English sentence

I SUB English ACC speak want .Boku wa eigo wo hanasi tai .

NP NP\NP NP NP\NP (S\NP)\NP S\S S\S< < <B2

NP NP (S\NP)\NP<

S\NP<

S<

S

(b) Japanese sentence “I want to speak English.”

No , it was n’t Black Monday .

(c) LEWISRULE

No , it was n’t Black Monday .

(d) HEADFIRST

Boku wa eigo wo hanasi tai .

(e) HEADFINAL

Figure 4: Examples of applying conversion rules in Section 4 to English and Japanese sentences.

One issue when applying this method for ob-taining the training data is that due to the mis-match between the rule set of our CCG parser, forwhich we follow Lewis and Steedman (2014), andthe grammar in English CCGbank (Hockenmaierand Steedman, 2007) we cannot extract dependen-cies from some of annotated CCG trees.4 For thisreason, we instead obtain the training data for thismethod from the original dependency annotationson CCGbank. Fortunately the dependency annota-tions of CCGbank matches LEWISRULE above inmost cases and thus they can be a good approxi-mation to it.

HEADFINAL Among SOV languages, Japaneseis known as a strictly head final language, mean-ing that the head of every word always followsit. Japanese dependency parsing (Uchimoto et al.,1999; Kudo and Matsumoto, 2002) has exploitedthis property explicitly by only allowing left-to-right dependency arcs. Inspired by this tradition,we try a simple HEADFINAL rule in JapaneseCCG parsing, in which we always select the rightargument as the head. For example we obtain thehead final dependency tree in Figure 4e from theJapanese CCG tree in Figure 4b.

HEADFIRST We apply the similar idea asHEADFINAL into English. Since English has theopposite, SVO word order, we define the simple“head first” rule, in which the left argument alwaysbecomes the head (Figure 4d).

4 For example, the combinatory rules in Lewis and Steed-man (2014) do not contain Nconj → N N in CCGbank.Another difficulty is that in English CCGbank the name ofeach combinatory rule is not annotated explicitly.

Though this conversion may look odd at firstsight it also has some advantages over LEWIS-RULE. First, since the model with LEWISRULE

is trained on the CCGbank dependencies, at infer-ence, occasionally the two components Pdep andPtag cause some conflicts on their predictions. Forexample, the true Viterbi parse may have a lowerscore in terms of dependencies, in which casethe parser slows down and may degrade the ac-curacy. HEADFIRST, in contract, does not sufferfrom such conflicts. Second, by fixing the direc-tion of arcs, the prediction of heads becomes eas-ier, meaning that the dependency predictions be-come more reliable. Later we show that this is infact the case for existing dependency parsers (seeSection 5), and in practice, we find that this simpleconversion rule leads to the higher parsing scoresthan LEWISRULE on English (Section 6).

5 Tri-training

We extend the existing tri-training method to ourmodels and apply it to our English parsers.

Tri-training is one of the semi-supervised meth-ods, in which the outputs of two parsers on un-labeled data are intersected to create (silver) newtraining data. This method is successfully appliedto dependency parsing (Weiss et al., 2015) andCCG supertagging (Lewis et al., 2016).

We simply combine the two previous ap-proaches. Lewis et al. (2016) obtain their sil-ver data annotated with the high quality supertags.Since they make this data publicly available 5, weobtain our silver data by assigning dependency

5https://github.com/uwnlp/taggerflow

281

structures on top of them.6

We train two very different dependency parsersfrom the training data extracted from CCGbankSection 02-21. This training data differs depend-ing on our dependency conversion strategies (Sec-tion 4). For LEWISRULE, we extract the orig-inal dependency annotations of CCGbank. ForHEADFIRST, we extract the head first dependen-cies from the CCG trees. Note that we cannot an-notate dependency labels so we assign a dummy“none” label to every arc. The first parser isgraph-based RBGParser (Lei et al., 2014) withthe default settings except that we train an unla-beled parser and use word embeddings of Turianet al. (2010). The second parser is transition-basedlstm-parser (Dyer et al., 2015) with the de-fault parameters.

On the development set (Section 00), withLEWISRULE dependencies RBGParser shows93.8% unlabeled attachment score while that oflstm-parser is 92.5% using gold POS tags.Interestingly, the parsers with HEADFIRST de-pendencies achieve higher scores: 94.9% byRBGParser and 94.6% by lstm-parser, sug-gesting that HEADFIRST dependencies are easierto parse. For both dependencies, we obtain morethan 1.7 million sentences on which two parsersagree.

Following Lewis et al. (2016), we include 15copies of CCGbank training set when using thesesilver data. Also to make effects of the tri-trainsamples smaller we multiply their loss by 0.4.

6 Experiments

We perform experiments on English and JapaneseCCGbanks.

6.1 English Experimental Settings

We follow the standard data splits and use Sections02-21 for training, Section 00 for development,and Section 23 for final evaluation. We report la-beled and unlabeled F1 of the extracted CCG se-mantic dependencies obtained using generateprogram supplied with C&C parser.

For our models, we adopt the pruning strate-gies in Lewis and Steedman (2014) and allow atmost 50 categories per word, use a variable-widthbeam with β = 0.00001, and utilize a tag dictio-nary, which maps frequent words to the possible

6We annotate POS tags on this data using Stanford POStagger (Toutanova et al., 2003).

supertags7. Unless otherwise stated, we only al-low normal form parses (Eisner, 1996; Hocken-maier and Bisk, 2010), choosing the same subsetof the constraints as Lewis and Steedman (2014).

We use as word representation the concatena-tion of word vectors initialized to GloVe8 (Pen-nington et al., 2014), and randomly initialized pre-fix and suffix vectors of the length 1 to 4, whichis inspired by Lewis et al. (2016). All affixes ap-pearing less than two times in the training data aremapped to “UNK”.

Other model configurations are: 4-layer bi-LSTMs with left and right 300-dimensionalLSTMs, 1-layer 100-dimensional MLPs withELU non-linearity (Clevert et al., 2015) for allMLP dep

child, MLP dephead, MLP tag

child and MLP taghead,

and the Adam optimizer with β1 = 0.9, β2 = 0.9,L2 norm (1e−6), and learning rate decay with theratio 0.75 for every 2,500 iteration starting from2e−3, which is shown to be effective for trainingthe biaffine parser (Dozat and Manning, 2016).

6.2 Japanese Experimental SettingsWe follow the default train/dev/test splits ofJapanese CCGbank (Uematsu et al., 2013). Forthe baselines, we use an existing shift-reduce CCGparser implemented in an NLP tool Jigg9 (Nojiand Miyao, 2016), and our implementation of thesupertag-factored model using bi-LSTMs.

For Japanese, we use as word representationthe concatenation of word vectors initialized toJapanese Wikipedia Entity Vector10, and 100-dimensional vectors computed from randomlyinitialized 50-dimensional character embeddingsthrough convolution (dos Santos and Zadrozny,2014). We do not use affix vectors as affixes areless informative in Japanese. All characters ap-pearing less than two times are mapped to “UNK”.We use the same parameter settings as English forbi-LSTMs, MLPs, and optimization.

One issue in Japanese experiments is evalua-tion. The Japanese CCGbank is encoded in a dif-ferent format than the English bank, and no stan-dalone script for extracting semantic dependen-cies is available yet. For this reason, we evaluatethe parser outputs by converting them to bunsetsu

7We use the same tag dictionary provided with their bi-LSTM model.

8http://nlp.stanford.edu/projects/glove/

9https://github.com/mynlp/jigg10http://www.cl.ecei.tohoku.ac.jp/

˜m-suzuki/jawiki_vector/

282

Method Labeled UnlabeledCCGbank

LEWISRULE w/o dep 85.8 91.7LEWISRULE 86.0 92.5HEADFIRST w/o dep 85.6 91.6HEADFIRST 86.6 92.8

Tri-trainingLEWISRULE 86.9 93.0HEADFIRST 87.6 93.3

Table 1: Parsing results (F1) on English develop-ment set. “w/o dep” means that the model discardsdependency components at prediction.

Method Labeled Unlabeled # violationsCCGbank

LEWISRULE w/o dep 85.8 91.7 2732LEWISRULE 85.4 92.2 283HEADFIRST w/o dep 85.6 91.6 2773HEADFIRST 86.8 93.0 89

Tri-trainingLEWISRULE 86.7 92.8 253HEADFIRST 87.7 93.5 66

Table 2: Parsing results (F1) on English develop-ment set when excluding the normal form con-straints. # violations is the number of combina-tions violating the constraints on the outputs.

dependencies, the syntactic representation ordi-nary used in Japanese NLP (Kudo and Matsumoto,2002). Given a CCG tree, we obtain this by firstsegment a sentence into bunsetsu (chunks) usingCaboCha11 and extract dependencies that cross abunsetsu boundary after obtaining the word-level,head final dependencies as in Figure 4b. For ex-ample, the sentence in Figure 4e is segmented as“Boku wa | eigo wo | hanashi tai”, from which weextract two dependencies (Boku wa) ← (hanashitai) and (eigo wo) ← (hanashi tai). We performthis conversion for both gold and output CCG treesand calculate the (unlabeled) attachment accuracy.Though this is imperfect, it can detect importantparse errors such as attachment errors and thus canbe a good proxy for the performance as a CCGparser.

6.3 English Parsing ResultsEffect of Dependency We first see how the de-pendency components added in our model affectthe performance. Table 1 shows the results on thedevelopment set with the several configurations,in which “w/o dep” means discarding the depen-

11http://taku910.github.io/cabocha/

Method Labeled UnlabeledCCGbankC&C (Clark and Curran, 2007) 85.5 91.7w/ LSTMs (Vaswani et al., 2016) 88.3 -EasySRL (Lewis et al., 2016) 87.2 -EasySRL reimpl 86.8 92.3HEADFIRST w/o NF (Ours) 87.7 93.4

Tri-trainingEasySRL (Lewis et al., 2016) 88.0 92.9neuralccg (Lee et al., 2016) 88.7 93.7HEADFIRST w/o NF (Ours) 88.8 94.0

Table 3: Parsing results (F1) on English testset (Section 23).

dency terms of the model and applying the attachlow heuristics (Section 1) instead (i.e., a supertag-factored model; Section 2.1). We can see that forboth LEWISRULE and HEADFIRST, adding de-pendency terms improves the performance.

Choice of Dependency Conversion Rule Toour surprise, our simple HEADFIRST strategy al-ways leads to better results than the linguisticallymotivated LEWISRULE. The absolute improve-ments by tri-training are equally large (about 1.0points), suggesting that our model with dependen-cies can also benefit from the silver data.

Excluding Normal Form Constraints One ad-vantage of HEADFIRST is that the direction ofarcs is always right, making the structures sim-pler and more parsable (Section 5). From anotherviewpoint, this fixed direction means that the con-stituent structure behind a (head first) dependencytree is unique. Since the constituent structures ofCCGbank trees basically follow the normal form(NF), we hypothesize that the model learned withHEADFIRST has an ability to force the outputs inNF automatically. We summarize the results with-out the NF constraints in Table 2, which showsthat the above argument is correct; the numberof violating NF rules on the outputs of HEAD-FIRST is much smaller than that of LEWISRULE

(89 vs. 283). Interestingly the scores of HEAD-FIRST slightly increase from the models with NF(e.g., 86.8 vs. 86.6 for CCGbank), suggesting thatthe NF constraints hinder the search of HEAD-FIRST models occasionally.

Results on Test Set Parsing results on the testset (Section 23) are shown in Table 3, where wecompare our best performing HEADFIRST depen-dency model without NF constraints with the sev-eral existing parsers. In the CCGbank experi-

283

EasySRL reimpl neuralccg OursTagging 24.8 21.7 16.6A* Search 185.2 16.7 114.6Total 21.9 9.33 14.5

Table 4: Results of the efficiency experiment,where each number is the number of sentencesprocessed per second. We compare our proposedparser against neuralccg and our reimplemen-tation of EasySRL.

ment, our parser shows the better result than allthe baseline parsers except C&C with an LSTMsupertagger (Vaswani et al., 2016). Our parseroutperforms EasySRL by 0.5% and our reimple-mentation of that parser (EasySRL reimpl) by0.9% in terms of labeled F1. In the tri-trainingexperiment, our parser shows much increased per-formance of 88.8% labeled F1 and 94.0% unla-beled F1, outperforming the current state-of-the-art neuralccg (Lee et al., 2016) that uses recur-sive neural networks by 0.1 point and 0.3 point interms of labeled and unlabeled F1. This is the bestreported F1 in English CCG parsing.

Efficiency Comparison We compare the ef-ficiency of our parser with neuralccg andEasySRL reimpl.12 The results are shownin Table 4. For the overall speed (the thirdrow), our parser is faster than neuralccg al-though lags behind EasySRL reimpl. Inspect-ing the details, our supertagger runs slower thanthose of neuralccg and EasySRL reimpl,while in A* search our parser processes over 7times more sentences than neuralccg. Thedelay in supertagging can be attributed to sev-eral factors, in particular the differences in net-work architectures including the number of bi-LSTM layers (4 vs. 2) and the use of bilin-ear transformation instead of linear one. Thereare also many implementation differences in ourparser (C++A* parser with neural network modelimplemented with Chainer (Tokui et al., 2015))and neuralccg (Java parser with C++ Tensor-Flow (Abadi et al., 2015) supertagger and recur-sive neural model in C++ DyNet (Neubig et al.,2017)).

6.4 Japanese Parsing ResultWe show the results of the Japanese parsing exper-iment in Table 5. The simple application of Lewis

12This experiment is performed on a laptop with 4-thread2.0 GHz CPU.

Method Category Bunsetsu Dep.Noji and Miyao (2016) 93.0 87.5Supertag model 93.7 81.5LEWISRULE (Ours) 93.8 90.8HEADFINAL (Ours) 94.1 91.5

Table 5: Results of Japanese CCGbank.

Yesterday buy−PAST curry−ACC eat−PASTKinoo kat − ta karee − wo tabe − taS/S S NP S\NP

>S un

NP/NP>

NP<

S

Yesterday buy−PAST curry−ACC eat−PASTKinoo kat − ta karee − wo tabe − taS/S S NP S\NP

unNP/NP

>NP

<S

>S

Figure 5: Examples of ambiguous Japanese sen-tence given fixed supertags. The English transla-tion is “I ate the curry I bought yesterday”.

et al. (2016) (Supertag model) is not effective forJapanese, showing the lowest attachment score of81.5%. We observe a performance boost with ourmethod, especially with HEADFINAL dependen-cies, which outperforms the baseline shift-reduceparser by 1.1 points on category assignments and4.0 points on bunsetsu dependencies.

The degraded results of the simple applicationof the supertag-factored model can be attributed tothe fact that the structure of a Japanese sentenceis still highly ambiguous given the supertags (Fig-ure 5). This is particularly the case in construc-tions where phrasal adverbial/adnominal modi-fiers (with the supertag S/S) are involved. Theresult suggests the importance of modeling depen-dencies in some languages, at least Japanese.

7 Related Work

There is some past work that utilizes dependenciesin lexicalized grammar parsing, which we reviewbriefly here.

For Head-driven Phrase Structure Gram-mar (HPSG; Pollard and Sag (1994)), there arestudies to use the predicted dependency structureto improve HPSG parsing accuracy. Sagae et al.(2007) use dependencies to constrain the formof the output tree. As in our method, for everyrule (schema) application they define which childbecomes the head and impose a soft constraintthat these dependencies agree with the output ofthe dependency parser. Our method is different

284

in that we do not use the one-best dependencystructure alone, but rather we search for a CCGtree that is optimal in terms of dependenciesand CCG supertags. Zhang et al. (2010) use thesyntactic dependencies in a different way, andshow that dependency-based features are usefulfor predicting HPSG supertags.

In the CCG parsing literature, some work op-timizes a dependency model, instead of supertagsor a derivation (Clark and Curran, 2007; Xu et al.,2014). This approach is reasonable given that theobjective matches the evaluation metric. Insteadof modeling dependencies alone, our method findsa CCG derivation that has a higher dependencyscore. Lewis et al. (2015) present a joint modelof CCG parsing and semantic role labeling (SRL),which is closely related to our approach. Theymap each CCG semantic dependency to an SRLrelation, for which they give the A* upper boundby the score from a predicate to the most proba-ble argument. Our approach is similar; the largestdifference is that we instead model syntactic de-pendencies from each token to its head, and this isthe key to our success. Since dependency parsingcan be formulated as independent head selectionssimilar to tagging, we can build the entire modelon LSTMs to exploit features from the whole sen-tence. This formulation is not straightforward inthe case of multi-headed semantic dependencies intheir model.

8 Conclusion

We have presented a new A* CCG parsingmethod, in which the probability of a CCG treeis decomposed into local factors of the CCG cat-egories and its dependency structure. By explic-itly modeling the dependency structure, we do notrequire any deterministic heuristics to resolve at-tachment ambiguities, and keep the model locallyfactored so that all the probabilities can be pre-computed before running the search. Our parserefficiently finds the optimal parse and achieves thestate-of-the-art performance in both English andJapanese parsing.

Acknowledgments

We are grateful to Mike Lewis for answeringour questions and your Github repository fromwhich we learned many things. We also thankYuichiro Sawai for the faster LSTM implementa-tion. This work was in part supported by JSPS

KAKENHI Grant Number 16H06981, and also byJST CREST Grant Number JPMJCR1301.

ReferencesMartın Abadi, Ashish Agarwal, Paul Barham, Eugene

Brevdo, Zhifeng Chen, Craig Citro, Greg S. Cor-rado, Andy Davis, Jeffrey Dean, Matthieu Devin,Sanjay Ghemawat, Ian Goodfellow, Andrew Harp,Geoffrey Irving, Michael Isard, Yangqing Jia, RafalJozefowicz, Lukasz Kaiser, Manjunath Kudlur, JoshLevenberg, Dan Mane, Rajat Monga, Sherry Moore,Derek Murray, Chris Olah, Mike Schuster, JonathonShlens, Benoit Steiner, Ilya Sutskever, Kunal Tal-war, Paul Tucker, Vincent Vanhoucke, Vijay Va-sudevan, Fernanda Viegas, Oriol Vinyals, Pete War-den, Martin Wattenberg, Martin Wicke, Yuan Yu,and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Sys-tems. Software available from tensorflow.org.http://tensorflow.org/.

Srinivas Bangalore and Aravind K Joshi. 1999. Su-pertagging: An Approach to Almost Parsing. Com-putational linguistics 25(2):237–265.

Stephen Clark and James R. Curran. 2007. Wide-Coverage Efficient Statistical Parsing with CCGand Log-Linear Models. Computational Lin-guistics, Volume 33, Number 4, December 2007http://aclweb.org/anthology/J07-4004.

Djork-Arne Clevert, Thomas Unterthiner, andSepp Hochreiter. 2015. Fast and AccurateDeep Network Learning by Exponential Lin-ear Units (ELUs). CoRR abs/1511.07289.http://arxiv.org/abs/1511.07289.

Cıcero Nogueira dos Santos and Bianca Zadrozny.2014. Learning Character-level Representations forPart-of-Speech Tagging. ICML.

Timothy Dozat and Christopher D. Manning.2016. Deep Biaffine Attention for NeuralDependency Parsing. CoRR abs/1611.01734.http://arxiv.org/abs/1611.01734.

Chris Dyer, Miguel Ballesteros, Wang Ling, AustinMatthews, and A. Noah Smith. 2015. Transition-Based Dependency Parsing with Stack Long Short-Term Memory. In Proceedings of the 53rd AnnualMeeting of the Association for Computational Lin-guistics and the 7th International Joint Conferenceon Natural Language Processing (Volume 1: LongPapers). Association for Computational Linguistics,pages 334–343. https://doi.org/10.3115/v1/P15-1033.

Jason Eisner. 1996. Efficient Normal-Form Parsing forCombinatory Categorial Grammar. In 34th AnnualMeeting of the Association for Computational Lin-guistics. http://aclweb.org/anthology/P96-1011.

285

Julia Hockenmaier and Yonatan Bisk. 2010. Normal-form parsing for Combinatory Categorial Grammarswith generalized composition and type-raising. InProceedings of the 23rd International Conferenceon Computational Linguistics (Coling 2010). Col-ing 2010 Organizing Committee, pages 465–473.http://aclweb.org/anthology/C10-1053.

Julia Hockenmaier and Mark Steedman. 2007. CCG-bank: A Corpus of CCG Derivations and Depen-dency Structures Extracted from the Penn Tree-bank. Computational Linguistics 33(3):355–396.http://www.aclweb.org/anthology/J07-3004.

Eliyahu Kiperwasser and Yoav Goldberg. 2016.Simple and Accurate Dependency ParsingUsing Bidirectional LSTM Feature Repre-sentations. Transactions of the Associationfor Computational Linguistics 4:313–327.https://www.transacl.org/ojs/index.php/tacl/article/view/885.

Dan Klein and Christopher D. Manning. 2003. A*Parsing: Fast Exact Viterbi Parse Selection. InProceedings of the 2003 Human Language Tech-nology Conference of the North American Chapterof the Association for Computational Linguistics.http://aclweb.org/anthology/N03-1016.

Taku Kudo and Yuji Matsumoto. 2002. JapaneseDependency Analysis using Cascaded Chunking.In Proceedings of the 6th Conference on Natu-ral Language Learning, CoNLL 2002, Held incooperation with COLING 2002, Taipei, Taiwan,2002. http://aclweb.org/anthology/W/W02/W02-2016.pdf.

Kenton Lee, Mike Lewis, and Luke Zettlemoyer.2016. Global Neural CCG Parsing with Op-timality Guarantees. In Proceedings of the2016 Conference on Empirical Methods inNatural Language Processing. Association forComputational Linguistics, pages 2366–2376.http://aclweb.org/anthology/D16-1262.

Tao Lei, Yu Xin, Yuan Zhang, Regina Barzilay, andTommi Jaakkola. 2014. Low-Rank Tensors forScoring Dependency Structures. In Proceedingsof the 52nd Annual Meeting of the Association forComputational Linguistics (Volume 1: Long Pa-pers). Association for Computational Linguistics,pages 1381–1391. https://doi.org/10.3115/v1/P14-1130.

Mike Lewis, Luheng He, and Luke Zettlemoyer. 2015.Joint A* CCG Parsing and Semantic Role Labelling.In Proceedings of the 2015 Conference on Empiri-cal Methods in Natural Language Processing. Asso-ciation for Computational Linguistics, pages 1444–1454. https://doi.org/10.18653/v1/D15-1169.

Mike Lewis, Kenton Lee, and Luke Zettlemoyer.2016. LSTM CCG Parsing. In Proceedingsof the 2016 Conference of the North American

Chapter of the Association for Computational Lin-guistics: Human Language Technologies. Associa-tion for Computational Linguistics, pages 221–231.https://doi.org/10.18653/v1/N16-1026.

Mike Lewis and Mark Steedman. 2014. A* CCG Pars-ing with a Supertag-factored Model. In Proceed-ings of the 2014 Conference on Empirical Methodsin Natural Language Processing (EMNLP). Asso-ciation for Computational Linguistics, pages 990–1000. https://doi.org/10.3115/v1/D14-1107.

Graham Neubig, Chris Dyer, Yoav Goldberg, AustinMatthews, Waleed Ammar, Antonios Anastasopou-los, Miguel Ballesteros, David Chiang, DanielClothiaux, Trevor Cohn, Kevin Duh, ManaalFaruqui, Cynthia Gan, Dan Garrette, Yangfeng Ji,Lingpeng Kong, Adhiguna Kuncoro, Gaurav Ku-mar, Chaitanya Malaviya, Paul Michel, YusukeOda, Matthew Richardson, Naomi Saphra, SwabhaSwayamdipta, and Pengcheng Yin. 2017. DyNet:The Dynamic Neural Network Toolkit. arXivpreprint arXiv:1701.03980 .

Hiroshi Noji and Yusuke Miyao. 2016. Jigg:A Framework for an Easy Natural LanguageProcessing Pipeline. In Proceedings of ACL-2016 System Demonstrations. Association forComputational Linguistics, pages 103–108.https://doi.org/10.18653/v1/P16-4018.

Jeffrey Pennington, Richard Socher, and Christo-pher D. Manning. 2014. GloVe: Global Vectors forWord Representation. In Empirical Methods in Nat-ural Language Processing (EMNLP). pages 1532–1543. http://www.aclweb.org/anthology/D14-1162.

Carl Pollard and Ivan A Sag. 1994. Head-drivenphrase structure grammar. University of ChicagoPress.

Kenji Sagae, Yusuke Miyao, and Jun’ichi Tsujii.2007. HPSG Parsing with Shallow DependencyConstraints. In Proceedings of the 45th AnnualMeeting of the Association of Computational Lin-guistics. Association for Computational Linguistics,pages 624–631. http://aclweb.org/anthology/P07-1079.

Mark Steedman. 2000. The Syntactic Process. TheMIT Press.

Seiya Tokui, Kenta Oono, Shohei Hido, and JustinClayton. 2015. Chainer: a Next-GenerationOpen Source Framework for Deep Learn-ing. In Proceedings of Workshop on Ma-chine Learning Systems (LearningSys) in TheTwenty-ninth Annual Conference on Neu-ral Information Processing Systems (NIPS).http://learningsys.org/papers/LearningSys 2015 paper 33.pdf.

Kristina Toutanova, Dan Klein, Christopher D. Man-ning, and Yoram Singer. 2003. Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Net-work. In Proceedings of the 2003 Human Lan-guage Technology Conference of the North Ameri-can Chapter of the Association for Computational

286

Linguistics. http://www.aclweb.org/anthology/N03-1033.

Joseph Turian, Lev-Arie Ratinov, and Yoshua Ben-gio. 2010. Word Representations: A Simple andGeneral Method for Semi-Supervised Learning. InProceedings of the 48th Annual Meeting of the As-sociation for Computational Linguistics. Associa-tion for Computational Linguistics, pages 384–394.http://aclweb.org/anthology/P10-1040.

Kiyotaka Uchimoto, Satoshi Sekine, and HitoshiIsahara. 1999. Japanese Dependency StructureAnalysis Based on Maximum Entropy Models.In Ninth Conference of the European Chapterof the Association for Computational Linguistics.http://aclweb.org/anthology/E99-1026.

Sumire Uematsu, Takuya Matsuzaki, Hiroki Hanaoka,Yusuke Miyao, and Hideki Mima. 2013. Inte-grating Multiple Dependency Corpora for Induc-ing Wide-coverage Japanese CCG Resources. InProceedings of the 51st Annual Meeting of theAssociation for Computational Linguistics (Vol-ume 1: Long Papers). Association for Compu-tational Linguistics, Sofia, Bulgaria, pages 1042–1051. http://www.aclweb.org/anthology/P13-1103.

Ashish Vaswani, Yonatan Bisk, Kenji Sagae, and RyanMusa. 2016. Supertagging With LSTMs. In Pro-ceedings of the 2016 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics: Human Language Technologies. Asso-ciation for Computational Linguistics, pages 232–237. https://doi.org/10.18653/v1/N16-1027.

David Weiss, Chris Alberti, Michael Collins, andSlav Petrov. 2015. Structured Training for Neu-ral Network Transition-Based Parsing. In Proceed-ings of the 53rd Annual Meeting of the Associa-tion for Computational Linguistics and the 7th In-ternational Joint Conference on Natural LanguageProcessing (Volume 1: Long Papers). Associa-tion for Computational Linguistics, pages 323–333.https://doi.org/10.3115/v1/P15-1032.

Wenduan Xu, Stephen Clark, and Yue Zhang. 2014.Shift-Reduce CCG Parsing with a DependencyModel. In Proceedings of the 52nd Annual Meet-ing of the Association for Computational Lin-guistics (Volume 1: Long Papers). Associationfor Computational Linguistics, pages 218–227.https://doi.org/10.3115/v1/P14-1021.

Yao-zhong Zhang, Takuya Matsuzaki, and Jun’ichiTsujii. 2010. A Simple Approach for HPSG Su-pertagging Using Dependency Information. In Hu-man Language Technologies: The 2010 AnnualConference of the North American Chapter of theAssociation for Computational Linguistics. Associ-ation for Computational Linguistics, pages 645–648.http://aclweb.org/anthology/N10-1090.

287

Documents

A* CCG Parsing with a Supertag and ... - Masashi Yoshikawa · Masashi Yoshikawa and Hiroshi Noji and Yuji Matsumoto Graduate School of Information and Science Nara Institute of Science