slides for "Supervised Model Learning with Feature Grouping based on a Discrete Constraint "

Supervised Model Learning with Feature Grouping based on a Discrete

Constraint(Suzuki+, ACL, 2013)の紹介

03/10/12 kensuke-mi

背景

✦ NLPのタスクでは，機械学習のモデルが巨大化しやすい．というのも，素性数の多さに伴って重みベクトルが巨大化

✦ L1正則化項を導入すると，モデルサイズは小さくなる．でも，それって本当に妥当なの？

✦一般に，今日のNLPタスクではL1正則項　モデルサイズが小さくなる．高速デコード可能L2正則項　L1より精度良いが，モデルサイズが巨大化

2

おさらい

3

第5章統計桜械学習のアルゴリズムとその応用

はあまりにもほど遠く、期待外れな結果にな0ています。なお、この図の作成のために最小二乗法を使いまし元が、最小二乗法が

役に立元ない方法なのかというと、そういうわけではありません。今回の例で問題だっ元のは、ν=1(幻のモデルとして14次の多項式を仮定したところです。モデルに高すぎる向由度を・与えてしまったので、経験損失が0になる代わりに、人冏の偵感には合わないちょっε悲惨な結果になっ元のです。ここでもし多項式の次数を2次ιか4次あ九りにまで落とし元ら、最小二乗法はうまく動くようになります。

正則化とはとはいえ、モデルの自由度が高寸ぎるとか低すぎるなどということは、

侠15.5のようなシンプルな例であればグラフを見れぱわかりますが、実際の問題では、データを見ても人ル*二t訓川挑ができません。そこで正則化Eいう概念を導入します。正則化はモデルの複雑さに対して与えるぺナルティです。複雑なモデルは図5.4中央の過学習のような状態となることがあります。過学習の状態では経験損失を0にするこεもしぱしぱ可能ですが、先ほど見元ように、複雑すぎるモデルは米知のデータ仁対して弱い、ということがよくあります。このような亊態を避けるために、モデルの複雑さに対してぺナルティを与える允め、目的関数の中に正則化項を入れます。よく使われる正則化にはL1正則化ε口正則化があり、それぞれパラメー

タベクトルの要素の絶対値の汎ル、要棄の2乗の府ルなります。式で表すι次のようになります。

・口正則化・L2正則化

口正則化の場合、 wkの値が0に近づけぱぺナルティはほぼ'0になりますが、U正則化の場合、ペナルティが0になるのは値が完全に0であるしきのみです。そのため、 Uのほラがパラメータベクトルの亦0の要素を減ら寸力が強く卿持ます。このように11くと U正則化のほうが良さそうですが、 0のεこ

バW)* 0Σk1如kl,'(仙)= 0 三k 1脚k12

174

L2正則化は，w_kの値が０に近づけばペナルティはほぼ０になる．

L1正則化はペナルティが０になるのは，値が完全に０のときのみ．

L1ノルムの方がNon-zeroのパラメータベクトルを減らす力が強い．

この論文のアイディア

✦重みベクトルをグループ化して，計算量を下げる

✦こんなメリットがありますモデルの大きさが抑制できる．(=Model Complexityを下げる)過学習を避けることができる．non-zeroの素性を選ぶ安定性が高まる．

4

じゃあ，どうやってグループ化するの？

5

of freedom. For example, previous studies clari-fied that it can reduce the chance of over-fitting tothe training data (Shen and Huang, 2010). This isan important property for many NLP tasks sincethey are often modeled with a high-dimensionalfeature space, and thus, the over-fitting problem isreadily triggered. It has also been reported that itcan improve the stability of selecting non-zero fea-tures beyond that possible with the standard L1-regularizer given the existence of many highly cor-related features (Jornsten and Yu, 2003; Zou andHastie, 2005). Moreover, it can dramatically re-duce model complexity. This is because we canmerge all features whose feature weight values areequivalent in the trained model into a single fea-ture cluster without any loss.

3 Modeling with Feature Grouping

This section describes our proposal for obtaininga feature grouping solution.

3.1 Integration of a Discrete ConstraintLet S be a finite set of discrete values, i.e., a setinteger from �4 to 4, that is, S={�4,. . . , �1, 0,1, . . . , 4}. The detailed discussion how we defineS can be found in our experiments section sinceit deeply depends on training data. Then, we de-fine the objective that can simultaneously achievea feature grouping and model learning as follows:

O(w;D) =L(w;D) + ⌦(w)

s.t. w 2 SN .(2)

where SN is the cartesian power of a set S . Theonly difference with Eq. 1 is the additional dis-crete constraint, namely, w 2 SN . This con-straint means that each variable (feature weight)in trained models must take a value in S , that is,wn

2 S , where wn

is the n-th factor of ˆ

w, andn 2 {1, . . . , N}. As a result, feature weights intrained models are automatically grouped in termsof the basis of model learning. This is the basicidea of feature grouping proposed in this paper.

However, a concern is how we can efficientlyoptimize Eq. 2 since it involves a NP-hard combi-natorial optimization problem. The time complex-ity of the direct optimization is exponential againstN . Next section introduces a feasible algorithm.

3.2 Dual Decomposition FormulationHereafter, we strictly assume that L(w;D) and⌦(w) are both convex in w. Then, the proper-ties of our method are unaffected by the selection

of L(w;D) and ⌦(w). Thus, we ignore their spe-cific definition in this section. Typical cases canbe found in the experiments section. Then, we re-formulate Eq. 2 by using the dual decompositiontechnique (Everett, 1963):

O(w,u;D) =L(w;D) + ⌦(w) +⌥(u)

s.t. w = u, and u 2 SN .(3)

Difference from Eq. 2, Eq. 3 has an additional term⌥(u), which is similar to the regularizer ⌦(w),whose optimization variables w and u are tight-ened with equality constraint w = u. Here, thispaper only considers the case ⌥(u) =

�22 ||u||22 +

�1||u||1, and �2 � 0 and �1 � 0

2. This objec-tive can also be viewed as the decomposition ofthe standard loss minimization problem shown inEq. 1 and the additional discrete constraint regu-larizer by the dual decomposition technique.

To solve the optimization in Eq. 3, we lever-age the alternating direction method of multiplier(ADMM) (Gabay and Mercier, 1976; Boyd et al.,2011). ADMM provides a very efficient optimiza-tion framework for the problem in the dual decom-position form. Here, ↵ represents dual variablesfor the equivalence constraint w=u. ADMM in-troduces the augmented Lagrangian term ⇢

2 ||w �u||22 with ⇢>0 which ensures strict convexity andincreases robustness3.

Finally, the optimization problem in Eq. 3 canbe converted into a series of iterative optimiza-tion problems. Detailed derivation in the generalcase can be found in (Boyd et al., 2011). Fig. 1shows the entire model learning framework of ourproposed method. The remarkable point is thatADMM works by iteratively computing one of thethree optimization variable sets w, u, and ↵ whileholding the other variables fixed in the iterationst = 1, 2, . . . until convergence.

Step1 (w-update): This part of the optimiza-tion problem shown in Eq. 4 is essentially Eq. 1with a ‘biased’ L2-regularizer. ‘bias’ means herethat the direction of regularization is toward pointa instead of the origin. Note that it becomes astandard L2-regularizer if a = 0. We can selectany learning algorithm that can handle the L2-regularizer for this part of the optimization.

Step2 (u-update): This part of the optimizationproblem shown in Eq. 5 can be rewritten in the

2Note that this setting includes the use of only L1-, L2-,or without regularizers (L1 only: �1>0 and �2=0, L2 only:�1=0 and �2>0, and without regularizer: �1=0,�2=0).

3Standard dual decomposition can be viewed as ⇢=0

19

Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, pages 18–23,Sofia, Bulgaria, August 4-9 2013. c�2013 Association for Computational Linguistics

Supervised Model Learning with Feature Groupingbased on a Discrete Constraint

Jun Suzuki and Masaaki NagataNTT Communication Science Laboratories, NTT Corporation2-4 Hikaridai, Seika-cho, Soraku-gun, Kyoto, 619-0237 Japan{suzuki.jun, nagata.masaaki}@lab.ntt.co.jp

Abstract

This paper proposes a framework of super-vised model learning that realizes featuregrouping to obtain lower complexity mod-els. The main idea of our method is tointegrate a discrete constraint into modellearning with the help of the dual decom-position technique. Experiments on twowell-studied NLP tasks, dependency pars-ing and NER, demonstrate that our methodcan provide state-of-the-art performanceeven if the degrees of freedom in trainedmodels are surprisingly small, i.e., 8 oreven 2. This significant benefit enables usto provide compact model representation,which is especially useful in actual use.

1 Introduction

This paper focuses on the topic of supervisedmodel learning, which is typically represented asthe following form of the optimization problem:

ˆ

w=argmin

w

�O(w;D)

,

O(w;D) = L(w;D) + ⌦(w),(1)

where D is supervised training data that consistsof the corresponding input x and output y pairs,that is, (x,y) 2 D. w is an N -dimensional vectorrepresentation of a set of optimization variables,which are also interpreted as feature weights.L(w;D) and ⌦(w) represent a loss function anda regularization term, respectively. Nowadays, we,in most cases, utilize a supervised learning methodexpressed as the above optimization problem toestimate the feature weights of many natural lan-guage processing (NLP) tasks, such as text clas-sification, POS-tagging, named entity recognition,dependency parsing, and semantic role labeling.

In the last decade, the L1-regularization tech-nique, which incorporates L1-norm into ⌦(w),has become popular and widely-used in manyNLP tasks (Gao et al., 2007; Tsuruoka et al.,

2009). The reason is that L1-regularizers encour-age feature weights to be zero as much as pos-sible in model learning, which makes the resul-tant model a sparse solution (many zero-weightsexist). We can discard all features whose weightis zero from the trained model1 without any loss.Therefore, L1-regularizers have the ability to eas-ily and automatically yield compact models with-out strong concern over feature selection.

Compact models generally have significant andclear advantages in practice: instances are fasterloading speed to memory, less memory occupa-tion, and even faster decoding is possible if themodel is small enough to be stored in cache mem-ory. Given this background, our aim is to establisha model learning framework that can reduce themodel complexity beyond that possible by sim-ply applying L1-regularizers. To achieve our goal,we focus on the recently developed concept of au-tomatic feature grouping (Tibshirani et al., 2005;Bondell and Reich, 2008). We introduce a modellearning framework that achieves feature group-ing by incorporating a discrete constraint duringmodel learning.

2 Feature Grouping Concept

Going beyond L1-regularized sparse modeling,the idea of ‘automatic feature grouping’ has re-cently been developed. Examples are fusedlasso (Tibshirani et al., 2005), grouping pur-suit (Shen and Huang, 2010), and OSCAR (Bon-dell and Reich, 2008). The concept of automaticfeature grouping is to find accurate models thathave fewer degrees of freedom. This is equiva-lent to enforce every optimization variables to beequal as much as possible. A simple example isthat ˆw1 = (0.1, 0.5, 0.1, 0.5, 0.1) is preferred overˆ

w2 = (0.1, 0.3, 0.2, 0.5, 0.3) since ˆ

w1 and ˆ

w2

have two and four unique values, respectively.There are several merits to reducing the degree1This paper refers to model after completion of (super-

vised) model learning as “trained model”

18

元々の学習式

Lは損失関数，Ωが正則化項

重みの集合Sを定義する例えば-4から4までの範囲とすると，S={-4,-3,-2.....3,4}

重みはSのべき乗集合から選ぶことにする．

じゃあ，どうやってグループ化するの？

6





O(w;D) =L(w;D) + ⌦(w)

s.t. w 2 SN .(2)


2 S , where wn






O(w,u;D) =L(w;D) + ⌦(w) +⌥(u)

s.t. w = u, and u 2 SN .(3)


�22 ||u||22 +

�1||u||1, and �2 � 0 and �1 � 0









19

Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, pages 18–23,Sofia, Bulgaria, August 4-9 2013. c�2013 Association for Computational Linguistics

Supervised Model Learning with Feature Groupingbased on a Discrete Constraint

Jun Suzuki and Masaaki NagataNTT Communication Science Laboratories, NTT Corporation2-4 Hikaridai, Seika-cho, Soraku-gun, Kyoto, 619-0237 Japan{suzuki.jun, nagata.masaaki}@lab.ntt.co.jp

Abstract

This paper proposes a framework of super-vised model learning that realizes featuregrouping to obtain lower complexity mod-els. The main idea of our method is tointegrate a discrete constraint into modellearning with the help of the dual decom-position technique. Experiments on twowell-studied NLP tasks, dependency pars-ing and NER, demonstrate that our methodcan provide state-of-the-art performanceeven if the degrees of freedom in trainedmodels are surprisingly small, i.e., 8 oreven 2. This significant benefit enables usto provide compact model representation,which is especially useful in actual use.

1 Introduction

This paper focuses on the topic of supervisedmodel learning, which is typically represented asthe following form of the optimization problem:

ˆ

w=argmin

w

�O(w;D)

,

O(w;D) = L(w;D) + ⌦(w),(1)

where D is supervised training data that consistsof the corresponding input x and output y pairs,that is, (x,y) 2 D. w is an N -dimensional vectorrepresentation of a set of optimization variables,which are also interpreted as feature weights.L(w;D) and ⌦(w) represent a loss function anda regularization term, respectively. Nowadays, we,in most cases, utilize a supervised learning methodexpressed as the above optimization problem toestimate the feature weights of many natural lan-guage processing (NLP) tasks, such as text clas-sification, POS-tagging, named entity recognition,dependency parsing, and semantic role labeling.

In the last decade, the L1-regularization tech-nique, which incorporates L1-norm into ⌦(w),has become popular and widely-used in manyNLP tasks (Gao et al., 2007; Tsuruoka et al.,

2009). The reason is that L1-regularizers encour-age feature weights to be zero as much as pos-sible in model learning, which makes the resul-tant model a sparse solution (many zero-weightsexist). We can discard all features whose weightis zero from the trained model1 without any loss.Therefore, L1-regularizers have the ability to eas-ily and automatically yield compact models with-out strong concern over feature selection.

Compact models generally have significant andclear advantages in practice: instances are fasterloading speed to memory, less memory occupa-tion, and even faster decoding is possible if themodel is small enough to be stored in cache mem-ory. Given this background, our aim is to establisha model learning framework that can reduce themodel complexity beyond that possible by sim-ply applying L1-regularizers. To achieve our goal,we focus on the recently developed concept of au-tomatic feature grouping (Tibshirani et al., 2005;Bondell and Reich, 2008). We introduce a modellearning framework that achieves feature group-ing by incorporating a discrete constraint duringmodel learning.

2 Feature Grouping Concept

Going beyond L1-regularized sparse modeling,the idea of ‘automatic feature grouping’ has re-cently been developed. Examples are fusedlasso (Tibshirani et al., 2005), grouping pur-suit (Shen and Huang, 2010), and OSCAR (Bon-dell and Reich, 2008). The concept of automaticfeature grouping is to find accurate models thathave fewer degrees of freedom. This is equiva-lent to enforce every optimization variables to beequal as much as possible. A simple example isthat ˆw1 = (0.1, 0.5, 0.1, 0.5, 0.1) is preferred overˆ

w2 = (0.1, 0.3, 0.2, 0.5, 0.3) since ˆ

w1 and ˆ

w2

have two and four unique values, respectively.There are several merits to reducing the degree1This paper refers to model after completion of (super-

vised) model learning as “trained model”

18

元々の学習式

Lは損失関数，Ωが正則化項

重みの集合Sを定義する例えば-4から4までの範囲とすると，S={-4,-3,-2.....3,4}

重みはSのべき乗集合から選ぶことにする．

つまり．．重み値の集合を作成して，この集合から重みを選ぶ．この行為がグループ化である．

ただし，この問題は普通には解けない．

7





O(w;D) =L(w;D) + ⌦(w)

s.t. w 2 SN .(2)


2 S , where wn






O(w,u;D) =L(w;D) + ⌦(w) +⌥(u)

s.t. w = u, and u 2 SN .(3)


�22 ||u||22 +

�1||u||1, and �2 � 0 and �1 � 0









19

Sのべき乗が巨大化するので，最適化の際に組み合わせ爆発が発生する．つまりNP困難問題

そこで，双対分解を導入して，これを解く

双対分解のおさらい

8

双対分解とは，つまり「NP困難な問題を分割して解く問題の解き方」

1.argmax_y ( g(y) + h(y) )はNP困難になるため解けない2.そこで，argmax_z,y ( g(z) + h(y) ) st. z=yと問題を分解3. 2の式にラグランジュ法を導入して新しくLを定義．Lの最適解をL*とする．この時，双対定理によりL*=min_u L(u)4. L(u)は凸関数なので，勾配法で最適解が求まる．勾配の更新をu:=u-µ(y*-z*)とする．5. y*=z*の時に，2の式が解ける．

詳しくはhttp://research.preferred.jp/2010/11/dual-decomposition/

http://research.preferred.jp/2010/11/dual-decomposition/

http://research.preferred.jp/2010/11/dual-decomposition/

本論文での双対分解の適用

9

γはΩに似た項らしい．(Sec. 3.1より)γはなんでも良いのだが，このpaperでは





O(w;D) =L(w;D) + ⌦(w)

s.t. w 2 SN .(2)


2 S , where wn






O(w,u;D) =L(w;D) + ⌦(w) +⌥(u)

s.t. w = u, and u 2 SN .(3)


�22 ||u||22 +

�1||u||1, and �2 � 0 and �1 � 0









19





O(w;D) =L(w;D) + ⌦(w)

s.t. w 2 SN .(2)


2 S , where wn






O(w,u;D) =L(w;D) + ⌦(w) +⌥(u)

s.t. w = u, and u 2 SN .(3)


�22 ||u||22 +

�1||u||1, and �2 � 0 and �1 � 0









19

元の式を双対分解of freedom. For example, previous studies clari-fied that it can reduce the chance of over-fitting tothe training data (Shen and Huang, 2010). This isan important property for many NLP tasks sincethey are often modeled with a high-dimensionalfeature space, and thus, the over-fitting problem isreadily triggered. It has also been reported that itcan improve the stability of selecting non-zero fea-tures beyond that possible with the standard L1-regularizer given the existence of many highly cor-related features (Jornsten and Yu, 2003; Zou andHastie, 2005). Moreover, it can dramatically re-duce model complexity. This is because we canmerge all features whose feature weight values areequivalent in the trained model into a single fea-ture cluster without any loss.




O(w;D) =L(w;D) + ⌦(w)

s.t. w 2 SN .(2)


2 S , where wn






O(w,u;D) =L(w;D) + ⌦(w) +⌥(u)

s.t. w = u, and u 2 SN .(3)


�22 ||u||22 +

�1||u||1, and �2 � 0 and �1 � 0









19





O(w;D) =L(w;D) + ⌦(w)

s.t. w 2 SN .(2)


2 S , where wn






O(w,u;D) =L(w;D) + ⌦(w) +⌥(u)

s.t. w = u, and u 2 SN .(3)


�22 ||u||22 +

�1||u||1, and �2 � 0 and �1 � 0









19

のみを考える．

パラメータの更新と最適化

✦分解式を特にはADMMというアルゴリズムを用いる（双対分解では一般的に用いられるアルゴリズムらしい）

✦詳しいパラメーターの更新はsec. 3.1を見てください．（たぶん，勾配的に更新してると思われる）

✦計算量はO(N log |S| )に抑えられる．

✦ ADMMの中でオンライン学習を用いて高速化が可能(sec. 3.3)

10

２つのタスクで２軸で評価実験✦２つのNLPタスクで評価を行ったNamed Entity Recognitionタスク (NER)Dependency Parsingタスク(DEPAR)

✦手法の精度評価Complete Sentence Accuracy(COMP)が完全一致？NERタスクにF-sc(F-score)DEPARタスクにUAS(unlabelのedgeの正確さ)

✦モデル複雑度の評価#nzF:featureの数，ただし対応する重みがnon-zero#DoF:uniqueなnon-zeroな重み

11

重み集合Sの定義(4.1)

12

Sの定義は自由にしてもいいが，一般的に以下が最適

Our decoding models are the Viterbi algorithmon CRF (Lafferty et al., 2001), and the second-order parsing model proposed by (Carreras, 2007)for NER and DEPAR, respectively. Featuresare automatically generated according to the pre-defined feature templates widely-used in the pre-vious studies. We also integrated the cluster fea-tures obtained by the method explained in (Koo etal., 2008) as additional features for evaluating ourmethod in the range of the current best systems.

Evaluation measures: T he purpose of our ex-periments is to investigate the effectiveness of ourproposed method in terms of both its performanceand the complexity of the trained model. T here-fore, our evaluation measures consist of two axes.T ask performance was mainly evaluated in termsof the complete sentence accuracy (COMP) sincethe obj ective of all model learning methods eval-uated in our experiments is to maximiz e COM P.We also report the F

�=1 score (F-sc) for NER,and the unlabeled attachment score (UAS) for DE-PAR for comparison with previous studies. M odelcomplexity is evaluated by the number of non-z eroactive features (#nzF) and the degree of freedom(#DoF) (Z hong and Kwok , 2011). # nz F is thenumber of features whose corresponding featureweight is non-z ero in the trained model, and # DoFis the number of uniq ue non-z ero feature weights.

Baseline methods: Our main baseline is L1-regulariz ed sparse modeling. T o cover both batchand online leaning, we selected L1-regulariz edCRF (L1CRF) (Lafferty et al., 2001) optimiz ed byOWL-Q N (Andrew and G ao, 2007) for the NERexperiment, and the L1-regulariz ed regularizeddual averaging (L1RDA) method (X iao, 2010)4

for DEPAR. Additionally, we also evaluated L2-regulariz ed CRF (L2CRF) with L-B FG S (Liu andNocedal, 19 89 ) for NER, and passive-aggressivealgorithm (L2PA) (Crammer et al., 2006 )5 for DE-PAR since L2-regulariz er often provides better re-sults than L1-regulariz er (G ao et al., 2007).

For a fair comparison, we applied the proce-dure of S tep2 as a simple q uantiz ation methodto trained models obtained from L1-regulariz edmodel learning, which we refer to as (QT).

4 RDA provided better results at least in our experimentsthan L1-regulariz ed FOB OS (Duchi and S inger, 2009 ), andits variant (T suruok a et al., 2009 ), which are more familiar tothe NLP community.

5 L2PA is also k nown as a loss augmented variant of one-best M I RA, well-k nown in DEPAR (M cDonald et al., 2005 ).

4.1 Configurations of Our MethodBase learning algorithm: T he settings of ourmethod in our experiments imitate L1-regulariz edlearning algorithm since the purpose of ourexperiments is to investigate the effectivenessagainst standard L1-regulariz ed learning algo-rithms. T hen, we have the following two possiblesettings; DC-ADMM : we leveraged the baselineL1-regulariz ed learning algorithm to solve S tep1,and set �1 = 0 and �2 = 0 for S tep2. DCwL1-ADMM: we leveraged the baselineL2-regulariz edlearning algorithm, but without L2-regulariz er, tosolve S tep1, and set �1 > 0 and �2 =0 for S tep2.T he difference can be found in the obj ective func-tion O(w,u;D) shown in Eq . 3 ;

(DC-ADM M ) : O(w,u;D)=L(w;D)+�1||w||1(DCwL1-ADM M ) : O(w,u;D)=L(w;D)+�1||u||1

I n other words, DC-ADM M utiliz es L1-regulariz er as a part of base leaning algorithm⌦(w)=�1||w||1, while DCwL1-ADM M discardsregulariz er of base learning algorithm ⌦(w), butinstead introducing ⌥(u) = �1||u||1. Note thatthese two configurations are essentially identicalsince obj ectives are identical, even though theformulation and algorithm is different. We onlyreport results of DC-ADM M because of the spacereason since the results of DCwL1-ADM M werenearly eq uivalent to those of DC-ADM M .

Definition of S: DC-ADM M can utiliz e any fi-nite set for S . H owever, we have to carefully se-lect it since it deeply affects the performance. Ac-tually, this is the most considerable point of ourmethod. We preliminarily investigated the severalsettings. H ere, we introduce an example of tem-plate which is suitable for large feature set. Let⌘, �, and represent non-negative real-value con-stants, ⇣ be a positive integer, � = {�1, 1}, anda function f

⌘,�,

(x, y) = y(⌘x + �). T hen, wedefine a finite set of values S as follows:

S⌘,�,,⇣

={f⌘,�,

(x, y)|(x, y) 2 S⇣

⇥�} [ {0},where S

⇣

is a set of non-negative integers fromz ero to ⇣ � 1, that is, S

⇣

={m}⇣�1m=0. For example,

if we set ⌘ = 0.1, � = 0.4, = 4, and ⇣ = 3, thenS⌘,�,,⇣

= {�2.0, �0.8, �0.5, 0, 0.5, 0.8, 2.0}.T he intuition of this template is that the distribu-tion of the feature weights in trained model oftentak es a form a similar to that of the ‘ power law’in the case of the large feature sets. T herefore, us-ing an exponential function with a scale and biasseems to be appropriate for fitting them.

21

Our decoding models are the Viterbi algorithmon CRF (Lafferty et al., 2001), and the second-order parsing model proposed by (Carreras, 2007)for NER and DEPAR, respectively. Featuresare automatically generated according to the pre-defined feature templates widely-used in the pre-vious studies. We also integrated the cluster fea-tures obtained by the method explained in (Koo etal., 2008) as additional features for evaluating ourmethod in the range of the current best systems.

Evaluation measures: The purpose of our ex-periments is to investigate the effectiveness of ourproposed method in terms of both its performanceand the complexity of the trained model. There-fore, our evaluation measures consist of two axes.Task performance was mainly evaluated in termsof the complete sentence accuracy (COMP) sincethe objective of all model learning methods eval-uated in our experiments is to maximize COMP.We also report the F

�=1 score (F-sc) for NER,and the unlabeled attachment score (UAS) for DE-PAR for comparison with previous studies. Modelcomplexity is evaluated by the number of non-zeroactive features (#nzF) and the degree of freedom(#DoF) (Zhong and Kwok, 2011). #nzF is thenumber of features whose corresponding featureweight is non-zero in the trained model, and #DoFis the number of unique non-zero feature weights.

Baseline methods: Our main baseline is L1-regularized sparse modeling. To cover both batchand online leaning, we selected L1-regularizedCRF (L1CRF) (Lafferty et al., 2001) optimized byOWL-QN (Andrew and Gao, 2007) for the NERexperiment, and the L1-regularized regularizeddual averaging (L1RDA) method (Xiao, 2010)4

for DEPAR. Additionally, we also evaluated L2-regularized CRF (L2CRF) with L-BFGS (Liu andNocedal, 1989) for NER, and passive-aggressivealgorithm (L2PA) (Crammer et al., 2006)5 for DE-PAR since L2-regularizer often provides better re-sults than L1-regularizer (Gao et al., 2007).

For a fair comparison, we applied the proce-dure of Step2 as a simple quantization methodto trained models obtained from L1-regularizedmodel learning, which we refer to as (QT).

4RDA provided better results at least in our experimentsthan L1-regularized FOBOS (Duchi and Singer, 2009), andits variant (Tsuruoka et al., 2009), which are more familiar tothe NLP community.

5L2PA is also known as a loss augmented variant of one-best MIRA, well-known in DEPAR (McDonald et al., 2005).

4.1 Configurations of Our MethodBase learning algorithm: The settings of ourmethod in our experiments imitate L1-regularizedlearning algorithm since the purpose of ourexperiments is to investigate the effectivenessagainst standard L1-regularized learning algo-rithms. Then, we have the following two possiblesettings; DC-ADMM: we leveraged the baselineL1-regularized learning algorithm to solve Step1,and set �1 = 0 and �2 = 0 for Step2. DCwL1-ADMM: we leveraged the baseline L2-regularizedlearning algorithm, but without L2-regularizer, tosolve Step1, and set �1 > 0 and �2 =0 for Step2.The difference can be found in the objective func-tion O(w,u;D) shown in Eq. 3;

(DC-ADMM) : O(w,u;D)=L(w;D)+�1||w||1(DCwL1-ADMM) : O(w,u;D)=L(w;D)+�1||u||1

In other words, DC-ADMM utilizes L1-regularizer as a part of base leaning algorithm⌦(w)=�1||w||1, while DCwL1-ADMM discardsregularizer of base learning algorithm ⌦(w), butinstead introducing ⌥(u) = �1||u||1. Note thatthese two configurations are essentially identicalsince objectives are identical, even though theformulation and algorithm is different. We onlyreport results of DC-ADMM because of the spacereason since the results of DCwL1-ADMM werenearly equivalent to those of DC-ADMM.

Definition of S: DC-ADMM can utilize any fi-nite set for S . However, we have to carefully se-lect it since it deeply affects the performance. Ac-tually, this is the most considerable point of ourmethod. We preliminarily investigated the severalsettings. Here, we introduce an example of tem-plate which is suitable for large feature set. Let⌘, �, and represent non-negative real-value con-stants, ⇣ be a positive integer, � = {�1, 1}, anda function f

⌘,�,

(x, y) = y(⌘x + �). Then, wedefine a finite set of values S as follows:

S⌘,�,,⇣

={f⌘,�,

(x, y)|(x, y) 2 S⇣

⇥�} [ {0},where S

⇣

is a set of non-negative integers fromzero to ⇣ � 1, that is, S

⇣

={m}⇣�1m=0. For example,

if we set ⌘ = 0.1, � = 0.4, = 4, and ⇣ = 3, thenS⌘,�,,⇣

= {�2.0, �0.8, �0.5, 0, 0.5, 0.8, 2.0}.The intuition of this template is that the distribu-tion of the feature weights in trained model oftentakes a form a similar to that of the ‘power law’in the case of the large feature sets. Therefore, us-ing an exponential function with a scale and biasseems to be appropriate for fitting them.

21

ただし η,k,δは非負の実数ζは正の整数．S_ζは0からζ-１までの実数集合

重みの分布は一般的に，べき乗則(power law)に従う傾向がある．なので，指数関数でフィッテングを行った．

上式の根拠

ちなみに#DoFはζによってコントロール可能．

実験結果

13

81.0

83.0

85.0

87.0

89.0

91.0

1.0E+00 1.0E+03 1.0E+06

DC-ADMM

L1CRF (w/ QT)

L1CRF

L2CRF

Com

plete S

entence A

ccuracy

quantized

# of degrees of freedom (#DoF) [log-scale]

30.0

35.0

40.0

45.0

50.0

55.0

1.0E+00 1.0E+03 1.0E+06

DC-ADMM

L1RAD (w/ QT)

L1RDA

L2PA

Com

plete S

entence A

ccuracy

quantized

# of degrees of freedom (#DoF) [log-scale]

(a) NER (b) DEPARFigure 3: Performance vs. degree of freedom inthe trained model for the development data

Note that we can control the upper bound of#DoF in trained model by ⇣, namely if ⇣ = 4 thenthe upper bound of #DoF is 8 (doubled by posi-tive and negative sides). We fixed ⇢ = 1, ⇠ = 1,�2 = 0, = 4 (or 2 if ⇣ � 5), � = ⌘/2 in all ex-periments. Thus the only tunable parameter in ourexperiments is ⌘ for each ⇣.

4.2 Results and Discussions

Fig. 3 shows the task performance on the develop-ment data against the model complexities in termsof the degrees of freedom in the trained models.Plots are given by changing the ⇣ value for DC-ADMM and L1-regularized methods with QT. Theplots of the standard L1-regularized methods aregiven by changing the regularization constants �1.Moreover, Table 1 shows the final results of ourexperiments on the test data. The tunable param-eters were fixed at values that provided the bestperformance on the development data.

According to the figure and table, the most re-markable point is that DC-ADMM successfullymaintained the task performance even if #DoF (thedegree of freedom) was 8, and the performancedrop-offs were surprisingly limited even if #DoFwas 2, which is the upper bound of feature group-ing. Moreover, it is worth noting that the DC-ADMM performance is sometimes improved. Thereason may be that such low degrees of freedomprevent over-fitting to the training data. Surpris-ingly, the simple quantization method (QT) pro-vided fairly good results. However, we empha-size that the models produced by the QT approachoffer no guarantee as to the optimal solution. Incontrast, DC-ADMM can truly provide the opti-mal solution of Eq. 3 since the discrete constraintis also considered during the model learning.

In general, a trained model consists of two parts:

Test Model complex.NER COMP F-sc #nzF #DoFL2CRF 84.88 89.97 61.6M 38.6ML1CRF 84.85 89.99 614K 321K

(w/ QT ⇣=4) 78.39 85.33 568K 8(w/ QT ⇣=2) 73.40 81.45 454K 4(w/ QT ⇣=1) 65.53 75.87 454K 2

DC-ADMM (⇣=4) 84.96 89.92 643K 8(⇣=2) 84.04 89.35 455K 4(⇣=1) 83.06 88.62 364K 2

Test Model complex.DEPER COMP UAS #nzF #DoFL2PA 49.67 93.51 15.5M 5.59ML1RDA 49.54 93.48 7.76M 3.56M

(w/ QT ⇣=4) 38.58 90.85 6.32M 8(w/ QT ⇣=2) 34.19 89.42 3.08M 4(w/ QT ⇣=1) 30.42 88.67 3.08M 2

DC-ADMM (⇣=4) 49.83 93.55 5.81M 8(⇣=2) 48.97 93.18 4.11M 4(⇣=1) 46.56 92.86 6.37M 2

Table 1: Comparison results of the methods on testdata (K: thousand, M: million)

feature weights and an indexed structure of fea-ture strings, which are used as the key for obtain-ing the corresponding feature weight. This papermainly discussed how to reduce the size of the for-mer part, and described its successful reduction.We note that it is also possible to reduce the lat-ter part especially if the feature string structure isTRIE. We omit the details here since it is not themain topic of this paper, but by merging featurestrings that have the same feature weights, the sizeof entire trained models in our DEPAR case can bereduced to about 10 times smaller than those ob-tained by standard L1-regularization, i.e., to 12.2MB from 124.5 MB.

5 Conclusion

This paper proposed a model learning frameworkthat can simultaneously realize feature groupingby the incorporation of a simple discrete con-straint into model learning optimization. Thispaper also introduced a feasible algorithm, DC-ADMM, which can vanish the infeasible combi-natorial optimization part from the entire learningalgorithm with the help of the ADMM technique.Experiments showed that DC-ADMM drasticallyreduced model complexity in terms of the degreesof freedom in trained models while maintainingthe performance. There may exist theoreticallycleverer approaches to feature grouping, but theperformance of DC-ADMM is close to the upperbound. We believe our method, DC-ADMM, to bevery useful for actual use.

22

NERとDEPERの両方でBaselineと謙遜ない精度を出しつつ，モデルの複雑さを

抑えた

まとめ

✦ NLPタスクの機械学習は重みの多さからモデル複雑度が高くなりがち．

✦複雑度を抑えるために，重みのグループ化を行った．

✦グループ化に伴い発生するNP困難問題を双対分解で解決

✦ Named Entity RecognitionタスクとDependency Parsingタスクで精度を保ちつつ，複雑度を抑えた

14

Education

slides for "Supervised Model Learning with Feature Grouping based on a Discrete Constraint "