Fast Methods for Kernel-based Text Analysis

Taku Kudo 工藤　拓Yuji Matsumoto 松本　裕治NAIST (Nara Institute of Science and Technology)

41st Annual Meeting of the Association for Computational Linguistics , Sapporo JAPAN

Background

Kernel methods (e.g., SVM) become popularCan incorporate prior knowledge independently from the machine learning algorithms by giving task dependent kernel (generalized dot-product) High accuracy

Problem

Too slow to use kernel-based text analyzers to the real NL applications (e.g., QA or text mining) because of their inefficiency in testingSome kernel-based parsers run only at 2 - 3 seconds/sentence

Build fast but still accurate kernel- based text analyzersMake it possible to use them to wider range of NL applications

Outline

Polynomial Kernel of degree d Fast Methods for Polynomial kernel PKI PKE

Experiments Conclusions and Future Work

Outline

Polynomial Kernel of degree d Fast Methods for Polynomial kernels PKI PKE

Kernel Methods

No need to represent example in an explicit feature vector

Complexity of testing is O(L ・ |X|)

)φ()φ()(

},,,{ 21 LXXXT Training data

Kernels for Sets (1/3)

FXXXX T

},,,,{

Focus on the special case where examples are represented as sets

The instances in NLP are usually represented as sets (e.g., bag-of-words)

Feature set:

Training data:

Kernels for Sets (2/3)},,,{ ,},,,{ 21 edbaXdcbaX

Combinations (subsets) of features

}},,{{

}},{},,{},,{{

dbdaba

3 |},,{| || ),( 2121 dbaXXXXK

Simple definition:

2nd order

3rd order

Kernels for Sets (3/3)

I ate a cake PRP VBD DT NN

Dependent (+1) or independent (-1) ?

head modifier

Head-word: ateHead-POS: VBDModifier-word: cakeModifier-POS: NN

Head-word: ateHead-POS: VBDModifier-word: cakeModifier-POS: NNHead-POS/Modifier-POS: VBD/NNHead-word/Modifier-POS: ate/NN …

Subsets (combinations) of basic features are critical to improve overall accuracy in many NL tasks

Previous approaches select combinations heuristically

Heuristic

selection

Polynomial Kernel of degree d

..}2,1,0{1||),( 2121 dXXXXK dd 　

Implicit form

|)(|)(),(0

rrdd XXPrcXXK

Explicit form

drc )1()(

is a set of all subsets of with exactly elements in it

is prior weight to the subsets with size

)(XPr X

r )(rcd

(subset weight)

Example (Cubic Kernel d=3 )

},,,{ ,},,,{ 21 edbaXdcbaX

64)13(1||),( 3321213 　XXXXK

Implicit form:

}},,{{)( ,6)3(

}},{},,{},,{{)( ,12)2(

}}{},{},{{)( ,7)1(

}{)( ,1)0(

dbaXXPc

dbdabaXXPc

dbaXXPc

64163123711),( 213 XXK

Explicit form:

Up to 3 subsets are used as new

features

Outline

Toy Example

{a, b, c}{a, b, d}{b, c, d}

10.5-2

X={a,c,e}

Feature Set: F={a,b,c,d,e}

Examples:

Test Example:

Kernel: 　321213 1||),( XXXXK

#SVs L =3

PKB (Baseline)

{a, b, c}{a, b, d}{b, c, d}

10.5-2

Test Example X={a,c,e}

K(X,X’) = (|X∩X’|+1)３

f(X) = 1 ・ (2+1) + 0.5 ・ (1+1) - 2 (1+1) = 15

Complexity is always O(L ・ |X|)

３３３

K(Xj,X)

PKI (Inverted Representation)

{a, b, c}{a, b, d}{b, c, d}

10.5-2

K(X,X’) = (|X∩X’|+1)３

a b c d

{1,2}{1,2,3}{1,3}{2,3}

Test Example X= {a, c, e}

f(X)=1 ・ (2+1) + 0.5 ・ (1+1) - 2 (1+1) = 15３３３

Average complexity is O(B ・ |X|+L) Efficient if feature space is sparse Suitable for many NL tasks

Inverted Index

B = Avg. size

PKE (Expanded Representation)

iii XXKXf

)φ( )φ(

)φ()φ(

Convert into linear form by calculating vector w projects X into its subsets space)φ(X

K(X,X’) = (|X∩X’|+1)

c3(0)=1, c3(1)=7,c3(2)=12, c3(3)=6

{a, b, c} {a, b, d} {b, c, d}

10.5-2

φ{a}{b}{c}{d}{a,b}{a,c}{a,d}{b,c}{b,d}{c,d}{a,b,c}{a,b,d}{a,c,d}{b,c,d}

-0.5 10.5-3.5-7-10.5 18 12 6-12-18-24 6 3 0-12

W (Expansion Table)3

F(X)= - 0.5 + 10.5 – 7 + 12 = 15

Test Example X={a,c,e}

{φ,{a},{c}, {e}, {a,c},{a,e}, {c,e},{a,c,e}}

Complexity is O(|X| ) , independent of the number of SVs (L)

Efficient if the number of SVs is large

w({b,d}) = 12 (0.5 – 2 ) = -18

PKE in Practice

Hard to calculate Expansion Table exactlyUse Approximated Expansion TableSubsets with smaller |w| can be removed, since |w| represents a contribution to the final classification Use subset mining (a.k.a. basket mining) algorithm for efficient calculation

Subset Mining Problemid set

{ a c d } { a b c } { a b d } { b c e }

Transaction Database

{a}:3 {b}:3 {c}:3 {d}:2 {a b}:2 {b c}: 2 {a c}:2 {a d}: 2

Results

Extract all subsets that occur in no less than sets of the transaction database

and no size constraints → NP-hard Efficient algorithms have been proposed

(e.g., Apriori, PrefixSpan)

Feature Selection as Mining

• Can efficiently build the approximated table • σ controls the rate of approximation

{a, b, c} {a, b, d} {b, c, d}

10.5-2

Direct generation with subset mining

{a}{d}{a,b}{a,c}{b,c}{b,d}{c,d}{b,c,d}

10.5-10.5 12 12 -12-18-24-12

W φ{a}{b}{c}{d}{a,b}{a,c}{a,d}{b,c}{b,d}{c,d}{a,b,c}{a,b,d}{a,c,d}{b,c,d}

-0.5 10.5-3.5-7-10.5 12 12 6-12-18-24 6 3 0-12

Exhaustive generation and testing

→ Impractical!

Outline

Experimental Settings

Three NL tasks English Base-NP Chunking (EBC) Japanese Word Segmentation (JWS) Japanese Dependency Parsing (JDP)

Kernel Settings Quadratic kernel is applied to EBC Cubic kernel is applied to JWS and JDP

Results (English Base-NP Chunking)

Time(Sec./Sent.)

Speedup Ratio

F-score

PKB .164 1.0 93.84PKI .020 8.3 93.84PKE (σ=.01) .0016 105.2 93.79PKE (σ=.005) .0016 101.3 93.85PKE (σ=.001) .0017 97.7 93.84PKE (σ=.0005) .0017 96.8 93.84

Results (Japanese Word Segmentation)

Time(Sec./Sent.)

Speedup Ratio

Accuracy (%)

PKB .85 1.0 97.94PKI .49 1.7 97.94PKE (σ=.01) .0024 358.2 97.93PKE (σ=.005) .0028 300.1 97.95 PKE (σ=.001) .0034 242.6 97.94 PKE (σ=.0005) .0035 238.8 97.94

Results (Japanese Dependency Parsing)

Time(Sec./Sent.)

Speedup Ratio

Accuracy (%)

PKB .285 1.0 89.29PKI .0226 12.6 89.29PKE (σ=.01) .0042 66.8 88.91PKE (σ=.005) .0060 47.8 89.05 PKE (σ=.001) .0086 33.3 89.26PKE (σ=.0005) .0090 31.8 89.29

Results

2 - 12 fold speed up in PKI 30 - 300 fold speed up in PKE Preserve the accuracy when we set an appropriate σ

Comparison with related work

XQK [Isozaki et al. 02] Same concept as PKE Designed only for the Quadratic Kernel Exhaustively creates the expansion

PKE Designed for general Polynomial Kernels Uses subset mining algorithms to create

the expansion table

Conclusions

Propose two fast methods for the polynomial kernel of degree d PKI (Inverted) PKE (Expanded)

2-12 fold speed up in PKI, 30-300 fold speed up in PKEPreserve the accuracy

Future Work

Examine the effectiveness in a general machine learning dataset Apply PKE to other convolution kernels Tree Kernel [Collins 00]

Dot-product between trees Feature space is all sub-tree Apply sub-tree mining algorithm [Zaki 02]

English Base-NP ChunkingExtract Non-overlapping Noun Phrase from text[NP He ] reckons [NP the current account deficit ] will narrow to[NP only # 1.8 billion ] in [NP September ] .

BIO representation (seeing as a tagging task) B: beginning of chunk I: non-initial chunk O: outside

Pair-wise method to 3-class problem

training: wsj15-18, test: wsj20 (standard set)

Japanese Word Segmentation

太郎は花子に本を読ませた ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑

Sentence:Boundaries:

},,,,,{ 321,12 iiiiiii ccccccX

Distinguish the relative position Use also the character types of Japanese Training: KUC 01-08, Test: KUC 09

If there is a boundary between and i 1i1iY , otherwise 1iY

Taro made Hanako read a book

Japanese Dependency Parsing

私は　　ケーキを　　食べるI-top cake-acc. eat

Identify the correct dependency relations between two bunsetsu (base phrase in English)

Linguistic features related to the modifier and head (word, POS, POS-subcat, inflections, punctuations, etc)

Binary classification (+1 dependent, -1 independent)

Cascaded Chunking Model [kudo, et al. 02]

Training: KUC 01-08, Test: KUC 09

I eat a cake

Kernel Methods (1/2)

iii XXXf

)φ()φ()(

X : example to be classified Xi: training examples : weight for examples : a function to map examples to another vectorial

spaceφ

Suppose a learning task: }1,1{: Xg

))(sgn()( XfXg

},{ 1 LXXT training examples

rirdi XXPrcXf

|)(|)()(

If we calculate in advance ( is the indicator function)

))((|)(|)(1

iisdi XPsIscsw

for all subsets

)()(Xs d

r rd FPFs0

r rd XPX0

TRIE representation

{a}{d}{a,b}{a,c}{b,c}{b,d}{c,d}{b,c,d}

10.5-10.5 12 12 -12-18-24-12

b c c d

-24-18-12

Compress redundant structures Classification can be done by simply

traversing the TRIE

Kernel Methods

No need to represent example in an explicit feature vector

Complexity of testing is O(L |X|)

)φ()φ()(

},,,{ 21 LXXXT Training data

Fast Methods for Kernel-based Text Analysis

Documents

2004 icse-comparison of software product line architecture design methods copa, fast, form, kobr a and qada

OSUE Linux Kernel Modul Bonusbeispiel · OSUE Linux Kernel Modul Bonusbeispiel Frömel Der Linux Kernel Kernel Modul Appplication vs. Kernel Modul Device Drivers Allgemeines I Linux

FAST MULTIPOLE METHODS (FMM)€¦ · 28.11.2018 · • Specified accuracy, є. • Hierarchical subdivision of space. • Far field expansion of ‘kernel’ K(x, y), separating

HALAMAN JUDUL ESTIMATOR KERNEL GAUSSIAN, KERNEL

Information Geometry...Outline Information geometry historical remarks treasure example Information divergence class robust statistical methods robust ICA robust kernel PCA 2 geometry

高速な物体候補領域提案手法 (Fast Object Proposal Methods)

Kernel Methods

[1/2017] Fast and Agile Fire Extinguishing methods for Fire & …info.smedu.fi/kirjasto/Sarja_B/B1_2017.pdf · 2017-04-19 · Fast and Agile Extinguishing Methods for Fire & Rescue

Fast Numerical Methods for Waves in Periodic Media...Fast Numerical Methods for Waves in Periodic Media M. Ehrhardt∗ and C. Zheng+ ∗ Bergische Universität Wuppertal, Fachbereich

Kernel Methods for Implicit Surface Modeling B. Scholkopf, J. Giesen, S. Spalinger

Fast methods for solving elliptic PDEs - Applied … · Fast methods for solving elliptic PDEs ... Recall that the solution to the Dirichlet problem can be written u ... Finite Diﬀerences,

1 ماشین بردار پشتیبان Instructor : Saeed Shiry. 2 مقدمه SVM دسته بندی کننده ای است که جزو شاخه Kernel Methods دریادگیری ماشین

Windows Kernel, Kernel Driver-y · Kernel, Executive, Drivery • Windows kernel –jadro OS –časť Ntoskrnl.exe –nízkoúrovňové funkcie: thread scheduling, interrupt & exception

Kernel Smoothing Methods - GitHub Pages

Kernel 3 methods tmp - ism.ac.jpfukumizu/ISM_lecture_2010/Kernel_3_methods.pdf · 概要 • Kernel PCAの応用 • Kernel CCA （カーネル正準相関分析） • サポトベクタ

Fast numerical methods for solving linear PDEs · Fast numerical methods for solving linear PDEs P.G. Martinsson, The University of Colorado at Boulder ... As a result, complex systems

Fast Numerical Methods for Fractional Di usion Equationsxiaj/FastSolvers2018/hao.pdf · Fast Numerical Methods for Fractional Di usion Equations Zhao-Peng Hao Department of Mathematics,

Transformation Knowledge in Pattern Analysis with Kernel Methods

Problem - Diva1205366/FULLTEXT01.pdf · are Essence – Kernel and Languages for Software Engineering Methods and Self-Governance Developer Framework. The goal is that the model is

Kernel Methods in Computer Vision: Object Localization ... · Zusammenfassung In dieser Arbeit studieren wir drei fundamentale Computer Vision Probleme mit Hilfe von Kernmethoden