44
Bibliographic Element Segmentation with Latent Permutations Tomonari MASADA ( 正正 正正 ) Nagasaki University ( 正正正正 ) [email protected] 1 T. Masada @ ICADL 2011

Semi-supervised Bibliographic Element Segmentation with Latent Permutations

Embed Size (px)

Citation preview

Page 1: Semi-supervised Bibliographic Element Segmentation with Latent Permutations

Semi-supervisedBibliographic Element

Segmentation withLatent Permutations

Tomonari MASADA ( 正田 備也 )Nagasaki University ( 長崎大學 )

[email protected]. Masada @ ICADL 2011

Page 2: Semi-supervised Bibliographic Element Segmentation with Latent Permutations

PROBLEM

Semi-supervisedBibliographic Element

Segmentation withLatent Permutations

T. Masada @ ICADL 2011 2

Page 3: Semi-supervised Bibliographic Element Segmentation with Latent Permutations

Zouhaier Brahmia Rafik Bouaziz Schema Versioning in Multi-temporal XML Databases. ACIS-ICIS 2008Xiang Zhang Yakup Genc Bootstrapped Real-Time Ego Motion Estimation and Scene Modeling. 3DIM 2005Takaki Yoshida Masafumi Watari MD-SCAN Method for Low Power Scan Testing. Asian Test Symposium 2002Yan Zhao Enterprise Service Oriented Architecture (ESOA) Adoption Reference. IEEE SCC 2006Rich Jochems Shane Rodgers The Rollercoaster of Required Agile Transition. AGILE 2007Weigen Qiu Zhibin Hu Composed Fuzzy Rough Set and Its Applications in Fuzzy RSAR. APPT 2007Guilherme Bittencourt Isabel Tonin A Proof Strategy Based on a Dual Representation. AISC 2000James A. Kupsch Barton P. Miller How to Open a File and Not Get Hacked. ARES 2008Claire Grover Alex Lascarides XML-Based Data Preparation for Robust Deep Parsing. ACL 2001Yuanlin Zhang Roland H. C. Yap Incrementally Solving Functional Constraints. AAAI/IAAI 2002Gerald Quirchmayr Survivability and Business Continuity Management. ACSW Frontiers 2004Witold Dzwinel A Cellular Automata Model of Population Infected by Periodic Plague. ACRI 2004Martin Buchwitz The IDA Standard. The Industrial Information Technology Handbook 2005Olivier Gutknecht Jacques Ferber MadKit: a generic multi-agent platform. Agents 2000Hakan Özadam Ferruh Özbudak The Minimum Hamming Distance of Cyclic Codes of Length 2ps. AAECC 2009Riccardo Torlone Conceptual Multidimensional Models. Multidimensional Databases 2003Jixue Liu Chengfei Liu A Declarative Way of Extracting XML Data in XSL. ADBIS 2002Min-Jung Yoo An industrial application of agents for dynamic planning and scheduling. AAMAS 2002David Fang Rajit Manohar Non-Uniform Access Asynchronous Register Files. ASYNC 2004Carsten Ullrich Tutorial Planning: Adapting Course Generation to Today's Needs. AIED 2005Po-Jen Chuang Shien-Da Chang Performance Analysis on Location Tracking in PCS Networks. AINA 2005Hee Seo Seon Wook Kim OpenMP Directive Extension for BlackFin 561 Dual Core Processor. CIT 2006V. Benjamin Livshits John Whaley Monica S. Lam Reflection Analysis for Java. APLAS 2005 3T. Masada @ ICADL 2011

Page 4: Semi-supervised Bibliographic Element Segmentation with Latent Permutations

Zouhaier Brahmia Rafik Bouaziz Schema Versioning in Multi-temporal XML Databases. ACIS-ICIS 2008■ ■ ■Xiang Zhang Yakup Genc Bootstrapped Real-Time Ego Motion Estimation and Scene Modeling. 3DIM 2005■ ■ ■Takaki Yoshida Masafumi Watari MD-SCAN Method for Low Power Scan esting. Asian Test Symposium 2002■ ■ ■Yan Zhao Enterprise Service Oriented Architecture (ESOA) Adoption Reference. IEEE SCC 2006■ ■ ■Rich Jochems Shane Rodgers The Rollercoaster of Required Agile Transition. AGILE 2007■ ■ ■Weigen Qiu Zhibin Hu Composed Fuzzy Rough Set and Its Applications in Fuzzy RSAR. APPT 2007■ ■ ■Guilherme Bittencourt Isabel Tonin A Proof Strategy Based on a Dual Representation. AISC 2000■ ■ ■James A. Kupsch Barton P. Miller How to Open a File and Not Get Hacked. ARES 2008■ ■ ■Claire Grover Alex Lascarides XML-Based Data Preparation for Robust Deep Parsing. ACL 2001■ ■ ■Yuanlin Zhang Roland H. C. Yap Incrementally Solving Functional Constraints. AAAI/IAAI 2002■ ■ ■Gerald Quirchmayr Survivability and Business Continuity Management. ACSW Frontiers 2004■ ■ ■Witold Dzwinel A Cellular Automata Model of Population Infected by Periodic Plague. ACRI 2004■ ■ ■Martin Buchwitz The IDA Standard. The Industrial Information Technology Handbook 2005■ ■ ■Olivier Gutknecht Jacques Ferber MadKit: a generic multi-agent platform. Agents 2000■ ■ ■Hakan Özadam Ferruh Özbudak The Minimum Hamming Distance of Cyclic Codes of Length 2ps. AAECC 2009■ ■ ■Riccardo Torlone Conceptual Multidimensional Models. Multidimensional Databases 2003■ ■ ■Jixue Liu Chengfei Liu A Declarative Way of Extracting XML Data in XSL. ADBIS 2002■ ■ ■Min-Jung Yoo An industrial application of agents for dynamic planning and scheduling. AAMAS 2002■ ■ ■David Fang Rajit Manohar Non-Uniform Access Asynchronous Register Files. ASYNC 2004■ ■ ■Carsten Ullrich Tutorial Planning: Adapting Course Generation to Today‘s Needs. AIED 2005■ ■ ■Po-Jen Chuang Shien-Da Chang Performance Analysis on Location Tracking in PCS Networks. AINA 2005■ ■ ■Hee Seo Seon Wook Kim OpenMP Directive Extension for BlackFin 561 Dual Core Processor. CIT 2006■ ■ ■V. Benjamin Livshits John Whaley Monica S. Lam Reflection Analysis for Java. APLAS 2005■ ■ ■ 4T. Masada @ ICADL 2011

Page 5: Semi-supervised Bibliographic Element Segmentation with Latent Permutations

T. Masada @ ICADL 2011 5

HMM-based approach[Takasu+ 03]

Page 6: Semi-supervised Bibliographic Element Segmentation with Latent Permutations

T. Masada @ ICADL 2011 6

CRF-based approach

[Peng+ 04]

CORA dataset500 references 350 for training 150 for test

Page 7: Semi-supervised Bibliographic Element Segmentation with Latent Permutations

PaiFen

Semi-supervisedBibliographic Element

Segmentation withLatent Permutations

T. Masada @ ICADL 2011 7

Page 8: Semi-supervised Bibliographic Element Segmentation with Latent Permutations

Bayesian Topic Modeling• “Topic Modeling”

Document as a mixture of multiple topics

• “Bayesian”(Posterior) ∝ (Likelihood) × (Prior)

T. Masada @ ICADL 2011 8

Page 9: Semi-supervised Bibliographic Element Segmentation with Latent Permutations

LDA (Latent Dirichlet allocation)

• More effective than PLSI–PLSI often suffers from overfitting.

• Easy to implement–MCMC can be implemented with + - x /.

• Easy to parallelize–Inference is scalable to massive data sets.

T. Masada @ ICADL 2011 9

Page 10: Semi-supervised Bibliographic Element Segmentation with Latent Permutations

Segmentation by Topics

T. Masada @ ICADL 2011 10

topic = bibliographic element< interpretation >

MadKit: a generic multi-agent platform. Olivier Gutknecht Jacques Ferber 2000 Agents

(1,1,1,1,1,0,0,0,3,2,2)

Page 11: Semi-supervised Bibliographic Element Segmentation with Latent Permutations

Beyond LDA• Contiguity constraint

zd = (3,3,1,3,1,2,0,2,2,0)

zd = (3,3,3,1,1,2,2,2,0,0)

T. Masada @ ICADL 2011 11

We arrange the orderof topic assignments!

Page 12: Semi-supervised Bibliographic Element Segmentation with Latent Permutations

Latent Permutations [Chen et al. 09]

• Draw a permutation πd from a distribution called generalized Mallows model

• Arrange the order of topic assignments based on πd

T. Masada @ ICADL 2011 12

(K=4)

(1,2,3,0)

(1,3,2,0)

(2,1,3,0)

(1,2,0,3)

(3,2,1,0)

(1,0,2,3)

(2,0,1,3)

(3,1,0,2)

(2,1,0,3)

(3,2,0,1)

Page 13: Semi-supervised Bibliographic Element Segmentation with Latent Permutations

MadKit: a generic multi-agent platform. Olivier Gutknecht Jacques Ferber 2000 Agents

(1,1,1,1,1,0,0,0,3,2,2)

{1,1,0,0,3,2,0,1,1,1,2}

(1,0,3,2)permutation

topic assignments

Page 14: Semi-supervised Bibliographic Element Segmentation with Latent Permutations

Inference Implementation

T. Masada @ ICADL 2011 14

kkk

jjkkk Nvp exp

k w

jkwk

w kwj

k

kk

jkkjk nWn

nWnvvp

new

newnewnew exp

k

kkk

kK

exp1

1exp1where Slice sampling

k w

jkwk

w kwj

kijkji nWn

nWnnktp

new

new

Page 15: Semi-supervised Bibliographic Element Segmentation with Latent Permutations

Inference Details

T. Masada @ ICADL 2011 15

Page 16: Semi-supervised Bibliographic Element Segmentation with Latent Permutations

PERFORMANCE OF PaiFen

Semi-supervisedBibliographic Element

Segmentation withLatent Permutations

T. Masada @ ICADL 2011 16

Page 17: Semi-supervised Bibliographic Element Segmentation with Latent Permutations

Synthetic Data

T. Masada @ ICADL 2011 17

Our datasets

# docs # paragraphs # vocabs #tokens

DBLP 944,755 17,408,876 685,799 17,408,876

MEDLINE 3,001,207 87,085,708 2,168,061 87,085,708

[Chen et al. 09] # docs # paragraphs # vocabs #tokens

Cities 100 6,670 41,978 492,402

Elements 118 2,810 18,008 191,762

Page 18: Semi-supervised Bibliographic Element Segmentation with Latent Permutations

T. Masada @ ICADL 2011 18

<inproceedings mdate="2004-10-27" key="conf/3dpvt/Dobbins04"><author>A. Dobbins</author><title>Color, Fusion, and Stereopsis.</title><pages>705</pages><year>2004</year><crossref>conf/3dpvt/2004</crossref><booktitle>3DPVT</booktitle><url>db/conf/3dpvt/3dpvt2004.html#Dobbins04</url></inproceedings>

<inproceedings mdate="2004-10-27" key="conf/3dpvt/Dobbins04"><author>A. Dobbins</author><title>Color, Fusion, and Stereopsis.</title><pages>705</pages><year>2004</year><crossref>conf/3dpvt/2004</crossref><booktitle>3DPVT</booktitle><url>db/conf/3dpvt/3dpvt2004.html#Dobbins04</url></inproceedings>

A. DobbinsColor, Fusion, and Stereopsis.

2004

3DPVT

Page 19: Semi-supervised Bibliographic Element Segmentation with Latent Permutations

Synthesizing “Noisy” Data• Random shuffle of bibliographic elements

– “D20” : random shuffle for 20% references– “D50” : random shuffle for 50% references

T. Masada @ ICADL 2011 19

A. Dobbins Color, Fusion, and Stereopsis. 20043DPVT

Page 20: Semi-supervised Bibliographic Element Segmentation with Latent Permutations

PaiFen [Masada 10]

T. Masada @ ICADL 2011 20

~ 82% accuracy~ 82% accuracy

Page 21: Semi-supervised Bibliographic Element Segmentation with Latent Permutations

BanPaiFen

Semi-supervisedBibliographic Element

Segmentation withLatent Permutations

T. Masada @ ICADL 2011 21

Page 22: Semi-supervised Bibliographic Element Segmentation with Latent Permutations

Zouhaier Brahmia Rafik Bouaziz Schema Versioning in Multi-temporal XML Databases. ACIS-ICIS 2008Xiang Zhang Yakup Genc Bootstrapped Real-Time Ego Motion Estimation and Scene Modeling. 3DIM 2005Takaki Yoshida Masafumi Watari MD-SCAN Method for Low Power Scan Testing. Asian Test Symposium 2002Yan Zhao Enterprise Service Oriented Architecture (ESOA) Adoption Reference. IEEE SCC 2006Rich Jochems Shane Rodgers The Rollercoaster of Required Agile Transition. AGILE 2007Weigen Qiu Zhibin Hu Composed Fuzzy Rough Set and Its Applications in Fuzzy RSAR. APPT 2007Guilherme Bittencourt Isabel Tonin A Proof Strategy Based on a Dual Representation. AISC 2000James A. Kupsch Barton P. Miller How to Open a File and Not Get Hacked. ARES 2008Claire Grover Alex Lascarides XML-Based Data Preparation for Robust Deep Parsing. ACL 2001Yuanlin Zhang Roland H. C. Yap Incrementally Solving Functional Constraints. AAAI/IAAI 2002Gerald Quirchmayr Survivability and Business Continuity Management. ACSW Frontiers 2004Witold Dzwinel A Cellular Automata Model of Population Infected by Periodic Plague. ACRI 2004Martin Buchwitz The IDA Standard. The Industrial Information Technology Handbook 2005Olivier Gutknecht Jacques Ferber MadKit: a generic multi-agent platform. Agents 2000Hakan Özadam Ferruh Özbudak The Minimum Hamming Distance of Cyclic Codes of Length 2ps. AAECC 2009Riccardo Torlone Conceptual Multidimensional Models. Multidimensional Databases 2003Jixue Liu Chengfei Liu A Declarative Way of Extracting XML Data in XSL. ADBIS 2002Min-Jung Yoo An industrial application of agents for dynamic planning and scheduling. AAMAS 2002David Fang Rajit Manohar Non-Uniform Access Asynchronous Register Files. ASYNC 2004Carsten Ullrich Tutorial Planning: Adapting Course Generation to Today's Needs. AIED 2005Po-Jen Chuang Shien-Da Chang Performance Analysis on Location Tracking in PCS Networks. AINA 2005Hee Seo Seon Wook Kim OpenMP Directive Extension for BlackFin 561 Dual Core Processor. CIT 2006V. Benjamin Livshits John Whaley Monica S. Lam Reflection Analysis for Java. APLAS 2005 22T. Masada @ ICADL 2011

Page 23: Semi-supervised Bibliographic Element Segmentation with Latent Permutations

Supervised Topic Models• No straightforward procedures for using

supervised signals in topic models– sLDA [Blei+ 07]– DiscLDA [Lacoste-Julien+ 08] ... gradient ascent– Labeled LDA [Ramage+ 09]– MedLDA [Zhu+ 09] ... O(K3)

• Our "simple" procedure– Modify posterior probabilities with penalties

T. Masada @ ICADL 2011 23

Page 24: Semi-supervised Bibliographic Element Segmentation with Latent Permutations

Penalties• Rj

mis # mismatches between the supervised label and the

inferred topic in the reference dj

• Rjred

# redundant assignments in dj

e.g. If three tokens are assigned to the topic corresponding to "year" element, then two among the three are redundant.

T. Masada @ ICADL 2011 24

Page 25: Semi-supervised Bibliographic Element Segmentation with Latent Permutations

Semi-supervised Inference• Modify posterior probabilities with penalties

T. Masada @ ICADL 2011 25

k w

jkwk

w kwj

k

kk

jkkjk nWn

nWnvvp

new

newnewnew exp

k w

jkwk

w kwj

kijkji nWn

nWnnktp

new

new

M

IR

jk

j

vpmis

expnew M

IRR

jk

jj

tp)(

expnew

redmis

performedalready iterations # :I

determined be oconstant t some :M

Page 26: Semi-supervised Bibliographic Element Segmentation with Latent Permutations

Zouhaier Brahmia Rafik Bouaziz Schema Versioning in Multi-temporal XML Databases. ACIS-ICIS 2008Xiang Zhang Yakup Genc Bootstrapped Real-Time Ego Motion Estimation and Scene Modeling. 3DIM 2005Takaki Yoshida Masafumi Watari MD-SCAN Method for Low Power Scan Testing. Asian Test Symposium 2002Yan Zhao Enterprise Service Oriented Architecture (ESOA) Adoption Reference. IEEE SCC 2006Rich Jochems Shane Rodgers The Rollercoaster of Required Agile Transition. AGILE 2007Weigen Qiu Zhibin Hu Composed Fuzzy Rough Set and Its Applications in Fuzzy RSAR. APPT 2007Guilherme Bittencourt Isabel Tonin A Proof Strategy Based on a Dual Representation. AISC 2000James A. Kupsch Barton P. Miller How to Open a File and Not Get Hacked. ARES 2008Claire Grover Alex Lascarides XML-Based Data Preparation for Robust Deep Parsing. ACL 2001Yuanlin Zhang Roland H. C. Yap Incrementally Solving Functional Constraints. AAAI/IAAI 2002Gerald Quirchmayr Survivability and Business Continuity Management. ACSW Frontiers 2004Witold Dzwinel A Cellular Automata Model of Population Infected by Periodic Plague. ACRI 2004Martin Buchwitz The IDA Standard. The Industrial Information Technology Handbook 2005Olivier Gutknecht Jacques Ferber MadKit: a generic multi-agent platform. Agents 2000Hakan Özadam Ferruh Özbudak The Minimum Hamming Distance of Cyclic Codes of Length 2ps. AAECC 2009Riccardo Torlone Conceptual Multidimensional Models. Multidimensional Databases 2003Jixue Liu Chengfei Liu A Declarative Way of Extracting XML Data in XSL. ADBIS 2002Min-Jung Yoo An industrial application of agents for dynamic planning and scheduling. AAMAS 2002David Fang Rajit Manohar Non-Uniform Access Asynchronous Register Files. ASYNC 2004Carsten Ullrich Tutorial Planning: Adapting Course Generation to Today's Needs. AIED 2005Po-Jen Chuang Shien-Da Chang Performance Analysis on Location Tracking in PCS Networks. AINA 2005Hee Seo Seon Wook Kim OpenMP Directive Extension for BlackFin 561 Dual Core Processor. CIT 2006V. Benjamin Livshits John Whaley Monica S. Lam Reflection Analysis for Java. APLAS 2005 26T. Masada @ ICADL 2011

Page 27: Semi-supervised Bibliographic Element Segmentation with Latent Permutations

Labeling Dictionary"Cheol-Hee" --> author

"Larrosa" --> author

"Endangered" --> title

"Violation" --> title

"Forschungen" --> journal

"2001" --> year...

T. Masada @ ICADL 2011 27

Page 28: Semi-supervised Bibliographic Element Segmentation with Latent Permutations

Labeling Procedure• Prepare a special set of references

– DBLP database• References ~ 1999 (399,497)

– References 2000~now for inference (944,755)

– MEDLINE database• 100 XML files

– different from the 100 XML files for inference

• Compose a "labeling dictionary" by using the set

T. Masada @ ICADL 2011 28

Page 29: Semi-supervised Bibliographic Element Segmentation with Latent Permutations

How to Obtain Dictionary Entries

• author, title, journal– Words that appear in author/title/journal element

and never appear in the other elements<-- We'd like to remove ambiguity.

• year– Integers from 1900 to 2012

• pages– Tokens matching [1-9][0-9]*\-[0-9]*

T. Masada @ ICADL 2011 29

Page 30: Semi-supervised Bibliographic Element Segmentation with Latent Permutations

(Words labeled as journal in MEDLINE)

T. Masada @ ICADL 2011 30

Page 31: Semi-supervised Bibliographic Element Segmentation with Latent Permutations

(Words labeled as author in MEDLINE)

T. Masada @ ICADL 2011 31

Page 32: Semi-supervised Bibliographic Element Segmentation with Latent Permutations

Specs of Supervised Labels

T. Masada @ ICADL 2011 32

Page 33: Semi-supervised Bibliographic Element Segmentation with Latent Permutations

RESULTS

Semi-supervisedBibliographic Element

Segmentation withLatent Permutations

T. Masada @ ICADL 2011 33

Page 34: Semi-supervised Bibliographic Element Segmentation with Latent Permutations

T. Masada @ ICADL 2011 34

Page 35: Semi-supervised Bibliographic Element Segmentation with Latent Permutations

T. Masada @ ICADL 2011 35

2000~2009 references from DBLP; 0% random shuffle

Page 36: Semi-supervised Bibliographic Element Segmentation with Latent Permutations

T. Masada @ ICADL 2011 36

2000~2009 references from DBLP; 20% random shuffle

Page 37: Semi-supervised Bibliographic Element Segmentation with Latent Permutations

T. Masada @ ICADL 2011 37

2000~2009 references from DBLP; 50% random shuffle

Page 38: Semi-supervised Bibliographic Element Segmentation with Latent Permutations

T. Masada @ ICADL 2011 38

medline09n0400.xml ~ medline09n0499.xml; 0% random shuffle

0.9350.935

Page 39: Semi-supervised Bibliographic Element Segmentation with Latent Permutations

T. Masada @ ICADL 2011 39

medline09n0400.xml ~ medline09n0499.xml; 20% random shuffle

Page 40: Semi-supervised Bibliographic Element Segmentation with Latent Permutations

T. Masada @ ICADL 2011 40

medline09n0400.xml ~ medline09n0499.xml; 50% random shuffle

Page 41: Semi-supervised Bibliographic Element Segmentation with Latent Permutations

Many incorrect labels were amended!

Many incorrect labels were amended!

How did supervised labels work?• supervised label: page

– answer: page ... 99.31% (=2683391.5/2702066)

– answer: title ... 60.99% (=5000.0/8198)

• supervised label: year– answer: year ... 99.56% (=2744454.5/2756538)

– answer: title ... 48.47% (=6000.7/12380)– answer: journal ... 50.93% (=126.3/248)– answer: page ... 53.72% (=710.7/1323)

T. Masada @ ICADL 2011 41

Page 42: Semi-supervised Bibliographic Element Segmentation with Latent Permutations

CONCLUSIONS

Semi-supervisedBibliographic Element

Segmentation withLatent Permutations

T. Masada @ ICADL 2011 42

Page 43: Semi-supervised Bibliographic Element Segmentation with Latent Permutations

Conclusions• BanPaiFen achieves accuracies > 90%

– BanPaiFen is a semi-supervised approach.– BanPaiFen can also amend incorrect labels.

• Future work– How can we achieve an "almost perfect"

accuracy?

T. Masada @ ICADL 2011 43

Page 44: Semi-supervised Bibliographic Element Segmentation with Latent Permutations

谢谢 !

T. Masada @ ICADL 2011 44