Upload
tomonari-masada
View
1.087
Download
0
Embed Size (px)
Citation preview
Semi-supervisedBibliographic Element
Segmentation withLatent Permutations
Tomonari MASADA ( 正田 備也 )Nagasaki University ( 長崎大學 )
[email protected]. Masada @ ICADL 2011
PROBLEM
Semi-supervisedBibliographic Element
Segmentation withLatent Permutations
T. Masada @ ICADL 2011 2
Zouhaier Brahmia Rafik Bouaziz Schema Versioning in Multi-temporal XML Databases. ACIS-ICIS 2008Xiang Zhang Yakup Genc Bootstrapped Real-Time Ego Motion Estimation and Scene Modeling. 3DIM 2005Takaki Yoshida Masafumi Watari MD-SCAN Method for Low Power Scan Testing. Asian Test Symposium 2002Yan Zhao Enterprise Service Oriented Architecture (ESOA) Adoption Reference. IEEE SCC 2006Rich Jochems Shane Rodgers The Rollercoaster of Required Agile Transition. AGILE 2007Weigen Qiu Zhibin Hu Composed Fuzzy Rough Set and Its Applications in Fuzzy RSAR. APPT 2007Guilherme Bittencourt Isabel Tonin A Proof Strategy Based on a Dual Representation. AISC 2000James A. Kupsch Barton P. Miller How to Open a File and Not Get Hacked. ARES 2008Claire Grover Alex Lascarides XML-Based Data Preparation for Robust Deep Parsing. ACL 2001Yuanlin Zhang Roland H. C. Yap Incrementally Solving Functional Constraints. AAAI/IAAI 2002Gerald Quirchmayr Survivability and Business Continuity Management. ACSW Frontiers 2004Witold Dzwinel A Cellular Automata Model of Population Infected by Periodic Plague. ACRI 2004Martin Buchwitz The IDA Standard. The Industrial Information Technology Handbook 2005Olivier Gutknecht Jacques Ferber MadKit: a generic multi-agent platform. Agents 2000Hakan Özadam Ferruh Özbudak The Minimum Hamming Distance of Cyclic Codes of Length 2ps. AAECC 2009Riccardo Torlone Conceptual Multidimensional Models. Multidimensional Databases 2003Jixue Liu Chengfei Liu A Declarative Way of Extracting XML Data in XSL. ADBIS 2002Min-Jung Yoo An industrial application of agents for dynamic planning and scheduling. AAMAS 2002David Fang Rajit Manohar Non-Uniform Access Asynchronous Register Files. ASYNC 2004Carsten Ullrich Tutorial Planning: Adapting Course Generation to Today's Needs. AIED 2005Po-Jen Chuang Shien-Da Chang Performance Analysis on Location Tracking in PCS Networks. AINA 2005Hee Seo Seon Wook Kim OpenMP Directive Extension for BlackFin 561 Dual Core Processor. CIT 2006V. Benjamin Livshits John Whaley Monica S. Lam Reflection Analysis for Java. APLAS 2005 3T. Masada @ ICADL 2011
Zouhaier Brahmia Rafik Bouaziz Schema Versioning in Multi-temporal XML Databases. ACIS-ICIS 2008■ ■ ■Xiang Zhang Yakup Genc Bootstrapped Real-Time Ego Motion Estimation and Scene Modeling. 3DIM 2005■ ■ ■Takaki Yoshida Masafumi Watari MD-SCAN Method for Low Power Scan esting. Asian Test Symposium 2002■ ■ ■Yan Zhao Enterprise Service Oriented Architecture (ESOA) Adoption Reference. IEEE SCC 2006■ ■ ■Rich Jochems Shane Rodgers The Rollercoaster of Required Agile Transition. AGILE 2007■ ■ ■Weigen Qiu Zhibin Hu Composed Fuzzy Rough Set and Its Applications in Fuzzy RSAR. APPT 2007■ ■ ■Guilherme Bittencourt Isabel Tonin A Proof Strategy Based on a Dual Representation. AISC 2000■ ■ ■James A. Kupsch Barton P. Miller How to Open a File and Not Get Hacked. ARES 2008■ ■ ■Claire Grover Alex Lascarides XML-Based Data Preparation for Robust Deep Parsing. ACL 2001■ ■ ■Yuanlin Zhang Roland H. C. Yap Incrementally Solving Functional Constraints. AAAI/IAAI 2002■ ■ ■Gerald Quirchmayr Survivability and Business Continuity Management. ACSW Frontiers 2004■ ■ ■Witold Dzwinel A Cellular Automata Model of Population Infected by Periodic Plague. ACRI 2004■ ■ ■Martin Buchwitz The IDA Standard. The Industrial Information Technology Handbook 2005■ ■ ■Olivier Gutknecht Jacques Ferber MadKit: a generic multi-agent platform. Agents 2000■ ■ ■Hakan Özadam Ferruh Özbudak The Minimum Hamming Distance of Cyclic Codes of Length 2ps. AAECC 2009■ ■ ■Riccardo Torlone Conceptual Multidimensional Models. Multidimensional Databases 2003■ ■ ■Jixue Liu Chengfei Liu A Declarative Way of Extracting XML Data in XSL. ADBIS 2002■ ■ ■Min-Jung Yoo An industrial application of agents for dynamic planning and scheduling. AAMAS 2002■ ■ ■David Fang Rajit Manohar Non-Uniform Access Asynchronous Register Files. ASYNC 2004■ ■ ■Carsten Ullrich Tutorial Planning: Adapting Course Generation to Today‘s Needs. AIED 2005■ ■ ■Po-Jen Chuang Shien-Da Chang Performance Analysis on Location Tracking in PCS Networks. AINA 2005■ ■ ■Hee Seo Seon Wook Kim OpenMP Directive Extension for BlackFin 561 Dual Core Processor. CIT 2006■ ■ ■V. Benjamin Livshits John Whaley Monica S. Lam Reflection Analysis for Java. APLAS 2005■ ■ ■ 4T. Masada @ ICADL 2011
T. Masada @ ICADL 2011 5
HMM-based approach[Takasu+ 03]
T. Masada @ ICADL 2011 6
CRF-based approach
[Peng+ 04]
CORA dataset500 references 350 for training 150 for test
PaiFen
Semi-supervisedBibliographic Element
Segmentation withLatent Permutations
T. Masada @ ICADL 2011 7
Bayesian Topic Modeling• “Topic Modeling”
Document as a mixture of multiple topics
• “Bayesian”(Posterior) ∝ (Likelihood) × (Prior)
T. Masada @ ICADL 2011 8
LDA (Latent Dirichlet allocation)
• More effective than PLSI–PLSI often suffers from overfitting.
• Easy to implement–MCMC can be implemented with + - x /.
• Easy to parallelize–Inference is scalable to massive data sets.
T. Masada @ ICADL 2011 9
Segmentation by Topics
T. Masada @ ICADL 2011 10
topic = bibliographic element< interpretation >
MadKit: a generic multi-agent platform. Olivier Gutknecht Jacques Ferber 2000 Agents
(1,1,1,1,1,0,0,0,3,2,2)
Beyond LDA• Contiguity constraint
zd = (3,3,1,3,1,2,0,2,2,0)
zd = (3,3,3,1,1,2,2,2,0,0)
T. Masada @ ICADL 2011 11
We arrange the orderof topic assignments!
Latent Permutations [Chen et al. 09]
• Draw a permutation πd from a distribution called generalized Mallows model
• Arrange the order of topic assignments based on πd
T. Masada @ ICADL 2011 12
(K=4)
(1,2,3,0)
(1,3,2,0)
(2,1,3,0)
(1,2,0,3)
(3,2,1,0)
(1,0,2,3)
(2,0,1,3)
(3,1,0,2)
(2,1,0,3)
(3,2,0,1)
…
MadKit: a generic multi-agent platform. Olivier Gutknecht Jacques Ferber 2000 Agents
(1,1,1,1,1,0,0,0,3,2,2)
{1,1,0,0,3,2,0,1,1,1,2}
(1,0,3,2)permutation
topic assignments
Inference Implementation
T. Masada @ ICADL 2011 14
kkk
jjkkk Nvp exp
k w
jkwk
w kwj
k
kk
jkkjk nWn
nWnvvp
new
newnewnew exp
k
kkk
kK
exp1
1exp1where Slice sampling
k w
jkwk
w kwj
kijkji nWn
nWnnktp
new
new
Inference Details
T. Masada @ ICADL 2011 15
PERFORMANCE OF PaiFen
Semi-supervisedBibliographic Element
Segmentation withLatent Permutations
T. Masada @ ICADL 2011 16
Synthetic Data
T. Masada @ ICADL 2011 17
Our datasets
# docs # paragraphs # vocabs #tokens
DBLP 944,755 17,408,876 685,799 17,408,876
MEDLINE 3,001,207 87,085,708 2,168,061 87,085,708
[Chen et al. 09] # docs # paragraphs # vocabs #tokens
Cities 100 6,670 41,978 492,402
Elements 118 2,810 18,008 191,762
T. Masada @ ICADL 2011 18
<inproceedings mdate="2004-10-27" key="conf/3dpvt/Dobbins04"><author>A. Dobbins</author><title>Color, Fusion, and Stereopsis.</title><pages>705</pages><year>2004</year><crossref>conf/3dpvt/2004</crossref><booktitle>3DPVT</booktitle><url>db/conf/3dpvt/3dpvt2004.html#Dobbins04</url></inproceedings>
<inproceedings mdate="2004-10-27" key="conf/3dpvt/Dobbins04"><author>A. Dobbins</author><title>Color, Fusion, and Stereopsis.</title><pages>705</pages><year>2004</year><crossref>conf/3dpvt/2004</crossref><booktitle>3DPVT</booktitle><url>db/conf/3dpvt/3dpvt2004.html#Dobbins04</url></inproceedings>
A. DobbinsColor, Fusion, and Stereopsis.
2004
3DPVT
Synthesizing “Noisy” Data• Random shuffle of bibliographic elements
– “D20” : random shuffle for 20% references– “D50” : random shuffle for 50% references
T. Masada @ ICADL 2011 19
A. Dobbins Color, Fusion, and Stereopsis. 20043DPVT
PaiFen [Masada 10]
T. Masada @ ICADL 2011 20
~ 82% accuracy~ 82% accuracy
BanPaiFen
Semi-supervisedBibliographic Element
Segmentation withLatent Permutations
T. Masada @ ICADL 2011 21
Zouhaier Brahmia Rafik Bouaziz Schema Versioning in Multi-temporal XML Databases. ACIS-ICIS 2008Xiang Zhang Yakup Genc Bootstrapped Real-Time Ego Motion Estimation and Scene Modeling. 3DIM 2005Takaki Yoshida Masafumi Watari MD-SCAN Method for Low Power Scan Testing. Asian Test Symposium 2002Yan Zhao Enterprise Service Oriented Architecture (ESOA) Adoption Reference. IEEE SCC 2006Rich Jochems Shane Rodgers The Rollercoaster of Required Agile Transition. AGILE 2007Weigen Qiu Zhibin Hu Composed Fuzzy Rough Set and Its Applications in Fuzzy RSAR. APPT 2007Guilherme Bittencourt Isabel Tonin A Proof Strategy Based on a Dual Representation. AISC 2000James A. Kupsch Barton P. Miller How to Open a File and Not Get Hacked. ARES 2008Claire Grover Alex Lascarides XML-Based Data Preparation for Robust Deep Parsing. ACL 2001Yuanlin Zhang Roland H. C. Yap Incrementally Solving Functional Constraints. AAAI/IAAI 2002Gerald Quirchmayr Survivability and Business Continuity Management. ACSW Frontiers 2004Witold Dzwinel A Cellular Automata Model of Population Infected by Periodic Plague. ACRI 2004Martin Buchwitz The IDA Standard. The Industrial Information Technology Handbook 2005Olivier Gutknecht Jacques Ferber MadKit: a generic multi-agent platform. Agents 2000Hakan Özadam Ferruh Özbudak The Minimum Hamming Distance of Cyclic Codes of Length 2ps. AAECC 2009Riccardo Torlone Conceptual Multidimensional Models. Multidimensional Databases 2003Jixue Liu Chengfei Liu A Declarative Way of Extracting XML Data in XSL. ADBIS 2002Min-Jung Yoo An industrial application of agents for dynamic planning and scheduling. AAMAS 2002David Fang Rajit Manohar Non-Uniform Access Asynchronous Register Files. ASYNC 2004Carsten Ullrich Tutorial Planning: Adapting Course Generation to Today's Needs. AIED 2005Po-Jen Chuang Shien-Da Chang Performance Analysis on Location Tracking in PCS Networks. AINA 2005Hee Seo Seon Wook Kim OpenMP Directive Extension for BlackFin 561 Dual Core Processor. CIT 2006V. Benjamin Livshits John Whaley Monica S. Lam Reflection Analysis for Java. APLAS 2005 22T. Masada @ ICADL 2011
Supervised Topic Models• No straightforward procedures for using
supervised signals in topic models– sLDA [Blei+ 07]– DiscLDA [Lacoste-Julien+ 08] ... gradient ascent– Labeled LDA [Ramage+ 09]– MedLDA [Zhu+ 09] ... O(K3)
• Our "simple" procedure– Modify posterior probabilities with penalties
T. Masada @ ICADL 2011 23
Penalties• Rj
mis # mismatches between the supervised label and the
inferred topic in the reference dj
• Rjred
# redundant assignments in dj
e.g. If three tokens are assigned to the topic corresponding to "year" element, then two among the three are redundant.
T. Masada @ ICADL 2011 24
Semi-supervised Inference• Modify posterior probabilities with penalties
T. Masada @ ICADL 2011 25
k w
jkwk
w kwj
k
kk
jkkjk nWn
nWnvvp
new
newnewnew exp
k w
jkwk
w kwj
kijkji nWn
nWnnktp
new
new
M
IR
jk
j
vpmis
expnew M
IRR
jk
jj
tp)(
expnew
redmis
performedalready iterations # :I
determined be oconstant t some :M
Zouhaier Brahmia Rafik Bouaziz Schema Versioning in Multi-temporal XML Databases. ACIS-ICIS 2008Xiang Zhang Yakup Genc Bootstrapped Real-Time Ego Motion Estimation and Scene Modeling. 3DIM 2005Takaki Yoshida Masafumi Watari MD-SCAN Method for Low Power Scan Testing. Asian Test Symposium 2002Yan Zhao Enterprise Service Oriented Architecture (ESOA) Adoption Reference. IEEE SCC 2006Rich Jochems Shane Rodgers The Rollercoaster of Required Agile Transition. AGILE 2007Weigen Qiu Zhibin Hu Composed Fuzzy Rough Set and Its Applications in Fuzzy RSAR. APPT 2007Guilherme Bittencourt Isabel Tonin A Proof Strategy Based on a Dual Representation. AISC 2000James A. Kupsch Barton P. Miller How to Open a File and Not Get Hacked. ARES 2008Claire Grover Alex Lascarides XML-Based Data Preparation for Robust Deep Parsing. ACL 2001Yuanlin Zhang Roland H. C. Yap Incrementally Solving Functional Constraints. AAAI/IAAI 2002Gerald Quirchmayr Survivability and Business Continuity Management. ACSW Frontiers 2004Witold Dzwinel A Cellular Automata Model of Population Infected by Periodic Plague. ACRI 2004Martin Buchwitz The IDA Standard. The Industrial Information Technology Handbook 2005Olivier Gutknecht Jacques Ferber MadKit: a generic multi-agent platform. Agents 2000Hakan Özadam Ferruh Özbudak The Minimum Hamming Distance of Cyclic Codes of Length 2ps. AAECC 2009Riccardo Torlone Conceptual Multidimensional Models. Multidimensional Databases 2003Jixue Liu Chengfei Liu A Declarative Way of Extracting XML Data in XSL. ADBIS 2002Min-Jung Yoo An industrial application of agents for dynamic planning and scheduling. AAMAS 2002David Fang Rajit Manohar Non-Uniform Access Asynchronous Register Files. ASYNC 2004Carsten Ullrich Tutorial Planning: Adapting Course Generation to Today's Needs. AIED 2005Po-Jen Chuang Shien-Da Chang Performance Analysis on Location Tracking in PCS Networks. AINA 2005Hee Seo Seon Wook Kim OpenMP Directive Extension for BlackFin 561 Dual Core Processor. CIT 2006V. Benjamin Livshits John Whaley Monica S. Lam Reflection Analysis for Java. APLAS 2005 26T. Masada @ ICADL 2011
Labeling Dictionary"Cheol-Hee" --> author
"Larrosa" --> author
"Endangered" --> title
"Violation" --> title
"Forschungen" --> journal
"2001" --> year...
T. Masada @ ICADL 2011 27
Labeling Procedure• Prepare a special set of references
– DBLP database• References ~ 1999 (399,497)
– References 2000~now for inference (944,755)
– MEDLINE database• 100 XML files
– different from the 100 XML files for inference
• Compose a "labeling dictionary" by using the set
T. Masada @ ICADL 2011 28
How to Obtain Dictionary Entries
• author, title, journal– Words that appear in author/title/journal element
and never appear in the other elements<-- We'd like to remove ambiguity.
• year– Integers from 1900 to 2012
• pages– Tokens matching [1-9][0-9]*\-[0-9]*
T. Masada @ ICADL 2011 29
(Words labeled as journal in MEDLINE)
T. Masada @ ICADL 2011 30
(Words labeled as author in MEDLINE)
T. Masada @ ICADL 2011 31
Specs of Supervised Labels
T. Masada @ ICADL 2011 32
RESULTS
Semi-supervisedBibliographic Element
Segmentation withLatent Permutations
T. Masada @ ICADL 2011 33
T. Masada @ ICADL 2011 34
T. Masada @ ICADL 2011 35
2000~2009 references from DBLP; 0% random shuffle
T. Masada @ ICADL 2011 36
2000~2009 references from DBLP; 20% random shuffle
T. Masada @ ICADL 2011 37
2000~2009 references from DBLP; 50% random shuffle
T. Masada @ ICADL 2011 38
medline09n0400.xml ~ medline09n0499.xml; 0% random shuffle
0.9350.935
T. Masada @ ICADL 2011 39
medline09n0400.xml ~ medline09n0499.xml; 20% random shuffle
T. Masada @ ICADL 2011 40
medline09n0400.xml ~ medline09n0499.xml; 50% random shuffle
Many incorrect labels were amended!
Many incorrect labels were amended!
How did supervised labels work?• supervised label: page
– answer: page ... 99.31% (=2683391.5/2702066)
– answer: title ... 60.99% (=5000.0/8198)
• supervised label: year– answer: year ... 99.56% (=2744454.5/2756538)
– answer: title ... 48.47% (=6000.7/12380)– answer: journal ... 50.93% (=126.3/248)– answer: page ... 53.72% (=710.7/1323)
T. Masada @ ICADL 2011 41
CONCLUSIONS
Semi-supervisedBibliographic Element
Segmentation withLatent Permutations
T. Masada @ ICADL 2011 42
Conclusions• BanPaiFen achieves accuracies > 90%
– BanPaiFen is a semi-supervised approach.– BanPaiFen can also amend incorrect labels.
• Future work– How can we achieve an "almost perfect"
accuracy?
T. Masada @ ICADL 2011 43
谢谢 !
T. Masada @ ICADL 2011 44