Upload
xena
View
36
Download
1
Embed Size (px)
DESCRIPTION
Inferring XML Schema Definitions from XML Data. Eert Jan Bex, Frank Neven, Stijn Vansummeren Hasselt Univ. and transnational Univ. of Limburg, Belgium VLDB 2007 2008. 02. 15. Summarized by Chulki Lee , IDS Lab., Seoul National University - PowerPoint PPT Presentation
Citation preview
Inferring XML Schema Definitions from XML Data
Eert Jan Bex, Frank Neven, Stijn VansummerenHasselt Univ. and transnational Univ. of Limburg, Belgium
VLDB 2007
2008. 02. 15.Summarized by Chulki Lee, IDS Lab., Seoul National University
Presented by Chulki Lee, IDS Lab., Seoul National University
Copyright 2007 by CEBT
Inferring XML Schema Why schemas?
automation & optimization of search integration of XML data sources …
Why infer schemas? 50% of XML on the web have none 33% of schemas are not valid
Why infer XSD? (XML Schema Definition) DTD (Document Type Definitions) has limitations
– element type only depend on the element’s name (not consider path)
2
Copyright 2007 by CEBT
Example: DTD vs. XSD
3
type
name
Copyright 2007 by CEBT
Theorem
Inferring XSD from XML corpus is impossible to learn from positive data only
Content model of an element is uniquely determined by the path from the root to that ele-
ment
4
Copyright 2007 by CEBT
Observation: local context XSD is k-local
its content models depend only on labels up to the k-th an-cestor
98% of XSD, k = 2
5
Copyright 2007 by CEBT
Observation: SORE Single Occurrence Regular Expression (SORE)
What’s SOREtitle, (author, affiliation?)+, abstract
What’s not SOREtitle, ((author, affiliation)++(editor, affiliation)
+), abstract
99 % of regular expressions is single occurrence
6
duplicated element names
Copyright 2007 by CEBT
Proposed Algorithms Theorem
XSDs with local context and SORE content models arelearnable from positive examples only (need ‘sufficiently large’)
iLocal = iSOA + TOSORE + MINIMIZE infer k-local and single occurrence target XSD Schema
iXSD = iLocal & REDUCE REDUCE = (unify sufficiently similar types)
SOA: Single Occurrence Automaton
7
Copyright 2007 by CEBT
Algorithm: iLocal (1/4)
8
Copyright 2007 by CEBT 9
Algorithm: iLocal (2/4)
Copyright 2007 by CEBT
Algorithm: iLocal (3/4)
iSOA: make SOA from stringsToSORE: translate SOA → SORE
10
Copyright 2007 by CEBT
Algorithm: iLocal (4/4)
11
Copyright 2007 by CEBT
Algorithm: iXSD incomplete data
iLocal derives too many types
REDUCE: practical heuristics define distance between types for type s and t
– if distance(s, t) < ε then unify s and t
12
Copyright 2007 by CEBT
Experiments 8 schemas & 200 generated documents for each schema
schema: 12~23 types with unbounded depth and width local with k = 2, 3
types of iXSD imprecisions: content model for target and inferred type can differ
– based on positive examples, can’t be avoided type in target XSD can corresponds to multiple types in in-
ferred XSD: false positives type in inferred XSD can corresponds to multiple types in
target XSD: false negatives type in target XSD is not derived
– incomplete corpus, can't be avoided
13
Copyright 2007 by CEBT
Experiments k = 3, parsing 697 XSDs (40Mb), PentiumM1.73 → 17
seconds k = 2, without REDUCE → 29 false positive
power of REDUCE Sensitivity to parameters
context size k ↑⇒ false positives ↑⇒ false negatives ↓
ε ↑⇒ false positives ↓⇒ false negatives ↑
14
Copyright 2007 by CEBT
Experiments iXSD derives good XSDs from small training sets (50~)
15
Copyright 2007 by CEBT
Conclusions Propose two algorithms
iLocal – sound & k-complete iXSD – deal with poor data
– good performance on real world– good runtime performance
Future work determine best locality k
16