16
Inferring XML Schema Definitions from XML Data Eert Jan Bex, Frank Neven, Stijn Vansummeren Hasselt Univ. and transnational Univ. of Limburg, Belgium VLDB 2007 2008. 02. 15. Summarized by Chulki Lee, IDS Lab., Seoul National University Presented by Chulki Lee, IDS Lab., Seoul National University

Inferring XML Schema Definitions from XML Data

  • Upload
    xena

  • View
    36

  • Download
    1

Embed Size (px)

DESCRIPTION

Inferring XML Schema Definitions from XML Data. Eert Jan Bex, Frank Neven, Stijn Vansummeren Hasselt Univ. and transnational Univ. of Limburg, Belgium VLDB 2007 2008. 02. 15. Summarized by Chulki Lee , IDS Lab., Seoul National University - PowerPoint PPT Presentation

Citation preview

Page 1: Inferring XML Schema Definitions from XML Data

Inferring XML Schema Definitions from XML Data

Eert Jan Bex, Frank Neven, Stijn VansummerenHasselt Univ. and transnational Univ. of Limburg, Belgium

VLDB 2007

2008. 02. 15.Summarized by Chulki Lee, IDS Lab., Seoul National University

Presented by Chulki Lee, IDS Lab., Seoul National University

Page 2: Inferring XML Schema Definitions from XML Data

Copyright 2007 by CEBT

Inferring XML Schema Why schemas?

automation & optimization of search integration of XML data sources …

Why infer schemas? 50% of XML on the web have none 33% of schemas are not valid

Why infer XSD? (XML Schema Definition) DTD (Document Type Definitions) has limitations

– element type only depend on the element’s name (not consider path)

2

Page 3: Inferring XML Schema Definitions from XML Data

Copyright 2007 by CEBT

Example: DTD vs. XSD

3

type

name

Page 4: Inferring XML Schema Definitions from XML Data

Copyright 2007 by CEBT

Theorem

Inferring XSD from XML corpus is impossible to learn from positive data only

Content model of an element is uniquely determined by the path from the root to that ele-

ment

4

Page 5: Inferring XML Schema Definitions from XML Data

Copyright 2007 by CEBT

Observation: local context XSD is k-local

its content models depend only on labels up to the k-th an-cestor

98% of XSD, k = 2

5

Page 6: Inferring XML Schema Definitions from XML Data

Copyright 2007 by CEBT

Observation: SORE Single Occurrence Regular Expression (SORE)

What’s SOREtitle, (author, affiliation?)+, abstract

What’s not SOREtitle, ((author, affiliation)++(editor, affiliation)

+), abstract

99 % of regular expressions is single occurrence

6

duplicated element names

Page 7: Inferring XML Schema Definitions from XML Data

Copyright 2007 by CEBT

Proposed Algorithms Theorem

XSDs with local context and SORE content models arelearnable from positive examples only (need ‘sufficiently large’)

iLocal = iSOA + TOSORE + MINIMIZE infer k-local and single occurrence target XSD Schema

iXSD = iLocal & REDUCE REDUCE = (unify sufficiently similar types)

SOA: Single Occurrence Automaton

7

Page 8: Inferring XML Schema Definitions from XML Data

Copyright 2007 by CEBT

Algorithm: iLocal (1/4)

8

Page 9: Inferring XML Schema Definitions from XML Data

Copyright 2007 by CEBT 9

Algorithm: iLocal (2/4)

Page 10: Inferring XML Schema Definitions from XML Data

Copyright 2007 by CEBT

Algorithm: iLocal (3/4)

iSOA: make SOA from stringsToSORE: translate SOA → SORE

10

Page 11: Inferring XML Schema Definitions from XML Data

Copyright 2007 by CEBT

Algorithm: iLocal (4/4)

11

Page 12: Inferring XML Schema Definitions from XML Data

Copyright 2007 by CEBT

Algorithm: iXSD incomplete data

iLocal derives too many types

REDUCE: practical heuristics define distance between types for type s and t

– if distance(s, t) < ε then unify s and t

12

Page 13: Inferring XML Schema Definitions from XML Data

Copyright 2007 by CEBT

Experiments 8 schemas & 200 generated documents for each schema

schema: 12~23 types with unbounded depth and width local with k = 2, 3

types of iXSD imprecisions: content model for target and inferred type can differ

– based on positive examples, can’t be avoided type in target XSD can corresponds to multiple types in in-

ferred XSD: false positives type in inferred XSD can corresponds to multiple types in

target XSD: false negatives type in target XSD is not derived

– incomplete corpus, can't be avoided

13

Page 14: Inferring XML Schema Definitions from XML Data

Copyright 2007 by CEBT

Experiments k = 3, parsing 697 XSDs (40Mb), PentiumM1.73 → 17

seconds k = 2, without REDUCE → 29 false positive

power of REDUCE Sensitivity to parameters

context size k ↑⇒ false positives ↑⇒ false negatives ↓

ε ↑⇒ false positives ↓⇒ false negatives ↑

14

Page 15: Inferring XML Schema Definitions from XML Data

Copyright 2007 by CEBT

Experiments iXSD derives good XSDs from small training sets (50~)

15

Page 16: Inferring XML Schema Definitions from XML Data

Copyright 2007 by CEBT

Conclusions Propose two algorithms

iLocal – sound & k-complete iXSD – deal with poor data

– good performance on real world– good runtime performance

Future work determine best locality k

16