SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"

  • View
    59

  • Download
    3

Embed Size (px)

DESCRIPTION

The creation of values to represent incomplete information, often referred to as value invention, is central in data exchange. Within schema mappings, Skolem functions have long been used for value invention as they permit a precise representation of missing information. Recent work on a powerful mapping language called second-order tuple generating dependencies (SO tgds), has drawn attention to the fact that the use of arbitrary Skolem functions can have negative computational and programmatic properties in data exchange. In this paper, we present two techniques for understanding when the Skolem functions needed to represent the correct semantics of incomplete information are computationally well-behaved. Specifically, we consider when the Skolem functions in second-order (SO) mappings have a first-order (FO) semantics and are therefore programmatically and computationally more desirable for use in practice. Our first technique, linearization, significantly extends the Nash, Bernstein and Melnik unskolemization algorithm, by understanding when the sets of arguments of the Skolem functions in a mapping are related by set inclusion. We show that such a linear relationship leads to mappings that have FO semantics and are expressible in popular mapping languages including source-to-target tgds and nested tgds. Our second technique uses source semantics, specifically functional dependencies (including keys), to transform SO mappings into equivalent FO mappings. We show that our algorithms are applicable to a strictly larger class of mappings than previous approaches, but more importantly we present an extensive experimental evaluation that quantifies this difference (about 78% improvement) over an extensive schema mapping benchmark and illustrates the applicability of our results on real mappings.

Transcript

  • 1. Value Invention in Data Exchange Patricia Arocena1 Boris Glavic2 Renee J. Miller1 University of Toronto1 DBGroup Illinois Institute of Technology2 DBGroup SIGMOD 2013 - June 25, 2013 - New York, USA

2. Outline 1 Introduction 2 Linearization 3 Exploiting Source Constraints 4 Experiments 5 Conclusions 3. The Data Exchange Problem1 Schema Mappings M = (S, T, ) Source Schema S and Target Schema T High-level specication models the relationship between S and T Source Schema S Target Schema T Source Data Target Data M Source Schema S Target Schema T Source Data Target Data MSource Schema S Target Schema T Source Data Target Data M 1R. Fagin et al., Theor. Comput. Sci. 336 (2005). Slide 1 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Introduction 4. The Data Exchange Problem1 Schema Mappings M = (S, T, ) Source Schema S and Target Schema T High-level specication models the relationship between S and T Data Exchange Given an instance of S Source Schema S Target Schema T Source Data Target Data M Source Schema S Target Schema T Source Data Target Data M Source Schema S Target Schema T Source Data Target Data MSource Schema S Target Schema T Source Data Target Data M 1R. Fagin et al., Theor. Comput. Sci. 336 (2005). Slide 1 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Introduction 5. The Data Exchange Problem1 Schema Mappings M = (S, T, ) Source Schema S and Target Schema T High-level specication models the relationship between S and T Data Exchange Given an instance of S How to materialize a target instance of T? Source Schema S Target Schema T Source Data Target Data M Source Schema S Target Schema T Source Data Target Data M Source Schema S Target Schema T Source Data Target Data MSource Schema S Target Schema T Source Data Target Data M Source Schema S Target Schema T Source Data Target Data M 1R. Fagin et al., Theor. Comput. Sci. 336 (2005). Slide 1 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Introduction 6. Example Source Schema S Target Schema T Source Data Target Data MWorksOn(Department,Project,City) Source Schema S Target Schema T Source Data Target Data M Projects(PId, City, ManagerId)Source Schema S Target Schema T Source Data Target Data MSource Schema S Target Schema T Source Data Target Data M Source Schema S Target Schema T Source Data Target Data M IT Web Toronto IT Big Data Chicago Sales Mobile New York NULL Toronto NULL NULL Chicago NULL NULL New York NULL We usually create values to represent incomplete information! Slide 2 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Introduction 7. Value Invention Source Schema S Target Schema T Source Data Target Data MWorksOn(Department,Project,City) Source Schema S Target Schema T Source Data Target Data M Projects(PId, City, ManagerId)Source Schema S Target Schema T Source Data Target Data MSource Schema S Target Schema T Source Data Target Data M Source Schema S Target Schema T Source Data Target Data M IT Web Toronto IT Big Data Chicago Sales Mobile New York f(Web) Toronto g(IT) f(Big Data) Chicago g(IT) f(Mobile) New York g(Sales) We usually create values to represent incomplete information! Slide 2 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Introduction 8. Value Invention Source Schema S Target Schema T Source Data Target Data MWorksOn(Department,Project,City) Source Schema S Target Schema T Source Data Target Data M Projects(PId, City, ManagerId)Source Schema S Target Schema T Source Data Target Data MSource Schema S Target Schema T Source Data Target Data M Source Schema S Target Schema T Source Data Target Data M IT Web Toronto IT Big Data Chicago Sales Mobile New York f(Web) Toronto g(IT) f(Big Data) Chicago g(IT) f(Mobile) New York g(Sales) We usually create values to represent incomplete information! f g ( WorksOn (d, p, c) Project (f (p), c, g(d)) ) Slide 2 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Introduction 9. Our Goal Understand when schema mappings specied by SO tgds Flexible and precise value invention . . . can be rewritten into nested GLAV mappings Desirable computational and programatic properties Slide 3 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Introduction 10. Skolem Functions Introduced by Thoralf A. Skolem (1920s) Widely used in Mathematical Logic and Computer Science Many important uses in Information Integration to model object identier (OID) inventiona aR. Hull, M. Yoshikawa, In VLDB (1990). Slide 4 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Introduction 11. Skolem Functions Introduced by Thoralf A. Skolem (1920s) Widely used in Mathematical Logic and Computer Science Many important uses in Information Integration to model object identier (OID) invention to express correlation semantics (e.g., grouping and data merging)abcd aL. Popa et al., In VLDB (2002). bA. Fuxman et al., In VLDB (2006). cL. Libkin, C. Sirangelo, J. Comput. Syst. Sci. 77 (2011). dB. Alexe et al., VLDB J. 21 (2012). Slide 4 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Introduction 12. Skolem Functions Introduced by Thoralf A. Skolem (1920s) Widely used in Mathematical Logic and Computer Science Many important uses in Information Integration to model object identier (OID) invention to express correlation semantics (e.g., grouping and data merging) to provide a precise representation of missing and incomplete informationabc aY. Papakonstantinou et al., In VLDB (1996). bL. Popa et al., In VLDB (2002). cR. Fagin et al., TODS 30 (2005). Slide 4 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Introduction 13. Schema Mapping Languages Various logical mapping formalisms s-t tgds (also known as GLAV)a Nested s-t tgds (nested GLAV)b Second-Order (SO) tgdsc aR. Fagin et al., Theor. Comput. Sci. 336 (2005). bA. Fuxman et al., In VLDB (2006). cR. Fagin et al., TODS 30 (2005). Slide 5 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Introduction 14. Schema Mapping Languages Various logical mapping formalisms s-t tgds (also known as GLAV) Nested s-t tgds (nested GLAV) Second-Order (SO) tgds Expressiveness SO tgds permits arbitrary Skolems!a FO mapping languages have more desirable programmatic and computational propertiesb aR. Fagin et al., TODS 30 (2005). bB. ten Cate, P. Kolaitis, In ICDT (2009). Slide 5 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Introduction 15. Characterization of Mapping Languages234 Property GLAV nested GLAV SO tgds Composition Not closed Not closed Closed Value Invention No Linear Fully customized correlation correlation correlation Target Homomorphisms Closed Closed Not closed Model Checking PTIME PTIME NP-Complete 2R. Fagin et al., Theor. Comput. Sci. 336 (2005). 3R. Fagin et al., TODS 30 (2005). 4B. ten Cate, P. Kolaitis, In ICDT (2009). Slide 6 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Introduction 16. The Quest for FO Rewritability Rewritability Many SO tgds are equivalent to FO mappings! We call this FO/GLAV/nested GLAV rewritable Some SO tgds are not FO rewritablea . . . Even testing for FO rewritability is undecidableb aR. Fagin et al., TODS 30 (2005). bI. Feinerer et al., In AMW (2011). Nash, Bernstein and Melnik First sucient condition for GLAV rewritabilitya Tailored to consider SO tgds produced by mapping composition aA. Nash et al., TODS 32 (2007). Slide 7 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Introduction 17. Our Contributions 1 Sucient condition for nested GLAV rewritability of SO tgds 2 Linearize: PTIME algorithm for rewriting SO tgds 3 Equivalence preserving transformation of SO tgds using source semantics 4 LinearizeFDs: PTIME algorithm for rewriting SO tgds using source FDs 5 Extensive experimental evaluation STBenchmark 2.0a Real-life mapping scenarios aP. C. Arocena et al., STBenchmark 2.0, tech. rep. (Uni. of Toronto, 2013). Slide 8 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Introduction 18. Outline 1 Introduction 2 Linearization 3 Exploiting Source Constraints 4 Experiments 5 Conclusions 19. Intuition of Rewriting Rewrite SO tgds into nested GLAV Replace second-order existentials with rst-order existentials f (x) vf Apply logical equivalence of Skolemization in reverse direction May have to reorder universal quantiers to create x Skolemization Equivalence xvf (x, vf ) f x (x, vf )[vf f (x)] Slide 9 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Linearization 20. UnSkolemization Revisited Example: Key Invention Source Schema WorksOn (Department, Project, BudgetId) Audit (BudgetId, Auditor) City (Department, City) Target Schema Project (PId, BudgetId) Dept (Dept, Year, Project, NumEmp) Location (Department, DepId, City, State) Budget (Project, Leader, Size) f (dpb WorksOn (d, p, b) Project (f (d, p), b)) Slide 10 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Linearization 21. UnSkolemization Revisited Example: Key Invention Source Schema WorksOn (Department, Project, BudgetId) Audit (BudgetId, Auditor) City (Department, City) Target Schema Project (PId, BudgetId) Dept (Dept, Year, Project, NumEmp) Location (Department, DepId, City, State) Budget (Project, Leader, Size) f (dpb WorksOn (d, p, b) Project (f (d, p), b)) We need to introduce vf nested within the scope of d and p Slide 10 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Linearization 22. UnSkolemization Revisited Example: Key Invention Source Schema WorksOn (Department, Project, BudgetId) Audit (BudgetId, Auditor) City (Department, City) Target Schema Project (PId, BudgetId) Dept (Dept, Year, Project, NumEmp) Location (Department, DepId, City, State) Budget