Upload
mustafa-jarrar
View
644
Download
3
Embed Size (px)
Citation preview
1PalGov © 2011
أكاديمية الحكومة اإللكترونية الفلسطينيةThe Palestinian eGovernment Academy
www.egovacademy.ps
Tutorial II: Data Integration and Open Information Systems
Session 13.1
Data Schema Integration
Dr. Mustafa Jarrar
University of Birzeit
www.jarrar.info
2PalGov © 2011
About
This tutorial is part of the PalGov project, funded by the TEMPUS IV program of the
Commission of the European Communities, grant agreement 511159-TEMPUS-1-
2010-1-PS-TEMPUS-JPHES. The project website: www.egovacademy.ps
University of Trento, Italy
University of Namur, Belgium
Vrije Universiteit Brussel, Belgium
TrueTrust, UK
Birzeit University, Palestine
(Coordinator )
Palestine Polytechnic University, Palestine
Palestine Technical University, PalestineUniversité de Savoie, France
Ministry of Local Government, Palestine
Ministry of Telecom and IT, Palestine
Ministry of Interior, Palestine
Project Consortium:
Coordinator:
Dr. Mustafa Jarrar
Birzeit University, P.O.Box 14- Birzeit, Palestine
Telfax:+972 2 2982935 [email protected]
3PalGov © 2011
© Copyright Notes
Everyone is encouraged to use this material, or part of it, but should
properly cite the project (logo and website), and the author of that part.
No part of this tutorial may be reproduced or modified in any form or by
any means, without prior written permission from the project, who have
the full copyrights on the material.
Attribution-NonCommercial-ShareAlike
CC-BY-NC-SA
This license lets others remix, tweak, and build upon your work non-
commercially, as long as they credit you and license their new creations
under the identical terms.
PalGov © 2011 4
Tutorial Map
Topic h
Session 1: XML Basics and Namespaces 3
Session 2: XML DTD’s 3
Session 3: XML Schemas 3
Session 4: Lab-XML Schemas 3
Session 5: RDF and RDFs 3
Session 6: Lab-RDF and RDFs 3
Session 7: OWL (Ontology Web Language) 3
Session 8: Lab-OWL 3
Session 9: Lab-RDF Stores -Challenges and Solutions 3
Session 10: Lab-SPARQL 3
Session 11: Lab-Oracle Semantic Technology 3
Session 12_1: The problem of Data Integration 1.5
Session 12_2: Architectural Solutions for the Integration Issues 1.5
Session 13_1: Data Schema Integration 1
Session 13_2: GAV and LAV Integration 1
Session 13_3: Data Integration and Fusion using RDF 1
Session 14: Lab-Data Integration and Fusion using RDF 3
Session 15_1: Data Web and Linked Data 1.5
Session 15_2: RDFa 1.5
Session 16: Lab-RDFa 3
Intended Learning Objectives
A: Knowledge and Understanding
2a1: Describe tree and graph data models.
2a2: Understand the notation of XML, RDF, RDFS, and OWL.
2a3: Demonstrate knowledge about querying techniques for data
models as SPARQL and XPath.
2a4: Explain the concepts of identity management and Linked data.
2a5: Demonstrate knowledge about Integration &fusion of
heterogeneous data.
B: Intellectual Skills
2b1: Represent data using tree and graph data models (XML &
RDF).
2b2: Describe data semantics using RDFS and OWL.
2b3: Manage and query data represented in RDF, XML, OWL.
2b4: Integrate and fuse heterogeneous data.
C: Professional and Practical Skills
2c1: Using Oracle Semantic Technology and/or Virtuoso to store
and query RDF stores.
D: General and Transferable Skills2d1: Working with team.
2d2: Presenting and defending ideas.
2d3: Use of creativity and innovation in problem solving.
2d4: Develop communication skills and logical reasoning abilities.
5PalGov © 2011
Module ILOs
After completing this module students will be able to:
- Integrate heterogeneous information systems by schema integration.
6PalGov © 2011
Employee Region
Organization
City
bornIn/ locatedIn/
/Wo
rks
In
locatedIn/
Employee
Organization
/Wo
rks
In
Worker RegionCity
bornIn/ locatedIn/
Organization
Municipality
locatedIn/
Schema 1 Schema 2 Schema 3
Data Schema Integration: A simple example
In ORM:
7PalGov © 2011
Data Schema Integration: A simple exampleSource: Carlo Batini
Employee
Organiza
tion
Empoloyee City Region
Munici
pality
Organi
zation
in
in
works
born
Schema 1
Schema 2
Schema 3
Employee City Region
Organiza
tion
works
born in
inIntegrated schema
In ER:
8PalGov © 2011
Challenges of Data Schema Integration
Schema Integration has two major challenges:
1. Identification of all portions of schemas that pertain to the
same concept, in such a way to unify such different
representations in the global schema.
2. Identification, analysis and resolution of the different types
of conflicts (heterogeneities) in different schemas.
Source: Carlo Batini
9PalGov © 2011
A generic framework for Schema Integration
Schemas
Transformation
Schemas
Matching
Schemas
Integration
Local
Schemas
Integrated Schema
and mappings
Transformation
Rules
Matching
Rules
Integration
Rules
Source: Advances in Object-Oriented Data Modeling, M. P. Papazoglou, S. Spaccapietra, Z. Tari (Eds.),
The MIT Press, 2000
10PalGov © 2011
0. Define the integration strategy
If the number of local schemas to be integrated is large, the order of schema integration becomes important. Several strategies can be adopted.
Input: n source schemas
Output: n source schemas + integration strategies
Method used: heuristics
A generic framework for Schema Integration
One shot strategy
S1 S2 S3
IS
Pair at a time strategy
S1 S2 S3
IS1
IS2
S4
IS
Balanced Strategy
S1 S2 S3
IS1 IS2
S4
IS
- Efficient integration process- Many correspondences between concepts have to be considered together.
- Priority to most relevant and stable schemas. - The integration process is more efficient
-Example: Production, Marketing, Sales.-To be preferred when the cohesion among schemas is high.
…
11PalGov © 2011
A generic framework for Schema Integration
1. Schema transformation (or Pre-integration)
Input: n source schemas
Output: n source schemas homogeneized
Methods used: Model and Design Homogeneization
Reduce model heterogeneities as much as possible to make the sources more suitable for integration.
Goal: use a single, common data model and format.
source DBs homogeneized DBs DW
transformation integration
Source: Stefano Spaccapietra
12PalGov © 2011
Schema Transformation
Schema Transformation involves:
• Data model homogeneization
– Where all data sources are described using the same data model.
• Design homogeneization
– Enforce standard design rules to reduce the number of structural
conflicts (e.g., Normalization: one fact in one place)
• Reverse Engineering
– Reverse engineer the schema from existing data (such as COBOL
files, spreadsheets, legacy relational databases, legacy object-
oriented databases).
13PalGov © 2011
Example of Design homogeneization
(Normalization)
• ONE TABLE:
R1 (#Student, Name, LastName, #Course, CourseName,
Grade, Date)
• Dependencies:
– #Student Name, LastName
– #Course CourseName
– #Student #Course Grade, Date)
• NORMALIZED INTO 3 TABLES: ONE FACT IN ONE PLACE:
R11 (#Student, Name, LastName)
R12 (#Course, CourseName)
R13 (#Student, #Course, Grade, Date)
14PalGov © 2011
Example of Reverse EngineeringSource: Stefano Spaccapietra
15PalGov © 2011
2. Schema matching (Correspondences investigation)
2. Schema matching (Correspondences investigation)Input: n source schemas
Output: n source schemas + correspondences
Method used: techniques to discover correspondences
• Correspondences relate (schema) elements which describe the same
phenomena of the real world.
– This step aims at finding and describing all semantic
links between elements of the input schemas and the
corresponding data.
– By doing so, one matches between the schemas to be
integrated.
– This step fixes the conflicts found in the schema.
16PalGov © 2011
Semantics of Correspondences
Correspondences relate (schema) elements which
describe the same phenomena of the real world.
Source: Stefano Spaccapietra
17PalGov © 2011
Asserting Correspondences
• Finding matching correspondences is done through the use of a rich
language for expressing correspondences (matchings).
• EXAMPLE:
S1.Person S2.Person,
With Corresponding Identifiers: Pin,
With Corresponding Property: name
Source: Stefano Spaccapietra
18PalGov © 2011
Automated Matching
• Fully automated matching is considered impossible, as a computer
process can hardly make ultimate decisions about the semantics of
data.
• But even partial assistance in discovering of correspondences (to be
confirmed or guided by humans) is beneficial, due to the complexity of
the task.
• All proposed methods rely on some similarity measures that try to
evaluate the semantic distance between two descriptions.
• Some state of the art matching systems
Cupid (Microsoft Research, USA)
FOAM/QOM (University of Karlsruhe, Germany)
OLA (INRIA Rhône-Alpes, France / Université de Montréal,Canada)
S-Match (University of Trento, Italy)
19PalGov © 2011
Examples of CorrespondencesSource: Stefano Spaccapietra
20PalGov © 2011
Examples of Correspondences
Employee
Organization
/Wo
rks
In
Worker RegionCity
bornIn/ locatedIn/
Organization
Municipality
locatedIn/
Schema 1
Schema 2
Schema 3
21PalGov © 2011
Examples of CorrespondencesSource: Stefano Spaccapietra
22PalGov © 2011
STEP3: Schemas integration and mapping
generation
3. Schemas integration and mapping generationInput: n source schemas + correspondences
Output: integrated schema + mapping rules btw the integrated schema and input source schemas
Method used: New classification of conflicts + Conflict resolution transformations
GOAL: Creating an Integrated Schema ( IS ) and the mappings to the local databases.
Source: Carlo Batini
23PalGov © 2011
GAV and LAV Integration
Research has identified two methods to set up mappings between the
integrated schema and the input schemas:
(1) GAV (Global As View): proposes to define the integrated schema
as a view over input schemas.
• GAV is usually considered simpler and more efficient for processing
queries on the integrated database, but is weaker in supporting evolution
of the global system through addition of new sources.
(2) LAV (Local As View): proposes to define the local schemas as
views over the integrated schema.
• LAV generates issues of incomplete information, which adds complexity
in handling global queries, but it better supports dynamic addition and
removal of source.
24PalGov © 2011
Integration Process
• After we identified the correspondences (in the previous step), we now
solve the conflicts:
• One can distinguish between four types of conflicts:
– Structural conflicts
– Classification conflicts
– Descriptive conflicts
– Fragmentation conflicts
• Examples of conflicts among related object types
– different classifications (sets of instances)
– different sets of properties
– different structures
– different coding schemes
– …
25PalGov © 2011
Integration Rules
• Rules defining the strategy to solve conflicts
• Example rules:
– If an object type corresponds to an attribute, keep the
object type
– If the population of an object type is included in the
population of another object type, build an is-a
hierarchy
• Integration rules depend on how you want the
integrated schema to look like
26PalGov © 2011
Structural Conflicts
• Different schema element types, e.g.: class, attribute, relationship
• Library example:
– S1 : Book is a class
– S2 : books is an attribute of Author
• Conflict resolution :
Choose the less constraining structure
– Integrated Schema: Book is a class
S1
S2
Source: Stefano Spaccapietra
27PalGov © 2011
Classification Conflicts
• Corresponding elements describe different sets of real
world objects
– S1.Faculty CONTAINS S2.PhD-advisor
• Conflict Resolution:
– Generalization / Specialization hierarchy
– Merging
Phd-advisor
Faculty
Phd-advisor
FacultyS1
S2
Faculty
28PalGov © 2011
Descriptive Conflicts
• Corresponding types have different properties, or
corresponding properties are described in different ways
• Object / Entity / Relationship type:
– naming conflicts :
• synonyms Node , Extremity
• homonyms Highway (EU) , Highway (USA)
– composition conflicts : different attributes and methods
• Employee ( E# , name , address )
• Employee ( E# , position , salary , department )
29PalGov © 2011
Integration Methods: Manual
Easy to implement , Flexible
BUT
time consuming for the DBA
a language
schemas integrated
schema
mapping
rules
DBA
• First method : manual integration
“ do it yourself ”
Source: Stefano Spaccapietra
30PalGov © 2011
Integration Methods: Semi-Automatic
schemasintegrated
schema
mapping rules
TOOL
DBA
correspondences
Opens to visual CASE tools, integration servers
BUT knowledge acquisition can be painful
• Second method : semi-automatic integration
“ tell me about the problem , I will try to fix it “
Source: Stefano Spaccapietra
31PalGov © 2011
References
• Carlo Batini: Course on Data Integration. BZU IT Summer School
2011.
• Stefano Spaccapietra: Information Integration. Presentation at the IFIP
Academy. Porto Alegre. 2005.
• Chris Bizer: The Emerging Web of Linked Data. Presentation at SRI
International, Artificial Intelligence Center. Menlo Park, USA. 2009.