Pal gov.tutorial2.session13 1.data schema integration

1PalGov © 2011

أكاديمية الحكومة اإللكترونية الفلسطينيةThe Palestinian eGovernment Academy

www.egovacademy.ps

Tutorial II: Data Integration and Open Information Systems

Session 13.1

Data Schema Integration

Dr. Mustafa Jarrar

University of Birzeit

[email protected]

www.jarrar.info

mailto:[email protected]

http://www.jarrar.info/

2PalGov © 2011

About

This tutorial is part of the PalGov project, funded by the TEMPUS IV program of the

Commission of the European Communities, grant agreement 511159-TEMPUS-1-

2010-1-PS-TEMPUS-JPHES. The project website: www.egovacademy.ps

University of Trento, Italy

University of Namur, Belgium

Vrije Universiteit Brussel, Belgium

TrueTrust, UK

Birzeit University, Palestine

(Coordinator )

Palestine Polytechnic University, Palestine

Palestine Technical University, PalestineUniversité de Savoie, France

Ministry of Local Government, Palestine

Ministry of Telecom and IT, Palestine

Ministry of Interior, Palestine

Project Consortium:

Coordinator:

Dr. Mustafa Jarrar

Birzeit University, P.O.Box 14- Birzeit, Palestine

Telfax:+972 2 2982935 [email protected]

http://www.egovacademy.ps/

3PalGov © 2011

© Copyright Notes

Everyone is encouraged to use this material, or part of it, but should

properly cite the project (logo and website), and the author of that part.

No part of this tutorial may be reproduced or modified in any form or by

any means, without prior written permission from the project, who have

the full copyrights on the material.

Attribution-NonCommercial-ShareAlike

CC-BY-NC-SA

This license lets others remix, tweak, and build upon your work non-

commercially, as long as they credit you and license their new creations

under the identical terms.

PalGov © 2011 4

Tutorial Map

Topic h

Session 1: XML Basics and Namespaces 3

Session 2: XML DTD’s 3

Session 3: XML Schemas 3

Session 4: Lab-XML Schemas 3

Session 5: RDF and RDFs 3

Session 6: Lab-RDF and RDFs 3

Session 7: OWL (Ontology Web Language) 3

Session 8: Lab-OWL 3

Session 9: Lab-RDF Stores -Challenges and Solutions 3

Session 10: Lab-SPARQL 3

Session 11: Lab-Oracle Semantic Technology 3

Session 12_1: The problem of Data Integration 1.5

Session 12_2: Architectural Solutions for the Integration Issues 1.5

Session 13_1: Data Schema Integration 1

Session 13_2: GAV and LAV Integration 1

Session 13_3: Data Integration and Fusion using RDF 1

Session 14: Lab-Data Integration and Fusion using RDF 3

Session 15_1: Data Web and Linked Data 1.5

Session 15_2: RDFa 1.5

Session 16: Lab-RDFa 3

Intended Learning Objectives

A: Knowledge and Understanding

2a1: Describe tree and graph data models.

2a2: Understand the notation of XML, RDF, RDFS, and OWL.

2a3: Demonstrate knowledge about querying techniques for data

models as SPARQL and XPath.

2a4: Explain the concepts of identity management and Linked data.

2a5: Demonstrate knowledge about Integration &fusion of

heterogeneous data.

B: Intellectual Skills

2b1: Represent data using tree and graph data models (XML &

RDF).

2b2: Describe data semantics using RDFS and OWL.

2b3: Manage and query data represented in RDF, XML, OWL.

2b4: Integrate and fuse heterogeneous data.

C: Professional and Practical Skills

2c1: Using Oracle Semantic Technology and/or Virtuoso to store

and query RDF stores.

D: General and Transferable Skills2d1: Working with team.

2d2: Presenting and defending ideas.

2d3: Use of creativity and innovation in problem solving.

2d4: Develop communication skills and logical reasoning abilities.

5PalGov © 2011

Module ILOs

After completing this module students will be able to:

- Integrate heterogeneous information systems by schema integration.

6PalGov © 2011

Employee Region

Organization

City

bornIn/ locatedIn/

/Wo

rks

In

locatedIn/

Employee

Organization

/Wo

rks

In

Worker RegionCity

bornIn/ locatedIn/

Organization

Municipality

locatedIn/

Schema 1 Schema 2 Schema 3

Data Schema Integration: A simple example

In ORM:

7PalGov © 2011

Data Schema Integration: A simple exampleSource: Carlo Batini

Employee

Organiza

tion

Empoloyee City Region

Munici

pality

Organi

zation

in

in

works

born

Schema 1

Schema 2

Schema 3

Employee City Region

Organiza

tion

works

born in

inIntegrated schema

In ER:

8PalGov © 2011

Challenges of Data Schema Integration

Schema Integration has two major challenges:

1. Identification of all portions of schemas that pertain to the

same concept, in such a way to unify such different

representations in the global schema.

2. Identification, analysis and resolution of the different types

of conflicts (heterogeneities) in different schemas.

Source: Carlo Batini

9PalGov © 2011

A generic framework for Schema Integration

Schemas

Transformation

Schemas

Matching

Schemas

Integration

Local

Schemas

Integrated Schema

and mappings

Transformation

Rules

Matching

Rules

Integration

Rules

Source: Advances in Object-Oriented Data Modeling, M. P. Papazoglou, S. Spaccapietra, Z. Tari (Eds.),

The MIT Press, 2000

10PalGov © 2011

0. Define the integration strategy

If the number of local schemas to be integrated is large, the order of schema integration becomes important. Several strategies can be adopted.

Input: n source schemas

Output: n source schemas + integration strategies

Method used: heuristics


One shot strategy

S1 S2 S3

IS

Pair at a time strategy

S1 S2 S3

IS1

IS2

S4

IS

Balanced Strategy

S1 S2 S3

IS1 IS2

S4

IS

- Efficient integration process- Many correspondences between concepts have to be considered together.

- Priority to most relevant and stable schemas. - The integration process is more efficient

-Example: Production, Marketing, Sales.-To be preferred when the cohesion among schemas is high.

…

11PalGov © 2011


1. Schema transformation (or Pre-integration)

Input: n source schemas

Output: n source schemas homogeneized

Methods used: Model and Design Homogeneization

Reduce model heterogeneities as much as possible to make the sources more suitable for integration.

Goal: use a single, common data model and format.

source DBs homogeneized DBs DW

transformation integration

Source: Stefano Spaccapietra

12PalGov © 2011

Schema Transformation

Schema Transformation involves:

• Data model homogeneization

– Where all data sources are described using the same data model.

• Design homogeneization

– Enforce standard design rules to reduce the number of structural

conflicts (e.g., Normalization: one fact in one place)

• Reverse Engineering

– Reverse engineer the schema from existing data (such as COBOL

files, spreadsheets, legacy relational databases, legacy object-

oriented databases).

13PalGov © 2011

Example of Design homogeneization

(Normalization)

• ONE TABLE:

R1 (#Student, Name, LastName, #Course, CourseName,

Grade, Date)

• Dependencies:

– #Student Name, LastName

– #Course CourseName

– #Student #Course Grade, Date)

• NORMALIZED INTO 3 TABLES: ONE FACT IN ONE PLACE:

R11 (#Student, Name, LastName)

R12 (#Course, CourseName)

R13 (#Student, #Course, Grade, Date)

14PalGov © 2011

Example of Reverse EngineeringSource: Stefano Spaccapietra

15PalGov © 2011

2. Schema matching (Correspondences investigation)

2. Schema matching (Correspondences investigation)Input: n source schemas

Output: n source schemas + correspondences

Method used: techniques to discover correspondences

• Correspondences relate (schema) elements which describe the same

phenomena of the real world.

– This step aims at finding and describing all semantic

links between elements of the input schemas and the

corresponding data.

– By doing so, one matches between the schemas to be

integrated.

– This step fixes the conflicts found in the schema.

16PalGov © 2011

Semantics of Correspondences

Correspondences relate (schema) elements which

describe the same phenomena of the real world.


17PalGov © 2011

Asserting Correspondences

• Finding matching correspondences is done through the use of a rich

language for expressing correspondences (matchings).

• EXAMPLE:

S1.Person S2.Person,

With Corresponding Identifiers: Pin,

With Corresponding Property: name


18PalGov © 2011

Automated Matching

• Fully automated matching is considered impossible, as a computer

process can hardly make ultimate decisions about the semantics of

data.

• But even partial assistance in discovering of correspondences (to be

confirmed or guided by humans) is beneficial, due to the complexity of

the task.

• All proposed methods rely on some similarity measures that try to

evaluate the semantic distance between two descriptions.

• Some state of the art matching systems

Cupid (Microsoft Research, USA)

FOAM/QOM (University of Karlsruhe, Germany)

OLA (INRIA Rhône-Alpes, France / Université de Montréal,Canada)

S-Match (University of Trento, Italy)

19PalGov © 2011

Examples of CorrespondencesSource: Stefano Spaccapietra

20PalGov © 2011

Examples of Correspondences

Employee

Organization

/Wo

rks

In

Worker RegionCity

bornIn/ locatedIn/

Organization

Municipality

locatedIn/

Schema 1

Schema 2

Schema 3

21PalGov © 2011

Examples of CorrespondencesSource: Stefano Spaccapietra

22PalGov © 2011

STEP3: Schemas integration and mapping

generation

3. Schemas integration and mapping generationInput: n source schemas + correspondences

Output: integrated schema + mapping rules btw the integrated schema and input source schemas

Method used: New classification of conflicts + Conflict resolution transformations

GOAL: Creating an Integrated Schema ( IS ) and the mappings to the local databases.

Source: Carlo Batini

23PalGov © 2011

GAV and LAV Integration

Research has identified two methods to set up mappings between the

integrated schema and the input schemas:

(1) GAV (Global As View): proposes to define the integrated schema

as a view over input schemas.

• GAV is usually considered simpler and more efficient for processing

queries on the integrated database, but is weaker in supporting evolution

of the global system through addition of new sources.

(2) LAV (Local As View): proposes to define the local schemas as

views over the integrated schema.

• LAV generates issues of incomplete information, which adds complexity

in handling global queries, but it better supports dynamic addition and

removal of source.

24PalGov © 2011

Integration Process

• After we identified the correspondences (in the previous step), we now

solve the conflicts:

• One can distinguish between four types of conflicts:

– Structural conflicts

– Classification conflicts

– Descriptive conflicts

– Fragmentation conflicts

• Examples of conflicts among related object types

– different classifications (sets of instances)

– different sets of properties

– different structures

– different coding schemes

– …

25PalGov © 2011

Integration Rules

• Rules defining the strategy to solve conflicts

• Example rules:

– If an object type corresponds to an attribute, keep the

object type

– If the population of an object type is included in the

population of another object type, build an is-a

hierarchy

• Integration rules depend on how you want the

integrated schema to look like

26PalGov © 2011

Structural Conflicts

• Different schema element types, e.g.: class, attribute, relationship

• Library example:

– S1 : Book is a class

– S2 : books is an attribute of Author

• Conflict resolution :

Choose the less constraining structure

– Integrated Schema: Book is a class

S1

S2


27PalGov © 2011

Classification Conflicts

• Corresponding elements describe different sets of real

world objects

– S1.Faculty CONTAINS S2.PhD-advisor

• Conflict Resolution:

– Generalization / Specialization hierarchy

– Merging

Phd-advisor

Faculty

Phd-advisor

FacultyS1

S2

Faculty

28PalGov © 2011

Descriptive Conflicts

• Corresponding types have different properties, or

corresponding properties are described in different ways

• Object / Entity / Relationship type:

– naming conflicts :

• synonyms Node , Extremity

• homonyms Highway (EU) , Highway (USA)

– composition conflicts : different attributes and methods

• Employee ( E# , name , address )

• Employee ( E# , position , salary , department )

29PalGov © 2011

Integration Methods: Manual

Easy to implement , Flexible

BUT

time consuming for the DBA

a language

schemas integrated

schema

mapping

rules

DBA

• First method : manual integration

“ do it yourself ”


30PalGov © 2011

Integration Methods: Semi-Automatic

schemasintegrated

schema

mapping rules

TOOL

DBA

correspondences

Opens to visual CASE tools, integration servers

BUT knowledge acquisition can be painful

• Second method : semi-automatic integration

“ tell me about the problem , I will try to fix it “


31PalGov © 2011

References

• Carlo Batini: Course on Data Integration. BZU IT Summer School

2011.

• Stefano Spaccapietra: Information Integration. Presentation at the IFIP

Academy. Porto Alegre. 2005.

• Chris Bizer: The Emerging Web of Linked Data. Presentation at SRI

International, Artificial Intelligence Center. Menlo Park, USA. 2009.

Education

Pal gov.tutorial2.session13 1.data schema integration