31
1 PalGov © 2011 فلسطينيةلكترونية الديمية الحكومة ا أكاThe Palestinian eGovernment Academy www.egovacademy.ps Tutorial II: Data Integration and Open Information Systems Session 13.1 Data Schema Integration Dr. Mustafa Jarrar University of Birzeit [email protected] www.jarrar.info

Pal gov.tutorial2.session13 1.data schema integration

Embed Size (px)

Citation preview

Page 1: Pal gov.tutorial2.session13 1.data schema integration

1PalGov © 2011

أكاديمية الحكومة اإللكترونية الفلسطينيةThe Palestinian eGovernment Academy

www.egovacademy.ps

Tutorial II: Data Integration and Open Information Systems

Session 13.1

Data Schema Integration

Dr. Mustafa Jarrar

University of Birzeit

[email protected]

www.jarrar.info

Page 2: Pal gov.tutorial2.session13 1.data schema integration

2PalGov © 2011

About

This tutorial is part of the PalGov project, funded by the TEMPUS IV program of the

Commission of the European Communities, grant agreement 511159-TEMPUS-1-

2010-1-PS-TEMPUS-JPHES. The project website: www.egovacademy.ps

University of Trento, Italy

University of Namur, Belgium

Vrije Universiteit Brussel, Belgium

TrueTrust, UK

Birzeit University, Palestine

(Coordinator )

Palestine Polytechnic University, Palestine

Palestine Technical University, PalestineUniversité de Savoie, France

Ministry of Local Government, Palestine

Ministry of Telecom and IT, Palestine

Ministry of Interior, Palestine

Project Consortium:

Coordinator:

Dr. Mustafa Jarrar

Birzeit University, P.O.Box 14- Birzeit, Palestine

Telfax:+972 2 2982935 [email protected]

Page 3: Pal gov.tutorial2.session13 1.data schema integration

3PalGov © 2011

© Copyright Notes

Everyone is encouraged to use this material, or part of it, but should

properly cite the project (logo and website), and the author of that part.

No part of this tutorial may be reproduced or modified in any form or by

any means, without prior written permission from the project, who have

the full copyrights on the material.

Attribution-NonCommercial-ShareAlike

CC-BY-NC-SA

This license lets others remix, tweak, and build upon your work non-

commercially, as long as they credit you and license their new creations

under the identical terms.

Page 4: Pal gov.tutorial2.session13 1.data schema integration

PalGov © 2011 4

Tutorial Map

Topic h

Session 1: XML Basics and Namespaces 3

Session 2: XML DTD’s 3

Session 3: XML Schemas 3

Session 4: Lab-XML Schemas 3

Session 5: RDF and RDFs 3

Session 6: Lab-RDF and RDFs 3

Session 7: OWL (Ontology Web Language) 3

Session 8: Lab-OWL 3

Session 9: Lab-RDF Stores -Challenges and Solutions 3

Session 10: Lab-SPARQL 3

Session 11: Lab-Oracle Semantic Technology 3

Session 12_1: The problem of Data Integration 1.5

Session 12_2: Architectural Solutions for the Integration Issues 1.5

Session 13_1: Data Schema Integration 1

Session 13_2: GAV and LAV Integration 1

Session 13_3: Data Integration and Fusion using RDF 1

Session 14: Lab-Data Integration and Fusion using RDF 3

Session 15_1: Data Web and Linked Data 1.5

Session 15_2: RDFa 1.5

Session 16: Lab-RDFa 3

Intended Learning Objectives

A: Knowledge and Understanding

2a1: Describe tree and graph data models.

2a2: Understand the notation of XML, RDF, RDFS, and OWL.

2a3: Demonstrate knowledge about querying techniques for data

models as SPARQL and XPath.

2a4: Explain the concepts of identity management and Linked data.

2a5: Demonstrate knowledge about Integration &fusion of

heterogeneous data.

B: Intellectual Skills

2b1: Represent data using tree and graph data models (XML &

RDF).

2b2: Describe data semantics using RDFS and OWL.

2b3: Manage and query data represented in RDF, XML, OWL.

2b4: Integrate and fuse heterogeneous data.

C: Professional and Practical Skills

2c1: Using Oracle Semantic Technology and/or Virtuoso to store

and query RDF stores.

D: General and Transferable Skills2d1: Working with team.

2d2: Presenting and defending ideas.

2d3: Use of creativity and innovation in problem solving.

2d4: Develop communication skills and logical reasoning abilities.

Page 5: Pal gov.tutorial2.session13 1.data schema integration

5PalGov © 2011

Module ILOs

After completing this module students will be able to:

- Integrate heterogeneous information systems by schema integration.

Page 6: Pal gov.tutorial2.session13 1.data schema integration

6PalGov © 2011

Employee Region

Organization

City

bornIn/ locatedIn/

/Wo

rks

In

locatedIn/

Employee

Organization

/Wo

rks

In

Worker RegionCity

bornIn/ locatedIn/

Organization

Municipality

locatedIn/

Schema 1 Schema 2 Schema 3

Data Schema Integration: A simple example

In ORM:

Page 7: Pal gov.tutorial2.session13 1.data schema integration

7PalGov © 2011

Data Schema Integration: A simple exampleSource: Carlo Batini

Employee

Organiza

tion

Empoloyee City Region

Munici

pality

Organi

zation

in

in

works

born

Schema 1

Schema 2

Schema 3

Employee City Region

Organiza

tion

works

born in

inIntegrated schema

In ER:

Page 8: Pal gov.tutorial2.session13 1.data schema integration

8PalGov © 2011

Challenges of Data Schema Integration

Schema Integration has two major challenges:

1. Identification of all portions of schemas that pertain to the

same concept, in such a way to unify such different

representations in the global schema.

2. Identification, analysis and resolution of the different types

of conflicts (heterogeneities) in different schemas.

Source: Carlo Batini

Page 9: Pal gov.tutorial2.session13 1.data schema integration

9PalGov © 2011

A generic framework for Schema Integration

Schemas

Transformation

Schemas

Matching

Schemas

Integration

Local

Schemas

Integrated Schema

and mappings

Transformation

Rules

Matching

Rules

Integration

Rules

Source: Advances in Object-Oriented Data Modeling, M. P. Papazoglou, S. Spaccapietra, Z. Tari (Eds.),

The MIT Press, 2000

Page 10: Pal gov.tutorial2.session13 1.data schema integration

10PalGov © 2011

0. Define the integration strategy

If the number of local schemas to be integrated is large, the order of schema integration becomes important. Several strategies can be adopted.

Input: n source schemas

Output: n source schemas + integration strategies

Method used: heuristics

A generic framework for Schema Integration

One shot strategy

S1 S2 S3

IS

Pair at a time strategy

S1 S2 S3

IS1

IS2

S4

IS

Balanced Strategy

S1 S2 S3

IS1 IS2

S4

IS

- Efficient integration process- Many correspondences between concepts have to be considered together.

- Priority to most relevant and stable schemas. - The integration process is more efficient

-Example: Production, Marketing, Sales.-To be preferred when the cohesion among schemas is high.

Page 11: Pal gov.tutorial2.session13 1.data schema integration

11PalGov © 2011

A generic framework for Schema Integration

1. Schema transformation (or Pre-integration)

Input: n source schemas

Output: n source schemas homogeneized

Methods used: Model and Design Homogeneization

Reduce model heterogeneities as much as possible to make the sources more suitable for integration.

Goal: use a single, common data model and format.

source DBs homogeneized DBs DW

transformation integration

Source: Stefano Spaccapietra

Page 12: Pal gov.tutorial2.session13 1.data schema integration

12PalGov © 2011

Schema Transformation

Schema Transformation involves:

• Data model homogeneization

– Where all data sources are described using the same data model.

• Design homogeneization

– Enforce standard design rules to reduce the number of structural

conflicts (e.g., Normalization: one fact in one place)

• Reverse Engineering

– Reverse engineer the schema from existing data (such as COBOL

files, spreadsheets, legacy relational databases, legacy object-

oriented databases).

Page 13: Pal gov.tutorial2.session13 1.data schema integration

13PalGov © 2011

Example of Design homogeneization

(Normalization)

• ONE TABLE:

R1 (#Student, Name, LastName, #Course, CourseName,

Grade, Date)

• Dependencies:

– #Student Name, LastName

– #Course CourseName

– #Student #Course Grade, Date)

• NORMALIZED INTO 3 TABLES: ONE FACT IN ONE PLACE:

R11 (#Student, Name, LastName)

R12 (#Course, CourseName)

R13 (#Student, #Course, Grade, Date)

Page 14: Pal gov.tutorial2.session13 1.data schema integration

14PalGov © 2011

Example of Reverse EngineeringSource: Stefano Spaccapietra

Page 15: Pal gov.tutorial2.session13 1.data schema integration

15PalGov © 2011

2. Schema matching (Correspondences investigation)

2. Schema matching (Correspondences investigation)Input: n source schemas

Output: n source schemas + correspondences

Method used: techniques to discover correspondences

• Correspondences relate (schema) elements which describe the same

phenomena of the real world.

– This step aims at finding and describing all semantic

links between elements of the input schemas and the

corresponding data.

– By doing so, one matches between the schemas to be

integrated.

– This step fixes the conflicts found in the schema.

Page 16: Pal gov.tutorial2.session13 1.data schema integration

16PalGov © 2011

Semantics of Correspondences

Correspondences relate (schema) elements which

describe the same phenomena of the real world.

Source: Stefano Spaccapietra

Page 17: Pal gov.tutorial2.session13 1.data schema integration

17PalGov © 2011

Asserting Correspondences

• Finding matching correspondences is done through the use of a rich

language for expressing correspondences (matchings).

• EXAMPLE:

S1.Person S2.Person,

With Corresponding Identifiers: Pin,

With Corresponding Property: name

Source: Stefano Spaccapietra

Page 18: Pal gov.tutorial2.session13 1.data schema integration

18PalGov © 2011

Automated Matching

• Fully automated matching is considered impossible, as a computer

process can hardly make ultimate decisions about the semantics of

data.

• But even partial assistance in discovering of correspondences (to be

confirmed or guided by humans) is beneficial, due to the complexity of

the task.

• All proposed methods rely on some similarity measures that try to

evaluate the semantic distance between two descriptions.

• Some state of the art matching systems

Cupid (Microsoft Research, USA)

FOAM/QOM (University of Karlsruhe, Germany)

OLA (INRIA Rhône-Alpes, France / Université de Montréal,Canada)

S-Match (University of Trento, Italy)

Page 19: Pal gov.tutorial2.session13 1.data schema integration

19PalGov © 2011

Examples of CorrespondencesSource: Stefano Spaccapietra

Page 20: Pal gov.tutorial2.session13 1.data schema integration

20PalGov © 2011

Examples of Correspondences

Employee

Organization

/Wo

rks

In

Worker RegionCity

bornIn/ locatedIn/

Organization

Municipality

locatedIn/

Schema 1

Schema 2

Schema 3

Page 21: Pal gov.tutorial2.session13 1.data schema integration

21PalGov © 2011

Examples of CorrespondencesSource: Stefano Spaccapietra

Page 22: Pal gov.tutorial2.session13 1.data schema integration

22PalGov © 2011

STEP3: Schemas integration and mapping

generation

3. Schemas integration and mapping generationInput: n source schemas + correspondences

Output: integrated schema + mapping rules btw the integrated schema and input source schemas

Method used: New classification of conflicts + Conflict resolution transformations

GOAL: Creating an Integrated Schema ( IS ) and the mappings to the local databases.

Source: Carlo Batini

Page 23: Pal gov.tutorial2.session13 1.data schema integration

23PalGov © 2011

GAV and LAV Integration

Research has identified two methods to set up mappings between the

integrated schema and the input schemas:

(1) GAV (Global As View): proposes to define the integrated schema

as a view over input schemas.

• GAV is usually considered simpler and more efficient for processing

queries on the integrated database, but is weaker in supporting evolution

of the global system through addition of new sources.

(2) LAV (Local As View): proposes to define the local schemas as

views over the integrated schema.

• LAV generates issues of incomplete information, which adds complexity

in handling global queries, but it better supports dynamic addition and

removal of source.

Page 24: Pal gov.tutorial2.session13 1.data schema integration

24PalGov © 2011

Integration Process

• After we identified the correspondences (in the previous step), we now

solve the conflicts:

• One can distinguish between four types of conflicts:

– Structural conflicts

– Classification conflicts

– Descriptive conflicts

– Fragmentation conflicts

• Examples of conflicts among related object types

– different classifications (sets of instances)

– different sets of properties

– different structures

– different coding schemes

– …

Page 25: Pal gov.tutorial2.session13 1.data schema integration

25PalGov © 2011

Integration Rules

• Rules defining the strategy to solve conflicts

• Example rules:

– If an object type corresponds to an attribute, keep the

object type

– If the population of an object type is included in the

population of another object type, build an is-a

hierarchy

• Integration rules depend on how you want the

integrated schema to look like

Page 26: Pal gov.tutorial2.session13 1.data schema integration

26PalGov © 2011

Structural Conflicts

• Different schema element types, e.g.: class, attribute, relationship

• Library example:

– S1 : Book is a class

– S2 : books is an attribute of Author

• Conflict resolution :

Choose the less constraining structure

– Integrated Schema: Book is a class

S1

S2

Source: Stefano Spaccapietra

Page 27: Pal gov.tutorial2.session13 1.data schema integration

27PalGov © 2011

Classification Conflicts

• Corresponding elements describe different sets of real

world objects

– S1.Faculty CONTAINS S2.PhD-advisor

• Conflict Resolution:

– Generalization / Specialization hierarchy

– Merging

Phd-advisor

Faculty

Phd-advisor

FacultyS1

S2

Faculty

Page 28: Pal gov.tutorial2.session13 1.data schema integration

28PalGov © 2011

Descriptive Conflicts

• Corresponding types have different properties, or

corresponding properties are described in different ways

• Object / Entity / Relationship type:

– naming conflicts :

• synonyms Node , Extremity

• homonyms Highway (EU) , Highway (USA)

– composition conflicts : different attributes and methods

• Employee ( E# , name , address )

• Employee ( E# , position , salary , department )

Page 29: Pal gov.tutorial2.session13 1.data schema integration

29PalGov © 2011

Integration Methods: Manual

Easy to implement , Flexible

BUT

time consuming for the DBA

a language

schemas integrated

schema

mapping

rules

DBA

• First method : manual integration

“ do it yourself ”

Source: Stefano Spaccapietra

Page 30: Pal gov.tutorial2.session13 1.data schema integration

30PalGov © 2011

Integration Methods: Semi-Automatic

schemasintegrated

schema

mapping rules

TOOL

DBA

correspondences

Opens to visual CASE tools, integration servers

BUT knowledge acquisition can be painful

• Second method : semi-automatic integration

“ tell me about the problem , I will try to fix it “

Source: Stefano Spaccapietra

Page 31: Pal gov.tutorial2.session13 1.data schema integration

31PalGov © 2011

References

• Carlo Batini: Course on Data Integration. BZU IT Summer School

2011.

• Stefano Spaccapietra: Information Integration. Presentation at the IFIP

Academy. Porto Alegre. 2005.

• Chris Bizer: The Emerging Web of Linked Data. Presentation at SRI

International, Artificial Intelligence Center. Menlo Park, USA. 2009.