25
Complex Matching of RDF Datatype Properties Bernardo Pereira Nunes 1,2 , Alexander Mera 1 , Marco Antonio Casanova 1 , Besnik Fetahu 2 , Luiz André P. Paes Leme 3 , Stefan Dietze 2 1) Department of Informatics-PUC-Rio, 2) L3S Research Center, Leibniz University Hannover, 3) Computer Science Institute, Fluminese Federal University DEXA 2013 – Prague, Czech Republic

Complex Matching of RDF Datatype Properties

Embed Size (px)

DESCRIPTION

Paper about complex matching of RDF datatype properties. DEXA Conference, 2013, Prague, Czech Republic.

Citation preview

Page 1: Complex Matching of RDF Datatype Properties

Complex Matching of RDF Datatype Properties

Bernardo Pereira Nunes1,2, Alexander Mera1, Marco Antonio Casanova1, Besnik Fetahu2, Luiz André P. Paes Leme3, Stefan Dietze2

1) Department of Informatics-PUC-Rio, 2) L3S Research Center, Leibniz University Hannover,

3) Computer Science Institute, Fluminese Federal University

DEXA 2013 – Prague, Czech Republic

Page 2: Complex Matching of RDF Datatype Properties

Outline

• Introduction

• Motivation

• Related Work

• Schema Matching Principles

• Our approach:

• Phase 1) Estimated Mutual Information – EMI

• Phase 2) Genetic Programing - GP

• Evaluation

• Results

• Conclusions

Besnik Fetahu 2DEXA 2013 – Prague, Czech Republic

Page 3: Complex Matching of RDF Datatype Properties

Introduction

• Data Integration

• Combine different data sources into an unified view of data

• Originally fomented by large organizations:

• Merge companies databases due to acquisitions

• Currently, driven by new Web trends such as:

• Improvement of Web-based search

• Proliferation of Web applications

• e-business

• Examples: momondo.de, semantic search, price watchers sites, etc.

Besnik Fetahu 3DEXA 2013 – Prague, Czech Republic

Page 4: Complex Matching of RDF Datatype Properties

Introduction

• Challenges

• Heterogeneous data

• Different data formats

• Data quality (data impurities, corrupted information)

• Scalability

• Adaptability

• Costly

Besnik Fetahu 4DEXA 2013 – Prague, Czech Republic

Page 5: Complex Matching of RDF Datatype Properties

Introduction

• Initiatives to address data integration problems

• Linked Data Principles

• Ontology Alignment Initiatives (OAI)

• Schema Matching tools

Besnik Fetahu 5DEXA 2013 – Prague, Czech Republic

Page 6: Complex Matching of RDF Datatype Properties

Motivation

• Given two schemas S and T a matching from S to T is characterized if an element e from S is mapped to an element e’ from T by some expression that relates both elements.

Besnik Fetahu 6DEXA 2013 – Prague, Czech Republic

?

?

?

Page 7: Complex Matching of RDF Datatype Properties

Related Work

• Methods

• RiMOM, iMAP, S-Match, DSSim, ATOM, etc.

• Schema-based approach

• Instance-based approach

• Hybrid approach

• Cardinality

• 1:1

• 1:n

• n:m

Besnik Fetahu 7DEXA 2013 – Prague, Czech Republic

Rahm, E. and Bernstein, P. 2001. A survey of approaches to automatic schema matching. The VLDB Journal 10, 4 (Dec. 2001), 334-350.

Page 8: Complex Matching of RDF Datatype Properties

Cardinality

• Simple match

• 1:1 – direct matching

• Complex match

• 1:1 / n:1 (mapping functions)

Besnik Fetahu 8DEXA 2013 – Prague, Czech Republic

ISBN

0-671-72287-5

ISBN

0-671-72287-5

Fullname

William Shakespeare

Firstname Last name

William Shakespeare

split(fullname)

concatenate(f,l)

Page 9: Complex Matching of RDF Datatype Properties

Our approach

• Two-phase approach:

• Estimated Mutual Information

• Suggest 1:1 and 1:n mappings

• Serve as a filtering step (filter out data properties that have no mutual information)

• Reduce search space for the next phase (speed up the process)

• Genetic Programming

• Automatic process for creating mapping functions

• Reduces the cost of traversing the search space

Besnik Fetahu 9DEXA 2013 – Prague, Czech Republic

Page 10: Complex Matching of RDF Datatype Properties

Estimated Mutual Information (EMI)

• EMI Matrix

• p=(p1,…,pu), q=(q1,…,qv) two lists of sets (i.e. sets of data type properties)

Besnik Fetahu 10DEXA 2013 – Prague, Czech Republic

Cosine SimilarityJaccard Index…..

Page 11: Complex Matching of RDF Datatype Properties

Estimated Mutual Information (EMI)

• Computing the mutual information:

• Cosine Similarity

• Simple matches: William Shakespeare → William Shakespeare

• Jaccard Similarity

• Simple and Complex matches: William → William Shakespeare

Besnik Fetahu 11DEXA 2013 – Prague, Czech Republic

Page 12: Complex Matching of RDF Datatype Properties

Genetic Programming (GP)

• Genetic programming refers to an automated method to create and evolve programs to solve a problem.

• A solution is represented by a tree, whose nodes are labeled with functions (concatenate, split, sum) or with values (strings, numbers, etc).

• New individuals are generated by applying genetic operations to the current population of individuals.

• Selects individuals that should breed by an evolutionary process.

Besnik Fetahu 12DEXA 2013 – Prague, Czech Republic

Page 13: Complex Matching of RDF Datatype Properties

Genetic Programming (GP)

• GP Functions:

• Crossover

• The act of swapping gene values between two potential solutions,

simulating the "mating" of the two solutions.

• Mutation

• The act of randomly altering the value of a gene in a potential solution.

• Reproduction

• The act of making a copy of a potential solution

Besnik Fetahu 13DEXA 2013 – Prague, Czech Republic

Page 14: Complex Matching of RDF Datatype Properties

Genetic Programming (GP)

• Fitness function

• Levenshtein similarity function for string values

• KL-divergence measure for numeric values

• Different measures are applied since data properties values can have multiple common values (such as 0) and it can lead to a wrong match. Thus, we use measure the probability of two sets being the same with KL.

Besnik Fetahu 14DEXA 2013 – Prague, Czech Republic

Page 15: Complex Matching of RDF Datatype Properties

An Example of Implementation

Besnik Fetahu 15DEXA 2013 – Prague, Czech Republic

Page 16: Complex Matching of RDF Datatype Properties

An Example of Implementation

Phase 1 – Co-occurrence matrix

1. Difference between Cosine/Jaccard similarity metrics.

Besnik Fetahu 16DEXA 2013 – Prague, Czech Republic

Page 17: Complex Matching of RDF Datatype Properties

An Example of Implementation

Phase 1 – EMI matrix

2. Possible matchings:

Besnik Fetahu 17DEXA 2013 – Prague, Czech Republic

Page 18: Complex Matching of RDF Datatype Properties

An Example of Implementation

Besnik Fetahu 18DEXA 2013 – Prague, Czech Republic

Page 19: Complex Matching of RDF Datatype Properties

An Example of Implementation

Besnik Fetahu 19DEXA 2013 – Prague, Czech Republic

Complement

+

NumberAddress

+

Number

Crossover

NeighborhoodComplementNumber

Address

+

+ mutation

Complement

+

Number

reproduction

Page 20: Complex Matching of RDF Datatype Properties

An Example of Implementation

Besnik Fetahu 20DEXA 2013 – Prague, Czech Republic

Correct

Repetitive and Incorrect mutation

Page 21: Complex Matching of RDF Datatype Properties

Evaluation

• Datasets

• “Personal Information” dataset lists information about people

• “Real Estate” dataset lists information about houses for sale

• “Inventory” dataset describes product inventories

With exception of the “Personal Information” dataset due to privacy reasons, other datasets are available at:

http://pages.cs.wisc.edu/ anhai/wisc-si-archive/domains/

Besnik Fetahu 21DEXA 2013 – Prague, Czech Republic

Page 22: Complex Matching of RDF Datatype Properties

Results

Besnik Fetahu 22DEXA 2013 – Prague, Czech Republic

Page 23: Complex Matching of RDF Datatype Properties

Results

Besnik Fetahu 23DEXA 2013 – Prague, Czech Republic

Page 24: Complex Matching of RDF Datatype Properties

10/04/23Ricardo Kawase 24

Conclusions

• Complex schema matching approach

• Simple + Complex matching:

• Estimated Mutual Information + Genetic Programing

• Reduced search space for matching properties

• Adaptive to variations of 1:1 and n:1 matching instances

• High accuracy on generated matches and coverage

Page 25: Complex Matching of RDF Datatype Properties

Questions?

Thank you!

Besnik Fetahu 25DEXA 2013 – Prague, Czech Republic