Complex Matching of RDF Datatype Properties
Bernardo Pereira Nunes1,2, Alexander Mera1, Marco Antonio Casanova1, Besnik Fetahu2, Luiz André P. Paes Leme3, Stefan Dietze2
1) Department of Informatics-PUC-Rio, 2) L3S Research Center, Leibniz University Hannover,
3) Computer Science Institute, Fluminese Federal University
DEXA 2013 – Prague, Czech Republic
Outline
• Introduction
• Motivation
• Related Work
• Schema Matching Principles
• Our approach:
• Phase 1) Estimated Mutual Information – EMI
• Phase 2) Genetic Programing - GP
• Evaluation
• Results
• Conclusions
Besnik Fetahu 2DEXA 2013 – Prague, Czech Republic
Introduction
• Data Integration
• Combine different data sources into an unified view of data
• Originally fomented by large organizations:
• Merge companies databases due to acquisitions
• Currently, driven by new Web trends such as:
• Improvement of Web-based search
• Proliferation of Web applications
• e-business
• Examples: momondo.de, semantic search, price watchers sites, etc.
Besnik Fetahu 3DEXA 2013 – Prague, Czech Republic
Introduction
• Challenges
• Heterogeneous data
• Different data formats
• Data quality (data impurities, corrupted information)
• Scalability
• Adaptability
• Costly
Besnik Fetahu 4DEXA 2013 – Prague, Czech Republic
Introduction
• Initiatives to address data integration problems
• Linked Data Principles
• Ontology Alignment Initiatives (OAI)
• Schema Matching tools
Besnik Fetahu 5DEXA 2013 – Prague, Czech Republic
Motivation
• Given two schemas S and T a matching from S to T is characterized if an element e from S is mapped to an element e’ from T by some expression that relates both elements.
Besnik Fetahu 6DEXA 2013 – Prague, Czech Republic
?
?
?
Related Work
• Methods
• RiMOM, iMAP, S-Match, DSSim, ATOM, etc.
• Schema-based approach
• Instance-based approach
• Hybrid approach
• Cardinality
• 1:1
• 1:n
• n:m
Besnik Fetahu 7DEXA 2013 – Prague, Czech Republic
Rahm, E. and Bernstein, P. 2001. A survey of approaches to automatic schema matching. The VLDB Journal 10, 4 (Dec. 2001), 334-350.
Cardinality
• Simple match
• 1:1 – direct matching
• Complex match
• 1:1 / n:1 (mapping functions)
Besnik Fetahu 8DEXA 2013 – Prague, Czech Republic
ISBN
0-671-72287-5
ISBN
0-671-72287-5
Fullname
William Shakespeare
Firstname Last name
William Shakespeare
split(fullname)
concatenate(f,l)
Our approach
• Two-phase approach:
• Estimated Mutual Information
• Suggest 1:1 and 1:n mappings
• Serve as a filtering step (filter out data properties that have no mutual information)
• Reduce search space for the next phase (speed up the process)
• Genetic Programming
• Automatic process for creating mapping functions
• Reduces the cost of traversing the search space
Besnik Fetahu 9DEXA 2013 – Prague, Czech Republic
Estimated Mutual Information (EMI)
• EMI Matrix
• p=(p1,…,pu), q=(q1,…,qv) two lists of sets (i.e. sets of data type properties)
Besnik Fetahu 10DEXA 2013 – Prague, Czech Republic
Cosine SimilarityJaccard Index…..
Estimated Mutual Information (EMI)
• Computing the mutual information:
• Cosine Similarity
• Simple matches: William Shakespeare → William Shakespeare
• Jaccard Similarity
• Simple and Complex matches: William → William Shakespeare
Besnik Fetahu 11DEXA 2013 – Prague, Czech Republic
Genetic Programming (GP)
• Genetic programming refers to an automated method to create and evolve programs to solve a problem.
• A solution is represented by a tree, whose nodes are labeled with functions (concatenate, split, sum) or with values (strings, numbers, etc).
• New individuals are generated by applying genetic operations to the current population of individuals.
• Selects individuals that should breed by an evolutionary process.
Besnik Fetahu 12DEXA 2013 – Prague, Czech Republic
Genetic Programming (GP)
• GP Functions:
• Crossover
• The act of swapping gene values between two potential solutions,
simulating the "mating" of the two solutions.
• Mutation
• The act of randomly altering the value of a gene in a potential solution.
• Reproduction
• The act of making a copy of a potential solution
Besnik Fetahu 13DEXA 2013 – Prague, Czech Republic
Genetic Programming (GP)
• Fitness function
• Levenshtein similarity function for string values
• KL-divergence measure for numeric values
• Different measures are applied since data properties values can have multiple common values (such as 0) and it can lead to a wrong match. Thus, we use measure the probability of two sets being the same with KL.
Besnik Fetahu 14DEXA 2013 – Prague, Czech Republic
An Example of Implementation
Besnik Fetahu 15DEXA 2013 – Prague, Czech Republic
An Example of Implementation
Phase 1 – Co-occurrence matrix
1. Difference between Cosine/Jaccard similarity metrics.
Besnik Fetahu 16DEXA 2013 – Prague, Czech Republic
An Example of Implementation
Phase 1 – EMI matrix
2. Possible matchings:
Besnik Fetahu 17DEXA 2013 – Prague, Czech Republic
An Example of Implementation
Besnik Fetahu 18DEXA 2013 – Prague, Czech Republic
An Example of Implementation
Besnik Fetahu 19DEXA 2013 – Prague, Czech Republic
Complement
+
NumberAddress
+
Number
Crossover
NeighborhoodComplementNumber
Address
+
+ mutation
Complement
+
Number
reproduction
An Example of Implementation
Besnik Fetahu 20DEXA 2013 – Prague, Czech Republic
Correct
Repetitive and Incorrect mutation
Evaluation
• Datasets
• “Personal Information” dataset lists information about people
• “Real Estate” dataset lists information about houses for sale
• “Inventory” dataset describes product inventories
With exception of the “Personal Information” dataset due to privacy reasons, other datasets are available at:
http://pages.cs.wisc.edu/ anhai/wisc-si-archive/domains/
Besnik Fetahu 21DEXA 2013 – Prague, Czech Republic
Results
Besnik Fetahu 22DEXA 2013 – Prague, Czech Republic
Results
Besnik Fetahu 23DEXA 2013 – Prague, Czech Republic
10/04/23Ricardo Kawase 24
Conclusions
• Complex schema matching approach
• Simple + Complex matching:
• Estimated Mutual Information + Genetic Programing
• Reduced search space for matching properties
• Adaptive to variations of 1:1 and n:1 matching instances
• High accuracy on generated matches and coverage
Questions?
Thank you!
Besnik Fetahu 25DEXA 2013 – Prague, Czech Republic