Upload
besnik-fetahu
View
108
Download
0
Embed Size (px)
DESCRIPTION
Paper about complex matching of RDF datatype properties. DEXA Conference, 2013, Prague, Czech Republic.
Citation preview
Complex Matching of RDF Datatype Properties
Bernardo Pereira Nunes1,2, Alexander Mera1, Marco Antonio Casanova1, Besnik Fetahu2, Luiz André P. Paes Leme3, Stefan Dietze2
1) Department of Informatics-PUC-Rio, 2) L3S Research Center, Leibniz University Hannover,
3) Computer Science Institute, Fluminese Federal University
DEXA 2013 – Prague, Czech Republic
Outline
• Introduction
• Motivation
• Related Work
• Schema Matching Principles
• Our approach:
• Phase 1) Estimated Mutual Information – EMI
• Phase 2) Genetic Programing - GP
• Evaluation
• Results
• Conclusions
Besnik Fetahu 2DEXA 2013 – Prague, Czech Republic
Introduction
• Data Integration
• Combine different data sources into an unified view of data
• Originally fomented by large organizations:
• Merge companies databases due to acquisitions
• Currently, driven by new Web trends such as:
• Improvement of Web-based search
• Proliferation of Web applications
• e-business
• Examples: momondo.de, semantic search, price watchers sites, etc.
Besnik Fetahu 3DEXA 2013 – Prague, Czech Republic
Introduction
• Challenges
• Heterogeneous data
• Different data formats
• Data quality (data impurities, corrupted information)
• Scalability
• Adaptability
• Costly
Besnik Fetahu 4DEXA 2013 – Prague, Czech Republic
Introduction
• Initiatives to address data integration problems
• Linked Data Principles
• Ontology Alignment Initiatives (OAI)
• Schema Matching tools
Besnik Fetahu 5DEXA 2013 – Prague, Czech Republic
Motivation
• Given two schemas S and T a matching from S to T is characterized if an element e from S is mapped to an element e’ from T by some expression that relates both elements.
Besnik Fetahu 6DEXA 2013 – Prague, Czech Republic
?
?
?
Related Work
• Methods
• RiMOM, iMAP, S-Match, DSSim, ATOM, etc.
• Schema-based approach
• Instance-based approach
• Hybrid approach
• Cardinality
• 1:1
• 1:n
• n:m
Besnik Fetahu 7DEXA 2013 – Prague, Czech Republic
Rahm, E. and Bernstein, P. 2001. A survey of approaches to automatic schema matching. The VLDB Journal 10, 4 (Dec. 2001), 334-350.
Cardinality
• Simple match
• 1:1 – direct matching
• Complex match
• 1:1 / n:1 (mapping functions)
Besnik Fetahu 8DEXA 2013 – Prague, Czech Republic
ISBN
0-671-72287-5
ISBN
0-671-72287-5
Fullname
William Shakespeare
Firstname Last name
William Shakespeare
split(fullname)
concatenate(f,l)
Our approach
• Two-phase approach:
• Estimated Mutual Information
• Suggest 1:1 and 1:n mappings
• Serve as a filtering step (filter out data properties that have no mutual information)
• Reduce search space for the next phase (speed up the process)
• Genetic Programming
• Automatic process for creating mapping functions
• Reduces the cost of traversing the search space
Besnik Fetahu 9DEXA 2013 – Prague, Czech Republic
Estimated Mutual Information (EMI)
• EMI Matrix
• p=(p1,…,pu), q=(q1,…,qv) two lists of sets (i.e. sets of data type properties)
Besnik Fetahu 10DEXA 2013 – Prague, Czech Republic
Cosine SimilarityJaccard Index…..
Estimated Mutual Information (EMI)
• Computing the mutual information:
• Cosine Similarity
• Simple matches: William Shakespeare → William Shakespeare
• Jaccard Similarity
• Simple and Complex matches: William → William Shakespeare
Besnik Fetahu 11DEXA 2013 – Prague, Czech Republic
Genetic Programming (GP)
• Genetic programming refers to an automated method to create and evolve programs to solve a problem.
• A solution is represented by a tree, whose nodes are labeled with functions (concatenate, split, sum) or with values (strings, numbers, etc).
• New individuals are generated by applying genetic operations to the current population of individuals.
• Selects individuals that should breed by an evolutionary process.
Besnik Fetahu 12DEXA 2013 – Prague, Czech Republic
Genetic Programming (GP)
• GP Functions:
• Crossover
• The act of swapping gene values between two potential solutions,
simulating the "mating" of the two solutions.
• Mutation
• The act of randomly altering the value of a gene in a potential solution.
• Reproduction
• The act of making a copy of a potential solution
Besnik Fetahu 13DEXA 2013 – Prague, Czech Republic
Genetic Programming (GP)
• Fitness function
• Levenshtein similarity function for string values
• KL-divergence measure for numeric values
• Different measures are applied since data properties values can have multiple common values (such as 0) and it can lead to a wrong match. Thus, we use measure the probability of two sets being the same with KL.
Besnik Fetahu 14DEXA 2013 – Prague, Czech Republic
An Example of Implementation
Besnik Fetahu 15DEXA 2013 – Prague, Czech Republic
An Example of Implementation
Phase 1 – Co-occurrence matrix
1. Difference between Cosine/Jaccard similarity metrics.
Besnik Fetahu 16DEXA 2013 – Prague, Czech Republic
An Example of Implementation
Phase 1 – EMI matrix
2. Possible matchings:
Besnik Fetahu 17DEXA 2013 – Prague, Czech Republic
An Example of Implementation
Besnik Fetahu 18DEXA 2013 – Prague, Czech Republic
An Example of Implementation
Besnik Fetahu 19DEXA 2013 – Prague, Czech Republic
Complement
+
NumberAddress
+
Number
Crossover
NeighborhoodComplementNumber
Address
+
+ mutation
Complement
+
Number
reproduction
An Example of Implementation
Besnik Fetahu 20DEXA 2013 – Prague, Czech Republic
Correct
Repetitive and Incorrect mutation
Evaluation
• Datasets
• “Personal Information” dataset lists information about people
• “Real Estate” dataset lists information about houses for sale
• “Inventory” dataset describes product inventories
With exception of the “Personal Information” dataset due to privacy reasons, other datasets are available at:
http://pages.cs.wisc.edu/ anhai/wisc-si-archive/domains/
Besnik Fetahu 21DEXA 2013 – Prague, Czech Republic
Results
Besnik Fetahu 22DEXA 2013 – Prague, Czech Republic
Results
Besnik Fetahu 23DEXA 2013 – Prague, Czech Republic
10/04/23Ricardo Kawase 24
Conclusions
• Complex schema matching approach
• Simple + Complex matching:
• Estimated Mutual Information + Genetic Programing
• Reduced search space for matching properties
• Adaptive to variations of 1:1 and n:1 matching instances
• High accuracy on generated matches and coverage
Questions?
Thank you!
Besnik Fetahu 25DEXA 2013 – Prague, Czech Republic