Upload
barnard-grant
View
220
Download
0
Embed Size (px)
Citation preview
Page 1
PathSim: Meta Path-Based Top-KSimilarity Search in Heteroge-neous Information Networks
Yizhou Sun, Jiawei Han, Xifeng Yan, Philip S. Yu, Tianyi WuVLDB ’11
Summarized and presented by Kim Chungrim
Page 2
Contents
• Introduction• Motivation & Terminology
– Heterogeneous Information Network (HIN)– Network Schema– Meta Path
• Meta Path-based Similarity Search Framework• PathSim: A Novel Meta Path-Based Similarity Measure• Online Query Processing for Top-K Similarity Search
– PathSim-baseline– PathSim-pruning
• Experiments• Conclusions
Page 3
• Logical networks involving multi-typed objects and multi-typed links denoting different relations are arising
– Bibliographic networks– Social media networks– Knowledge network encoded in Wikipedia
• It is important to study similarity search in such networks, as similarity search is a primitive operation in database and Web search engines.
• Similarity search has been only studied for traditional rela-tional databases or homogeneous information networks
– Personalized PageRank (P-PageRank)– SimRank– Random Walk (RW)– Personalized Random Walk (PRW)
• There are studies of similarity search on Heterogeneous In-formation Networks (HIN)
INTRODUCTION
Page 4
Motivation
• When conventional similarity measures for homogeneous information network is applied to HIN, the subtle semantic meanings that each type of links carry will be ignored
• Limitation of current similarity/proximity measures defined in networks
– Do NOT distinguish different types of objects and different types of links in the network
– Different types of objects and links have different semantic meanings
• E.g., personalized PageRank (P-PageRank), SimRank
• To distinguish the semantic among paths connecting two objects, a meta path-based similarity framework can be considered
Page 5
Terminology
• HIN : – Networks containing multi-typed objects, interconnected via multi-typed
relationships– G(V,E)– Examples
• DBLP network: papers, authors, venues, terms• Flickr network: pictures, tags, users, groups
– Sources• From online web services: online shopping websites, social media websites, bibliographic
websites, …• From database systems: medical databases, university databases, police department databases,
…
• Network Schema : – Information about the entity type and their binary relations– – Similar to the E-R Model
),( RATG
JimP1
VLDBNet-work
P2
Ann
Data
DBLP Network schema
Pa-per
Venue
Term
Au-thor
Page 6
Terminology (cont.)
• Meta Path– Two objects can be connected via different connectivity paths – E.g., two authors can be connected by
• “author-paper-author” (APA)• “author-paper-author-paper-author” (APAPA)• “author-paper-venue-paper-author” (APCPA)
• Each connectivity path represents a different semantic meaning and implies different similarity semantics
• A meta path is a meta level description of the topological connectivity between objects
– Given a Network Schema , A meta path can be defined as
– Can be considered as a new relation defined on type and
JimP1
VLDBNet-work
P2
Ann
Data
),( RATG
121...21
l
RRR AAA l
1A
1lA
Page 7
Meta Path-based Similarity Search Framework
• Similarity definition– Meta Path X Similarity Measure
• Conventional Similarity measures– Path Count : the number of path instances p between x and y following P– s(x,y) = |{p : p ∈ P }|
– Random Walk : the probability ( Prob(p) ) of the random walk that starts from x and ends with y following meta path P, which is the sum of the probabilities of all the path instances p
– s(x,y) =
– Pairwise Random Walk : for a meta path P that can be decomposed into two shorter meta paths with the same length P = , pairwise random walk probability is the probabilities starting from x and y and reaching the same middle object z
– s(x,y) =
Pp pob )(Pr
21PP
)(Pr)(Pr1
2)()( 12121
pobpobPPpp
x yp
x y
z
𝑝1 𝑝2
Page 8
PathSim: A Novel Meta Path-Based Similarity Measure
• Similarity in terms of ‘Peers’– Two similar peer object should not only be strongly connected, but also
share comparable visibility.
• Path count and Random walk (RW)– Favor highly visible objects (objects with large degrees)
• Pairwise random walk (PRW)– Favor pure objects (objects with highly skewed scatterness in their
in-links or out-links)
• PathSim– Favor “peers” (objects with similar visibility and strong connectivity
under the given meta path)
Page 9
PathSim: A Novel Meta Path-Based Similarity Measure (cont)
• Restricted on Round-Trip Meta Path– A round-trip meta path is a path of the form of P = – Guarantees a symmetric relation
–
1
llPP
Jim
VLDB
Mike
SIGMOD 2
50
20
1 s(Mike, Jim) = 0.0826 20)*2050*(501)*12*(2
20)*150*(2*2
Page 10
PathSim: A Novel Meta Path-Based Similarity Measure (cont)
• Properties of PathSim– Symmetric– s(x,y) = s(y,x)
– Self Maximum– s(x,y) ∈ [0,1], s(x,x) = 1
– Balance of Visibility–
2
),(
ii
jj
jj
ii
M
M
M
Myxs
Page 11
PathSim: A Novel Meta Path-Based Similarity Measure (cont)
• Comparison with other measures.
Page 12
Online Query Processing for Top-K Similarity Search
• The Top K Similarity Search Problem under PathSim– Given an HIG G and its network schema , given a round-trip meta path P =
the top-K Similarity Search is defined as:
– For a given query object x ∈ A1, find the sorted k objects y in the same type A that are most similar to the object x under PathSim similarity definition
• Major issues for Online Computation– Very large Matrix Multiplication : Need to compute the commuting matrix
– Calculating the commuting matrix is too time consuming– Full materialization of commuting matrix is also time and space expensive
• Solution– Partially materialize commuting matrices for short length meta paths, and
concatenate them online to get longer ones for a given query– Materialize the commuting matrix Mp for meta path
TG
1
llPP
Page 13
Online Query Processing for Top-K Similarity Search – PathSim-baseline
• Find the candidates via traversing the network following meta path P from the query object x
• For each candidate y, calculate s(x,y) using partial commuting matrix Mp
– Calculate and scale it with sum of visibility
– , can be pre-computed and stored using
• Sort y according to s(x,y) and return top-k objects
• Still very time-consuming if the candidate set is very large!
Page 14
Online Query Processing for Top-K Similarity Search – PathSim-pruning
• PathSim-pruning algorithm prunes the candidates that are not promising
• Offline: Generate co-clusters according to partial commuting matrix and store statistics for each block for deriving upper bound of similarity
• Online: For each query– Calculate the upper bound similarity between query object and the candi-
date cluster; prune the whole cluster if it is not promising– Calculate the upper bound similarity between query and each candidate in
the cluster; prune the candidate if it is not promising– Calculate the exact similarity measure between query and the candidate,
and update the top-k list
Page 15
Online Query Processing for Top-K Similarity Search – PathSim-pruning
Page 16
Experiments
• The DBLP network– By Nov. 2009– Contains over 710K authors, 1.2M papers, 5K venues (conferences/journals),
and around 70K terms appearing more than once (stopwords have been re-moved).
– Called full DBLP dataset
• Created two subsets– DBIS dataset : contains all 464 venues and top-5000 authors from the data-
base and information system area– 4-area dataset : contains 20 venues and top-5000 authors from 4 areas:
database, data mining, machine learning and information retrieval• Cluster labels are given for all the 20 venues and a subset of 1713 authors
Page 17
Experiment - Effectiveness
• Labeled top-15 result for 15 queries from venue type in DBIS dataset
– Labeled each result object with relevance score as • 0 : non relevant• 1 : some-relevant• 2 : very-relevant
• Used nDCG to evaluate the quality of a ranking algorithm
Page 18
Experiment - Effectiveness
Page 19
Experiment - Efficiency
Page 20
Experiment - Efficiency
Page 21
Conclusion & Contribution
• The authors– defined a meta path-based similarity framework in HIN– Proposed a new measure called PathSim, which is able to detect peer
objects for the given meta path– Propose a co-clustering-based efficient online search algorithm to support
top-k search
Page 22
Summary / Discussion / Future work
• Network inference procedure assumes ad-hoc edge filtering• Introduced a threshold on edges and a family of Networks to
find aoptimal threshold for a certain prediction task
– The prediction accuracies peak in a non-obvious yet relatively narrow thresholdrange
• Tested on too few datasets• Not enough to give a solid conclusion
• Apply method to variety of networks• Test various thresholds for more interests