Upload
mikel
View
45
Download
0
Embed Size (px)
DESCRIPTION
Fast Nearest-neighbor Search in Disk-resident Graphs. 报告人:鲁轶奇. Outline. Introduction Background & related works Proposed Work Experiments. Introduction-Motivation. Graph becoming enormous Streaming algorithm must take passes over the entire dataset - PowerPoint PPT Presentation
Citation preview
Fast Nearest-neighbor Search in Disk-resident Graphs
报告人:鲁轶奇
IBM – China Research Lab
Outline
Introduction Background & related works Proposed Work Experiments
IBM – China Research Lab
Introduction-Motivation
Graph becoming enormous Streaming algorithm must take passes over the entire dataset Other perform clever preprocessing which use a specific similarity measure
This paper introduces analysis and algorithms which try to address the scalability problem in a generalizable way: not specific to one kind of graph partitioning nor one specific proximity measure.
IBM – China Research Lab
Introduction-Motivation(cont.)
Real world graphs contain high-degree nodes Computing node value by combining that of its neighbors. Whenever a high degree node is encountered, these algorithm have to examine a
much large neighborhood leading to severely degraded performance.
IBM – China Research Lab
Introduction-Motivation(cont.)
Algorithms can no longer assume that entire graph can be stored in memory. Compression techniques still have at least three setting where these might not
work social networks are far less compressible than Web graphs decompression might lead to an unacceptable increase in query response time even if a graph could be compressed down to a gigabyte, it might be undesirable to
keep it in memory on a machine which is running other applications
IBM – China Research Lab
Contribution
a simple transform of the graph (turning high degree nodes into sinks) a deterministic local algorithm guaranteed to return nearest neighbors in
personalized pagerank from the disk-resident clustered graph. we develop a fully external-memory clustering algorithm (RWDISK) that uses
only sequential sweeps over data files.
IBM – China Research Lab
Background-Personalized Pagerank
A random walk starting at node a, at any step the walk can be reset to the start node with probability α
PPV(a, j) : PPV entry from a to j Large value indicates high similarity
IBM – China Research Lab
Background-Clustering
Using random walk based approaches for computing good quality local graph partition near a given anchor node.
Main intuition: A random walk started inside a low conductance cluster will mostly stay inside the
cluster. Conductance:
ФV(A) denote conductance and μ(A)=Σi A∈ degree(i)
IBM – China Research Lab
Proposed Work
First problem: most local algorithms for computing nearest neighbors suffer from the presence of high degree nodes.
Second issue: computing proximity measures on large disk-resident graphs. Third issue: Finding a good clustering
IBM – China Research Lab
Effect of high degree nodes
High degree nodes are performance bottleneck Effect on personalized pagerank
Main intuition: a very high degree node passes on a small fraction of its value to the out-neighbors, which might not be significant enough to invest our computing resources on.
Argue: stopping a random walk at a high degree node does not change the personalized pagerank value at other nodes which have relatively smaller degree.
IBM – China Research Lab
Effect of high degree nodes
error incurred in personalized pagerank is inversely proportional to the degree of the sink node.
IBM – China Research Lab
Effect of high degree nodes
faα(i, j) is simply the probability of hitting a node j for the first time from node i, in this α-discounted walk.
IBM – China Research Lab
Effect of high degree nodes
IBM – China Research Lab
Effect of high degree nodes
the error for introducing a set of sink nodes
IBM – China Research Lab
Nearest-neighbors on clustered graphs
how to use the clusters for deterministic computation of nodes "close" to an arbitrary query.
Use degree-normalized personalized pagerank For a given node i, the PPV from j to it, i.e. PPV (j, i) can be written as
IBM – China Research Lab
assume that j and i are in the same cluster S.
Don’t have access to PPV-1(k), , replace it with upper and lower bound lower bound: 0, we pretend that S is completely disconnected to the rest of the
graph Upper bound : A random walk from outside S has to cross the boundary of S
to hit node i.
IBM – China Research Lab
S is small in size, the power method suffice
At each iteration, maintain the upper and lower bounds for nodes within S To expand S: bring in the clusters for x of the external neighbors of
this global upper boundfalls below a pre-specified small threshold γ In reality, using an additive slack ε, (ubk+1- ε)
IBM – China Research Lab
Ranking Step
return all nodes which have lower bound greater than the (k+1)th largest upper bound
Why: All nodes outside the cluster are guaranteed to have personalized pagerank smaller than the global upperbound, which is smaller than γ
IBM – China Research Lab
Clustered Representation on Disk
Intuition: use a set of anchor nodes and assign each remaining node to its “closest” anchor.
Using personalized page-rank as the measure of “closeness” Algorithm:
Start with a random set of anchors Iteratively add new anchors from the set of unreachable nodes, and the recompute the
cluster assignments Two properties:
new anchors are far away from the existing anchors when the algorithm terminates, each node i is guaranteed to be assigned to its closest
anchor.
IBM – China Research Lab
RWDISK
4 kinds of files Edge file: Each line represents an edge by a triplet {src,dst,p}, p = P(X t = dst| Xt-1=src)
Last file: each line in Last is {src,anchor,value}, value= P(X t-1=src| X0=anchor) Newt file: Newt contains xt, each line is {src,anchor,value}, where value equals
P(Xt=src|X0 =anchor) Ans file: represents the values for vt. Thus each line in Ans is {src,anchor,value},
where value = Algorithm to compute vt by power iterations
IBM – China Research Lab
RWDISK(cont.)
Newt is simply a matrix-vector product between the transition matrix stored in Edges and Last.
File are stored lexicographically, this can be obtained by a file-join like algorithm.
First step: simply joins the two files, and accumulates the probability values at a node from its in-neighbors.
Next step: the Newt file is sorted and compressed, in order to add up the values from different in-neighbors
multiply the probabilities by α(1-α)t-1 Fix the number of iterations at maxiter.
IBM – China Research Lab
One major problem is that intermediate files can become much larger than the number of edges
in most real-world networks within 4-5 steps it is possible to reach a huge fraction of the whole graph
Intermediate file getting too large Using rounding for reducing file sizes
IBM – China Research Lab
Experiments
Dataset
IBM – China Research Lab
Experiments(cont.)
System Detail On a off-the-shelf PC Least recently used replacement scheme Page size 4KB
IBM – China Research Lab
Experiments(cont.)-Effect of high degree nodes
Three-fold advantages:- Speed up external memory clustering- Reduce number of page-faults in random-walk simulation
Effect on RWDISK
IBM – China Research Lab
Experiments(cont.)-Deterministic vs. Simulations
Computing top-10 neighbors with approximation slack 0.005 for 500 randomly picked nodes
Citeseer original graph DBLP turned nodes with degree above 1000 into sinks LiveJournal turn nodes with degree above 100 into sinks
IBM – China Research Lab
Experiments(cont.)-RWDISK vs. METIS
maxiter = 30, α = 0.1 and ε = 0.001 for PPV METIS for baseline algorithm
break DBLP into 50000 parts, which used 20GB of RAM Break LiveJournal into 75000 parts, which used 50GB of RAM
In comparison, RWDISK can be excuted on a 2-4 GB standard PC
IBM – China Research Lab
Experiments(cont.)-RWDISK vs. METIS
Measure of cluster quality A good disk-based clustering must satisfy :
- Low conductance- Fit in disk-sized pages
IBM – China Research Lab
Experiments(cont.)-RWDISK vs. METIS
IBM – China Research Lab
Experiments(cont.)-RWDISK vs. METIS