Fast Nearest-neighbor Search in Disk-resident Graphs

Fast Nearest-neighbor Search in Disk-resident Graphs

报告人：鲁轶奇

IBM – China Research Lab

Outline

Introduction Background & related works Proposed Work Experiments


Introduction-Motivation

Graph becoming enormous Streaming algorithm must take passes over the entire dataset Other perform clever preprocessing which use a specific similarity measure

This paper introduces analysis and algorithms which try to address the scalability problem in a generalizable way: not specific to one kind of graph partitioning nor one specific proximity measure.


Introduction-Motivation(cont.)

Real world graphs contain high-degree nodes Computing node value by combining that of its neighbors. Whenever a high degree node is encountered, these algorithm have to examine a

much large neighborhood leading to severely degraded performance.


Introduction-Motivation(cont.)

Algorithms can no longer assume that entire graph can be stored in memory. Compression techniques still have at least three setting where these might not

work social networks are far less compressible than Web graphs decompression might lead to an unacceptable increase in query response time even if a graph could be compressed down to a gigabyte, it might be undesirable to

keep it in memory on a machine which is running other applications


Contribution

a simple transform of the graph (turning high degree nodes into sinks) a deterministic local algorithm guaranteed to return nearest neighbors in

personalized pagerank from the disk-resident clustered graph. we develop a fully external-memory clustering algorithm (RWDISK) that uses

only sequential sweeps over data files.


Background-Personalized Pagerank

A random walk starting at node a, at any step the walk can be reset to the start node with probability α

PPV(a, j) : PPV entry from a to j Large value indicates high similarity


Background-Clustering

Using random walk based approaches for computing good quality local graph partition near a given anchor node.

Main intuition: A random walk started inside a low conductance cluster will mostly stay inside the

cluster. Conductance:

ФV(A) denote conductance and μ(A)=Σi A∈ degree(i)


Proposed Work

First problem: most local algorithms for computing nearest neighbors suffer from the presence of high degree nodes.

Second issue: computing proximity measures on large disk-resident graphs. Third issue: Finding a good clustering


Effect of high degree nodes

High degree nodes are performance bottleneck Effect on personalized pagerank

Main intuition: a very high degree node passes on a small fraction of its value to the out-neighbors, which might not be significant enough to invest our computing resources on.

Argue: stopping a random walk at a high degree node does not change the personalized pagerank value at other nodes which have relatively smaller degree.



error incurred in personalized pagerank is inversely proportional to the degree of the sink node.



faα(i, j) is simply the probability of hitting a node j for the first time from node i, in this α-discounted walk.





the error for introducing a set of sink nodes


Nearest-neighbors on clustered graphs

how to use the clusters for deterministic computation of nodes "close" to an arbitrary query.

Use degree-normalized personalized pagerank For a given node i, the PPV from j to it, i.e. PPV (j, i) can be written as


assume that j and i are in the same cluster S.

Don’t have access to PPV-1(k), , replace it with upper and lower bound lower bound: 0, we pretend that S is completely disconnected to the rest of the

graph Upper bound ： A random walk from outside S has to cross the boundary of S

to hit node i.


S is small in size, the power method suffice

At each iteration, maintain the upper and lower bounds for nodes within S To expand S: bring in the clusters for x of the external neighbors of

this global upper boundfalls below a pre-specified small threshold γ In reality, using an additive slack ε, (ubk+1- ε)


Ranking Step

return all nodes which have lower bound greater than the (k+1)th largest upper bound

Why: All nodes outside the cluster are guaranteed to have personalized pagerank smaller than the global upperbound, which is smaller than γ


Clustered Representation on Disk

Intuition: use a set of anchor nodes and assign each remaining node to its “closest” anchor.

Using personalized page-rank as the measure of “closeness” Algorithm:

Start with a random set of anchors Iteratively add new anchors from the set of unreachable nodes, and the recompute the

cluster assignments Two properties:

new anchors are far away from the existing anchors when the algorithm terminates, each node i is guaranteed to be assigned to its closest

anchor.


RWDISK

4 kinds of files Edge file: Each line represents an edge by a triplet {src,dst,p}, p = P(X t = dst| Xt-1=src)

Last file: each line in Last is {src,anchor,value}, value= P(X t-1=src| X0=anchor) Newt file: Newt contains xt, each line is {src,anchor,value}, where value equals

P(Xt=src|X0 =anchor) Ans file: represents the values for vt. Thus each line in Ans is {src,anchor,value},

where value = Algorithm to compute vt by power iterations


RWDISK(cont.)

Newt is simply a matrix-vector product between the transition matrix stored in Edges and Last.

File are stored lexicographically, this can be obtained by a file-join like algorithm.

First step: simply joins the two files, and accumulates the probability values at a node from its in-neighbors.

Next step: the Newt file is sorted and compressed, in order to add up the values from different in-neighbors

multiply the probabilities by α(1-α)t-1 Fix the number of iterations at maxiter.


One major problem is that intermediate files can become much larger than the number of edges

in most real-world networks within 4-5 steps it is possible to reach a huge fraction of the whole graph

Intermediate file getting too large Using rounding for reducing file sizes


Experiments

Dataset


Experiments(cont.)

System Detail On a off-the-shelf PC Least recently used replacement scheme Page size 4KB


Experiments(cont.)-Effect of high degree nodes

Three-fold advantages:- Speed up external memory clustering- Reduce number of page-faults in random-walk simulation

Effect on RWDISK


Experiments(cont.)-Deterministic vs. Simulations

Computing top-10 neighbors with approximation slack 0.005 for 500 randomly picked nodes

Citeseer original graph DBLP turned nodes with degree above 1000 into sinks LiveJournal turn nodes with degree above 100 into sinks


Experiments(cont.)-RWDISK vs. METIS

maxiter = 30, α = 0.1 and ε = 0.001 for PPV METIS for baseline algorithm

break DBLP into 50000 parts, which used 20GB of RAM Break LiveJournal into 75000 parts, which used 50GB of RAM

In comparison, RWDISK can be excuted on a 2-4 GB standard PC



Measure of cluster quality A good disk-based clustering must satisfy ：

- Low conductance- Fit in disk-sized pages





Documents

Fast Nearest-neighbor Search in Disk-resident Graphs