Upload
xml
View
216
Download
0
Embed Size (px)
Citation preview
8/3/2019 s07-minhash
1/37
Scalable Techniques for
Clustering the Web
Taher H. Haveliwala
Aristides Gionis
Piotr Indyk
Stanford University{taherh,gionis,indyk}@cs.stanford.edu
8/3/2019 s07-minhash
2/37
Project Goals
Generate fine-grained clustering of webbased on topic
Similarity search (Whats Related?)Two major issues:
Develop appropriate notion of similarity
Scale up to millions of documents
8/3/2019 s07-minhash
3/37
Prior Work
Offline: detecting replicas
[Broder-Glassman-Manasse-Zweig97]
[Shivakumar-G. Molina98]
Online: finding/grouping related pages
[Zamir-Etzioni98]
[Manjara]
Link based methods
[Dean-Henzinger99, Clever]
8/3/2019 s07-minhash
4/37
Prior Work: Online, Link
Online: cluster results of search queries
does not work for clustering entire web
offlineLink based approaches are limited
What about relatively new pages?
What about less popular pages?
8/3/2019 s07-minhash
5/37
Prior Work: Copy detection
Designed to detect duplicates/near-replicas
Do not scale when notion of similarity ismodified to topical similarity
Creation of document-document similarity
matrix is the core challenge:joinbottleneck
8/3/2019 s07-minhash
6/37
Pairwise similarity
Consider relation Docs(id, sentence)
Must compute:
SELECT D1.id, D2.idFROM Docs D1, Docs D2
WHERE D1.sentence = D2.sentence
GROUP BY D1.id, D2.id
HAVING count(*) >
What if we change sentence to word?
8/3/2019 s07-minhash
7/37
Pairwise similarity
Relation Docs(id, word)
Compute:
SELECT D1.id, D2.idFROM Docs D1, Docs D2
WHERE D1.word = D2.word
GROUP BY D1.id, D2.id
HAVING count(*) >
For 25M urls, could take months to compute!
8/3/2019 s07-minhash
8/37
Overview
Choose document representation
Choose similarity metric
Compute pairwise document similarities
Generate clusters
8/3/2019 s07-minhash
9/37
Document representation
Bag of words model
Bag for each page pconsists of
Title ofp
Anchor text of all pages pointing to p(Also include window of words around
anchors)
8/3/2019 s07-minhash
10/37
Bag Generation
...click here for a
great music page...
...click here for greatsports page...
...this music is great...
...what I had for
lunch...
http://www.foobar.com/
http://www.baz.com/
http://www.music.com/
Enter our site
MusicWorld
8/3/2019 s07-minhash
11/37
Bag Generation
Union of anchor windows is a concisedescription of a page.
Note that using anchor windows, we cancluster more documents than wevecrawled:
In general, a set of N documents refers to cNurls
8/3/2019 s07-minhash
12/37
Standard IR
Remove stopwords (~ 750)
Remove high frequency & low frequency
termsUse stemming
Apply TFIDF scaling
8/3/2019 s07-minhash
13/37
Overview
Choose document representation
Choose similarity metric
Compute pairwise document similarities
Generate clusters
8/3/2019 s07-minhash
14/37
Similarity
Similarity metric for pages U1, U2, thatwere assigned bags B1, B2, respectively
sim(U1, U2) = |B1 B2| / |B1 B2|Threshold is set to 20%
8/3/2019 s07-minhash
15/37
Reality Check
www.foodchannel.com:
www.epicurious.com/a_home/a00_home/home.html .37
www.gourmetworld.com .36
www.foodwine.com .325www.cuisinenet.com .3125
www.kitchenlink.com .3125
www.yumyum.com .3
www.menusonline.com .3
www.snap.com/directory/category/0,16,-324,00.html .2875
www.ichef.com .2875
www.home-canning.com .275
8/3/2019 s07-minhash
16/37
Overview
Choose document representation
Choose similarity metric
Compute pairwise documentsimilarities
Generate clusters
8/3/2019 s07-minhash
17/37
Pair Generation
Find all pairs of pages (U1, U2) satisfyingsim(U1, U2) 20%
Ignore all url pairs with sim < 20%How do we avoid the join bottleneck?
8/3/2019 s07-minhash
18/37
Locality Sensitive Hashing
Idea: use special kind of hashing
Locality Sensitive Hashing (LSH) provides
a solution:Min-wise hash functions [Broder98]
LSH [Indyk, Motwani98], [Cohen et al2000]
Properties:Similar urls are hashed together w.h.p
Dissimilar urls are not hashed together
8/3/2019 s07-minhash
19/37
Locality Sensitive Hashing
sports.comgolf.com
music.com
opera.comsing.com
8/3/2019 s07-minhash
20/37
Hashing
Two steps
Min-hash (MH): a way to consistently sample
words from bagsLocality sensitive hashing (LSH): similar
pages get hashed to the same bucket whiledissimilar ones do not
8/3/2019 s07-minhash
21/37
Step 1: Min-hash
Step 1: Generate m min-hash signaturesfor each url (m = 80)
For i = 1...mGenerate a random order hi on words
mhi(u) = argmin {hi(w) | w Bu}
Pr(mhi(u) = mhi(v)) = sim(u, v)
8/3/2019 s07-minhash
22/37
Step 1: Min-hash
Round 1:
ordering = [cat, dog, mouse, banana]
Set A:{mouse, dog}
MH-signature = dog
Set B:{cat, mouse}
MH-signature = cat
8/3/2019 s07-minhash
23/37
Step 1: Min-hash
Round 2:
ordering = [banana, mouse, cat, dog]
Set A:{mouse, dog}
MH-signature = mouse
Set B:{cat, mouse}
MH-signature = mouse
8/3/2019 s07-minhash
24/37
Step 2: LSH
Step 2: Generate l LSH signatures foreach url, using kof the min-hash values
(l = 125, k= 3)For i = 1...l
Randomly select kmin-hash indices andconcatenate them to form ith LSH
signature
8/3/2019 s07-minhash
25/37
Step 2: LSH
Generate candidate pair if u and v havean LSH signature in common in any round
Pr(lsh(u) = lsh(v)) = Pr(mh(u) = mh(v))k
8/3/2019 s07-minhash
26/37
Step 2: LSH
Set A:{mouse, dog, horse, ant}
MH1 = horse
MH2 = mouseMH3 = antMH4 = dog
LSH134 = horse-ant-dogLSH234 = mouse-ant-dog
Set B:{cat, ice, shoe, mouse}
MH1 = cat
MH2 = mouseMH3 = iceMH4 = shoe
LSH134 = cat-ice-shoeLSH234 = mouse-ice-shoe
8/3/2019 s07-minhash
27/37
Step 2: LSH
Bottom line - probability of collision:
10% similarity 0.1%
1% similarity 0.0001%
8/3/2019 s07-minhash
28/37
Step 2: LSH
Round 1
sports.comgolf.comparty.com
music.com
opera.com
sport-team-win
music-sound-play
. . .sing.com
. . .
sing-music-ear
8/3/2019 s07-minhash
29/37
Step 2: LSH
Round 2
sports.comgolf.com
music.com
sing.com
game-team-score
audio-music-note
. . .opera.com
. . .
theater-luciano-sing
8/3/2019 s07-minhash
30/37
Sort & Filter
Using all buckets from all LSH rounds,generate candidate pairs
Sort candidate pairs on first fieldFilter candidate pairs: keep pair (u, v),
only if u and v agree on 20% of MH-
signaturesReady for Whats Related? queries...
8/3/2019 s07-minhash
31/37
Overview
Choose document representation
Choose similarity metric
Compute pairwise document similarities
Generate clusters
8/3/2019 s07-minhash
32/37
Clustering
The set of document pairs represents thedocument-document similarity matrix with
20% similarity thresholdClustering algorithms
S-Link: connected components
C-Link: maximal cliquesCenter: approximation to C-Link
8/3/2019 s07-minhash
33/37
Center
Scan through pairs (they are sorted onfirst component)
For each run [(u, v1), ... , (u, vn)]if u is not marked
cluster = u + unmarked neighbors of u
mark u and all neighbors of u
8/3/2019 s07-minhash
34/37
Center
8/3/2019 s07-minhash
35/37
Results
Algorithm Step Running Time
(hours)
Bag generation 23Bag sorting 4.7
Min-hash 26
LSH 16
Filtering 83
Sorting 107
CENTER 18
20 Million urls on Pentium-II 450
8/3/2019 s07-minhash
36/37
Sample Cluster
feynman.princeton.edu/~sondhi/205main.html
hep.physics.wisc.edu/wsmith/p202/p202syl.html
hepweb.rl.ac.uk/ppUK/PhysFAQ/relativity.html
pdg.lbl.gov/mc_particle_id_contents.htmlphysics.ucsc.edu/courses/10.html
town.hall.org/places/SciTech/qmachine
www.as.ua.edu/physics/hetheory.html
www.landfield.com/faqs/by-newsgroup/sci/sci.physics.relativity.html
www.pa.msu.edu/courses/1999spring/PHY492/desc_PHY492.html
www.phy.duke.edu/Courses/271/Synopsis.html
. . . (total of 27 urls) . . .
8/3/2019 s07-minhash
37/37
Ongoing/Future Work
Tune anchor-window length
Develop system to measure quality
What is ground truth?How do you judge clustering of millions of
pages?