s07-minhash

8/3/2019 s07-minhash

1/37

Scalable Techniques for

Clustering the Web

Taher H. Haveliwala

Aristides Gionis

Piotr Indyk

Stanford University{taherh,gionis,indyk}@cs.stanford.edu


2/37

Project Goals

Generate fine-grained clustering of webbased on topic

Similarity search (Whats Related?)Two major issues:

Develop appropriate notion of similarity

Scale up to millions of documents


3/37

Prior Work

Offline: detecting replicas

[Broder-Glassman-Manasse-Zweig97]

[Shivakumar-G. Molina98]

Online: finding/grouping related pages

[Zamir-Etzioni98]

[Manjara]

Link based methods

[Dean-Henzinger99, Clever]


4/37

Prior Work: Online, Link

Online: cluster results of search queries

does not work for clustering entire web

offlineLink based approaches are limited

What about relatively new pages?

What about less popular pages?


5/37

Prior Work: Copy detection

Designed to detect duplicates/near-replicas

Do not scale when notion of similarity ismodified to topical similarity

Creation of document-document similarity

matrix is the core challenge:joinbottleneck


6/37

Pairwise similarity

Consider relation Docs(id, sentence)

Must compute:

SELECT D1.id, D2.idFROM Docs D1, Docs D2

WHERE D1.sentence = D2.sentence

GROUP BY D1.id, D2.id

HAVING count(*) >

What if we change sentence to word?


7/37

Pairwise similarity

Relation Docs(id, word)

Compute:

SELECT D1.id, D2.idFROM Docs D1, Docs D2

WHERE D1.word = D2.word

GROUP BY D1.id, D2.id

HAVING count(*) >

For 25M urls, could take months to compute!


8/37

Overview

Choose document representation

Choose similarity metric

Compute pairwise document similarities

Generate clusters


9/37

Document representation

Bag of words model

Bag for each page pconsists of

Title ofp

Anchor text of all pages pointing to p(Also include window of words around

anchors)


10/37

Bag Generation

...click here for a

great music page...

...click here for greatsports page...

...this music is great...

...what I had for

lunch...

http://www.foobar.com/

http://www.baz.com/

http://www.music.com/

Enter our site

MusicWorld


11/37

Bag Generation

Union of anchor windows is a concisedescription of a page.

Note that using anchor windows, we cancluster more documents than wevecrawled:

In general, a set of N documents refers to cNurls


12/37

Standard IR

Remove stopwords (~ 750)

Remove high frequency & low frequency

termsUse stemming

Apply TFIDF scaling


13/37

Overview




Generate clusters


14/37

Similarity

Similarity metric for pages U1, U2, thatwere assigned bags B1, B2, respectively

sim(U1, U2) = |B1 B2| / |B1 B2|Threshold is set to 20%


15/37

Reality Check

www.foodchannel.com:

www.epicurious.com/a_home/a00_home/home.html .37

www.gourmetworld.com .36

www.foodwine.com .325www.cuisinenet.com .3125

www.kitchenlink.com .3125

www.yumyum.com .3

www.menusonline.com .3

www.snap.com/directory/category/0,16,-324,00.html .2875

www.ichef.com .2875

www.home-canning.com .275


16/37

Overview



Compute pairwise documentsimilarities

Generate clusters


17/37

Pair Generation

Find all pairs of pages (U1, U2) satisfyingsim(U1, U2) 20%

Ignore all url pairs with sim < 20%How do we avoid the join bottleneck?


18/37

Locality Sensitive Hashing

Idea: use special kind of hashing

Locality Sensitive Hashing (LSH) provides

a solution:Min-wise hash functions [Broder98]

LSH [Indyk, Motwani98], [Cohen et al2000]

Properties:Similar urls are hashed together w.h.p

Dissimilar urls are not hashed together


19/37

Locality Sensitive Hashing

sports.comgolf.com

music.com

opera.comsing.com


20/37

Hashing

Two steps

Min-hash (MH): a way to consistently sample

words from bagsLocality sensitive hashing (LSH): similar

pages get hashed to the same bucket whiledissimilar ones do not


21/37

Step 1: Min-hash

Step 1: Generate m min-hash signaturesfor each url (m = 80)

For i = 1...mGenerate a random order hi on words

mhi(u) = argmin {hi(w) | w Bu}

Pr(mhi(u) = mhi(v)) = sim(u, v)


22/37

Step 1: Min-hash

Round 1:

ordering = [cat, dog, mouse, banana]

Set A:{mouse, dog}

MH-signature = dog

Set B:{cat, mouse}

MH-signature = cat


23/37

Step 1: Min-hash

Round 2:

ordering = [banana, mouse, cat, dog]

Set A:{mouse, dog}

MH-signature = mouse

Set B:{cat, mouse}

MH-signature = mouse


24/37

Step 2: LSH

Step 2: Generate l LSH signatures foreach url, using kof the min-hash values

(l = 125, k= 3)For i = 1...l

Randomly select kmin-hash indices andconcatenate them to form ith LSH

signature


25/37

Step 2: LSH

Generate candidate pair if u and v havean LSH signature in common in any round

Pr(lsh(u) = lsh(v)) = Pr(mh(u) = mh(v))k


26/37

Step 2: LSH

Set A:{mouse, dog, horse, ant}

MH1 = horse

MH2 = mouseMH3 = antMH4 = dog

LSH134 = horse-ant-dogLSH234 = mouse-ant-dog

Set B:{cat, ice, shoe, mouse}

MH1 = cat

MH2 = mouseMH3 = iceMH4 = shoe

LSH134 = cat-ice-shoeLSH234 = mouse-ice-shoe


27/37

Step 2: LSH

Bottom line - probability of collision:

10% similarity 0.1%

1% similarity 0.0001%


28/37

Step 2: LSH

Round 1

sports.comgolf.comparty.com

music.com

opera.com

sport-team-win

music-sound-play

. . .sing.com

. . .

sing-music-ear


29/37

Step 2: LSH

Round 2

sports.comgolf.com

music.com

sing.com

game-team-score

audio-music-note

. . .opera.com

. . .

theater-luciano-sing


30/37

Sort & Filter

Using all buckets from all LSH rounds,generate candidate pairs

Sort candidate pairs on first fieldFilter candidate pairs: keep pair (u, v),

only if u and v agree on 20% of MH-

signaturesReady for Whats Related? queries...


31/37

Overview




Generate clusters


32/37

Clustering

The set of document pairs represents thedocument-document similarity matrix with

20% similarity thresholdClustering algorithms

S-Link: connected components

C-Link: maximal cliquesCenter: approximation to C-Link


33/37

Center

Scan through pairs (they are sorted onfirst component)

For each run [(u, v1), ... , (u, vn)]if u is not marked

cluster = u + unmarked neighbors of u

mark u and all neighbors of u


34/37

Center


35/37

Results

Algorithm Step Running Time

(hours)

Bag generation 23Bag sorting 4.7

Min-hash 26

LSH 16

Filtering 83

Sorting 107

CENTER 18

20 Million urls on Pentium-II 450


36/37

Sample Cluster

feynman.princeton.edu/~sondhi/205main.html

hep.physics.wisc.edu/wsmith/p202/p202syl.html

hepweb.rl.ac.uk/ppUK/PhysFAQ/relativity.html

pdg.lbl.gov/mc_particle_id_contents.htmlphysics.ucsc.edu/courses/10.html

town.hall.org/places/SciTech/qmachine

www.as.ua.edu/physics/hetheory.html

www.landfield.com/faqs/by-newsgroup/sci/sci.physics.relativity.html

www.pa.msu.edu/courses/1999spring/PHY492/desc_PHY492.html

www.phy.duke.edu/Courses/271/Synopsis.html

. . . (total of 27 urls) . . .


37/37

Ongoing/Future Work

Tune anchor-window length

Develop system to measure quality

What is ground truth?How do you judge clustering of millions of

pages?

Documents

s07-minhash