s07-minhash

  • Upload
    xml

  • View
    216

  • Download
    0

Embed Size (px)

Citation preview

  • 8/3/2019 s07-minhash

    1/37

    Scalable Techniques for

    Clustering the Web

    Taher H. Haveliwala

    Aristides Gionis

    Piotr Indyk

    Stanford University{taherh,gionis,indyk}@cs.stanford.edu

  • 8/3/2019 s07-minhash

    2/37

    Project Goals

    Generate fine-grained clustering of webbased on topic

    Similarity search (Whats Related?)Two major issues:

    Develop appropriate notion of similarity

    Scale up to millions of documents

  • 8/3/2019 s07-minhash

    3/37

    Prior Work

    Offline: detecting replicas

    [Broder-Glassman-Manasse-Zweig97]

    [Shivakumar-G. Molina98]

    Online: finding/grouping related pages

    [Zamir-Etzioni98]

    [Manjara]

    Link based methods

    [Dean-Henzinger99, Clever]

  • 8/3/2019 s07-minhash

    4/37

    Prior Work: Online, Link

    Online: cluster results of search queries

    does not work for clustering entire web

    offlineLink based approaches are limited

    What about relatively new pages?

    What about less popular pages?

  • 8/3/2019 s07-minhash

    5/37

    Prior Work: Copy detection

    Designed to detect duplicates/near-replicas

    Do not scale when notion of similarity ismodified to topical similarity

    Creation of document-document similarity

    matrix is the core challenge:joinbottleneck

  • 8/3/2019 s07-minhash

    6/37

    Pairwise similarity

    Consider relation Docs(id, sentence)

    Must compute:

    SELECT D1.id, D2.idFROM Docs D1, Docs D2

    WHERE D1.sentence = D2.sentence

    GROUP BY D1.id, D2.id

    HAVING count(*) >

    What if we change sentence to word?

  • 8/3/2019 s07-minhash

    7/37

    Pairwise similarity

    Relation Docs(id, word)

    Compute:

    SELECT D1.id, D2.idFROM Docs D1, Docs D2

    WHERE D1.word = D2.word

    GROUP BY D1.id, D2.id

    HAVING count(*) >

    For 25M urls, could take months to compute!

  • 8/3/2019 s07-minhash

    8/37

    Overview

    Choose document representation

    Choose similarity metric

    Compute pairwise document similarities

    Generate clusters

  • 8/3/2019 s07-minhash

    9/37

    Document representation

    Bag of words model

    Bag for each page pconsists of

    Title ofp

    Anchor text of all pages pointing to p(Also include window of words around

    anchors)

  • 8/3/2019 s07-minhash

    10/37

    Bag Generation

    ...click here for a

    great music page...

    ...click here for greatsports page...

    ...this music is great...

    ...what I had for

    lunch...

    http://www.foobar.com/

    http://www.baz.com/

    http://www.music.com/

    Enter our site

    MusicWorld

  • 8/3/2019 s07-minhash

    11/37

    Bag Generation

    Union of anchor windows is a concisedescription of a page.

    Note that using anchor windows, we cancluster more documents than wevecrawled:

    In general, a set of N documents refers to cNurls

  • 8/3/2019 s07-minhash

    12/37

    Standard IR

    Remove stopwords (~ 750)

    Remove high frequency & low frequency

    termsUse stemming

    Apply TFIDF scaling

  • 8/3/2019 s07-minhash

    13/37

    Overview

    Choose document representation

    Choose similarity metric

    Compute pairwise document similarities

    Generate clusters

  • 8/3/2019 s07-minhash

    14/37

    Similarity

    Similarity metric for pages U1, U2, thatwere assigned bags B1, B2, respectively

    sim(U1, U2) = |B1 B2| / |B1 B2|Threshold is set to 20%

  • 8/3/2019 s07-minhash

    15/37

    Reality Check

    www.foodchannel.com:

    www.epicurious.com/a_home/a00_home/home.html .37

    www.gourmetworld.com .36

    www.foodwine.com .325www.cuisinenet.com .3125

    www.kitchenlink.com .3125

    www.yumyum.com .3

    www.menusonline.com .3

    www.snap.com/directory/category/0,16,-324,00.html .2875

    www.ichef.com .2875

    www.home-canning.com .275

  • 8/3/2019 s07-minhash

    16/37

    Overview

    Choose document representation

    Choose similarity metric

    Compute pairwise documentsimilarities

    Generate clusters

  • 8/3/2019 s07-minhash

    17/37

    Pair Generation

    Find all pairs of pages (U1, U2) satisfyingsim(U1, U2) 20%

    Ignore all url pairs with sim < 20%How do we avoid the join bottleneck?

  • 8/3/2019 s07-minhash

    18/37

    Locality Sensitive Hashing

    Idea: use special kind of hashing

    Locality Sensitive Hashing (LSH) provides

    a solution:Min-wise hash functions [Broder98]

    LSH [Indyk, Motwani98], [Cohen et al2000]

    Properties:Similar urls are hashed together w.h.p

    Dissimilar urls are not hashed together

  • 8/3/2019 s07-minhash

    19/37

    Locality Sensitive Hashing

    sports.comgolf.com

    music.com

    opera.comsing.com

  • 8/3/2019 s07-minhash

    20/37

    Hashing

    Two steps

    Min-hash (MH): a way to consistently sample

    words from bagsLocality sensitive hashing (LSH): similar

    pages get hashed to the same bucket whiledissimilar ones do not

  • 8/3/2019 s07-minhash

    21/37

    Step 1: Min-hash

    Step 1: Generate m min-hash signaturesfor each url (m = 80)

    For i = 1...mGenerate a random order hi on words

    mhi(u) = argmin {hi(w) | w Bu}

    Pr(mhi(u) = mhi(v)) = sim(u, v)

  • 8/3/2019 s07-minhash

    22/37

    Step 1: Min-hash

    Round 1:

    ordering = [cat, dog, mouse, banana]

    Set A:{mouse, dog}

    MH-signature = dog

    Set B:{cat, mouse}

    MH-signature = cat

  • 8/3/2019 s07-minhash

    23/37

    Step 1: Min-hash

    Round 2:

    ordering = [banana, mouse, cat, dog]

    Set A:{mouse, dog}

    MH-signature = mouse

    Set B:{cat, mouse}

    MH-signature = mouse

  • 8/3/2019 s07-minhash

    24/37

    Step 2: LSH

    Step 2: Generate l LSH signatures foreach url, using kof the min-hash values

    (l = 125, k= 3)For i = 1...l

    Randomly select kmin-hash indices andconcatenate them to form ith LSH

    signature

  • 8/3/2019 s07-minhash

    25/37

    Step 2: LSH

    Generate candidate pair if u and v havean LSH signature in common in any round

    Pr(lsh(u) = lsh(v)) = Pr(mh(u) = mh(v))k

  • 8/3/2019 s07-minhash

    26/37

    Step 2: LSH

    Set A:{mouse, dog, horse, ant}

    MH1 = horse

    MH2 = mouseMH3 = antMH4 = dog

    LSH134 = horse-ant-dogLSH234 = mouse-ant-dog

    Set B:{cat, ice, shoe, mouse}

    MH1 = cat

    MH2 = mouseMH3 = iceMH4 = shoe

    LSH134 = cat-ice-shoeLSH234 = mouse-ice-shoe

  • 8/3/2019 s07-minhash

    27/37

    Step 2: LSH

    Bottom line - probability of collision:

    10% similarity 0.1%

    1% similarity 0.0001%

  • 8/3/2019 s07-minhash

    28/37

    Step 2: LSH

    Round 1

    sports.comgolf.comparty.com

    music.com

    opera.com

    sport-team-win

    music-sound-play

    . . .sing.com

    . . .

    sing-music-ear

  • 8/3/2019 s07-minhash

    29/37

    Step 2: LSH

    Round 2

    sports.comgolf.com

    music.com

    sing.com

    game-team-score

    audio-music-note

    . . .opera.com

    . . .

    theater-luciano-sing

  • 8/3/2019 s07-minhash

    30/37

    Sort & Filter

    Using all buckets from all LSH rounds,generate candidate pairs

    Sort candidate pairs on first fieldFilter candidate pairs: keep pair (u, v),

    only if u and v agree on 20% of MH-

    signaturesReady for Whats Related? queries...

  • 8/3/2019 s07-minhash

    31/37

    Overview

    Choose document representation

    Choose similarity metric

    Compute pairwise document similarities

    Generate clusters

  • 8/3/2019 s07-minhash

    32/37

    Clustering

    The set of document pairs represents thedocument-document similarity matrix with

    20% similarity thresholdClustering algorithms

    S-Link: connected components

    C-Link: maximal cliquesCenter: approximation to C-Link

  • 8/3/2019 s07-minhash

    33/37

    Center

    Scan through pairs (they are sorted onfirst component)

    For each run [(u, v1), ... , (u, vn)]if u is not marked

    cluster = u + unmarked neighbors of u

    mark u and all neighbors of u

  • 8/3/2019 s07-minhash

    34/37

    Center

  • 8/3/2019 s07-minhash

    35/37

    Results

    Algorithm Step Running Time

    (hours)

    Bag generation 23Bag sorting 4.7

    Min-hash 26

    LSH 16

    Filtering 83

    Sorting 107

    CENTER 18

    20 Million urls on Pentium-II 450

  • 8/3/2019 s07-minhash

    36/37

    Sample Cluster

    feynman.princeton.edu/~sondhi/205main.html

    hep.physics.wisc.edu/wsmith/p202/p202syl.html

    hepweb.rl.ac.uk/ppUK/PhysFAQ/relativity.html

    pdg.lbl.gov/mc_particle_id_contents.htmlphysics.ucsc.edu/courses/10.html

    town.hall.org/places/SciTech/qmachine

    www.as.ua.edu/physics/hetheory.html

    www.landfield.com/faqs/by-newsgroup/sci/sci.physics.relativity.html

    www.pa.msu.edu/courses/1999spring/PHY492/desc_PHY492.html

    www.phy.duke.edu/Courses/271/Synopsis.html

    . . . (total of 27 urls) . . .

  • 8/3/2019 s07-minhash

    37/37

    Ongoing/Future Work

    Tune anchor-window length

    Develop system to measure quality

    What is ground truth?How do you judge clustering of millions of

    pages?