091208KenKrugler

Embed Size (px)

Citation preview

  • 7/27/2019 091208KenKrugler

    1/42

    QuickTime and a

    decompressorare needed to see this picture.

  • 7/27/2019 091208KenKrugler

    2/42

    Web Mining in the CloudHadoop/Cascading/Bixo in EC

    Ken Krugler, Bixo Labs, Inc.

    ACM Data Mining SIG

    08 December 2009

    QuickTime and a decompressor

    are needed to see this picture.

  • 7/27/2019 091208KenKrugler

    3/42

    !bout me

    Background in vertical web crawl Krugle search engine for open source code

    Bixo open source web mining toolkit

    Consultant for companies using EC2 Web mining

    Data processing

    Founder of Bixo Labs Elastic web mining platform

    http://bixolabs.com

  • 7/27/2019 091208KenKrugler

    4/42

    T"pical #ata Mining

  • 7/27/2019 091208KenKrugler

    5/42

    #ata Mining $ictor"%

  • 7/27/2019 091208KenKrugler

    6/42

    Mean&hile' ()er at Mc!*ee+

  • 7/27/2019 091208KenKrugler

    7/42

    Web Mining ,-,

    Extracting & Analyzing Web Data

    More Than Just Search

    Business intelligence, competitive

    intelligence, events, people, companies,

    popularity, pricing, social graphs, Twitterfeeds, Facebook friends, support forums,

    shopping carts

  • 7/27/2019 091208KenKrugler

    8/42

    teps in Web Mining

    Collect - fetch content from web

    Parse - extract data from formats

    Analyze - tokenize, rate, classify, cluster

    Produce - useful data

  • 7/27/2019 091208KenKrugler

    9/42

    Web Mining )ersus #ata Mining

    Scale - 10 million isnt a big number

    Access - public but restricted

    Special implicit rules apply

    Structure - not much

  • 7/27/2019 091208KenKrugler

    10/42

    Ho& to Mine 0arge cale Web#ata1

    Start with scalable map-reduce platform

    Add a workflow API layer

    Mix in a web crawling toolkit

    Write your custom data processing code

    Run in an elastic cloud environment

  • 7/27/2019 091208KenKrugler

    11/42

    (ne olution 2 the HECB tack

    Bixo

    Cascading

    Hadoop

    EC2

    QuickTime and adecompressor

    are needed to see this picture.

  • 7/27/2019 091208KenKrugler

    12/42

    EC 2 !ma3on Elastic ComputeCloud

    True cost of non-cloud environment

    Cost of servers & networking (2 year life)

    Cost of colo (6 servers/rack) Cost of OPS salary (15% of FTE/cluster)

    Managing servers is no fun

    Web mining is perfect for the cloud bursty => savings are even greater

    Data is distilled, so no transfer $$$ pain

  • 7/27/2019 091208KenKrugler

    13/42

    Wh" Hadoop1

    Perfect for processing lots of data

    Map-reduce

    Distributed file system

    Open source, large community, etc.

    Runs well in EC2 clustersElastic Map Reduce as option

  • 7/27/2019 091208KenKrugler

    14/42

    Wh" Cascading1

    API on top of Hadoop

    Supports efficient, reliable workflows

    Reduces painful low-level MR details

    Build workflow using pipe model

  • 7/27/2019 091208KenKrugler

    15/42

    Wh" Bixo1

    Plugs into Cascading-based workflow Scales with Hadoop cluster

    Rules well in EC2

    Handles grungy web crawling details Polite yet efficient fetching

    Errors, web servers that lie

    Parsing lots of formats, broken HTML

    Open source toolkit for web mining apps

  • 7/27/2019 091208KenKrugler

    16/42

    E( 4e"&ord #ata Mining

    Example of typical web mining task

    Find common keywords (1,2,3 word

    terms)

    Do domain-centric web crawl

    Parse pages to extract title, meta, h1, links

    Output keywords sorted by frequency

    Compare to competitor site(s)

  • 7/27/2019 091208KenKrugler

    17/42

    Work*lo&

  • 7/27/2019 091208KenKrugler

    18/42

    Custom Code *or Example

    Filtering URLs inside domain

    Non-English content

    User-generated content (forums, etc)Generating keywords from text

    Special tokenization

    One, two, three word phrases

    But 95% of code was generic

  • 7/27/2019 091208KenKrugler

    19/42

    End 5esult in #ata Mining Tool

  • 7/27/2019 091208KenKrugler

    20/42

    What 6ext1

    Another example - mining mailing lists

    Go straight toSummary/Q&A

    Talk aboutWeb Scale Mining

    Write tweets, posts & emails

    No minute off-line goes unpunished

  • 7/27/2019 091208KenKrugler

    21/42

    !nother Example 2 H78MEE

    HadoopUsers whoGenerate theMostEffective

    Emails

  • 7/27/2019 091208KenKrugler

    22/42

    Help*ul Hadoopers

    Use mailing list archives for data (collect)

    Parse mbox files and emails (parse)

    Score based on key phrases (analyze)

    End result is score/name pair (produce)

  • 7/27/2019 091208KenKrugler

    23/42

    coring !lgorithm

    Very sophisticated point system

    thanks == 5

    owe you a beer == 50

    worship the ground you walk on == 100

  • 7/27/2019 091208KenKrugler

    24/42

    High 0e)el teps

    Collect emails

    Fetch mod_mbox generated page

    Parse it to extract links to mbox files Fetch mbox files

    Split into separate emails

    Parse emails Extract key headers (messageId, email, etc)

    Parse body to identify quoted text

  • 7/27/2019 091208KenKrugler

    25/42

    High 0e)el teps

    Analyze emails

    Find key phrases in replies (ignore signoff)

    Score emails by phrases Group & sum by message ID

    Group & sum by email address

    Produce ranked list Toss email addresses with no love

    Sort by summed score

  • 7/27/2019 091208KenKrugler

    26/42

    Work*lo&

  • 7/27/2019 091208KenKrugler

    27/42

    Building the 9lo&

  • 7/27/2019 091208KenKrugler

    28/42

    mod:mbox ;age

  • 7/27/2019 091208KenKrugler

    29/42

    Custom (peration

  • 7/27/2019 091208KenKrugler

    30/42

    $alidate

  • 7/27/2019 091208KenKrugler

    31/42

    This Hug

  • 7/27/2019 091208KenKrugler

    32/42

    ;roduce

    Back

  • 7/27/2019 091208KenKrugler

    33/42

    Web cale Mining

    Bigger Data

    100M pages versus 1M pages

    Bigger Breadth 100K domains versus 1K domains

    Bigger Clusters

    50 servers versus 5 servers

  • 7/27/2019 091208KenKrugler

    34/42

    Web cale == Endless Heuristics

    Document features detection Charset

    Mime-type

    Language

    Many noisy sources of truth

    Duplicates detection Quest for the perfect hash function

    Spam/porn/link farm detection

  • 7/27/2019 091208KenKrugler

    35/42

    Web cale == Challenges

    All web servers lie

    Edge cases ad nauseam

    Avoiding spam/porn/junk

    Focusing on English content

    Scaling to 100K domains/100M pages Avoid bottlenecks Fix large cluster issues

  • 7/27/2019 091208KenKrugler

    36/42

    ;ublic Terab"te #ataset

    Sponsored by Concurrent/Bixolabs

    High quality crawl of top domains

    HECB Stack using Elastic Map Reduce

    Hosted by Amazon in S3, free to EC2 users

    Crawl & processing code available

    Questions, input? http://bixolabs.com/PTD/

  • 7/27/2019 091208KenKrugler

    37/42

    Web cale Case tud" 2 ;T#Cra&l

    Robots.txt - Robot Exclusion Protocol

    Not a real standard, lots of extensions

    Many ways to mess it up (HTML, typos, etc)

    Great performance when all is well

    25K pages/minute fetching

    50K pages/minute parsing

    Hadoop 0.18.3 vs. 0.19.2

    Different APIs, behavior, bugs

    At painful cluster tuning stage

  • 7/27/2019 091208KenKrugler

    38/42

  • 7/27/2019 091208KenKrugler

    39/42

    0arge cale Web Miningummar"

    10K is easy, 100M is hard

    You encounter endless edge cases

    Theres always another bottleneck Cluster tuning is challenging

    Web mining toolkit approach works

    Easier to customize/optimize Easier to solve problems

    Back

  • 7/27/2019 091208KenKrugler

    40/42

    ummar"

    HECB stack works well for web mining Cheaper than typical colo option

    Scales to hundreds of millions of pages

    Reliable and efficient workflow

    Web mining has high & increasing value Search engine optimization, advertising

    Social networks, reputation

    Competitive pricing

    Etc, etc, etc.

  • 7/27/2019 091208KenKrugler

    41/42

    !n" Questions1

    My email:

    [email protected]

    Bixo mailing list:

    http://tech.groups.yahoo.com/group/bixo-dev/

    http://tech.groups.yahoo.com/group/bixo-dev/http://tech.groups.yahoo.com/group/bixo-dev/http://tech.groups.yahoo.com/group/bixo-dev/http://tech.groups.yahoo.com/group/bixo-dev/
  • 7/27/2019 091208KenKrugler

    42/42

    QuickTime and adecompressor

    are needed to see this picture.