Upload
arteepu4
View
216
Download
0
Embed Size (px)
Citation preview
7/27/2019 091208KenKrugler
1/42
QuickTime and a
decompressorare needed to see this picture.
7/27/2019 091208KenKrugler
2/42
Web Mining in the CloudHadoop/Cascading/Bixo in EC
Ken Krugler, Bixo Labs, Inc.
ACM Data Mining SIG
08 December 2009
QuickTime and a decompressor
are needed to see this picture.
7/27/2019 091208KenKrugler
3/42
!bout me
Background in vertical web crawl Krugle search engine for open source code
Bixo open source web mining toolkit
Consultant for companies using EC2 Web mining
Data processing
Founder of Bixo Labs Elastic web mining platform
http://bixolabs.com
7/27/2019 091208KenKrugler
4/42
T"pical #ata Mining
7/27/2019 091208KenKrugler
5/42
#ata Mining $ictor"%
7/27/2019 091208KenKrugler
6/42
Mean&hile' ()er at Mc!*ee+
7/27/2019 091208KenKrugler
7/42
Web Mining ,-,
Extracting & Analyzing Web Data
More Than Just Search
Business intelligence, competitive
intelligence, events, people, companies,
popularity, pricing, social graphs, Twitterfeeds, Facebook friends, support forums,
shopping carts
7/27/2019 091208KenKrugler
8/42
teps in Web Mining
Collect - fetch content from web
Parse - extract data from formats
Analyze - tokenize, rate, classify, cluster
Produce - useful data
7/27/2019 091208KenKrugler
9/42
Web Mining )ersus #ata Mining
Scale - 10 million isnt a big number
Access - public but restricted
Special implicit rules apply
Structure - not much
7/27/2019 091208KenKrugler
10/42
Ho& to Mine 0arge cale Web#ata1
Start with scalable map-reduce platform
Add a workflow API layer
Mix in a web crawling toolkit
Write your custom data processing code
Run in an elastic cloud environment
7/27/2019 091208KenKrugler
11/42
(ne olution 2 the HECB tack
Bixo
Cascading
Hadoop
EC2
QuickTime and adecompressor
are needed to see this picture.
7/27/2019 091208KenKrugler
12/42
EC 2 !ma3on Elastic ComputeCloud
True cost of non-cloud environment
Cost of servers & networking (2 year life)
Cost of colo (6 servers/rack) Cost of OPS salary (15% of FTE/cluster)
Managing servers is no fun
Web mining is perfect for the cloud bursty => savings are even greater
Data is distilled, so no transfer $$$ pain
7/27/2019 091208KenKrugler
13/42
Wh" Hadoop1
Perfect for processing lots of data
Map-reduce
Distributed file system
Open source, large community, etc.
Runs well in EC2 clustersElastic Map Reduce as option
7/27/2019 091208KenKrugler
14/42
Wh" Cascading1
API on top of Hadoop
Supports efficient, reliable workflows
Reduces painful low-level MR details
Build workflow using pipe model
7/27/2019 091208KenKrugler
15/42
Wh" Bixo1
Plugs into Cascading-based workflow Scales with Hadoop cluster
Rules well in EC2
Handles grungy web crawling details Polite yet efficient fetching
Errors, web servers that lie
Parsing lots of formats, broken HTML
Open source toolkit for web mining apps
7/27/2019 091208KenKrugler
16/42
E( 4e"&ord #ata Mining
Example of typical web mining task
Find common keywords (1,2,3 word
terms)
Do domain-centric web crawl
Parse pages to extract title, meta, h1, links
Output keywords sorted by frequency
Compare to competitor site(s)
7/27/2019 091208KenKrugler
17/42
Work*lo&
7/27/2019 091208KenKrugler
18/42
Custom Code *or Example
Filtering URLs inside domain
Non-English content
User-generated content (forums, etc)Generating keywords from text
Special tokenization
One, two, three word phrases
But 95% of code was generic
7/27/2019 091208KenKrugler
19/42
End 5esult in #ata Mining Tool
7/27/2019 091208KenKrugler
20/42
What 6ext1
Another example - mining mailing lists
Go straight toSummary/Q&A
Talk aboutWeb Scale Mining
Write tweets, posts & emails
No minute off-line goes unpunished
7/27/2019 091208KenKrugler
21/42
!nother Example 2 H78MEE
HadoopUsers whoGenerate theMostEffective
Emails
7/27/2019 091208KenKrugler
22/42
Help*ul Hadoopers
Use mailing list archives for data (collect)
Parse mbox files and emails (parse)
Score based on key phrases (analyze)
End result is score/name pair (produce)
7/27/2019 091208KenKrugler
23/42
coring !lgorithm
Very sophisticated point system
thanks == 5
owe you a beer == 50
worship the ground you walk on == 100
7/27/2019 091208KenKrugler
24/42
High 0e)el teps
Collect emails
Fetch mod_mbox generated page
Parse it to extract links to mbox files Fetch mbox files
Split into separate emails
Parse emails Extract key headers (messageId, email, etc)
Parse body to identify quoted text
7/27/2019 091208KenKrugler
25/42
High 0e)el teps
Analyze emails
Find key phrases in replies (ignore signoff)
Score emails by phrases Group & sum by message ID
Group & sum by email address
Produce ranked list Toss email addresses with no love
Sort by summed score
7/27/2019 091208KenKrugler
26/42
Work*lo&
7/27/2019 091208KenKrugler
27/42
Building the 9lo&
7/27/2019 091208KenKrugler
28/42
mod:mbox ;age
7/27/2019 091208KenKrugler
29/42
Custom (peration
7/27/2019 091208KenKrugler
30/42
$alidate
7/27/2019 091208KenKrugler
31/42
This Hug
7/27/2019 091208KenKrugler
32/42
;roduce
Back
7/27/2019 091208KenKrugler
33/42
Web cale Mining
Bigger Data
100M pages versus 1M pages
Bigger Breadth 100K domains versus 1K domains
Bigger Clusters
50 servers versus 5 servers
7/27/2019 091208KenKrugler
34/42
Web cale == Endless Heuristics
Document features detection Charset
Mime-type
Language
Many noisy sources of truth
Duplicates detection Quest for the perfect hash function
Spam/porn/link farm detection
7/27/2019 091208KenKrugler
35/42
Web cale == Challenges
All web servers lie
Edge cases ad nauseam
Avoiding spam/porn/junk
Focusing on English content
Scaling to 100K domains/100M pages Avoid bottlenecks Fix large cluster issues
7/27/2019 091208KenKrugler
36/42
;ublic Terab"te #ataset
Sponsored by Concurrent/Bixolabs
High quality crawl of top domains
HECB Stack using Elastic Map Reduce
Hosted by Amazon in S3, free to EC2 users
Crawl & processing code available
Questions, input? http://bixolabs.com/PTD/
7/27/2019 091208KenKrugler
37/42
Web cale Case tud" 2 ;T#Cra&l
Robots.txt - Robot Exclusion Protocol
Not a real standard, lots of extensions
Many ways to mess it up (HTML, typos, etc)
Great performance when all is well
25K pages/minute fetching
50K pages/minute parsing
Hadoop 0.18.3 vs. 0.19.2
Different APIs, behavior, bugs
At painful cluster tuning stage
7/27/2019 091208KenKrugler
38/42
7/27/2019 091208KenKrugler
39/42
0arge cale Web Miningummar"
10K is easy, 100M is hard
You encounter endless edge cases
Theres always another bottleneck Cluster tuning is challenging
Web mining toolkit approach works
Easier to customize/optimize Easier to solve problems
Back
7/27/2019 091208KenKrugler
40/42
ummar"
HECB stack works well for web mining Cheaper than typical colo option
Scales to hundreds of millions of pages
Reliable and efficient workflow
Web mining has high & increasing value Search engine optimization, advertising
Social networks, reputation
Competitive pricing
Etc, etc, etc.
7/27/2019 091208KenKrugler
41/42
!n" Questions1
My email:
Bixo mailing list:
http://tech.groups.yahoo.com/group/bixo-dev/
http://tech.groups.yahoo.com/group/bixo-dev/http://tech.groups.yahoo.com/group/bixo-dev/http://tech.groups.yahoo.com/group/bixo-dev/http://tech.groups.yahoo.com/group/bixo-dev/7/27/2019 091208KenKrugler
42/42
QuickTime and adecompressor
are needed to see this picture.