Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Demonstrating Intelligent Crawlingand Archiving of Web ApplicationsMuhammad Faheem
Institut Mines–TélécomTélécom ParisTech; CNRS LTCI
Paris, [email protected]
Pierre SenellartTélécom ParisTech
& The University of Hong KongHong Kong
Traditional crawling: independent of the nature of the sites and theircontent management system
QueueMan-
agement
PageFetching
Link Ex-traction
URLSelection
=⇒ Many HTTP requests, no guarantee of content quality
Traditional crawler
ArchivistInterface
Web application detection module
Indexingmodule
Crawling module
Web applications to crawl
Web application adaptation
Module
Content extraction and annotation
module
Crawled Web pages with annotated Contents
3 2 1
56
8
9
12
4
RDF store
Stored WARC files
7
11
10
XML knowledge
base
Architecture
WordPress vBulletin phpBB0
200
400
600
800
1,000
1,200
Num
bero
fHTT
Pre
ques
ts(×1,000)
AAHwget
Crawl efficiency
•Different crawling techniques for different Web sites•Detect the type of Web application, kind of Web pages inside thisWeb application, and decide crawling actions accordingly•Directly targets useful content-rich areas, avoids archive redundancy,and enriches the archive with semantic description of the content• Implemented in 2 Web crawlers: Internet Memory Foundation crawlerand Heritrix
QueueManage-
ment
ResourceFetching
ApplicationAwareHelper
ResourceSelection
Goal: Smart archiving of the Social Web:1.Performing intelligent Crawling2.Archiving Web objects
Application-aware helper
•Knowledge base of known Web application types, algorithms for flex-ible and adaptive matching of Web applications to these typesDeclarative, XML-based formatIntegrated with YFilter for efficient indexing of KB.
•Type detected using URL patterns, HTTP metadata, textualcontent, XPath patterns, etc. E.g., vBulletin Web forum:contains(//script/@src,’vbulletin_global.js’)•Different crawling actions for different kinds of Web pages under aspecific Web application•Crawling action: not just a list of URLs; can be any action that usesREST API, complicated interaction with AJAX-based application, andextracts semantic Web objects
Methodology
WordPress vBulletin phpBB
97
98
99
Prop
ortio
nof
seenn-g
ram
s(%
)
Crawl effectiveness