Upload
cueneyt-yesilkaya
View
933
Download
4
Embed Size (px)
DESCRIPTION
GDG İstanbul Şubat (23.02.2013) etkinliği kapsamında "Web Crawling" ve "Web Scraping" ile ilgili yaptığım sunum.
Citation preview
Web Crawling Web Scraping
cuneytykaya
cuneyt.yesilkaya
Cüneyt Yeşilkaya
2007
2048
......... 20102012
Agenda
● Web Crawling● Web Scraping● Web Crawling Tools● Demo (Crawler4j & Jsoup)● Crawling - Where to Use
Web Crawling
Browsing the World Wide Web in a methodical, automated manner or in an orderly fashion.
Web Scraping
Computer software technique of extracting information from websites.
Web Crawling Tools
Selecting Crawler ?
● Multi-Threaded Structure● Max Page to Fetch● Max Page Size● Max Depth to Crawl● Redundant Link Control● Politeness Time● Resumable● Well-Documented
Crawler4j
Yasser Ganjisaffar
Microsoft Bing & Microsoft Live Search
Demo - Crawler4j (1/3)
myCrawler.java myController.java
Demo - Crawler4j (2/3)
myCrawler.java
import edu.uci.ics.crawler4j.crawler.WebCrawler; public class myCrawler extends WebCrawler { @Override public boolean shouldVisit(WebURL url) { return url.getURL().startsWith("http://www.gdgistanbul.com"); } @Override public void visit(Page page) { String url = page.getWebURL().getURL(); }}
Demo - Crawler4j (3/3)
myController.java
int numberOfCrawlers = 4; CrawlConfig config = new CrawlConfig(); config.setPolitenessDelay(250); config.setMaxPagesToFetch(100); PageFetcher pageFetcher = new PageFetcher(config); RobotstxtConfig robotstxtConfig = new RobotstxtConfig(); RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher); CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer); controller.addSeed("http://www.gdgistanbul.com"); controller.start(myCrawler.class, numberOfCrawlers);
Demo - Jsoup (1/2)Jsoup : nice way to do HTML Parsing in Java
● scrape and parse HTML from a URL, file, or string● find and extract data, using DOM traversal or CSS selectors● manipulate the HTML elements, attributes, and text
Demo - Jsoup (2/2)Document doc = Jsoup.connect("http://en.wikipedia.org/").get();Elements newsHeadlines = doc.select("#mp-itn b a");
String html = "<html><head><title>First parse</title></head>" + "<body><p>Parsed HTML into a doc.</p></body></html>";Document doc = Jsoup.parse(html);
Element content = doc.getElementById("content");Elements links = content.getElementsByTag("a");for (Element link : links) {
String linkHref = link.attr("href");String linkText = link.text();
}Elements links = doc.select("a[href]");Elements media = doc.select("[src]");
Where to Use
● Search Engines (GoogleBot)● Aggregators
○ Data aggregator○ News aggregator○ Review aggregator○ Search aggregator○ Social network aggregation○ Video aggregator
● Kaarun Product Collector
www.kaarun.com
All Friends
Products for each Facebook Like
cyesilkaya.wordpress.com & @cuneytykaya & tr.linkedin/cuneyt.yesilkaya
Teşekkürler...