GDG İstanbul Şubat Etkinliği - Sunum

Web Crawling Web Scraping

cuneytykaya

cuneyt.yesilkaya

Cüneyt Yeşilkaya

2007

2048

......... 20102012

Agenda

● Web Crawling● Web Scraping● Web Crawling Tools● Demo (Crawler4j & Jsoup)● Crawling - Where to Use

Web Crawling

Browsing the World Wide Web in a methodical, automated manner or in an orderly fashion.

Web Scraping

Computer software technique of extracting information from websites.

Web Crawling Tools

Selecting Crawler ?

● Multi-Threaded Structure● Max Page to Fetch● Max Page Size● Max Depth to Crawl● Redundant Link Control● Politeness Time● Resumable● Well-Documented

Crawler4j

Yasser Ganjisaffar

Microsoft Bing & Microsoft Live Search

Demo - Crawler4j (1/3)

myCrawler.java myController.java


myCrawler.java

import edu.uci.ics.crawler4j.crawler.WebCrawler; public class myCrawler extends WebCrawler { @Override public boolean shouldVisit(WebURL url) { return url.getURL().startsWith("http://www.gdgistanbul.com"); } @Override public void visit(Page page) { String url = page.getWebURL().getURL(); }}


myController.java

int numberOfCrawlers = 4; CrawlConfig config = new CrawlConfig(); config.setPolitenessDelay(250); config.setMaxPagesToFetch(100); PageFetcher pageFetcher = new PageFetcher(config); RobotstxtConfig robotstxtConfig = new RobotstxtConfig(); RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher); CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer); controller.addSeed("http://www.gdgistanbul.com"); controller.start(myCrawler.class, numberOfCrawlers);

Demo - Jsoup (1/2)Jsoup : nice way to do HTML Parsing in Java

● scrape and parse HTML from a URL, file, or string● find and extract data, using DOM traversal or CSS selectors● manipulate the HTML elements, attributes, and text

Demo - Jsoup (2/2)Document doc = Jsoup.connect("http://en.wikipedia.org/").get();Elements newsHeadlines = doc.select("#mp-itn b a");

String html = "<html><head><title>First parse</title></head>" + "<body><p>Parsed HTML into a doc.</p></body></html>";Document doc = Jsoup.parse(html);

Element content = doc.getElementById("content");Elements links = content.getElementsByTag("a");for (Element link : links) {

String linkHref = link.attr("href");String linkText = link.text();

}Elements links = doc.select("a[href]");Elements media = doc.select("[src]");

Where to Use

● Search Engines (GoogleBot)● Aggregators

○ Data aggregator○ News aggregator○ Review aggregator○ Search aggregator○ Social network aggregation○ Video aggregator

● Kaarun Product Collector

www.kaarun.com

All Friends

Products for each Facebook Like

cyesilkaya.wordpress.com & @cuneytykaya & tr.linkedin/cuneyt.yesilkaya

Teşekkürler...

Documents

GDG İstanbul Şubat Etkinliği - Sunum