GDG İstanbul Şubat Etkinliği - Sunum

Preview:

DESCRIPTION

GDG İstanbul Şubat (23.02.2013) etkinliği kapsamında "Web Crawling" ve "Web Scraping" ile ilgili yaptığım sunum.

Citation preview

Web Crawling Web Scraping

cuneytykaya

cuneyt.yesilkaya

Cüneyt Yeşilkaya

2007

2048

......... 20102012

Agenda

● Web Crawling● Web Scraping● Web Crawling Tools● Demo (Crawler4j & Jsoup)● Crawling - Where to Use

Web Crawling

Browsing the World Wide Web in a methodical, automated manner or in an orderly fashion.

Web Scraping

Computer software technique of extracting information from websites.

Web Crawling Tools

Selecting Crawler ?

● Multi-Threaded Structure● Max Page to Fetch● Max Page Size● Max Depth to Crawl● Redundant Link Control● Politeness Time● Resumable● Well-Documented

Crawler4j

Yasser Ganjisaffar

Microsoft Bing & Microsoft Live Search

Demo - Crawler4j (1/3)

myCrawler.java myController.java

Demo - Crawler4j (2/3)

myCrawler.java

import edu.uci.ics.crawler4j.crawler.WebCrawler; public class myCrawler extends WebCrawler { @Override public boolean shouldVisit(WebURL url) { return url.getURL().startsWith("http://www.gdgistanbul.com"); } @Override public void visit(Page page) { String url = page.getWebURL().getURL(); }}

Demo - Crawler4j (3/3)

myController.java

int numberOfCrawlers = 4; CrawlConfig config = new CrawlConfig(); config.setPolitenessDelay(250); config.setMaxPagesToFetch(100); PageFetcher pageFetcher = new PageFetcher(config); RobotstxtConfig robotstxtConfig = new RobotstxtConfig(); RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher); CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer); controller.addSeed("http://www.gdgistanbul.com"); controller.start(myCrawler.class, numberOfCrawlers);

Demo - Jsoup (1/2)Jsoup : nice way to do HTML Parsing in Java

● scrape and parse HTML from a URL, file, or string● find and extract data, using DOM traversal or CSS selectors● manipulate the HTML elements, attributes, and text

Demo - Jsoup (2/2)Document doc = Jsoup.connect("http://en.wikipedia.org/").get();Elements newsHeadlines = doc.select("#mp-itn b a");

String html = "<html><head><title>First parse</title></head>" + "<body><p>Parsed HTML into a doc.</p></body></html>";Document doc = Jsoup.parse(html);

Element content = doc.getElementById("content");Elements links = content.getElementsByTag("a");for (Element link : links) {

String linkHref = link.attr("href");String linkText = link.text();

}Elements links = doc.select("a[href]");Elements media = doc.select("[src]");

Where to Use

● Search Engines (GoogleBot)● Aggregators

○ Data aggregator○ News aggregator○ Review aggregator○ Search aggregator○ Social network aggregation○ Video aggregator

● Kaarun Product Collector

www.kaarun.com

All Friends

Products for each Facebook Like

cyesilkaya.wordpress.com & @cuneytykaya & tr.linkedin/cuneyt.yesilkaya

Teşekkürler...