Deep-Web Crawling “Enlightening the dark side of the web”

Deep-Web Crawling

Daniele Alfarone Erasmus student Milan (Italy)Deep-Web CrawlingEnlightening the dark side of the web

1StructureIntroductionWhat is the Deep-WebHow to crawl itGoogles ApproachProblem statementMain algorithmsPerformance evaluationImprovementsMain limitationsSome ideas to improveConclusions2What is the Deep-Web?Deep-Web is thecontent hidden behind HTML forms3

IntroductionGoogles approachImprovementsHidden contentThis content cannot be reached by traditional crawlers

Deep-Web has 10 times more data than the currently searchable content4

IntroductionGoogles approachImprovementsHow do webmasters deal with it?Not only the search engines are interested:the websites want to be more accessible to the crawlers The websites publish pages with long lists of static links to let traditional crawlers index them

5

IntroductionGoogles approachImprovementsHow can search engines crawl the Deep-Web?Developing vertical search enginesfocused on a specific topic flights jobs

ButLimited to the number of topics for which a vertical search engine has been builtDifficult to keep semantic maps between individual data sources and a common DBBoundaries between different domains are fuzzy6But search engines cannot pretend thatevery website does the same

IntroductionGoogles approachImprovements

Are there smarter approaches?7Currently the Web contains more than 10 millionshigh-quality HTML forms and it is still growing exponentiallyAny approach which involves human effort can't scale: we need a fully-automatic approach without site-specific coding

Solution: the surfacing approachChoose a set of queries to submit to the web formStore the URL of the page obtainedPass all the URLs to the crawlerNumber of websites since 1990 (7% has an high-quality form)IntroductionGoogles approachImprovementsPart 2 Googles approach Problem statementMain algorithmsPerformance evaluation

8

Solving the surfacing problem: Googles approachThe problem is divided in two sub-problems912IntroductionGoogles approachImprovements

HTML formexample10

Free-text inputsChoiceinputsIntroductionGoogles approachImprovementsIntroductionGoogles approachImprovements

HTML formexample11

SelectioninputsPresentation inputsSelectioninputs

Which form inputs to fill:Query templatesDefined by Google as:the list of input types to be filled to create a set of queries12

Query Template #1IntroductionGoogles approachImprovements

Which form inputs to fill:Query templatesDefined by Google as:the list of input types to be filled to create a set of queries13

Query Template #2IntroductionGoogles approachImprovementsHow to createinformative query templatesdiscard presentation inputscurrently a big challenge

choose the optimal dimension for the templatetoo big: increase crawling traffic and produce pages without resultstoo small: every submission will get a large numbers of results and the website site may:limit the number of resultsallow to browse results through pagination (which is not always easy to follow)

14IntroductionGoogles approachImprovementsInformativeness testerHow Google evaluates if a template is informative? Query templates are evaluated upon the distinctness of the web pages resulting from the form submissions generatedTo estimate the number of distinct web pages,the results are clustered based on the similarity of their content15# distinct pages # pages

> 25%A template is informative ifIntroductionGoogles approachImprovements15How to scale to big web forms?Given a form with N inputs, the possible templates are2N 1

To avoid running the informativeness tester on all possible templates, Google developed an algorithm called Incremental Search for Informative Query TemplatesI.S.I.T.16IntroductionGoogles approachImprovements16ISIT example17IntroductionGoogles approachImprovements

XX17Generating input valuesTo assign values to a select menu is as easy as select all the possible valuesTo generate meaningful values for text boxes is a big challenge

Text boxes are used in different ways in web forms:Generic text boxes: to retrieve all documents in a database that match the words typed (e.g. title or author of a book)Typed text boxes: as a selection predicate on a specific attribute in the where clause of a SQL query(e.g. zip codes, US states, prices)

18IntroductionGoogles approachImprovementsValues for generic text boxes19Initial seed keywords are extracted from the form pageA query template with only the generic text box is submittedAdditional keywords are extracted from the resulting pageDiscard keywords not representative for the page (TF-IDF rank)Runs until a sufficient number of keywordshas been extracted1342IntroductionGoogles approachImprovementsValues for typed text boxes20The number of types which can appear in HTML forms of different domains are limited (e.g.: city, date, price, zip)Forms with typed text boxes will produce reasonableresult pages only with type-appropriate valuesTo recognize the correct type, the form is submitted with known values of different types and the one with highest distinctness fraction is considered to be the correct typeIntroductionGoogles approachImprovementsPerformance evaluationquery templates with only select menus21As the number of inputs increase, the number of possible templates increases exponentially, but the number tested only increases linearly, as does the number found to be informative

IntroductionGoogles approachImprovementsTesting on 1 million HTML forms, the URLs were generated using a template which had:only one text box (57%)one or more select menus (37%)one text box and one or more select menus (6%)22Today on Google.com one query out of 10 contains "surfaced" results

Performance evaluationmixed query templatesIntroductionGoogles approachImprovementsPart 3ImprovementsMain limitationsSome ideas to improve

231. POST forms are discardedThe output of the whole Deep-Web crawling by Google is a list of URLs for each form considered.The result pages from a form submitted with method=POST dont have a unique URLGoogle bypasses these forms relying on the fact theRFC specifications recommend POST forms only foroperations that write on the website database(e.g.: comments in a forum, sign-up to a website)But In reality websites make massive use of POST forms, for:URL ShorteningMaintaining the state of a form after its submission24IntroductionGoogles approachImprovementsHow can we crawl POST forms?Two approaches can drop the limitation put by Google: POST forms can be crawled sending to the server a complete HTTP request, rather than just an URL.The problem becomes how to link (in the SERP) the page obtained submitting the POST form.

An approach which would solve all the problems stated is to simply convert the POST form to its GET equivalent.An analysis is required to assess which percentage of websites accept also GET parameters for POST forms.

25IntroductionGoogles approachImprovements2. Select menus with bad default valuesWhen instantiating a query template, for select menus not included in the template, the default value of the menu is assigned, making the assumption that it's a wild card value like "Any" or All.This assumption is probably too strong: in several select menus the default option is simply the first one of the list.26

e.g. for a select menu of U.S. cities we would expect All, but we can find Alabama.If a bad option like Alabama is selected,a high percentage of the database will remain undiscovered.IntroductionGoogles approachImprovementsHow can we recognize a bad default value? Idea:to submit the form with all possible values andcount the results if the number of results with the (potentially) default valueis close to the sum of all the other results,probably it is a real default value.

Once we recognize a bad default value, we force the inclusion of the select menu in every template for the given form.27IntroductionGoogles approachImprovements3. Managing mandatory inputsOften the HTML forms indicate to the user which inputs are mandatory (e.g.: with asterisks or red borders).

To recognize the mandatory inputs can offer some benefits:Reduce the number of URLs generated by ISITonly the templates which contain all the mandatory fieldswill be passed to the informativeness testerAvoid to instantiate the default value (not always correct)to inputs that can just be discarded because they are not mandatory28

IntroductionGoogles approachImprovements4. Filling text boxes exploitingJavascript suggestionsAn alternative approach for filling text boxes can be to exploit whenever a website uses suggestions proposed via Javascript.29

IntroductionGoogles approachImprovementsAlgorithm to extract the suggestionsType in the text box all the possible first 3 letters (with the English alphabet: 263 = 17.576 submissions)For each combination of 3 letters, retrieve all theauto-completion suggestions using a Javascript simulatorAll suggestions can be assumed as valid inputs, we dont need to filter according to relevanceThe relevance filter will be applied only if the website is not particularly interesting30IntroductionGoogles approachImprovements5. Input correlations not taken into accountGoogle uses the same set of values to fill an input for all templates that contain that input.Usually some inputs are correlated e.g.: the text box US city" and select menu "US state" or two text boxes representing a range

Advantages of taking correlation into account:More relevant keywords for text boxese.g. in a correlation between a text box and a select menu, we can submit the form for different select menu values and extract relevant keywords for the associated text boxLess zero-results pages are generated, resulting in less load for the search engine crawler and the website servers

31IntroductionGoogles approachImprovementsHow to recognize a correlation?To detect correlations between any two input types we can:

Use the informativeness testassuming that values are correlated only if the query results are informative

Recognize particular types of correlationse.g. if we have 2 select menus, where filling the first one restricts the possible values of the second one (US state/city, car brand/model)we can use a Javascript simulator to manage the correlation32IntroductionGoogles approachImprovementsConclusions Deep-Web Crawling is one the most interestingtodays challenges for search enginesGoogle already implemented the surfacing approach obtaining encouraging resultsBut There are still several limitationsSome ideas have been illustrated to solve them 33IntroductionGoogles approachImprovementsReferencesJ. Madhavan et al. (2008)Googles Deep-Web Crawlhttp://www.cs.washington.edu/homes/alon/files/vldb08deepweb.pdf J. Madhavan et al. (2009)Harnessing the deep web: Present and futurehttp://arxiv.org/ftp/arxiv/papers/0909/0909.1785.pdf W3C, Hypertext Transfer Protocol - HTTP/1.1GET and POST methods definitionhttp://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html E. CharalambousHow postback works in ASP.NEThttp://www.xefteri.com/articles/show.cfm?id=18 34Thank youfor the attention :)

Questions?35

Documents

Deep-Web Crawling “Enlightening the dark side of the web”