Click here to load reader

Cloak and Dagger: Dynamics of Web Search Cloaking

  • View

  • Download

Embed Size (px)


18 th  ACM Conference on Computer and Communications Security (CCS 2011). Cloak and Dagger: Dynamics of Web Search Cloaking. David Y. Wang, Stefan Savage, and Geoffrey M. Voelker University of California, San Diego. 左昌國 Seminar @ ADLab , NCU-CSIE . Outline. Introduction Methodology - PowerPoint PPT Presentation

Text of Cloak and Dagger: Dynamics of Web Search Cloaking

KLIMAX: Profiling Memory Write Patterns to Detect Keystroke-Harvesting Malware

Cloak and Dagger: Dynamics of Web Search CloakingDavid Y. Wang, Stefan Savage, and Geoffrey M. Voelker

University of California, San DiegoSeminar @ ADLab, NCU-CSIE 18thACM Conference on Computer and Communications Security(CCS 2011)OutlineIntroductionMethodologyResultsRelated WorkConclusion2IntroductionSearch Engine Optimization (SEO)Search engine optimization(SEO) is the process of improving the visibility of awebsiteor aweb pageinsearch enginesvia the "natural" or un-paid ("organic" or "algorithmic")search results. --- WikipediaSEO could be used as benign techniquesCloakingUp to 1999One of the notorious blackhat SEO skillsDelivering different content to different user segmentsie. Search engine crawlers and normal users

3Blackhat SEO : link farm, link spamming (for inbound links)3Introduction4

Normal UserSearch Engine CrawlerIntroductionTypes of cloakingRepeat CloakingCookies or IP tracking User Agent CloakingUser-Agent field in the HTTP request headerReferrer CloakingReferer field in the HTTP headerIP Cloaking

55IntroductionThis paperDesigns a system, Dagger, to identify cloaking in near real-timeUses this system toProvide a picture of cloaking activity as seen through three search engines(Google, Bing and Yahoo)Characterize the differences in cloaking behavior between undifferentiated trending keywords and targeted keywords.Characterize the dynamic behavior of cloaking activity6MethodologyDagger consists of five functional componentsCollecting search termsFetching search results from search enginesCrawling the pages linked from the search resultsAnalyzing the pages crawledRepeating measurements over time

7MethodologyCollecting Search TermsCollecting popular search terms fromGoogle Hot SearchesAlexaTwitterConstructing another source of search terms using keyword suggestions from Google Suggest.ex: User enter -> viagra 50mg Suggestion -> viagra 50mg cost viagra 50mg canada 8Google Hot Searchestermssuggestterms8MethodologyQuerying Search ResultsSubmitting the search terms to search engines(Google, Yahoo, and Bing)Google Hot Searches and Alexa each supply 80 terms per 4-hourTwitter supplies 40Together with 240 additional suggestions based on Google Hot Searches (80 * 3)Total 440 termsExtracting the top 100 search results for each search term(44,000)Removing whitelist URLsGrouping similar entries (same URL, source, and search term)average roughly 15,000 unique URLs in each measurement period

9whitelist, ?whitelist9MethodologyCrawling Search ResultsWeb crawlerA Java web crawler using the HttpClient 3.x package from ApacheCrawling 3 times for each URLDisguised as a normal user using Internet Explorer, clicking through the search resultDisguised as the Googlebot Web crawlerDisguised as a normal user again, NOT clicking through the search resultDealing with IP cloaking?Fourth crawling using Google TranslateMore than half of cloaked results do IP cloaking10HttpClient HTTP 3xx redirects, HTTP header, timeout, error handlingcrawlingpure user-agent cloaking, 35%cloaking site, malicious site

paperscam pagedagger-2nd crawling scam page -> , security crawler SEO-ed page -> cloaking -> cloaking10MethodologyDetecting CloakingRemoving HTTP error response (average 4% of URLs)Using Text Shingling to filter out nearly identical pages90% of URLs are near duplicates ( near duplicates means 10% or less differences between 2 sets of signatures)Measuring the similarity between the snippet of the search result and the user view of the pageRemoving noise from both the snippet and the body of the user viewSearch substrings from the snippetNumber of words from unmatched substrings divided by the total number of words from all substrings1.0 means no match0.0 means fully matchThreshold: 0.33 filter out 56% of the remaining URLs11Text shingling: hash substrings in each page to construct signatures of the content

Noise: character case, punctuation, HTML tags, useless whitespace

Filter out < 0.33 URL11MethodologyDetecting Cloaking(cont.)False positives may still existExamining the DOMs as the final testComputing the sum of an overall comparison and a hierarchical comparisonOverall comparison: unmatched tags from the entire page divided by the total number of tagsHierarchical comparison: the sum of the unmatched tags from each level of the DOM hierarchy divided by the total number of tags2.0 means no match0.0 means fully matchThreshold: 0.66

12cloaking pages 1.012MethodologyDetecting Cloaking(cont.)Manual inspectionFalse positive: 9.1% (29 of 317) in Google search 12% (9 of 75) in Yahoo (benign websites but delivering different content to search engines)

Advanced browser detection

Temporal RemeasurementDagger remeasures every 4 hours for up to 7 days13ResultsCloaking Over Time14

search engines, cloaking page

4/1 ~ 5/4 googlecloakyahoo, link farmreverse DNS cloaking for google IPs14Results15

, target term cloakingtrend term, cloakingTarget term, page rank15ResultsSources of Search Terms16




ResultsSearch Engine Response20

Google, cloaking

top 100()

Error barsCloakerror bars,20Results21




harmfulURL24ResultsCloaking Duration25

, cloakpage, cloak

top 100, crawl25ResultsCloaked Content26

cloakcluster (normal user clickDOM) top 62 clusters (total 7671)767161% google search trending cloak page

Traffic sale, redirection, , , Fake-AV, CPALead, PPC(pay-per-click)

Errorclient, flash playerLink farmlink, ,

cluster, 34% cloaked result, javascriptredirection26Results27

Classify the HTML of cloaked pagesFile size, substrings

Link farm : search engine crawlerlinkRedirect : user4KB, javascript redirectHTML meta refreshError: pageWeak: user4KB , not classifyMisc: 4KB, not classify27ResultsDomain Infrastructure28

TLD : top level domain28ResultsSEO29

search engine29ConclusionCloaking is an standard skill of constructing scam pages.This paper examined the current state of search engine cloaking as used to support Web spam.New techniques for identifying cloaking(via the search engine snippets that identify keyword-related content found at the time of crawling)Exploring the dynamics of cloaked search results and sites over time.30

Search related