30
Cloak and Dagger: Dynamics of Web Search Cloaking David Y. Wang, Stefan Savage, and Geoffrey M. Voelker University of California, San Diego 左左左 Seminar @ ADLab, NCU-CSIE 18 th ACM Conference on Computer and Communications Security (CCS 2011)

Cloak and Dagger: Dynamics of Web Search Cloaking

  • Upload
    borna

  • View
    53

  • Download
    6

Embed Size (px)

DESCRIPTION

18 th  ACM Conference on Computer and Communications Security (CCS 2011). Cloak and Dagger: Dynamics of Web Search Cloaking. David Y. Wang, Stefan Savage, and Geoffrey M. Voelker University of California, San Diego. 左昌國 Seminar @ ADLab , NCU-CSIE . Outline. Introduction Methodology - PowerPoint PPT Presentation

Citation preview

Page 1: Cloak and Dagger: Dynamics of Web Search Cloaking

Cloak and Dagger: Dynamics of Web Search CloakingDavid Y. Wang, Stefan Savage, and Geoffrey M. Voelker

University of California, San Diego

左昌國Seminar @ ADLab, NCU-CSIE

18th ACM Conference on Computer and Communications Security(CCS 2011)

Page 2: Cloak and Dagger: Dynamics of Web Search Cloaking

2

Outline• Introduction• Methodology• Results• Related Work• Conclusion

Page 3: Cloak and Dagger: Dynamics of Web Search Cloaking

3

Introduction• Search Engine Optimization (SEO)

• “Search engine optimization (SEO) is the process of improving the visibility of a website or a web page in search engines via the "natural" or un-paid ("organic" or "algorithmic") search results.” --- Wikipedia

• SEO could be used as benign techniques• Cloaking

• Up to 1999• One of the notorious blackhat SEO skills• Delivering different content to different user segments

• ie. Search engine crawlers and normal users

Page 4: Cloak and Dagger: Dynamics of Web Search Cloaking

4

Introduction Normal User

Search Engine Crawler

Page 5: Cloak and Dagger: Dynamics of Web Search Cloaking

5

Introduction• Types of cloaking

• Repeat Cloaking• Cookies or IP tracking

• User Agent Cloaking• User-Agent field in the HTTP request header

• Referrer Cloaking• Referer field in the HTTP header

• IP Cloaking

Page 6: Cloak and Dagger: Dynamics of Web Search Cloaking

6

Introduction• This paper…

• Designs a system, Dagger, to identify cloaking in near real-time• Uses this system to

• Provide a picture of cloaking activity as seen through three search engines(Google, Bing and Yahoo)

• Characterize the differences in cloaking behavior between undifferentiated “trending” keywords and targeted keywords.

• Characterize the dynamic behavior of cloaking activity

Page 7: Cloak and Dagger: Dynamics of Web Search Cloaking

7

Methodology• Dagger consists of five functional components

• Collecting search terms• Fetching search results from search engines• Crawling the pages linked from the search results• Analyzing the pages crawled• Repeating measurements over time

Page 8: Cloak and Dagger: Dynamics of Web Search Cloaking

8

Methodology• Collecting Search Terms

• Collecting popular search terms from• Google Hot Searches• Alexa• Twitter

• Constructing another source of search terms using keyword suggestions from “Google Suggest.”• ex: User enter -> viagra 50mg

Suggestion -> viagra 50mg cost viagra 50mg canada

Page 9: Cloak and Dagger: Dynamics of Web Search Cloaking

9

Methodology• Querying Search Results

• Submitting the search terms to search engines(Google, Yahoo, and Bing)• Google Hot Searches and Alexa each supply 80 terms per 4-hour• Twitter supplies 40• Together with 240 additional suggestions based on Google Hot

Searches (80 * 3)Total 440 terms

• Extracting the top 100 search results for each search term(44,000)• Removing whitelist URLs• Grouping similar entries (same URL, source, and search term)average roughly 15,000 unique URLs in each measurement period

Page 10: Cloak and Dagger: Dynamics of Web Search Cloaking

10

Methodology• Crawling Search Results

• Web crawler• A Java web crawler using the HttpClient 3.x package from Apache

• Crawling 3 times for each URL• Disguised as a normal user using Internet Explorer, clicking through the

search result• Disguised as the Googlebot Web crawler• Disguised as a normal user again, NOT clicking through the search

result• Dealing with IP cloaking?

• Fourth crawling using Google Translate• More than half of cloaked results do IP cloaking

Page 11: Cloak and Dagger: Dynamics of Web Search Cloaking

11

Methodology• Detecting Cloaking

• Removing HTTP error response (average 4% of URLs)• Using Text Shingling to filter out nearly identical pages

• 90% of URLs are near duplicates ( “near duplicates” means 10% or less differences between 2 sets of signatures)

• Measuring the similarity between the snippet of the search result and the user view of the page• Removing noise from both the snippet and the body of the user view• Search substrings from the snippet• Number of words from unmatched substrings divided by the total

number of words from all substrings• 1.0 means no match• 0.0 means fully match• Threshold: 0.33 filter out 56% of the remaining URLs

Page 12: Cloak and Dagger: Dynamics of Web Search Cloaking

12

Methodology• Detecting Cloaking(cont.)

• False positives may still exist• Examining the DOMs as the final test

• Computing the sum of an overall comparison and a hierarchical comparison

• Overall comparison: unmatched tags from the entire page divided by the total number of tags

• Hierarchical comparison: the sum of the unmatched tags from each level of the DOM hierarchy divided by the total number of tags• 2.0 means no match• 0.0 means fully match• Threshold: 0.66

Page 13: Cloak and Dagger: Dynamics of Web Search Cloaking

13

Methodology• Detecting Cloaking(cont.)

• Manual inspection• False positive: 9.1% (29 of 317) in Google search 12% (9 of 75) in Yahoo (benign websites but delivering different content to search engines)

• Advanced browser detection

• Temporal Remeasurement• Dagger remeasures every 4 hours for up to 7 days

Page 14: Cloak and Dagger: Dynamics of Web Search Cloaking

14

Results• Cloaking Over Time

Page 15: Cloak and Dagger: Dynamics of Web Search Cloaking

15

Results

Page 16: Cloak and Dagger: Dynamics of Web Search Cloaking

16

Results• Sources of Search Terms

Page 17: Cloak and Dagger: Dynamics of Web Search Cloaking

17

Results

Page 18: Cloak and Dagger: Dynamics of Web Search Cloaking

18

Results

Page 19: Cloak and Dagger: Dynamics of Web Search Cloaking

19

Results

Page 20: Cloak and Dagger: Dynamics of Web Search Cloaking

20

Results• Search Engine Response

Page 21: Cloak and Dagger: Dynamics of Web Search Cloaking

21

Results

Page 22: Cloak and Dagger: Dynamics of Web Search Cloaking

22

Results

Page 23: Cloak and Dagger: Dynamics of Web Search Cloaking

23

Results

Page 24: Cloak and Dagger: Dynamics of Web Search Cloaking

24

Results

Page 25: Cloak and Dagger: Dynamics of Web Search Cloaking

25

Results• Cloaking Duration

Page 26: Cloak and Dagger: Dynamics of Web Search Cloaking

26

Results• Cloaked Content

Page 27: Cloak and Dagger: Dynamics of Web Search Cloaking

27

Results

Page 28: Cloak and Dagger: Dynamics of Web Search Cloaking

28

Results• Domain Infrastructure

Page 29: Cloak and Dagger: Dynamics of Web Search Cloaking

29

Results• SEO

Page 30: Cloak and Dagger: Dynamics of Web Search Cloaking

30

Conclusion• Cloaking is an standard skill of constructing scam pages.• This paper examined the current state of search engine

cloaking as used to support Web spam.• New techniques for identifying cloaking(via the search

engine snippets that identify keyword-related content found at the time of crawling)

• Exploring the dynamics of cloaked search results and sites over time.