11 A Hybrid Phish Detection Approach by Identity Discovery and Keywords Retrieval Reporter: 林佳宜 Email: [email protected] 2015/10/17

11

A Hybrid Phish Detection Approach by Identity Discovery and Keywords Retrieval

Reporter: 林佳宜Email: [email protected]/04/20

ReferencesXiang, G., and J.I. Hong. (2009). A

Hybrid Phish Detection Approach by Identity Discovery and Keywords Retrieval. In Proceedings of WWW 2009 Full paper

2

3

OutlineIntroductionHybrid detect approachSystem architectureExperimental evaluationConclusion

IntroductionPhishing is a significant security threat to

the Internet causes tremendous economic loss every year

Proposed a novel hybrid phish detection method

information extraction (IE) 、 information retrieval (IR)

Identity-based component by directly discovering the inconsistency between their

identity and the identity they are imitating.

Keywords-retrieval component utilizes IR algorithms exploiting the power of search engines

to identify phish

4

Define phish webpageSatisfying the following criteria

◦ It impersonates well-known websites by replicating the whole or part of the target sites, showing high visual similarity to its targets.

◦ It is associated with a domain usually unrelated to that of its target website.

◦ It has a login form requesting sensitive information.

5

Phishing Site of eBay

6

Exploits a few propertiesWebsite brand names usually appear

title, copyright field

The domain keyword is the segment in the domain representing the brand name

“Paypal” for “paypal.com

Phishing webpages are much less likely to be crawled and indexed by major search engines

short-lived nature few in-coming links

7

Hybrid approach(1/2)Consists of an

identity-based detection component keywords-retrieval detection component

Requires no training data, no prior knowledge of phishing signatures

and specific implementations and thus is able to adapt quickly to the

constantly appearing new phishing patterns

8

Hybrid approach(2/2)Relies on identity recognition to

find the domain of the page’s declared identity examines the legitimacy of the webpage

by comparing this extracted domain with its own domain site:declared brand domain “page domain”

Not directly match two domain strings some closely related domains (e.g., company

affiliations) such as “blogger.com” and “blogspot.com”

9

Named Entity RecognizerThe NE identity recognition module

augments the retrieval-based one in cases brand names are absent in title and copyright field to

control false positives an auxiliary module to the identitybased

component to reduce false positives

Identifies from the page content meta keywords/description tags

Using the well-known TF-IDF scoring function

get top ranking keywords

10

TF-IDF AlgorithmYield a weight that measures how

important a word is to a document in a corpus

Term Frequency (TF)◦ The number of times a given term appears in a

specific document◦ Measure of the importance of the term within

the particular documentInverse Document Frequency (IDF)

◦ Measure how common a term is across an entire collection of documents

A term has a high TF-IDF weight◦ A high term frequency in a given document ◦ A low document frequency in the whole

collection of documents 11

12

Login form detection(1/2)Using the HTML DOM to identify

login forms

Forms on a page is characterized by three properties

FORM tags INPUT tags login keywords such as password

13

Login form detection(2/2)Designed the following algorithm to

declare the existence of a login form form tags, input tags and login keywords all

appear in the DOM Return true if all three are found

form and input tags are found, but login-related keywords exist outside

searching keyword “search” Return true if a match is found

forms and inputs are detected, but phishers put login keywords in images

phishers put login keywords in images and refrain from using text to avoid being detected

return true if no text is found and only images exist

14

System architecture

15

Two strategiesTo accurately map a brand name to its

domain,define two strategies in selecting domains when domain query matches occur.

Strategy I: evaluate domain-query matches among the top 5 search

results of Google and Yahoo If both search engines have such matches and the

domain of the No.1 match from each side coincides take it as a candidate domain of the brand

corresponding to the query If only one search engine has matches, we take the

No.1 domain as a candidate brand domain

Strategy II: Just take the two branches corresponding to the italicized part

of strategy I.

16

Data and UsageOur webpage collection consists of

phishing cases from one source, and good webpages from six sources

Phishing pages collecting a total of 7906 phishing webpages from

Phishtank

Good pages came from Alexa.com Google: pages with keywords “signin” and “login” 3Sharp Yahoo :directory’s bank category,Yahoo misc

pages prominent pages

17

Detecting Login FormsSuccessfully detected 99.82%

phishing pages with login formsRemaining 0.18% (14 in absolute

number) phishing pages they either do not have a login form use login keywords not in our list such as

“serial key” or organize the form/input tags in a way

our method misses

18

Identity-based Detection under Strategy I(1/2)Experimented with five approaches,

i.e., detection by ◦title◦copyright◦TF-IDF◦title + copyright + NE◦a full-blown method with a combination

of the four

19

Identity-based Detection under Strategy I(2/2)All individual detection algorithms

have low FP(< 1.5%)

20

Identity-based Detection across StrategiesConsidering the different sizes of

the phish (7906) and legitimate (3543) corpus

21

Evaluation with Other TF-IDF ApproachesZhang et al proposed CANTINA

◦a content-based method◦against two state-of-the-art toolbars,

SpoofGuard and Netcraft

22

23

ConclusionPresented the design and

evaluation of a hybrid phish detection method

Achieved a true positive rate of 90.06% with a false positive rate of 1.95%.

Not requiring existing phishing signatures and training data

Questions

24

Documents

11 A Hybrid Phish Detection Approach by Identity Discovery and Keywords Retrieval Reporter: 林佳宜 Email: [email protected] 2015/10/17