Upload
joshua-reynard-webster
View
216
Download
1
Embed Size (px)
Citation preview
11
A Hybrid Phish Detection Approach by Identity Discovery and Keywords Retrieval
Reporter: 林佳宜Email: [email protected]/04/20
ReferencesXiang, G., and J.I. Hong. (2009). A
Hybrid Phish Detection Approach by Identity Discovery and Keywords Retrieval. In Proceedings of WWW 2009 Full paper
2
3
OutlineIntroductionHybrid detect approachSystem architectureExperimental evaluationConclusion
IntroductionPhishing is a significant security threat to
the Internet causes tremendous economic loss every year
Proposed a novel hybrid phish detection method
information extraction (IE) 、 information retrieval (IR)
Identity-based component by directly discovering the inconsistency between their
identity and the identity they are imitating.
Keywords-retrieval component utilizes IR algorithms exploiting the power of search engines
to identify phish
4
Define phish webpageSatisfying the following criteria
◦ It impersonates well-known websites by replicating the whole or part of the target sites, showing high visual similarity to its targets.
◦ It is associated with a domain usually unrelated to that of its target website.
◦ It has a login form requesting sensitive information.
5
Phishing Site of eBay
6
Exploits a few propertiesWebsite brand names usually appear
title, copyright field
The domain keyword is the segment in the domain representing the brand name
“Paypal” for “paypal.com
Phishing webpages are much less likely to be crawled and indexed by major search engines
short-lived nature few in-coming links
7
Hybrid approach(1/2)Consists of an
identity-based detection component keywords-retrieval detection component
Requires no training data, no prior knowledge of phishing signatures
and specific implementations and thus is able to adapt quickly to the
constantly appearing new phishing patterns
8
Hybrid approach(2/2)Relies on identity recognition to
find the domain of the page’s declared identity examines the legitimacy of the webpage
by comparing this extracted domain with its own domain site:declared brand domain “page domain”
Not directly match two domain strings some closely related domains (e.g., company
affiliations) such as “blogger.com” and “blogspot.com”
9
Named Entity RecognizerThe NE identity recognition module
augments the retrieval-based one in cases brand names are absent in title and copyright field to
control false positives an auxiliary module to the identitybased
component to reduce false positives
Identifies from the page content meta keywords/description tags
Using the well-known TF-IDF scoring function
get top ranking keywords
10
TF-IDF AlgorithmYield a weight that measures how
important a word is to a document in a corpus
Term Frequency (TF)◦ The number of times a given term appears in a
specific document◦ Measure of the importance of the term within
the particular documentInverse Document Frequency (IDF)
◦ Measure how common a term is across an entire collection of documents
A term has a high TF-IDF weight◦ A high term frequency in a given document ◦ A low document frequency in the whole
collection of documents 11
12
Login form detection(1/2)Using the HTML DOM to identify
login forms
Forms on a page is characterized by three properties
FORM tags INPUT tags login keywords such as password
13
Login form detection(2/2)Designed the following algorithm to
declare the existence of a login form form tags, input tags and login keywords all
appear in the DOM Return true if all three are found
form and input tags are found, but login-related keywords exist outside
searching keyword “search” Return true if a match is found
forms and inputs are detected, but phishers put login keywords in images
phishers put login keywords in images and refrain from using text to avoid being detected
return true if no text is found and only images exist
14
System architecture
15
Two strategiesTo accurately map a brand name to its
domain,define two strategies in selecting domains when domain query matches occur.
Strategy I: evaluate domain-query matches among the top 5 search
results of Google and Yahoo If both search engines have such matches and the
domain of the No.1 match from each side coincides take it as a candidate domain of the brand
corresponding to the query If only one search engine has matches, we take the
No.1 domain as a candidate brand domain
Strategy II: Just take the two branches corresponding to the italicized part
of strategy I.
16
Data and UsageOur webpage collection consists of
phishing cases from one source, and good webpages from six sources
Phishing pages collecting a total of 7906 phishing webpages from
Phishtank
Good pages came from Alexa.com Google: pages with keywords “signin” and “login” 3Sharp Yahoo :directory’s bank category,Yahoo misc
pages prominent pages
17
Detecting Login FormsSuccessfully detected 99.82%
phishing pages with login formsRemaining 0.18% (14 in absolute
number) phishing pages they either do not have a login form use login keywords not in our list such as
“serial key” or organize the form/input tags in a way
our method misses
18
Identity-based Detection under Strategy I(1/2)Experimented with five approaches,
i.e., detection by ◦title◦copyright◦TF-IDF◦title + copyright + NE◦a full-blown method with a combination
of the four
19
Identity-based Detection under Strategy I(2/2)All individual detection algorithms
have low FP(< 1.5%)
20
Identity-based Detection across StrategiesConsidering the different sizes of
the phish (7906) and legitimate (3543) corpus
21
Evaluation with Other TF-IDF ApproachesZhang et al proposed CANTINA
◦a content-based method◦against two state-of-the-art toolbars,
SpoofGuard and Netcraft
22
23
ConclusionPresented the design and
evaluation of a hybrid phish detection method
Achieved a true positive rate of 90.06% with a false positive rate of 1.95%.
Not requiring existing phishing signatures and training data
Questions
24