25
JSPM’s RAJARSHI SHAHU COLLEGE OF ENGG. PUNE-33 BE Computer Engineering Preliminary Project Presentation 2013-14 1 DEPARTMENT OF COMPUTER ENGG.

Clickstream ppt copy

Embed Size (px)

Citation preview

Page 1: Clickstream ppt   copy

1

JSPM’s RAJARSHI SHAHU COLLEGE OF ENGG. PUNE-33

BE Computer Engineering

Preliminary Project Presentation

2013-14

DEPARTMENT OF COMPUTER ENGG.

Page 2: Clickstream ppt   copy

2

Identifying Fraudulent Activities Over Online Application Through Clickstream

Analysis

Group No.: 08

Guided by:-Ms. V.M.Barkade

Exam Seat No. Name of Student

B80374244 Nikita Hiremath

B80374265 Surbhi Sonkhaskar

B80374270 Shital More

Page 3: Clickstream ppt   copy

3

• Internet has been integrated in day to day activities of human

beings.

• Frauds followed with the advent of e-commerce.

• It is necessary for all the online applications involving

monetary transactions to ensure the safety of money being

invested by people.

• Click stream analysis is one such technique which helps in

detecting frauds by analyzing the user behavior.

Introduction

Page 4: Clickstream ppt   copy

4

To Develop a Business Solution for Identifying the Fraudulent Activities on Online Application through Click Stream Analysis

using Hadoop.

Problem Statement

Page 5: Clickstream ppt   copy

5

Scope

1. Detecting only order frauds.

2. Limited to only detecting the fraudulent user.

3. No recovery measure

4. Analysis will be done only for a particular session of an user

5. There is no restriction over number of clicks or for time stamp.

Page 6: Clickstream ppt   copy

6

Literature Survey

• Paper 1: Wichian Premchaiswadi;Walisa Romsaiyud,” Extracting WebLog of Siam

University for Learning User Behavior on MapReduce”,Siam University,Thailand.(ICIAS2012)

• Paper 2:

Narayanan Sadgopan;Jie Li,”Characterizing Typical and Atypical User Sessions in Clickstreams”Yahoo!.WWW2008

• Paper 3:

Bimal Viswanath;Ansley Post;Krishna P.Gummadi;Alan Mislove,”An Analysis of Social Network-based Sybil Defenses”,MPI-SWS and North Eastern University .SIGCOMM’10.

Page 7: Clickstream ppt   copy

7

Paper 1

Abstract: MapReduce is a framework that allows developers to write applications that

rapidly process and analyze large volumes of data in a massively parallel scale.Moreover, a clickstream is a record of a user's activity on the Internet. Using a clickstream analysis we can collect, analyze, and report aggregate data about which pages visitors visit in what order – and which are the result of the succession of mouse clicks each visitor makes. Clickstream analysis can reveal usage patterns leading to a heightened understanding of users’ behavior. In this paper, we introduced a novel and efficient web log mining model for web users clustering. In general, our model consists of three main steps; 1) Computing the similarity measure of any path in a web page, 2) Defining the k-mean clustering for group customerID 3) Generating the report based on the Hadoop MapReduce Framework. Consequently.Our experiments were run on real world data derived from weblogs of SiamUniversity at Bangkok, Thailand (www.siam.edu).

.

Literature Survey

Page 8: Clickstream ppt   copy

8

In this paper they have proposed: The paper has suggested how two algorithms: Calculate the similarity of

the

graph and Fuzzy K-mean clustering can be used to analyze the user behavior

using click stream. These algorithm use graphs and data set as

input respectively.

From this paper we have referred:

An already existing systems’ study that has used Clickstream analysis for

studying the behavior of the user over an educational website.

Page 9: Clickstream ppt   copy

9

Paper 2

Abstract:

Millions of users retrieve information from the Internet using search engines. Mining these user sessions can provide valuable information about the quality of user experience and the perceived quality of search results. Often search engines rely on accurate estimates of Click Through Rate (CTR) to evaluate the quality of user experience. The vast heterogeneity in the user population and presence of automated software programs (bots) can result in high variance in the estimates of CTR. To improve the estimation accuracy of user experience metrics like CTR, we argue that it is important to identify typical and atypical user sessions in clickstreams. Our approach to identify these sessions is based on detecting outliers using Mahalanobis distance in the user session space. Our user session model incorporates several key clickstream characteristics including a novel conformance score obtained by Markov Chain analysis. Editorial results show that our approach of identifying typical and atypical sessions has a precision of about 89%. Filtering out these atypical sessions reduces the uncertainty (95% confidence interval) of the mean CTR by about 40%. These results demonstrate that our approach of identifying typical and atypical user sessions is extremely valuable for cleaning “noisy" user session data for increased accuracy in evaluating user experience.

.

Page 10: Clickstream ppt   copy

10

In this paper they have proposed: Use of Markov Chain analysis to improve the detection of typical and atypical user sessions. Also they have used Click Through Rate(CTR) to evaluate the quality of users.

From this paper we have referred: From this paper we referred to various techniques for analyzing typical and atypical users depending on the clicks made by the user. It has suggested few models like Click-based model, Time-based model and Hybrid model, using which the sessions can be divide and analyzed. The concept of Click Through Rate is referred from this paper.

Page 11: Clickstream ppt   copy

11

Paper 3

Abstract: Recently, there has been much excitement in the research community over using

social networks to mitigate multiple identity, or Sybil, attacks. A number of schemes have been proposed, but they differ greatly in the algorithms they use and in the networks upon which they are evaluated. As a result, the research community lacks a clear understanding of how these schemes compare against each other, how well they would work on real-world social networks with different structural properties, or whether there exist other (potentially better) ways of Sybil defense. In this paper, we show that, despite their considerable differences, existing Sybil defense schemes work by detecting local communities (i.e., clusters of nodes more tightly knit than the rest of the graph) around a trusted node. Our finding has important implications for both existing and future designs of Sybil defense schemes. First, we show that there is an opportunity to leverage the substantial amount of prior work on general community detection algorithms in order to defend against Sybils. Second, our analysis reveals the fundamental limits of current social network-based Sybil defenses: We demonstrate that networks with well-defined community structure are inherently more vulnerable to Sybil attacks, and that, in such networks, Sybils can carefully target their links in order make their attacks more effective.

Page 12: Clickstream ppt   copy

12

In this paper they have proposed:

An analysis of Sybil attacks on social networking sites has been

given. They have given how even a well structured site can be

targeted for such attacks.

From this paper we referred: In this paper we got more information about Sybil Attacks over online social

network. We got an understanding that Sybil attacks over an online shopping

website cannot completely block the site. But partial Sybil attack can be done

through order frauds.

Page 13: Clickstream ppt   copy

13

Requirement Analysis

Software Requirement

• Shell Script• Apache Hadoop 0.20.x• Pig Script 0.9.1• Ubuntu 12.04

Hardware Requirements

• Processor :Intel Pentium IV 2.1 GHz or above

• Clock speed:500 MHz• RAM:128MB • HD:20 GB or higher

Page 14: Clickstream ppt   copy

14

Proposed System

Data gathering

Storing and structuring data

Extraction of weblogs

Pattern matching and map reduce algorithm

Data analysis

Data visualization

HDFS

Page 15: Clickstream ppt   copy

15

SYSTEM DIAGRAMS• Class diagram• State Transition Diagram• System Architecture Diagram• Use case diagram• Activity Diagram• Object Diagram• Sequence Diagram• Collaboration Diagram• State chart Diagram• Component Diagram• Deployment Diagram

Page 16: Clickstream ppt   copy

16

WORKING OF THE SYSTEM

USER 1 USER 2 USER 3

FLUME AGENT

Extracting Weblogs

Page 17: Clickstream ppt   copy

17

FLUME AGENT

SERVER

PROVIDING WEBLOGS

HDFS

STORING AND STRUCTURING DATA

Page 18: Clickstream ppt   copy

18

HDFS

DATA NODESPATTERN MATCHING ALGORITHM

PROVIDING THE MATCHED VALUES

Page 19: Clickstream ppt   copy

19

HDFS

SERVER

PROVIDING PROCESSED DATA

DATA ANALYSIS

DATA VISUALIZATION USING DATA ANALYTICS TOOLS

05

10Se-ries 1Se-ries 2

0246

Page 20: Clickstream ppt   copy

20

Algorithms

The various pattern matching algorithm that can be applied are:-

1. Brute force algorithm

2. Boyer Moore algorithm

3. Not so naïve algorithm

4. Knuth-Morris-Pratt algorithm

Out of all the above listed algorithms, we are going to use the Knuth-

Morris-Pratt algorithm since it is most efficient algorithm for matching short as

well as long patterns.

Page 21: Clickstream ppt   copy

21

Bernoulli’s Distribution: This distribution best describes all situations where a "trial" is made resulting in either "success" or "failure," such as when tossing a coin, or when modeling the success or failure of a surgical procedure. The Bernoulli distribution is defined as: f(x) = px (1-p)1-x,    for x = 0, where, p is the probability that a particular event (e.g.,success) will occur 

Arithmetic Mean:The arithmetic mean of a set of data is found by taking the sum of the data, and then dividing the sum by the total number of values in the set. A mean is commonly referred to as an average. n/sum(n)where n is total number of elements

Arithmetic Mode:Mode is a most frequently occurring value in frequency distribution.

 

MATHEMATICAL MODEL

Page 22: Clickstream ppt   copy

22

Arithmetic Median: Median is the “middle number” value in number.

Variance:

The variance (σ2), is defined as the sum of the squared distances of each

term in the distribution from the mean (μ), divided by the number of terms in

the distribution (N).

Page 23: Clickstream ppt   copy

23

Future Scope

• This system can be implemented for any online commercial application.

• Currently only detection of fraudulent users is being done, the system can

be expanded to undertake the necessary authentication steps.

Page 24: Clickstream ppt   copy

24

References1. SADAGOPAN, N., AND LI, J. Characterizing typical and atypical user

sessions in clickstreams. In Proc. of WWW(2008).

2. You are How You Click: Clickstream Analysis for Sybil Detection Gang Wang, Tristan Konolige, Christo Wilson, Xiao Wang, Haitao Zheng and Ben Y. Zhao.

3. Wichian Premchaiswadi, Walisa Romsaiyud Extracting WebLog of Siam University for Learning User Behavior on MapReduce.

4. YU, H., KAMINSKY, M., GIBBONS, P. B., AND FLAXMAN,A. Sybilguard: defending against sybil attacks via social networks. In Proc. of SIGCOMM (2006).

5. DOUCEUR, J. R. The Sybil attack. In Proc. of IPTPS(2002).

Page 25: Clickstream ppt   copy

25

THANK YOU!