90
Bigdata based Fraud Detection 김민경 [email protected] 2015.04.09

Bigdata based fraud detection

  • Upload
    mk-kim

  • View
    1.723

  • Download
    8

Embed Size (px)

Citation preview

Page 1: Bigdata based fraud detection

Bigdata based Fraud Detection

김민경[email protected]

2015.04.09

Page 2: Bigdata based fraud detection

2

•Outline

• Intorduction• Bigdata• Machine Learning• Fraud Detection• Solutions

Page 3: Bigdata based fraud detection

3

•Outline

• Intorduction• Bigdata• Machine Learning• Fraud Detection• Solutions

Page 4: Bigdata based fraud detection

4

•O2O Platform

Introduction

온라인 기업의 오프라인 장악을 위해 고안되었으나 현재는 오프라인과 온라인이 서로 인터랙션하는 플랫폼으로 진행....

Page 5: Bigdata based fraud detection

5

•FinTech

Introduction

결제의 편의성에서 출발......

Page 6: Bigdata based fraud detection

6

• IntroductionMarket Share

핀테크와 O2O플랫폼은 편의성 확보를 통한 고객확보와 시장을 선점하기 위한 전쟁

Page 7: Bigdata based fraud detection

7

• Introduction인증

간편하게 ID와 패스워드 한번으로 결제

공인인증서 발급하여 쇼핑하는 것이 가능한가?아이핀을 발급하여 쇼핑하는 것이 가능한가?

Page 8: Bigdata based fraud detection

8

• IntroductionTrade-Off

보안성편의성

보안성과 편의성은 서로 트레이드 오프(Trade-Off) 관계

Page 9: Bigdata based fraud detection

9

• IntroductionSuccess

보안성

편의성

두마리 토끼를 잡아야 성공할 수 있다.

Page 10: Bigdata based fraud detection

10

• IntroductionMotive

인터넷 전문은행?

• 인터넷과 모바일을 통해서 예금 수신·이체·대출·펀드투자 등 금융 서비스를 제공하는 은행

• 특징 : 점포 없이 저비용 구조로 운영하면서 시중은행 보다 저렴한 수수료와 낮은 대출 금리 제공.

• 산업자본의 지분 참여 30% 이상 허용

• 대기업군(61개) 제한 : 삼성, 현대자동차 등 공정거래 위원회로 부터 상호 출자 제한을 받는 자산 5조원 이상

Page 11: Bigdata based fraud detection

11

• IntroductionMotive

• 불편한 금융 보안장치와 프로세스• 보안카드 OTP• 책임은 누구 • 금융보안은 자율적으로 처리하는 것이 대세• 금융회사 책임 범위 강화• 금융보호업무 재위탁 금지, 단 금융위 허용시 예외• 징벌적 과징금- 50억원이하• 벌칙강화 -10년이하 징역, 1억원이하 벌금• 과태료 -신설, 안정성 확보의무 불이행시 5천만원이하• 의무적 보고 – CISO의 매월 정보보안점검 내용 보고.

Page 12: Bigdata based fraud detection

12

• Introduction

신제윤 금융위원장은 금융보안을 위해 모든 금융권이 이상거래탐지시스템(FDS) 구축을 완료해야 한다고 촉구했다.

"핀테크 활성화 방안을 추진하기 위해서 반드시 전제돼야 할 사항은 보안의 중요성"이라며 "정보보안이 확보되지 않은 서비스는 결국 사상누각이 될 것"이라고 우려했다.

그는 핀테크(Fintech) 추진 방안과 관련해서는 "오프라인 위주의 금융제도 개편을 통해 핀테크 기술이 금융에 자연스럽게 접목될 수 있도록 지원할 것"이라며 "전자금융업종 규율을 재설계토록 하겠다"고 밝혔다.

Motive

Page 13: Bigdata based fraud detection

13

•Outline

• Intorduction• Bigdata• Machine Learning• Fraud Detection• Solutions

Page 14: Bigdata based fraud detection

14

•Bigdata Ecosystem

Bigdata

• 빅데이터의 의의

데이터 양이 방대할 뿐만 아니라 복잡해져서 전통적인 데이터 프로세싱으로는 처리하기 어려워서 고안되 대용량 병렬 컴퓨팅 기술

• 빅데이터 처리 기술 이러한 복잡하고 방대한 데이터를 병렬 프로세싱을 통해서 효율적으로 처리하는 기술 • 빅데이터 처리 과정

수집-저장-처리-분석-표현 수집-처리-분석-표현-저장

• 빅데이터 분석의 의의복잡하고 방대한 데이터를 대용량 병렬 컴퓨팅 기술에 기반하여 기계학습이나 확률 통계적 기법을 이용한 분석 기술

Page 15: Bigdata based fraud detection

15

•Bigdata Ecosystem

Open Source Bigdata Ecosystem

• Query (NOSQL) : Cassandra, HBase, MongoDB and more

• Query (SQL) : Hive, Stinger, Impala, Presto, Shark

• Advanced Analytic : Hadoop, Spark,H2O

• Real time : Storm, Samza, S4, Spark Streaming

Bigdata

Page 16: Bigdata based fraud detection

16

•Bigdata Ecosystem

Bigdata

Page 17: Bigdata based fraud detection

17

Veracity

Bigdata Problems

Bigdata

Value Meaning

Page 18: Bigdata based fraud detection

18

•Bigdata Streaming

Bigdata

Page 19: Bigdata based fraud detection

19

•Analytics Problems

Bigdata

DATAACQUISITION

DATAANALYSIS

DATASTORAGE

RESULT

Stream pipeline

Page 20: Bigdata based fraud detection

20

•Stream Problems

Bigdata

Big(Volume)

Complex(Veriety)

Speed(Velocity)

Page 21: Bigdata based fraud detection

21

•Lambda Architecture

Bigdata

Page 22: Bigdata based fraud detection

22

•Lambda Architecture

Bigdata

Page 23: Bigdata based fraud detection

23

•Lambda Architecture

Bigdata

Page 24: Bigdata based fraud detection

24

•Lambda Architecture

Bigdata

Page 25: Bigdata based fraud detection

25

•Lambda Architecture

Bigdata

Page 26: Bigdata based fraud detection

26

•Lambda Architecture

Bigdata

Page 27: Bigdata based fraud detection

27

•Lambda Architecture

Bigdata

Page 28: Bigdata based fraud detection

28

•Lambda Architecture

Bigdata

•All data entering the system is dispatched to both the batch layer and the speed layer for processing.

•The batch layer has two functions: (i) managing the master dataset (an immutable, append-only set of raw data), and (ii) to pre-compute the batch views.

•The serving layer indexes the batch views so that they can be queried in low-latency, ad-hoc way.

•The speed layer compensates for the high latency of updates to the serving layer and deals with recent data only.

•Any incoming query can be answered by merging results from batch views and real-time views.

Page 29: Bigdata based fraud detection

29

•Lambda Architecture

Bigdata

Page 30: Bigdata based fraud detection

30

•In-Stream Processing

Bigdata

Page 31: Bigdata based fraud detection

31

•Seldon Infrastructure

Bigdata

•Real-Time Layer : responsible for handling the live predictive API requests.•Storage Layer : various types of storage used by other components.•Near time/Offline Layer:components that run compute intensive or non-realtime jobs.•Stats layer : components to monitor and analyze the running system.

Page 32: Bigdata based fraud detection

32

•Pulsar Architecture

Bigdata

Pulsar Deployment Architecture

Page 33: Bigdata based fraud detection

33

•Pulsar Architecture

Bigdata

Page 34: Bigdata based fraud detection

34

•Pulsar Architecture

Bigdata

The Pulsar pipeline includes the following components:

• Collector: Ingests events through a Rest end point• Sessionizer: Sessionizes the events, maintaining the session state and generating marker events• Distributor: Filters and mutates events to different consumers; acts as an event router• Metrics calculator: Calculates metrics by various dimensions and persists them in the metrics store• Replay: Replays the failed events on other stages• ConfigApp: Configures dynamic provisioning for the whole pipeline

Page 35: Bigdata based fraud detection

35

•Pulsar Architecture

Bigdata

• • Complex Event Processing: SQL on stream data

• • Custom sub-stream creation: Filtering and Mutation

• • In Memory Aggregation: Multi Dimensional counting

Page 36: Bigdata based fraud detection

36

•Pulsar Architecture

Bigdata

Page 37: Bigdata based fraud detection

37

•Realtime Ecosystem

Bigdata

Page 38: Bigdata based fraud detection

38

•Realtime Ecosystem

Bigdata

Page 39: Bigdata based fraud detection

39

•Outline

• Intorduction• Bigdata• Machine Learning• Fraud Detection• Solutions

Page 40: Bigdata based fraud detection

40

•What is ?

Machine Learnig

Data로 부터 출발....

• 기계(Machine) + Learning (학습)

• 기계(컴퓨터)에게 데이터를 이용하여 학습하는 방법을 가르치는 것. Teach computer how to learn from data

따라서 Data가 교재이다.

Page 41: Bigdata based fraud detection

41

•ML Types

Machine Learnig

• Supervised learning : 지도학습• Data의 종류를 알고 있을 때(Category, Labeled)• ex: spam mail

• Unsupervised : 비지도학습• Data의 종류는 모르지만 패턴을 알고 싶을 때 • SNS, Twitter

• Semi-supervised learning : 지도학습 + 비지도학습• Reinforcement learning : 강화학습

• 잘못된 것을 다시 피드백• Evolutionary learning : 진화학습(GA, AIS)• Meta Learning : Landmark of data for classifier

Page 42: Bigdata based fraud detection

42

•Lifecycle on Realtime

Machine Learnig

ML Modeling

ML Deploy

ML Optimizer

New Data

Decision Making

Alert

Anomaly Store

Hadoop DFS/NoSQl/Hive

Page 43: Bigdata based fraud detection

43

•Genetic algorithm

Abnormal Behavior

Machine Learnig

Page 44: Bigdata based fraud detection

44

• Association Rule Mining

Machine Learnig

Page 45: Bigdata based fraud detection

45

• Finite State Automata (FSA)

Since the tests in can be grouped, the states can represent the several tests being performed at the same time. For example, T34 means that T3 and T4 can be done simultaneously

Machine Learnig

Page 46: Bigdata based fraud detection

46

•Clustering

Machine Learnig

Page 47: Bigdata based fraud detection

47

•Hidden Markov

Sequence Based Algorithm

•Certain fraudulent activities may not be detectable with instance based algorithms

•small amount of money, instance based algorithms will fail to detect the fraud

Machine Learnig

Page 48: Bigdata based fraud detection

48

•Decision Tree

Profiling?

Machine Learnig

Page 49: Bigdata based fraud detection

49

• Support Vector Machine

Machine Learnig

Page 50: Bigdata based fraud detection

50

•Neural Network

Single Layer Feed Forward Model

Machine Learnig

Page 51: Bigdata based fraud detection

51

• anti-k nearest neighbor

Outlier Detection

Machine Learnig

Page 52: Bigdata based fraud detection

52

• Comparison of Three Algorithms

Machine Learnig

Page 53: Bigdata based fraud detection

53

•Outline

• Intorduction• Bigdata• Machine Learning• Fraud Detection• Solutions

Page 54: Bigdata based fraud detection

54

•Banking

• 트래픽

Fraud Detection

1일트랜잭션 1일로그 날짜 총건수 트랜젝션수

20,000,000 200,000,000 7일 1,400,000,000건 140,000,000건

21일 4,200,000,000건 420,000,000건

30일 6,000,000,000건 600,000,000건

60일 12,000,000,000건 1,200,000,000건

90일 18,000,000,000건 1,800,000,000건

120일 24,000,000,000건 2,400,000,000건

150일 30,000,000,000건 3,000,000,000건

180일 36,000,000,000건 3,600,000,000건

360일 72,000,000,000건 7,200,000,000건

Page 55: Bigdata based fraud detection

55

•Fraud Detection

Credit card data (70-80 variables per transaction): • Transaction ID • Transaction type • Date and time of transaction (to nearest second) • Amount • Currency • Local currency amount • Merchant category • Card issuer ID • ATM ID • POS type • Cheque account prefix • Savings account prefix

• Acquiring institution ID • Transaction authorisation code • Online authorisation performed • New card • Transaction exceeds floor limit • Number of times chip has been accessed • Merchant city name • Chip terminal capability • Chip card verification result

Card

Page 56: Bigdata based fraud detection

56

• Fraud Detection Basics

Fraud Detection

Speed is the key !!! •- many transactions - billions - algorithms must be efficient- mixed variable types (generally not text, image)- large number of variables- incomprehensible variables, irrelevant variables- different misclassification costs- many ways of committing fraud- unbalanced class sizes (c. 0.1% transactions fraudulent)- delay in labelling- mislabelled classes- random transaction arrival times- (reactive) population drift- Maintain a sliding buffer of the last billion transactions in RAM (fast memory)- Organize the transactions in such a way that some queries could be executed very fast- Develop some clever algorithms that operate on this data structure- Will it work??? Yes, it will !!! Yes, it does …

Page 57: Bigdata based fraud detection

57

• Fraud Detection Basics

Fraud Detection

Challenge: real-time detection! • Monitor in real time all POS/ATM transactions • Detect unusual patterns and block compromised cards as quickly as possible • Ideally: block compromised cards before fraud is discovered! • A big question: can we do it ??? • Some numbers: • 3,000,000,000 transactions per year

• up to 15,000,000 transactions per day • up to 400 transactions per second (peak hours) • 100,000,000 cards

Page 58: Bigdata based fraud detection

58

• Fraud Detection Basics

Fraud Detection

•Self Healing

•Multi datacenter failovers

•State management

•Shutdown Orchestration

•Dynamic Partitioning

•Elastic Clusters

•Dynamic Flow Routing

•Dynamic Topology Changes

Page 59: Bigdata based fraud detection

59

• Fraud Detection Basics

Fraud Detection

–SQL like language for specifying processing rules

–Analysis over rolling and tumbling windows of time

–Filtering and Joining streams

–Grouping and Ordering output

–For routing events between stages and between clusters

–Event Mutation

–Correlation

–Patterns

Page 60: Bigdata based fraud detection

60

• Fraud Detection Basics

Fraud Detection

•Rolling window aggregation over long time windows

(hours or days)

•Session store scaling to 1 million insert/update per sec

•Dynamic Joins with graphs and RDBMS tables

•Auto scaling based on load sensing

•Hot deployment of Java source code

Page 61: Bigdata based fraud detection

61

• Fraud Detection Basics

• Outlier Detection • detecting data points that don’t follow the trends and

patters in the data • rule base detection • anomaly detection

• Two approaches for treating input • focus on instance of data point • focus on sequence of data points

• Three kinds of algorithms • building a model out of data • using data directly. • immunse system base on temporal data

• Real time fraud detection • feasible with model based approach • A model is built with batch processing of training data • A real time stream processor uses the model and

makes predictions in real time

Fraud Detection

Page 62: Bigdata based fraud detection

62

•Economy Imperative

• Not worth spending $200m to stop $20m fraud

• The Pareto principle • fthe first 50% of fraud is easy to stop • next 25% takes the same effort • next 12.5% takes the same effort

• Resources available for fraud detection are always limited • around 3% of police resources go on fraud ? • this will not significantly increase

• If we cannot outspend the fraudsters we must out-think them

Fraud Detection

Page 63: Bigdata based fraud detection

63

•Types of Anomaly

Fraud Detection

Page 64: Bigdata based fraud detection

64

•Fraud Detection

AIS are adaptive systems inspired by theoretical immunology and observed immune functions, principles and models, which are applied to complex problem domains

•Immune system needs to be able to differentiate between self and non-self cells

•may result in cell death therefore • Some kind of positive selection(Clonal Selection) • Some kind of negative selection

AritificalImmune Systems

Page 65: Bigdata based fraud detection

65

•Fraud Detection

무과립성 백혈구(無顆粒性 白血球, agranulocyte)의 일종으로 면역 기능 관여하며 전체 백혈구 중에서도 30%를 차지한다.

•T세포(T cell)

•보조 T세포(Helper T cell)

•세포독성 T세포(killer T cell)

•억제 T세포(suppressor T cell)

•B세포(B cell)

•NK세포(Natural killer cell, NK cell)

Lymphocyte(림프구)

Page 66: Bigdata based fraud detection

66

•Fraud Detection

B 세포(B細胞, B cell)는 림프구 중 항체를 생산하는 세포

B cell

Page 67: Bigdata based fraud detection

67

•Fraud Detection

T세포(T細胞, T cell) 또는 T림프구(T lymphocyte)는 항원 특이적인 적응 면역을 주관하는 림프구의 하나이다. 가슴샘(Thymus)에서 성숙되기 때문에 첫글자를 따서 T세포라는 이름이 붙었다. 전체 림프구 중 약 4분의 3이 T세포

T세포는 아직 항원을 만나지 못한 미접촉 T세포와, 항원을 만나 성숙한 효과 T세포(보조 T세포, 세포독성 T세포, 자연살상 T세포), 그리고 기억 T세포로 분류

T cell

Page 68: Bigdata based fraud detection

68

•Fraud Detection

each antibody can recognize a single antigen

Antibody, Antigen

Page 69: Bigdata based fraud detection

69

•Fraud Detection Biological ImmuneSystem

Page 70: Bigdata based fraud detection

70

•Danger Theory

•Proposed by Polly Matzinger, around 1995

•Traditional self/non-self theory doesn’t always match observations

•Immune system always responds to non-self•Immune system always tolerates self

•Antigen-presenting cell(APC):T-cell activation by APCs

•Danger theory relates innate and adaptive immune systems•Tissues induce tolerance towards themselves•Tissues protect themselves and select class of response

Fraud Detection

Page 71: Bigdata based fraud detection

71

•Tissues induce tolerance by•Lymphocytes receive 2 signals

•antigen/lymphocyte binding•antigen is properly presented by APC

•Signal 1 WITHOUT signal 2 : lymphocyte death

•Tissues protect themselves•Alarm Signals activate APCs

•Alarm signals come from•Cells that die unnaturally•Cells under stress

•APCs activate lymphocytes

•Tissues dictate response type•Alarm signals may convey information

Danger TheoryFraud Detection

Page 72: Bigdata based fraud detection

72

•Danger Theory

Fraud Detection

Page 73: Bigdata based fraud detection

73

•Artificial Immune Systems

Fraud Detection

•VectorsAb = {Ab1, Ab2, ..., AbL}Ag = {Ag1, Ag2, ..., AgL}

•Real-valued shape-space

•Integer shape-space

•Binary shape-space

•Symbolic shape-space

D=√∑i=1

L

(Ab i−Ag i )2

Artificial ImmuneSystem

Page 74: Bigdata based fraud detection

74

•Fraud Detection

Meta-Frameworks

Artificial ImmuneSystem

Page 75: Bigdata based fraud detection

75

•Fraud Detection Hybrid ImmuneLearning

Page 76: Bigdata based fraud detection

76

•Fraud Detection

For natural immune system, all cells of body arecategorized as two types of self and non-self. Theimmune process is to detect non-self from cells.

use the Positive Selection Algorithm (PSA) toperform the non-self detection for recognizing themalicious executable.

Non-self Detection Principle

Page 77: Bigdata based fraud detection

77

•Fraud Detection Network Security

Page 78: Bigdata based fraud detection

78

•Fraud Detection Network Security

Architecture of anomaly detection system.

Page 79: Bigdata based fraud detection

79

•Fraud Detection Intrusion Detection Systems

Page 80: Bigdata based fraud detection

80

•Outline

• Intorduction• Bigdata• Machine Learning• Fraud Detection• Solutions

Page 81: Bigdata based fraud detection

81

•Neural Stream Architecture

Solutions

Modeling

Fork

New data stream

Alert

Page 82: Bigdata based fraud detection

82

•Neural Stream Architecture

Solutions

ForkNew data stream

BatchModeling

OnlineCompute

OnlineLearning

Convergence

Alert

Page 83: Bigdata based fraud detection

83

•Neural Stream Architecture

Solutions

Agent

TimeReducer

Daily

Weekly

Monthly

TwoMonth

ThreeMonth

FourMonth

FiveMonth

SixMonth

SevenMonth

EightMonth

NineMonth

TenMonth

TwoWeek

ThreeWeekTwoDay

ThreeDay

FourDay

FiveDay

SixDay

FourWeek

MetaParserTimeStore

Long Transaction Memory

BlackList

SeccueCode

POSEntry

Page 84: Bigdata based fraud detection

84

•Neural Stream Architecture

Solutions

Velocity

Volume

Variety

Veracity

NeuralStream

Big Data

On-line Learning Neural Architecture

Machine Learning Platform

Page 85: Bigdata based fraud detection

85

•Neural Stream FDS

Solutions

Page 86: Bigdata based fraud detection

86

•Solutions

• Storage • hadoop

• HDFS: Distributed File System(DFS) • MapReduce : parallel processing

• Algorithms • on-line learning (Immune System and Genetic Algorithms) • batch model • direct data

• Stream • Neural stream

• Decentralize decision process • Cell base detection

• Network for Artificial Immune Systems • Lambda architecture, Pulsar can’t use on-line learning

Neural Stream

Page 87: Bigdata based fraud detection

87

• Classical rule-based approach

• Always “too late”: • New fraud pattern is “invented” by criminals • Cardholders lose money and complain • Banks investigate complains and try to understand the new pattern • A new rule is implemented a few weeks later • Expensive to build (knowledge intensive) • Difficult to maintain: • Many rules • The situation is dynamically changing, so frequently

• rules have to be added, modified, or removed …

Solutions

Page 88: Bigdata based fraud detection

88

•Solutions

• Every bank user gets a vector of parameters that describe his/her behavior: an “average-behavior” profile

• The system constantly compares this “long-term” profile with the recent behavior of cardholder

• Transactions that do not fit into bank user’s profile are flagged as suspicious (or are blocked)

• Profiles are updated with every single transaction, so the system constantly adopts to (slow and small) changes in bank user’ behavior

A system based on profiles

Page 89: Bigdata based fraud detection

89

•SolutionsSolve the problems

Page 90: Bigdata based fraud detection

Q&A

Thanks