25
FlonCon 2013 | January 7–10 | Albuquerque, New Mexico John Munro / [email protected] Jason Trost / [email protected]

Clairvoyant Squirrel: Large Scale Malicious Domain Classification

Embed Size (px)

DESCRIPTION

Clairvoyant Squirrel: Large Scale Malicious Domain Classification. Presentation from FloCon 2013

Citation preview

Page 1: Clairvoyant Squirrel: Large Scale Malicious Domain Classification

FlonCon 2013 | January 7–10 | Albuquerque, New Mexico

John Munro / [email protected] Jason Trost / [email protected]

Page 2: Clairvoyant Squirrel: Large Scale Malicious Domain Classification

• John Munro ([email protected])

– Network Security Researcher and Data Scientist

• Jason Trost ([email protected])

– Senior Software Engineer

– Specializes in Hadoop/Storm/BigData

Introductions

Page 3: Clairvoyant Squirrel: Large Scale Malicious Domain Classification

• The Problem

• Our Approach

• DGA Domain Classifier

• String Statistics as Features

• Malicious Domain Classifier

• Demo

• Real-time Streaming Platform

Agenda

Page 4: Clairvoyant Squirrel: Large Scale Malicious Domain Classification

The Problem

yahoo.com

docs.joomla.org

youtube.com

bibz01.apple.com

p4.httzd5e2ufizo.3bawhfuec45dca65.401724.s1.v4.ipv6-exp.l.google.com

ns3.ohio.gov

abulqe.com za6.limfoklubs.com

Ct0u2xj5dbe4.wvw—game465.com

txmxbo.info

Wmk41035u3751s0bgv4n91b0b7h74v.ipcheker.com

Page 5: Clairvoyant Squirrel: Large Scale Malicious Domain Classification

The Problem

yahoo.com

docs.joomla.org

youtube.com

bibz01.apple.com

p4.httzd5e2ufizo.3bawhfuec45dca65.401724.s1.v4.ipv6-exp.l.google.com

ns3.ohio.gov

abulqe.com za6.limfoklubs.com

Ct0u2xj5dbe4.wvw—game465.com

txmxbo.info

Wmk41035u3751s0bgv4n91b0b7h74v.ipcheker.com

Page 6: Clairvoyant Squirrel: Large Scale Malicious Domain Classification

• Massive Volumes

– Some of our partners deal with TBs per day of DNS PCAPs

• Incredible Rates

– One partner sees 13k requests/sec

– Another closer to 100k/sec

The Problem

Page 7: Clairvoyant Squirrel: Large Scale Malicious Domain Classification

• Real-time streaming classification – In parallel across multiple servers

• Markov Models – Random Domain Generation Traffic

– Normal Benign Traffic

• Random Forests – Benign vs Malicious

• Periodically retrained – In order to maintain accuracy

Our Approach: Machine Learning!

Page 8: Clairvoyant Squirrel: Large Scale Malicious Domain Classification

• Benign Domains

– Millions of popular, real domains

• Correlated with the Alexa top 10k domains

• Malicious Domains

– 800k domains gathered from an internal malware sandbox

– Public blacklist domains from Conficker and Murofet Botnets

Data Sources

Page 9: Clairvoyant Squirrel: Large Scale Malicious Domain Classification

Markov Models

Page 10: Clairvoyant Squirrel: Large Scale Malicious Domain Classification

• Domain Generation Algorithm (DGA)

• Popular Domain Model – Trained: 258,039 domains from Day 1 of our Benign set

– Tested: 331,359 domains from Day 2 of our Benign set

– Accuracy: 99.40 % with 1,458 Unknown

• Randomly Generated Domain Model – Trained: 90,884 domains from Conficker Botnet

– Tested: 295,306 domains from Murofet Botnet

– Accuracy: 99.34 % with 1,923 Unknown

Markovian DGA Classifier

Page 11: Clairvoyant Squirrel: Large Scale Malicious Domain Classification

String Statistics as Features

Page 12: Clairvoyant Squirrel: Large Scale Malicious Domain Classification

Feature Usefulness

Page 13: Clairvoyant Squirrel: Large Scale Malicious Domain Classification

Random Forests Algorithm

FPO VIDEO TO COME

Page 14: Clairvoyant Squirrel: Large Scale Malicious Domain Classification

• Pros: – Very high accuracy – Scalable across many nodes – Built-in protection from over fitting – Can handle very large data sets with many features – Robust with respect to goodness of features – Practical for real world use – Does not assume a distribution – Only two parameters to tune – Memory efficient

• Cons: – Not the quickest classifier, but plenty fast in practice

Random Forests

Page 15: Clairvoyant Squirrel: Large Scale Malicious Domain Classification

• Performance measured by 10 – fold Cross Validation

• Training Set

– 200k Benign

– 200k Malicious

Malicious Domain Classifier

0

0.01

0.02

0.03

0.04

0.05

0.06

10 20 30 40 50 60 70 80 90 100

Ou

t o

f B

ag E

rro

r

Number of Trees

Cross Validation

K = 3 K = 5 K = 10

Page 16: Clairvoyant Squirrel: Large Scale Malicious Domain Classification

Results

0.975

0.976

0.977

0.978

0.979

0.98

0.981

0.982

0.983

0.984

10 20 30 40 50 60 70 80 90 100

Pre

cisi

on

Number of Trees

Bad Precision

0.9745

0.975

0.9755

0.976

0.9765

0.977

0.9775

0.978

0.9785

0.979

0.9795

10 20 30 40 50 60 70 80 90 100

Pre

cisi

on

Number of Trees

Good Precision

0.976

0.9765

0.977

0.9775

0.978

0.9785

0.979

0.9795

0.98

0.9805

10 20 30 40 50 60 70 80 90 100

Acc

ura

cy

Number of Trees

Bad Accuracy

0.976

0.9765

0.977

0.9775

0.978

0.9785

0.979

0.9795

0.98

0.9805

10 20 30 40 50 60 70 80 90 100

Acc

ura

cy

Number of Trees

Good Accuracy

K = 3 K = 5 K = 10

Page 17: Clairvoyant Squirrel: Large Scale Malicious Domain Classification

Results

0

10,000

20,000

30,000

40,000

50,000

60,000

10 20 30 40 50 60 70 80 90 100

Cla

ssif

ica

tio

ns/

sec

Number of Trees

Classification Throughput

K=3

K=5

K=10

0

50

100

150

200

250

300

350

400

10 20 30 40 50 60 70 80 90 100

Mo

del

Siz

e (M

B)

Number of Trees

Model Size

K=3

K=5

K=10

Page 18: Clairvoyant Squirrel: Large Scale Malicious Domain Classification

Results

Page 19: Clairvoyant Squirrel: Large Scale Malicious Domain Classification

Demo

Page 20: Clairvoyant Squirrel: Large Scale Malicious Domain Classification

• Velocity is a platform for processing, analyzing, and visualizing large-scale event data in real-time

• It was designed to be horizontally scalable and is built using Twitter’s Storm

• It was built primarily for internal use with DNS events, IDS alerts, and netflow data, but it is in the process of being commercialized

Realtime Streaming Platform

Page 21: Clairvoyant Squirrel: Large Scale Malicious Domain Classification

Velocity Pipeline

Page 22: Clairvoyant Squirrel: Large Scale Malicious Domain Classification

• Malicious domain classification

• DGA domain identification using Markov Models

• Summary Statistics based on domain string work well

• Random Forests are very successful at classifying domains as Benign or Malicious

• Real-time, distributed implementation

Conclusion

Page 23: Clairvoyant Squirrel: Large Scale Malicious Domain Classification

• Include more features: TTL, frequency seen, etc.

• Correlation of bad domains based on ASN, Country, Organization, etc.

• Identify subnets that are infected based on high traffic to bad domains

• Identify Content Delivery Networks

• Self Organizing Maps and other visualizations

Future Work

Page 24: Clairvoyant Squirrel: Large Scale Malicious Domain Classification

Questions

Page 25: Clairvoyant Squirrel: Large Scale Malicious Domain Classification

• John Munro

• Email: [email protected]

• Jason Trost

• Email: [email protected]

• Twitter: @jason_trost

• Blog: www.covert.io

Contact Information