22
Pattern Recognition and Applications Lab University of Cagliari, Italy Department of Electrical and Electronic Engineering Is Data Clustering in Adversarial Settings Secure? Ba#sta Biggio ( 1 ), Ignazio Pillai ( 1 ), Samuel Rota Bulò ( 2 ), Davide Ariu ( 1 ), Marcello Pelillo ( 3 ), and Fabio Roli ( 1 ) ( 1 ) Università di Cagliari (IT); ( 2 ) FBKirst (IT); ( 3 ) Università Ca’ Foscari di Venezia (IT) Berlin, 4 November 2013

Battista Biggio @ AISec 2013 - Is Data Clustering in Adversarial Settings Secure?

Embed Size (px)

DESCRIPTION

Clustering algorithms have been increasingly adopted in security applications to spot dangerous or illicit activities. However, they have not been originally devised to deal with deliberate attack attempts that may aim to subvert the clustering process itself. Whether clustering can be safely adopted in such settings remains thus questionable. In this work we propose a general framework that allows one to identify potential attacks against clustering algorithms, and to evaluate their impact, by making specific assumptions on the adversary's goal, knowledge of the attacked system, and capabilities of manipulating the input data. We show that an attacker may significantly poison the whole clustering process by adding a relatively small percentage of attack samples to the input data, and that some attack samples may be obfuscated to be hidden within some existing clusters. We present a case study on single-linkage hierarchical clustering, and report experiments on clustering of malware samples and handwritten digits.

Citation preview

Page 1: Battista Biggio @ AISec 2013 - Is Data Clustering in Adversarial Settings Secure?

Pattern Recognition and Applications Lab

                               

 University

of Cagliari, Italy

 

Department of Electrical and Electronic

Engineering

Is Data Clustering in Adversarial Settings Secure?

Ba#sta  Biggio  (1),  Ignazio  Pillai  (1),  Samuel  Rota  Bulò  (2),  Davide  Ariu  (1),  Marcello  Pelillo  (3),  and  Fabio  Roli  (1)  

 (1)  Università  di  Cagliari  (IT);  (2)  FBK-­‐irst  (IT);  (3)  Università  Ca’  Foscari  di  Venezia  (IT)  

Berlin,  4  November  2013  

Page 2: Battista Biggio @ AISec 2013 - Is Data Clustering in Adversarial Settings Secure?

 

http://pralab.diee.unica.it

Motivation: is clustering secure?

•  Data clustering increasingly applied in security-sensitive tasks –  e.g., malware clustering for anti-virus / IDS signature generation

•  Carefully targeted attacks may mislead the clustering process

2  (1) D. B. Skillicorn. Adversarial knowledge discovery. IEEE Intelligent Systems, 24:54–61, 2009. (2) J. G. Dutrisac and D. Skillicorn. Hiding clusters in adversarial settings. In IEEE Int’l Conf. Intelligence and Security Informatics, pp.185–187, 2008.

x   x  x  

x  x  

x  

x  x  

x  x  x  x  

x  x  x  

x  

x  

x   x  x  

x  x  

x  

x  x  

x  x  x  x  

x   x  x  

x  x  

x  

x  x  

x  x  

Samples can be added to merge (and split) existing clusters

x   x  x  

x  x  

x  

x  x  

x  x  x  x  

x  x  x  

x  

x  x  x   x   x  

Samples can be obfuscated and hidden within existing clusters (e.g., fringe clusters)

Page 3: Battista Biggio @ AISec 2013 - Is Data Clustering in Adversarial Settings Secure?

 

http://pralab.diee.unica.it

Our work

•  Framework for security evaluation of clustering algorithms 1.  Definition of potential attacks 2.  Empirical evaluation of their impact

•  Adversary’s model –  Goal –  Knowledge –  Capability –  Attack strategy

•  Inspired from previous work on adversarial learning –  Barreno et al., Can machine learning be secure?, ASIACCS 2006 –  Huang et al., Adversarial machine learning, AISec 2011 –  Biggio et al., Security evaluation of pattern classifiers under attack, IEEE Trans.

Knowledge and Data Eng., 2013

3  

Page 4: Battista Biggio @ AISec 2013 - Is Data Clustering in Adversarial Settings Secure?

 

http://pralab.diee.unica.it

Adversary’s goal

•  Security violation –  Integrity: hiding clusters / malicious activities without compromising

normal system operation •  e.g., creating fringe clusters

–  Availability: compromising normal system operation by altering the clustering output •  e.g., merging existing clusters

–  Privacy: gaining confidential information about system users by reverse-engineering the clustering process

•  Attack specificity –  Targeted: affects clustering of a given subset of samples –  Indiscriminate: affects clustering of any sample

4  

Page 5: Battista Biggio @ AISec 2013 - Is Data Clustering in Adversarial Settings Secure?

 

http://pralab.diee.unica.it

Adversary’s knowledge

•  The adversary may know:

•  Perfect knowledge

–  upper bound on the performance degradation under attack

5  

INPUT DATA FEATURE

REPRESENTATION CLUSTERING ALGORITHM

ALGORITHM PARAMETERS

e.g., initialization

Page 6: Battista Biggio @ AISec 2013 - Is Data Clustering in Adversarial Settings Secure?

 

http://pralab.diee.unica.it

Adversary’s capability

•  Attacker’s capability is bounded: –  maximum number of samples that can be added to the input data

•  e.g., the attacker may only control a small fraction of malware samples collected by a honeypot

–  maximum amount of modifications (distance in feature space) •  e.g., malware samples should preserve their malicious functionality

6  

x2  

x1  

xFeasible domain (e.g., L1-norm)

x '

x − "x1≤ dmax

Page 7: Battista Biggio @ AISec 2013 - Is Data Clustering in Adversarial Settings Secure?

 

http://pralab.diee.unica.it

Formalizing the optimal attack strategy

7  

max!AEθ~µ g !A ;θ( )"# $%

s.t. !A ∈Ω(A)

Knowledge of the data, features, …

Capability of manipulating the input data

Attacker’s goal

Perfect knowledge: Eθ~µ g !A ;θ( )"# $%= g !A ;θ0( )

Page 8: Battista Biggio @ AISec 2013 - Is Data Clustering in Adversarial Settings Secure?

 

http://pralab.diee.unica.it

Poisoning attacks (availability violation)

•  Goal: maximally compromising the clustering output on D •  Capability: adding m attack samples

8  

max!Ag !A ;θ0( ) = dc C, fD (D∪ !A )( )

s.t. !A ∈Ωp = { !ai}i=1m ⊂ Rd{ }

x   x  x  

x  x  

x  

x  x  

x  x  x  x  

x  x  x  

x  

x  x   x  x  

x  x  

x  

x  x  

x  x  x  x  

x   x  x  

x  x  

x  

x  x  

x  x  

A’  

Heuristics tailored to the clustering algorithm for efficient solution!

C = f (D) f (D∪ "A )

Page 9: Battista Biggio @ AISec 2013 - Is Data Clustering in Adversarial Settings Secure?

 

http://pralab.diee.unica.it

Single-linkage hierarchical clustering

•  Bottom-up agglomerative clustering –  each point is initially considered as a cluster –  closest clusters are iteratively merged –  single-linkage criterion

9  

x   x  x  

x  x  

x  

x  x  

x  x  x  x  

x  x  x  

x  

x  

dist(Ci,Cj ) = mina∈Ci , b∈Cj

d(a,b)

3 7 2 4 5 9 1 6 8 14 15 16 17 10 11 12 130

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Dendrogram cut

C = f (D)

Page 10: Battista Biggio @ AISec 2013 - Is Data Clustering in Adversarial Settings Secure?

 

http://pralab.diee.unica.it

Poisoning attacks vs. single-linkage HC

10  

dc Y, !Y( ) = YY T − !Y !Y T

F, Y =

1 0 00 0 10 0 11 0 00 1 0

#

$

%%%%%%

&

'

((((((

, YY T =

1 0 0 1 00 1 1 0 00 1 1 0 01 0 0 1 00 0 0 0 1

#

$

%%%%%%

&

'

((((((

max!Ag !A ;θ0( ) = dc C, fD (D∪ !A )( )

s.t. !A ∈Ωp

For a given cut criterion:

We assume the most advantageous criterion for the clustering algorithm: the dendrogram cut is chosen to minimize the attacker’s objective!

Sample 1

Sample 5

Page 11: Battista Biggio @ AISec 2013 - Is Data Clustering in Adversarial Settings Secure?

 

http://pralab.diee.unica.it 3 7 9 4 2 5 1 8 6 14 16 17 15 18 21 19 20 10 12 11 13

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Poisoning attacks vs. single-linkage HC

•  Heuristic-based solutions –  Greedy approach: adding one attack sample at each iteration

11  

−2 −1.5 −1 −0.5 0 0.5 1 1.5

−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

2

4

6

8

10

12

14

16

Local maxima are often found in between clusters Close to connections (bridges) that have been cut to obtain the final k clusters Can be obtained directly from the dendrogram!

Dendrogram cut

k-1 Bridges

Page 12: Battista Biggio @ AISec 2013 - Is Data Clustering in Adversarial Settings Secure?

 

http://pralab.diee.unica.it

−2 −1.5 −1 −0.5 0 0.5 1 1.5

−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

2

4

6

8

10

12

14

16

Poisoning attacks vs. single-linkage HC

•  Heuristic-based solutions 1. Bridge (Best): evaluates the objective function k-1 times, each time by adding an attack point in between a bridge

12  

Requires running the clustering algorithm k-1 times!

2. Bridge (Hard): estimates the objective function assuming that each attack point will merge the corresponding clusters

Does not require running the clustering algorithm

Page 13: Battista Biggio @ AISec 2013 - Is Data Clustering in Adversarial Settings Secure?

 

http://pralab.diee.unica.it

−2 −1.5 −1 −0.5 0 0.5 1 1.5

−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

0.5

1

1.5

2

2.5

3

3.5

4

4.5

Poisoning attacks vs. single-linkage HC

•  Heuristic-based solutions 3. Bridge (Soft): similar to Bridge (Hard), but using soft clustering assignments for Y (estimated with Gaussian KDE)

13  

−2 −1.5 −1 −0.5 0 0.5 1 1.5

−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

Clustering  output  aVer  greedily  adding  20  aXack  points  

Page 14: Battista Biggio @ AISec 2013 - Is Data Clustering in Adversarial Settings Secure?

 

http://pralab.diee.unica.it

Experiments on poisoning attacks

•  Banana: artificial data, 80 samples, 2 features, k=4 initial clusters

•  Malware: real data(1), 1,000 samples, 6 features, k≈9 initial clusters (estimated from data minimizing the Davies-Bouldin Index) –  Features:

1.  number of GET requests 2.  number of POST requests 3.  average URL length 4.  average number of URL parameters 5.  average amount of data sent by POST requests 6.  average response length

•  MNIST Handwritten Digits: real data, 330 samples per cluster, 28 x 28 = 784 features (pixels), k=3 initial clusters corresponding to digits ‘0’, ‘1’, and ‘6’

14  (1) R. Perdisci, D. Ariu, and G. Giacinto. Scalable fine-grained behavioral clustering of http-based malware. Computer Networks, 57(2):487-500, 2013.

Page 15: Battista Biggio @ AISec 2013 - Is Data Clustering in Adversarial Settings Secure?

 

http://pralab.diee.unica.it

0.0% 0.2% 0.4% 0.6% 0.8% 1.0%0

100200300400500600700800

Obj

ectiv

e Fu

nctio

n

Digits

RandomRandom (Best)Bridge (Best)Bridge (Soft)Bridge (Hard)

0% 1% 2% 3% 4% 5%0

20406080

100120140160180

Obj

ectiv

e Fu

nctio

nMalware

Experiments on poisoning attacks

•  Attack strategies: Bridge (Best), Bridge (Hard), Bridge (Soft), Random, Random (Best) –  Random (Best) selects the best random attack over k-1 attempts –  Same complexity as Bridge (Best)

15  

0% 2% 5% 7% 9% 12% 15% 18% 20%0

102030405060

Obj

ectiv

e Fu

nctio

n

Banana

0.0% 0.2% 0.4% 0.6% 0.8% 1.0%0

20406080

100

Num

Clu

ster

s (k

)Fraction of samples controlled by the attacker

0% 1% 2% 3% 4% 5%5

1015202530

Num

Clu

ster

s (k

)

Fraction of samples controlled by the attacker0% 2% 5% 7% 9% 12% 15% 18% 20%8

10121446

Num

Clu

ster

s (k

)

Fraction of samples controlled by the attacker

Page 16: Battista Biggio @ AISec 2013 - Is Data Clustering in Adversarial Settings Secure?

 

http://pralab.diee.unica.it

Experiments on poisoning attacks

•  Some attack samples obtained by the given attack strategies on the MNIST Handwritten Digits, at iterations 1, 2, and 10.

16  

Random

1

2

10

Random(Best)

Bridge(Best)

Bridge(Soft)

Bridge(Hard)

Page 17: Battista Biggio @ AISec 2013 - Is Data Clustering in Adversarial Settings Secure?

 

http://pralab.diee.unica.it

Obfuscation attacks (integrity violation)

•  Goal: hiding attacks A without compromising clustering of D •  Capability: bounded manipulation of attack samples

17  

max!Ag !A ;θ0( ) = −dc Ct, f (D∪ !A )( ), where πD (Ct ) = f (D)

s.t. !A ∈Ωo(A) = { !ai}i=1|A| : ds (A, !A ) =max

iai − !ai 2

≤ dmax{ }

x   x  x  

x  x  

x  

x  x  

x  x  x  x  

x  x   x  x  

x  x  

x  

x  x  

x  x  x  x  

x  x  x  

x  

x  x  x   x   x  D

A

!A

Page 18: Battista Biggio @ AISec 2013 - Is Data Clustering in Adversarial Settings Secure?

 

http://pralab.diee.unica.it

Obfuscation attacks vs. single-linkage HC

•  Heuristic-based solution –  For each attack sample a in A –  Select the closest sample d in D from the cluster to which a should

belong to, and define a’ as

18  

!a = a+αd − a( )d − a

2

,

α =min dmax, d − a 2( )

x   x  x  

x  x  

x  

x  x  

x  x  x  x  

x  x  x  

x  

x  x   x  

a

!a x  d

Page 19: Battista Biggio @ AISec 2013 - Is Data Clustering in Adversarial Settings Secure?

 

http://pralab.diee.unica.it

Experiments on obfuscation attacks

•  MNIST Handwritten Digits –  Initial clusters ‘0’, ‘1’, ‘6’, ‘3’ –  Attacker modifies ‘3’s to have them clustered with ‘6’ –  Attacker minimizes distance from the desired clustering –  Clustering minimizes distance from the initial clusters (where ‘3’s are

not manipulated)

19  

0 1 2 3 4 5 6 7 8 9 100

50

100

150

200

250

300

350

Obje

ctiv

e F

unct

ion

ClusteringAttacker

0.0 2.0 3.0 4.0 5.0 7.0

0 1 2 3 4 5 6 7 8 9 103

3.43.84.24.6

5

Num

Clu

ster

s (k

)

d max

Page 20: Battista Biggio @ AISec 2013 - Is Data Clustering in Adversarial Settings Secure?

 

http://pralab.diee.unica.it

Experiments on obfuscation attacks

20  

0 1 2 3 4 5 6 7 8 9 100

50

100

150

200

250

300

350

Ob

ject

ive

Fu

nct

ion

ClusteringAttacker

0 1 2 3 4 5 6 7 8 9 103

3.43.84.24.6

5

Num

Clu

ster

s (k

)

d max

Why the attacker’s objective increases here?

x   x  x  x  

x  

x  

x   x  x  x  

x  xx  x  

x  x  

xx  x  

x  Bridging!

x   x  x  x  

x  xx  x  

x  x  

xx  x  x  

3   6  

x   x  x  x  

x  x

x  x  

This may suggest a more effective heuristic, based on modifying only a subset of attacks!

Page 21: Battista Biggio @ AISec 2013 - Is Data Clustering in Adversarial Settings Secure?

 

http://pralab.diee.unica.it

Conclusions and future work

•  Framework for security evaluation of clustering algorithms •  Definition of poisoning and obfuscation attacks •  Case study on single-linkage HC highlights vulnerability to attacks •  Future work

–  Extensions to other algorithms, common solver for the attack strategy •  e.g., black-box optimization with suitable heuristics

–  Connections with clustering stability –  Secure / Robust clustering algorithms

21  

Page 22: Battista Biggio @ AISec 2013 - Is Data Clustering in Adversarial Settings Secure?

 

http://pralab.diee.unica.it

?  22  

 Any  ques*ons  Thanks  for  your  aXen\on!