Battista Biggio @ AISec 2013 - Is Data Clustering in Adversarial Settings Secure?

Pattern Recognition and Applications Lab

University

of Cagliari, Italy

Department of Electrical and Electronic

Engineering

Is Data Clustering in Adversarial Settings Secure?

Ba#sta Biggio (1), Ignazio Pillai (1), Samuel Rota Bulò (2), Davide Ariu (1), Marcello Pelillo (3), and Fabio Roli (1)

(1) Università di Cagliari (IT); (2) FBK-‐irst (IT); (3) Università Ca’ Foscari di Venezia (IT)

Berlin, 4 November 2013

http://pralab.diee.unica.it

Motivation: is clustering secure?

•  Data clustering increasingly applied in security-sensitive tasks –  e.g., malware clustering for anti-virus / IDS signature generation

•  Carefully targeted attacks may mislead the clustering process

2 (1) D. B. Skillicorn. Adversarial knowledge discovery. IEEE Intelligent Systems, 24:54–61, 2009. (2) J. G. Dutrisac and D. Skillicorn. Hiding clusters in adversarial settings. In IEEE Int’l Conf. Intelligence and Security Informatics, pp.185–187, 2008.

x x x

x x

x

x x

x x x x

x x x

x

x

x x x

x x

x

x x

x x x x

x x x

x x

x

x x

x x

Samples can be added to merge (and split) existing clusters

x x x

x x

x

x x

x x x x

x x x

x

x x x x x

Samples can be obfuscated and hidden within existing clusters (e.g., fringe clusters)


Our work

•  Framework for security evaluation of clustering algorithms 1.  Definition of potential attacks 2.  Empirical evaluation of their impact

•  Adversary’s model –  Goal –  Knowledge –  Capability –  Attack strategy

•  Inspired from previous work on adversarial learning –  Barreno et al., Can machine learning be secure?, ASIACCS 2006 –  Huang et al., Adversarial machine learning, AISec 2011 –  Biggio et al., Security evaluation of pattern classifiers under attack, IEEE Trans.

Knowledge and Data Eng., 2013

3


Adversary’s goal

•  Security violation –  Integrity: hiding clusters / malicious activities without compromising

normal system operation •  e.g., creating fringe clusters

–  Availability: compromising normal system operation by altering the clustering output •  e.g., merging existing clusters

–  Privacy: gaining confidential information about system users by reverse-engineering the clustering process

•  Attack specificity –  Targeted: affects clustering of a given subset of samples –  Indiscriminate: affects clustering of any sample

4


Adversary’s knowledge

•  The adversary may know:

•  Perfect knowledge

–  upper bound on the performance degradation under attack

5

INPUT DATA FEATURE

REPRESENTATION CLUSTERING ALGORITHM

ALGORITHM PARAMETERS

e.g., initialization


Adversary’s capability

•  Attacker’s capability is bounded: –  maximum number of samples that can be added to the input data

•  e.g., the attacker may only control a small fraction of malware samples collected by a honeypot

–  maximum amount of modifications (distance in feature space) •  e.g., malware samples should preserve their malicious functionality

6

x2

x1

xFeasible domain (e.g., L1-norm)

x '

x − "x1≤ dmax


Formalizing the optimal attack strategy

7

max!AEθ~µ g !A ;θ( )"# $%

s.t. !A ∈Ω(A)

Knowledge of the data, features, …

Capability of manipulating the input data

Attacker’s goal

Perfect knowledge: Eθ~µ g !A ;θ( )"# $%= g !A ;θ0( )


Poisoning attacks (availability violation)

•  Goal: maximally compromising the clustering output on D •  Capability: adding m attack samples

8

max!Ag !A ;θ0( ) = dc C, fD (D∪ !A )( )

s.t. !A ∈Ωp = { !ai}i=1m ⊂ Rd{ }

x x x

x x

x

x x

x x x x

x x x

x

x x x x

x x

x

x x

x x x x

x x x

x x

x

x x

x x

A’

Heuristics tailored to the clustering algorithm for efficient solution!

C = f (D) f (D∪ "A )


Single-linkage hierarchical clustering

•  Bottom-up agglomerative clustering –  each point is initially considered as a cluster –  closest clusters are iteratively merged –  single-linkage criterion

9

x x x

x x

x

x x

x x x x

x x x

x

x

dist(Ci,Cj ) = mina∈Ci , b∈Cj

d(a,b)

3 7 2 4 5 9 1 6 8 14 15 16 17 10 11 12 130

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Dendrogram cut

C = f (D)


Poisoning attacks vs. single-linkage HC

10

dc Y, !Y( ) = YY T − !Y !Y T

F, Y =

1 0 00 0 10 0 11 0 00 1 0

#

$

%%%%%%

&

'

((((((

, YY T =

1 0 0 1 00 1 1 0 00 1 1 0 01 0 0 1 00 0 0 0 1

#

$

%%%%%%

&

'

((((((

max!Ag !A ;θ0( ) = dc C, fD (D∪ !A )( )

s.t. !A ∈Ωp

For a given cut criterion:

We assume the most advantageous criterion for the clustering algorithm: the dendrogram cut is chosen to minimize the attacker’s objective!

Sample 1

…

Sample 5

http://pralab.diee.unica.it 3 7 9 4 2 5 1 8 6 14 16 17 15 18 21 19 20 10 12 11 13

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9


•  Heuristic-based solutions –  Greedy approach: adding one attack sample at each iteration

11

−2 −1.5 −1 −0.5 0 0.5 1 1.5

−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

2

4

6

8

10

12

14

16

Local maxima are often found in between clusters Close to connections (bridges) that have been cut to obtain the final k clusters Can be obtained directly from the dendrogram!

Dendrogram cut

k-1 Bridges


−2 −1.5 −1 −0.5 0 0.5 1 1.5

−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

2

4

6

8

10

12

14

16


•  Heuristic-based solutions 1. Bridge (Best): evaluates the objective function k-1 times, each time by adding an attack point in between a bridge

12

Requires running the clustering algorithm k-1 times!

2. Bridge (Hard): estimates the objective function assuming that each attack point will merge the corresponding clusters

Does not require running the clustering algorithm


−2 −1.5 −1 −0.5 0 0.5 1 1.5

−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

0.5

1

1.5

2

2.5

3

3.5

4

4.5


•  Heuristic-based solutions 3. Bridge (Soft): similar to Bridge (Hard), but using soft clustering assignments for Y (estimated with Gaussian KDE)

13

−2 −1.5 −1 −0.5 0 0.5 1 1.5

−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

Clustering output aVer greedily adding 20 aXack points


Experiments on poisoning attacks

•  Banana: artificial data, 80 samples, 2 features, k=4 initial clusters

•  Malware: real data(1), 1,000 samples, 6 features, k≈9 initial clusters (estimated from data minimizing the Davies-Bouldin Index) –  Features:

1.  number of GET requests 2.  number of POST requests 3.  average URL length 4.  average number of URL parameters 5.  average amount of data sent by POST requests 6.  average response length

•  MNIST Handwritten Digits: real data, 330 samples per cluster, 28 x 28 = 784 features (pixels), k=3 initial clusters corresponding to digits ‘0’, ‘1’, and ‘6’

14 (1) R. Perdisci, D. Ariu, and G. Giacinto. Scalable fine-grained behavioral clustering of http-based malware. Computer Networks, 57(2):487-500, 2013.


0.0% 0.2% 0.4% 0.6% 0.8% 1.0%0

100200300400500600700800

Obj

ectiv

e Fu

nctio

n

Digits

RandomRandom (Best)Bridge (Best)Bridge (Soft)Bridge (Hard)

0% 1% 2% 3% 4% 5%0

20406080

100120140160180

Obj

ectiv

e Fu

nctio

nMalware


•  Attack strategies: Bridge (Best), Bridge (Hard), Bridge (Soft), Random, Random (Best) –  Random (Best) selects the best random attack over k-1 attempts –  Same complexity as Bridge (Best)

15

0% 2% 5% 7% 9% 12% 15% 18% 20%0

102030405060

Obj

ectiv

e Fu

nctio

n

Banana

0.0% 0.2% 0.4% 0.6% 0.8% 1.0%0

20406080

100

Num

Clu

ster

s (k

)Fraction of samples controlled by the attacker

0% 1% 2% 3% 4% 5%5

1015202530

Num

Clu

ster

s (k

)

Fraction of samples controlled by the attacker0% 2% 5% 7% 9% 12% 15% 18% 20%8

10121446

Num

Clu

ster

s (k

)

Fraction of samples controlled by the attacker



•  Some attack samples obtained by the given attack strategies on the MNIST Handwritten Digits, at iterations 1, 2, and 10.

16

Random

1

2

10

Random(Best)

Bridge(Best)

Bridge(Soft)

Bridge(Hard)


Obfuscation attacks (integrity violation)

•  Goal: hiding attacks A without compromising clustering of D •  Capability: bounded manipulation of attack samples

17

max!Ag !A ;θ0( ) = −dc Ct, f (D∪ !A )( ), where πD (Ct ) = f (D)

s.t. !A ∈Ωo(A) = { !ai}i=1|A| : ds (A, !A ) =max

iai − !ai 2

≤ dmax{ }

x x x

x x

x

x x

x x x x

x x x x

x x

x

x x

x x x x

x x x

x

x x x x x D

A

!A


Obfuscation attacks vs. single-linkage HC

•  Heuristic-based solution –  For each attack sample a in A –  Select the closest sample d in D from the cluster to which a should

belong to, and define a’ as

18

!a = a+αd − a( )d − a

2

,

α =min dmax, d − a 2( )

x x x

x x

x

x x

x x x x

x x x

x

x x x

a

!a x d


Experiments on obfuscation attacks

•  MNIST Handwritten Digits –  Initial clusters ‘0’, ‘1’, ‘6’, ‘3’ –  Attacker modifies ‘3’s to have them clustered with ‘6’ –  Attacker minimizes distance from the desired clustering –  Clustering minimizes distance from the initial clusters (where ‘3’s are

not manipulated)

19

0 1 2 3 4 5 6 7 8 9 100

50

100

150

200

250

300

350

Obje

ctiv

e F

unct

ion

ClusteringAttacker

0.0 2.0 3.0 4.0 5.0 7.0

0 1 2 3 4 5 6 7 8 9 103

3.43.84.24.6

5

Num

Clu

ster

s (k

)

d max


Experiments on obfuscation attacks

20

0 1 2 3 4 5 6 7 8 9 100

50

100

150

200

250

300

350

Ob

ject

ive

Fu

nct

ion

ClusteringAttacker

0 1 2 3 4 5 6 7 8 9 103

3.43.84.24.6

5

Num

Clu

ster

s (k

)

d max

Why the attacker’s objective increases here?

x x x x

x

x

x x x x

x xx x

x x

xx x

x Bridging!

x x x x

x xx x

x x

xx x x

3 6

x x x x

x x

x x

This may suggest a more effective heuristic, based on modifying only a subset of attacks!


Conclusions and future work

•  Framework for security evaluation of clustering algorithms •  Definition of poisoning and obfuscation attacks •  Case study on single-linkage HC highlights vulnerability to attacks •  Future work

–  Extensions to other algorithms, common solver for the attack strategy •  e.g., black-box optimization with suitable heuristics

–  Connections with clustering stability –  Secure / Robust clustering algorithms

21


? 22

Any ques*ons Thanks for your aXen\on!

Education

Battista Biggio @ AISec 2013 - Is Data Clustering in Adversarial Settings Secure?