View
282
Download
2
Embed Size (px)
DESCRIPTION
Clustering algorithms have been increasingly adopted in security applications to spot dangerous or illicit activities. However, they have not been originally devised to deal with deliberate attack attempts that may aim to subvert the clustering process itself. Whether clustering can be safely adopted in such settings remains thus questionable. In this work we propose a general framework that allows one to identify potential attacks against clustering algorithms, and to evaluate their impact, by making specific assumptions on the adversary's goal, knowledge of the attacked system, and capabilities of manipulating the input data. We show that an attacker may significantly poison the whole clustering process by adding a relatively small percentage of attack samples to the input data, and that some attack samples may be obfuscated to be hidden within some existing clusters. We present a case study on single-linkage hierarchical clustering, and report experiments on clustering of malware samples and handwritten digits.
Citation preview
Pattern Recognition and Applications Lab
University
of Cagliari, Italy
Department of Electrical and Electronic
Engineering
Is Data Clustering in Adversarial Settings Secure?
Ba#sta Biggio (1), Ignazio Pillai (1), Samuel Rota Bulò (2), Davide Ariu (1), Marcello Pelillo (3), and Fabio Roli (1)
(1) Università di Cagliari (IT); (2) FBK-‐irst (IT); (3) Università Ca’ Foscari di Venezia (IT)
Berlin, 4 November 2013
http://pralab.diee.unica.it
Motivation: is clustering secure?
• Data clustering increasingly applied in security-sensitive tasks – e.g., malware clustering for anti-virus / IDS signature generation
• Carefully targeted attacks may mislead the clustering process
2 (1) D. B. Skillicorn. Adversarial knowledge discovery. IEEE Intelligent Systems, 24:54–61, 2009. (2) J. G. Dutrisac and D. Skillicorn. Hiding clusters in adversarial settings. In IEEE Int’l Conf. Intelligence and Security Informatics, pp.185–187, 2008.
x x x
x x
x
x x
x x x x
x x x
x
x
x x x
x x
x
x x
x x x x
x x x
x x
x
x x
x x
Samples can be added to merge (and split) existing clusters
x x x
x x
x
x x
x x x x
x x x
x
x x x x x
Samples can be obfuscated and hidden within existing clusters (e.g., fringe clusters)
http://pralab.diee.unica.it
Our work
• Framework for security evaluation of clustering algorithms 1. Definition of potential attacks 2. Empirical evaluation of their impact
• Adversary’s model – Goal – Knowledge – Capability – Attack strategy
• Inspired from previous work on adversarial learning – Barreno et al., Can machine learning be secure?, ASIACCS 2006 – Huang et al., Adversarial machine learning, AISec 2011 – Biggio et al., Security evaluation of pattern classifiers under attack, IEEE Trans.
Knowledge and Data Eng., 2013
3
http://pralab.diee.unica.it
Adversary’s goal
• Security violation – Integrity: hiding clusters / malicious activities without compromising
normal system operation • e.g., creating fringe clusters
– Availability: compromising normal system operation by altering the clustering output • e.g., merging existing clusters
– Privacy: gaining confidential information about system users by reverse-engineering the clustering process
• Attack specificity – Targeted: affects clustering of a given subset of samples – Indiscriminate: affects clustering of any sample
4
http://pralab.diee.unica.it
Adversary’s knowledge
• The adversary may know:
• Perfect knowledge
– upper bound on the performance degradation under attack
5
INPUT DATA FEATURE
REPRESENTATION CLUSTERING ALGORITHM
ALGORITHM PARAMETERS
e.g., initialization
http://pralab.diee.unica.it
Adversary’s capability
• Attacker’s capability is bounded: – maximum number of samples that can be added to the input data
• e.g., the attacker may only control a small fraction of malware samples collected by a honeypot
– maximum amount of modifications (distance in feature space) • e.g., malware samples should preserve their malicious functionality
6
x2
x1
xFeasible domain (e.g., L1-norm)
x '
x − "x1≤ dmax
http://pralab.diee.unica.it
Formalizing the optimal attack strategy
7
max!AEθ~µ g !A ;θ( )"# $%
s.t. !A ∈Ω(A)
Knowledge of the data, features, …
Capability of manipulating the input data
Attacker’s goal
Perfect knowledge: Eθ~µ g !A ;θ( )"# $%= g !A ;θ0( )
http://pralab.diee.unica.it
Poisoning attacks (availability violation)
• Goal: maximally compromising the clustering output on D • Capability: adding m attack samples
8
max!Ag !A ;θ0( ) = dc C, fD (D∪ !A )( )
s.t. !A ∈Ωp = { !ai}i=1m ⊂ Rd{ }
x x x
x x
x
x x
x x x x
x x x
x
x x x x
x x
x
x x
x x x x
x x x
x x
x
x x
x x
A’
Heuristics tailored to the clustering algorithm for efficient solution!
C = f (D) f (D∪ "A )
http://pralab.diee.unica.it
Single-linkage hierarchical clustering
• Bottom-up agglomerative clustering – each point is initially considered as a cluster – closest clusters are iteratively merged – single-linkage criterion
9
x x x
x x
x
x x
x x x x
x x x
x
x
dist(Ci,Cj ) = mina∈Ci , b∈Cj
d(a,b)
3 7 2 4 5 9 1 6 8 14 15 16 17 10 11 12 130
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Dendrogram cut
C = f (D)
http://pralab.diee.unica.it
Poisoning attacks vs. single-linkage HC
10
dc Y, !Y( ) = YY T − !Y !Y T
F, Y =
1 0 00 0 10 0 11 0 00 1 0
#
$
%%%%%%
&
'
((((((
, YY T =
1 0 0 1 00 1 1 0 00 1 1 0 01 0 0 1 00 0 0 0 1
#
$
%%%%%%
&
'
((((((
max!Ag !A ;θ0( ) = dc C, fD (D∪ !A )( )
s.t. !A ∈Ωp
For a given cut criterion:
We assume the most advantageous criterion for the clustering algorithm: the dendrogram cut is chosen to minimize the attacker’s objective!
Sample 1
…
Sample 5
http://pralab.diee.unica.it 3 7 9 4 2 5 1 8 6 14 16 17 15 18 21 19 20 10 12 11 13
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Poisoning attacks vs. single-linkage HC
• Heuristic-based solutions – Greedy approach: adding one attack sample at each iteration
11
−2 −1.5 −1 −0.5 0 0.5 1 1.5
−2.5
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
2
4
6
8
10
12
14
16
Local maxima are often found in between clusters Close to connections (bridges) that have been cut to obtain the final k clusters Can be obtained directly from the dendrogram!
Dendrogram cut
k-1 Bridges
http://pralab.diee.unica.it
−2 −1.5 −1 −0.5 0 0.5 1 1.5
−2.5
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
2
4
6
8
10
12
14
16
Poisoning attacks vs. single-linkage HC
• Heuristic-based solutions 1. Bridge (Best): evaluates the objective function k-1 times, each time by adding an attack point in between a bridge
12
Requires running the clustering algorithm k-1 times!
2. Bridge (Hard): estimates the objective function assuming that each attack point will merge the corresponding clusters
Does not require running the clustering algorithm
http://pralab.diee.unica.it
−2 −1.5 −1 −0.5 0 0.5 1 1.5
−2.5
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
0.5
1
1.5
2
2.5
3
3.5
4
4.5
Poisoning attacks vs. single-linkage HC
• Heuristic-based solutions 3. Bridge (Soft): similar to Bridge (Hard), but using soft clustering assignments for Y (estimated with Gaussian KDE)
13
−2 −1.5 −1 −0.5 0 0.5 1 1.5
−2.5
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
Clustering output aVer greedily adding 20 aXack points
http://pralab.diee.unica.it
Experiments on poisoning attacks
• Banana: artificial data, 80 samples, 2 features, k=4 initial clusters
• Malware: real data(1), 1,000 samples, 6 features, k≈9 initial clusters (estimated from data minimizing the Davies-Bouldin Index) – Features:
1. number of GET requests 2. number of POST requests 3. average URL length 4. average number of URL parameters 5. average amount of data sent by POST requests 6. average response length
• MNIST Handwritten Digits: real data, 330 samples per cluster, 28 x 28 = 784 features (pixels), k=3 initial clusters corresponding to digits ‘0’, ‘1’, and ‘6’
14 (1) R. Perdisci, D. Ariu, and G. Giacinto. Scalable fine-grained behavioral clustering of http-based malware. Computer Networks, 57(2):487-500, 2013.
http://pralab.diee.unica.it
0.0% 0.2% 0.4% 0.6% 0.8% 1.0%0
100200300400500600700800
Obj
ectiv
e Fu
nctio
n
Digits
RandomRandom (Best)Bridge (Best)Bridge (Soft)Bridge (Hard)
0% 1% 2% 3% 4% 5%0
20406080
100120140160180
Obj
ectiv
e Fu
nctio
nMalware
Experiments on poisoning attacks
• Attack strategies: Bridge (Best), Bridge (Hard), Bridge (Soft), Random, Random (Best) – Random (Best) selects the best random attack over k-1 attempts – Same complexity as Bridge (Best)
15
0% 2% 5% 7% 9% 12% 15% 18% 20%0
102030405060
Obj
ectiv
e Fu
nctio
n
Banana
0.0% 0.2% 0.4% 0.6% 0.8% 1.0%0
20406080
100
Num
Clu
ster
s (k
)Fraction of samples controlled by the attacker
0% 1% 2% 3% 4% 5%5
1015202530
Num
Clu
ster
s (k
)
Fraction of samples controlled by the attacker0% 2% 5% 7% 9% 12% 15% 18% 20%8
10121446
Num
Clu
ster
s (k
)
Fraction of samples controlled by the attacker
http://pralab.diee.unica.it
Experiments on poisoning attacks
• Some attack samples obtained by the given attack strategies on the MNIST Handwritten Digits, at iterations 1, 2, and 10.
16
Random
1
2
10
Random(Best)
Bridge(Best)
Bridge(Soft)
Bridge(Hard)
http://pralab.diee.unica.it
Obfuscation attacks (integrity violation)
• Goal: hiding attacks A without compromising clustering of D • Capability: bounded manipulation of attack samples
17
max!Ag !A ;θ0( ) = −dc Ct, f (D∪ !A )( ), where πD (Ct ) = f (D)
s.t. !A ∈Ωo(A) = { !ai}i=1|A| : ds (A, !A ) =max
iai − !ai 2
≤ dmax{ }
x x x
x x
x
x x
x x x x
x x x x
x x
x
x x
x x x x
x x x
x
x x x x x D
A
!A
http://pralab.diee.unica.it
Obfuscation attacks vs. single-linkage HC
• Heuristic-based solution – For each attack sample a in A – Select the closest sample d in D from the cluster to which a should
belong to, and define a’ as
18
!a = a+αd − a( )d − a
2
,
α =min dmax, d − a 2( )
x x x
x x
x
x x
x x x x
x x x
x
x x x
a
!a x d
http://pralab.diee.unica.it
Experiments on obfuscation attacks
• MNIST Handwritten Digits – Initial clusters ‘0’, ‘1’, ‘6’, ‘3’ – Attacker modifies ‘3’s to have them clustered with ‘6’ – Attacker minimizes distance from the desired clustering – Clustering minimizes distance from the initial clusters (where ‘3’s are
not manipulated)
19
0 1 2 3 4 5 6 7 8 9 100
50
100
150
200
250
300
350
Obje
ctiv
e F
unct
ion
ClusteringAttacker
0.0 2.0 3.0 4.0 5.0 7.0
0 1 2 3 4 5 6 7 8 9 103
3.43.84.24.6
5
Num
Clu
ster
s (k
)
d max
http://pralab.diee.unica.it
Experiments on obfuscation attacks
20
0 1 2 3 4 5 6 7 8 9 100
50
100
150
200
250
300
350
Ob
ject
ive
Fu
nct
ion
ClusteringAttacker
0 1 2 3 4 5 6 7 8 9 103
3.43.84.24.6
5
Num
Clu
ster
s (k
)
d max
Why the attacker’s objective increases here?
x x x x
x
x
x x x x
x xx x
x x
xx x
x Bridging!
x x x x
x xx x
x x
xx x x
3 6
x x x x
x x
x x
This may suggest a more effective heuristic, based on modifying only a subset of attacks!
http://pralab.diee.unica.it
Conclusions and future work
• Framework for security evaluation of clustering algorithms • Definition of poisoning and obfuscation attacks • Case study on single-linkage HC highlights vulnerability to attacks • Future work
– Extensions to other algorithms, common solver for the attack strategy • e.g., black-box optimization with suitable heuristics
– Connections with clustering stability – Secure / Robust clustering algorithms
21
http://pralab.diee.unica.it
? 22
Any ques*ons Thanks for your aXen\on!