39
Toward a Practical Data Privacy Scheme for a Distributed Implementation of the Smith-Waterman Genome Sequence Comparison Algorithm Doug Szajda Mike Pohl * Jason Owen Barry Lawson 1

Toward a Practical Data Privacy Scheme for a Distributed Implementation of the Smith- Waterman Genome Sequence Comparison Algorithm Doug Szajda Mike Pohl

Embed Size (px)

DESCRIPTION

Examples –Finding Martians –Protein folding GIMPS (Entropia) –Mersenne Prime search United Devices, IBM, DOD: Smallpox study DNA sequencing Graphics Exhaustive Regression Genetic Algorithms Data Mining Monte Carlo simulation

Citation preview

Page 1: Toward a Practical Data Privacy Scheme for a Distributed Implementation of the Smith- Waterman Genome Sequence Comparison Algorithm Doug Szajda Mike Pohl

Toward a Practical Data Privacy Scheme for a

Distributed Implementation of the Smith-Waterman

Genome Sequence Comparison Algorithm

Doug SzajdaMike Pohl*

Jason OwenBarry Lawson

1

Page 2: Toward a Practical Data Privacy Scheme for a Distributed Implementation of the Smith- Waterman Genome Sequence Comparison Algorithm Doug Szajda Mike Pohl

Large-Scale Distributed Computations

• Easily parallelizable, compute intensive

• Divide into independent tasks to be executed on participant PCs

• Significant results collected by supervisor

2

Page 3: Toward a Practical Data Privacy Scheme for a Distributed Implementation of the Smith- Waterman Genome Sequence Comparison Algorithm Doug Szajda Mike Pohl

Examples• seti@home

– Finding Martians• folding@home

– Protein folding• GIMPS (Entropia)

– Mersenne Prime search• United Devices, IBM,

DOD: Smallpox study

• DNA sequencing• Graphics• Exhaustive

Regression• Genetic Algorithms• Data Mining• Monte Carlo

simulation

Page 4: Toward a Practical Data Privacy Scheme for a Distributed Implementation of the Smith- Waterman Genome Sequence Comparison Algorithm Doug Szajda Mike Pohl

A Problem• Code is executing in untrusted

environments– Data required for task execution may

be proprietary– Can we find a way to have participants

execute tasks without divulging data?

Page 5: Toward a Practical Data Privacy Scheme for a Distributed Implementation of the Smith- Waterman Genome Sequence Comparison Algorithm Doug Szajda Mike Pohl

Related Work (not exhaustive)

• Computing with Encrypted Data– Feigenbaum (1985)– Abadi, Feigenbaum, Killian (1987)

• Secure Circuit Evaluation – Abadi and Feigenbaum (1990)– Sander, Young, and Yung (1999)

Page 6: Toward a Practical Data Privacy Scheme for a Distributed Implementation of the Smith- Waterman Genome Sequence Comparison Algorithm Doug Szajda Mike Pohl

Related Work (not exhaustive)

• Privacy Homomorphisms – Rivest, Adleman, Dertouzos (1978)– Ahituv, Lapid, Neumann (1987)– Brickell and Yacobi (1987)

• Multiparty function computation– Yao (1986)– Goldreich, Micali, Wigderson (1987)– Ben-Or, Goldwasser, and Wigderson (1988)– Chaum, Crepeau, and Damgard (1988)

Page 7: Toward a Practical Data Privacy Scheme for a Distributed Implementation of the Smith- Waterman Genome Sequence Comparison Algorithm Doug Szajda Mike Pohl

Computing With Encrypted Data

• Alice has x, wants Bob to compute f(x), but does not want to divulge x

• Alice gives Bob E(x) and f’, tells him to return f’(E(x))

• Alice can determine f(x) from f’(E(x)), but Bob cannot determine x from knowledge of E(x), f’(E(x))

Page 8: Toward a Practical Data Privacy Scheme for a Distributed Implementation of the Smith- Waterman Genome Sequence Comparison Algorithm Doug Szajda Mike Pohl

In Present Context• Alice has several x values. Asks Bob to

identify those that are significant– Alice doesn’t need f(x), so greater flexibility

in definition of f’ (Sufficient Accuracy)– Post-filtering means that some false

positives are OK.• Lots of Bobs offering computing

services

Page 9: Toward a Practical Data Privacy Scheme for a Distributed Implementation of the Smith- Waterman Genome Sequence Comparison Algorithm Doug Szajda Mike Pohl

Adversary (as usual)

• Assumed to be intelligent– Can decompile, analyze, modify code– Understands task algorithms and

measures used to prevent disclosure of data

Page 10: Toward a Practical Data Privacy Scheme for a Distributed Implementation of the Smith- Waterman Genome Sequence Comparison Algorithm Doug Szajda Mike Pohl

The Model

• Computation: evaluate f : D -> R• Partition D into subsets Di

• Task T(Di): evaluate f(xi) for all xi in Di

• Each task assigned filter function Gi

– Gi returns indices of interesting xi

Page 11: Toward a Practical Data Privacy Scheme for a Distributed Implementation of the Smith- Waterman Genome Sequence Comparison Algorithm Doug Szajda Mike Pohl

Basic Approach• Transform Di, f, Gi into Di’, f’, Gi’• Replace T(Di) with T(Di’) such that

– T(Di’) does not leak additional information about values in Di

– Identifiers returned by T(Di’) contains those that would be returned by T(Di)

– Difference is reasonably small

Page 12: Toward a Practical Data Privacy Scheme for a Distributed Implementation of the Smith- Waterman Genome Sequence Comparison Algorithm Doug Szajda Mike Pohl

Reality• Providing required properties is

difficult (impossible for some apps)• Even when possible,

implementation is application specific

• Bottom line: A potential approach, where few (if any) others exist

Page 13: Toward a Practical Data Privacy Scheme for a Distributed Implementation of the Smith- Waterman Genome Sequence Comparison Algorithm Doug Szajda Mike Pohl

An Example: Smith-Waterman Genome

Sequence Comparison

Page 14: Toward a Practical Data Privacy Scheme for a Distributed Implementation of the Smith- Waterman Genome Sequence Comparison Algorithm Doug Szajda Mike Pohl

Genetic Sequence Alignment

• Comparing sequences over alphabet ∑={A,C,G,T}

• Biologists track evolutionary changes by writing sequences with columns aligned (called an alignment)

• Ex. CTGTTA CAGTTA

Page 15: Toward a Practical Data Privacy Scheme for a Distributed Implementation of the Smith- Waterman Genome Sequence Comparison Algorithm Doug Szajda Mike Pohl

Sequence Evolution• Deletion: CTGTTA CTGTA• Insertion: CTGTTA CGTGTTA• Substitution: CTGTTA CAGTTA

indels

Page 16: Toward a Practical Data Privacy Scheme for a Distributed Implementation of the Smith- Waterman Genome Sequence Comparison Algorithm Doug Szajda Mike Pohl

Sequence Evolution (cont.)• After several “generations”: CTGTTA CTATGCTCG

• Note: Number of alignments (for pair of realistic length sequences) is huge

Page 17: Toward a Practical Data Privacy Scheme for a Distributed Implementation of the Smith- Waterman Genome Sequence Comparison Algorithm Doug Szajda Mike Pohl

Alignment “Types”• Global alignment

– Considers entire sequence• Local alignment

– Considers substrings– Biologists usually consider local

alignments

Page 18: Toward a Practical Data Privacy Scheme for a Distributed Implementation of the Smith- Waterman Genome Sequence Comparison Algorithm Doug Szajda Mike Pohl

Measuring Alignments• Scoring function

– +1 if symbols match– -1 if not

• Gap penalty– g(k) = a + b(k-1)– k is gap length (# consecutive dashes in

single sequence)• Alignment score is sum of column

scores minus gap penalties

Page 19: Toward a Practical Data Privacy Scheme for a Distributed Implementation of the Smith- Waterman Genome Sequence Comparison Algorithm Doug Szajda Mike Pohl

Smith-Waterman• Dynamic programming algorithm

guaranteed to produce an optimal alignment– Global: O(n2); local: O(n3)

• Widely used by biologists• Implemented on commercial volunteer

distributed computing platforms

Page 20: Toward a Practical Data Privacy Scheme for a Distributed Implementation of the Smith- Waterman Genome Sequence Comparison Algorithm Doug Szajda Mike Pohl

Using Smith-Waterman• Significance of Smith-Waterman score

based on probabilistic considerations• Empirical Evidence: Similarity scores of

randomly generated sequences exhibit an extreme value distribution

• Significance threshold p chosen so that probability random score > p is small (typically <0.003)

Page 21: Toward a Practical Data Privacy Scheme for a Distributed Implementation of the Smith- Waterman Genome Sequence Comparison Algorithm Doug Szajda Mike Pohl

A Smith-Waterman Task• Pairwise comparison of two sets of

sequences, A and B– A : proprietary sequences– B : sequences from public database

• Returned: indices of well-matched pairs

• Notation: T(A,B,s,g,p)

Page 22: Toward a Practical Data Privacy Scheme for a Distributed Implementation of the Smith- Waterman Genome Sequence Comparison Algorithm Doug Szajda Mike Pohl

Our Transformation• Offset sequences: compare relative

distances b/w specific nucleotide • U: GCACTTACGCCCTTACGACG

– F(U,A) = {3,4,8,3}– F(U,C) = {2,2,4,2,1,1,4,3}– F(U,G) = {1,8,8,3}– F(U,T) = {5,1,7,1}

Page 23: Toward a Practical Data Privacy Scheme for a Distributed Implementation of the Smith- Waterman Genome Sequence Comparison Algorithm Doug Szajda Mike Pohl

Modified Tasks• U: GCACTTACGCCCTTACGACG

F(U,C) = {2,2,4,2,1,1,4,3}• V: GCACTCGCCACTTAGCACG

F(V,C) = {2,2,2,2,1,2,5,2}

• Apply S-W to F(U,C) and F(V,C)– Scoring function, gap penalty– “Goodness” threshold

Page 24: Toward a Practical Data Privacy Scheme for a Distributed Implementation of the Smith- Waterman Genome Sequence Comparison Algorithm Doug Szajda Mike Pohl

Intuition• Similar sequences should have similar

offsets– Consider effects of indels, substitutions

• False positives can be reduced– Consider multiple nucleotides

• I.e., assign A and C info to distinct participants– Good match if both tasks indicate

significance

Page 25: Toward a Practical Data Privacy Scheme for a Distributed Implementation of the Smith- Waterman Genome Sequence Comparison Algorithm Doug Szajda Mike Pohl

Using Multiple Nucleotide Literals

• Maximum method– One task for each of A,C,G,T – Result significant if any of the four says so

• Adding method– One task for each of A,C,G,T, results passed

to fifth participant – Result significant if sum of four scores

indicates significance• Costs reduced in either case

Page 26: Toward a Practical Data Privacy Scheme for a Distributed Implementation of the Smith- Waterman Genome Sequence Comparison Algorithm Doug Szajda Mike Pohl

Security?

Page 27: Toward a Practical Data Privacy Scheme for a Distributed Implementation of the Smith- Waterman Genome Sequence Comparison Algorithm Doug Szajda Mike Pohl

Recall…1. T(Di’) does not leak additional

information about values in Di2. Identifiers returned by T(Di’)

contains those that would be returned by T(Di)

3. Difference is reasonably small

Page 28: Toward a Practical Data Privacy Scheme for a Distributed Implementation of the Smith- Waterman Genome Sequence Comparison Algorithm Doug Szajda Mike Pohl

Data Privacy?• Property 1 fails: adversary will know all

info about a single nucleotide literal• Conditional entropy gives rough estimate

of amount of information leaked– Bits leaked: 2N - (N - C∂ ) log 3

• C∂ is # of occurrences of ∂ in sequence– Ex. N = 600, C∂ = N/4 487 bits (of 1200)

leaked (713 bits of uncertainty remain)

Page 29: Toward a Practical Data Privacy Scheme for a Distributed Implementation of the Smith- Waterman Genome Sequence Comparison Algorithm Doug Szajda Mike Pohl

Analysis• Clearly, our scheme does not provide

provable security, but it does suggest two questions:

1. Can an adversary determine additional symbols (and if so, how many)?

2. How much information leakage is too much in this context?

Page 30: Toward a Practical Data Privacy Scheme for a Distributed Implementation of the Smith- Waterman Genome Sequence Comparison Algorithm Doug Szajda Mike Pohl

“4 out of 5 [Biologists] Agree”

• Given only the position of a single nucleotide literal:

1. No additional elements can be inferred

2. There is no “biologically useful” information that can be inferred

• Given current understanding of the structure and function of the genome

Page 31: Toward a Practical Data Privacy Scheme for a Distributed Implementation of the Smith- Waterman Genome Sequence Comparison Algorithm Doug Szajda Mike Pohl

An Extension• Sequences can be “masked”

– For each task, choose random binary mask

– Remove from sequence all “zeroed” elements

• Our experiments suggest mask with “1” in 90% of positions works well

Page 32: Toward a Practical Data Privacy Scheme for a Distributed Implementation of the Smith- Waterman Genome Sequence Comparison Algorithm Doug Szajda Mike Pohl

Does it Work?• In general, yes

– Strong correlation between our scores and S-W

– Not as sensitive as Smith-Waterman• Some weak matches missed

• Statistical inference techniques show:– Very few false positives ( < 10-4)– Very few false negatives (often none)

Page 33: Toward a Practical Data Privacy Scheme for a Distributed Implementation of the Smith- Waterman Genome Sequence Comparison Algorithm Doug Szajda Mike Pohl

Simulation Results• Well-matched sequences artificially

generated– Substring mutated over several generations– Placed at random location into random

sequences• Scoring function as given earlier (1, -1)• Gap penalty: g(k) = 2 + 1(k-1)

Page 34: Toward a Practical Data Privacy Scheme for a Distributed Implementation of the Smith- Waterman Genome Sequence Comparison Algorithm Doug Szajda Mike Pohl

• 10000 comp, no mask, maximum method for determining significance

• Sequence length 600-800, matching portion length 300, average of 52.5 subs and 52.5 indels

Page 35: Toward a Practical Data Privacy Scheme for a Distributed Implementation of the Smith- Waterman Genome Sequence Comparison Algorithm Doug Szajda Mike Pohl

• 10000 comp, no mask, adding method for determining significance

• Sequence length 600-800, matching portion length 300, average of 52.5 subs and 52.5 indels

Page 36: Toward a Practical Data Privacy Scheme for a Distributed Implementation of the Smith- Waterman Genome Sequence Comparison Algorithm Doug Szajda Mike Pohl

• 1000 comp, no mask, maximum method for determining significance

• Sequence length 2000, matching portion length 1000, average of 150 subs and 150 indels

Page 37: Toward a Practical Data Privacy Scheme for a Distributed Implementation of the Smith- Waterman Genome Sequence Comparison Algorithm Doug Szajda Mike Pohl

• 1000 comp, 90% mask, maximum method for determining significance

• Sequence length 1000-1300, matching portion length 500, average of 86.25 subs and 86.25 indels

Page 38: Toward a Practical Data Privacy Scheme for a Distributed Implementation of the Smith- Waterman Genome Sequence Comparison Algorithm Doug Szajda Mike Pohl

Conclusions• Introduced notion of sufficient accuracy• Presented a strategy for enhancing data

privacy in important real-world application

• Present important real-world app that requires privacy and is efficiently parallelizable– These are relatively rare– Potential first entry for benchmark suite of

apps for privacy study

Page 39: Toward a Practical Data Privacy Scheme for a Distributed Implementation of the Smith- Waterman Genome Sequence Comparison Algorithm Doug Szajda Mike Pohl

In the Future• Solution is less than ideal

– Lack of formal privacy model / provable security– Need more testing on real genetic data

• But it’s a start– General problem is difficult, this is a potential

avenue of attack– Smith-Waterman requires more careful study in

this context • Application behavior vs. application configurations