Click here to load reader

An Efficient Parallel Approach for Identifying Protein Families from Large-scale Metagenomics Data Changjun Wu, Ananth Kalyanaraman School of Electrical

  • View
    214

  • Download
    0

Embed Size (px)

Text of An Efficient Parallel Approach for Identifying Protein Families from Large-scale Metagenomics Data...

  • Slide 1
  • An Efficient Parallel Approach for Identifying Protein Families from Large-scale Metagenomics Data Changjun Wu, Ananth Kalyanaraman School of Electrical Engineering and Computer Science Washington State University
  • Slide 2
  • Outline Problem Introduction Related Work Our Parallel Approach for Protein Family Identification Experimental Results Conclusions & Future Work Acknowledgments 11/19/2008SC08, Austin, TX1
  • Slide 3
  • Outline Problem Introduction Related Work Our Parallel Approach for Protein Family Identification Experimental Results Conclusions & Future Work Acknowledgments 11/19/2008SC08, Austin, TX2
  • Slide 4
  • Metagenomics Application of genomics techniques to the study of microbial communities in their natural environments. Without isolation and lab cultivation of individual species. 11/19/2008SC08, Austin, TX3
  • Slide 5
  • Protein Family Identification Problem Motivation Family identification Functional annotation Diversity of protein family universe 11/19/2008SC08, Austin, TX4 family 1 family 2 known proteins new metagenomic proteins family i new protein family functional annotation functional annotation
  • Slide 6
  • What is a Protein Family? A protein family is a group of evolutionarily (thus functionally) related proteins. 11/19/2008SC08, Austin, TX5 sequence similarity domain similarity structure similarity
  • Slide 7
  • Outline Problem Introduction Related Work Our Parallel Approach for Protein Family Identification Experimental Results Conclusions & Future Work Acknowledgments 11/19/2008SC08, Austin, TX6
  • Slide 8
  • Related Work General approach Perform all-against-all sequence comparison (BLAST) Group proteins based on pair-wise similarity Related work Kriventseva et al. (2001) Enright et al. (2002) Pipenbacher et al. (2002) Kelil et al. (2007) Yooseph et al. (2007) 11/19/2008SC08, Austin, TX7 sequential approach sequential approach
  • Slide 9
  • GOS Approach Yooseph et al. (2007) 11/19/2008SC08, Austin, TX8 Redundancy removal Graph generation Dense subgraph detection 1 1 2 2 3 3 (n 2 ) space (n 2 ) time
  • Slide 10
  • Limitations of Current Approaches Constructing large graphs can be time-consuming ~10 6 CPU hours for ~28.6 million proteins GOS approach Quadratic space requirement Brute-force parallel approach 11/19/2008SC08, Austin, TX9
  • Slide 11
  • Outline Problem Introduction Related Work Our Parallel Approach for Protein Family Identification Experimental Results Conclusions & Future Work Acknowledgments 11/19/2008SC08, Austin, TX10
  • Slide 12
  • Main Ideas of Our Approach Idea#1: A dense subgraph cannot span two connected components 11/19/2008SC08, Austin, TX11 DS CC DS CC DS use divide and conquer to drastically reduce problem size! Challenge: find connected components without generating the whole graph
  • Slide 13
  • Main Ideas of Our Approach Idea#2: Exact-match based filtering technique 11/19/2008SC08, Austin, TX12 100 bp 98% sequence similarity >= 33 bp eliminate unnecessary all-against-all comparisons!
  • Slide 14
  • Main Ideas of Our Approach Idea#3: High overlap of outlinks dense subgraph 11/19/2008SC08, Austin, TX13 u v v u web community outlinks use outlinks comparison to group vertices into dense subgraph!
  • Slide 15
  • Our Parallel Approach for Protein Family Identification 11/19/2008SC08, Austin, TX14 connected component detection redundancy removal redundancy removal dense subgraph detection dense subgraph detection input protein sequences connected components protein sequence pairwise sequence homology dense subgraph dense subgraph bipartite graph generation bipartite graph generation 4 4 3 3 2 2 1 1
  • Slide 16
  • Redundancy Removal Criteria similarity of the match is >= 98% >= 95% of the shorter sequence is covered by the match 11/19/2008SC08, Austin, TX15 |||||| |||||||||||||| >=95% generalized suffix tree (GST) p1p1 p2p2 p3p3 p4p4 p5p5 cut off >=98% idea#2
  • Slide 17
  • Connected Component Detection 11/19/2008SC08, Austin, TX16 M GST 1 GST 2 GST p 1)manage CC using union-find data structure 2)distribute work in a load-balancing way 1)generate pairs 2)sequence alignment WW W pairs work M Master node W Worker node + alignment results
  • Slide 18
  • Bipartite Graph Generation 11/19/2008SC08, Austin, TX17 connected componentG(V,E) B(V,V,E)
  • Slide 19
  • Dense Subgraph Detection Shingle algorithm 11/19/2008SC08, Austin, TX18 outlinks(u) s elems shingle permutation s elems comparison c times outlinks(v) u v s, c: parameters
  • Slide 20
  • Dense Subgraph Detection 11/19/2008SC08, Austin, TX19 shingle dense subgraph dense subgraph 1 1 2 2 3 3 1 st pass2 nd passA~B B(V, V, E) A B
  • Slide 21
  • Outline Problem Introduction Related Work Our Parallel Approach for Protein Family Identification Experimental Results Conclusions & Future Work Acknowledgments 11/19/2008SC08, Austin, TX20
  • Slide 22
  • Qualitative Validation with GOS Data 160k data set Our results vs. GOS results 11/19/2008SC08, Austin, TX21 #input seq #NR#CC#DS mean degree mean density size of largest DS 160,000138,6331,8618502676%13,263 22,18621,34811342078%6,828 Precision Rate (PR) = 95.75% Sensitivity (SE) = 56.89% Overlap Quality (OQ) = 55.49%
  • Slide 23
  • Drastical Work Reduction 40k input data 11/19/2008SC08, Austin, TX22 ~800 million ~8 million all-against-all BLAST our parallel approach #(sequence alignment work)
  • Slide 24
  • Run Time as Function of Input Size 11/19/2008SC08, Austin, TX23
  • Slide 25
  • Performance Evaluation 11/19/2008SC08, Austin, TX24
  • Slide 26
  • Conclusions & Future Work Presented a parallel approach for protein family identification Quality testing better benchmark Parallelization of Shingle algorithm potential memory problem Large-scale application 28.6 million 11/19/2008SC08, Austin, TX25
  • Slide 27
  • Acknowledgments Prof. Srinivas Aluru at Iowa State University for BlueGene/L access Anonymous reviewers Funding: Washington State University Foundation and the Office of Research 11/19/2008SC08, Austin, TX26
  • Slide 28
  • Thanks! Questions? 11/19/2008SC08, Austin, TX

Search related