Apply Bioinformatics Applications on Parallel and Grid Computing Environment

應用生物資訊軟體於平行及網格計算環境

東海大學資訊工程與科學系高效能計算實驗室楊朝棟 , 郭育倫

Apply Bioinformatics Applications on Parallel and Grid Computing EnvironmentApply Bioinformatics Applications on Parallel and Grid Computing Environment

國立高雄應用科技大學電機研究所presenter : Yu-Ming Wang

2

Experimental Resulet5

Outline

Introduction1

Bioinformatics , BioGrid2

Parallel Bioinformatics3

System Environment4

Click to add TitleConclusionsConclusions6

3

IntroductionBioinformatics tools can speed up the analysis of

large-scale sequence data, especially about sequence alignment.

Hardwares ： PC clusters; one master node, seven slave nodes(16 processors totally) Sun Fire 6800 Sever Grid System

Bioinformatics tools ： mpiBLAST (MPI) FASTA (MPI) HMMs (PVM-Parallel Virtual machine)

4

Bioinformatics

1. Creation of database allowing storage and management of large biological deta set.

2. Development of algorithems and statistics to determine relationships between members.

3. Use above tools for analysis and interpretation of biological data.

5

Grid Computing

To make more effective use of computer resource. As a way to solve problems that required enormous of computer

power. The resources of many computers can be toward a common

objects.

6

BioGrid

Construct the BioGrid system is necessary for research to reduced the sequence alignment time.

PC ClusterPC Cluster

Local BioGrid

Global BioGrid

7

Parallel Bioinformatics I (BLAST) Basic Local Alignment Search Tool - 核酸與蛋白質序列比對工具

[blastall] :

[blastpgp] : 搜尋 PSI-BLAST(Position-Specific Iterated BLAST ; 一種輸入

蛋白質序列查詢蛋白質資料庫，搜尋是否屬於某個蛋白質家族的 BLAST程式。

[bl2seq] : 2 條核酸或蛋白質序列比對 [formatdb] : 將序列資料轉換成 FASTA 格式 , 再輸入 BLAST 的資料庫 mpiBLAST is based on MPI.

核酸序列比對蛋白質序列比對

核酸序列與蛋白質資料庫比對

蛋白質序列與轉譯核酸資料庫比對核酸序列與轉譯核酸資料庫比對

8

Parallel Bioinformatics II (FASTA) FASTA is a searching sequence programs that are similar to the

BLAST modes, exception of PSI-BLAST, therefore provide very fast searchs of sequence database.(DNA and protein)

[fasta] 使用 FASTA 演算法來對 DNA 序列與 DNA 資料庫比對或 protein 序列

跟 protein 資料庫比對

[ssearch]使用 Smith-Waterman 演算法再次進行上述的比對程序 [fastx/fasty]將 DNA 序列與 protein 資料庫作比對，並在 DNA 序列上執行轉譯 [tfastx/tfasty]將 protein 序列與 DNA 資料庫作比對，並在 protein 序列上執行轉譯 [align]在兩組 DNA 或 protein 序列中，計算排列組合 [lalign]在兩組 DNA 或 protein 序列中，計算局部的排列組合

9

Parallel Bioinformatics III (HMMs) HMMs (Hidden Markov Models) can be used to do database

searching using statistical descriptions of a sequence families.

[hmmpfam]要求在 HMM 資料庫上進行序列搜索，並試著在未知的序列上加上註解 [hmmindex]在 HMM 資料庫上建立二進制 SSI 索引 (binary SSI index) [hmmsearch]搜索 HMM 的序列資料庫，找出更多類似的序列組合 [hmmalign]排列多種序列 (align multiple sequence) [hmmbulid]從多種序列排列建立一個 HMM [hmmcalibrate]讀取 HMM ，並校正它的搜尋統計 (search statistics) 法 [hmmemit]產生一個 " 一致性 "(consensus) 的序列 [hmmfetch]從 HMM 資料庫重新取回 HMM

10

Our System Environment(I)

Linux PC Cluster One server node

• AMD ATHLON MP 2000+ processors

• 1 GB shared memory

Seven slave nodes• AMD ATHLON MP 1800+ processors

• 512 MB shared memory

100Mbps Ethernet switches

Sun Fire 6800 Server 8 UltraSPARC III Cu 1.2-GHz processors 8 GB main memory Setup by Solaries 8 operation system

11

Our System Environment(II)

The Grid System Each clusters has one master

node ， two slave nodes. 3COM 3C9051 10/100 Fast

Ethernet Card AC-EX3016B Switch HUB Globus Toolkit v2.4

12

Experimental Results (I)

The Experimental Results on PC Cluster The Performance of mpiBLAST

near two times

13

The Performance of HMMs

HMMs

• saved about a half time

• speedup : near two times

14

Experimental Results (II)The Experimental Results on Sun Fire 6800

The Performance of FASTA• speedup : near two times

15

The Performance of HMMs• saved about a half time

• speedup : near two times

16

Experimental Results (III)The Experimental Results on Grid SystemThe performance has obvious improvement

and it can save about one-third time????

250sec160sec

17

Conclusions

The parallel computer and grid system can

save more time for sequence analysis.

Therefore, the parallel bioinformatics tools

can help us reduce the waiting time of

alignment and improve performance about

sequence alignment.

Life is too short &

DNA is too long!!

Documents

Apply Bioinformatics Applications on Parallel and Grid Computing Environment