Speed Up DNA Sequence Database Search and Alignment by Methods of DSP

Speed Up DNA Sequence Database Search and Alignment

by Methods of DSP

Student: Kang-Hua Hsu 徐康華Advisor: Jian-Jiun Ding 丁建均E-mail: r96942097@ntu.edu.tw

Graduate Institute of Communication EngineeringNational Taiwan University, Taipei, Taiwan, ROC

DISP@MD531

Outline What is Bioinformatics?Sequence alignmentBrute force method

Dynamic programmingHeuristic method

FASTABLAST

Our methodConclusionFuture workReference

DISP@MD531

What is Bioinformatics?

One of the motivations: Similar sequences usually have similar functions, so we try to search for similarities between sequences.

→ Alignment & Database searchProblem: Huge data amount of DNA sequences,

composed of A、 G、 T、 C. (also protein sequences)Solution: Computer

DISP@MD531

Sequence alignment(1)

DISP@MD531

Sequence alignment(2)

DISP@MD531

EX. Global alignment ofＣＴＴＧＡＣＴＡＧＡ andＣＴＡＣＴＧＴＧＡResult:ＣＴＴＧＡＣＴ－ＡＧＡＣＴ－－ＡＣＴＧＴＧＡ

Deletion Insertion Substitution

Dynamic programming

DISP@MD531

Figure out optimal sequence alignment(s).Steps:1. Recurrence relation2. Tabular computation3. TracebackProblem: Inefficient & much memory

→ O(MN) : bad for long sequences

Solution: Heuristic method→ FASTA & BLAST or… our method

Heuristic method Screen phase:

We first pick out the most similar sequences in the database.

Dynamic programming:Use the dynamic programming to further access the similarities of the picked out database sequences.

DISP@MD531

FASTA1. Look-up table for k-tuple words. (k = 4 to 6)Ex. TGACGA & ATGAGC, k=2.

DISP@MD531

Word Pos.1 Pos. 2 OffsetTG 1 2 -1GA 2 3 -1AC 3 XCG 4 XGA 5 XAG 4 XGC 5 XAT 1 X……

DISP@MD531

One X means one k-tuple word match

FASTA2. Find the 10 “best”(high-scoring) diagonal regions.

Note: If there is a long gap of a diagonal, we would cut it into 2 diagonal lines.

DISP@MD531

A G T CA 1 -5 -5 -5G -5 1 -5 -5T -5 -5 1 -5C -5 -5 -5 1

DISP@MD531

FASTA3. Keep only the most high-scoring diagonal regions.

Keep the ones whose score is greater than a threshold.

DISP@MD531

FASTA4. Try to join these remained diagonal regions into a

longer alignment.Score of the longer region = SUM(scores of the individual regions) – Gap penalties

Search for the longer region(initial region) with maximal score(INITN score).

DISP@MD531

FASTA5. Perform a local alignment by the dynamic

programming, and obtain the optimized score.

If the INITN score is greater than a threshold, we perform a local alignment between a 32 residue wide region centered on the best initial region and the query sequence.

DISP@MD531

FASTA6. Evaluate the significance of the optimized score.

DISP@MD531

1.825 0.57721 exp zP Z z e

1 DP Z zE Z z e

Lower E value, higher significance.

BLAST1. Make a k-tuple word list of the query sequence.

DISP@MD531

2. List the high-scoring words for each k-tuple words of the query sequence.

Score by substitution matrix. PQG ↔ PEG = 15, PQG ↔ PQA = 12 If threshold T =13, we only care about PEG in

the database sequences.

DISP@MD531

3. Scan the database sequences for exact match with the remaining high-scoring words.

Such as PEG

DISP@MD531

BLAST4. Extend the exact matches to high-scoring segment

pair (HSP).

DISP@MD531

5. List all of the HSPs in the database whose score is high enough to be considered.

cutoff score S

DISP@MD531

6. Access the significance of the HSP score. Score of random sequences: Gumbel EVD

7. Local alignments of the query and each of the matched database sequences

8. Report the most possible significant database sequences.

DISP@MD531

Our method1. Unitary mapping.

2. UDCR (Unitary Discrete CorRelation)algorithm : estimates the better-aligned location.If not found, insignificant.

DISP@MD531

Our method3. UDCR (better aligned location) + Dynamic

programming (alignments in detail) = CUDCR(Combined UDCR) algorithm

Only for semi-global and local alignments, not for global.

Discrete correlation is implemented by FFT or NTT, faster.

DISP@MD531

Our method

Remember that O(MN) of dynamic programming

By CUDCR, O(MN) can be significantly reduced, because we input shorter sequences to the dynamic programming.

DISP@MD531

Conclusion

UDCR for estimating the better-aligned location.CUDCR for local and semi-global alignments in

detail.Our method is faster than other methods with the

same accuracy.

DISP@MD531

Future Work

Perform FASTA, BLAST and our method by C language.

Try to further speed it up.Compare our method with other method more

impersonally.

DISP@MD531

Reference[1] J. Setubal and J. Meidanis, Introduction to Computational

Molecular Biology, PWS Pub., Boston, 1997.[2] Pearson W. R., Lipman D. J., Improved tools for biological

sequence comparison. Proc Natl Acad Sci U S A. 85, 2444-2448, 1988.

[3] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman, “Basic local alignment search tool”, J. Mol. Biol., vol. 215, pp. 403-410, 1990.

[4] D. Gusfield, Algorithms on Strings, Trees, and Sequences. Cambridge University Press, 1997.

[5]http://binfo.ym.edu.tw/ib/courses/course_94_2/advanced_bioinformatics.htm

DISP@MD531

Speed Up DNA Sequence Database Search and Alignment by Methods of DSP

Documents

BLAST and FASTA Heuristics in pairwise sequence alignmentksvi.mff.cuni.cz/~mraz/bioinf/BioAlg10-8.pdf · · 2011-01-04Heuristics in pairwise sequence alignment ... BLAST (1) •

Sequence Alignment 序列組合

Chapter 2: Sequence Alignment - Department of Computer ... · 1 COMS4761-- 2007 Prof. Yechiam Yemini (YY) Computer Science Department Columbia University Chapter 2: Sequence Alignment

EECS 730 Introduction to Bioinformatics Sequence Alignmentjhuan/EECS730_F12/slides/9... · 2012-09-30 · EECS 730 Introduction to Bioinformatics Sequence Alignment ... We know how

Global sequence alignment tool - bib.irb.hr€¦ · University of Zagreb aFculty of Electrical Engineering and Computing Master Thesis Assignment No. 1004 Global sequence alignment

Machine Learning for Biological Data Mining · 2015-12-16 · Machine Learning Techniques for Bio Data Mining?Sequence Alignment? Simulated Annealing? Genetic Algorithms?Structure

A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices Maxime Crochemore Gad M. Landau Michal Ziv-Ukelson

Homology - genetics.wustl.edugenetics.wustl.edu/bio5488/files/2021/02/Bio5488_homology_2021.pdf · •Homology beyond sequence analysis •Next-gen sequencing alignment. Russell Doolittle

Class 70 Sequence Alignment

FIG S1 Protein sequence alignment of AgaO and characterized … · 2016-07-20 · FIG S1 Protein sequence alignment of AgaO and characterized GH50 β-agarases. Amino acid residues

Whole-Genome Prokaryote Phylogeny without Sequence Alignment Bailin HAO and Ji QI T-Life Research Center, Fudan University Shanghai 200433, China Institute

Multiple Sequence Alignment H.C.Huang @ 2005/9/29 2005 Autumn / YM / Bioinformatics

SM How to Move Roller Alignment - Mechanics - AIMCAL · Web 101.72SM – How to Move Roller Alignment - Mechanics ©2012 . ... Alignment Challenges • Bearing Housings ... Alignment

Implementasi Sequence Alignment pada Lingkungan Spark

A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices

WS03/041 Dynamische Programmierung (4) Editierdistanz Approximative Zeichenkettensuche Sequence Alignment Prof. Dr. S. Albers Prof. Dr. Th Ottmann

Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,

Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment Ruibin Xi Peking University School of Mathematical Sciences

DSP-chap7 2010cwlin/courses/dsp/notes/DSP... · 2010. 6. 11. · 2 Before we introduce the DFT, we consider the sampling of the Fourier transform of an aperiodic discrete-time sequence

Speed Up DNA Sequence Database Search and Alignment by Methods of DSP Student: Kang-Hua Hsu 徐康華 Advisor: Jian-Jiun Ding 丁建均 E-mail: r96942097@ntu.edu.tw