47
Visualization of K-Tuple Distribution in Prokaryote Complete Genomes and Their Randomized Counterparts XIE Huimin ( 谢谢谢 ) Department of Mathematics, Suzhou University and HAO Bailin ( 谢谢谢 ) T- Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica Institute of Theoretical Physics, Academia Sinica

XIE Huimin ( 谢惠民 ) Department of Mathematics, Suzhou University and HAO Bailin ( 郝柏林 )

  • Upload
    benard

  • View
    130

  • Download
    0

Embed Size (px)

DESCRIPTION

Visualization of K-Tuple Distribution in Prokaryote Complete Genomes and Their Randomized Counterparts. XIE Huimin ( 谢惠民 ) Department of Mathematics, Suzhou University and HAO Bailin ( 郝柏林 ) T- Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica - PowerPoint PPT Presentation

Citation preview

Page 1: XIE Huimin ( 谢惠民 ) Department of Mathematics, Suzhou University and HAO Bailin ( 郝柏林 )

Visualization of K-Tuple Distribution in Prokaryote Complete Genomes and

Their Randomized Counterparts

XIE Huimin ( 谢惠民 )

Department of Mathematics, Suzhou University

and

HAO Bailin ( 郝柏林 )

T- Life Research Center, Fudan University

Beijing Genomics Institute, Academia Sinica

Institute of Theoretical Physics, Academia Sinica

Page 2: XIE Huimin ( 谢惠民 ) Department of Mathematics, Suzhou University and HAO Bailin ( 郝柏林 )

Prokaryote Complete Genomes ( PCG )

K Biology-inspired mathematics

Combinatorics Goulden-Jackson

cluster method

FactorizableLanguage

Avioded and RareK-strings

K=6-9,15,18

Species-specificityof avoidance

Phylogeny( Failure )

PhylogenyBased on PCG Compositional

Distance( Success )

Decomposition and

Reconstruction of AA sequences

Graph theory:Euler paths

Page 3: XIE Huimin ( 谢惠民 ) Department of Mathematics, Suzhou University and HAO Bailin ( 郝柏林 )
Page 4: XIE Huimin ( 谢惠民 ) Department of Mathematics, Suzhou University and HAO Bailin ( 郝柏林 )
Page 5: XIE Huimin ( 谢惠民 ) Department of Mathematics, Suzhou University and HAO Bailin ( 郝柏林 )

1. 2D Histogram of K-Tuples

Page 6: XIE Huimin ( 谢惠民 ) Department of Mathematics, Suzhou University and HAO Bailin ( 郝柏林 )

g c

a t

Page 7: XIE Huimin ( 谢惠民 ) Department of Mathematics, Suzhou University and HAO Bailin ( 郝柏林 )
Page 8: XIE Huimin ( 谢惠民 ) Department of Mathematics, Suzhou University and HAO Bailin ( 郝柏林 )
Page 9: XIE Huimin ( 谢惠民 ) Department of Mathematics, Suzhou University and HAO Bailin ( 郝柏林 )
Page 10: XIE Huimin ( 谢惠民 ) Department of Mathematics, Suzhou University and HAO Bailin ( 郝柏林 )
Page 11: XIE Huimin ( 谢惠民 ) Department of Mathematics, Suzhou University and HAO Bailin ( 郝柏林 )

Thermoanaerobacter tengchongenic

(K = 8)

姓 名职 称

姓 名职 称

姓 名职 称

姓 名职 称

g c

a t

Page 12: XIE Huimin ( 谢惠民 ) Department of Mathematics, Suzhou University and HAO Bailin ( 郝柏林 )
Page 13: XIE Huimin ( 谢惠民 ) Department of Mathematics, Suzhou University and HAO Bailin ( 郝柏林 )

The Algorithm (Hao Histogram) Implemented at:

National Institute for Standard and Technology (NIST)

http://math.nist.gov/~FHunt/GenPatterns/

European Bioinformatics Institute (EBI)

http://industry.ebi.ac.uk/openBSA/bsa_viewers

However, 2D only, no 1D histograms.

Page 14: XIE Huimin ( 谢惠民 ) Department of Mathematics, Suzhou University and HAO Bailin ( 郝柏林 )

Two Mathematical Problems

Dimensions of the complementary sets of portraits of tagged strings.

Number of true and redundant missing strings.

The two problems turn out to be one and the same, the first being graphic representation of the second.

Page 15: XIE Huimin ( 谢惠民 ) Department of Mathematics, Suzhou University and HAO Bailin ( 郝柏林 )

Two Methods to Solve the Problem

Combinatorial solution: Goulden-Jackson cluster method (1979); number of dirty and clean words.

Language theory solution: factorizable language, minimal deterministic finite-state automaton.

Page 16: XIE Huimin ( 谢惠民 ) Department of Mathematics, Suzhou University and HAO Bailin ( 郝柏林 )

2. 1D Histogram of K-Tuples

Collect those K-tuples whose count fall in a bin from to ,

Plot the number of such K-tuples versus the counts,

This is a 1D histogram or An expectation curve.

n nn

Page 17: XIE Huimin ( 谢惠民 ) Department of Mathematics, Suzhou University and HAO Bailin ( 郝柏林 )

The effect of c+g content in 2D histograms oforiginal genome and randomized sequence:

Page 18: XIE Huimin ( 谢惠民 ) Department of Mathematics, Suzhou University and HAO Bailin ( 郝柏林 )

Escherichia coli original genome

Page 19: XIE Huimin ( 谢惠民 ) Department of Mathematics, Suzhou University and HAO Bailin ( 郝柏林 )

Escherichia coli randomized sequence

Page 20: XIE Huimin ( 谢惠民 ) Department of Mathematics, Suzhou University and HAO Bailin ( 郝柏林 )

Haemophilus influenzae randomized sequence

Page 21: XIE Huimin ( 谢惠民 ) Department of Mathematics, Suzhou University and HAO Bailin ( 郝柏林 )

Mycobacterium leprae original genome

Page 22: XIE Huimin ( 谢惠民 ) Department of Mathematics, Suzhou University and HAO Bailin ( 郝柏林 )

Mycobacterium laprae randomized sequence

Page 23: XIE Huimin ( 谢惠民 ) Department of Mathematics, Suzhou University and HAO Bailin ( 郝柏林 )

Mycobacterium tuberculosis original genome

Page 24: XIE Huimin ( 谢惠民 ) Department of Mathematics, Suzhou University and HAO Bailin ( 郝柏林 )

Mycobacterium tuberculosis randomized sequence

Page 25: XIE Huimin ( 谢惠民 ) Department of Mathematics, Suzhou University and HAO Bailin ( 郝柏林 )

G+C Content of Some Bacteria

Species G+C Content

H. influenzae 38.15%

E. coli 50.79%

M. laprae 57.80%

M. tuberculosis 65.61%

Page 26: XIE Huimin ( 谢惠民 ) Department of Mathematics, Suzhou University and HAO Bailin ( 郝柏林 )

3. Three Artificial Models Generating Sequences

Eiid: equal-probability independently and identically distributed model.

Niid: nonequal-probability independently and identically distributed model.

MMn: Markov model of order n

Page 27: XIE Huimin ( 谢惠民 ) Department of Mathematics, Suzhou University and HAO Bailin ( 郝柏林 )

Monte Carlo Methodestimation of expectation (ex) and stan

dard deviation (sd) for an niid model

(the compositions of a,c,g,t are 15:35:35:15, the length ofsequence is , the value of K=8.)610

Page 28: XIE Huimin ( 谢惠民 ) Department of Mathematics, Suzhou University and HAO Bailin ( 郝柏林 )

Validation about the Robustness of

K-Histograms: a comparison of absolute error from ex

in an experiment with sd as reference

Page 29: XIE Huimin ( 谢惠民 ) Department of Mathematics, Suzhou University and HAO Bailin ( 郝柏林 )

Compare the population of shuffling a given sequence and the population of sequence generated from a stochastic

model.

F-test t-test

Page 30: XIE Huimin ( 谢惠民 ) Department of Mathematics, Suzhou University and HAO Bailin ( 郝柏林 )

Definition. For each , define a random variable

0n

K

K

innnnin IIIIX

4

1,4,2,1, ,

Where random variable takes value 1 if the i-th K-tuple occursexactly n times in the sequence, or takes value 0 if it does notoccur.

niI ,

)( ,, nini IEp

(1)

4. A Theory for the Expectation Curve (1)

nX

Page 31: XIE Huimin ( 谢惠民 ) Department of Mathematics, Suzhou University and HAO Bailin ( 郝柏林 )

Theorem. For each , the mathematical expectation of random variable is given by

0nnX

K

K

innnnin ppppXE

4

1,4,2,1, ,)(

Where the random variable is the occurrence number ofK-tuples of I-th type.

niI ,

(1)

A Theory for the Expectation Curve (2)

Page 32: XIE Huimin ( 谢惠民 ) Department of Mathematics, Suzhou University and HAO Bailin ( 郝柏林 )

The Exact Computation of Expectation Curve

In order to compute the expectation curve we need to know the probability for each and .

The Goulden-Jackson cluster method can be used successfully for the model of eiid.

It is still difficult to do the computation for other models.

nip , }4,,2,1{ Ki 0n

Page 33: XIE Huimin ( 谢惠民 ) Department of Mathematics, Suzhou University and HAO Bailin ( 郝柏林 )

Two Experiments (for the model of eiid):compare with a K-histogram compare with Monte Carlo method

the red curves are the standard deviation estimationobtained by Monte Carlo method.

Page 34: XIE Huimin ( 谢惠民 ) Department of Mathematics, Suzhou University and HAO Bailin ( 郝柏林 )

Poisson Approximation for

the Expectation Curve

For each K-tuple calculate its expected number of appearing in sequence of length N, then use the formula of probability function of Poisson distribution and sum them for all K-tuples:

i

,0,41,!, nie

np K

ni

nii

ix

Remark. This follows from a theorem in Percus and Whitlock, ACM Transaction on Modeling and Computer Simulation, 5 (1995) 87—100(the model, however, can only be eiid, and the tuples must be overlapless).

Page 35: XIE Huimin ( 谢惠民 ) Department of Mathematics, Suzhou University and HAO Bailin ( 郝柏林 )

Comparison of Poissonapproximation with K-histogram fo

r U. urealyticum

Page 36: XIE Huimin ( 谢惠民 ) Department of Mathematics, Suzhou University and HAO Bailin ( 郝柏林 )

Comparison of Poissonapproximation with 7-histogram fo

r Haemophilus influenzae

Page 37: XIE Huimin ( 谢惠民 ) Department of Mathematics, Suzhou University and HAO Bailin ( 郝柏林 )

Comparison of Poissonapproximation with 8-histogram fo

r Haemophilus influenzae

Page 38: XIE Huimin ( 谢惠民 ) Department of Mathematics, Suzhou University and HAO Bailin ( 郝柏林 )

A comparison of Poisson approximation with

Monte Carlo method

In this computation the model is an niid, in which the parametersare taken from the randomized sequence of H. influenzae.

Page 39: XIE Huimin ( 谢惠民 ) Department of Mathematics, Suzhou University and HAO Bailin ( 郝柏林 )

5. Analysis of the Mechanism of Multi-Modal K-histogr

ams

kk 0 1 2 3

4 5 6 7 8

153.1 94.4 58.2 35.9 22.2 13.7 8.4 5.2 3.2

An example for H. influenzae. The length of its genome is1830023. Under the simplified conditions of

for , there are only 9 types of different of as shownin the following list.

,19075.0 ,30925.0 gcta pppp

8K k

Page 40: XIE Huimin ( 谢惠民 ) Department of Mathematics, Suzhou University and HAO Bailin ( 郝柏林 )

The following map shows the nine individual probability functions and

their sum

Notice that the effect from the ratio of successive modes:

7,,1,0for 616815.030925.019075.01 k

k

k

Page 41: XIE Huimin ( 谢惠民 ) Department of Mathematics, Suzhou University and HAO Bailin ( 郝柏林 )

For E. coli the ratio is 0.968931, hence the result is

quite different

Page 42: XIE Huimin ( 谢惠民 ) Department of Mathematics, Suzhou University and HAO Bailin ( 郝柏林 )

6. Analysis of Short-Range Correlation by K-

HistogramsTwo 8-histograms for E. coli, the left one is from its genome,

and the right one is from its Markov model of order 1.

Page 43: XIE Huimin ( 谢惠民 ) Department of Mathematics, Suzhou University and HAO Bailin ( 郝柏林 )

Compare the 8-histograms of Markov Models of order from 2—7

for E. coli

Page 44: XIE Huimin ( 谢惠民 ) Department of Mathematics, Suzhou University and HAO Bailin ( 郝柏林 )

Using Markov model of order 5 and Monte Carlo methodto compare the 8-histogram of E. coli’s complete genome sequence with the ex and sd of MM5.

the red curve is the expectationcurve estimated by doing 50 times of simulation.

this is the ratio curve

for sd

exX n ||

.200,,0 n

Page 45: XIE Huimin ( 谢惠民 ) Department of Mathematics, Suzhou University and HAO Bailin ( 郝柏林 )

Reference:

Huimin Xie, Bailin Hao, “Visualization of K-tuple distribution in prokaryote complete genomes and their randomized counterparts”, CSB2002: IEEE Computer Systems Bioinformatics Conference Proceedings, IEEE Computer Society, Los Alamitos, 2002, 31-42.

Page 46: XIE Huimin ( 谢惠民 ) Department of Mathematics, Suzhou University and HAO Bailin ( 郝柏林 )

7. Discussion

Most of the results shown above are of experimental nature, many problems are left for fut

ure study. How to select reasonably the value of K. How to use 1D visualization to protein? What are the properties of random variables

? How to compute exactly the expectation cur

ve for the model of niid and MMn? Why the Poisson approximation is effective

without considering the overlap of K-tuples?

nX

Page 47: XIE Huimin ( 谢惠民 ) Department of Mathematics, Suzhou University and HAO Bailin ( 郝柏林 )