XIE Huimin ( 谢惠民 ) Department of Mathematics, Suzhou University and HAO Bailin ( 郝柏林 )

Visualization of K-Tuple Distribution in Prokaryote Complete Genomes and

Their Randomized Counterparts

XIE Huimin ( 谢惠民 )

Department of Mathematics, Suzhou University

and

HAO Bailin ( 郝柏林 )

Ｔ－ Life Research Center, Fudan University

Beijing Genomics Institute, Academia Sinica

Institute of Theoretical Physics, Academia Sinica

Prokaryote Complete Genomes ( PCG )

K Biology-inspired mathematics

Combinatorics Goulden-Jackson

cluster method

FactorizableLanguage

Avioded and RareK-strings

K=6-9,15,18

Species-specificityof avoidance

Phylogeny( Failure )

PhylogenyBased on PCG Compositional

Distance( Success )

Decomposition and

Reconstruction of AA sequences

Graph theory:Euler paths

1. 2D Histogram of K-Tuples

g c

a t

Thermoanaerobacter tengchongenic

(K = 8)

姓名职称

姓名职称

姓名职称

姓名职称

g c

a t

The Algorithm (Hao Histogram) Implemented at:

National Institute for Standard and Technology (NIST)

http://math.nist.gov/~FHunt/GenPatterns/

European Bioinformatics Institute (EBI)

http://industry.ebi.ac.uk/openBSA/bsa_viewers

However, 2D only, no 1D histograms.

http://math.nist.gov/~FHunt/GenPatterns/

http://industry.ebi.ac.uk/openBSA/bsa_viewers

Two Mathematical Problems

Dimensions of the complementary sets of portraits of tagged strings.

Number of true and redundant missing strings.

The two problems turn out to be one and the same, the first being graphic representation of the second.

Two Methods to Solve the Problem

Combinatorial solution: Goulden-Jackson cluster method (1979); number of dirty and clean words.

Language theory solution: factorizable language, minimal deterministic finite-state automaton.

2. 1D Histogram of K-Tuples

Collect those K-tuples whose count fall in a bin from to ,

Plot the number of such K-tuples versus the counts,

This is a 1D histogram or An expectation curve.

n nn

The effect of c+g content in 2D histograms oforiginal genome and randomized sequence:

Escherichia coli original genome

Escherichia coli randomized sequence

Haemophilus influenzae randomized sequence

Mycobacterium leprae original genome

Mycobacterium laprae randomized sequence

Mycobacterium tuberculosis original genome

Mycobacterium tuberculosis randomized sequence

G+C Content of Some Bacteria

Species G+C Content

H. influenzae 38.15%

E. coli 50.79%

M. laprae 57.80%

M. tuberculosis 65.61%

3. Three Artificial Models Generating Sequences

Eiid: equal-probability independently and identically distributed model.

Niid: nonequal-probability independently and identically distributed model.

MMn: Markov model of order n

Monte Carlo Methodestimation of expectation (ex) and stan

dard deviation (sd) for an niid model

(the compositions of a,c,g,t are 15:35:35:15, the length ofsequence is , the value of K=8.)610

Validation about the Robustness of

K-Histograms: a comparison of absolute error from ex

in an experiment with sd as reference

Compare the population of shuffling a given sequence and the population of sequence generated from a stochastic

model.

F-test t-test

Definition. For each , define a random variable

0n

K

K

innnnin IIIIX

4

1,4,2,1, ,

Where random variable takes value 1 if the i-th K-tuple occursexactly n times in the sequence, or takes value 0 if it does notoccur.

niI ,

)( ,, nini IEp

(1)

4. A Theory for the Expectation Curve (1)

nX

Theorem. For each , the mathematical expectation of random variable is given by

0nnX

K

K

innnnin ppppXE

4

1,4,2,1, ,)(

Where the random variable is the occurrence number ofK-tuples of I-th type.

niI ,

(1)

A Theory for the Expectation Curve (2)

The Exact Computation of Expectation Curve

In order to compute the expectation curve we need to know the probability for each and .

The Goulden-Jackson cluster method can be used successfully for the model of eiid.

It is still difficult to do the computation for other models.

nip , }4,,2,1{ Ki 0n

Two Experiments (for the model of eiid):compare with a K-histogram compare with Monte Carlo method

the red curves are the standard deviation estimationobtained by Monte Carlo method.

Poisson Approximation for

the Expectation Curve

For each K-tuple calculate its expected number of appearing in sequence of length N, then use the formula of probability function of Poisson distribution and sum them for all K-tuples:

i

,0,41,!, nie

np K

ni

nii

ix

Remark. This follows from a theorem in Percus and Whitlock, ACM Transaction on Modeling and Computer Simulation, 5 (1995) 87—100(the model, however, can only be eiid, and the tuples must be overlapless).

Comparison of Poissonapproximation with K-histogram fo

r U. urealyticum

Comparison of Poissonapproximation with 7-histogram fo

r Haemophilus influenzae

Comparison of Poissonapproximation with 8-histogram fo

r Haemophilus influenzae

A comparison of Poisson approximation with

Monte Carlo method

In this computation the model is an niid, in which the parametersare taken from the randomized sequence of H. influenzae.

5. Analysis of the Mechanism of Multi-Modal K-histogr

ams

kk 0 1 2 3

4 5 6 7 8

153.1 94.4 58.2 35.9 22.2 13.7 8.4 5.2 3.2

An example for H. influenzae. The length of its genome is1830023. Under the simplified conditions of

for , there are only 9 types of different of as shownin the following list.

,19075.0 ,30925.0 gcta pppp

8K k

The following map shows the nine individual probability functions and

their sum

Notice that the effect from the ratio of successive modes:

7,,1,0for 616815.030925.019075.01 k

k

k

For E. coli the ratio is 0.968931, hence the result is

quite different

6. Analysis of Short-Range Correlation by K-

HistogramsTwo 8-histograms for E. coli, the left one is from its genome,

and the right one is from its Markov model of order 1.

Compare the 8-histograms of Markov Models of order from 2—7

for E. coli

Using Markov model of order 5 and Monte Carlo methodto compare the 8-histogram of E. coli’s complete genome sequence with the ex and sd of MM5.

the red curve is the expectationcurve estimated by doing 50 times of simulation.

this is the ratio curve

for sd

exX n ||

.200,,0 n

Reference:

Huimin Xie, Bailin Hao, “Visualization of K-tuple distribution in prokaryote complete genomes and their randomized counterparts”, CSB2002: IEEE Computer Systems Bioinformatics Conference Proceedings, IEEE Computer Society, Los Alamitos, 2002, 31-42.

7. Discussion

Most of the results shown above are of experimental nature, many problems are left for fut

ure study. How to select reasonably the value of K. How to use 1D visualization to protein? What are the properties of random variables

? How to compute exactly the expectation cur

ve for the model of niid and MMn? Why the Poisson approximation is effective

without considering the overlap of K-tuples?

nX

Documents

XIE Huimin ( 谢惠民 ) Department of Mathematics, Suzhou University and HAO Bailin ( 郝柏林 )