Upload
benard
View
130
Download
0
Embed Size (px)
DESCRIPTION
Visualization of K-Tuple Distribution in Prokaryote Complete Genomes and Their Randomized Counterparts. XIE Huimin ( 谢惠民 ) Department of Mathematics, Suzhou University and HAO Bailin ( 郝柏林 ) T- Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica - PowerPoint PPT Presentation
Citation preview
Visualization of K-Tuple Distribution in Prokaryote Complete Genomes and
Their Randomized Counterparts
XIE Huimin ( 谢惠民 )
Department of Mathematics, Suzhou University
and
HAO Bailin ( 郝柏林 )
T- Life Research Center, Fudan University
Beijing Genomics Institute, Academia Sinica
Institute of Theoretical Physics, Academia Sinica
Prokaryote Complete Genomes ( PCG )
K Biology-inspired mathematics
Combinatorics Goulden-Jackson
cluster method
FactorizableLanguage
Avioded and RareK-strings
K=6-9,15,18
Species-specificityof avoidance
Phylogeny( Failure )
PhylogenyBased on PCG Compositional
Distance( Success )
Decomposition and
Reconstruction of AA sequences
Graph theory:Euler paths
1. 2D Histogram of K-Tuples
g c
a t
Thermoanaerobacter tengchongenic
(K = 8)
姓 名职 称
姓 名职 称
姓 名职 称
姓 名职 称
g c
a t
The Algorithm (Hao Histogram) Implemented at:
National Institute for Standard and Technology (NIST)
http://math.nist.gov/~FHunt/GenPatterns/
European Bioinformatics Institute (EBI)
http://industry.ebi.ac.uk/openBSA/bsa_viewers
However, 2D only, no 1D histograms.
Two Mathematical Problems
Dimensions of the complementary sets of portraits of tagged strings.
Number of true and redundant missing strings.
The two problems turn out to be one and the same, the first being graphic representation of the second.
Two Methods to Solve the Problem
Combinatorial solution: Goulden-Jackson cluster method (1979); number of dirty and clean words.
Language theory solution: factorizable language, minimal deterministic finite-state automaton.
2. 1D Histogram of K-Tuples
Collect those K-tuples whose count fall in a bin from to ,
Plot the number of such K-tuples versus the counts,
This is a 1D histogram or An expectation curve.
n nn
The effect of c+g content in 2D histograms oforiginal genome and randomized sequence:
Escherichia coli original genome
Escherichia coli randomized sequence
Haemophilus influenzae randomized sequence
Mycobacterium leprae original genome
Mycobacterium laprae randomized sequence
Mycobacterium tuberculosis original genome
Mycobacterium tuberculosis randomized sequence
G+C Content of Some Bacteria
Species G+C Content
H. influenzae 38.15%
E. coli 50.79%
M. laprae 57.80%
M. tuberculosis 65.61%
3. Three Artificial Models Generating Sequences
Eiid: equal-probability independently and identically distributed model.
Niid: nonequal-probability independently and identically distributed model.
MMn: Markov model of order n
Monte Carlo Methodestimation of expectation (ex) and stan
dard deviation (sd) for an niid model
(the compositions of a,c,g,t are 15:35:35:15, the length ofsequence is , the value of K=8.)610
Validation about the Robustness of
K-Histograms: a comparison of absolute error from ex
in an experiment with sd as reference
Compare the population of shuffling a given sequence and the population of sequence generated from a stochastic
model.
F-test t-test
Definition. For each , define a random variable
0n
K
K
innnnin IIIIX
4
1,4,2,1, ,
Where random variable takes value 1 if the i-th K-tuple occursexactly n times in the sequence, or takes value 0 if it does notoccur.
niI ,
)( ,, nini IEp
(1)
4. A Theory for the Expectation Curve (1)
nX
Theorem. For each , the mathematical expectation of random variable is given by
0nnX
K
K
innnnin ppppXE
4
1,4,2,1, ,)(
Where the random variable is the occurrence number ofK-tuples of I-th type.
niI ,
(1)
A Theory for the Expectation Curve (2)
The Exact Computation of Expectation Curve
In order to compute the expectation curve we need to know the probability for each and .
The Goulden-Jackson cluster method can be used successfully for the model of eiid.
It is still difficult to do the computation for other models.
nip , }4,,2,1{ Ki 0n
Two Experiments (for the model of eiid):compare with a K-histogram compare with Monte Carlo method
the red curves are the standard deviation estimationobtained by Monte Carlo method.
Poisson Approximation for
the Expectation Curve
For each K-tuple calculate its expected number of appearing in sequence of length N, then use the formula of probability function of Poisson distribution and sum them for all K-tuples:
i
,0,41,!, nie
np K
ni
nii
ix
Remark. This follows from a theorem in Percus and Whitlock, ACM Transaction on Modeling and Computer Simulation, 5 (1995) 87—100(the model, however, can only be eiid, and the tuples must be overlapless).
Comparison of Poissonapproximation with K-histogram fo
r U. urealyticum
Comparison of Poissonapproximation with 7-histogram fo
r Haemophilus influenzae
Comparison of Poissonapproximation with 8-histogram fo
r Haemophilus influenzae
A comparison of Poisson approximation with
Monte Carlo method
In this computation the model is an niid, in which the parametersare taken from the randomized sequence of H. influenzae.
5. Analysis of the Mechanism of Multi-Modal K-histogr
ams
kk 0 1 2 3
4 5 6 7 8
153.1 94.4 58.2 35.9 22.2 13.7 8.4 5.2 3.2
An example for H. influenzae. The length of its genome is1830023. Under the simplified conditions of
for , there are only 9 types of different of as shownin the following list.
,19075.0 ,30925.0 gcta pppp
8K k
The following map shows the nine individual probability functions and
their sum
Notice that the effect from the ratio of successive modes:
7,,1,0for 616815.030925.019075.01 k
k
k
For E. coli the ratio is 0.968931, hence the result is
quite different
6. Analysis of Short-Range Correlation by K-
HistogramsTwo 8-histograms for E. coli, the left one is from its genome,
and the right one is from its Markov model of order 1.
Compare the 8-histograms of Markov Models of order from 2—7
for E. coli
Using Markov model of order 5 and Monte Carlo methodto compare the 8-histogram of E. coli’s complete genome sequence with the ex and sd of MM5.
the red curve is the expectationcurve estimated by doing 50 times of simulation.
this is the ratio curve
for sd
exX n ||
.200,,0 n
Reference:
Huimin Xie, Bailin Hao, “Visualization of K-tuple distribution in prokaryote complete genomes and their randomized counterparts”, CSB2002: IEEE Computer Systems Bioinformatics Conference Proceedings, IEEE Computer Society, Los Alamitos, 2002, 31-42.
7. Discussion
Most of the results shown above are of experimental nature, many problems are left for fut
ure study. How to select reasonably the value of K. How to use 1D visualization to protein? What are the properties of random variables
? How to compute exactly the expectation cur
ve for the model of niid and MMn? Why the Poisson approximation is effective
without considering the overlap of K-tuples?
nX