Upload
betty-osborne
View
216
Download
0
Embed Size (px)
Citation preview
Motif Finding
Kun-Pin Wu ( 巫坤品 )Institute of Biomedical InformaticsNational Yang Ming University2007/11/20
www.bioalgorithms.infoAn Introduction to Bioinformatics Algorithms
Finding Regulatory Motifs in DNA Sequences
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Random Sample
atgaccgggatactgataccgtatttggcctaggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg
acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatactgggcataaggtaca
tgagtatccctgggatgacttttgggaacactatagtgctctcccgatttttgaatatgtaggatcattcgccagggtccga
gctgagaattggatgaccttgtaagtgttttccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga
tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatggcccacttagtccacttatag
gtcaatcatgttcttgtgaatggatttttaactgagggcatagaccgcttggcgcacccaaattcagtgtgggcgagcgcaa
cggttttggcccttgttagaggcccccgtactgatggaaactttcaattatgagagagctaatctatcgcgtgcgtgttcat
aacttgagttggtttcgaaaatgctctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta
ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatttcaacgtatgccgaaccgaaagggaag
ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttctgggtactgatagca
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Implanting Motif AAAAAAAGGGGGGG
atgaccgggatactgatAAAAAAAAGGGGGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg
acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataAAAAAAAAGGGGGGGa
tgagtatccctgggatgacttAAAAAAAAGGGGGGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccga
gctgagaattggatgAAAAAAAAGGGGGGGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga
tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAAAAAAAAGGGGGGGcttatag
gtcaatcatgttcttgtgaatggatttAAAAAAAAGGGGGGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaa
cggttttggcccttgttagaggcccccgtAAAAAAAAGGGGGGGcaattatgagagagctaatctatcgcgtgcgtgttcat
aacttgagttAAAAAAAAGGGGGGGctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta
ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatAAAAAAAAGGGGGGGaccgaaagggaag
ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttAAAAAAAAGGGGGGGa
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Where is the Implanted Motif?
atgaccgggatactgataaaaaaaagggggggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg
acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataaaaaaaaaggggggga
tgagtatccctgggatgacttaaaaaaaagggggggtgctctcccgatttttgaatatgtaggatcattcgccagggtccga
gctgagaattggatgaaaaaaaagggggggtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga
tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaataaaaaaaagggggggcttatag
gtcaatcatgttcttgtgaatggatttaaaaaaaaggggggggaccgcttggcgcacccaaattcagtgtgggcgagcgcaa
cggttttggcccttgttagaggcccccgtaaaaaaaagggggggcaattatgagagagctaatctatcgcgtgcgtgttcat
aacttgagttaaaaaaaagggggggctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta
ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcataaaaaaaagggggggaccgaaagggaag
ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttaaaaaaaaggggggga
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Implanting Motif AAAAAAGGGGGGG with Four Mutations
atgaccgggatactgatAgAAgAAAGGttGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg
acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacAAtAAAAcGGcGGGa
tgagtatccctgggatgacttAAAAtAAtGGaGtGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccga
gctgagaattggatgcAAAAAAAGGGattGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga
tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAtAAtAAAGGaaGGGcttatag
gtcaatcatgttcttgtgaatggatttAAcAAtAAGGGctGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaa
cggttttggcccttgttagaggcccccgtAtAAAcAAGGaGGGccaattatgagagagctaatctatcgcgtgcgtgttcat
aacttgagttAAAAAAtAGGGaGccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta
ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatActAAAAAGGaGcGGaccgaaagggaag
ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttActAAAAAGGaGcGGa
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Where is the Motif???
atgaccgggatactgatagaagaaaggttgggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg
acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacaataaaacggcggga
tgagtatccctgggatgacttaaaataatggagtggtgctctcccgatttttgaatatgtaggatcattcgccagggtccga
gctgagaattggatgcaaaaaaagggattgtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga
tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatataataaaggaagggcttatag
gtcaatcatgttcttgtgaatggatttaacaataagggctgggaccgcttggcgcacccaaattcagtgtgggcgagcgcaa
cggttttggcccttgttagaggcccccgtataaacaaggagggccaattatgagagagctaatctatcgcgtgcgtgttcat
aacttgagttaaaaaatagggagccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta
ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatactaaaaaggagcggaccgaaagggaag
ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttactaaaaaggagcgga
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Why Finding (15,4) Motif is Difficult?
atgaccgggatactgatAgAAgAAAGGttGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg
acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacAAtAAAAcGGcGGGa
tgagtatccctgggatgacttAAAAtAAtGGaGtGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccga
gctgagaattggatgcAAAAAAAGGGattGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga
tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAtAAtAAAGGaaGGGcttatag
gtcaatcatgttcttgtgaatggatttAAcAAtAAGGGctGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaa
cggttttggcccttgttagaggcccccgtAtAAAcAAGGaGGGccaattatgagagagctaatctatcgcgtgcgtgttcat
aacttgagttAAAAAAtAGGGaGccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta
ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatActAAAAAGGaGcGGaccgaaagggaag
ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttActAAAAAGGaGcGGa
AgAAgAAAGGttGGG
cAAtAAAAcGGcGGG..|..|||.|..|||
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Challenge Problem
• Find a motif in a sample of
- 20 “random” sequences (e.g. 600 nt long)
- each sequence containing an implanted
pattern of length 15,
- each pattern appearing with 4 mismatches
as (15,4)-motif.
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Combinatorial Gene Regulation• A microarray experiment showed that when
gene X is knocked out, 20 other genes are not expressed
• How can one gene have such drastic effects?
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Regulatory Proteins• Gene X encodes regulatory protein, a.k.a. a
transcription factor (TF)
• The 20 unexpressed genes rely on gene X’s TF to induce transcription
• A single TF may regulate multiple genes
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Regulatory Regions• Every gene contains a regulatory region (RR) typically st
retching 100-1000 bp upstream of the transcriptional start site
• Located within the RR are the Transcription Factor Binding Sites (TFBS), also known as motifs, specific for a given transcription factor
• TFs influence gene expression by binding to a specific location in the respective gene’s regulatory region - TFBS
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Transcription Factor Binding Sites
• A TFBS can be located anywhere within the
Regulatory Region
• TFBS may vary slightly across different regulatory regions since non-essential bases could mutate
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Motifs and Transcriptional Start Sites
geneATCCCG
geneTTCCGG
geneATCCCG
geneATGCCG
geneATGCCC
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Transcription Factors and Motifs
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Motif Logo
• Motifs can mutate on non important bases
• The five motifs in five different genes have mutations in position 3 and 5
• Representations called motif logos illustrate the conserved and variable regions of a motif
TGGGGGATGAGAGATGGGGGATGAGAGATGAGGGA
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Motif Logos: An Example
(http://www-lmmb.ncifcrf.gov/~toms/sequencelogo.html)
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Identifying Motifs• Genes are turned on or off by regulatory proteins
• These proteins bind to upstream regulatory regions of genes to either attract or block an RNA polymerase
• Regulatory protein (TF) binds to a short DNA sequence called a motif (TFBS)
• So finding the same motif in multiple genes’ regulatory regions suggests a regulatory relationship amongst those genes
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Identifying Motifs: Complications
• We do not know the motif sequence
• We do not know where it is located relative to the genes start
• Motifs can differ slightly from one gene to the next
• How to discern it from “random” motifs?
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
The Motif Finding Problem• Given a random sample of DNA sequences:
cctgatagacgctatctggctatccacgtacgtaggtcctctgtgcgaatctatgcgtttccaaccat
agtactggtgtacatttgatacgtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc
aaacgtacgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt
agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtacgtataca
ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaacgtacgtc
• Find the pattern that is implanted in each of the individual sequences, namely, the motif
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
The Motif Finding Problem (cont’d)• Additional information:
• The hidden sequence is of length 8
• The pattern is not exactly the same in each array because random point mutations may occur in the sequences
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
The Motif Finding Problem (cont’d)• The patterns revealed with no mutations:cctgatagacgctatctggctatccacgtacgtaggtcctctgtgcgaatctatgcgtttccaaccat
agtactggtgtacatttgatacgtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc
aaacgtacgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt
agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtacgtataca
ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaacgtacgtc
acgtacgtConsensus String
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
The Motif Finding Problem (cont’d)• The patterns with 2 point mutations:
cctgatagacgctatctggctatccaGgtacTtaggtcctctgtgcgaatctatgcgtttccaaccat
agtactggtgtacatttgatCcAtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc
aaacgtTAgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt
agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtCcAtataca
ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaCcgtacgGc
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
The Motif Finding Problem (cont’d)• The patterns with 2 point mutations:
cctgatagacgctatctggctatccaGgtacTtaggtcctctgtgcgaatctatgcgtttccaaccat
agtactggtgtacatttgatCcAtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc
aaacgtTAgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt
agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtCcAtataca
ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaCcgtacgGc
Can we still find the motif, now that we have 2 mutations?
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Defining Motifs
• To define a motif, lets say we know where the motif starts in the sequence
• The motif start positions in their sequences can be represented as s = (s1,s2,s3,…,st)
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Motifs: Profiles and Consensus a G g t a c T t C c A t a c g tAlignment a c g t T A g t a c g t C c A t C c g t a c g G
_________________
A 3 0 1 0 3 1 1 0Profile C 2 4 0 0 1 4 0 0 G 0 1 4 0 0 0 3 1 T 0 0 0 5 1 0 1 4
_________________
Consensus A C G T A C G T
• Line up the patterns by their start indexes
s = (s1, s2, …, st)
• Construct matrix profile with frequencies of each nucleotide in columns
• Consensus nucleotide in each position has the highest score in column
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Consensus
• Think of consensus as an “ancestor” motif, from which mutated motifs emerged
• The distance between a real motif and the consensus sequence is generally less than that for two real motifs
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Consensus (cont’d)
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Evaluating Motifs
• We have a guess about the consensus sequence, but how “good” is this consensus?
• Need to introduce a scoring function to compare different guesses and choose the “best” one.
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Defining Some Terms
• t - number of sample DNA sequences• n - length of each DNA sequence• DNA - sample of DNA sequences (t x n array)
• l - length of the motif (l-mer)• si - starting position of an l-mer in sequence i
• s=(s1, s2,… st) - array of motif’s starting positions
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Parameters
cctgatagacgctatctggctatccaGgtacTtaggtcctctgtgcgaatctatgcgtttccaaccat
agtactggtgtacatttgatCcAtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc
aaacgtTAgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt
agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtCcAtataca
ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaCcgtacgGc
l = 8
t=5
s1 = 26 s2 = 21 s3= 3 s4 = 56 s5 = 60 s
DNA
n = 69
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Scoring Motifs
• Given s = (s1, … st) and DNA:
Score(s,DNA) =
a G g t a c T t C c A t a c g t a c g t T A g t a c g t C c A t C c g t a c g G _________________ A 3 0 1 0 3 1 1 0 C 2 4 0 0 1 4 0 0 G 0 1 4 0 0 0 3 1 T 0 0 0 5 1 0 1 4 _________________
Consensus a c g t a c g t
Score 3+4+4+5+3+4+3+4=30
l
t
l
i GCTAk
ikcount1 },,,{
),(max
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
The Motif Finding Problem
• If starting positions s=(s1, s2,… st) are given, finding consensus is easy even with mutations in the sequences because we can simply construct the profile to find the motif (consensus)
• But… the starting positions s are usually not given. How can we find the “best” profile matrix?
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
The Motif Finding Problem: Formulation
• Goal: Given a set of DNA sequences, find a set of l-mers, one from each sequence, that maximizes the consensus score
• Input: A t x n matrix of DNA, and l, the length of the pattern to find
• Output: An array of t starting positions s = (s1, s2, … st) maximizing Score(s,DNA)
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
The Motif Finding Problem: Brute Force Solution
• Compute the scores for each possible combination of starting positions s
• The best score will determine the best profile and the consensus pattern in DNA
• The goal is to maximize Score(s,DNA) by varying the starting positions si, where:
si = [1, …, n-l+1]i = [1, …, t]
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Running Time of BruteForceMotifSearch• Varying (n - l + 1) positions in each of t sequen
ces, we’re looking at (n - l + 1)t sets of starting positions
• For each set of starting positions, the scoring function makes l operations, so complexity is l (n – l + 1)t = O(l nt)
• That means that for t = 8, n = 1000, l = 10 we must perform approximately 1020 computations – it will take billions years
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
The Median String Problem
• Given a set of t DNA sequences find a pattern that appears in all t sequences with the minimum number of mutations
• This pattern will be the motif
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Hamming Distance
• Hamming distance: • dH(v,w) is the number of nucleotide pairs th
at do not match when v and w are aligned. For example:
dH(AAAAAA,ACAAAC) = 2
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Total Distance: Example
• Given v = “acgtacgt” and s acgtacgt
cctgatagacgctatctggctatccacgtacAtaggtcctctgtgcgaatctatgcgtttccaaccat acgtacgtagtactggtgtacatttgatacgtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc acgtacgtaaaAgtCcgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt acgtacgtagcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtacgtataca acgtacgtctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaacgtaGgtc
v is the sequence in red, x is the sequence in blue
• TotalDistance(v,DNA) = 1+0+2+0+1 = 4
dH(v, x) = 2
dH(v, x) = 1
dH(v, x) = 0
dH(v, x) = 0
dH(v, x) = 1
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Total Distance: Definition
• For each DNA sequence i, compute all dH(v, x), where x is an l-mer with starting position si
(1 < si < n – l + 1)• Find minimum of dH(v, x) among all l-mers in sequen
ce i• TotalDistance(v,DNA) is the sum of the minimum Ha
mming distances for each DNA sequence i• TotalDistance(v,DNA) = mins dH(v, s), where s is the
set of starting positions s1, s2,… st
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
The Median String Problem: Formulation• Goal: Given a set of DNA sequences, find a
median string• Input: A t x n matrix DNA, and l, the length of
the pattern to find• Output: A string v of l nucleotides that minimi
zes TotalDistance(v,DNA) over all strings of that length
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Motif Finding Problem == Median String Problem
• The Motif Finding is a maximization problem while Median String is a minimization problem
• However, the Motif Finding problem and Median String problem are computationally equivalent
• Need to show that minimizing TotalDistance is equivalent to maximizing Score
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
We are looking for the same thing
a G g t a c T t C c A t a c g tAlignment a c g t T A g t a c g t C c A t C c g t a c g G _________________ A 3 0 1 0 3 1 1 0Profile C 2 4 0 0 1 4 0 0 G 0 1 4 0 0 0 3 1 T 0 0 0 5 1 0 1 4 _________________
Consensus a c g t a c g t
Score 3+4+4+5+3+4+3+4
TotalDistance 2+1+1+0+2+1+2+1
Sum 5 5 5 5 5 5 5 5
• At any column iScorei + TotalDistancei = t
• Because there are l columns Score + TotalDistance = l * t
• Rearranging:Score = l * t - TotalDistance
• l * t is constant the minimization of the right side is equivalent to the maximization of the left side
l
t
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Motif Finding Problem vs. Median String Problem• Why bother reformulating the Motif Finding
problem into the Median String problem?
• The Motif Finding Problem needs to examine all the combinations for s. That is (n - l + 1)t combinations!!!
• The Median String Problem needs to examine all 4l combinations for v. This number is relatively smaller
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Motif Finding: Improving the Running Time
1. BruteForceMotifSearch(DNA, t, n, l)2. bestScore 03. for each s=(s1,s2 , . . ., st) from (1,1 . . . 1) to (n-l+1, . . ., n-l+1)4. if (Score(s,DNA) > bestScore)5. bestScore Score(s, DNA)6. bestMotif (s1,s2 , . . . , st) 7. return bestMotif
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Structuring the Search
• How can we perform the line
for each s=(s1,s2 , . . ., st) from (1,1 . . . 1) to (n-l+1, . . ., n-l+1) ?
• We need a method for efficiently structuring and navigating the many possible motifs
• This is not very different than exploring all t-digit numbers
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Median String: Improving the Running Time1. MedianStringSearch (DNA, t, n, l)2. bestWord AAA…A
3. bestDistance ∞
4. for each l-mer s from AAA…A to TTT…T if TotalDistance(s,DNA) < bestDistance
5. bestDistanceTotalDistance(s,DNA)
6. bestWord s
7. return bestWord
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Structuring the Search
• For the Median String Problem we need to consider all 4l possible l-mers:
aa… aaaa… acaa… agaa… at
.
.tt… tt
How to organize this search?
l
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Alternative Representation of the Search Space
• Let A = 1, C = 2, G = 3, T = 4• Then the sequences from AA…A to TT…T become:
11…1111…1211…1311…14
.
.44…44
• Notice that the sequences above simply list all numbers as if we were counting on base 4 without using 0 as a digit
l
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Search Tree
a- c- g- t-
aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt
--
root
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Three Operations
• NextLeaf• NextVertex• ByPass
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
NextLeaf: Example• Moving to the next leaf:
1- 2- 3- 4-
11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44
--Next Location
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Depth First Search
• So we can search leaves
• How about searching all vertices of the tree?
• We can do this with a depth first search• Preorder traversal
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
NextVertex: Example• Moving to the next vertices:
1- 2- 3- 4-
11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44
--
Location after 5 next vertex moves
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
ByPass: Example• Bypassing the descendants of “2-”:
1- 2- 3- 4-
11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44
--Next Location
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Revisiting Brute Force Search
• Now that we have method for navigating the tree, lets look again at BruteForceMotifSearch• NextLeaf• NextVertex• ByPass
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Can We Do Better?
• Sets of s=(s1, s2, …,st) may have a weak profile for the first i positions (s1, s2, …,si)
• Every row of alignment may add at most l to Score• Optimism: if all subsequent (t-i) positions (si+1, …st) a
dd (t – i ) * l to Score(s,i,DNA)
• If Score(s,i,DNA) + (t – i ) * l < BestScore, it makes no sense to search in vertices of the current subtree• Use ByPass()
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Branch and Bound Algorithm for Motif Search
• Since each level of the tree goes deeper into search, discarding a prefix discards all following branches
• This saves us from looking at (n – l + 1)t-i leaves• Use NextVertex() and ByPas
s() to navigate the tree
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Branch and Bound Applied to Median String Search• Note that if the total distance for a prefix is gr
eater than that for the best word so far:
TotalDistance (prefix, DNA) > BestDistance
there is no use exploring the remaining part of the word
• We can eliminate that branch and BYPASS exploring that branch further
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
More on the Motif Problem
• Exhaustive Search and Median String are both exact algorithms
• They always find the optimal solution, though they may be too slow to perform practical tasks
• Many algorithms sacrifice optimal solution for speed
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
CONSENSUS: Greedy Motif Search• Find two closest l-mers in sequences 1 and 2 and forms 2 x l alignment matrix with Score(s,2,DNA)• At each of the following t-2 iterations CONSENSUS finds a “best”
l-mer in sequence i from the perspective of the already constructed (i-1) x l alignment matrix for the first (i-1) sequences
• In other words, it finds an l-mer in sequence i maximizing Score(s,i,DNA)
under the assumption that the first (i-1) l-mers have been already chosen
• CONSENSUS sacrifices optimal solution for speed: in fact the bulk of the time is actually spent locating the first 2 l-mers
www.bioalgorithms.infoAn Introduction to Bioinformatics Algorithms
Randomized Motif Finding
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Randomized Algorithms
• Randomized algorithms make random rather than deterministic decisions.
• The main advantage is that no input can reliably produce worst-case results because the algorithm runs differently each time.
• These algorithms are commonly used in situations where no exact and fast algorithm is known.
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
A New Motif Finding Approach
• Motif Finding Problem: Given a list of t sequences each of length n, find the “best” pattern of length l that appears in each of the t sequences.
• Previously: we solved the Motif Finding Problem using a Branch and Bound or a Greedy technique.
• Now: randomly select possible locations and find a way to greedily change those locations until we have converged to the hidden motif.
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Profiles Revisited
• Let s=(s1,...,st) be the set of starting positions for l-mers in our t sequences.
• The substrings corresponding to these starting positions will form:
- t x l alignment matrix and - 4 x l profile matrix* P.
*We make a special note that the profile matrix will be defined in terms of the frequency of letters, and not as the count of letters.
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
• Prob(a|P) is defined as the probability that an l-mer a was created by the Profile P.
• If a is very similar to the consensus string of P then Prob(a|P) will be high
• If a is very different, then Prob(a|P) will be low.
n
Prob(a|P) =Π pai , i
i=1
Scoring Strings with a Profile
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Scoring Strings with a Profile (cont’d)Given a profile: P =
A 1/2 7/8 3/8 0 1/8 0
C 1/8 0 1/2 5/8 3/8 0
T 1/8 1/8 0 0 1/4 7/8
G 1/4 0 1/8 3/8 1/4 1/8
Prob(aaacct|P) = ??? The probability of the consensus string:
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Scoring Strings with a Profile (cont’d)Given a profile: P =
A 1/2 7/8 3/8 0 1/8 0
C 1/8 0 1/2 5/8 3/8 0
T 1/8 1/8 0 0 1/4 7/8
G 1/4 0 1/8 3/8 1/4 1/8
Prob(aaacct|P) = 1/2 x 7/8 x 3/8 x 5/8 x 3/8 x 7/8 = .033646The probability of the consensus string:
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Scoring Strings with a Profile (cont’d)Given a profile: P =
A 1/2 7/8 3/8 0 1/8 0
C 1/8 0 1/2 5/8 3/8 0
T 1/8 1/8 0 0 1/4 7/8
G 1/4 0 1/8 3/8 1/4 1/8
Prob(atacag|P) = 1/2 x 1/8 x 3/8 x 5/8 x 1/8 x 1/8 = .001602
Prob(aaacct|P) = 1/2 x 7/8 x 3/8 x 5/8 x 3/8 x 7/8 = .033646The probability of the consensus string:
Probability of a different string:
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
P-Most Probable l-mer• Define the P-most probable l-mer from a sequence
as an l-mer in that sequence which has the highest probability of being created from the profile P.
A 1/2 7/8 3/8 0 1/8 0
C 1/8 0 1/2 5/8 3/8 0
T 1/8 1/8 0 0 1/4 7/8
G 1/4 0 1/8 3/8 1/4 1/8
P =
Given a sequence = ctataaaccttacatc, find the P-most probable l-mer
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Third try: c t a t a a a c c t t a c a t c
Second try: c t a t a a a c c t t a c a t c
First try: c t a t a a a c c t t a c a t c
P-Most Probable l-mer (cont’d)
A 1/2 7/8 3/8 0 1/8 0
C 1/8 0 1/2 5/8 3/8 0
T 1/8 1/8 0 0 1/4 7/8
G 1/4 0 1/8 3/8 1/4 1/8
Find the Prob(a|P) of every possible 6-mer:
-Continue this process to evaluate every possible 6-mer
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
P-Most Probable l-mer (cont’d)
String, Highlighted in Red Calculations prob(a|P)
ctataaaccttacat 1/8 x 1/8 x 3/8 x 0 x 1/8 x 0 0
ctataaaccttacat 1/2 x 7/8 x 0 x 0 x 1/8 x 0 0
ctataaaccttacat 1/2 x 1/8 x 3/8 x 0 x 1/8 x 0 0
ctataaaccttacat 1/8 x 7/8 x 3/8 x 0 x 3/8 x 0 0
ctataaaccttacat 1/2 x 7/8 x 3/8 x 5/8 x 3/8 x 7/8 .0336
ctataaaccttacat 1/2 x 7/8 x 1/2 x 5/8 x 1/4 x 7/8 .0299
ctataaaccttacat 1/2 x 0 x 1/2 x 0 1/4 x 0 0
ctataaaccttacat 1/8 x 0 x 0 x 0 x 0 x 1/8 x 0 0
ctataaaccttacat 1/8 x 1/8 x 0 x 0 x 3/8 x 0 0
ctataaaccttacat 1/8 x 1/8 x 3/8 x 5/8 x 1/8 x 7/8 .0004
Compute prob(a|P) for every possible 6-mer:
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
P-Most Probable l-mer (cont’d)
String, Highlighted in Red Calculations Prob(a|P)
ctataaaccttacat 1/8 x 1/8 x 3/8 x 0 x 1/8 x 0 0
ctataaaccttacat 1/2 x 7/8 x 0 x 0 x 1/8 x 0 0
ctataaaccttacat 1/2 x 1/8 x 3/8 x 0 x 1/8 x 0 0
ctataaaccttacat 1/8 x 7/8 x 3/8 x 0 x 3/8 x 0 0
ctataaaccttacat 1/2 x 7/8 x 3/8 x 5/8 x 3/8 x 7/8 .0336
ctataaaccttacat 1/2 x 7/8 x 1/2 x 5/8 x 1/4 x 7/8 .0299
ctataaaccttacat 1/2 x 0 x 1/2 x 0 1/4 x 0 0
ctataaaccttacat 1/8 x 0 x 0 x 0 x 0 x 1/8 x 0 0
ctataaaccttacat 1/8 x 1/8 x 0 x 0 x 3/8 x 0 0
ctataaaccttacat 1/8 x 1/8 x 3/8 x 5/8 x 1/8 x 7/8 .0004
P-Most Probable 6-mer in the sequence is aaacct:
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
P-Most Probable l-mer (cont’d)
ctataaaccttacatc
because Prob(aaacct|P) = .0336 is greater than the Prob(a|P) of any other 6-mer in the sequence.
aaacct is the P-most probable 6-mer in:
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Dealing with Zeroes
• In our toy example prob(a|P)=0 in many cases. In practice, there will be enough sequences so that the number of elements in the profile with a frequency of zero is small.
• To avoid many entries with prob(a|P)=0, there exist techniques to equate zero to a very small number so that one zero does not make the entire probability of a string zero (we will not address these techniques here).
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
P-Most Probable l-mers in Many Sequences• Find the P-most probab
le l-mer in each of the sequences.
ctataaacgttacatc
atagcgattcgactg
cagcccagaaccct
cggtataccttacatc
tgcattcaatagctta
tatcctttccactcac
ctccaaatcctttaca
ggtcatcctttatcct
A 1/2 7/8 3/8 0 1/8 0
C 1/8 0 1/2 5/8 3/8 0
T 1/8 1/8 0 0 1/4 7/8
G 1/4 0 1/8 3/8 1/4 1/8
P=
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
P-Most Probable l-mers in Many Sequences (cont’d) ctataaacgttacatc
atagcgattcgactg
cagcccagaaccct
cggtgaaccttacatc
tgcattcaatagctta
tgtcctgtccactcac
ctccaaatcctttaca
ggtctacctttatcct
P-Most Probable l-mers form a new profile
1 a a a c g t
2 a t a g c g
3 a a c c c t
4 g a a c c t
5 a t a g c t
6 g a c c t g
7 a t c c t t
8 t a c c t t
A 5/8 5/8 4/8 0 0 0
C 0 0 4/8 6/8 4/8 0
T 1/8 3/8 0 0 3/8 6/8
G 2/8 0 0 2/8 1/8 2/8
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Comparing New and Old Profiles
Red – frequency increased, Blue – frequency descreased
1 a a a c g t
2 a t a g c g
3 a a c c c t
4 g a a c c t
5 a t a g c t
6 g a c c t g
7 a t c c t t
8 t a c c t t
A 5/8 5/8 4/8 0 0 0
C 0 0 4/8 6/8 4/8 0
T 1/8 3/8 0 0 3/8 6/8
G 2/8 0 0 2/8 1/8 2/8
A 1/2 7/8 3/8 0 1/8 0
C 1/8 0 1/2 5/8 3/8 0
T 1/8 1/8 0 0 1/4 7/8
G 1/4 0 1/8 3/8 1/4 1/8
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Greedy Profile Motif SearchUse P-Most probable l-mers to adjust start positions u
ntil we reach a “best” profile; this is the motif.
1) Select random starting positions.2) Create a profile P from the substrings at these start
ing positions.3) Find the P-most probable l-mer a in each sequence
and change the starting position to the starting position of a.
4) Compute a new profile based on the new starting positions after each iteration and proceed until we cannot increase the score anymore.
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
GreedyProfileMotifSearch Analysis
• Since we choose starting positions randomly, there is little chance that our guess will be close to an optimal motif, meaning it will take a very long time to find the optimal motif.
• It is unlikely that the random starting positions will lead us to the correct solution at all.
• In practice, this algorithm is run many times with the hope that random starting positions will be close to the optimum solution simply by chance.
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Gibbs Sampling
• GreedyProfileMotifSearch is probably not the best way to find motifs.
• However, we can improve the algorithm by introducing Gibbs Sampling, an iterative procedure that discards one l-mer after each iteration and replaces it with a new one.
• Gibbs Sampling proceeds more slowly and chooses new l-mers at random increasing the odds that it will converge to the correct solution.
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
How Gibbs Sampling Works1) Randomly choose starting positions
s = (s1,...,st) and form the set of l-mers associated with these starting positions. 2) Randomly choose one of the t sequences.
3) Create a profile P from the other t -1 sequences.4) For each position in the removed sequence, calculate the probability that the l-mer starting at that position was generated by P.5) Choose a new starting position for the removed
sequence at random based on the probabilities calculated in step 4.
6) Repeat steps 2-5 until there is no improvement
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Gibbs Sampling: an Example
Input:
t = 5 sequences, motif length l = 8
1. GTAAACAATATTTATAGC2. AAAATTTACCTCGCAAGG
3. CCGTACTGTCAAGCGTGG 4. TGAGTAAACGACGTCCCA
5. TACTTAACACCCTGTCAA
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Gibbs Sampling: an Example
1) Randomly choose starting positions, s=(s1,s2,s3,s4,s5) in the 5 sequences:
s1=7 GTAAACAATATTTATAGC
s2=11 AAAATTTACCTTAGAAGG
s3=9 CCGTACTGTCAAGCGTGG
s4=4 TGAGTAAACGACGTCCCA
s5=1 TACTTAACACCCTGTCAA
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Gibbs Sampling: an Example
2) Choose one of the sequences at random:Sequence 2: AAAATTTACCTTAGAAGG
s1=7 GTAAACAATATTTATAGC
s2=11 AAAATTTACCTTAGAAGG
s3=9 CCGTACTGTCAAGCGTGG
s4=4 TGAGTAAACGACGTCCCA
s5=1 TACTTAACACCCTGTCAA
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Gibbs Sampling: an Example
2) Choose one of the sequences at random:Sequence 2: AAAATTTACCTTAGAAGG
s1=7 GTAAACAATATTTATAGC
s3=9 CCGTACTGTCAAGCGTGG
s4=4 TGAGTAAACGACGTCCCA
s5=1 TACTTAACACCCTGTCAA
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Gibbs Sampling: an Example3) Create profile P from l-mers in remaining 4 seq
uences:1 A A T A T T T A
3 T C A A G C G T
4 G T A A A C G A
5 T A C T T A A C
A 1/4 2/4 2/4 3/4 1/4 1/4 1/4 2/4
C 0 1/4 1/4 0 0 2/4 0 1/4
T 2/4 1/4 1/4 1/4 2/4 1/4 1/4 1/4
G 1/4 0 0 0 1/4 0 3/4 0Consensus
StringT A A A T C G A
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Gibbs Sampling: an Example4) Calculate the prob(a|P) for every possible 8-
mer in the removed sequence:
Strings Highlighted in Red prob(a|P)
AAAATTTACCTTAGAAGG .000732AAAATTTACCTTAGAAGG .000122AAAATTTACCTTAGAAGG 0AAAATTTACCTTAGAAGG 0AAAATTTACCTTAGAAGG 0AAAATTTACCTTAGAAGG 0AAAATTTACCTTAGAAGG 0AAAATTTACCTTAGAAGG .000183AAAATTTACCTTAGAAGG 0AAAATTTACCTTAGAAGG 0AAAATTTACCTTAGAAGG 0
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Gibbs Sampling: an Example
5) Create a distribution of probabilities of l-mers prob(a|P), and randomly select a new starting position based on this distribution.
Starting Position 1: prob( AAAATTTA | P ) = .000732 / .000122 = 6
Starting Position 2: prob( AAATTTAC | P ) = .000122 / .000122 = 1
Starting Position 8: prob( ACCTTAGA | P ) = .000183 / .000122 = 1.5
a) To create this distribution, divide each probability prob(a|P) by the lowest probability:
Ratio = 6 : 1 : 1.5
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Turning Ratios into Probabilities
Probability (Selecting Starting Position 1): 6/(6+1+1.5)= 0.706
Probability (Selecting Starting Position 2): 1/(6+1+1.5)= 0.118
Probability (Selecting Starting Position 8): 1.5/(6+1+1.5)=0.176
b) Define probabilities of starting positions according to computed ratios
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Gibbs Sampling: an Example
c) Select the start position according to computed ratios:
P(selecting starting position 1): .706
P(selecting starting position 2): .118
P(selecting starting position 8): .176
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Gibbs Sampling: an Example
Assume we select the substring with the highest probability – then we are left with the following new substrings and starting positions.
s1=7 GTAAACAATATTTATAGC
s2=1 AAAATTTACCTCGCAAGG
s3=9 CCGTACTGTCAAGCGTGG
s4=5 TGAGTAATCGACGTCCCA
s5=1 TACTTCACACCCTGTCAA
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Gibbs Sampling: an Example
6) We iterate the procedure again with the above starting positions until we cannot improve the score any more.
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Gibbs Sampler in Practice
• Gibbs sampling needs to be modified when applied to samples with unequal distributions of nucleotides (relative entropy approach).
• Gibbs sampling often converges to locally optimal motifs rather than globally optimal motifs.
• Needs to be run with many randomly chosen seeds to achieve good results.
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Another Randomized Approach• Random Projection Algorithm is a different
way to solve the Motif Finding Problem.• Guiding principle: Some instances of a motif
agree on a subset of positions.• However, it is unclear how to find these “non-
mutated” positions.• To bypass the effect of mutations within a motif,
we randomly select a subset of positions in the pattern creating a projection of the pattern.
• Search for that projection in a hope that the selected positions are not affected by mutations in most instances of the motif.
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Projections• Choose k positions in string of length l.• Concatenate nucleotides at chosen k
positions to form k-tuple.• This can be viewed as a projection of l-
dimensional space onto k-dimensional subspace.
ATGGCATTCAGATTC TGCTGAT
l = 15 k = 7 Projection
Projection = (2, 4, 5, 7, 11, 12, 13)
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Random Projections Algorithm• Select k out of l
positions uniformly at random.
• For each l-tuple in input sequences, hash into bucket based on letters at k selected positions.
• Recover motif from enriched bucket that contain many l-tuples. Bucket TGCT
TGCACCT
Input sequence:…TCAATGCACCTAT...
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Random Projections Algorithm (cont’d)
• Some projections will fail to detect motifs but if we try many of them the probability that one of the buckets fills in is increasing.
• In the example below, the bucket **GC*AC is “bad” while the bucket AT**G*C is “good”
ATGCGTC
...ccATCCGACca...
...ttATGAGGCtc...
...ctATAAGTCgc...
...tcATGTGACac... (7,2) motif
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Example
• l = 7 (motif size) , k = 4 (projection size)• Choose projection (1,2,5,7)
GCTC
...TAGACATCCGACTTGCCTTACTAC...
Buckets
ATGC
ATCCGAC
GCCTTAC
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Hashing and Buckets
• Hash function h(x) obtained from k positions of projection.
• Buckets are labeled by values of h(x).• Enriched buckets: contain more than s l-
tuples, for some parameter s.
ATTCCATCGCTCATGC
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Motif Refinement
• How do we recover the motif from the sequences in the enriched buckets?
• k nucleotides are from hash value of bucket.• Use information in other l-k positions as starting
point for local refinement scheme, e.g. Gibbs sampler.
Local refinement algorithmATGCGAC
Candidate motif
ATGC
ATCCGAC
ATGAGGCATAAGTC
ATGCGAC
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Synergy between Random Projection and Gibbs Sampler• Random Projection is a procedure for finding good
starting points: every enriched bucket is a potential starting point.
• Feeding these starting points into existing algorithms (like Gibbs sampler) provides good local search in vicinity of every starting point.
• These algorithms work particularly well for “good” starting points.
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Building Profiles from Buckets
A 1 0 .25 .50 0 .50 0
C 0 0 .25 .25 0 0 1
G 0 0 .50 0 1 .25 0
T 0 1 0 .25 0 .25 0
Profile P
Gibbs sampler
Refined profile P*
ATCCGAC
ATGAGGC
ATAAGTC
ATGTGAC
ATGC
A Graph-Based Method for Motif Finding
Graphs com from http://biochem218.stanford.edu/
Problems with Current Representations of DNA Motifs All current methods for representing DNA motifs inv
olve either consensus sequences or probabilistic models (such as PSSMs) of the motif.
Consensus sequences do not adequately represent the variability seen in promoters or transcription factor binding sites.
Both consensus sequences and PSSM models assume positional independence.
Neither method can accommodate correlations between positions.
Yeast Motifs
We analyzed yeast motifs for pairwise dependencies. We used a chi-square statistic to find whether two
positions were correlated or not. We found that 25% of motifs have significantl
y correlated positions.
Graph Representation
A Graph-Based Model of a Motif
The main idea is to formulate motif finding as the problem of finding the Maximum Density Subgraph
Given a set of DNA sequences, we construct a graph G=(V,E) V: the set of the k-mer occurrences in the input E: the set of undirected edges, represents nucleoti
de similarities between those k-mers
Maximum Density Subgraph
Nucleotide Dependencies and MDS
Things to be Done
Graph construction Edge weight assignment Background normalization
Finding maximum density subgraph(s) Choice of density function Max-Flow/Min-Cut Neighborhood subgraph …
The Modeling of Regulatory Regions
More than a hundred methods have been proposed for motif discovery A large variation with respect to both algorithmic
approaches as well as the underlying models of regulatory regions
Key Points
How the multiple binding sites for modules of regulatory elements can be modeled
How additional data sources may be integrated into such models
A Integrated Framework to Model Regulatory Regions A single motif, denoted by mg, consists of two parts,
m*: how well the sequence matches a consensus og is a prior on whether any regulatory element is to occur at t
hat position A set of single motifs, together with inter-motif distanc
e restrictions (d), then forms a composite motif (cg) Multiple occurrences of a composite motif in the regul
atory regions of a gene is represented by a gene score Gc
A Schematic View of the Integrated Framework
Single Motif Models (Level 1)
Transcription factor binding sites: This is the most basic element of the regulatory system, and can be modeled using single motif models
A single motif model is defined as a function mg that maps a sequence position p to a real numbered motif score mg(p) It consists of a match score m*(p) and an occurren
ce prior og(p)
The Match Model m*
Probabilistic match model Position weight matrix (PWM) a.k.a. PSSM Incorporating dependencies within motifs
n-th order Markov chains Bayesian network
Deterministic match model oligos regular expression mismatch expressions
Occurrence priors og
Spatial distribution of binding sites Conservation in orthologous sequences DNA structure Nucleotide distribution
Composite Motif Models (Level 2) Clusters of binding sites for cooperating TFs,
often called modules, are believed to be essential building blocks of the regulatory machinery
A composite motif model is defined as a function cg that maps a set of single motif sequence positions p to a real numbered composite motif score cg(p).
A Composite Motif Score cg
Given a set of positions, the score of a composite motif will typically be the sum or product of individual single motif, and distance scores
Distance Functions
constraints fixed distances distances below thresholds distances within intervals all single motifs to be within a window of a certain
length score functions
uniform distribution: geometry, Poisson, etc.
Combining Single Motifs
For methods using deterministic match models and constraints on distances intersection of component scores M out of N single motif scores
For methods that use non-binary single motif scores the sum/product of single motif and distance scores
Specialized models hidden Markov model (HMM) self-organizing map (SOM) artificial neural network (ANN)
Gene level models (Level 3)
A gene score model is defined as a function Gc that maps a gene index g to a real numbered gene score Gc(g)
The gene level score is calculated from composite motif scores, cg, across the regulatory region of gene g, and is referred to as gene score
Multiple Binding Sites/Modules One motif in the regulatory region
The gene level score is often defined simply as the maximum motif score in the regulatory region of a gene
Multiple binding sites/modules sum of log-scores of motifs logistic-regression ANN HMM …
Overview of Methods