Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

Motif Finding

Kun-Pin Wu ( 巫坤品 )Institute of Biomedical InformaticsNational Yang Ming University2007/11/20

www.bioalgorithms.infoAn Introduction to Bioinformatics Algorithms

Finding Regulatory Motifs in DNA Sequences

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Random Sample

atgaccgggatactgataccgtatttggcctaggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg

acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatactgggcataaggtaca

tgagtatccctgggatgacttttgggaacactatagtgctctcccgatttttgaatatgtaggatcattcgccagggtccga

gctgagaattggatgaccttgtaagtgttttccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga

tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatggcccacttagtccacttatag

gtcaatcatgttcttgtgaatggatttttaactgagggcatagaccgcttggcgcacccaaattcagtgtgggcgagcgcaa

cggttttggcccttgttagaggcccccgtactgatggaaactttcaattatgagagagctaatctatcgcgtgcgtgttcat

aacttgagttggtttcgaaaatgctctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta

ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatttcaacgtatgccgaaccgaaagggaag

ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttctgggtactgatagca


Implanting Motif AAAAAAAGGGGGGG

atgaccgggatactgatAAAAAAAAGGGGGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg

acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataAAAAAAAAGGGGGGGa

tgagtatccctgggatgacttAAAAAAAAGGGGGGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccga

gctgagaattggatgAAAAAAAAGGGGGGGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga

tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAAAAAAAAGGGGGGGcttatag

gtcaatcatgttcttgtgaatggatttAAAAAAAAGGGGGGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaa

cggttttggcccttgttagaggcccccgtAAAAAAAAGGGGGGGcaattatgagagagctaatctatcgcgtgcgtgttcat

aacttgagttAAAAAAAAGGGGGGGctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta

ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatAAAAAAAAGGGGGGGaccgaaagggaag

ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttAAAAAAAAGGGGGGGa


Where is the Implanted Motif?

atgaccgggatactgataaaaaaaagggggggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg

acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataaaaaaaaaggggggga

tgagtatccctgggatgacttaaaaaaaagggggggtgctctcccgatttttgaatatgtaggatcattcgccagggtccga

gctgagaattggatgaaaaaaaagggggggtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga

tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaataaaaaaaagggggggcttatag

gtcaatcatgttcttgtgaatggatttaaaaaaaaggggggggaccgcttggcgcacccaaattcagtgtgggcgagcgcaa

cggttttggcccttgttagaggcccccgtaaaaaaaagggggggcaattatgagagagctaatctatcgcgtgcgtgttcat

aacttgagttaaaaaaaagggggggctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta

ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcataaaaaaaagggggggaccgaaagggaag

ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttaaaaaaaaggggggga


Implanting Motif AAAAAAGGGGGGG with Four Mutations

atgaccgggatactgatAgAAgAAAGGttGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg

acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacAAtAAAAcGGcGGGa

tgagtatccctgggatgacttAAAAtAAtGGaGtGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccga

gctgagaattggatgcAAAAAAAGGGattGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga

tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAtAAtAAAGGaaGGGcttatag

gtcaatcatgttcttgtgaatggatttAAcAAtAAGGGctGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaa

cggttttggcccttgttagaggcccccgtAtAAAcAAGGaGGGccaattatgagagagctaatctatcgcgtgcgtgttcat

aacttgagttAAAAAAtAGGGaGccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta

ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatActAAAAAGGaGcGGaccgaaagggaag

ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttActAAAAAGGaGcGGa


Where is the Motif???

atgaccgggatactgatagaagaaaggttgggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg

acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacaataaaacggcggga

tgagtatccctgggatgacttaaaataatggagtggtgctctcccgatttttgaatatgtaggatcattcgccagggtccga

gctgagaattggatgcaaaaaaagggattgtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga

tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatataataaaggaagggcttatag

gtcaatcatgttcttgtgaatggatttaacaataagggctgggaccgcttggcgcacccaaattcagtgtgggcgagcgcaa

cggttttggcccttgttagaggcccccgtataaacaaggagggccaattatgagagagctaatctatcgcgtgcgtgttcat

aacttgagttaaaaaatagggagccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta

ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatactaaaaaggagcggaccgaaagggaag

ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttactaaaaaggagcgga


Why Finding (15,4) Motif is Difficult?

atgaccgggatactgatAgAAgAAAGGttGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg

acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacAAtAAAAcGGcGGGa

tgagtatccctgggatgacttAAAAtAAtGGaGtGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccga

gctgagaattggatgcAAAAAAAGGGattGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga

tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAtAAtAAAGGaaGGGcttatag

gtcaatcatgttcttgtgaatggatttAAcAAtAAGGGctGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaa

cggttttggcccttgttagaggcccccgtAtAAAcAAGGaGGGccaattatgagagagctaatctatcgcgtgcgtgttcat

aacttgagttAAAAAAtAGGGaGccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta

ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatActAAAAAGGaGcGGaccgaaagggaag

ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttActAAAAAGGaGcGGa

AgAAgAAAGGttGGG

cAAtAAAAcGGcGGG..|..|||.|..|||


Challenge Problem

• Find a motif in a sample of

- 20 “random” sequences (e.g. 600 nt long)

- each sequence containing an implanted

pattern of length 15,

- each pattern appearing with 4 mismatches

as (15,4)-motif.


Combinatorial Gene Regulation• A microarray experiment showed that when

gene X is knocked out, 20 other genes are not expressed

• How can one gene have such drastic effects?


Regulatory Proteins• Gene X encodes regulatory protein, a.k.a. a

transcription factor (TF)

• The 20 unexpressed genes rely on gene X’s TF to induce transcription

• A single TF may regulate multiple genes


Regulatory Regions• Every gene contains a regulatory region (RR) typically st

retching 100-1000 bp upstream of the transcriptional start site

• Located within the RR are the Transcription Factor Binding Sites (TFBS), also known as motifs, specific for a given transcription factor

• TFs influence gene expression by binding to a specific location in the respective gene’s regulatory region - TFBS


Transcription Factor Binding Sites

• A TFBS can be located anywhere within the

Regulatory Region

• TFBS may vary slightly across different regulatory regions since non-essential bases could mutate


Motifs and Transcriptional Start Sites

geneATCCCG

geneTTCCGG

geneATCCCG

geneATGCCG

geneATGCCC


Transcription Factors and Motifs


Motif Logo

• Motifs can mutate on non important bases

• The five motifs in five different genes have mutations in position 3 and 5

• Representations called motif logos illustrate the conserved and variable regions of a motif

TGGGGGATGAGAGATGGGGGATGAGAGATGAGGGA


Motif Logos: An Example

(http://www-lmmb.ncifcrf.gov/~toms/sequencelogo.html)


Identifying Motifs• Genes are turned on or off by regulatory proteins

• These proteins bind to upstream regulatory regions of genes to either attract or block an RNA polymerase

• Regulatory protein (TF) binds to a short DNA sequence called a motif (TFBS)

• So finding the same motif in multiple genes’ regulatory regions suggests a regulatory relationship amongst those genes


Identifying Motifs: Complications

• We do not know the motif sequence

• We do not know where it is located relative to the genes start

• Motifs can differ slightly from one gene to the next

• How to discern it from “random” motifs?


The Motif Finding Problem• Given a random sample of DNA sequences:

cctgatagacgctatctggctatccacgtacgtaggtcctctgtgcgaatctatgcgtttccaaccat

agtactggtgtacatttgatacgtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc

aaacgtacgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt

agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtacgtataca

ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaacgtacgtc

• Find the pattern that is implanted in each of the individual sequences, namely, the motif


The Motif Finding Problem (cont’d)• Additional information:

• The hidden sequence is of length 8

• The pattern is not exactly the same in each array because random point mutations may occur in the sequences


The Motif Finding Problem (cont’d)• The patterns revealed with no mutations:cctgatagacgctatctggctatccacgtacgtaggtcctctgtgcgaatctatgcgtttccaaccat

agtactggtgtacatttgatacgtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc

aaacgtacgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt

agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtacgtataca

ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaacgtacgtc

acgtacgtConsensus String


The Motif Finding Problem (cont’d)• The patterns with 2 point mutations:

cctgatagacgctatctggctatccaGgtacTtaggtcctctgtgcgaatctatgcgtttccaaccat

agtactggtgtacatttgatCcAtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc

aaacgtTAgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt

agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtCcAtataca

ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaCcgtacgGc


The Motif Finding Problem (cont’d)• The patterns with 2 point mutations:






Can we still find the motif, now that we have 2 mutations?


Defining Motifs

• To define a motif, lets say we know where the motif starts in the sequence

• The motif start positions in their sequences can be represented as s = (s1,s2,s3,…,st)


Motifs: Profiles and Consensus a G g t a c T t C c A t a c g tAlignment a c g t T A g t a c g t C c A t C c g t a c g G

_________________

A 3 0 1 0 3 1 1 0Profile C 2 4 0 0 1 4 0 0 G 0 1 4 0 0 0 3 1 T 0 0 0 5 1 0 1 4

_________________

Consensus A C G T A C G T

• Line up the patterns by their start indexes

s = (s1, s2, …, st)

• Construct matrix profile with frequencies of each nucleotide in columns

• Consensus nucleotide in each position has the highest score in column


Consensus

• Think of consensus as an “ancestor” motif, from which mutated motifs emerged

• The distance between a real motif and the consensus sequence is generally less than that for two real motifs


Consensus (cont’d)


Evaluating Motifs

• We have a guess about the consensus sequence, but how “good” is this consensus?

• Need to introduce a scoring function to compare different guesses and choose the “best” one.


Defining Some Terms

• t - number of sample DNA sequences• n - length of each DNA sequence• DNA - sample of DNA sequences (t x n array)

• l - length of the motif (l-mer)• si - starting position of an l-mer in sequence i

• s=(s1, s2,… st) - array of motif’s starting positions


Parameters






l = 8

t=5

s1 = 26 s2 = 21 s3= 3 s4 = 56 s5 = 60 s

DNA

n = 69


Scoring Motifs

• Given s = (s1, … st) and DNA:

Score(s,DNA) =

a G g t a c T t C c A t a c g t a c g t T A g t a c g t C c A t C c g t a c g G _________________ A 3 0 1 0 3 1 1 0 C 2 4 0 0 1 4 0 0 G 0 1 4 0 0 0 3 1 T 0 0 0 5 1 0 1 4 _________________

Consensus a c g t a c g t

Score 3+4+4+5+3+4+3+4=30

l

t

l

i GCTAk

ikcount1 },,,{

),(max


The Motif Finding Problem

• If starting positions s=(s1, s2,… st) are given, finding consensus is easy even with mutations in the sequences because we can simply construct the profile to find the motif (consensus)

• But… the starting positions s are usually not given. How can we find the “best” profile matrix?


The Motif Finding Problem: Formulation

• Goal: Given a set of DNA sequences, find a set of l-mers, one from each sequence, that maximizes the consensus score

• Input: A t x n matrix of DNA, and l, the length of the pattern to find

• Output: An array of t starting positions s = (s1, s2, … st) maximizing Score(s,DNA)


The Motif Finding Problem: Brute Force Solution

• Compute the scores for each possible combination of starting positions s

• The best score will determine the best profile and the consensus pattern in DNA

• The goal is to maximize Score(s,DNA) by varying the starting positions si, where:

si = [1, …, n-l+1]i = [1, …, t]


Running Time of BruteForceMotifSearch• Varying (n - l + 1) positions in each of t sequen

ces, we’re looking at (n - l + 1)t sets of starting positions

• For each set of starting positions, the scoring function makes l operations, so complexity is l (n – l + 1)t = O(l nt)

• That means that for t = 8, n = 1000, l = 10 we must perform approximately 1020 computations – it will take billions years


The Median String Problem

• Given a set of t DNA sequences find a pattern that appears in all t sequences with the minimum number of mutations

• This pattern will be the motif


Hamming Distance

• Hamming distance: • dH(v,w) is the number of nucleotide pairs th

at do not match when v and w are aligned. For example:

dH(AAAAAA,ACAAAC) = 2


Total Distance: Example

• Given v = “acgtacgt” and s acgtacgt

cctgatagacgctatctggctatccacgtacAtaggtcctctgtgcgaatctatgcgtttccaaccat acgtacgtagtactggtgtacatttgatacgtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc acgtacgtaaaAgtCcgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt acgtacgtagcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtacgtataca acgtacgtctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaacgtaGgtc

v is the sequence in red, x is the sequence in blue

• TotalDistance(v,DNA) = 1+0+2+0+1 = 4

dH(v, x) = 2

dH(v, x) = 1

dH(v, x) = 0

dH(v, x) = 0

dH(v, x) = 1


Total Distance: Definition

• For each DNA sequence i, compute all dH(v, x), where x is an l-mer with starting position si

(1 < si < n – l + 1)• Find minimum of dH(v, x) among all l-mers in sequen

ce i• TotalDistance(v,DNA) is the sum of the minimum Ha

mming distances for each DNA sequence i• TotalDistance(v,DNA) = mins dH(v, s), where s is the

set of starting positions s1, s2,… st


The Median String Problem: Formulation• Goal: Given a set of DNA sequences, find a

median string• Input: A t x n matrix DNA, and l, the length of

the pattern to find• Output: A string v of l nucleotides that minimi

zes TotalDistance(v,DNA) over all strings of that length


Motif Finding Problem == Median String Problem

• The Motif Finding is a maximization problem while Median String is a minimization problem

• However, the Motif Finding problem and Median String problem are computationally equivalent

• Need to show that minimizing TotalDistance is equivalent to maximizing Score


We are looking for the same thing

a G g t a c T t C c A t a c g tAlignment a c g t T A g t a c g t C c A t C c g t a c g G _________________ A 3 0 1 0 3 1 1 0Profile C 2 4 0 0 1 4 0 0 G 0 1 4 0 0 0 3 1 T 0 0 0 5 1 0 1 4 _________________

Consensus a c g t a c g t

Score 3+4+4+5+3+4+3+4

TotalDistance 2+1+1+0+2+1+2+1

Sum 5 5 5 5 5 5 5 5

• At any column iScorei + TotalDistancei = t

• Because there are l columns Score + TotalDistance = l * t

• Rearranging:Score = l * t - TotalDistance

• l * t is constant the minimization of the right side is equivalent to the maximization of the left side

l

t


Motif Finding Problem vs. Median String Problem• Why bother reformulating the Motif Finding

problem into the Median String problem?

• The Motif Finding Problem needs to examine all the combinations for s. That is (n - l + 1)t combinations!!!

• The Median String Problem needs to examine all 4l combinations for v. This number is relatively smaller


Motif Finding: Improving the Running Time

1. BruteForceMotifSearch(DNA, t, n, l)2. bestScore 03. for each s=(s1,s2 , . . ., st) from (1,1 . . . 1) to (n-l+1, . . ., n-l+1)4. if (Score(s,DNA) > bestScore)5. bestScore Score(s, DNA)6. bestMotif (s1,s2 , . . . , st) 7. return bestMotif


Structuring the Search

• How can we perform the line

for each s=(s1,s2 , . . ., st) from (1,1 . . . 1) to (n-l+1, . . ., n-l+1) ?

• We need a method for efficiently structuring and navigating the many possible motifs

• This is not very different than exploring all t-digit numbers


Median String: Improving the Running Time1. MedianStringSearch (DNA, t, n, l)2. bestWord AAA…A

3. bestDistance ∞

4. for each l-mer s from AAA…A to TTT…T if TotalDistance(s,DNA) < bestDistance

5. bestDistanceTotalDistance(s,DNA)

6. bestWord s

7. return bestWord


Structuring the Search

• For the Median String Problem we need to consider all 4l possible l-mers:

aa… aaaa… acaa… agaa… at

.

.tt… tt

How to organize this search?

l


Alternative Representation of the Search Space

• Let A = 1, C = 2, G = 3, T = 4• Then the sequences from AA…A to TT…T become:

11…1111…1211…1311…14

.

.44…44

• Notice that the sequences above simply list all numbers as if we were counting on base 4 without using 0 as a digit

l


Search Tree

a- c- g- t-

aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt

--

root


Three Operations

• NextLeaf• NextVertex• ByPass


NextLeaf: Example• Moving to the next leaf:

1- 2- 3- 4-

11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44

--Next Location


Depth First Search

• So we can search leaves

• How about searching all vertices of the tree?

• We can do this with a depth first search• Preorder traversal


NextVertex: Example• Moving to the next vertices:

1- 2- 3- 4-

11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44

--

Location after 5 next vertex moves


ByPass: Example• Bypassing the descendants of “2-”:

1- 2- 3- 4-

11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44

--Next Location


Revisiting Brute Force Search

• Now that we have method for navigating the tree, lets look again at BruteForceMotifSearch• NextLeaf• NextVertex• ByPass


Can We Do Better?

• Sets of s=(s1, s2, …,st) may have a weak profile for the first i positions (s1, s2, …,si)

• Every row of alignment may add at most l to Score• Optimism: if all subsequent (t-i) positions (si+1, …st) a

dd (t – i ) * l to Score(s,i,DNA)

• If Score(s,i,DNA) + (t – i ) * l < BestScore, it makes no sense to search in vertices of the current subtree• Use ByPass()


Branch and Bound Algorithm for Motif Search

• Since each level of the tree goes deeper into search, discarding a prefix discards all following branches

• This saves us from looking at (n – l + 1)t-i leaves• Use NextVertex() and ByPas

s() to navigate the tree


Branch and Bound Applied to Median String Search• Note that if the total distance for a prefix is gr

eater than that for the best word so far:

TotalDistance (prefix, DNA) > BestDistance

there is no use exploring the remaining part of the word

• We can eliminate that branch and BYPASS exploring that branch further


More on the Motif Problem

• Exhaustive Search and Median String are both exact algorithms

• They always find the optimal solution, though they may be too slow to perform practical tasks

• Many algorithms sacrifice optimal solution for speed


CONSENSUS: Greedy Motif Search• Find two closest l-mers in sequences 1 and 2 and forms 2 x l alignment matrix with Score(s,2,DNA)• At each of the following t-2 iterations CONSENSUS finds a “best”

l-mer in sequence i from the perspective of the already constructed (i-1) x l alignment matrix for the first (i-1) sequences

• In other words, it finds an l-mer in sequence i maximizing Score(s,i,DNA)

under the assumption that the first (i-1) l-mers have been already chosen

• CONSENSUS sacrifices optimal solution for speed: in fact the bulk of the time is actually spent locating the first 2 l-mers

www.bioalgorithms.infoAn Introduction to Bioinformatics Algorithms

Randomized Motif Finding


Randomized Algorithms

• Randomized algorithms make random rather than deterministic decisions.

• The main advantage is that no input can reliably produce worst-case results because the algorithm runs differently each time.

• These algorithms are commonly used in situations where no exact and fast algorithm is known.


A New Motif Finding Approach

• Motif Finding Problem: Given a list of t sequences each of length n, find the “best” pattern of length l that appears in each of the t sequences.

• Previously: we solved the Motif Finding Problem using a Branch and Bound or a Greedy technique.

• Now: randomly select possible locations and find a way to greedily change those locations until we have converged to the hidden motif.


Profiles Revisited

• Let s=(s1,...,st) be the set of starting positions for l-mers in our t sequences.

• The substrings corresponding to these starting positions will form:

- t x l alignment matrix and - 4 x l profile matrix* P.

*We make a special note that the profile matrix will be defined in terms of the frequency of letters, and not as the count of letters.


• Prob(a|P) is defined as the probability that an l-mer a was created by the Profile P.

• If a is very similar to the consensus string of P then Prob(a|P) will be high

• If a is very different, then Prob(a|P) will be low.

n

Prob(a|P) =Π pai , i

i=1

Scoring Strings with a Profile


Scoring Strings with a Profile (cont’d)Given a profile: P =

A 1/2 7/8 3/8 0 1/8 0

C 1/8 0 1/2 5/8 3/8 0

T 1/8 1/8 0 0 1/4 7/8

G 1/4 0 1/8 3/8 1/4 1/8

Prob(aaacct|P) = ??? The probability of the consensus string:



A 1/2 7/8 3/8 0 1/8 0

C 1/8 0 1/2 5/8 3/8 0

T 1/8 1/8 0 0 1/4 7/8

G 1/4 0 1/8 3/8 1/4 1/8

Prob(aaacct|P) = 1/2 x 7/8 x 3/8 x 5/8 x 3/8 x 7/8 = .033646The probability of the consensus string:



A 1/2 7/8 3/8 0 1/8 0

C 1/8 0 1/2 5/8 3/8 0

T 1/8 1/8 0 0 1/4 7/8

G 1/4 0 1/8 3/8 1/4 1/8

Prob(atacag|P) = 1/2 x 1/8 x 3/8 x 5/8 x 1/8 x 1/8 = .001602

Prob(aaacct|P) = 1/2 x 7/8 x 3/8 x 5/8 x 3/8 x 7/8 = .033646The probability of the consensus string:

Probability of a different string:


P-Most Probable l-mer• Define the P-most probable l-mer from a sequence

as an l-mer in that sequence which has the highest probability of being created from the profile P.

A 1/2 7/8 3/8 0 1/8 0

C 1/8 0 1/2 5/8 3/8 0

T 1/8 1/8 0 0 1/4 7/8

G 1/4 0 1/8 3/8 1/4 1/8

P =

Given a sequence = ctataaaccttacatc, find the P-most probable l-mer


Third try: c t a t a a a c c t t a c a t c

Second try: c t a t a a a c c t t a c a t c

First try: c t a t a a a c c t t a c a t c

P-Most Probable l-mer (cont’d)

A 1/2 7/8 3/8 0 1/8 0

C 1/8 0 1/2 5/8 3/8 0

T 1/8 1/8 0 0 1/4 7/8

G 1/4 0 1/8 3/8 1/4 1/8

Find the Prob(a|P) of every possible 6-mer:

-Continue this process to evaluate every possible 6-mer



String, Highlighted in Red Calculations prob(a|P)

ctataaaccttacat 1/8 x 1/8 x 3/8 x 0 x 1/8 x 0 0

ctataaaccttacat 1/2 x 7/8 x 0 x 0 x 1/8 x 0 0



ctataaaccttacat 1/2 x 7/8 x 3/8 x 5/8 x 3/8 x 7/8 .0336


ctataaaccttacat 1/2 x 0 x 1/2 x 0 1/4 x 0 0

ctataaaccttacat 1/8 x 0 x 0 x 0 x 0 x 1/8 x 0 0



Compute prob(a|P) for every possible 6-mer:



String, Highlighted in Red Calculations Prob(a|P)







ctataaaccttacat 1/2 x 0 x 1/2 x 0 1/4 x 0 0

ctataaaccttacat 1/8 x 0 x 0 x 0 x 0 x 1/8 x 0 0



P-Most Probable 6-mer in the sequence is aaacct:



ctataaaccttacatc

because Prob(aaacct|P) = .0336 is greater than the Prob(a|P) of any other 6-mer in the sequence.

aaacct is the P-most probable 6-mer in:


Dealing with Zeroes

• In our toy example prob(a|P)=0 in many cases. In practice, there will be enough sequences so that the number of elements in the profile with a frequency of zero is small.

• To avoid many entries with prob(a|P)=0, there exist techniques to equate zero to a very small number so that one zero does not make the entire probability of a string zero (we will not address these techniques here).


P-Most Probable l-mers in Many Sequences• Find the P-most probab

le l-mer in each of the sequences.

ctataaacgttacatc

atagcgattcgactg

cagcccagaaccct

cggtataccttacatc

tgcattcaatagctta

tatcctttccactcac

ctccaaatcctttaca

ggtcatcctttatcct

A 1/2 7/8 3/8 0 1/8 0

C 1/8 0 1/2 5/8 3/8 0

T 1/8 1/8 0 0 1/4 7/8

G 1/4 0 1/8 3/8 1/4 1/8

P=


P-Most Probable l-mers in Many Sequences (cont’d) ctataaacgttacatc

atagcgattcgactg

cagcccagaaccct

cggtgaaccttacatc

tgcattcaatagctta

tgtcctgtccactcac

ctccaaatcctttaca

ggtctacctttatcct

P-Most Probable l-mers form a new profile

1 a a a c g t

2 a t a g c g

3 a a c c c t

4 g a a c c t

5 a t a g c t

6 g a c c t g

7 a t c c t t

8 t a c c t t

A 5/8 5/8 4/8 0 0 0

C 0 0 4/8 6/8 4/8 0

T 1/8 3/8 0 0 3/8 6/8

G 2/8 0 0 2/8 1/8 2/8


Comparing New and Old Profiles

Red – frequency increased, Blue – frequency descreased

1 a a a c g t

2 a t a g c g

3 a a c c c t

4 g a a c c t

5 a t a g c t

6 g a c c t g

7 a t c c t t

8 t a c c t t

A 5/8 5/8 4/8 0 0 0

C 0 0 4/8 6/8 4/8 0

T 1/8 3/8 0 0 3/8 6/8

G 2/8 0 0 2/8 1/8 2/8

A 1/2 7/8 3/8 0 1/8 0

C 1/8 0 1/2 5/8 3/8 0

T 1/8 1/8 0 0 1/4 7/8

G 1/4 0 1/8 3/8 1/4 1/8


Greedy Profile Motif SearchUse P-Most probable l-mers to adjust start positions u

ntil we reach a “best” profile; this is the motif.

1) Select random starting positions.2) Create a profile P from the substrings at these start

ing positions.3) Find the P-most probable l-mer a in each sequence

and change the starting position to the starting position of a.

4) Compute a new profile based on the new starting positions after each iteration and proceed until we cannot increase the score anymore.


GreedyProfileMotifSearch Analysis

• Since we choose starting positions randomly, there is little chance that our guess will be close to an optimal motif, meaning it will take a very long time to find the optimal motif.

• It is unlikely that the random starting positions will lead us to the correct solution at all.

• In practice, this algorithm is run many times with the hope that random starting positions will be close to the optimum solution simply by chance.


Gibbs Sampling

• GreedyProfileMotifSearch is probably not the best way to find motifs.

• However, we can improve the algorithm by introducing Gibbs Sampling, an iterative procedure that discards one l-mer after each iteration and replaces it with a new one.

• Gibbs Sampling proceeds more slowly and chooses new l-mers at random increasing the odds that it will converge to the correct solution.


How Gibbs Sampling Works1) Randomly choose starting positions

s = (s1,...,st) and form the set of l-mers associated with these starting positions. 2) Randomly choose one of the t sequences.

3) Create a profile P from the other t -1 sequences.4) For each position in the removed sequence, calculate the probability that the l-mer starting at that position was generated by P.5) Choose a new starting position for the removed

sequence at random based on the probabilities calculated in step 4.

6) Repeat steps 2-5 until there is no improvement


Gibbs Sampling: an Example

Input:

t = 5 sequences, motif length l = 8

1. GTAAACAATATTTATAGC2. AAAATTTACCTCGCAAGG

3. CCGTACTGTCAAGCGTGG 4. TGAGTAAACGACGTCCCA

5. TACTTAACACCCTGTCAA



1) Randomly choose starting positions, s=(s1,s2,s3,s4,s5) in the 5 sequences:

s1=7 GTAAACAATATTTATAGC

s2=11 AAAATTTACCTTAGAAGG

s3=9 CCGTACTGTCAAGCGTGG

s4=4 TGAGTAAACGACGTCCCA

s5=1 TACTTAACACCCTGTCAA



2) Choose one of the sequences at random:Sequence 2: AAAATTTACCTTAGAAGG


s2=11 AAAATTTACCTTAGAAGG






2) Choose one of the sequences at random:Sequence 2: AAAATTTACCTTAGAAGG






Gibbs Sampling: an Example3) Create profile P from l-mers in remaining 4 seq

uences:1 A A T A T T T A

3 T C A A G C G T

4 G T A A A C G A

5 T A C T T A A C

A 1/4 2/4 2/4 3/4 1/4 1/4 1/4 2/4

C 0 1/4 1/4 0 0 2/4 0 1/4

T 2/4 1/4 1/4 1/4 2/4 1/4 1/4 1/4

G 1/4 0 0 0 1/4 0 3/4 0Consensus

StringT A A A T C G A


Gibbs Sampling: an Example4) Calculate the prob(a|P) for every possible 8-

mer in the removed sequence:

Strings Highlighted in Red prob(a|P)

AAAATTTACCTTAGAAGG .000732AAAATTTACCTTAGAAGG .000122AAAATTTACCTTAGAAGG 0AAAATTTACCTTAGAAGG 0AAAATTTACCTTAGAAGG 0AAAATTTACCTTAGAAGG 0AAAATTTACCTTAGAAGG 0AAAATTTACCTTAGAAGG .000183AAAATTTACCTTAGAAGG 0AAAATTTACCTTAGAAGG 0AAAATTTACCTTAGAAGG 0



5) Create a distribution of probabilities of l-mers prob(a|P), and randomly select a new starting position based on this distribution.

Starting Position 1: prob( AAAATTTA | P ) = .000732 / .000122 = 6

Starting Position 2: prob( AAATTTAC | P ) = .000122 / .000122 = 1

Starting Position 8: prob( ACCTTAGA | P ) = .000183 / .000122 = 1.5

a) To create this distribution, divide each probability prob(a|P) by the lowest probability:

Ratio = 6 : 1 : 1.5


Turning Ratios into Probabilities

Probability (Selecting Starting Position 1): 6/(6+1+1.5)= 0.706

Probability (Selecting Starting Position 2): 1/(6+1+1.5)= 0.118

Probability (Selecting Starting Position 8): 1.5/(6+1+1.5)=0.176

b) Define probabilities of starting positions according to computed ratios



c) Select the start position according to computed ratios:

P(selecting starting position 1): .706





Assume we select the substring with the highest probability – then we are left with the following new substrings and starting positions.


s2=1 AAAATTTACCTCGCAAGG


s4=5 TGAGTAATCGACGTCCCA

s5=1 TACTTCACACCCTGTCAA



6) We iterate the procedure again with the above starting positions until we cannot improve the score any more.


Gibbs Sampler in Practice

• Gibbs sampling needs to be modified when applied to samples with unequal distributions of nucleotides (relative entropy approach).

• Gibbs sampling often converges to locally optimal motifs rather than globally optimal motifs.

• Needs to be run with many randomly chosen seeds to achieve good results.


Another Randomized Approach• Random Projection Algorithm is a different

way to solve the Motif Finding Problem.• Guiding principle: Some instances of a motif

agree on a subset of positions.• However, it is unclear how to find these “non-

mutated” positions.• To bypass the effect of mutations within a motif,

we randomly select a subset of positions in the pattern creating a projection of the pattern.

• Search for that projection in a hope that the selected positions are not affected by mutations in most instances of the motif.


Projections• Choose k positions in string of length l.• Concatenate nucleotides at chosen k

positions to form k-tuple.• This can be viewed as a projection of l-

dimensional space onto k-dimensional subspace.

ATGGCATTCAGATTC TGCTGAT

l = 15 k = 7 Projection

Projection = (2, 4, 5, 7, 11, 12, 13)


Random Projections Algorithm• Select k out of l

positions uniformly at random.

• For each l-tuple in input sequences, hash into bucket based on letters at k selected positions.

• Recover motif from enriched bucket that contain many l-tuples. Bucket TGCT

TGCACCT

Input sequence:…TCAATGCACCTAT...


Random Projections Algorithm (cont’d)

• Some projections will fail to detect motifs but if we try many of them the probability that one of the buckets fills in is increasing.

• In the example below, the bucket **GC*AC is “bad” while the bucket AT**G*C is “good”

ATGCGTC

...ccATCCGACca...

...ttATGAGGCtc...

...ctATAAGTCgc...

...tcATGTGACac... (7,2) motif


Example

• l = 7 (motif size) , k = 4 (projection size)• Choose projection (1,2,5,7)

GCTC

...TAGACATCCGACTTGCCTTACTAC...

Buckets

ATGC

ATCCGAC

GCCTTAC


Hashing and Buckets

• Hash function h(x) obtained from k positions of projection.

• Buckets are labeled by values of h(x).• Enriched buckets: contain more than s l-

tuples, for some parameter s.

ATTCCATCGCTCATGC


Motif Refinement

• How do we recover the motif from the sequences in the enriched buckets?

• k nucleotides are from hash value of bucket.• Use information in other l-k positions as starting

point for local refinement scheme, e.g. Gibbs sampler.

Local refinement algorithmATGCGAC

Candidate motif

ATGC

ATCCGAC

ATGAGGCATAAGTC

ATGCGAC


Synergy between Random Projection and Gibbs Sampler• Random Projection is a procedure for finding good

starting points: every enriched bucket is a potential starting point.

• Feeding these starting points into existing algorithms (like Gibbs sampler) provides good local search in vicinity of every starting point.

• These algorithms work particularly well for “good” starting points.


Building Profiles from Buckets

A 1 0 .25 .50 0 .50 0

C 0 0 .25 .25 0 0 1

G 0 0 .50 0 1 .25 0

T 0 1 0 .25 0 .25 0

Profile P

Gibbs sampler

Refined profile P*

ATCCGAC

ATGAGGC

ATAAGTC

ATGTGAC

ATGC

A Graph-Based Method for Motif Finding

Graphs com from http://biochem218.stanford.edu/

Problems with Current Representations of DNA Motifs All current methods for representing DNA motifs inv

olve either consensus sequences or probabilistic models (such as PSSMs) of the motif.

Consensus sequences do not adequately represent the variability seen in promoters or transcription factor binding sites.

Both consensus sequences and PSSM models assume positional independence.

Neither method can accommodate correlations between positions.

Yeast Motifs

We analyzed yeast motifs for pairwise dependencies. We used a chi-square statistic to find whether two

positions were correlated or not. We found that 25% of motifs have significantl

y correlated positions.

Graph Representation

A Graph-Based Model of a Motif

The main idea is to formulate motif finding as the problem of finding the Maximum Density Subgraph

Given a set of DNA sequences, we construct a graph G=(V,E) V: the set of the k-mer occurrences in the input E: the set of undirected edges, represents nucleoti

de similarities between those k-mers

Maximum Density Subgraph

Nucleotide Dependencies and MDS

Things to be Done

Graph construction Edge weight assignment Background normalization

Finding maximum density subgraph(s) Choice of density function Max-Flow/Min-Cut Neighborhood subgraph …

The Modeling of Regulatory Regions

More than a hundred methods have been proposed for motif discovery A large variation with respect to both algorithmic

approaches as well as the underlying models of regulatory regions

Key Points

How the multiple binding sites for modules of regulatory elements can be modeled

How additional data sources may be integrated into such models

A Integrated Framework to Model Regulatory Regions A single motif, denoted by mg, consists of two parts,

m*: how well the sequence matches a consensus og is a prior on whether any regulatory element is to occur at t

hat position A set of single motifs, together with inter-motif distanc

e restrictions (d), then forms a composite motif (cg) Multiple occurrences of a composite motif in the regul

atory regions of a gene is represented by a gene score Gc

A Schematic View of the Integrated Framework

Single Motif Models (Level 1)

Transcription factor binding sites: This is the most basic element of the regulatory system, and can be modeled using single motif models

A single motif model is defined as a function mg that maps a sequence position p to a real numbered motif score mg(p) It consists of a match score m*(p) and an occurren

ce prior og(p)

The Match Model m*

Probabilistic match model Position weight matrix (PWM) a.k.a. PSSM Incorporating dependencies within motifs

n-th order Markov chains Bayesian network

Deterministic match model oligos regular expression mismatch expressions

Occurrence priors og

Spatial distribution of binding sites Conservation in orthologous sequences DNA structure Nucleotide distribution

Composite Motif Models (Level 2) Clusters of binding sites for cooperating TFs,

often called modules, are believed to be essential building blocks of the regulatory machinery

A composite motif model is defined as a function cg that maps a set of single motif sequence positions p to a real numbered composite motif score cg(p).

A Composite Motif Score cg

Given a set of positions, the score of a composite motif will typically be the sum or product of individual single motif, and distance scores

Distance Functions

constraints fixed distances distances below thresholds distances within intervals all single motifs to be within a window of a certain

length score functions

uniform distribution: geometry, Poisson, etc.

Combining Single Motifs

For methods using deterministic match models and constraints on distances intersection of component scores M out of N single motif scores

For methods that use non-binary single motif scores the sum/product of single motif and distance scores

Specialized models hidden Markov model (HMM) self-organizing map (SOM) artificial neural network (ANN)

Gene level models (Level 3)

A gene score model is defined as a function Gc that maps a gene index g to a real numbered gene score Gc(g)

The gene level score is calculated from composite motif scores, cg, across the regulatory region of gene g, and is referred to as gene score

Multiple Binding Sites/Modules One motif in the regulatory region

The gene level score is often defined simply as the maximum motif score in the regulatory region of a gene

Multiple binding sites/modules sum of log-scores of motifs logistic-regression ANN HMM …

Overview of Methods

Documents

Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20