126
Motif Finding Kun-Pin Wu ( 巫巫巫 ) Institute of Biomedical Informatic s National Yang Ming University 2007/11/20

Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

Embed Size (px)

Citation preview

Page 1: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

Motif Finding

Kun-Pin Wu ( 巫坤品 )Institute of Biomedical InformaticsNational Yang Ming University2007/11/20

Page 2: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

www.bioalgorithms.infoAn Introduction to Bioinformatics Algorithms

Finding Regulatory Motifs in DNA Sequences

Page 3: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Random Sample

atgaccgggatactgataccgtatttggcctaggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg

acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatactgggcataaggtaca

tgagtatccctgggatgacttttgggaacactatagtgctctcccgatttttgaatatgtaggatcattcgccagggtccga

gctgagaattggatgaccttgtaagtgttttccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga

tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatggcccacttagtccacttatag

gtcaatcatgttcttgtgaatggatttttaactgagggcatagaccgcttggcgcacccaaattcagtgtgggcgagcgcaa

cggttttggcccttgttagaggcccccgtactgatggaaactttcaattatgagagagctaatctatcgcgtgcgtgttcat

aacttgagttggtttcgaaaatgctctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta

ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatttcaacgtatgccgaaccgaaagggaag

ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttctgggtactgatagca

Page 4: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Implanting Motif AAAAAAAGGGGGGG

atgaccgggatactgatAAAAAAAAGGGGGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg

acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataAAAAAAAAGGGGGGGa

tgagtatccctgggatgacttAAAAAAAAGGGGGGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccga

gctgagaattggatgAAAAAAAAGGGGGGGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga

tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAAAAAAAAGGGGGGGcttatag

gtcaatcatgttcttgtgaatggatttAAAAAAAAGGGGGGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaa

cggttttggcccttgttagaggcccccgtAAAAAAAAGGGGGGGcaattatgagagagctaatctatcgcgtgcgtgttcat

aacttgagttAAAAAAAAGGGGGGGctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta

ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatAAAAAAAAGGGGGGGaccgaaagggaag

ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttAAAAAAAAGGGGGGGa

Page 5: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Where is the Implanted Motif?

atgaccgggatactgataaaaaaaagggggggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg

acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataaaaaaaaaggggggga

tgagtatccctgggatgacttaaaaaaaagggggggtgctctcccgatttttgaatatgtaggatcattcgccagggtccga

gctgagaattggatgaaaaaaaagggggggtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga

tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaataaaaaaaagggggggcttatag

gtcaatcatgttcttgtgaatggatttaaaaaaaaggggggggaccgcttggcgcacccaaattcagtgtgggcgagcgcaa

cggttttggcccttgttagaggcccccgtaaaaaaaagggggggcaattatgagagagctaatctatcgcgtgcgtgttcat

aacttgagttaaaaaaaagggggggctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta

ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcataaaaaaaagggggggaccgaaagggaag

ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttaaaaaaaaggggggga

Page 6: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Implanting Motif AAAAAAGGGGGGG with Four Mutations

atgaccgggatactgatAgAAgAAAGGttGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg

acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacAAtAAAAcGGcGGGa

tgagtatccctgggatgacttAAAAtAAtGGaGtGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccga

gctgagaattggatgcAAAAAAAGGGattGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga

tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAtAAtAAAGGaaGGGcttatag

gtcaatcatgttcttgtgaatggatttAAcAAtAAGGGctGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaa

cggttttggcccttgttagaggcccccgtAtAAAcAAGGaGGGccaattatgagagagctaatctatcgcgtgcgtgttcat

aacttgagttAAAAAAtAGGGaGccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta

ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatActAAAAAGGaGcGGaccgaaagggaag

ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttActAAAAAGGaGcGGa

Page 7: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Where is the Motif???

atgaccgggatactgatagaagaaaggttgggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg

acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacaataaaacggcggga

tgagtatccctgggatgacttaaaataatggagtggtgctctcccgatttttgaatatgtaggatcattcgccagggtccga

gctgagaattggatgcaaaaaaagggattgtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga

tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatataataaaggaagggcttatag

gtcaatcatgttcttgtgaatggatttaacaataagggctgggaccgcttggcgcacccaaattcagtgtgggcgagcgcaa

cggttttggcccttgttagaggcccccgtataaacaaggagggccaattatgagagagctaatctatcgcgtgcgtgttcat

aacttgagttaaaaaatagggagccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta

ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatactaaaaaggagcggaccgaaagggaag

ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttactaaaaaggagcgga

Page 8: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Why Finding (15,4) Motif is Difficult?

atgaccgggatactgatAgAAgAAAGGttGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg

acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacAAtAAAAcGGcGGGa

tgagtatccctgggatgacttAAAAtAAtGGaGtGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccga

gctgagaattggatgcAAAAAAAGGGattGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga

tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAtAAtAAAGGaaGGGcttatag

gtcaatcatgttcttgtgaatggatttAAcAAtAAGGGctGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaa

cggttttggcccttgttagaggcccccgtAtAAAcAAGGaGGGccaattatgagagagctaatctatcgcgtgcgtgttcat

aacttgagttAAAAAAtAGGGaGccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta

ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatActAAAAAGGaGcGGaccgaaagggaag

ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttActAAAAAGGaGcGGa

AgAAgAAAGGttGGG

cAAtAAAAcGGcGGG..|..|||.|..|||

Page 9: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Challenge Problem

• Find a motif in a sample of

- 20 “random” sequences (e.g. 600 nt long)

- each sequence containing an implanted

pattern of length 15,

- each pattern appearing with 4 mismatches

as (15,4)-motif.

Page 10: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Combinatorial Gene Regulation• A microarray experiment showed that when

gene X is knocked out, 20 other genes are not expressed

• How can one gene have such drastic effects?

Page 11: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Regulatory Proteins• Gene X encodes regulatory protein, a.k.a. a

transcription factor (TF)

• The 20 unexpressed genes rely on gene X’s TF to induce transcription

• A single TF may regulate multiple genes

Page 12: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Regulatory Regions• Every gene contains a regulatory region (RR) typically st

retching 100-1000 bp upstream of the transcriptional start site

• Located within the RR are the Transcription Factor Binding Sites (TFBS), also known as motifs, specific for a given transcription factor

• TFs influence gene expression by binding to a specific location in the respective gene’s regulatory region - TFBS

Page 13: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Transcription Factor Binding Sites

• A TFBS can be located anywhere within the

Regulatory Region

• TFBS may vary slightly across different regulatory regions since non-essential bases could mutate

Page 14: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Motifs and Transcriptional Start Sites

geneATCCCG

geneTTCCGG

geneATCCCG

geneATGCCG

geneATGCCC

Page 15: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Transcription Factors and Motifs

Page 16: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Motif Logo

• Motifs can mutate on non important bases

• The five motifs in five different genes have mutations in position 3 and 5

• Representations called motif logos illustrate the conserved and variable regions of a motif

TGGGGGATGAGAGATGGGGGATGAGAGATGAGGGA

Page 17: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Motif Logos: An Example

(http://www-lmmb.ncifcrf.gov/~toms/sequencelogo.html)

Page 18: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Identifying Motifs• Genes are turned on or off by regulatory proteins

• These proteins bind to upstream regulatory regions of genes to either attract or block an RNA polymerase

• Regulatory protein (TF) binds to a short DNA sequence called a motif (TFBS)

• So finding the same motif in multiple genes’ regulatory regions suggests a regulatory relationship amongst those genes

Page 19: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Identifying Motifs: Complications

• We do not know the motif sequence

• We do not know where it is located relative to the genes start

• Motifs can differ slightly from one gene to the next

• How to discern it from “random” motifs?

Page 20: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

The Motif Finding Problem• Given a random sample of DNA sequences:

cctgatagacgctatctggctatccacgtacgtaggtcctctgtgcgaatctatgcgtttccaaccat

agtactggtgtacatttgatacgtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc

aaacgtacgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt

agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtacgtataca

ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaacgtacgtc

• Find the pattern that is implanted in each of the individual sequences, namely, the motif

Page 21: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

The Motif Finding Problem (cont’d)• Additional information:

• The hidden sequence is of length 8

• The pattern is not exactly the same in each array because random point mutations may occur in the sequences

Page 22: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

The Motif Finding Problem (cont’d)• The patterns revealed with no mutations:cctgatagacgctatctggctatccacgtacgtaggtcctctgtgcgaatctatgcgtttccaaccat

agtactggtgtacatttgatacgtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc

aaacgtacgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt

agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtacgtataca

ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaacgtacgtc

acgtacgtConsensus String

Page 23: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

The Motif Finding Problem (cont’d)• The patterns with 2 point mutations:

cctgatagacgctatctggctatccaGgtacTtaggtcctctgtgcgaatctatgcgtttccaaccat

agtactggtgtacatttgatCcAtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc

aaacgtTAgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt

agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtCcAtataca

ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaCcgtacgGc

Page 24: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

The Motif Finding Problem (cont’d)• The patterns with 2 point mutations:

cctgatagacgctatctggctatccaGgtacTtaggtcctctgtgcgaatctatgcgtttccaaccat

agtactggtgtacatttgatCcAtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc

aaacgtTAgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt

agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtCcAtataca

ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaCcgtacgGc

Can we still find the motif, now that we have 2 mutations?

Page 25: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Defining Motifs

• To define a motif, lets say we know where the motif starts in the sequence

• The motif start positions in their sequences can be represented as s = (s1,s2,s3,…,st)

Page 26: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Motifs: Profiles and Consensus a G g t a c T t C c A t a c g tAlignment a c g t T A g t a c g t C c A t C c g t a c g G

_________________

A 3 0 1 0 3 1 1 0Profile C 2 4 0 0 1 4 0 0 G 0 1 4 0 0 0 3 1 T 0 0 0 5 1 0 1 4

_________________

Consensus A C G T A C G T

• Line up the patterns by their start indexes

s = (s1, s2, …, st)

• Construct matrix profile with frequencies of each nucleotide in columns

• Consensus nucleotide in each position has the highest score in column

Page 27: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Consensus

• Think of consensus as an “ancestor” motif, from which mutated motifs emerged

• The distance between a real motif and the consensus sequence is generally less than that for two real motifs

Page 28: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Consensus (cont’d)

Page 29: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Evaluating Motifs

• We have a guess about the consensus sequence, but how “good” is this consensus?

• Need to introduce a scoring function to compare different guesses and choose the “best” one.

Page 30: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Defining Some Terms

• t - number of sample DNA sequences• n - length of each DNA sequence• DNA - sample of DNA sequences (t x n array)

• l - length of the motif (l-mer)• si - starting position of an l-mer in sequence i

• s=(s1, s2,… st) - array of motif’s starting positions

Page 31: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Parameters

cctgatagacgctatctggctatccaGgtacTtaggtcctctgtgcgaatctatgcgtttccaaccat

agtactggtgtacatttgatCcAtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc

aaacgtTAgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt

agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtCcAtataca

ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaCcgtacgGc

l = 8

t=5

s1 = 26 s2 = 21 s3= 3 s4 = 56 s5 = 60 s

DNA

n = 69

Page 32: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Scoring Motifs

• Given s = (s1, … st) and DNA:

Score(s,DNA) =

a G g t a c T t C c A t a c g t a c g t T A g t a c g t C c A t C c g t a c g G _________________ A 3 0 1 0 3 1 1 0 C 2 4 0 0 1 4 0 0 G 0 1 4 0 0 0 3 1 T 0 0 0 5 1 0 1 4 _________________

Consensus a c g t a c g t

Score 3+4+4+5+3+4+3+4=30

l

t

l

i GCTAk

ikcount1 },,,{

),(max

Page 33: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

The Motif Finding Problem

• If starting positions s=(s1, s2,… st) are given, finding consensus is easy even with mutations in the sequences because we can simply construct the profile to find the motif (consensus)

• But… the starting positions s are usually not given. How can we find the “best” profile matrix?

Page 34: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

The Motif Finding Problem: Formulation

• Goal: Given a set of DNA sequences, find a set of l-mers, one from each sequence, that maximizes the consensus score

• Input: A t x n matrix of DNA, and l, the length of the pattern to find

• Output: An array of t starting positions s = (s1, s2, … st) maximizing Score(s,DNA)

Page 35: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

The Motif Finding Problem: Brute Force Solution

• Compute the scores for each possible combination of starting positions s

• The best score will determine the best profile and the consensus pattern in DNA

• The goal is to maximize Score(s,DNA) by varying the starting positions si, where:

si = [1, …, n-l+1]i = [1, …, t]

Page 36: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Running Time of BruteForceMotifSearch• Varying (n - l + 1) positions in each of t sequen

ces, we’re looking at (n - l + 1)t sets of starting positions

• For each set of starting positions, the scoring function makes l operations, so complexity is l (n – l + 1)t = O(l nt)

• That means that for t = 8, n = 1000, l = 10 we must perform approximately 1020 computations – it will take billions years

Page 37: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

The Median String Problem

• Given a set of t DNA sequences find a pattern that appears in all t sequences with the minimum number of mutations

• This pattern will be the motif

Page 38: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Hamming Distance

• Hamming distance: • dH(v,w) is the number of nucleotide pairs th

at do not match when v and w are aligned. For example:

dH(AAAAAA,ACAAAC) = 2

Page 39: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Total Distance: Example

• Given v = “acgtacgt” and s acgtacgt

cctgatagacgctatctggctatccacgtacAtaggtcctctgtgcgaatctatgcgtttccaaccat acgtacgtagtactggtgtacatttgatacgtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc acgtacgtaaaAgtCcgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt acgtacgtagcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtacgtataca acgtacgtctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaacgtaGgtc

v is the sequence in red, x is the sequence in blue

• TotalDistance(v,DNA) = 1+0+2+0+1 = 4

dH(v, x) = 2

dH(v, x) = 1

dH(v, x) = 0

dH(v, x) = 0

dH(v, x) = 1

Page 40: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Total Distance: Definition

• For each DNA sequence i, compute all dH(v, x), where x is an l-mer with starting position si

(1 < si < n – l + 1)• Find minimum of dH(v, x) among all l-mers in sequen

ce i• TotalDistance(v,DNA) is the sum of the minimum Ha

mming distances for each DNA sequence i• TotalDistance(v,DNA) = mins dH(v, s), where s is the

set of starting positions s1, s2,… st

Page 41: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

The Median String Problem: Formulation• Goal: Given a set of DNA sequences, find a

median string• Input: A t x n matrix DNA, and l, the length of

the pattern to find• Output: A string v of l nucleotides that minimi

zes TotalDistance(v,DNA) over all strings of that length

Page 42: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Motif Finding Problem == Median String Problem

• The Motif Finding is a maximization problem while Median String is a minimization problem

• However, the Motif Finding problem and Median String problem are computationally equivalent

• Need to show that minimizing TotalDistance is equivalent to maximizing Score

Page 43: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

We are looking for the same thing

a G g t a c T t C c A t a c g tAlignment a c g t T A g t a c g t C c A t C c g t a c g G _________________ A 3 0 1 0 3 1 1 0Profile C 2 4 0 0 1 4 0 0 G 0 1 4 0 0 0 3 1 T 0 0 0 5 1 0 1 4 _________________

Consensus a c g t a c g t

Score 3+4+4+5+3+4+3+4

TotalDistance 2+1+1+0+2+1+2+1

Sum 5 5 5 5 5 5 5 5

• At any column iScorei + TotalDistancei = t

• Because there are l columns Score + TotalDistance = l * t

• Rearranging:Score = l * t - TotalDistance

• l * t is constant the minimization of the right side is equivalent to the maximization of the left side

l

t

Page 44: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Motif Finding Problem vs. Median String Problem• Why bother reformulating the Motif Finding

problem into the Median String problem?

• The Motif Finding Problem needs to examine all the combinations for s. That is (n - l + 1)t combinations!!!

• The Median String Problem needs to examine all 4l combinations for v. This number is relatively smaller

Page 45: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Motif Finding: Improving the Running Time

1. BruteForceMotifSearch(DNA, t, n, l)2. bestScore 03. for each s=(s1,s2 , . . ., st) from (1,1 . . . 1) to (n-l+1, . . ., n-l+1)4. if (Score(s,DNA) > bestScore)5. bestScore Score(s, DNA)6. bestMotif (s1,s2 , . . . , st) 7. return bestMotif

Page 46: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Structuring the Search

• How can we perform the line

for each s=(s1,s2 , . . ., st) from (1,1 . . . 1) to (n-l+1, . . ., n-l+1) ?

• We need a method for efficiently structuring and navigating the many possible motifs

• This is not very different than exploring all t-digit numbers

Page 47: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Median String: Improving the Running Time1. MedianStringSearch (DNA, t, n, l)2. bestWord AAA…A

3. bestDistance ∞

4. for each l-mer s from AAA…A to TTT…T if TotalDistance(s,DNA) < bestDistance

5. bestDistanceTotalDistance(s,DNA)

6. bestWord s

7. return bestWord

Page 48: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Structuring the Search

• For the Median String Problem we need to consider all 4l possible l-mers:

aa… aaaa… acaa… agaa… at

.

.tt… tt

How to organize this search?

l

Page 49: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Alternative Representation of the Search Space

• Let A = 1, C = 2, G = 3, T = 4• Then the sequences from AA…A to TT…T become:

11…1111…1211…1311…14

.

.44…44

• Notice that the sequences above simply list all numbers as if we were counting on base 4 without using 0 as a digit

l

Page 50: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Search Tree

a- c- g- t-

aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt

--

root

Page 51: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Three Operations

• NextLeaf• NextVertex• ByPass

Page 52: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

NextLeaf: Example• Moving to the next leaf:

1- 2- 3- 4-

11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44

--Next Location

Page 53: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Depth First Search

• So we can search leaves

• How about searching all vertices of the tree?

• We can do this with a depth first search• Preorder traversal

Page 54: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

NextVertex: Example• Moving to the next vertices:

1- 2- 3- 4-

11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44

--

Location after 5 next vertex moves

Page 55: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

ByPass: Example• Bypassing the descendants of “2-”:

1- 2- 3- 4-

11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44

--Next Location

Page 56: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Revisiting Brute Force Search

• Now that we have method for navigating the tree, lets look again at BruteForceMotifSearch• NextLeaf• NextVertex• ByPass

Page 57: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Can We Do Better?

• Sets of s=(s1, s2, …,st) may have a weak profile for the first i positions (s1, s2, …,si)

• Every row of alignment may add at most l to Score• Optimism: if all subsequent (t-i) positions (si+1, …st) a

dd (t – i ) * l to Score(s,i,DNA)

• If Score(s,i,DNA) + (t – i ) * l < BestScore, it makes no sense to search in vertices of the current subtree• Use ByPass()

Page 58: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Branch and Bound Algorithm for Motif Search

• Since each level of the tree goes deeper into search, discarding a prefix discards all following branches

• This saves us from looking at (n – l + 1)t-i leaves• Use NextVertex() and ByPas

s() to navigate the tree

Page 59: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Branch and Bound Applied to Median String Search• Note that if the total distance for a prefix is gr

eater than that for the best word so far:

TotalDistance (prefix, DNA) > BestDistance

there is no use exploring the remaining part of the word

• We can eliminate that branch and BYPASS exploring that branch further

Page 60: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

More on the Motif Problem

• Exhaustive Search and Median String are both exact algorithms

• They always find the optimal solution, though they may be too slow to perform practical tasks

• Many algorithms sacrifice optimal solution for speed

Page 61: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

CONSENSUS: Greedy Motif Search• Find two closest l-mers in sequences 1 and 2 and forms 2 x l alignment matrix with Score(s,2,DNA)• At each of the following t-2 iterations CONSENSUS finds a “best”

l-mer in sequence i from the perspective of the already constructed (i-1) x l alignment matrix for the first (i-1) sequences

• In other words, it finds an l-mer in sequence i maximizing Score(s,i,DNA)

under the assumption that the first (i-1) l-mers have been already chosen

• CONSENSUS sacrifices optimal solution for speed: in fact the bulk of the time is actually spent locating the first 2 l-mers

Page 62: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

www.bioalgorithms.infoAn Introduction to Bioinformatics Algorithms

Randomized Motif Finding

Page 63: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Randomized Algorithms

• Randomized algorithms make random rather than deterministic decisions.

• The main advantage is that no input can reliably produce worst-case results because the algorithm runs differently each time.

• These algorithms are commonly used in situations where no exact and fast algorithm is known.

Page 64: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

A New Motif Finding Approach

• Motif Finding Problem: Given a list of t sequences each of length n, find the “best” pattern of length l that appears in each of the t sequences.

• Previously: we solved the Motif Finding Problem using a Branch and Bound or a Greedy technique.

• Now: randomly select possible locations and find a way to greedily change those locations until we have converged to the hidden motif.

Page 65: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Profiles Revisited

• Let s=(s1,...,st) be the set of starting positions for l-mers in our t sequences.

• The substrings corresponding to these starting positions will form:

- t x l alignment matrix and - 4 x l profile matrix* P.

*We make a special note that the profile matrix will be defined in terms of the frequency of letters, and not as the count of letters.

Page 66: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

• Prob(a|P) is defined as the probability that an l-mer a was created by the Profile P.

• If a is very similar to the consensus string of P then Prob(a|P) will be high

• If a is very different, then Prob(a|P) will be low.

n

Prob(a|P) =Π pai , i

i=1

Scoring Strings with a Profile

Page 67: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Scoring Strings with a Profile (cont’d)Given a profile: P =

A 1/2 7/8 3/8 0 1/8 0

C 1/8 0 1/2 5/8 3/8 0

T 1/8 1/8 0 0 1/4 7/8

G 1/4 0 1/8 3/8 1/4 1/8

Prob(aaacct|P) = ??? The probability of the consensus string:

Page 68: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Scoring Strings with a Profile (cont’d)Given a profile: P =

A 1/2 7/8 3/8 0 1/8 0

C 1/8 0 1/2 5/8 3/8 0

T 1/8 1/8 0 0 1/4 7/8

G 1/4 0 1/8 3/8 1/4 1/8

Prob(aaacct|P) = 1/2 x 7/8 x 3/8 x 5/8 x 3/8 x 7/8 = .033646The probability of the consensus string:

Page 69: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Scoring Strings with a Profile (cont’d)Given a profile: P =

A 1/2 7/8 3/8 0 1/8 0

C 1/8 0 1/2 5/8 3/8 0

T 1/8 1/8 0 0 1/4 7/8

G 1/4 0 1/8 3/8 1/4 1/8

Prob(atacag|P) = 1/2 x 1/8 x 3/8 x 5/8 x 1/8 x 1/8 = .001602

Prob(aaacct|P) = 1/2 x 7/8 x 3/8 x 5/8 x 3/8 x 7/8 = .033646The probability of the consensus string:

Probability of a different string:

Page 70: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

P-Most Probable l-mer• Define the P-most probable l-mer from a sequence

as an l-mer in that sequence which has the highest probability of being created from the profile P.

A 1/2 7/8 3/8 0 1/8 0

C 1/8 0 1/2 5/8 3/8 0

T 1/8 1/8 0 0 1/4 7/8

G 1/4 0 1/8 3/8 1/4 1/8

P =

Given a sequence = ctataaaccttacatc, find the P-most probable l-mer

Page 71: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Third try: c t a t a a a c c t t a c a t c

Second try: c t a t a a a c c t t a c a t c

First try: c t a t a a a c c t t a c a t c

P-Most Probable l-mer (cont’d)

A 1/2 7/8 3/8 0 1/8 0

C 1/8 0 1/2 5/8 3/8 0

T 1/8 1/8 0 0 1/4 7/8

G 1/4 0 1/8 3/8 1/4 1/8

Find the Prob(a|P) of every possible 6-mer:

-Continue this process to evaluate every possible 6-mer

Page 72: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

P-Most Probable l-mer (cont’d)

String, Highlighted in Red Calculations prob(a|P)

ctataaaccttacat 1/8 x 1/8 x 3/8 x 0 x 1/8 x 0 0

ctataaaccttacat 1/2 x 7/8 x 0 x 0 x 1/8 x 0 0

ctataaaccttacat 1/2 x 1/8 x 3/8 x 0 x 1/8 x 0 0

ctataaaccttacat 1/8 x 7/8 x 3/8 x 0 x 3/8 x 0 0

ctataaaccttacat 1/2 x 7/8 x 3/8 x 5/8 x 3/8 x 7/8 .0336

ctataaaccttacat 1/2 x 7/8 x 1/2 x 5/8 x 1/4 x 7/8 .0299

ctataaaccttacat 1/2 x 0 x 1/2 x 0 1/4 x 0 0

ctataaaccttacat 1/8 x 0 x 0 x 0 x 0 x 1/8 x 0 0

ctataaaccttacat 1/8 x 1/8 x 0 x 0 x 3/8 x 0 0

ctataaaccttacat 1/8 x 1/8 x 3/8 x 5/8 x 1/8 x 7/8 .0004

Compute prob(a|P) for every possible 6-mer:

Page 73: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

P-Most Probable l-mer (cont’d)

String, Highlighted in Red Calculations Prob(a|P)

ctataaaccttacat 1/8 x 1/8 x 3/8 x 0 x 1/8 x 0 0

ctataaaccttacat 1/2 x 7/8 x 0 x 0 x 1/8 x 0 0

ctataaaccttacat 1/2 x 1/8 x 3/8 x 0 x 1/8 x 0 0

ctataaaccttacat 1/8 x 7/8 x 3/8 x 0 x 3/8 x 0 0

ctataaaccttacat 1/2 x 7/8 x 3/8 x 5/8 x 3/8 x 7/8 .0336

ctataaaccttacat 1/2 x 7/8 x 1/2 x 5/8 x 1/4 x 7/8 .0299

ctataaaccttacat 1/2 x 0 x 1/2 x 0 1/4 x 0 0

ctataaaccttacat 1/8 x 0 x 0 x 0 x 0 x 1/8 x 0 0

ctataaaccttacat 1/8 x 1/8 x 0 x 0 x 3/8 x 0 0

ctataaaccttacat 1/8 x 1/8 x 3/8 x 5/8 x 1/8 x 7/8 .0004

P-Most Probable 6-mer in the sequence is aaacct:

Page 74: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

P-Most Probable l-mer (cont’d)

ctataaaccttacatc

because Prob(aaacct|P) = .0336 is greater than the Prob(a|P) of any other 6-mer in the sequence.

aaacct is the P-most probable 6-mer in:

Page 75: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Dealing with Zeroes

• In our toy example prob(a|P)=0 in many cases. In practice, there will be enough sequences so that the number of elements in the profile with a frequency of zero is small.

• To avoid many entries with prob(a|P)=0, there exist techniques to equate zero to a very small number so that one zero does not make the entire probability of a string zero (we will not address these techniques here).

Page 76: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

P-Most Probable l-mers in Many Sequences• Find the P-most probab

le l-mer in each of the sequences.

ctataaacgttacatc

atagcgattcgactg

cagcccagaaccct

cggtataccttacatc

tgcattcaatagctta

tatcctttccactcac

ctccaaatcctttaca

ggtcatcctttatcct

A 1/2 7/8 3/8 0 1/8 0

C 1/8 0 1/2 5/8 3/8 0

T 1/8 1/8 0 0 1/4 7/8

G 1/4 0 1/8 3/8 1/4 1/8

P=

Page 77: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

P-Most Probable l-mers in Many Sequences (cont’d) ctataaacgttacatc

atagcgattcgactg

cagcccagaaccct

cggtgaaccttacatc

tgcattcaatagctta

tgtcctgtccactcac

ctccaaatcctttaca

ggtctacctttatcct

P-Most Probable l-mers form a new profile

1 a a a c g t

2 a t a g c g

3 a a c c c t

4 g a a c c t

5 a t a g c t

6 g a c c t g

7 a t c c t t

8 t a c c t t

A 5/8 5/8 4/8 0 0 0

C 0 0 4/8 6/8 4/8 0

T 1/8 3/8 0 0 3/8 6/8

G 2/8 0 0 2/8 1/8 2/8

Page 78: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Comparing New and Old Profiles

Red – frequency increased, Blue – frequency descreased

1 a a a c g t

2 a t a g c g

3 a a c c c t

4 g a a c c t

5 a t a g c t

6 g a c c t g

7 a t c c t t

8 t a c c t t

A 5/8 5/8 4/8 0 0 0

C 0 0 4/8 6/8 4/8 0

T 1/8 3/8 0 0 3/8 6/8

G 2/8 0 0 2/8 1/8 2/8

A 1/2 7/8 3/8 0 1/8 0

C 1/8 0 1/2 5/8 3/8 0

T 1/8 1/8 0 0 1/4 7/8

G 1/4 0 1/8 3/8 1/4 1/8

Page 79: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Greedy Profile Motif SearchUse P-Most probable l-mers to adjust start positions u

ntil we reach a “best” profile; this is the motif.

1) Select random starting positions.2) Create a profile P from the substrings at these start

ing positions.3) Find the P-most probable l-mer a in each sequence

and change the starting position to the starting position of a.

4) Compute a new profile based on the new starting positions after each iteration and proceed until we cannot increase the score anymore.

Page 80: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

GreedyProfileMotifSearch Analysis

• Since we choose starting positions randomly, there is little chance that our guess will be close to an optimal motif, meaning it will take a very long time to find the optimal motif.

• It is unlikely that the random starting positions will lead us to the correct solution at all.

• In practice, this algorithm is run many times with the hope that random starting positions will be close to the optimum solution simply by chance.

Page 81: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Gibbs Sampling

• GreedyProfileMotifSearch is probably not the best way to find motifs.

• However, we can improve the algorithm by introducing Gibbs Sampling, an iterative procedure that discards one l-mer after each iteration and replaces it with a new one.

• Gibbs Sampling proceeds more slowly and chooses new l-mers at random increasing the odds that it will converge to the correct solution.

Page 82: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

How Gibbs Sampling Works1) Randomly choose starting positions

s = (s1,...,st) and form the set of l-mers associated with these starting positions. 2) Randomly choose one of the t sequences.

3) Create a profile P from the other t -1 sequences.4) For each position in the removed sequence, calculate the probability that the l-mer starting at that position was generated by P.5) Choose a new starting position for the removed

sequence at random based on the probabilities calculated in step 4.

6) Repeat steps 2-5 until there is no improvement

Page 83: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Gibbs Sampling: an Example

Input:

t = 5 sequences, motif length l = 8

1. GTAAACAATATTTATAGC2. AAAATTTACCTCGCAAGG

3. CCGTACTGTCAAGCGTGG 4. TGAGTAAACGACGTCCCA

5. TACTTAACACCCTGTCAA

Page 84: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Gibbs Sampling: an Example

1) Randomly choose starting positions, s=(s1,s2,s3,s4,s5) in the 5 sequences:

s1=7 GTAAACAATATTTATAGC

s2=11 AAAATTTACCTTAGAAGG

s3=9 CCGTACTGTCAAGCGTGG

s4=4 TGAGTAAACGACGTCCCA

s5=1 TACTTAACACCCTGTCAA

Page 85: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Gibbs Sampling: an Example

2) Choose one of the sequences at random:Sequence 2: AAAATTTACCTTAGAAGG

s1=7 GTAAACAATATTTATAGC

s2=11 AAAATTTACCTTAGAAGG

s3=9 CCGTACTGTCAAGCGTGG

s4=4 TGAGTAAACGACGTCCCA

s5=1 TACTTAACACCCTGTCAA

Page 86: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Gibbs Sampling: an Example

2) Choose one of the sequences at random:Sequence 2: AAAATTTACCTTAGAAGG

s1=7 GTAAACAATATTTATAGC

s3=9 CCGTACTGTCAAGCGTGG

s4=4 TGAGTAAACGACGTCCCA

s5=1 TACTTAACACCCTGTCAA

Page 87: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Gibbs Sampling: an Example3) Create profile P from l-mers in remaining 4 seq

uences:1 A A T A T T T A

3 T C A A G C G T

4 G T A A A C G A

5 T A C T T A A C

A 1/4 2/4 2/4 3/4 1/4 1/4 1/4 2/4

C 0 1/4 1/4 0 0 2/4 0 1/4

T 2/4 1/4 1/4 1/4 2/4 1/4 1/4 1/4

G 1/4 0 0 0 1/4 0 3/4 0Consensus

StringT A A A T C G A

Page 88: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Gibbs Sampling: an Example4) Calculate the prob(a|P) for every possible 8-

mer in the removed sequence:

Strings Highlighted in Red prob(a|P)

AAAATTTACCTTAGAAGG .000732AAAATTTACCTTAGAAGG .000122AAAATTTACCTTAGAAGG 0AAAATTTACCTTAGAAGG 0AAAATTTACCTTAGAAGG 0AAAATTTACCTTAGAAGG 0AAAATTTACCTTAGAAGG 0AAAATTTACCTTAGAAGG .000183AAAATTTACCTTAGAAGG 0AAAATTTACCTTAGAAGG 0AAAATTTACCTTAGAAGG 0

Page 89: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Gibbs Sampling: an Example

5) Create a distribution of probabilities of l-mers prob(a|P), and randomly select a new starting position based on this distribution.

Starting Position 1: prob( AAAATTTA | P ) = .000732 / .000122 = 6

Starting Position 2: prob( AAATTTAC | P ) = .000122 / .000122 = 1

Starting Position 8: prob( ACCTTAGA | P ) = .000183 / .000122 = 1.5

a) To create this distribution, divide each probability prob(a|P) by the lowest probability:

Ratio = 6 : 1 : 1.5

Page 90: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Turning Ratios into Probabilities

Probability (Selecting Starting Position 1): 6/(6+1+1.5)= 0.706

Probability (Selecting Starting Position 2): 1/(6+1+1.5)= 0.118

Probability (Selecting Starting Position 8): 1.5/(6+1+1.5)=0.176

b) Define probabilities of starting positions according to computed ratios

Page 91: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Gibbs Sampling: an Example

c) Select the start position according to computed ratios:

P(selecting starting position 1): .706

P(selecting starting position 2): .118

P(selecting starting position 8): .176

Page 92: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Gibbs Sampling: an Example

Assume we select the substring with the highest probability – then we are left with the following new substrings and starting positions.

s1=7 GTAAACAATATTTATAGC

s2=1 AAAATTTACCTCGCAAGG

s3=9 CCGTACTGTCAAGCGTGG

s4=5 TGAGTAATCGACGTCCCA

s5=1 TACTTCACACCCTGTCAA

Page 93: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Gibbs Sampling: an Example

6) We iterate the procedure again with the above starting positions until we cannot improve the score any more.

Page 94: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Gibbs Sampler in Practice

• Gibbs sampling needs to be modified when applied to samples with unequal distributions of nucleotides (relative entropy approach).

• Gibbs sampling often converges to locally optimal motifs rather than globally optimal motifs.

• Needs to be run with many randomly chosen seeds to achieve good results.

Page 95: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Another Randomized Approach• Random Projection Algorithm is a different

way to solve the Motif Finding Problem.• Guiding principle: Some instances of a motif

agree on a subset of positions.• However, it is unclear how to find these “non-

mutated” positions.• To bypass the effect of mutations within a motif,

we randomly select a subset of positions in the pattern creating a projection of the pattern.

• Search for that projection in a hope that the selected positions are not affected by mutations in most instances of the motif.

Page 96: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Projections• Choose k positions in string of length l.• Concatenate nucleotides at chosen k

positions to form k-tuple.• This can be viewed as a projection of l-

dimensional space onto k-dimensional subspace.

ATGGCATTCAGATTC TGCTGAT

l = 15 k = 7 Projection

Projection = (2, 4, 5, 7, 11, 12, 13)

Page 97: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Random Projections Algorithm• Select k out of l

positions uniformly at random.

• For each l-tuple in input sequences, hash into bucket based on letters at k selected positions.

• Recover motif from enriched bucket that contain many l-tuples. Bucket TGCT

TGCACCT

Input sequence:…TCAATGCACCTAT...

Page 98: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Random Projections Algorithm (cont’d)

• Some projections will fail to detect motifs but if we try many of them the probability that one of the buckets fills in is increasing.

• In the example below, the bucket **GC*AC is “bad” while the bucket AT**G*C is “good”

ATGCGTC

...ccATCCGACca...

...ttATGAGGCtc...

...ctATAAGTCgc...

...tcATGTGACac... (7,2) motif

Page 99: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Example

• l = 7 (motif size) , k = 4 (projection size)• Choose projection (1,2,5,7)

GCTC

...TAGACATCCGACTTGCCTTACTAC...

Buckets

ATGC

ATCCGAC

GCCTTAC

Page 100: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Hashing and Buckets

• Hash function h(x) obtained from k positions of projection.

• Buckets are labeled by values of h(x).• Enriched buckets: contain more than s l-

tuples, for some parameter s.

ATTCCATCGCTCATGC

Page 101: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Motif Refinement

• How do we recover the motif from the sequences in the enriched buckets?

• k nucleotides are from hash value of bucket.• Use information in other l-k positions as starting

point for local refinement scheme, e.g. Gibbs sampler.

Local refinement algorithmATGCGAC

Candidate motif

ATGC

ATCCGAC

ATGAGGCATAAGTC

ATGCGAC

Page 102: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Synergy between Random Projection and Gibbs Sampler• Random Projection is a procedure for finding good

starting points: every enriched bucket is a potential starting point.

• Feeding these starting points into existing algorithms (like Gibbs sampler) provides good local search in vicinity of every starting point.

• These algorithms work particularly well for “good” starting points.

Page 103: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Building Profiles from Buckets

A 1 0 .25 .50 0 .50 0

C 0 0 .25 .25 0 0 1

G 0 0 .50 0 1 .25 0

T 0 1 0 .25 0 .25 0

Profile P

Gibbs sampler

Refined profile P*

ATCCGAC

ATGAGGC

ATAAGTC

ATGTGAC

ATGC

Page 104: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

A Graph-Based Method for Motif Finding

Graphs com from http://biochem218.stanford.edu/

Page 105: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

Problems with Current Representations of DNA Motifs All current methods for representing DNA motifs inv

olve either consensus sequences or probabilistic models (such as PSSMs) of the motif.

Consensus sequences do not adequately represent the variability seen in promoters or transcription factor binding sites.

Both consensus sequences and PSSM models assume positional independence.

Neither method can accommodate correlations between positions.

Page 106: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

Yeast Motifs

We analyzed yeast motifs for pairwise dependencies. We used a chi-square statistic to find whether two

positions were correlated or not. We found that 25% of motifs have significantl

y correlated positions.

Page 107: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

Graph Representation

Page 108: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

A Graph-Based Model of a Motif

The main idea is to formulate motif finding as the problem of finding the Maximum Density Subgraph

Given a set of DNA sequences, we construct a graph G=(V,E) V: the set of the k-mer occurrences in the input E: the set of undirected edges, represents nucleoti

de similarities between those k-mers

Page 109: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

Maximum Density Subgraph

Page 110: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

Nucleotide Dependencies and MDS

Page 111: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

Things to be Done

Graph construction Edge weight assignment Background normalization

Finding maximum density subgraph(s) Choice of density function Max-Flow/Min-Cut Neighborhood subgraph …

Page 112: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

The Modeling of Regulatory Regions

Page 113: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

More than a hundred methods have been proposed for motif discovery A large variation with respect to both algorithmic

approaches as well as the underlying models of regulatory regions

Page 114: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

Key Points

How the multiple binding sites for modules of regulatory elements can be modeled

How additional data sources may be integrated into such models

Page 115: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

A Integrated Framework to Model Regulatory Regions A single motif, denoted by mg, consists of two parts,

m*: how well the sequence matches a consensus og is a prior on whether any regulatory element is to occur at t

hat position A set of single motifs, together with inter-motif distanc

e restrictions (d), then forms a composite motif (cg) Multiple occurrences of a composite motif in the regul

atory regions of a gene is represented by a gene score Gc

Page 116: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

A Schematic View of the Integrated Framework

Page 117: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

Single Motif Models (Level 1)

Transcription factor binding sites: This is the most basic element of the regulatory system, and can be modeled using single motif models

A single motif model is defined as a function mg that maps a sequence position p to a real numbered motif score mg(p) It consists of a match score m*(p) and an occurren

ce prior og(p)

Page 118: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

The Match Model m*

Probabilistic match model Position weight matrix (PWM) a.k.a. PSSM Incorporating dependencies within motifs

n-th order Markov chains Bayesian network

Deterministic match model oligos regular expression mismatch expressions

Page 119: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

Occurrence priors og

Spatial distribution of binding sites Conservation in orthologous sequences DNA structure Nucleotide distribution

Page 120: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

Composite Motif Models (Level 2) Clusters of binding sites for cooperating TFs,

often called modules, are believed to be essential building blocks of the regulatory machinery

A composite motif model is defined as a function cg that maps a set of single motif sequence positions p to a real numbered composite motif score cg(p).

Page 121: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

A Composite Motif Score cg

Given a set of positions, the score of a composite motif will typically be the sum or product of individual single motif, and distance scores

Page 122: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

Distance Functions

constraints fixed distances distances below thresholds distances within intervals all single motifs to be within a window of a certain

length score functions

uniform distribution: geometry, Poisson, etc.

Page 123: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

Combining Single Motifs

For methods using deterministic match models and constraints on distances intersection of component scores M out of N single motif scores

For methods that use non-binary single motif scores the sum/product of single motif and distance scores

Specialized models hidden Markov model (HMM) self-organizing map (SOM) artificial neural network (ANN)

Page 124: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

Gene level models (Level 3)

A gene score model is defined as a function Gc that maps a gene index g to a real numbered gene score Gc(g)

The gene level score is calculated from composite motif scores, cg, across the regulatory region of gene g, and is referred to as gene score

Page 125: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

Multiple Binding Sites/Modules One motif in the regulatory region

The gene level score is often defined simply as the maximum motif score in the regulatory region of a gene

Multiple binding sites/modules sum of log-scores of motifs logistic-regression ANN HMM …

Page 126: Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

Overview of Methods