Group Feature Extraction Based on Multiple Indexing Sequence Alignment 多重索引序列排比應用於群組特徵擷取 Dr. Tun-Wen Pai Dept. of Computer Science and Engineering,

Group Feature Extraction Based on Group Feature Extraction Based on Multiple Indexing Sequence Multiple Indexing Sequence

AlignmentAlignment

多重索引序列排比應用於群組特徵擷取

Dr. Tun-Wen Pai Dept. of Computer Science and Engineering,

National Taiwan Ocean University2006.10.30

2

• Central idea: finding short approximate patterns

• Motivation: finding ordered combinatorial features

• Objectives:– constructing evolutionary relationship– providing key features for structural

alignment

3

System ArchitectureSystem Architecture

Sequences

Motif finding and indexing

Hierarchicalclustering

Multiple IndexingSequence Alignment

Exclusive group feature extraction

Background model

Phylogenetic tree

Consensus motifs

Combinatorial features

Exclusive group features

Background information

4

Motif findingMotif finding

• short consensus motifs including tolerable characteristics

• variable-site tolerance: the tolerated sites in a pattern can be variable

• substitutable tolerance: the similar chemical properties of residues in a pattern can be substituted

5

Variable-site toleranceVariable-site tolerance

• applying the uniqueness and efficient searching of hashing techniques

• original patterns unique digital value• comparing patterns using a hash

table structure

6

Substitutable toleranceSubstitutable tolerance

• depending on chemical properties• substitution matrix Blosum62• bitwise clustering avoid

misjudging two dissimilar residues

7

Hierarchical clusteringHierarchical clustering

• revealing phylogenetic relationships• two sequences possess more consensus motif

s more similar• scoring matrix pairwise similarities

8

Exclusive Group Feature ExtractionExclusive Group Feature Extraction

• Removing common motifs occurring in other subgroups

• CP: combinatorial patterns• ECP: exclusive combinatorial patterns

9

Background Model AnalysisBackground Model Analysis

• Verifying conspicuousness• Hit ratio close to 0 unique• Hit ratio relative large insignificant

10

The combinatorial features of RNase The combinatorial features of RNase A-like superfamily extracted by MISAA-like superfamily extracted by MISA

11

The combinatorial features of RNase A-like suThe combinatorial features of RNase A-like superfamily extracted by MISA(cont.)perfamily extracted by MISA(cont.)

• The known H-K-H active sites are identified exactly

12

The combinatorial features of RNase The combinatorial features of RNase A-like superfamily extracted by ClustalWA-like superfamily extracted by ClustalW

• The first H was misaligned

13

The combinatorial features of RNase A-likThe combinatorial features of RNase A-like superfamily extracted by ClustalWe superfamily extracted by ClustalW

• The first H was misaligned

14

The combinatorial features of RNase A-like suThe combinatorial features of RNase A-like superfamily extracted by ClustalW(cont.)perfamily extracted by ClustalW(cont.)

15

The combinatorial features of RNase A-likThe combinatorial features of RNase A-like superfamily extracted by MEMEe superfamily extracted by MEME

16

The combinatorial features of RNase A-like suThe combinatorial features of RNase A-like superfamily extracted by MEME(cont.)perfamily extracted by MEME(cont.)

• The first ‘H’ was not successfully detected

17

The combinatorial features of RNase A-like supThe combinatorial features of RNase A-like superfamily extracted by Gibbs Samplererfamily extracted by Gibbs Sampler

1, 1, 1 65 qekvt CKNGQ gncyk 69 1.00 F 1E21:A1, 2, 0 107 kerhi IVACE gspyv 111 1.00 F 1E21:A1, 3, 2 116 egspy VPVHFD asved 121 1.00 F 1E21:A2, 1, 1 38 nyqrr CKNQN tfllt 42 1.00 F 1GQV:A2, 2, 0 109 anmfy IVACD nrdqr 113 1.00 F 1GQV:A2, 3, 2 127 pqypv VPVHLD rii 132 1.00 F 1GQV:A3, 1, 1 37 nyrwr CKNQN tflrt 41 1.00 F 1DYT:A3, 2, 0 108 grrfy VVACD nrdpr 112 1.00 F 1DYT:A3, 3, 2 125 prypv VPVHLD tti 130 1.00 F 1DYT:A4, 1, 1 65 ttniq CKNGK mnche 69 1.00 F 1RNF:A4, 2, 0 105 strrv VIACE gnpqv 109 1.00 F 1RNF:A4, 3, 2 114 egnpq VPVHFD g 119 1.00 F 1RNF:A5, 1, 1 59 kaice NKNGN phren 63 1.00 F 1B1I:A5, 2, 0 104 gfrnv VVACE nglpv 108 1.00 F 1B1I:A5, 3, 2 111 aceng LPVHLD qsifr 116 1.00 F 1B1I:A15 motifs

Column 1 : Sequence Number, Site NumberColumn 2 : Motif typeColumn 3 : Left End LocationColumn 4 : Motif ElementColumn 5 : Right End LocationColumn 6 : Probability of ElementColumn 7 : Forward Motif (F) or Reverse Complement (R) Column 8 : Sequence Description from Fast A input

18

The combinatorial features of RNase A-like superfThe combinatorial features of RNase A-like superfamily extracted by Gibbs Sampler(cont.)amily extracted by Gibbs Sampler(cont.)

• The first ‘H’ was not successfully detected• The motif colored in red wrong

19

The Comparison in Average RMSD and The Comparison in Average RMSD and Aligned ResiduesAligned Residues MISA Gibbs Sampler ClustalW MEME

Average RMSD 1.039139 1.406796 1.361162 1.336590

AverageAlignedResidues

95.25 38.25 60.50 83.00

The lowest average RMSDThe highest average aligned residues

(using a straight forward structure alignment)

20

MISA for primate map1b upstream MISA for primate map1b upstream sequencessequences

Ref: D. Liu and I. Fischer, “Structural analysis of the proximal region of the microtubule-associated protein 1B promoter”, J Neurochem, 1997, 69: pp. 910-919

21

MISA for primate hspa2MISA for primate hspa2

22

Hierarchical clustering for p450 family 1Hierarchical clustering for p450 family 1

It can be clustered into three subfamilites

23

Combinatorial features for subfamily 1ACombinatorial features for subfamily 1A

24

Combinatorial features for subfamily 1BCombinatorial features for subfamily 1B

25

Combinatorial features for subfamily 1CCombinatorial features for subfamily 1C

26

Exclusive group features for p450 family 1Exclusive group features for p450 family 1

• cytochrome P450 subfamily 1A• ^ E*L*A ^ *PK*L* ^ *W*ARR*LA* ^ L**FS ^ *SC*LEEH*S*E ^ G*F*P ^ *V*SV*NVI ^ *DF*P*LR*LP* ^ **EHY**F ^ **DIT**L ^ **ELD**

^ R*P*LS• cytochrome P450 subfamily 1B• ^ F*R*A ^ WK**R ^ R*F*T ^ **RYP**Q*R*Q ^ DQ**LP ^ G**NK*L* ^ **HQC** ^ **LLD**• cytochrome P450 subfamily 1C• ^ SI**EWSG**QPAL*A*F ^ **EAC*W* ^ F**YSKQW**HRK*AQS**RAFS*AN*QT* ^ EA**LV**FL ^ F*P*HE*T ^ N**FF**V**KV**HR ^

W**LL ^ *AK*RG*

cytochrome P450 subfamily 1A

cytochrome P450 subfamily 1B

cytochrome P450 subfamily 1C

27

28

29

30

31

32

Documents

Group Feature Extraction Based on Multiple Indexing Sequence Alignment 多重索引序列排比應用於 群組特徵擷取 Dr. Tun-Wen Pai Dept. of Computer Science and Engineering,

Group Feature Extraction Based on Multiple Indexing Sequence Alignment 多重索引序列排比應用於群組特徵擷取 Dr. Tun-Wen Pai Dept. of Computer Science and Engineering,