Protein Subcellular Localization Prediction Based on ...violet/dm97/NTNU_20080805_Emily.pdf · 8/5/2008 · web-based computing, digital libraries, bioinformatics Wen-Hsiung Li (李文雄):

ACADEMIA SINICAInstitute of Information Science

Protein Subcellular Localization Prediction Based on Machine Learning Approaches

Wen-Lian Hsu (許聞廉老師)Emily Chia-Yu Su (蘇家玉)

Aug 5 2008

台師大資料探勘課程


OutlineIntroduction

Data miningProtein subcellular localization prediction

A support vector machine model-based methodSupport vector machinesCompartment-specific biological features

A probabilistic latent semantic analysis-based methodGapped-dipeptidesProbabilistic latent semantic analysis

Applications of protein localization predictionConclusion

2/78


About Myself

EducationPh.D. Candidate, Bioinformatics Program, Taiwan International Graduate Program (TIGP), Academia SinicaM.S., Department of Computer Science and Information Engineering, National Taiwan UniversityB.S., Department of Information and Computer Education, National Taiwan Normal University

Research interestsBioinformatics, computational biology, machine learning, data mining, text mining, natural language processing, information retrieval/extraction

3/78


About TIGP

TIGP programsChemical Biology and Molecular Biophysics (CBMB, 2002)Molecular Science and Technology (MST, 2002)Molecular and Biological Agricultural Sciences (MBAS, 2003)Bioinformatics (Bio, 2003)Molecular and Cell Biology (MCB, 2003)Nano Science and Technology (Nano, 2003)Molecular Medicine (MM, 2004)Computational Linguistics and Chinese Language Processing (CLCLP, 2005)Early System Science, (ESS, will admit students in 2009)

Website: http://tigp.sinica.edu.tw/

4/78

http://tigp.sinica.edu.tw/


About Bioinformatics Program (1/2)

Wen-Lian Hsu (許聞廉): Natural language processing, literature mining, proteomics, protein structure prediction

Der-Tsai Lee (李德財): Computational geometry, parallel and distributed computing, web-based computing, digital libraries, bioinformatics

Wen-Hsiung Li (李文雄): Molecular evolution, comparative genomics, population genetics, evolution of gene regulation, computational biology

Wen-Chang Lin (林文昌): Bioinformatics, tumor biology, cancer metastasis

5/78


About Bioinformatics Program (2/2)

Jenn-Kang Hwang (黃鎮剛): Structure prediction and classification, protein stability, structural alignment, molecular simulation

Cathy S.J. Fann (范盛娟): Biostatistics, genetic Epidemiology, genetic Statistics, disease gene napping, population genetics

Grace S. Shieh (謝叔蓉): Biostatistics, microarray analysis, gene regulatory network prediction, protein interaction networks, comparative genomics

Ueng-Cheng Yang (楊永正): Bioinformatics, infobiology, RNA-structure analysis, RNA-protein interaction, comparative genomics, genome annotation

6/78


OutlineIntroduction





7/78


Origins of Data Mining

Goals in data miningDraws ideas from machine learning/pattern recognition, statistics/AI, and database systems

Traditional techniquesmay be unsuitable due to

Enormity of dataHigh dimensionality of dataHeterogeneous, distributed nature of data

Machine Learning/Pattern

Recognition

Statistics/AI

Data Mining

Database systems

擷取自2008/7/23 柯佳伶老師“資料探勘簡介＂ 8/78


Two Types of Data Mining Methods

Prediction methodsUse some variables to predict unknown or future values of other variables

Description methodsFind human-interpretable patterns that describe the data



Different Tasks in Data Mining

Classification [Predictive]Clustering [Descriptive]Association rule discovery [Descriptive]Sequential pattern discovery [Descriptive]Regression [Predictive]Deviation detection [Predictive]



Definition of Classification

Definition:Given a collection of records (i.e., training set)

Each record contains a set of attributesFind a model for class attribute as a function of the values of other attributes

Goal: Assign a class to previously unseen records as accurately as possibleA test set is used to determine the accuracy of the model



Applications in Classification

Examples of classification: Direct marketing

Predict whether a consumer is likely to buy a new cell-phoneFraud detection

Predict fraudulent cases in credit card transactionsCustomer attrition/churn

Predict whether a customer is likely to be lost to a competitorSky survey cataloging

Predict class (star or galaxy) of sky objects based on the telescopic survey images

More applications in biology? 擷取自2008/7/23 柯佳伶老師“資料探勘簡介＂ 12/78


OutlineIntroduction





13/78


Protein Subcellular Localization (PSL) Prediction

Predict where the protein is located in a cell?

C1: cytoplasmC2: inner membraneC3: periplasmC4: outer membraneC5: extracellular space

Gram-negative bacteriaGardy LJ, et al. Methods for Predicting Bacterial Protein Subcellular Localization. Nature Reviews Microbiology, 2006.

14/78


Importance of PSL Prediction

Protein function identificationModulate and identify protein functions

Genome annotationAnnotate genomic features

Drug discoveryGive clues to new drug targets

15/78


The Available Computational Methods for Bacterial PSL Prediction

Gardy LJ, et al. Methods for Predicting Bacterial Protein Subcellular Localization. Nature Reviews Microbiology, 2006. 16/78


OutlineIntroduction





17/78


Classification FormulationGiven

an input spacea set of classes ={ }

the Classification Problem is to define a mapping f: where each x in ℜ is assigned to one class

This mapping function is called a decision function

cωωω ,...,, 21ω

ω

ℜ

ℜ

18/78


Decision Function (1/2)

The basic problem in classification problem is to find cdecision functions

with the property that, if a pattern x belongs to class i, then

is a similarity measure between x and class i, such as distance or probability concept

( ) ( ) ( )xdxdxd c,...,, 21

( ) ( )xdxd ji > ijcji ≠= ;,...2,1,

( )xdi

19/78


Decision Function (2/2)

Example

d3=d2

d1=d3

d1=d2

d1,d3


Support Vector Machines (1/2)

Support Vector Machines (SVM)Basically a 2-class classifier developed by Vapnik and Chervonenkis (1992)Which line is optimal?

21/78


Support Vector Machines (2/2)

Training vectors : xi , i=1….lConsider a simple case with two classes :Define a vector y, yi = 1 if xi in class 1

= -1 if xi in class 2A hyperplane which separates all data

r

ρ

Separating plane

Margin

Class 1Class 2

Support Vector (Class 1)Support Vector (Class 2)

22/78


Optimal Hyperplane

Formalization

Goal:Choose a set of w and b to maximize the margin

Next:http://en.wikipedia.org/wiki/Support_vector_machine

bias: or,input vect: tor,weight vec:0,)(

functiondecision The}1,1{,)},,(),...,,{(

:data Training

11

bxwbxwxf

yRxyxyxD nll

=+=

−∈∈=

23/78

http://en.wikipedia.org/wiki/Support_vector_machine


Multiclass Classification in SVM

One-versus-rest (1-v-r) SVM modelApply a universal set of biological features for different localization classes

One-versus-one (1-v-1) SVM modelDifferent biological features can be used in distinguishing two classes

24/78


Multiclass Classification by 1-v-r SVM

Binary classifiers: for each class i, construct a Ci vs. non-Cibinary classifier

# of classifiers = 5

Input features: same features for all binary classifiersClass determination:

The class with the largest probability (probi: the confidence of sample predicted as class i; 0


General Biological Features for PSL Prediction1. Amino acid composition (AA)2. Dipeptide composition (Dip)3. Secondary structure elements (SSE)

26/78


1. Amino Acid Composition2. Dipeptide Composition

Amino acid composition (AA) andDipeptide composition (Dip)

n-peptide compositions or their variations have been shown effective in PSL prediction

If n = 1, then the n-peptide composition reduces to the AA• Dimension = 20

If n = 2, then the n-peptide composition yields the Dip• Dimension = 20*20

N-terminus C-terminusMet AlaPhe Leu PheHisAlaArgSer Val • • •

27/78


3. Secondary Structure Elements

Predicted secondary structure elements (SSE) from HYPROSP II server

Encoding scheme: compute amino acid compositions of α–helix (H), β-strand (E), and random coil (C)Protein Seq. M P L D L Y N T L T R R K E R F E P MT P D．．

Predicted SSEs C C E E E E C C C H H H H H H C C E E E H H．．

0.12 0.02 0.03 … 0.05

0.07 0.04 0.02 … 0.09

0.08 0.06 0.07 … 0.03

H

E

C α–helix β-strand random coil

A C D Y↓ ↓ ↓ ↓

……

28/78


Training and Testing in SVM

Support Vector Machines (SVM)LIBSVM softwareKernel: Radial Basis Function (RBF)Parameter selection

c (cost) and γ(gamma) are tuned10-fold cross-validation (with validation)

8 folds for training1 fold for validation1 fold for testing

Chang CC and Lin CJ, LIBSVM : a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm

29/78

http://www.csie.ntu.edu.tw/~cjlin/libsvm


Gram-Negative Bacteria Data Set

Cytoplasmic . 278. 19%

Inner membrane .

309. 22%

Periplasmic . 276. 19%

Outer membrane .

391. 27%

Extracellular . 190. 13%

A benchmark data setePSORTdb1,444 proteins

Gardy LJ, et al. PSORTb v.2.0: Expanded Prediction of Bacterial Protein Subcellular Localization and Insights Gained from Comparative Proteome Analysis. Bioinformatics, 2005. 30/78


Performance EvaluationAccuracy (Acc)

l = 5 is the number of total localization sitesNi are the number of proteins in localization site i

1 1

l l

i ii i

Acc TP N= =

=∑ ∑

31/78


Results of 1-v-r SVM ModelDifferent feature combinations in 1-v-r SVM model1. AA2. Dip3. SSE4. AA+Dip5. AA+SSE6. Dip+SSE7. AA+Dip+SSE

Feature AA Dip SSE AA+Dip AA+SSE Dip+SSE AA+Dip+SSE

Overall Acc 85.56% 84.87% 83.26% 87.71% 84.95% 83.95% 86.25%

32/78


Multiclass Classification by 1-v-1 SVMBinary classifiers: for each pair of classes i and j, construct a Ci vs. Cj binary classifier

# of classifiers = 5*(5-1)/2 = 10Input features: different features can be in different classifiersClass determination:

Majority votesAverage probability

In case of a tie in majority votes, the class with the largest average probability is selected as final predicted class

C1/C2

F12

C1/C3

F13

C1/C4

F14

C3/C5

F35

C4/C5

F451-v-1 SVM Model C1

C2

C3

C4

C5

• • •

10 classifiers 33/78


Accuracy and Feature CombinationFeatures used in 1-v-1 SVM:

AA, Dip, and SSEAdvantages of 1-v-1 SVM model:

Flexibility of combining different featuresBetter accuracy

34/78


OutlineIntroduction





35/78


Compartment-Specific Biological Features

Integrate more biological features to improve accuracyFor each binary classifier Cij, a set of compartment-specific features is incorporated

Features unique to Ci or CjSelect features to mimic protein bacterial secretory pathways

Feature selection guided by biological insightsA binary classifier Cij distinguishes proteins localized in two different compartments

C1vs. C2, C2 vs. C3, etc.

36/78


Compartment-Specific Features in Bacterial Secretory Pathways

Omp85

C1 C2: Sig, TMα, SA

C2 C3: Sig, TMα, SSE

C1 C3: Tat, Sig, SSE, SA

C3 C4: TMβ, SSE, SA C3 C5: SSE C1 C5: SecretomeP, SSE

Modified from Wickner and Schekman with their permission.Wickner W and Schekman R. Protein Translocation Across Biological Membranes, Science, 2005 37/78


More Compartment-Specific Biological Features1. Amino acid composition (AA)2. Dipeptide composition (Dip)3. Secondary structure elements (SSE)4. Solvent accessibility (SA) – C55. Signal peptides (Sig) – C16. Transmembrane α-helices (TMA) – C27. Transmembrane β-barrels (TMB) – C48. Twin-arginine translocase signal peptides (TAT) – C39. Non-classical protein secretion (Sec) – C5

38/78


Compartment-Specific Biological Features

Implication of different biological features to localization classes

Fea. Description Features ClassesSA Solvent accessibility Acidic high SA residues C5

Sig Signal peptides Presence of Sig not C1

TMA Transmembrane α-helices Presence of TMA C2

TAT Twin-arg translocase motifs Presence of TAT C3

TMB Transmembrane β-barrels Presence of TMB C4Sec Non-classical protein secretion Presence of Sec C5

39/78


······RRDFLKGIASSSFVVLGGSSVLTPLN······

TMBAA Dip SSE

Compartment-Specific Feature Selection

SVM1CP,IM SVM2CP,PP SVM3CP,OM SVM10OM,EC

Majority Votes &Average Probabilities

FCP,IM FCP,PP FCP,OM FOM,EC

Predicted Localization Site(s)

PSL101 (Protein Subcellular Localization prediction by 1-On-1 classifiers)

Biological features

1-v-1 binary

classifiers

System Architecture of PSL101

Taipei 101

40/78


Feature Selection

Motivation: unlikely to try all possible feature combinations in different classifiersFeature selection: reduce computational costs

Sequential forward search algorithm:Starting with an empty subset, keep adding the best feature sets that improve the accuracyThe process terminates when adding a feature set no longer makes any improvement

41/78


Accuracy and Feature Combination

More biological features used in 1-v-1 SVM:AA, Dip, SSE, SA, Sig, TMA, TMB, TAT, and Sec

3.2% improvement over applying the general features!

42/78


More Refined Encoding Schemes to Encode Protein Structures (SSE)?

Previous encoding: consider only composition of H, E, and C

SSE1 and SSE2 have the same composition but different transition and distribution!

SSE1: CCCCCCCCCCCCCCCCCCCCCHHHHHHHHHHHHHHHHHHHHHHH

SSE2: CCCHHHHHCCCHHHHHCCCCCHHHHHCCCHHHHHCCCCCHHHHH

43/78


A More Refined Feature Representation for SSE1. Composition

The number of amino acids of H, E, and C2. Transition

The percent frequency with which H E, H C, andE C

3. DistributionThe chain length within which the first, 25, 50, 75 and 100% of the amino acids of a particular property is located respectivelyH1%, H25%, H50%, H75%, H100%, E1%, E25%, E50%, E75%, E100%, C1%, C25%, C50%, C75%, and C100%

Dubchak LJ, et al. Prediction of Protein Folding Class Using Global Description of Amino Acid Sequence. PNAS, 1995. 44/78


Accuracy and Feature Combination

Features used in 1-v-1 SVM:AA, Dip, SSE (EC1), SA, Sig, TMA, TMB, TAT, Sec, andSSE (EC2)

New encoding scheme leads to an improvement of 0.6%in overall accuracy!

45/78


OutlineIntroduction





46/78


Document Classification

47/78


Vector Space Model

Salton’s Vector Space ModelRepresent each document by a high-dimensional vector in the space of terms/words

Document A vector of terms

Gerald Salton

48/78


A Term-Document Matrix

A term-document matrix is m×n matrix where m is number of terms and n is number of documents

⎥⎥⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢⎢⎢

⎣

⎡

=

↓↓↓

aaa

aaaaaa

A

ddd

mnmm

n

n

n

…

…

…

21

22221

11211

21

t

tt

m←

←

←

2

1

documents

terms

49/78


Term Weighting by TFIDFThe term frequency (tf) in the given document d gives a measure of the importance of the term ti within the particular document

ni: the number of occurrences of the term tiThe inverse document frequency (idf) is obtained by dividing the number of all documents by the number of documents containing the term ti,

|D| : total number of document in the corpus: number of documents where the term ti appears

∑=

kk

ii n

ndttf ),(

)(log)(

iii td

Dtidf

⊃=

tfidf = tf*idf 50/78


The Terms of Proteins - Gapped-dipeptides

Gapped-dipeptide feature representation:Let XdZ denote the gapped-dipeptide of amino acid type Xand Z that are separated by d amino acids

51/78


Incorporation of Evolutionary Information

Position Specific Scoring Matrix (PSSM)A PSSM is constructed from a multiple alignment of the highest scoring hits in the PSIBLAST search

52/78


Weighting Scheme of Gapped-dipeptides

The weight of XdZ :

where f(i, Y) denotes the normalized value of the PSSM entry at the ith row and the column corresponding to amino acid type Y

An exampleW(M2D,P)= f(1,M) × f(4,D) + f(2,M) × f(5,D) + … + f(78,M) × f(81,D) = 0.99995×0.04743 + 0.11920×0.00247 +…+ 0.00669×0.26894

53/78


Gapped-dipeptide Feature Representation

Problems of proteins represented by gapped-dipeptides:Very large number of featuresIncrease of computational time and complexity

If d = 20, there are 8,400 (=20*20*21) features in a vector!

54/78


OutlineIntroduction





55/78


Feature Reduction (1/3)

Probabilistic Latent Semantic Analysis (PLSA)Create a mapping between documents and terms via latent concepts

56/78


Feature Reduction

Dimension reductionReduce the number of gapped-dipeptides by Probabilistic Latent Semantic Analysis (PLSA)

True plot in k dimensions Reduced-dimensionality plot

57/78



A joint probability between a term w and a document d can be modeled as:

Latent variable z(“small” #states)

Concept expression

probabilities

Document-specificmixing proportions

)|()|()(),( dzPzwPdPdwPZz∑∈

=

• The parameters could be estimated by maximum-likelihood function through EM algorithm.

58/78



Term 1Term 2

Term 3

Term 4Term 5

Vector

Term Space

59/78


System Architecture of PSLDoc

PSLDoc (Protein Subcellular Localization prediction by Document classification)

60/78


Performance Comparison with Other Approaches

PSLDoc compares favorably than other methods in overall accuracy by 1.51%

Chang JM, et al. PSLDoc: Protein Subcellular Localization Prediction Based on Gapped-dipeptides and Probabilistic Latent Semantic Analysis. Proteins: Structure, Function, andBioinformatics, 2008.

61/78


Effect of PLSA Feature Reduction (1/2)Distribution of topic versus proteins

Use PLSA to reduce gapped-dipeptides to different topics

62/78


Effect of PLSA Feature Reduction (2/2)

Distribution of topic versus localization sites

63/78


Gapped-dipeptide Signatures for Each Localization

Loc. Gapped-dipeptide Signatures

CPE0E, K1I, K5V, K1V, D0E; L1H, L5H, L3H, H4L, H0L; A12C, A9C, A13C, A5C, A7C;

R3R, R6R, R2R, R0R, R9R; A6A, A13A, A7A, A10A, A11A; I0E, R6I, I3R, I3K, R6V;

H3H, H1H, H7H, H13H, H10H; H1M, H2M, H11M, M0H, H0M; A4E, E1E, A2E, V4E, A9E;

E4E, K6E, E6E, E3E, E0E

IMI2I, I3I, I0I, L0I, I0F; L7L, L4L, L10L, L3L, L6L; M3M, M2M, M0M, M8M, M6M;

V2I, V2V, V3I, V3V, I0V; T2F, T6F, F3F, T4F, T8F; A1A, A7L, A4A, A1C, A11L;

W3W, W0W, W2W, W6W, W4W; Y12L, Y1L, Y11L, L0Y, L1L; M2T, M3T, M10T, M4T, M0L;

F10P, F8P, F12P, F3P, F13P

PPA1A, A2A, A0A, A3A, M4A; M0H, W1Q, W1H, W1K, W5Q; P1E, P0E, E0P, P0K, E1P;

D0D, Q0D, D3D, D3Q, D11D; W0E, E4W, W11E, E0W, W13E; K3K, K0K, K2K, K1K, K7K;

A3A, A7A, A1P, A6R, A10R; P3N, N4P, N3P, N5P, N0P; H6G, G3M, H7D, G11H, H11G;

A10A, A11A, A6A, A12A, A3A

OMT1R, R3T, R1T, T5R, P0P; R0F, R4F, Y13R, R6F, R2F; N4N, N0N, N10N, N7N, F1N;

Q6Q, Q1Q, Q3Q, Q13Q, Q4Q; S0F, A3F, F0S, R9F, F7F; G0G, A0G, A1G, G1A, G3A;

N1Q, N1N, Q1Q, N12N, Q11V; W2N, N2W, N0W, D2W, N13W; Q5R, R1Q, Q1R, Q3R, R2Q;

Y1Y, Y0Y, Y5Y, Y4Y, Y12Y

ECS6S, S2S, T11T, S13S, T6S; G8G, G0G, G7G, G9G, G6G; T1T, T3T, T5T, T9T, T10T;

N10N, N9N, N13N, N11N, N12N; N1N, N3N, N4N, N11N, N1T; I5Y, Y12S, Y3S, Y9S, Y6I;

Q2N, N1Q, Q1Q, N3Q, Q7Q; K1S, S6S, S5S, S11M, S0S; S3G, G3G, G4S, G3S, G2G;

N0N, N12V, N4V, V12N, N9V64/78


Amino Acid Compositions of Single Residues and Gapped-dipeptide Signatures

65/78


Grouped Amino Acid Compositions of Single Residues and Gapped-dipeptide Signatures

Amino acid grouping:Nonpolar (AIGLMV), polar (CNPQST), charged (DEHKR), and aromatic (FYW)

66/78


OutlineIntroduction





67/78


Interaction – Localization Networks in the Secretory Pathway

Co-localized proteins interact together!

Scott MS, et al. Refining Protein Subcellular Localization . PLoS Computational Biology , 2005. 68/78


Interaction – Localization Networks in ER

Interactions of ER periphery proteins are almost exclusively between members of this group or with cytosolic proteins.

Scott MS, et al. Refining Protein Subcellular Localization . PLoS Computational Biology , 2005. 69/78


Gene Expression Level – Localization Relationship

Expression levels are clearly correlated with localization

Drawid A, et al. Genome-wide analysis relating expression level with protein subcellular localization. Trends in Genetics, 2000.

70/78


The Discriminative Impact of Features for Different Functional Categories

Subcellular location related features are important for elucidate protein function.

Jensen LJ, et al. Prediction of Human Protein Function from Post-translational Modifications and Localization Features, J. Mol. Bio., 2000.

71/78


OutlineIntroduction





72/78


Conclusion (1/2)SVM models

1-v-1 outperforms 1-v-r in the flexibility of integrating different features

Compartment specific biological featuresCompartment-specific features improve accuracyA refined feature representation leads to Acc = 92.0% in PSL101(vs. 90.0 % in CELLO II)

ModelBiological Features

AccAA Dip SSE EC1 SA Sig TMA TMB TAT Sec SSE EC2

1-v-r SVM ● ● ● 87.7%

1-v-1 SVM ● ● ● 88.2%

1-v-1 SVM ● ● ● ● ● ● ● ● ● 91.4%1-v-1 SVM ● ● ● ● ● ● ● ● ● ● 92.0%

73/78


Conclusion (2/2)

Proteins represented by gapped-dipeptidesGapped-dipeptides capture remote relationships on the primary sequence

Effects of feature reductionPLSA greatly reduce feature dimension without sacrificing the performance (>8,000


Take Home Message

5 key factors in using data mining to solve a research problemData set

• Is the data set large enough for analysis? Feature extraction

• What kind of feature(s) could be useful to solve this problem?Feature representation

• How to effectively represent/encode these features? Data mining approaches

• Which are the suitable technique(s) for this problem? Evaluation

• What is the performance of this method?

75/78


References

Emily Chia-Yu Su, Hua-Sheng Chiu, Allan Lo, Jenn-Kang Hwang, Ting-Yi Sung, and Wen-Lian Hsu, “Protein subcellular localization prediction based on compartment-specific features and structure conservation,” BMC Bioinformatics, 8:330, (2007).Jia-Ming Chang, Emily Chia-Yu Su, Allan Lo, Hua-Sheng Chiu, Ting-Yi Sung, and Wen-Lian Hsu, “PSLDoc: Protein subcellular localization prediction based on gapped-

dipeptides and probabilistic latent semantic analysis,”PROTEINS: Structure, Function, and Bioinformatics, 72(2): 693, (2008).

76/78


Thank You!

77/78


Questions?

78/78


4. Solvent Accessibility – (1)

Proteins in different localization sites have different solvent accessibility (SA)

A balance of acidic and basic exposed residuescytoplasmic proteins (C1)A slight excess of of acidic exposed residuesextracellular proteins (C5)

Exposed and buried residuesLet SAi be the predicted SA of a residues at position i(0


4. Solvent Accessibility – (2)

Protein Seq. M P L D L Y N T L T R R K E R F E P M T P D R V

Pred. SA (%) 16 75 53 45 62 86 22 19 8 67 ．．．．．．．．．．．．．．

buried or exposed B E E E E E B B B E ．．．．．．．．．．．．．．

Encoding scheme of SACompute the amino acid compositions of exposed and buried residues

0.12 0.02 0.03 0.050.07 0.04 0.02 0.09

A C D Y↓ ↓ ↓ ↓

⎡ ⎤⎢ ⎥⎣ ⎦

………… B←

E←

80/78


Signal peptides (Sig) from SignalP 3.0 serverSignal peptides:

N-terminal peptides between 15 and 40 amino acids long target proteins for translocation through the general secretory pathway

Presence of a signal peptide not cytoplasmic proteins (not C1)

5. Signal Peptides – (1)

81/78


5. Signal Peptides – (2)

Sig features extracted from SignalP server

82/78


Transmembrane α-helices (TMA) from TMHMM 2.0 server

Integral inner membrane proteinscharacterized by transmembrane α-helices

Presence of transmembrane α-helices inner membrane proteins (C2)

6. Transmembrane α-helices – (1)

83/78


6. Transmembrane α-helices – (2)TMA features extracted from TMHMM server

84/78


Transmembrane β-barrels (TMB) from TMB-Hunt server

Lots of proteins residing in the outer membrane are characterized by β-barrel structures Presence of β-barrel structures outer membrane proteins (C4)

Wong et al. (2001) J Bacteriol 183:367

7. Transmembrane β-barrels – (1)

85/78


7. Transmembrane β-barrels – (2)

TMB features extracted from TMB-Hunt server

86/78


Twin-arginine translocase signal peptides (TAT) from TatP 1.0 server

The TAT system exports proteins from the cytoplasm to the periplasm

The proteins translocated by TAT bear a unique twin-arginine motif

Presence of TAT motif periplasmic proteins (C3)

8. Twin-Agrinine Translocase – (1)

87/78


8. Twin-Agrinine Translocase – (2)TAT features extracted from TatP server

88/78


Non-classical protein secretion (Sec) from SecretomeP 2.0 server

Several extracellular proteins can be secreted without a classical N-terminal signal peptideIdentification of non-classical protein secretionextracellular proteins (C5)

9. Non-classical Protein Secretion – (1)

89/78


9. Non-classical Protein Secretion – (2)Sec features extracted from SecretomeP server

90/78

Protein Subcellular Localization Prediction Based on Machine Learning ApproachesOutlineAbout MyselfAbout TIGPAbout Bioinformatics Program (1/2)About Bioinformatics Program (2/2)OutlineOrigins of Data MiningTwo Types of Data Mining MethodsDifferent Tasks in Data MiningDefinition of ClassificationApplications in ClassificationOutlineProtein Subcellular Localization (PSL) PredictionImportance of PSL PredictionThe Available Computational Methods for Bacterial PSL PredictionOutlineClassification FormulationDecision Function (1/2)Decision Function (2/2)Support Vector Machines (1/2)Support Vector Machines (2/2)Optimal HyperplaneMulticlass Classification in SVMMulticlass Classification by 1-v-r SVMGeneral Biological Features for PSL Prediction1. Amino Acid Composition�2. Dipeptide Composition3. Secondary Structure Elements Training and Testing in SVMGram-Negative Bacteria Data SetPerformance EvaluationResults of 1-v-r SVM ModelMulticlass Classification by 1-v-1 SVMAccuracy and Feature CombinationOutlineCompartment-Specific Biological FeaturesCompartment-Specific Features in Bacterial Secretory PathwaysMore Compartment-Specific Biological FeaturesCompartment-Specific Biological FeaturesSlide Number 40Feature SelectionAccuracy and Feature CombinationMore Refined Encoding Schemes to Encode Protein Structures (SSE)?A More Refined Feature Representation for SSEAccuracy and Feature CombinationOutlineDocument ClassificationVector Space ModelA Term-Document MatrixTerm Weighting by TFIDFThe Terms of Proteins - Gapped-dipeptidesIncorporation of Evolutionary InformationWeighting Scheme of Gapped-dipeptidesGapped-dipeptide Feature RepresentationOutlineFeature Reduction (1/3)Feature ReductionFeature Reduction (2/3)Feature Reduction (3/3)System Architecture of PSLDocPerformance Comparison with Other ApproachesEffect of PLSA Feature Reduction (1/2)Effect of PLSA Feature Reduction (2/2)Gapped-dipeptide Signatures for Each LocalizationAmino Acid Compositions of Single Residues and Gapped-dipeptide SignaturesGrouped Amino Acid Compositions of Single Residues and Gapped-dipeptide SignaturesOutlineInteraction – Localization Networks in the Secretory PathwayInteraction – Localization Networks in ERGene Expression Level – Localization RelationshipThe Discriminative Impact of Features for Different Functional CategoriesOutlineConclusion (1/2)Conclusion (2/2)Take Home MessageReferencesThank You!Questions?4. Solvent Accessibility – (1)4. Solvent Accessibility – (2)5. Signal Peptides – (1)5. Signal Peptides – (2)6. Transmembrane α-helices – (1)6. Transmembrane α-helices – (2)7. Transmembrane β-barrels – (1)7. Transmembrane β-barrels – (2)8. Twin-Agrinine Translocase – (1)8. Twin-Agrinine Translocase – (2)9. Non-classical Protein Secretion – (1)9. Non-classical Protein Secretion – (2)

Documents

Protein Subcellular Localization Prediction Based on ...violet/dm97/NTNU_20080805_Emily.pdf · 8/5/2008 · web-based computing, digital libraries, bioinformatics Wen-Hsiung Li (李文雄):