Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
ACADEMIA SINICAInstitute of Information Science
Protein Subcellular Localization Prediction Based on Machine Learning Approaches
Wen-Lian Hsu (許聞廉老師)Emily Chia-Yu Su (蘇家玉)
Aug 5 2008
台師大 資料探勘課程
ACADEMIA SINICAInstitute of Information Science
OutlineIntroduction
Data miningProtein subcellular localization prediction
A support vector machine model-based methodSupport vector machinesCompartment-specific biological features
A probabilistic latent semantic analysis-based methodGapped-dipeptidesProbabilistic latent semantic analysis
Applications of protein localization predictionConclusion
2/78
ACADEMIA SINICAInstitute of Information Science
About Myself
EducationPh.D. Candidate, Bioinformatics Program, Taiwan International Graduate Program (TIGP), Academia SinicaM.S., Department of Computer Science and Information Engineering, National Taiwan UniversityB.S., Department of Information and Computer Education, National Taiwan Normal University
Research interestsBioinformatics, computational biology, machine learning, data mining, text mining, natural language processing, information retrieval/extraction
3/78
ACADEMIA SINICAInstitute of Information Science
About TIGP
TIGP programsChemical Biology and Molecular Biophysics (CBMB, 2002)Molecular Science and Technology (MST, 2002)Molecular and Biological Agricultural Sciences (MBAS, 2003)Bioinformatics (Bio, 2003)Molecular and Cell Biology (MCB, 2003)Nano Science and Technology (Nano, 2003)Molecular Medicine (MM, 2004)Computational Linguistics and Chinese Language Processing (CLCLP, 2005)Early System Science, (ESS, will admit students in 2009)
Website: http://tigp.sinica.edu.tw/
4/78
http://tigp.sinica.edu.tw/
ACADEMIA SINICAInstitute of Information Science
About Bioinformatics Program (1/2)
Wen-Lian Hsu (許聞廉): Natural language processing, literature mining, proteomics, protein structure prediction
Der-Tsai Lee (李德財): Computational geometry, parallel and distributed computing, web-based computing, digital libraries, bioinformatics
Wen-Hsiung Li (李文雄): Molecular evolution, comparative genomics, population genetics, evolution of gene regulation, computational biology
Wen-Chang Lin (林文昌): Bioinformatics, tumor biology, cancer metastasis
5/78
ACADEMIA SINICAInstitute of Information Science
About Bioinformatics Program (2/2)
Jenn-Kang Hwang (黃鎮剛): Structure prediction and classification, protein stability, structural alignment, molecular simulation
Cathy S.J. Fann (范盛娟): Biostatistics, genetic Epidemiology, genetic Statistics, disease gene napping, population genetics
Grace S. Shieh (謝叔蓉): Biostatistics, microarray analysis, gene regulatory network prediction, protein interaction networks, comparative genomics
Ueng-Cheng Yang (楊永正): Bioinformatics, infobiology, RNA-structure analysis, RNA-protein interaction, comparative genomics, genome annotation
6/78
ACADEMIA SINICAInstitute of Information Science
OutlineIntroduction
Data miningProtein subcellular localization prediction
A support vector machine model-based methodSupport vector machinesCompartment-specific biological features
A probabilistic latent semantic analysis-based methodGapped-dipeptidesProbabilistic latent semantic analysis
Applications of protein localization predictionConclusion
7/78
ACADEMIA SINICAInstitute of Information Science
Origins of Data Mining
Goals in data miningDraws ideas from machine learning/pattern recognition, statistics/AI, and database systems
Traditional techniquesmay be unsuitable due to
Enormity of dataHigh dimensionality of dataHeterogeneous, distributed nature of data
Machine Learning/Pattern
Recognition
Statistics/AI
Data Mining
Database systems
擷取自2008/7/23 柯佳伶老師“資料探勘簡介" 8/78
ACADEMIA SINICAInstitute of Information Science
Two Types of Data Mining Methods
Prediction methodsUse some variables to predict unknown or future values of other variables
Description methodsFind human-interpretable patterns that describe the data
擷取自2008/7/23 柯佳伶老師“資料探勘簡介" 9/78
ACADEMIA SINICAInstitute of Information Science
Different Tasks in Data Mining
Classification [Predictive]Clustering [Descriptive]Association rule discovery [Descriptive]Sequential pattern discovery [Descriptive]Regression [Predictive]Deviation detection [Predictive]
擷取自2008/7/23 柯佳伶老師“資料探勘簡介" 10/78
ACADEMIA SINICAInstitute of Information Science
Definition of Classification
Definition:Given a collection of records (i.e., training set)
Each record contains a set of attributesFind a model for class attribute as a function of the values of other attributes
Goal: Assign a class to previously unseen records as accurately as possibleA test set is used to determine the accuracy of the model
擷取自2008/7/23 柯佳伶老師“資料探勘簡介" 11/78
ACADEMIA SINICAInstitute of Information Science
Applications in Classification
Examples of classification: Direct marketing
Predict whether a consumer is likely to buy a new cell-phoneFraud detection
Predict fraudulent cases in credit card transactionsCustomer attrition/churn
Predict whether a customer is likely to be lost to a competitorSky survey cataloging
Predict class (star or galaxy) of sky objects based on the telescopic survey images
More applications in biology? 擷取自2008/7/23 柯佳伶老師“資料探勘簡介" 12/78
ACADEMIA SINICAInstitute of Information Science
OutlineIntroduction
Data miningProtein subcellular localization prediction
A support vector machine model-based methodSupport vector machinesCompartment-specific biological features
A probabilistic latent semantic analysis-based methodGapped-dipeptidesProbabilistic latent semantic analysis
Applications of protein localization predictionConclusion
13/78
ACADEMIA SINICAInstitute of Information Science
Protein Subcellular Localization (PSL) Prediction
Predict where the protein is located in a cell?
C1: cytoplasmC2: inner membraneC3: periplasmC4: outer membraneC5: extracellular space
Gram-negative bacteriaGardy LJ, et al. Methods for Predicting Bacterial Protein Subcellular Localization. Nature Reviews Microbiology, 2006.
14/78
ACADEMIA SINICAInstitute of Information Science
Importance of PSL Prediction
Protein function identificationModulate and identify protein functions
Genome annotationAnnotate genomic features
Drug discoveryGive clues to new drug targets
15/78
ACADEMIA SINICAInstitute of Information Science
The Available Computational Methods for Bacterial PSL Prediction
Gardy LJ, et al. Methods for Predicting Bacterial Protein Subcellular Localization. Nature Reviews Microbiology, 2006. 16/78
ACADEMIA SINICAInstitute of Information Science
OutlineIntroduction
Data miningProtein subcellular localization prediction
A support vector machine model-based methodSupport vector machinesCompartment-specific biological features
A probabilistic latent semantic analysis-based methodGapped-dipeptidesProbabilistic latent semantic analysis
Applications of protein localization predictionConclusion
17/78
ACADEMIA SINICAInstitute of Information Science
Classification FormulationGiven
an input spacea set of classes ={ }
the Classification Problem is to define a mapping f: where each x in ℜ is assigned to one class
This mapping function is called a decision function
cωωω ,...,, 21ω
ω
ℜ
ℜ
18/78
ACADEMIA SINICAInstitute of Information Science
Decision Function (1/2)
The basic problem in classification problem is to find cdecision functions
with the property that, if a pattern x belongs to class i, then
is a similarity measure between x and class i, such as distance or probability concept
( ) ( ) ( )xdxdxd c,...,, 21
( ) ( )xdxd ji > ijcji ≠= ;,...2,1,
( )xdi
19/78
ACADEMIA SINICAInstitute of Information Science
Decision Function (2/2)
Example
d3=d2
d1=d3
d1=d2
d1,d3
ACADEMIA SINICAInstitute of Information Science
Support Vector Machines (1/2)
Support Vector Machines (SVM)Basically a 2-class classifier developed by Vapnik and Chervonenkis (1992)Which line is optimal?
21/78
ACADEMIA SINICAInstitute of Information Science
Support Vector Machines (2/2)
Training vectors : xi , i=1….lConsider a simple case with two classes :Define a vector y, yi = 1 if xi in class 1
= -1 if xi in class 2A hyperplane which separates all data
r
ρ
Separating plane
Margin
Class 1Class 2
Support Vector (Class 1)Support Vector (Class 2)
22/78
ACADEMIA SINICAInstitute of Information Science
Optimal Hyperplane
Formalization
Goal:Choose a set of w and b to maximize the margin
Next:http://en.wikipedia.org/wiki/Support_vector_machine
bias: or,input vect: tor,weight vec:0,)(
functiondecision The}1,1{,)},,(),...,,{(
:data Training
11
bxwbxwxf
yRxyxyxD nll
=+=
−∈∈=
23/78
http://en.wikipedia.org/wiki/Support_vector_machine
ACADEMIA SINICAInstitute of Information Science
Multiclass Classification in SVM
One-versus-rest (1-v-r) SVM modelApply a universal set of biological features for different localization classes
One-versus-one (1-v-1) SVM modelDifferent biological features can be used in distinguishing two classes
24/78
ACADEMIA SINICAInstitute of Information Science
Multiclass Classification by 1-v-r SVM
Binary classifiers: for each class i, construct a Ci vs. non-Cibinary classifier
# of classifiers = 5
Input features: same features for all binary classifiersClass determination:
The class with the largest probability (probi: the confidence of sample predicted as class i; 0
ACADEMIA SINICAInstitute of Information Science
General Biological Features for PSL Prediction1. Amino acid composition (AA)2. Dipeptide composition (Dip)3. Secondary structure elements (SSE)
26/78
ACADEMIA SINICAInstitute of Information Science
1. Amino Acid Composition2. Dipeptide Composition
Amino acid composition (AA) andDipeptide composition (Dip)
n-peptide compositions or their variations have been shown effective in PSL prediction
If n = 1, then the n-peptide composition reduces to the AA• Dimension = 20
If n = 2, then the n-peptide composition yields the Dip• Dimension = 20*20
N-terminus C-terminusMet AlaPhe Leu PheHisAlaArgSer Val • • •
27/78
ACADEMIA SINICAInstitute of Information Science
3. Secondary Structure Elements
Predicted secondary structure elements (SSE) from HYPROSP II server
Encoding scheme: compute amino acid compositions of α–helix (H), β-strand (E), and random coil (C)Protein Seq. M P L D L Y N T L T R R K E R F E P MT P D..
Predicted SSEs C C E E E E C C C H H H H H H C C E E E H H..
0.12 0.02 0.03 … 0.05
0.07 0.04 0.02 … 0.09
0.08 0.06 0.07 … 0.03
H
E
C α–helix β-strand random coil
A C D Y↓ ↓ ↓ ↓
……
28/78
ACADEMIA SINICAInstitute of Information Science
Training and Testing in SVM
Support Vector Machines (SVM)LIBSVM softwareKernel: Radial Basis Function (RBF)Parameter selection
c (cost) and γ(gamma) are tuned10-fold cross-validation (with validation)
8 folds for training1 fold for validation1 fold for testing
Chang CC and Lin CJ, LIBSVM : a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm
29/78
http://www.csie.ntu.edu.tw/~cjlin/libsvm
ACADEMIA SINICAInstitute of Information Science
Gram-Negative Bacteria Data Set
Cytoplasmic . 278. 19%
Inner membrane .
309. 22%
Periplasmic . 276. 19%
Outer membrane .
391. 27%
Extracellular . 190. 13%
A benchmark data setePSORTdb1,444 proteins
Gardy LJ, et al. PSORTb v.2.0: Expanded Prediction of Bacterial Protein Subcellular Localization and Insights Gained from Comparative Proteome Analysis. Bioinformatics, 2005. 30/78
ACADEMIA SINICAInstitute of Information Science
Performance EvaluationAccuracy (Acc)
l = 5 is the number of total localization sitesNi are the number of proteins in localization site i
1 1
l l
i ii i
Acc TP N= =
=∑ ∑
31/78
ACADEMIA SINICAInstitute of Information Science
Results of 1-v-r SVM ModelDifferent feature combinations in 1-v-r SVM model1. AA2. Dip3. SSE4. AA+Dip5. AA+SSE6. Dip+SSE7. AA+Dip+SSE
Feature AA Dip SSE AA+Dip AA+SSE Dip+SSE AA+Dip+SSE
Overall Acc 85.56% 84.87% 83.26% 87.71% 84.95% 83.95% 86.25%
32/78
ACADEMIA SINICAInstitute of Information Science
Multiclass Classification by 1-v-1 SVMBinary classifiers: for each pair of classes i and j, construct a Ci vs. Cj binary classifier
# of classifiers = 5*(5-1)/2 = 10Input features: different features can be in different classifiersClass determination:
Majority votesAverage probability
In case of a tie in majority votes, the class with the largest average probability is selected as final predicted class
C1/C2
F12
C1/C3
F13
C1/C4
F14
C3/C5
F35
C4/C5
F451-v-1 SVM Model C1
C2
C3
C4
C5
• • •
10 classifiers 33/78
ACADEMIA SINICAInstitute of Information Science
Accuracy and Feature CombinationFeatures used in 1-v-1 SVM:
AA, Dip, and SSEAdvantages of 1-v-1 SVM model:
Flexibility of combining different featuresBetter accuracy
34/78
ACADEMIA SINICAInstitute of Information Science
OutlineIntroduction
Data miningProtein subcellular localization prediction
A support vector machine model-based methodSupport vector machinesCompartment-specific biological features
A probabilistic latent semantic analysis-based methodGapped-dipeptidesProbabilistic latent semantic analysis
Applications of protein localization predictionConclusion
35/78
ACADEMIA SINICAInstitute of Information Science
Compartment-Specific Biological Features
Integrate more biological features to improve accuracyFor each binary classifier Cij, a set of compartment-specific features is incorporated
Features unique to Ci or CjSelect features to mimic protein bacterial secretory pathways
Feature selection guided by biological insightsA binary classifier Cij distinguishes proteins localized in two different compartments
C1vs. C2, C2 vs. C3, etc.
36/78
ACADEMIA SINICAInstitute of Information Science
Compartment-Specific Features in Bacterial Secretory Pathways
Omp85
C1 C2: Sig, TMα, SA
C2 C3: Sig, TMα, SSE
C1 C3: Tat, Sig, SSE, SA
C3 C4: TMβ, SSE, SA C3 C5: SSE C1 C5: SecretomeP, SSE
Modified from Wickner and Schekman with their permission.Wickner W and Schekman R. Protein Translocation Across Biological Membranes, Science, 2005 37/78
ACADEMIA SINICAInstitute of Information Science
More Compartment-Specific Biological Features1. Amino acid composition (AA)2. Dipeptide composition (Dip)3. Secondary structure elements (SSE)4. Solvent accessibility (SA) – C55. Signal peptides (Sig) – C16. Transmembrane α-helices (TMA) – C27. Transmembrane β-barrels (TMB) – C48. Twin-arginine translocase signal peptides (TAT) – C39. Non-classical protein secretion (Sec) – C5
38/78
ACADEMIA SINICAInstitute of Information Science
Compartment-Specific Biological Features
Implication of different biological features to localization classes
Fea. Description Features ClassesSA Solvent accessibility Acidic high SA residues C5
Sig Signal peptides Presence of Sig not C1
TMA Transmembrane α-helices Presence of TMA C2
TAT Twin-arg translocase motifs Presence of TAT C3
TMB Transmembrane β-barrels Presence of TMB C4Sec Non-classical protein secretion Presence of Sec C5
39/78
ACADEMIA SINICAInstitute of Information Science
······RRDFLKGIASSSFVVLGGSSVLTPLN······
TMBAA Dip SSE
Compartment-Specific Feature Selection
SVM1CP,IM SVM2CP,PP SVM3CP,OM SVM10OM,EC
Majority Votes &Average Probabilities
FCP,IM FCP,PP FCP,OM FOM,EC
Predicted Localization Site(s)
PSL101 (Protein Subcellular Localization prediction by 1-On-1 classifiers)
Biological features
1-v-1 binary
classifiers
System Architecture of PSL101
Taipei 101
40/78
ACADEMIA SINICAInstitute of Information Science
Feature Selection
Motivation: unlikely to try all possible feature combinations in different classifiersFeature selection: reduce computational costs
Sequential forward search algorithm:Starting with an empty subset, keep adding the best feature sets that improve the accuracyThe process terminates when adding a feature set no longer makes any improvement
41/78
ACADEMIA SINICAInstitute of Information Science
Accuracy and Feature Combination
More biological features used in 1-v-1 SVM:AA, Dip, SSE, SA, Sig, TMA, TMB, TAT, and Sec
3.2% improvement over applying the general features!
42/78
ACADEMIA SINICAInstitute of Information Science
More Refined Encoding Schemes to Encode Protein Structures (SSE)?
Previous encoding: consider only composition of H, E, and C
SSE1 and SSE2 have the same composition but different transition and distribution!
SSE1: CCCCCCCCCCCCCCCCCCCCCHHHHHHHHHHHHHHHHHHHHHHH
SSE2: CCCHHHHHCCCHHHHHCCCCCHHHHHCCCHHHHHCCCCCHHHHH
43/78
ACADEMIA SINICAInstitute of Information Science
A More Refined Feature Representation for SSE1. Composition
The number of amino acids of H, E, and C2. Transition
The percent frequency with which H E, H C, andE C
3. DistributionThe chain length within which the first, 25, 50, 75 and 100% of the amino acids of a particular property is located respectivelyH1%, H25%, H50%, H75%, H100%, E1%, E25%, E50%, E75%, E100%, C1%, C25%, C50%, C75%, and C100%
Dubchak LJ, et al. Prediction of Protein Folding Class Using Global Description of Amino Acid Sequence. PNAS, 1995. 44/78
ACADEMIA SINICAInstitute of Information Science
Accuracy and Feature Combination
Features used in 1-v-1 SVM:AA, Dip, SSE (EC1), SA, Sig, TMA, TMB, TAT, Sec, andSSE (EC2)
New encoding scheme leads to an improvement of 0.6%in overall accuracy!
45/78
ACADEMIA SINICAInstitute of Information Science
OutlineIntroduction
Data miningProtein subcellular localization prediction
A support vector machine model-based methodSupport vector machinesCompartment-specific biological features
A probabilistic latent semantic analysis-based methodGapped-dipeptidesProbabilistic latent semantic analysis
Applications of protein localization predictionConclusion
46/78
ACADEMIA SINICAInstitute of Information Science
Document Classification
47/78
ACADEMIA SINICAInstitute of Information Science
Vector Space Model
Salton’s Vector Space ModelRepresent each document by a high-dimensional vector in the space of terms/words
Document A vector of terms
Gerald Salton
48/78
ACADEMIA SINICAInstitute of Information Science
A Term-Document Matrix
A term-document matrix is m×n matrix where m is number of terms and n is number of documents
⎥⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢⎢
⎣
⎡
=
↓↓↓
aaa
aaaaaa
A
ddd
mnmm
n
n
n
…
…
…
21
22221
11211
21
t
tt
m←
←
←
2
1
documents
terms
49/78
ACADEMIA SINICAInstitute of Information Science
Term Weighting by TFIDFThe term frequency (tf) in the given document d gives a measure of the importance of the term ti within the particular document
ni: the number of occurrences of the term tiThe inverse document frequency (idf) is obtained by dividing the number of all documents by the number of documents containing the term ti,
|D| : total number of document in the corpus: number of documents where the term ti appears
∑=
kk
ii n
ndttf ),(
)(log)(
iii td
Dtidf
⊃=
tfidf = tf*idf 50/78
ACADEMIA SINICAInstitute of Information Science
The Terms of Proteins - Gapped-dipeptides
Gapped-dipeptide feature representation:Let XdZ denote the gapped-dipeptide of amino acid type Xand Z that are separated by d amino acids
51/78
ACADEMIA SINICAInstitute of Information Science
Incorporation of Evolutionary Information
Position Specific Scoring Matrix (PSSM)A PSSM is constructed from a multiple alignment of the highest scoring hits in the PSIBLAST search
52/78
ACADEMIA SINICAInstitute of Information Science
Weighting Scheme of Gapped-dipeptides
The weight of XdZ :
where f(i, Y) denotes the normalized value of the PSSM entry at the ith row and the column corresponding to amino acid type Y
An exampleW(M2D,P)= f(1,M) × f(4,D) + f(2,M) × f(5,D) + … + f(78,M) × f(81,D) = 0.99995×0.04743 + 0.11920×0.00247 +…+ 0.00669×0.26894
53/78
ACADEMIA SINICAInstitute of Information Science
Gapped-dipeptide Feature Representation
Problems of proteins represented by gapped-dipeptides:Very large number of featuresIncrease of computational time and complexity
If d = 20, there are 8,400 (=20*20*21) features in a vector!
54/78
ACADEMIA SINICAInstitute of Information Science
OutlineIntroduction
Data miningProtein subcellular localization prediction
A support vector machine model-based methodSupport vector machinesCompartment-specific biological features
A probabilistic latent semantic analysis-based methodGapped-dipeptidesProbabilistic latent semantic analysis
Applications of protein localization predictionConclusion
55/78
ACADEMIA SINICAInstitute of Information Science
Feature Reduction (1/3)
Probabilistic Latent Semantic Analysis (PLSA)Create a mapping between documents and terms via latent concepts
56/78
ACADEMIA SINICAInstitute of Information Science
Feature Reduction
Dimension reductionReduce the number of gapped-dipeptides by Probabilistic Latent Semantic Analysis (PLSA)
True plot in k dimensions Reduced-dimensionality plot
57/78
ACADEMIA SINICAInstitute of Information Science
Feature Reduction (2/3)
A joint probability between a term w and a document d can be modeled as:
Latent variable z(“small” #states)
Concept expression
probabilities
Document-specificmixing proportions
)|()|()(),( dzPzwPdPdwPZz∑∈
=
• The parameters could be estimated by maximum-likelihood function through EM algorithm.
58/78
ACADEMIA SINICAInstitute of Information Science
Feature Reduction (3/3)
Term 1Term 2
Term 3
Term 4Term 5
Vector
Term Space
59/78
ACADEMIA SINICAInstitute of Information Science
System Architecture of PSLDoc
PSLDoc (Protein Subcellular Localization prediction by Document classification)
60/78
ACADEMIA SINICAInstitute of Information Science
Performance Comparison with Other Approaches
PSLDoc compares favorably than other methods in overall accuracy by 1.51%
Chang JM, et al. PSLDoc: Protein Subcellular Localization Prediction Based on Gapped-dipeptides and Probabilistic Latent Semantic Analysis. Proteins: Structure, Function, andBioinformatics, 2008.
61/78
ACADEMIA SINICAInstitute of Information Science
Effect of PLSA Feature Reduction (1/2)Distribution of topic versus proteins
Use PLSA to reduce gapped-dipeptides to different topics
62/78
ACADEMIA SINICAInstitute of Information Science
Effect of PLSA Feature Reduction (2/2)
Distribution of topic versus localization sites
63/78
ACADEMIA SINICAInstitute of Information Science
Gapped-dipeptide Signatures for Each Localization
Loc. Gapped-dipeptide Signatures
CPE0E, K1I, K5V, K1V, D0E; L1H, L5H, L3H, H4L, H0L; A12C, A9C, A13C, A5C, A7C;
R3R, R6R, R2R, R0R, R9R; A6A, A13A, A7A, A10A, A11A; I0E, R6I, I3R, I3K, R6V;
H3H, H1H, H7H, H13H, H10H; H1M, H2M, H11M, M0H, H0M; A4E, E1E, A2E, V4E, A9E;
E4E, K6E, E6E, E3E, E0E
IMI2I, I3I, I0I, L0I, I0F; L7L, L4L, L10L, L3L, L6L; M3M, M2M, M0M, M8M, M6M;
V2I, V2V, V3I, V3V, I0V; T2F, T6F, F3F, T4F, T8F; A1A, A7L, A4A, A1C, A11L;
W3W, W0W, W2W, W6W, W4W; Y12L, Y1L, Y11L, L0Y, L1L; M2T, M3T, M10T, M4T, M0L;
F10P, F8P, F12P, F3P, F13P
PPA1A, A2A, A0A, A3A, M4A; M0H, W1Q, W1H, W1K, W5Q; P1E, P0E, E0P, P0K, E1P;
D0D, Q0D, D3D, D3Q, D11D; W0E, E4W, W11E, E0W, W13E; K3K, K0K, K2K, K1K, K7K;
A3A, A7A, A1P, A6R, A10R; P3N, N4P, N3P, N5P, N0P; H6G, G3M, H7D, G11H, H11G;
A10A, A11A, A6A, A12A, A3A
OMT1R, R3T, R1T, T5R, P0P; R0F, R4F, Y13R, R6F, R2F; N4N, N0N, N10N, N7N, F1N;
Q6Q, Q1Q, Q3Q, Q13Q, Q4Q; S0F, A3F, F0S, R9F, F7F; G0G, A0G, A1G, G1A, G3A;
N1Q, N1N, Q1Q, N12N, Q11V; W2N, N2W, N0W, D2W, N13W; Q5R, R1Q, Q1R, Q3R, R2Q;
Y1Y, Y0Y, Y5Y, Y4Y, Y12Y
ECS6S, S2S, T11T, S13S, T6S; G8G, G0G, G7G, G9G, G6G; T1T, T3T, T5T, T9T, T10T;
N10N, N9N, N13N, N11N, N12N; N1N, N3N, N4N, N11N, N1T; I5Y, Y12S, Y3S, Y9S, Y6I;
Q2N, N1Q, Q1Q, N3Q, Q7Q; K1S, S6S, S5S, S11M, S0S; S3G, G3G, G4S, G3S, G2G;
N0N, N12V, N4V, V12N, N9V64/78
ACADEMIA SINICAInstitute of Information Science
Amino Acid Compositions of Single Residues and Gapped-dipeptide Signatures
65/78
ACADEMIA SINICAInstitute of Information Science
Grouped Amino Acid Compositions of Single Residues and Gapped-dipeptide Signatures
Amino acid grouping:Nonpolar (AIGLMV), polar (CNPQST), charged (DEHKR), and aromatic (FYW)
66/78
ACADEMIA SINICAInstitute of Information Science
OutlineIntroduction
Data miningProtein subcellular localization prediction
A support vector machine model-based methodSupport vector machinesCompartment-specific biological features
A probabilistic latent semantic analysis-based methodGapped-dipeptidesProbabilistic latent semantic analysis
Applications of protein localization predictionConclusion
67/78
ACADEMIA SINICAInstitute of Information Science
Interaction – Localization Networks in the Secretory Pathway
Co-localized proteins interact together!
Scott MS, et al. Refining Protein Subcellular Localization . PLoS Computational Biology , 2005. 68/78
ACADEMIA SINICAInstitute of Information Science
Interaction – Localization Networks in ER
Interactions of ER periphery proteins are almost exclusively between members of this group or with cytosolic proteins.
Scott MS, et al. Refining Protein Subcellular Localization . PLoS Computational Biology , 2005. 69/78
ACADEMIA SINICAInstitute of Information Science
Gene Expression Level – Localization Relationship
Expression levels are clearly correlated with localization
Drawid A, et al. Genome-wide analysis relating expression level with protein subcellular localization. Trends in Genetics, 2000.
70/78
ACADEMIA SINICAInstitute of Information Science
The Discriminative Impact of Features for Different Functional Categories
Subcellular location related features are important for elucidate protein function.
Jensen LJ, et al. Prediction of Human Protein Function from Post-translational Modifications and Localization Features, J. Mol. Bio., 2000.
71/78
ACADEMIA SINICAInstitute of Information Science
OutlineIntroduction
Data miningProtein subcellular localization prediction
A support vector machine model-based methodSupport vector machinesCompartment-specific biological features
A probabilistic latent semantic analysis-based methodGapped-dipeptidesProbabilistic latent semantic analysis
Applications of protein localization predictionConclusion
72/78
ACADEMIA SINICAInstitute of Information Science
Conclusion (1/2)SVM models
1-v-1 outperforms 1-v-r in the flexibility of integrating different features
Compartment specific biological featuresCompartment-specific features improve accuracyA refined feature representation leads to Acc = 92.0% in PSL101(vs. 90.0 % in CELLO II)
ModelBiological Features
AccAA Dip SSE EC1 SA Sig TMA TMB TAT Sec SSE EC2
1-v-r SVM ● ● ● 87.7%
1-v-1 SVM ● ● ● 88.2%
1-v-1 SVM ● ● ● ● ● ● ● ● ● 91.4%1-v-1 SVM ● ● ● ● ● ● ● ● ● ● 92.0%
73/78
ACADEMIA SINICAInstitute of Information Science
Conclusion (2/2)
Proteins represented by gapped-dipeptidesGapped-dipeptides capture remote relationships on the primary sequence
Effects of feature reductionPLSA greatly reduce feature dimension without sacrificing the performance (>8,000
ACADEMIA SINICAInstitute of Information Science
Take Home Message
5 key factors in using data mining to solve a research problemData set
• Is the data set large enough for analysis? Feature extraction
• What kind of feature(s) could be useful to solve this problem?Feature representation
• How to effectively represent/encode these features? Data mining approaches
• Which are the suitable technique(s) for this problem? Evaluation
• What is the performance of this method?
75/78
ACADEMIA SINICAInstitute of Information Science
References
Emily Chia-Yu Su, Hua-Sheng Chiu, Allan Lo, Jenn-Kang Hwang, Ting-Yi Sung, and Wen-Lian Hsu, “Protein subcellular localization prediction based on compartment-specific features and structure conservation,” BMC Bioinformatics, 8:330, (2007).Jia-Ming Chang, Emily Chia-Yu Su, Allan Lo, Hua-Sheng Chiu, Ting-Yi Sung, and Wen-Lian Hsu, “PSLDoc: Protein subcellular localization prediction based on gapped-
dipeptides and probabilistic latent semantic analysis,”PROTEINS: Structure, Function, and Bioinformatics, 72(2): 693, (2008).
76/78
ACADEMIA SINICAInstitute of Information Science
Thank You!
77/78
ACADEMIA SINICAInstitute of Information Science
Questions?
78/78
ACADEMIA SINICAInstitute of Information Science
4. Solvent Accessibility – (1)
Proteins in different localization sites have different solvent accessibility (SA)
A balance of acidic and basic exposed residuescytoplasmic proteins (C1)A slight excess of of acidic exposed residuesextracellular proteins (C5)
Exposed and buried residuesLet SAi be the predicted SA of a residues at position i(0
ACADEMIA SINICAInstitute of Information Science
4. Solvent Accessibility – (2)
Protein Seq. M P L D L Y N T L T R R K E R F E P M T P D R V
Pred. SA (%) 16 75 53 45 62 86 22 19 8 67 . . . . . . . . . . . . . .
buried or exposed B E E E E E B B B E . . . . . . . . . . . . . .
Encoding scheme of SACompute the amino acid compositions of exposed and buried residues
0.12 0.02 0.03 0.050.07 0.04 0.02 0.09
A C D Y↓ ↓ ↓ ↓
⎡ ⎤⎢ ⎥⎣ ⎦
………… B←
E←
80/78
ACADEMIA SINICAInstitute of Information Science
Signal peptides (Sig) from SignalP 3.0 serverSignal peptides:
N-terminal peptides between 15 and 40 amino acids long target proteins for translocation through the general secretory pathway
Presence of a signal peptide not cytoplasmic proteins (not C1)
5. Signal Peptides – (1)
81/78
ACADEMIA SINICAInstitute of Information Science
5. Signal Peptides – (2)
Sig features extracted from SignalP server
82/78
ACADEMIA SINICAInstitute of Information Science
Transmembrane α-helices (TMA) from TMHMM 2.0 server
Integral inner membrane proteinscharacterized by transmembrane α-helices
Presence of transmembrane α-helices inner membrane proteins (C2)
6. Transmembrane α-helices – (1)
83/78
ACADEMIA SINICAInstitute of Information Science
6. Transmembrane α-helices – (2)TMA features extracted from TMHMM server
84/78
ACADEMIA SINICAInstitute of Information Science
Transmembrane β-barrels (TMB) from TMB-Hunt server
Lots of proteins residing in the outer membrane are characterized by β-barrel structures Presence of β-barrel structures outer membrane proteins (C4)
Wong et al. (2001) J Bacteriol 183:367
7. Transmembrane β-barrels – (1)
85/78
ACADEMIA SINICAInstitute of Information Science
7. Transmembrane β-barrels – (2)
TMB features extracted from TMB-Hunt server
86/78
ACADEMIA SINICAInstitute of Information Science
Twin-arginine translocase signal peptides (TAT) from TatP 1.0 server
The TAT system exports proteins from the cytoplasm to the periplasm
The proteins translocated by TAT bear a unique twin-arginine motif
Presence of TAT motif periplasmic proteins (C3)
8. Twin-Agrinine Translocase – (1)
87/78
ACADEMIA SINICAInstitute of Information Science
8. Twin-Agrinine Translocase – (2)TAT features extracted from TatP server
88/78
ACADEMIA SINICAInstitute of Information Science
Non-classical protein secretion (Sec) from SecretomeP 2.0 server
Several extracellular proteins can be secreted without a classical N-terminal signal peptideIdentification of non-classical protein secretionextracellular proteins (C5)
9. Non-classical Protein Secretion – (1)
89/78
ACADEMIA SINICAInstitute of Information Science
9. Non-classical Protein Secretion – (2)Sec features extracted from SecretomeP server
90/78
Protein Subcellular Localization Prediction Based on Machine Learning ApproachesOutlineAbout MyselfAbout TIGPAbout Bioinformatics Program (1/2)About Bioinformatics Program (2/2)OutlineOrigins of Data MiningTwo Types of Data Mining MethodsDifferent Tasks in Data MiningDefinition of ClassificationApplications in ClassificationOutlineProtein Subcellular Localization (PSL) PredictionImportance of PSL PredictionThe Available Computational Methods for Bacterial PSL PredictionOutlineClassification FormulationDecision Function (1/2)Decision Function (2/2)Support Vector Machines (1/2)Support Vector Machines (2/2)Optimal HyperplaneMulticlass Classification in SVMMulticlass Classification by 1-v-r SVMGeneral Biological Features for PSL Prediction1. Amino Acid Composition�2. Dipeptide Composition3. Secondary Structure Elements Training and Testing in SVMGram-Negative Bacteria Data SetPerformance EvaluationResults of 1-v-r SVM ModelMulticlass Classification by 1-v-1 SVMAccuracy and Feature CombinationOutlineCompartment-Specific Biological FeaturesCompartment-Specific Features in Bacterial Secretory PathwaysMore Compartment-Specific Biological FeaturesCompartment-Specific Biological FeaturesSlide Number 40Feature SelectionAccuracy and Feature CombinationMore Refined Encoding Schemes to Encode Protein Structures (SSE)?A More Refined Feature Representation for SSEAccuracy and Feature CombinationOutlineDocument ClassificationVector Space ModelA Term-Document MatrixTerm Weighting by TFIDFThe Terms of Proteins - Gapped-dipeptidesIncorporation of Evolutionary InformationWeighting Scheme of Gapped-dipeptidesGapped-dipeptide Feature RepresentationOutlineFeature Reduction (1/3)Feature ReductionFeature Reduction (2/3)Feature Reduction (3/3)System Architecture of PSLDocPerformance Comparison with Other ApproachesEffect of PLSA Feature Reduction (1/2)Effect of PLSA Feature Reduction (2/2)Gapped-dipeptide Signatures for Each LocalizationAmino Acid Compositions of Single Residues and Gapped-dipeptide SignaturesGrouped Amino Acid Compositions of Single Residues and Gapped-dipeptide SignaturesOutlineInteraction – Localization Networks in the Secretory PathwayInteraction – Localization Networks in ERGene Expression Level – Localization RelationshipThe Discriminative Impact of Features for Different Functional CategoriesOutlineConclusion (1/2)Conclusion (2/2)Take Home MessageReferencesThank You!Questions?4. Solvent Accessibility – (1)4. Solvent Accessibility – (2)5. Signal Peptides – (1)5. Signal Peptides – (2)6. Transmembrane α-helices – (1)6. Transmembrane α-helices – (2)7. Transmembrane β-barrels – (1)7. Transmembrane β-barrels – (2)8. Twin-Agrinine Translocase – (1)8. Twin-Agrinine Translocase – (2)9. Non-classical Protein Secretion – (1)9. Non-classical Protein Secretion – (2)