90
ACADEMIA SINICA Institute of Information Science Protein Subcellular Localization Prediction Based on Machine Learning Approaches Wen-Lian Hsu (許聞廉老師) Emily Chia-Yu Su (蘇家玉) Aug 5 2008 台師大 資料探勘課程

Protein Subcellular Localization Prediction Based on ...violet/dm97/NTNU_20080805_Emily.pdf · 8/5/2008  · web-based computing, digital libraries, bioinformatics Wen-Hsiung Li (李文雄):

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

  • ACADEMIA SINICAInstitute of Information Science

    Protein Subcellular Localization Prediction Based on Machine Learning Approaches

    Wen-Lian Hsu (許聞廉老師)Emily Chia-Yu Su (蘇家玉)

    Aug 5 2008

    台師大 資料探勘課程

  • ACADEMIA SINICAInstitute of Information Science

    OutlineIntroduction

    Data miningProtein subcellular localization prediction

    A support vector machine model-based methodSupport vector machinesCompartment-specific biological features

    A probabilistic latent semantic analysis-based methodGapped-dipeptidesProbabilistic latent semantic analysis

    Applications of protein localization predictionConclusion

    2/78

  • ACADEMIA SINICAInstitute of Information Science

    About Myself

    EducationPh.D. Candidate, Bioinformatics Program, Taiwan International Graduate Program (TIGP), Academia SinicaM.S., Department of Computer Science and Information Engineering, National Taiwan UniversityB.S., Department of Information and Computer Education, National Taiwan Normal University

    Research interestsBioinformatics, computational biology, machine learning, data mining, text mining, natural language processing, information retrieval/extraction

    3/78

  • ACADEMIA SINICAInstitute of Information Science

    About TIGP

    TIGP programsChemical Biology and Molecular Biophysics (CBMB, 2002)Molecular Science and Technology (MST, 2002)Molecular and Biological Agricultural Sciences (MBAS, 2003)Bioinformatics (Bio, 2003)Molecular and Cell Biology (MCB, 2003)Nano Science and Technology (Nano, 2003)Molecular Medicine (MM, 2004)Computational Linguistics and Chinese Language Processing (CLCLP, 2005)Early System Science, (ESS, will admit students in 2009)

    Website: http://tigp.sinica.edu.tw/

    4/78

    http://tigp.sinica.edu.tw/

  • ACADEMIA SINICAInstitute of Information Science

    About Bioinformatics Program (1/2)

    Wen-Lian Hsu (許聞廉): Natural language processing, literature mining, proteomics, protein structure prediction

    Der-Tsai Lee (李德財): Computational geometry, parallel and distributed computing, web-based computing, digital libraries, bioinformatics

    Wen-Hsiung Li (李文雄): Molecular evolution, comparative genomics, population genetics, evolution of gene regulation, computational biology

    Wen-Chang Lin (林文昌): Bioinformatics, tumor biology, cancer metastasis

    5/78

  • ACADEMIA SINICAInstitute of Information Science

    About Bioinformatics Program (2/2)

    Jenn-Kang Hwang (黃鎮剛): Structure prediction and classification, protein stability, structural alignment, molecular simulation

    Cathy S.J. Fann (范盛娟): Biostatistics, genetic Epidemiology, genetic Statistics, disease gene napping, population genetics

    Grace S. Shieh (謝叔蓉): Biostatistics, microarray analysis, gene regulatory network prediction, protein interaction networks, comparative genomics

    Ueng-Cheng Yang (楊永正): Bioinformatics, infobiology, RNA-structure analysis, RNA-protein interaction, comparative genomics, genome annotation

    6/78

  • ACADEMIA SINICAInstitute of Information Science

    OutlineIntroduction

    Data miningProtein subcellular localization prediction

    A support vector machine model-based methodSupport vector machinesCompartment-specific biological features

    A probabilistic latent semantic analysis-based methodGapped-dipeptidesProbabilistic latent semantic analysis

    Applications of protein localization predictionConclusion

    7/78

  • ACADEMIA SINICAInstitute of Information Science

    Origins of Data Mining

    Goals in data miningDraws ideas from machine learning/pattern recognition, statistics/AI, and database systems

    Traditional techniquesmay be unsuitable due to

    Enormity of dataHigh dimensionality of dataHeterogeneous, distributed nature of data

    Machine Learning/Pattern

    Recognition

    Statistics/AI

    Data Mining

    Database systems

    擷取自2008/7/23 柯佳伶老師“資料探勘簡介" 8/78

  • ACADEMIA SINICAInstitute of Information Science

    Two Types of Data Mining Methods

    Prediction methodsUse some variables to predict unknown or future values of other variables

    Description methodsFind human-interpretable patterns that describe the data

    擷取自2008/7/23 柯佳伶老師“資料探勘簡介" 9/78

  • ACADEMIA SINICAInstitute of Information Science

    Different Tasks in Data Mining

    Classification [Predictive]Clustering [Descriptive]Association rule discovery [Descriptive]Sequential pattern discovery [Descriptive]Regression [Predictive]Deviation detection [Predictive]

    擷取自2008/7/23 柯佳伶老師“資料探勘簡介" 10/78

  • ACADEMIA SINICAInstitute of Information Science

    Definition of Classification

    Definition:Given a collection of records (i.e., training set)

    Each record contains a set of attributesFind a model for class attribute as a function of the values of other attributes

    Goal: Assign a class to previously unseen records as accurately as possibleA test set is used to determine the accuracy of the model

    擷取自2008/7/23 柯佳伶老師“資料探勘簡介" 11/78

  • ACADEMIA SINICAInstitute of Information Science

    Applications in Classification

    Examples of classification: Direct marketing

    Predict whether a consumer is likely to buy a new cell-phoneFraud detection

    Predict fraudulent cases in credit card transactionsCustomer attrition/churn

    Predict whether a customer is likely to be lost to a competitorSky survey cataloging

    Predict class (star or galaxy) of sky objects based on the telescopic survey images

    More applications in biology? 擷取自2008/7/23 柯佳伶老師“資料探勘簡介" 12/78

  • ACADEMIA SINICAInstitute of Information Science

    OutlineIntroduction

    Data miningProtein subcellular localization prediction

    A support vector machine model-based methodSupport vector machinesCompartment-specific biological features

    A probabilistic latent semantic analysis-based methodGapped-dipeptidesProbabilistic latent semantic analysis

    Applications of protein localization predictionConclusion

    13/78

  • ACADEMIA SINICAInstitute of Information Science

    Protein Subcellular Localization (PSL) Prediction

    Predict where the protein is located in a cell?

    C1: cytoplasmC2: inner membraneC3: periplasmC4: outer membraneC5: extracellular space

    Gram-negative bacteriaGardy LJ, et al. Methods for Predicting Bacterial Protein Subcellular Localization. Nature Reviews Microbiology, 2006.

    14/78

  • ACADEMIA SINICAInstitute of Information Science

    Importance of PSL Prediction

    Protein function identificationModulate and identify protein functions

    Genome annotationAnnotate genomic features

    Drug discoveryGive clues to new drug targets

    15/78

  • ACADEMIA SINICAInstitute of Information Science

    The Available Computational Methods for Bacterial PSL Prediction

    Gardy LJ, et al. Methods for Predicting Bacterial Protein Subcellular Localization. Nature Reviews Microbiology, 2006. 16/78

  • ACADEMIA SINICAInstitute of Information Science

    OutlineIntroduction

    Data miningProtein subcellular localization prediction

    A support vector machine model-based methodSupport vector machinesCompartment-specific biological features

    A probabilistic latent semantic analysis-based methodGapped-dipeptidesProbabilistic latent semantic analysis

    Applications of protein localization predictionConclusion

    17/78

  • ACADEMIA SINICAInstitute of Information Science

    Classification FormulationGiven

    an input spacea set of classes ={ }

    the Classification Problem is to define a mapping f: where each x in ℜ is assigned to one class

    This mapping function is called a decision function

    cωωω ,...,, 21ω

    ω

    18/78

  • ACADEMIA SINICAInstitute of Information Science

    Decision Function (1/2)

    The basic problem in classification problem is to find cdecision functions

    with the property that, if a pattern x belongs to class i, then

    is a similarity measure between x and class i, such as distance or probability concept

    ( ) ( ) ( )xdxdxd c,...,, 21

    ( ) ( )xdxd ji > ijcji ≠= ;,...2,1,

    ( )xdi

    19/78

  • ACADEMIA SINICAInstitute of Information Science

    Decision Function (2/2)

    Example

    d3=d2

    d1=d3

    d1=d2

    d1,d3

  • ACADEMIA SINICAInstitute of Information Science

    Support Vector Machines (1/2)

    Support Vector Machines (SVM)Basically a 2-class classifier developed by Vapnik and Chervonenkis (1992)Which line is optimal?

    21/78

  • ACADEMIA SINICAInstitute of Information Science

    Support Vector Machines (2/2)

    Training vectors : xi , i=1….lConsider a simple case with two classes :Define a vector y, yi = 1 if xi in class 1

    = -1 if xi in class 2A hyperplane which separates all data

    r

    ρ

    Separating plane

    Margin

    Class 1Class 2

    Support Vector (Class 1)Support Vector (Class 2)

    22/78

  • ACADEMIA SINICAInstitute of Information Science

    Optimal Hyperplane

    Formalization

    Goal:Choose a set of w and b to maximize the margin

    Next:http://en.wikipedia.org/wiki/Support_vector_machine

    bias: or,input vect: tor,weight vec:0,)(

    functiondecision The}1,1{,)},,(),...,,{(

    :data Training

    11

    bxwbxwxf

    yRxyxyxD nll

    =+=

    −∈∈=

    23/78

    http://en.wikipedia.org/wiki/Support_vector_machine

  • ACADEMIA SINICAInstitute of Information Science

    Multiclass Classification in SVM

    One-versus-rest (1-v-r) SVM modelApply a universal set of biological features for different localization classes

    One-versus-one (1-v-1) SVM modelDifferent biological features can be used in distinguishing two classes

    24/78

  • ACADEMIA SINICAInstitute of Information Science

    Multiclass Classification by 1-v-r SVM

    Binary classifiers: for each class i, construct a Ci vs. non-Cibinary classifier

    # of classifiers = 5

    Input features: same features for all binary classifiersClass determination:

    The class with the largest probability (probi: the confidence of sample predicted as class i; 0

  • ACADEMIA SINICAInstitute of Information Science

    General Biological Features for PSL Prediction1. Amino acid composition (AA)2. Dipeptide composition (Dip)3. Secondary structure elements (SSE)

    26/78

  • ACADEMIA SINICAInstitute of Information Science

    1. Amino Acid Composition2. Dipeptide Composition

    Amino acid composition (AA) andDipeptide composition (Dip)

    n-peptide compositions or their variations have been shown effective in PSL prediction

    If n = 1, then the n-peptide composition reduces to the AA• Dimension = 20

    If n = 2, then the n-peptide composition yields the Dip• Dimension = 20*20

    N-terminus C-terminusMet AlaPhe Leu PheHisAlaArgSer Val • • •

    27/78

  • ACADEMIA SINICAInstitute of Information Science

    3. Secondary Structure Elements

    Predicted secondary structure elements (SSE) from HYPROSP II server

    Encoding scheme: compute amino acid compositions of α–helix (H), β-strand (E), and random coil (C)Protein Seq. M P L D L Y N T L T R R K E R F E P MT P D..

    Predicted SSEs C C E E E E C C C H H H H H H C C E E E H H..

    0.12 0.02 0.03 … 0.05

    0.07 0.04 0.02 … 0.09

    0.08 0.06 0.07 … 0.03

    H

    E

    C α–helix β-strand random coil

    A C D Y↓ ↓ ↓ ↓

    ……

    28/78

  • ACADEMIA SINICAInstitute of Information Science

    Training and Testing in SVM

    Support Vector Machines (SVM)LIBSVM softwareKernel: Radial Basis Function (RBF)Parameter selection

    c (cost) and γ(gamma) are tuned10-fold cross-validation (with validation)

    8 folds for training1 fold for validation1 fold for testing

    Chang CC and Lin CJ, LIBSVM : a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm

    29/78

    http://www.csie.ntu.edu.tw/~cjlin/libsvm

  • ACADEMIA SINICAInstitute of Information Science

    Gram-Negative Bacteria Data Set

    Cytoplasmic . 278. 19%

    Inner membrane .

    309. 22%

    Periplasmic . 276. 19%

    Outer membrane .

    391. 27%

    Extracellular . 190. 13%

    A benchmark data setePSORTdb1,444 proteins

    Gardy LJ, et al. PSORTb v.2.0: Expanded Prediction of Bacterial Protein Subcellular Localization and Insights Gained from Comparative Proteome Analysis. Bioinformatics, 2005. 30/78

  • ACADEMIA SINICAInstitute of Information Science

    Performance EvaluationAccuracy (Acc)

    l = 5 is the number of total localization sitesNi are the number of proteins in localization site i

    1 1

    l l

    i ii i

    Acc TP N= =

    =∑ ∑

    31/78

  • ACADEMIA SINICAInstitute of Information Science

    Results of 1-v-r SVM ModelDifferent feature combinations in 1-v-r SVM model1. AA2. Dip3. SSE4. AA+Dip5. AA+SSE6. Dip+SSE7. AA+Dip+SSE

    Feature AA Dip SSE AA+Dip AA+SSE Dip+SSE AA+Dip+SSE

    Overall Acc 85.56% 84.87% 83.26% 87.71% 84.95% 83.95% 86.25%

    32/78

  • ACADEMIA SINICAInstitute of Information Science

    Multiclass Classification by 1-v-1 SVMBinary classifiers: for each pair of classes i and j, construct a Ci vs. Cj binary classifier

    # of classifiers = 5*(5-1)/2 = 10Input features: different features can be in different classifiersClass determination:

    Majority votesAverage probability

    In case of a tie in majority votes, the class with the largest average probability is selected as final predicted class

    C1/C2

    F12

    C1/C3

    F13

    C1/C4

    F14

    C3/C5

    F35

    C4/C5

    F451-v-1 SVM Model C1

    C2

    C3

    C4

    C5

    • • •

    10 classifiers 33/78

  • ACADEMIA SINICAInstitute of Information Science

    Accuracy and Feature CombinationFeatures used in 1-v-1 SVM:

    AA, Dip, and SSEAdvantages of 1-v-1 SVM model:

    Flexibility of combining different featuresBetter accuracy

    34/78

  • ACADEMIA SINICAInstitute of Information Science

    OutlineIntroduction

    Data miningProtein subcellular localization prediction

    A support vector machine model-based methodSupport vector machinesCompartment-specific biological features

    A probabilistic latent semantic analysis-based methodGapped-dipeptidesProbabilistic latent semantic analysis

    Applications of protein localization predictionConclusion

    35/78

  • ACADEMIA SINICAInstitute of Information Science

    Compartment-Specific Biological Features

    Integrate more biological features to improve accuracyFor each binary classifier Cij, a set of compartment-specific features is incorporated

    Features unique to Ci or CjSelect features to mimic protein bacterial secretory pathways

    Feature selection guided by biological insightsA binary classifier Cij distinguishes proteins localized in two different compartments

    C1vs. C2, C2 vs. C3, etc.

    36/78

  • ACADEMIA SINICAInstitute of Information Science

    Compartment-Specific Features in Bacterial Secretory Pathways

    Omp85

    C1 C2: Sig, TMα, SA

    C2 C3: Sig, TMα, SSE

    C1 C3: Tat, Sig, SSE, SA

    C3 C4: TMβ, SSE, SA C3 C5: SSE C1 C5: SecretomeP, SSE

    Modified from Wickner and Schekman with their permission.Wickner W and Schekman R. Protein Translocation Across Biological Membranes, Science, 2005 37/78

  • ACADEMIA SINICAInstitute of Information Science

    More Compartment-Specific Biological Features1. Amino acid composition (AA)2. Dipeptide composition (Dip)3. Secondary structure elements (SSE)4. Solvent accessibility (SA) – C55. Signal peptides (Sig) – C16. Transmembrane α-helices (TMA) – C27. Transmembrane β-barrels (TMB) – C48. Twin-arginine translocase signal peptides (TAT) – C39. Non-classical protein secretion (Sec) – C5

    38/78

  • ACADEMIA SINICAInstitute of Information Science

    Compartment-Specific Biological Features

    Implication of different biological features to localization classes

    Fea. Description Features ClassesSA Solvent accessibility Acidic high SA residues C5

    Sig Signal peptides Presence of Sig not C1

    TMA Transmembrane α-helices Presence of TMA C2

    TAT Twin-arg translocase motifs Presence of TAT C3

    TMB Transmembrane β-barrels Presence of TMB C4Sec Non-classical protein secretion Presence of Sec C5

    39/78

  • ACADEMIA SINICAInstitute of Information Science

    ······RRDFLKGIASSSFVVLGGSSVLTPLN······

    TMBAA Dip SSE

    Compartment-Specific Feature Selection

    SVM1CP,IM SVM2CP,PP SVM3CP,OM SVM10OM,EC

    Majority Votes &Average Probabilities

    FCP,IM FCP,PP FCP,OM FOM,EC

    Predicted Localization Site(s)

    PSL101 (Protein Subcellular Localization prediction by 1-On-1 classifiers)

    Biological features

    1-v-1 binary

    classifiers

    System Architecture of PSL101

    Taipei 101

    40/78

  • ACADEMIA SINICAInstitute of Information Science

    Feature Selection

    Motivation: unlikely to try all possible feature combinations in different classifiersFeature selection: reduce computational costs

    Sequential forward search algorithm:Starting with an empty subset, keep adding the best feature sets that improve the accuracyThe process terminates when adding a feature set no longer makes any improvement

    41/78

  • ACADEMIA SINICAInstitute of Information Science

    Accuracy and Feature Combination

    More biological features used in 1-v-1 SVM:AA, Dip, SSE, SA, Sig, TMA, TMB, TAT, and Sec

    3.2% improvement over applying the general features!

    42/78

  • ACADEMIA SINICAInstitute of Information Science

    More Refined Encoding Schemes to Encode Protein Structures (SSE)?

    Previous encoding: consider only composition of H, E, and C

    SSE1 and SSE2 have the same composition but different transition and distribution!

    SSE1: CCCCCCCCCCCCCCCCCCCCCHHHHHHHHHHHHHHHHHHHHHHH

    SSE2: CCCHHHHHCCCHHHHHCCCCCHHHHHCCCHHHHHCCCCCHHHHH

    43/78

  • ACADEMIA SINICAInstitute of Information Science

    A More Refined Feature Representation for SSE1. Composition

    The number of amino acids of H, E, and C2. Transition

    The percent frequency with which H E, H C, andE C

    3. DistributionThe chain length within which the first, 25, 50, 75 and 100% of the amino acids of a particular property is located respectivelyH1%, H25%, H50%, H75%, H100%, E1%, E25%, E50%, E75%, E100%, C1%, C25%, C50%, C75%, and C100%

    Dubchak LJ, et al. Prediction of Protein Folding Class Using Global Description of Amino Acid Sequence. PNAS, 1995. 44/78

  • ACADEMIA SINICAInstitute of Information Science

    Accuracy and Feature Combination

    Features used in 1-v-1 SVM:AA, Dip, SSE (EC1), SA, Sig, TMA, TMB, TAT, Sec, andSSE (EC2)

    New encoding scheme leads to an improvement of 0.6%in overall accuracy!

    45/78

  • ACADEMIA SINICAInstitute of Information Science

    OutlineIntroduction

    Data miningProtein subcellular localization prediction

    A support vector machine model-based methodSupport vector machinesCompartment-specific biological features

    A probabilistic latent semantic analysis-based methodGapped-dipeptidesProbabilistic latent semantic analysis

    Applications of protein localization predictionConclusion

    46/78

  • ACADEMIA SINICAInstitute of Information Science

    Document Classification

    47/78

  • ACADEMIA SINICAInstitute of Information Science

    Vector Space Model

    Salton’s Vector Space ModelRepresent each document by a high-dimensional vector in the space of terms/words

    Document A vector of terms

    Gerald Salton

    48/78

  • ACADEMIA SINICAInstitute of Information Science

    A Term-Document Matrix

    A term-document matrix is m×n matrix where m is number of terms and n is number of documents

    ⎥⎥⎥⎥⎥⎥⎥

    ⎢⎢⎢⎢⎢⎢⎢

    =

    ↓↓↓

    aaa

    aaaaaa

    A

    ddd

    mnmm

    n

    n

    n

    21

    22221

    11211

    21

    t

    tt

    m←

    2

    1

    documents

    terms

    49/78

  • ACADEMIA SINICAInstitute of Information Science

    Term Weighting by TFIDFThe term frequency (tf) in the given document d gives a measure of the importance of the term ti within the particular document

    ni: the number of occurrences of the term tiThe inverse document frequency (idf) is obtained by dividing the number of all documents by the number of documents containing the term ti,

    |D| : total number of document in the corpus: number of documents where the term ti appears

    ∑=

    kk

    ii n

    ndttf ),(

    )(log)(

    iii td

    Dtidf

    ⊃=

    tfidf = tf*idf 50/78

  • ACADEMIA SINICAInstitute of Information Science

    The Terms of Proteins - Gapped-dipeptides

    Gapped-dipeptide feature representation:Let XdZ denote the gapped-dipeptide of amino acid type Xand Z that are separated by d amino acids

    51/78

  • ACADEMIA SINICAInstitute of Information Science

    Incorporation of Evolutionary Information

    Position Specific Scoring Matrix (PSSM)A PSSM is constructed from a multiple alignment of the highest scoring hits in the PSIBLAST search

    52/78

  • ACADEMIA SINICAInstitute of Information Science

    Weighting Scheme of Gapped-dipeptides

    The weight of XdZ :

    where f(i, Y) denotes the normalized value of the PSSM entry at the ith row and the column corresponding to amino acid type Y

    An exampleW(M2D,P)= f(1,M) × f(4,D) + f(2,M) × f(5,D) + … + f(78,M) × f(81,D) = 0.99995×0.04743 + 0.11920×0.00247 +…+ 0.00669×0.26894

    53/78

  • ACADEMIA SINICAInstitute of Information Science

    Gapped-dipeptide Feature Representation

    Problems of proteins represented by gapped-dipeptides:Very large number of featuresIncrease of computational time and complexity

    If d = 20, there are 8,400 (=20*20*21) features in a vector!

    54/78

  • ACADEMIA SINICAInstitute of Information Science

    OutlineIntroduction

    Data miningProtein subcellular localization prediction

    A support vector machine model-based methodSupport vector machinesCompartment-specific biological features

    A probabilistic latent semantic analysis-based methodGapped-dipeptidesProbabilistic latent semantic analysis

    Applications of protein localization predictionConclusion

    55/78

  • ACADEMIA SINICAInstitute of Information Science

    Feature Reduction (1/3)

    Probabilistic Latent Semantic Analysis (PLSA)Create a mapping between documents and terms via latent concepts

    56/78

  • ACADEMIA SINICAInstitute of Information Science

    Feature Reduction

    Dimension reductionReduce the number of gapped-dipeptides by Probabilistic Latent Semantic Analysis (PLSA)

    True plot in k dimensions Reduced-dimensionality plot

    57/78

  • ACADEMIA SINICAInstitute of Information Science

    Feature Reduction (2/3)

    A joint probability between a term w and a document d can be modeled as:

    Latent variable z(“small” #states)

    Concept expression

    probabilities

    Document-specificmixing proportions

    )|()|()(),( dzPzwPdPdwPZz∑∈

    =

    • The parameters could be estimated by maximum-likelihood function through EM algorithm.

    58/78

  • ACADEMIA SINICAInstitute of Information Science

    Feature Reduction (3/3)

    Term 1Term 2

    Term 3

    Term 4Term 5

    Vector

    Term Space

    59/78

  • ACADEMIA SINICAInstitute of Information Science

    System Architecture of PSLDoc

    PSLDoc (Protein Subcellular Localization prediction by Document classification)

    60/78

  • ACADEMIA SINICAInstitute of Information Science

    Performance Comparison with Other Approaches

    PSLDoc compares favorably than other methods in overall accuracy by 1.51%

    Chang JM, et al. PSLDoc: Protein Subcellular Localization Prediction Based on Gapped-dipeptides and Probabilistic Latent Semantic Analysis. Proteins: Structure, Function, andBioinformatics, 2008.

    61/78

  • ACADEMIA SINICAInstitute of Information Science

    Effect of PLSA Feature Reduction (1/2)Distribution of topic versus proteins

    Use PLSA to reduce gapped-dipeptides to different topics

    62/78

  • ACADEMIA SINICAInstitute of Information Science

    Effect of PLSA Feature Reduction (2/2)

    Distribution of topic versus localization sites

    63/78

  • ACADEMIA SINICAInstitute of Information Science

    Gapped-dipeptide Signatures for Each Localization

    Loc. Gapped-dipeptide Signatures

    CPE0E, K1I, K5V, K1V, D0E; L1H, L5H, L3H, H4L, H0L; A12C, A9C, A13C, A5C, A7C;

    R3R, R6R, R2R, R0R, R9R; A6A, A13A, A7A, A10A, A11A; I0E, R6I, I3R, I3K, R6V;

    H3H, H1H, H7H, H13H, H10H; H1M, H2M, H11M, M0H, H0M; A4E, E1E, A2E, V4E, A9E;

    E4E, K6E, E6E, E3E, E0E

    IMI2I, I3I, I0I, L0I, I0F; L7L, L4L, L10L, L3L, L6L; M3M, M2M, M0M, M8M, M6M;

    V2I, V2V, V3I, V3V, I0V; T2F, T6F, F3F, T4F, T8F; A1A, A7L, A4A, A1C, A11L;

    W3W, W0W, W2W, W6W, W4W; Y12L, Y1L, Y11L, L0Y, L1L; M2T, M3T, M10T, M4T, M0L;

    F10P, F8P, F12P, F3P, F13P

    PPA1A, A2A, A0A, A3A, M4A; M0H, W1Q, W1H, W1K, W5Q; P1E, P0E, E0P, P0K, E1P;

    D0D, Q0D, D3D, D3Q, D11D; W0E, E4W, W11E, E0W, W13E; K3K, K0K, K2K, K1K, K7K;

    A3A, A7A, A1P, A6R, A10R; P3N, N4P, N3P, N5P, N0P; H6G, G3M, H7D, G11H, H11G;

    A10A, A11A, A6A, A12A, A3A

    OMT1R, R3T, R1T, T5R, P0P; R0F, R4F, Y13R, R6F, R2F; N4N, N0N, N10N, N7N, F1N;

    Q6Q, Q1Q, Q3Q, Q13Q, Q4Q; S0F, A3F, F0S, R9F, F7F; G0G, A0G, A1G, G1A, G3A;

    N1Q, N1N, Q1Q, N12N, Q11V; W2N, N2W, N0W, D2W, N13W; Q5R, R1Q, Q1R, Q3R, R2Q;

    Y1Y, Y0Y, Y5Y, Y4Y, Y12Y

    ECS6S, S2S, T11T, S13S, T6S; G8G, G0G, G7G, G9G, G6G; T1T, T3T, T5T, T9T, T10T;

    N10N, N9N, N13N, N11N, N12N; N1N, N3N, N4N, N11N, N1T; I5Y, Y12S, Y3S, Y9S, Y6I;

    Q2N, N1Q, Q1Q, N3Q, Q7Q; K1S, S6S, S5S, S11M, S0S; S3G, G3G, G4S, G3S, G2G;

    N0N, N12V, N4V, V12N, N9V64/78

  • ACADEMIA SINICAInstitute of Information Science

    Amino Acid Compositions of Single Residues and Gapped-dipeptide Signatures

    65/78

  • ACADEMIA SINICAInstitute of Information Science

    Grouped Amino Acid Compositions of Single Residues and Gapped-dipeptide Signatures

    Amino acid grouping:Nonpolar (AIGLMV), polar (CNPQST), charged (DEHKR), and aromatic (FYW)

    66/78

  • ACADEMIA SINICAInstitute of Information Science

    OutlineIntroduction

    Data miningProtein subcellular localization prediction

    A support vector machine model-based methodSupport vector machinesCompartment-specific biological features

    A probabilistic latent semantic analysis-based methodGapped-dipeptidesProbabilistic latent semantic analysis

    Applications of protein localization predictionConclusion

    67/78

  • ACADEMIA SINICAInstitute of Information Science

    Interaction – Localization Networks in the Secretory Pathway

    Co-localized proteins interact together!

    Scott MS, et al. Refining Protein Subcellular Localization . PLoS Computational Biology , 2005. 68/78

  • ACADEMIA SINICAInstitute of Information Science

    Interaction – Localization Networks in ER

    Interactions of ER periphery proteins are almost exclusively between members of this group or with cytosolic proteins.

    Scott MS, et al. Refining Protein Subcellular Localization . PLoS Computational Biology , 2005. 69/78

  • ACADEMIA SINICAInstitute of Information Science

    Gene Expression Level – Localization Relationship

    Expression levels are clearly correlated with localization

    Drawid A, et al. Genome-wide analysis relating expression level with protein subcellular localization. Trends in Genetics, 2000.

    70/78

  • ACADEMIA SINICAInstitute of Information Science

    The Discriminative Impact of Features for Different Functional Categories

    Subcellular location related features are important for elucidate protein function.

    Jensen LJ, et al. Prediction of Human Protein Function from Post-translational Modifications and Localization Features, J. Mol. Bio., 2000.

    71/78

  • ACADEMIA SINICAInstitute of Information Science

    OutlineIntroduction

    Data miningProtein subcellular localization prediction

    A support vector machine model-based methodSupport vector machinesCompartment-specific biological features

    A probabilistic latent semantic analysis-based methodGapped-dipeptidesProbabilistic latent semantic analysis

    Applications of protein localization predictionConclusion

    72/78

  • ACADEMIA SINICAInstitute of Information Science

    Conclusion (1/2)SVM models

    1-v-1 outperforms 1-v-r in the flexibility of integrating different features

    Compartment specific biological featuresCompartment-specific features improve accuracyA refined feature representation leads to Acc = 92.0% in PSL101(vs. 90.0 % in CELLO II)

    ModelBiological Features

    AccAA Dip SSE EC1 SA Sig TMA TMB TAT Sec SSE EC2

    1-v-r SVM ● ● ● 87.7%

    1-v-1 SVM ● ● ● 88.2%

    1-v-1 SVM ● ● ● ● ● ● ● ● ● 91.4%1-v-1 SVM ● ● ● ● ● ● ● ● ● ● 92.0%

    73/78

  • ACADEMIA SINICAInstitute of Information Science

    Conclusion (2/2)

    Proteins represented by gapped-dipeptidesGapped-dipeptides capture remote relationships on the primary sequence

    Effects of feature reductionPLSA greatly reduce feature dimension without sacrificing the performance (>8,000

  • ACADEMIA SINICAInstitute of Information Science

    Take Home Message

    5 key factors in using data mining to solve a research problemData set

    • Is the data set large enough for analysis? Feature extraction

    • What kind of feature(s) could be useful to solve this problem?Feature representation

    • How to effectively represent/encode these features? Data mining approaches

    • Which are the suitable technique(s) for this problem? Evaluation

    • What is the performance of this method?

    75/78

  • ACADEMIA SINICAInstitute of Information Science

    References

    Emily Chia-Yu Su, Hua-Sheng Chiu, Allan Lo, Jenn-Kang Hwang, Ting-Yi Sung, and Wen-Lian Hsu, “Protein subcellular localization prediction based on compartment-specific features and structure conservation,” BMC Bioinformatics, 8:330, (2007).Jia-Ming Chang, Emily Chia-Yu Su, Allan Lo, Hua-Sheng Chiu, Ting-Yi Sung, and Wen-Lian Hsu, “PSLDoc: Protein subcellular localization prediction based on gapped-

    dipeptides and probabilistic latent semantic analysis,”PROTEINS: Structure, Function, and Bioinformatics, 72(2): 693, (2008).

    76/78

  • ACADEMIA SINICAInstitute of Information Science

    Thank You!

    77/78

  • ACADEMIA SINICAInstitute of Information Science

    Questions?

    78/78

  • ACADEMIA SINICAInstitute of Information Science

    4. Solvent Accessibility – (1)

    Proteins in different localization sites have different solvent accessibility (SA)

    A balance of acidic and basic exposed residuescytoplasmic proteins (C1)A slight excess of of acidic exposed residuesextracellular proteins (C5)

    Exposed and buried residuesLet SAi be the predicted SA of a residues at position i(0

  • ACADEMIA SINICAInstitute of Information Science

    4. Solvent Accessibility – (2)

    Protein Seq. M P L D L Y N T L T R R K E R F E P M T P D R V

    Pred. SA (%) 16 75 53 45 62 86 22 19 8 67 . . . . . . . . . . . . . .

    buried or exposed B E E E E E B B B E . . . . . . . . . . . . . .

    Encoding scheme of SACompute the amino acid compositions of exposed and buried residues

    0.12 0.02 0.03 0.050.07 0.04 0.02 0.09

    A C D Y↓ ↓ ↓ ↓

    ⎡ ⎤⎢ ⎥⎣ ⎦

    ………… B←

    E←

    80/78

  • ACADEMIA SINICAInstitute of Information Science

    Signal peptides (Sig) from SignalP 3.0 serverSignal peptides:

    N-terminal peptides between 15 and 40 amino acids long target proteins for translocation through the general secretory pathway

    Presence of a signal peptide not cytoplasmic proteins (not C1)

    5. Signal Peptides – (1)

    81/78

  • ACADEMIA SINICAInstitute of Information Science

    5. Signal Peptides – (2)

    Sig features extracted from SignalP server

    82/78

  • ACADEMIA SINICAInstitute of Information Science

    Transmembrane α-helices (TMA) from TMHMM 2.0 server

    Integral inner membrane proteinscharacterized by transmembrane α-helices

    Presence of transmembrane α-helices inner membrane proteins (C2)

    6. Transmembrane α-helices – (1)

    83/78

  • ACADEMIA SINICAInstitute of Information Science

    6. Transmembrane α-helices – (2)TMA features extracted from TMHMM server

    84/78

  • ACADEMIA SINICAInstitute of Information Science

    Transmembrane β-barrels (TMB) from TMB-Hunt server

    Lots of proteins residing in the outer membrane are characterized by β-barrel structures Presence of β-barrel structures outer membrane proteins (C4)

    Wong et al. (2001) J Bacteriol 183:367

    7. Transmembrane β-barrels – (1)

    85/78

  • ACADEMIA SINICAInstitute of Information Science

    7. Transmembrane β-barrels – (2)

    TMB features extracted from TMB-Hunt server

    86/78

  • ACADEMIA SINICAInstitute of Information Science

    Twin-arginine translocase signal peptides (TAT) from TatP 1.0 server

    The TAT system exports proteins from the cytoplasm to the periplasm

    The proteins translocated by TAT bear a unique twin-arginine motif

    Presence of TAT motif periplasmic proteins (C3)

    8. Twin-Agrinine Translocase – (1)

    87/78

  • ACADEMIA SINICAInstitute of Information Science

    8. Twin-Agrinine Translocase – (2)TAT features extracted from TatP server

    88/78

  • ACADEMIA SINICAInstitute of Information Science

    Non-classical protein secretion (Sec) from SecretomeP 2.0 server

    Several extracellular proteins can be secreted without a classical N-terminal signal peptideIdentification of non-classical protein secretionextracellular proteins (C5)

    9. Non-classical Protein Secretion – (1)

    89/78

  • ACADEMIA SINICAInstitute of Information Science

    9. Non-classical Protein Secretion – (2)Sec features extracted from SecretomeP server

    90/78

    Protein Subcellular Localization Prediction Based on Machine Learning ApproachesOutlineAbout MyselfAbout TIGPAbout Bioinformatics Program (1/2)About Bioinformatics Program (2/2)OutlineOrigins of Data MiningTwo Types of Data Mining MethodsDifferent Tasks in Data MiningDefinition of ClassificationApplications in ClassificationOutlineProtein Subcellular Localization (PSL) PredictionImportance of PSL PredictionThe Available Computational Methods for Bacterial PSL PredictionOutlineClassification FormulationDecision Function (1/2)Decision Function (2/2)Support Vector Machines (1/2)Support Vector Machines (2/2)Optimal HyperplaneMulticlass Classification in SVMMulticlass Classification by 1-v-r SVMGeneral Biological Features for PSL Prediction1. Amino Acid Composition�2. Dipeptide Composition3. Secondary Structure Elements Training and Testing in SVMGram-Negative Bacteria Data SetPerformance EvaluationResults of 1-v-r SVM ModelMulticlass Classification by 1-v-1 SVMAccuracy and Feature CombinationOutlineCompartment-Specific Biological FeaturesCompartment-Specific Features in Bacterial Secretory PathwaysMore Compartment-Specific Biological FeaturesCompartment-Specific Biological FeaturesSlide Number 40Feature SelectionAccuracy and Feature CombinationMore Refined Encoding Schemes to Encode Protein Structures (SSE)?A More Refined Feature Representation for SSEAccuracy and Feature CombinationOutlineDocument ClassificationVector Space ModelA Term-Document MatrixTerm Weighting by TFIDFThe Terms of Proteins - Gapped-dipeptidesIncorporation of Evolutionary InformationWeighting Scheme of Gapped-dipeptidesGapped-dipeptide Feature RepresentationOutlineFeature Reduction (1/3)Feature ReductionFeature Reduction (2/3)Feature Reduction (3/3)System Architecture of PSLDocPerformance Comparison with Other ApproachesEffect of PLSA Feature Reduction (1/2)Effect of PLSA Feature Reduction (2/2)Gapped-dipeptide Signatures for Each LocalizationAmino Acid Compositions of Single Residues and Gapped-dipeptide SignaturesGrouped Amino Acid Compositions of Single Residues and Gapped-dipeptide SignaturesOutlineInteraction – Localization Networks in the Secretory PathwayInteraction – Localization Networks in ERGene Expression Level – Localization RelationshipThe Discriminative Impact of Features for Different Functional CategoriesOutlineConclusion (1/2)Conclusion (2/2)Take Home MessageReferencesThank You!Questions?4. Solvent Accessibility – (1)4. Solvent Accessibility – (2)5. Signal Peptides – (1)5. Signal Peptides – (2)6. Transmembrane α-helices – (1)6. Transmembrane α-helices – (2)7. Transmembrane β-barrels – (1)7. Transmembrane β-barrels – (2)8. Twin-Agrinine Translocase – (1)8. Twin-Agrinine Translocase – (2)9. Non-classical Protein Secretion – (1)9. Non-classical Protein Secretion – (2)