BINSFinalt

Embed Size (px)

Citation preview

  • 8/3/2019 BINSFinalt

    1/43

    Muhammad Rehan

    1358-MSCS-08

    Department of Computer Science

    GC University Lahore

    MS(CS) Final Thesis

  • 8/3/2019 BINSFinalt

    2/43

    Background

    Literature Review

    Problem Statement

    Hypothesis

    Methodology

    Result Future Work

    References

  • 8/3/2019 BINSFinalt

    3/43

    Bioinformatics

    Any application of computation in biology

    including data management, developing

    algorithm and data mining. Bioinformatics is the field of science in which

    computer science, information technology,

    statistics and various branches of biology merge

    to from single discipline.

  • 8/3/2019 BINSFinalt

    4/43

    What are Proteins

    ProteinsAntibody Hormones

    Structural

    Support &

    Transportation

    Enzymes

  • 8/3/2019 BINSFinalt

    5/43

    What are Proteins made of

    C

    H

    COOHH2N

    R

    Carboxyl groupAmino group

    Alpha carbon

    General formula of Amino Acid

    R

    R

    H

    CH3

  • 8/3/2019 BINSFinalt

    6/43

    Sr.

    No

    Amino Acid Single Letter

    Code

    Three

    Letter

    Code

    1 Alanine A Ala

    2 Arginine R Arg

    3 Serine S Ser

    4 Threonine T Thr

    5 Tyrosine Y Tyr

    6 Glycine G Gly

    7 Histidine H His. . . .

    . . . .

    20 Ieucine L Ieu

  • 8/3/2019 BINSFinalt

    7/43

    Experimental Approach

    vivo

    vitro

    silico

  • 8/3/2019 BINSFinalt

    8/43

    Post Translational Modification(PTMs)

    Protein modification is very important for

    biological activity and perform the desire task.

    This modification is done by the addition ofphosphate, glycosyl or other groups to certain

    amino acids.

  • 8/3/2019 BINSFinalt

    9/43

    PTM Target Amino Acid Description

    Phosphorylation S, T, Y, H Addition of a phosphate group, to S,

    T, Y, H

    Glycosylation S, T, N Addition of a glycosyl group to either

    S, T, N.

    Sulfation Y Addition of a sulfate group to a Y

    Acetylation Addition of an acetyl group, usually at

    the N-terminus of the protein

    Methylation R Addition of a methyl group, usually at

    or R residues

    List of PTMs Types

  • 8/3/2019 BINSFinalt

    10/43

    Database Name Description Reference

    PROSITE Database of consensus patterns for

    various PTMs

    Sigrist et al., (2002)

    HPRD Human protein reference database of

    disease-related proteins and their

    PTMs

    Peri et al., (2003)

    RESID Database with collection of

    annotations and structures for PTMs

    Garabelli (2003)

    PhosphoBase

    ELM

    Database of validated

    phosphorylation sites.

    Diella et al., (2004)

    List of PTMs Databases

  • 8/3/2019 BINSFinalt

    11/43

    Sr.No Statement References

    1 Proteins often perform diverse and multiple

    functions.

    (Jeffery ,1999)

    2 The diversity of proteome is higher complex then

    genome, in human genome the number of genes

    are 22,000 to 25,000 but in contrast number of

    proteins more than 10,0000

    (Nicolle H et al ,2007)

    3 To identify the proteins functions and events

    mainly rely on their particular 3-D structure as well

    as the occurrence of targeted amino acid

    modification.

    (Attwood ,2000)

    4 PTMs regulate various functions of proteins byeffecting verificational changes such as enzymes

    activation.

    (Konstantinopoulos et al,2007)

    5 Phosphorylated serine, theronine and tyrosine

    residues using MS is not easy in VIVO.

    (Mann et al ,2002)

  • 8/3/2019 BINSFinalt

    12/43

    Sr.No Statement References

    6 Many methods have been developed within the

    field of proteomics but these methods are still in

    early stages.

    (Blom ,2004)

    7 Application of machine learning and statistics inbioinformatics have always played a core role in

    understanding proteomics and to analysis of

    PTMs.

    (Qazi et al,2006)

    8 ANN is one such Approach that has been

    extensively used in biological sequences analysis.

    (Wu, C.H ,1997)

    9 Mostly cellular proteins are regulated by reversible

    phosphorylation and at least 30% of protein have

    such alteration.

    (Ficarro et al ,2002)

  • 8/3/2019 BINSFinalt

    13/43

    DISPHOS

    PredPhosPho

    GPS

    PPSP

    KinasePhos 1.0, KinasePhos 2.0

    NetPhos, NetPhosK Neural-genetic

  • 8/3/2019 BINSFinalt

    14/43

    Tools DISPHOS NetPhos Neural-

    genetic

    BPNN

    Method Logistic

    regression

    ANN ANN ANN

    Serine 76% 69% 75% 72%

    Threonine 81% 72% 82% 77%

    Tyrosine 83% 61% 79% 74%

  • 8/3/2019 BINSFinalt

    15/43

    Tools KinasePhos

    2.0

    KinasePhos

    1.0

    PredPhospho GPS PPSP

    Method SVM HMM SVM GPS BDT

    Kinase

    PKA

    Sn=92%

    Sp=89%

    Acc=90%

    Sn=91%

    Sp=86%

    Acc=85%

    Sn=88%

    Sp=%91

    Ac=90%

    Sn=91%

    Sp=89%

    Acc=90%

    Sn=90%

    Sp=92%

    Acc=91%

    Kinase

    PKC

    Sn=84%

    Sp=86%

    Acc=85%

    Sn=80%

    Sp=87%

    Acc=83%

    Sn=79%

    Sp=86%

    Ac=83%

    Sn=82%

    Sp=83%

    Acc=82%

    Sn=82%

    Sp=86%

    Acc=84%

  • 8/3/2019 BINSFinalt

    16/43

    Develop a new method BINS to evolve new

    classification model by learning amino acid

    sequences data using machine learning

    based method artificial neural network. ThisBINS improve the prediction specificity,

    efficiency and accuracy for machine learning

    simulator called GEARS (Genetic

    EvaluationofClassifier by Learning Residue Rules and

    Sequences).

  • 8/3/2019 BINSFinalt

    17/43

    BINS classification method will reduce thefalse negative and positive prediction.

    BINS method show highly accuracy

    prediction about PTMs which will affect thespecific site and kinases that act at each site,disclose the important biologicallyinformation from noisy data.

    BINS method can gives the best result ascompare to the existing PTMs predictionmethods.

  • 8/3/2019 BINSFinalt

    18/43

    Empirical research methodology with

    Exploratory Development Life Cycle will be used

    for the development of BINS Model.

  • 8/3/2019 BINSFinalt

    19/43

    BINS consists mainly on three parts BINS Data Preparation Module

    BINS Bootstrapping Module

    BINS ANN Module

  • 8/3/2019 BINSFinalt

    20/43

  • 8/3/2019 BINSFinalt

    21/43

    BINS Data Preparation Module

    Create Protein grouped by target classes

    Create Protein Database grouped by non

    modified target classes

    Peptide Generator

    Removed of duplicate instance

    PTMs Database

    BINS ANN Module

    Topology and Network

    Configuration

    Validation

    [SN] [SP] [Acc] [MCC]

    Training

    [SN] [SP] [Acc] [MCC]

    BINS Bootstrapping Module

    Peptide dataset grouped by

    non modified classes

    Merge the Sparse Encoding dataset grouped bymodified and non modified target classes

    Sparse Encoding

    Peptide dataset grouped by

    non modified classes

    Training and validation Dataset Generator

    Validation dataset

    Generator

    Training dataset

    Generator

  • 8/3/2019 BINSFinalt

    22/43

    BINS Data Preparation Module BINS Database Inconsistency Analyzing Utility

    BINS Balance Inverted Site Application

    BINS Peptide Extraction Application

  • 8/3/2019 BINSFinalt

    23/43

    BINS Data Preparation Module

    PID Sequences Position Amino Acid Modification

    O08539 ASTSMNSY

    TLKSYA.

    4 S S

    O14543 MVTHSKFP

    AAGS.

    3 T T

    O14746 MPRAPRC

    RAVSTA

    11 S S

    O14920 MSWYPSL

    TQTC.

    4 Y Y

    O15117 ELSFKQGE

    QIYTA.

    3 S S

  • 8/3/2019 BINSFinalt

    24/43

    BINS Data Preparation Module

    Target

    sites

    No.of

    Proteins

    No.of

    positive

    Peptide

    No.of

    Negative

    Peptide

    No.of

    Balance

    negative

    Peptide

    No. of

    merge pos

    and

    balanceneg pep

    S 5431 14467 326396 14837 29304

    T 1940 2907 35795 2983 5890

    Y 1156 2208 16273 2325 4533

  • 8/3/2019 BINSFinalt

    25/43

    BINS Data Preparation Module

    BINS Database Inconsistency Analyzing Utility

    PID Sequences Position Amino Acid Modification Length

    O08539 ASTSMNSY

    TLKSYA.

    4 S S 350

    O14543 MVTHSKFP

    AAGS.

    3 T T 1030

    O14746 MPRAPRC

    RAVSTA

    11 S S 1250

    O14920 MSWYPSLTQTC.

    4 Y Y 735

    O15117 ELSFKQGE

    QIYTA.

    3 S S 952

  • 8/3/2019 BINSFinalt

    26/43

    BINS Data Preparation Module

    BINS Invert Application

    PID Sequences Position Amino Acid Modification Length

    O08539 ASTSMNSY

    TLKSYA.

    2 S S 350

    O08539 ASTSMNSY

    TLKSYA.

    7 S S 350

    O08539 ASTSMNSY

    TLKSYA.

    12 S S 350

  • 8/3/2019 BINSFinalt

    27/43

    BINS Data Preparation Module

    BINS Peptide Extraction Application

    Peptide

    ID

    Extend

    ed

    Seque

    nces

    Class P-10 P-9 P-8 P0 P9 P10

    O08539

    -2

    -,-,-

    ,A,S,T,

    S,M,N

    S,Y,T,L

    K,S

    0.1 - - - A S T K S

    O08539-7

    -,-,-,A,S,T,

    S,M,N,

    S,Y,T,L

    ,K,S,Y

    A,-,-,

    0.1 - - - N S Y - -

  • 8/3/2019 BINSFinalt

    28/43

    BINS Bootstrapping Module BINS Training Dataset Encoding Manager

    BINS Data Table Merging Utility

    BINS Boot Strapping Application

  • 8/3/2019 BINSFinalt

    29/43

    BINS Bootstrapping Module

    Sparse Encoding Scheme

    Amino Acid Coding Scheme

    A 10000000000000000000

    C 01000000000000000000

    D 00100000000000000000

    E 00010000000000000000

    F 00001000000000000000

    G 00000100000000000000

    H 00000010000000000000

    I 00000001000000000000

    .

    .

    .

    .

    - 00000000000000000000

  • 8/3/2019 BINSFinalt

    30/43

    BINS Bootstrapping Module

    BINS Training Dataset Encoding Manager

    Peptide

    ID

    Extend

    ed

    Seque

    nces

    Class P-10-

    1

    P-

    10-2

    P-

    10-3

    P-

    10-

    8

    P-

    10-9

    P-

    10-

    10

    O08539

    -2

    -,-,-

    ,A,S,T,

    S,M,N

    S,Y,T,L

    K,S

    0.1 0 0 0 0 0 0 0 0

    O08539

    -7

    -,-,-

    ,A,S,T,

    S,M,N,

    S,Y,T,L

    ,K,S,Y

    A,-,-,

    0.1 0 0 0 0 0 0 0 0

  • 8/3/2019 BINSFinalt

    31/43

    BINS Bootstrapping Module

    BINS DataTable Merging Utility

    Peptide

    ID

    Extend

    ed

    Seque

    nces

    Class P-10-

    1

    P-

    10-2

    P-

    10-3

    P-

    10-

    8

    P-

    10-9

    P-

    10-

    10

    O08539

    -2

    -,-,-

    ,A,S,T,

    S,M,N

    S,Y,T,L

    K,S

    0.1 0 0 0 0 0 0 0 0

    O08539

    -7

    -,-,-

    ,A,S,T,

    S,M,N,

    S,Y,T,L

    ,K,S,Y

    A,-,-,

    0.9 0 1 0 0 0 0 0 0

  • 8/3/2019 BINSFinalt

    32/43

    BINS Bootstrapping Module

    BINS Boot Strapping Application

  • 8/3/2019 BINSFinalt

    33/43

    BINS ANN Module

  • 8/3/2019 BINSFinalt

    34/43

    Evaluation Strategy

    Sn=TP/(TP+FN)

    Sp=TN/(TN+FP)

    Acc=(Sn+Sp)/2

    MCC=

  • 8/3/2019 BINSFinalt

    35/43

    Evaluation Strategy

    PID Sequence Position Target Clarify

    O3265 SASNSTSYTS 3 Mod TP

    O3265 SASNSTSYTS 10 Mod FN

    O3265 SASNSTSYTS 1 Non-mod TN

    O3265 SASNSTSYTS 5 Non-mod FP

    O3265 SASNSTSYTS 7 Non-mod TN

  • 8/3/2019 BINSFinalt

    36/43

    BINS Serine Result

    Sr.

    No

    Training Validation

    Ac Sn Sp MCC Ac Sn Sp MCC

    10.965 1 0.931 0.932 0.497 0 1 None

    2 0.984 0.982 0.987 0.969 0.805 0.612 0.999 0.662

    3 0.996 0.996 0.996 0.992 0.807 0.619 0.995 0.663

    4 0.995 0.995 0.996 0.991 0.807 0.622 0.995 0.664

    5 0.998 0.998 0.998 0.996 0.807 0.616 0.999 0.665

    6 0.99

    6 0.99

    6 0.99

    6 0.99

    2 0.809

    0.628 0.991

    0.663

  • 8/3/2019 BINSFinalt

    37/43

    BINS Threonine ResultSr.

    No

    Training Validation

    Ac Sn Sp MCC Ac Sn Sp MCC

    10.972 0.963 0.981 0.946 0.826 0.688 0.965 0.680

    20.987 0.986 0.989 0.975 0.834 0.737 0.932 0.683

    3 0.987 0.986 0.988 0.974 0.825 0.750 0.901 0.658

    4 0.987 0.986 0.987 0.974 0.827 0.771 0.884 0.659

    5 0.987 0.986 0.987 0.974 0.824 0.774 0.875 0.653

    6 0.986 0.986 0.986 0.972 0.822 0.772 0.872 0.648

    7 0.988 0.990 0.987 0.977 0.825 0.761 0.890 0.657

    8 0.989 0.989 0.990 0.979 0.823 0.768 0.880 0.652

    90.990 0.990 0.990 0.980 0.822 0.770 0.874 0.647

    10 0.990 0.989 0.991 0.980 0.821 0.770 0.871 0.645

  • 8/3/2019 BINSFinalt

    38/43

    BINS Tyrosine ResultSr.

    No

    Training Validation

    Ac Sn Sp MCC Ac Sn Sp MCC

    10.966 0.952 0.979 0.933 0.846 0.735 0.951 0.705

    20.972 0.961 0.983 0.945 0.843 0.741 0.939 0.697

    3 0.977 0.975 0.979 0.955 0.836 0.778 0.891 0.675

    4 0.976 0.975 0.977 0.953 0.837 0.780 0.890 0.676

    5 0.975 0.973 0.976 0.950 0.831 0.779 0.881 0.665

    6 0.973 0.970 0.976 0.947 0.829 0.779 0.877 0.661

    7 0.974 0.971 0.977 0.948 0.828 0.778 0.876 0.659

    8 0.974 0.973 0.975 0.948 0.828 0.779 0.875 0.659

    90.974 0.974 0.975 0.949 0.826 0.778 0.872 0.654

    10 0.977 0.974 0.980 0.955 0.825 0.768 0.879 0.652

  • 8/3/2019 BINSFinalt

    39/43

    BINS Comparison with other Method

    Algorithm Y T S

    Acc Sn Sp Acc Sn Sp Acc Sn Sp

    BINS 85% 74% 95% 83% 74% 93% 81% 63% 99%

    NetPhos 69% 70% 68% 72% 66% 77% 69% 81% 57%

    DISPHOS 83% NA NA 81% NA NA 76% NA NA

    BPNN 75% 75% 75% 78% 78% 77% 72% 72% 72%

    Neural-genetic 79% 81% 78% 83% 81% 84% 75% 76% 74%

  • 8/3/2019 BINSFinalt

    40/43

    BINS is a developed as Desktop Application, technically, there is no online WWW supportavailable in the current version, nevertheless, increasing opportunities over the internet

    urges the need to develop an online version of this application for its wider scope and

    availability to multiple clients in different regions of the world. This effort would not only

    help us to enhance the embedded capability of BINS for efficient PTMs but also could be

    major resource for multi-nation research collaborations.

    BINS are the sub module of GEARS so in next version learn and optimize the parameters

    and weights of ANN with genetic algorithm. In next, BINS integrate with other GEARS

    modules like MAPRes and HMM for best classification of proteins data using pros and

    cons of each technique.

  • 8/3/2019 BINSFinalt

    41/43

    Jeffery C.J. Moonlighting proteins, Trends Biochem. Sci., 24:8--11, 1999.

    Bork P., Dansekar T., Diaz-Lazcoz Y., Eisenhaber F., Huynen M.and Yuan Y. Predicting function: from genes to genome and back.J. Mol. Biol., 283:707--725, 1998.

    Attwood T. The quest to deduce protein function from sequence:the role of pattern databases, Int. J. Biochem. CellBiol., 32:139--155, 2000.

    Mann, M., Ong, S., Gronborg. M, .Steen, H. et al., TrendsBiotechnol. 2002, 20, 261-268.

    Wu, C. H., Comput, Chem, 1997, 21, 237-256.

    Blom N., Sicheritz-Protein T., Gupta R., Gammeltoft S., andBrunak S. Prediction of post-translational glycosylation andphosphorylation of proteins from the amino acid sequence,Proteomics, 4: 1633--1649, 2004.

  • 8/3/2019 BINSFinalt

    42/43

  • 8/3/2019 BINSFinalt

    43/43