40
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Technical University of Denmark - DTU Department of systems biology PERFORMANCE MEASURES Of predictive systems Monday, 7 June 2010

PERFORMANCE MEASURES

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: PERFORMANCE MEASURES

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

Technical University of Denmark - DTUDepartment of systems biology

PERFORMANCE MEASURESOf predictive systems

Monday, 7 June 2010

Page 2: PERFORMANCE MEASURES

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS

DATA TYPES

Data point Value

A FALSEB TRUEC TRUED FALSEE FALSEF TRUEG FALSE

Data Point Value

a 32.3b 11.2b 2.1d 111.1e 1313f 0.65g 72.8

Binary Real Value

Monday, 7 June 2010

Page 3: PERFORMANCE MEASURES

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS

ACCURACY

46%54%

Category A Category B

1%

99%

Monday, 7 June 2010

Page 4: PERFORMANCE MEASURES

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS

MATTHEWS CORRELATION COEFFICIENT

A

A

Not A

Not A

TP

FP

FN

TNAccuracy

21/25= 84%A Not A

Prediction

Expe

rimen

tal

Ass

ignm

ent

PPV20/23 = 87%

Monday, 7 June 2010

Page 5: PERFORMANCE MEASURES

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS

MATTHEWS CORRELATION COEFFICIENT

A

A

Not A

Not A

TP

FP

FN

TNAccuracy

21/25= 84%

Sens =TPAP

Spec =TNAN

CC =TP ⋅TN − FN ⋅ FPPP ⋅ AN ⋅ AP ⋅ PN

A Not APrediction

Expe

rimen

tal

Ass

ignm

ent

PPV20/23 = 87%

Monday, 7 June 2010

Page 6: PERFORMANCE MEASURES

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS

MATTHEWS CORRELATION COEFFICIENT

A

A

Not A

Not A

TP

FP

FN

TNAccuracy

21/25= 84%

Sens =TPAP

Spec =TNAN

CC =TP ⋅TN − FN ⋅ FPPP ⋅ AN ⋅ AP ⋅ PN

= 0.39

= 0.95

= 0.25

A Not APrediction

Expe

rimen

tal

Ass

ignm

ent

PPV20/23 = 87%

Monday, 7 June 2010

Page 7: PERFORMANCE MEASURES

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS

SENSITIVITY/SPECIFICITY

A

Not A

Sens =TPAP

Spec =TNAN

CC =TP ⋅TN − FN ⋅ FPPP ⋅ AN ⋅ AP ⋅ PN

= 0.39

= 0.95

= 0.25

A Not A

TP

FP

FN

TN

= 0.25

A Not A

Monday, 7 June 2010

Page 8: PERFORMANCE MEASURES

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS

SENSITIVITY/SPECIFICITY

A

Not A

Sens =TPAP

Spec =TNAN

CC =TP ⋅TN − FN ⋅ FPPP ⋅ AN ⋅ AP ⋅ PN

A Not A

TP

FP

FN

TN

= 0.75

= 0.43

= 0.25= 0.37

A Not A

Monday, 7 June 2010

Page 9: PERFORMANCE MEASURES

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS

FROM REAL LIFE

!

Monday, 7 June 2010

Page 10: PERFORMANCE MEASURES

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS

REAL VALUE

0 0.2 0.4 0.6 0.8 1Predicted affinity

0

0.2

0.4

0.6

0.8

1

Mea

sure

d af

finity

Monday, 7 June 2010

Page 11: PERFORMANCE MEASURES

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS

REAL VALUE

PCC =�

i (ai − a)(pi − p)��i (ai − a)2

��i (pi − p)2

0 0.2 0.4 0.6 0.8 1Predicted affinity

0

0.2

0.4

0.6

0.8

1

Mea

sure

d af

finity

Monday, 7 June 2010

Page 12: PERFORMANCE MEASURES

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS0 0.2 0.4 0.6 0.8 1

Predicted affinity

0

0.2

0.4

0.6

0.8

1

Mea

sure

d af

finity

ROC CURVES

Sens =TP

APSpec =

TN

AN

AUC =� 1

0f(x)dx

x = 1 - specificity (false positive rate) and f(x) is sensitivity (true positive rate)

Monday, 7 June 2010

Page 13: PERFORMANCE MEASURES

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS0 0.2 0.4 0.6 0.8 1

Predicted affinity

0

0.2

0.4

0.6

0.8

1

Mea

sure

d af

finity

ROC CURVES

Sens =TP

APSpec =

TN

AN

AUC =� 1

0f(x)dx

x = 1 - specificity (false positive rate) and f(x) is sensitivity (true positive rate)

Monday, 7 June 2010

Page 14: PERFORMANCE MEASURES

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS0 0.2 0.4 0.6 0.8 1

Predicted affinity

0

0.2

0.4

0.6

0.8

1

Mea

sure

d af

finity

ROC CURVES

Sens =TP

APSpec =

TN

AN

AUC =� 1

0f(x)dx

x = 1 - specificity (false positive rate) and f(x) is sensitivity (true positive rate)

Monday, 7 June 2010

Page 15: PERFORMANCE MEASURES

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS0 0.2 0.4 0.6 0.8 1

Predicted affinity

0

0.2

0.4

0.6

0.8

1

Mea

sure

d af

finity

ROC CURVES

TP

Sens =TP

APSpec =

TN

AN

AUC =� 1

0f(x)dx

x = 1 - specificity (false positive rate) and f(x) is sensitivity (true positive rate)

Monday, 7 June 2010

Page 16: PERFORMANCE MEASURES

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS0 0.2 0.4 0.6 0.8 1

Predicted affinity

0

0.2

0.4

0.6

0.8

1

Mea

sure

d af

finity

ROC CURVES

TP AP

Sens =TP

APSpec =

TN

AN

AUC =� 1

0f(x)dx

x = 1 - specificity (false positive rate) and f(x) is sensitivity (true positive rate)

Monday, 7 June 2010

Page 17: PERFORMANCE MEASURES

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS0 0.2 0.4 0.6 0.8 1

Predicted affinity

0

0.2

0.4

0.6

0.8

1

Mea

sure

d af

finity

ROC CURVES

TP

TN

AP

Sens =TP

APSpec =

TN

AN

AUC =� 1

0f(x)dx

x = 1 - specificity (false positive rate) and f(x) is sensitivity (true positive rate)

Monday, 7 June 2010

Page 18: PERFORMANCE MEASURES

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS0 0.2 0.4 0.6 0.8 1

Predicted affinity

0

0.2

0.4

0.6

0.8

1

Mea

sure

d af

finity

ROC CURVES

TP

TN

AP

AN

Sens =TP

APSpec =

TN

AN

AUC =� 1

0f(x)dx

x = 1 - specificity (false positive rate) and f(x) is sensitivity (true positive rate)

Monday, 7 June 2010

Page 19: PERFORMANCE MEASURES

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS1 - spec

Sen

s

0 0.2 0.4 0.6 0.8 1Predicted affinity

0

0.2

0.4

0.6

0.8

1

Mea

sure

d af

finity

ROC CURVES

TP

TN

AP

AN

Sens =TP

APSpec =

TN

AN

AUC =� 1

0f(x)dx

x = 1 - specificity (false positive rate) and f(x) is sensitivity (true positive rate)

Monday, 7 June 2010

Page 20: PERFORMANCE MEASURES

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS1 - spec

Sen

s

0 0.2 0.4 0.6 0.8 1Predicted affinity

0

0.2

0.4

0.6

0.8

1

Mea

sure

d af

finity

ROC CURVES

AUC=0.5

TP

TN

AP

AN

Sens =TP

APSpec =

TN

AN

AUC =� 1

0f(x)dx

x = 1 - specificity (false positive rate) and f(x) is sensitivity (true positive rate)

Monday, 7 June 2010

Page 21: PERFORMANCE MEASURES

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS1 - spec

Sen

s

0 0.2 0.4 0.6 0.8 1Predicted affinity

0

0.2

0.4

0.6

0.8

1

Mea

sure

d af

finity

ROC CURVES

AUC=0.5

AUC=0.8

TP

TN

AP

AN

Sens =TP

APSpec =

TN

AN

AUC =� 1

0f(x)dx

x = 1 - specificity (false positive rate) and f(x) is sensitivity (true positive rate)

Monday, 7 June 2010

Page 22: PERFORMANCE MEASURES

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS

ECCB/ISMB-2009 - Immunological Bioinformatics Tutorial

CALCULATING A ROC CURVE

True positive False positive

False negative True negative

Monday, 7 June 2010

Page 23: PERFORMANCE MEASURES

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS

ECCB/ISMB-2009 - Immunological Bioinformatics Tutorial

CALCULATING A ROC CURVE

True positive False positive

False negative True negative

Monday, 7 June 2010

Page 24: PERFORMANCE MEASURES

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS

ECCB/ISMB-2009 - Immunological Bioinformatics Tutorial

CALCULATING A ROC CURVE

True positive False positive

False negative True negative

4

Monday, 7 June 2010

Page 25: PERFORMANCE MEASURES

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS

ECCB/ISMB-2009 - Immunological Bioinformatics Tutorial

CALCULATING A ROC CURVE

True positive False positive

False negative True negative

4 10

Monday, 7 June 2010

Page 26: PERFORMANCE MEASURES

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS

ECCB/ISMB-2009 - Immunological Bioinformatics Tutorial

CALCULATING A ROC CURVE

True positive False positive

False negative True negative

4 10

Monday, 7 June 2010

Page 27: PERFORMANCE MEASURES

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS

ECCB/ISMB-2009 - Immunological Bioinformatics Tutorial

CALCULATING A ROC CURVE

True positive False positive

False negative True negative

4 10 12

Monday, 7 June 2010

Page 28: PERFORMANCE MEASURES

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS

ECCB/ISMB-2009 - Immunological Bioinformatics Tutorial

CALCULATING A ROC CURVE

True positive False positive

False negative True negative

4 10 121

Monday, 7 June 2010

Page 29: PERFORMANCE MEASURES

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS

ECCB/ISMB-2009 - Immunological Bioinformatics Tutorial

CALCULATING A ROC CURVE

True positive False positive

False negative True negative

4 10 121  0.080.29

Monday, 7 June 2010

Page 30: PERFORMANCE MEASURES

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS

Threshold TP FN TP/(TP+FN) FP TN FP/(FP+TN)

>0,8  4  10  0,29  1  12  0,08

>0,6 8  6  0,57  3  10  0,23 

>0,4 11  3  0,79  6  7  0,46 

>0,2 13 1  0,93  9  4  0,69 

>0 14  0  1  13  0  1 

0

0.2500

0.5000

0.7500

1.0000

0 0.2500 0.5000 0.7500 1.0000

True

pos

itive

s ra

te

False positives rate

AUC = 0.5AUC = 1.0

AUC =� 1

0f(x)dx

AUC = 0.71

Monday, 7 June 2010

Page 31: PERFORMANCE MEASURES

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS

DEALING WITH SEQUENCE

REDUNDANCY

Monday, 7 June 2010

Page 32: PERFORMANCE MEASURES

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS

OUTLINE

Monday, 7 June 2010

Page 33: PERFORMANCE MEASURES

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS

OUTLINE

•What is data redundancy?

Monday, 7 June 2010

Page 34: PERFORMANCE MEASURES

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS

OUTLINE

•What is data redundancy?

•Why is it a problem?

Monday, 7 June 2010

Page 35: PERFORMANCE MEASURES

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS

OUTLINE

•What is data redundancy?

•Why is it a problem?

• How can we deal with it?

Monday, 7 June 2010

Page 36: PERFORMANCE MEASURES

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS

DATABASES ARE REDUNDANT

• Biological reasons

• Some protein functions, or sequence motifs are more common than others

• Laboratory artifacts

• Some protein families have been heavily investigated, others not

•Mutagenesis studies makes large and almost identical replica

Monday, 7 June 2010

Page 37: PERFORMANCE MEASURES

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS

DATA REDUNDANCY

ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV

10 MHC restricted peptides

Monday, 7 June 2010

Page 38: PERFORMANCE MEASURES

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS

What is similarity?

• Sequence identity?

• Blast e-values– Often too conservative

• Other

DFLKKVPDDHLEFIPYLILGEVFPEWDERELGVGEKLLIKAVA------------MATGIDAKEIEESVKDTGDL-GEDVLLGADDGSLAFVP---------- SEFSISPGEKIVFKNNAGFPHNIVFDEDSIPSGVDASKISMSEEDLLNAKGE

ACDFGACEFG 80% ID versus 24% ID

Monday, 7 June 2010

Page 39: PERFORMANCE MEASURES

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS

OLE LUND ET AL.(PROTEIN ENGINEERING 1997)

%ID = 290/sqrt(alen)

Alen=100; %ID=29

Alen=30: %ID=53

DSSP secondary structure identity in alignments as a function of thealignment length and the percent sequence identity

Monday, 7 June 2010

Page 40: PERFORMANCE MEASURES

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS

MHC BINDING PEPTIDES

9mer : %id =290√

9= 97%

15mer : %id =290√

15= 75%

89

= 89% < 97%

1115

= 73% < 75%

Monday, 7 June 2010