Machine Reading for Cancer Panomics · Problem: Hard to scale U.S. 2014: 1.6 million new cases,...

Preview:

Citation preview

Machine Reading for

Cancer Panomics

Hoifung Poon

1

Overview

2

… ATTCGGATATTTAAGGC …

… ATTCGGGTATTTAAGCC …

……

……

Disease Genes

Drug Targets

……KB

Cancer Systems Modeling

High-Throughput Data

3

… ATTCGGATATTTAAGGC …

… ATTCGGGTATTTAAGCC …

……

……

Disease Genes

Drug Targets

…KB

Extract Pathways

from PubMed

Overview

High-Throughput Data

Grounded

Semantic Parsing

4

Panomics

5

… ATTCGGATATTTAAGGC …

Genome Transcriptome Epigenome

……

Genotype Phenotype

6

… ATTCGGATATTTAAGGC …

… ATTCGGGTATTTAAGCC …

……

……

Disease Genes

Drug Targets

……

High-Throughput Data

?

Precision Medicine

8

Before Treatment 15 Weeks

Vemurafenib on BRAF-V600 Melanoma

Vemurafenib on BRAF-V600 Melanoma

9

Before Treatment 15 Weeks 23 Weeks

Cancer

Hundreds of mutations

Most are “passenger”, not driver

Can we identify likely drivers?

10

… ATTCGGATATTTAAGGC …

… ATTCGGGTATTTAAGCC … Normal cells

Tumor cells

Traditional Biology

11

Targeted Experiments Discovery

One

hypothesis

Genomics

12

High-Throughput ExperimentsDiscovery

… ATTCGGATATTTAAGGC …

… ATTCGGGTATTTAAGCC …

… ATTCGGATATTTAAGGC …

… ATTCGGGTATTTAAGCC …

… ATTCGGATATTTAAGGC …

… ATTCGGGTATTTAAGCC …

Many

hypotheses

?

Genomics

13

Discovery

… ATTCGGATATTTAAGGC …

… ATTCGGGTATTTAAGCC …

… ATTCGGATATTTAAGGC …

… ATTCGGGTATTTAAGCC …

… ATTCGGATATTTAAGGC …

… ATTCGGGTATTTAAGCC …

Bottleneck #1: Knowledge

High-Throughput Experiments

Bottleneck #2: Reasoning

Example: Tumor Molecular Board

14

www.ucsf.edu/news/2014/11/120451/bridging-gap-precision-medicine

10-20 highly trained specialists

Tens of hours on each patient

Problem: Hard to scale

U.S. 2014: 1.6 million new cases, 585K deaths

Wanted: Decision support for clinical genomics

15

Example: Tumor Molecular Board

Pathway Knowledge

Genes work synergistically in pathways

16

Why Hard to Identify Drivers?

Complex diseases Perturb multiple pathways

17Hanahan & Weinberg [Cell 2011]

Why Cancer Comes Back?

Subtypes with alternative pathway profile

Compensatory pathways can be activated

18

EphA2 EphB2

Ovarian Cancer

Why Cancer Comes Back?

Subtypes with alternative pathway profile

Compensatory pathways can be activated

19

EphA2 EphB2

Ovarian Cancer

X

Cancer Systems Modeling

20

Gene A DNA mRNA Protein Protein Active

Transcription Translation Activation

… ATTCGGATATTTAAGGC …

Functional activity

Mutation effect

Drug Target

……

21

Gene A DNA mRNA Protein Protein Active

Gene B DNA mRNA Protein Protein Active

Gene C DNA mRNA Protein Protein Active

Transcription Factor

Protein Kinase

Knowledge Model

22

Gene A DNA mRNA Protein Protein Active

Gene B DNA mRNA Protein Protein Active

Gene C DNA mRNA Protein Protein Active

Transcription Factor

Protein Kinase

?Knowledge Model

23

Gene A DNA mRNA Protein Protein Active

Gene B DNA mRNA Protein Protein Active

Gene C DNA mRNA Protein Protein Active

Transcription Factor

Protein Kinase

?Knowledge Model

24

Gene A DNA mRNA Protein Protein Active

Gene B DNA mRNA Protein Protein Active

Gene C DNA mRNA Protein Protein Active

Transcription Factor

Protein Kinase

!Knowledge Model

Approach: Graph HMM

25

Gene A DNA mRNA Protein Protein Active

Transcription Factor

Protein Kinase

Gene B DNA mRNA Protein Protein Active

Gene C DNA mRNA Protein Protein Active

Extract Pathways from PubMed

26

… ATTCGGATATTTAAGGC …

… ATTCGGGTATTTAAGCC …

……

……

Disease Genes

Drug Targets

……KBHigh-Throughput Data

PubMed

24 millions abstracts

Two new abstracts every minute

Adds over one million every year

27

VDR+ binds to

SMAD3 to form

JUN expression

is induced by

SMAD3/4

PMID: 123

PMID: 456

……

28

Machine Reading

29

Involvement of p70(S6)-kinase activation in IL-10

up-regulation in human monocytes by gp41 envelope

protein of human immunodeficiency virus type 1 ...

Machine Reading

30

Involvement of p70(S6)-kinase activation in IL-10

up-regulation in human monocytes by gp41 envelope

protein of human immunodeficiency virus type 1 ...

IL-10human

monocytegp41 p70(S6)-kinase

Machine Reading

PROTEINPROTEINPROTEINCELL

31

Involvement of p70(S6)-kinase activation in IL-10

up-regulation in human monocytes by gp41 envelope

protein of human immunodeficiency virus type 1 ...

Involvement

up-regulation

IL-10human

monocyte

SiteTheme Cause

gp41 p70(S6)-kinase

activation

Theme Cause

Theme

Machine Reading

REGULATION

REGULATION REGULATION

PROTEINPROTEINPROTEINCELL

32

Involvement of p70(S6)-kinase activation in IL-10

up-regulation in human monocytes by gp41 envelope

protein of human immunodeficiency virus type 1 ...

Involvement

up-regulation

IL-10human

monocyte

SiteTheme Cause

gp41 p70(S6)-kinase

activation

Theme Cause

Theme

Machine Reading

REGULATION

REGULATION REGULATION

PROTEINPROTEINPROTEINCELL

Semantic Parsing

Long Tail of Variations

33

TP53 inhibits BCL2.

Tumor suppressor P53 down-regulates the activity of BCL-2 proteins.

BCL2 transcription is suppressed by P53 expression.

The inhibition of B-cell CLL/Lymphoma 2 expression by TP53 …

……

negative regulation532 inhibited, 252 inhibition, 218 inhibit, 207 blocked, 175

inhibits, 157 decreased, 156 reduced, 112 suppressed, 108 decrease, 86 inhibitor, 81 Inhibition, 68 inhibitors, 67 abolished, 66 suppress, 65 block, 63 prevented, 48 suppression, 47 blocks, 44 inhibiting, 42 loss, 39 impaired, 38 reduction, 32 down-regulated, 29 abrogated, 27 prevents, 27 attenuated, 26 repression, 26 decreases, 26 down-regulation, 25 diminished, 25 downregulated, 25 suppresses, 22 interfere, 21 absence, 21 repress ……

Bottleneck: Annotated Examples

GENIA (BioNLP Shared Task 2009-2013)

1999 abstracts

MeSH: human, blood cell, transcription factor

Challenge for “supervised” machine learning

Can we breach this bottleneck?

34

Free Lunch #1:

Distributional Similarity

Similar context Probably similar meaning

Annotation as latent variables

Textual expression Recursive clusters

Unsupervised semantic parsing

35

Poon & Domingos, “Unsupervised Semantic Parsing”.

EMNLP 2009. Best Paper Award.

Recursive Clustering

36

TP53 inhibits BCL2.

Tumor suppressor P53 down-regulates the activity of BCL-2 proteins.

BCL2 transcription is suppressed by P53 expression.

The inhibition of B-cell CLL/Lymphoma 2 expression by TP53 …

……

Recursive Clustering

37

TP53 inhibits BCL2.

Tumor suppressor P53 down-regulates the activity of BCL-2 proteins.

BCL2 transcription is suppressed by P53 expression.

The inhibition of B-cell CLL/Lymphoma 2 expression by TP53 …

……

Recursive Clustering

38

TP53 inhibits BCL2.

Tumor suppressor P53 down-regulates the activity of BCL-2 proteins.

BCL2 transcription is suppressed by P53 expression.

The inhibition of B-cell CLL/Lymphoma 2 expression by TP53 …

……

Recursive Clustering

39

TP53 inhibits BCL2.

Tumor suppressor P53 down-regulates the activity of BCL-2 proteins.

BCL2 transcription is suppressed by P53 expression.

The inhibition of B-cell CLL/Lymphoma 2 expression by TP53 …

……

BCL2, BCL-2 proteins,

B-cell CLL/Lymphoma 2

……

TP53,Tumor

suppressor P53

……

inhibits, down-regulates,

suppresses, inhibition, …

Theme Cause

Free Lunch #2:

Existing KBs

Many KBs available

Gene/Protein: GeneBank, UniProt, …

Pathways: NCI, Reactome, KEGG, BioCarta, …

Annotation as latent variables

Textual expression Table, column, join, …

Grounded semantic parsing

40

Relation Extraction

41

Regulation Theme Cause

Positive A2M FOXO1

Positive ABCB1 TP53

Negative BCL2 TP53

… … …

NCI-PID

Pathway KB

TP53 inhibits BCL2.

Tumor suppressor P53 down-regulates the activity of BCL-2 proteins.

BCL2 transcription is suppressed by P53 expression.

The inhibition of B-cell CLL/Lymphoma 2 expression by TP53 …

……

Relation Extraction

42

Regulation Theme Cause

Positive A2M FOXO1

Positive ABCB1 TP53

Negative BCL2 TP53

… … …

NCI-PID

Pathway KB

TP53 inhibits BCL2.

Tumor suppressor P53 down-regulates the activity of BCL-2 proteins.

BCL2 transcription is suppressed by P53 expression.

The inhibition of B-cell CLL/Lymphoma 2 expression by TP53 …

……

Distant Supervision

43

Involvement of p70(S6)-kinase activation in IL-10

up-regulation in human monocytes by gp41 envelope

protein of human immunodeficiency virus type 1 ...

Involvement

up-regulation

IL-10human

monocyte

SiteTheme Cause

gp41 p70(S6)-kinase

activation

Theme Cause

Theme

Nested Events

REGULATION

REGULATION REGULATION

PROTEINPROTEINPROTEINCELL

GUSPEE

Generalize distant supervision to

extracting nested events

Prior: Favor semantic parse grounded in KB

Outperformed 19 out of 24 participants in

GENIA Shared Task [Kim et al. 2009]

44

Parikh, Poon, Toutanova. “Grounded Semantic Parsing for

Complex Knowledge Extraction”, NAACL-15.

45

GUSPEE

Semantic parser for pathway extraction

Tree HMM

46

Tree HMM

47

Expectation Maximization

48

Virtual EvidenceKey challenge:

non-local evidence

Syntax-Semantics Mismatch

49

Syntax-Semantics Mismatch

50

Syntax-Semantics Mismatch

51

Best Supervised System

52

Preliminary Results

53

54

Prototype-Driven Learning

55

Prototype-Driven LearningOutperformed 19 out of 24 supervised participants

Incomplete KB

56

http://literome.azurewebsites.net

57

Literome

Poon et al., “Literome: PubMed-Scale Genomic Knowledge

Base in the Cloud”, Bioinformatics-14.

PubMed-Scale Extraction

Preliminary pass:

1.5 million instances

13,000 genes, 838,000 unique regulations

Applications:

UCSC Genome Browser, MSR Interactions Track

Expression profile modeling

Validate de novo pathway prediction

Etc.

58

Poon, Toutanova, Quirk, “Distant Supervision for Cancer

Pathway Extraction from Text”. PSB-15.

Machine Science

59

Evans & Rzhetsky, “Machine Science”.

Science, Vol. 329, 2010.

Machine Science

60

Big Data

Machine Science

61

Big Data Rich Knowledge

KB

Machine Science

62

Deep Model

Big Data Rich Knowledge

KB

Machine Science

63

Deep Model

Big Data Rich Knowledge

Hypotheses

KB

Machine Science

64

Deep Model

Big Data Rich Knowledge

Hypotheses

Experiments

KB

Machine Science

65

Deep Model

Big Data Rich Knowledge

Hypotheses

Experiments

KB

Roadmap

Extract richer knowledge:

Cell type, experimental condition, …

Hedging, negation, …

Formulate coherent models:

Supporting evidence, contradiction, …

Intellectual gaps, hypotheses, …

Integrate w. data & experiments:

Cancer panomics Driver genes / pathways

Single-drug response Drug combo prioritization

66

67

Berkeley

AMP Lab

OHSU

Microsoft

Research

68

Berkeley

AMP Lab

OHSU

Microsoft

Research

Decision Support for

Clinical Genomics

69

Raw

Reads

Variant Call

RNA-Seq

Clinical Observation

Knowledge Graph

Decision Support

Clinicians

Literature

70

Raw

Reads

Variant Call

RNA-Seq

Clinical Observation

Knowledge Graph

Decision Support

Clinicians

Literature

NLP

Decision Support for

Clinical Genomics

71

72

73

We Have Digitized Life

74

Next: Digitize Medicine

75

Knock down genes A, B, C → Cure

Pennisi, “The CRISPR Craze”.

Science, Vol. 341, 2013.

Summary

Precision medicine is the future

Cancer systems modeling

Graphical model: Pathways + Panomics data

Extract pathways from PubMed

Machine reading by grounded semantic parsing

Literome: KB for genomic medicine

76

Collaborators

77

Chicago: Andrey Rzhetsky, Kevin White

OHSU: Brian Drucker, Jeff Tyner

Berkeley AMP Lab: David Patterson

Wisconsin: Mark Craven, Anthony Gitter

UCSC: Max Haeussler, David Haussler

Microsoft Research: Chris Quirk, Kristina

Toutanova, David Heckerman, Scott Yih, Lucy

Vanderwende, Bill Bolosky, Ravi Pandya

Summary

78

… ATTCGGATATTTAAGGC …

… ATTCGGGTATTTAAGCC …

……

……

Disease Genes

Drug Targets

……KBHigh-Throughput Data

Recommended