161
Machine Learning for High-Throughput Biological Data These notes were originally from KDD2006 tutorial notes, written by David page at Dept. Biostatistics and Medical Informatics Dept. Computer Sciences University of Wisconsin-Madison. http://www.biostat.wisc.edu/~page/ PageKDD2006.ppt

Machine Learning for High-Throughput Biological Data

  • Upload
    mercer

  • View
    42

  • Download
    0

Embed Size (px)

DESCRIPTION

Machine Learning for High-Throughput Biological Data. These notes were originally from KDD2006 tutorial notes, written by David page at Dept. Biostatistics and Medical Informatics Dept. Computer Sciences University of Wisconsin-Madison . http://www.biostat.wisc.edu/~page/PageKDD2006.ppt. - PowerPoint PPT Presentation

Citation preview

Page 1: Machine Learning for  High-Throughput Biological Data

Machine Learning for High-Throughput Biological Data

These notes were originally from KDD2006 tutorial notes, written by David page at Dept. Biostatistics and Medical Informatics Dept. Computer Sciences University of Wisconsin-Madison.http://www.biostat.wisc.edu/~page/PageKDD2006.ppt

Page 2: Machine Learning for  High-Throughput Biological Data

Some Data Types We’ll Discuss

Gene expression microarraySingle-nucleotide polymorphisms ( 單一核苷酸基因多形性 )Mass spectrometry proteomics ( 蛋白質組學 ) and metabolomics ( 代謝物組學 )Protein-protein interactions (from co-immunoprecipitation)High-throughput screening of potential drug molecules

Page 3: Machine Learning for  High-Throughput Biological Data

image from the DOE Human Genome Programhttp://www.ornl.gov/hgmis

Page 4: Machine Learning for  High-Throughput Biological Data

Probes (DNA)

Gene Chip Surface

Hybridization

Labeled Sample (RNA)

How Microarrays Work

Page 5: Machine Learning for  High-Throughput Biological Data

Two Views of Microarray Data

Data points are genes Represented by expression levels across

different samples (ie, features=samples) Goal: categorize new genesData points are samples (eg, patients) Represented by expression levels of

different genes (ie, features=genes) Goal: categorize new samples

Page 6: Machine Learning for  High-Throughput Biological Data

Two Ways to View The Data

Person Gene A28202_ac AB00014_at AB00015_at . . . Person 1 1142.0 321.0 2567.2 . . . Person 2 586.3 586.1 759.0 . . . Person 3 105.2 559.3 3210.7 . . . Person 4 42.8 692.1 812.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Page 7: Machine Learning for  High-Throughput Biological Data

Data Points are Genes

Person Gene A28202_ac AB00014_at AB00015_at . . . Person 1 1142.0 321.0 2567.2 . . . Person 2 586.3 586.1 759.0 . . . Person 3 105.2 559.3 3210.7 . . . Person 4 42.8 692.1 812.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Page 8: Machine Learning for  High-Throughput Biological Data

Data Points are Samples

Person Gene A28202_ac AB00014_at AB00015_at . . . Person 1 1142.0 321.0 2567.2 . . . Person 2 586.3 586.1 759.0 . . . Person 3 105.2 559.3 3210.7 . . . Person 4 42.8 692.1 812.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Page 9: Machine Learning for  High-Throughput Biological Data

Supervision: Add Class Values

Person Gene A28202_ac AB00014_at AB00015_at . . . Class Person 1 1142.0 321.0 2567.2 . . . normal Person 2 586.3 586.1 759.0 . . . cancer Person 3 105.2 559.3 3210.7 . . . normal Person 4 42.8 692.1 812.0 . . . cancer . . . . . . . . . . . . . . . . . . . . . . . . . . .

Page 10: Machine Learning for  High-Throughput Biological Data

Supervised Learning TaskGiven: a set of microarray experiments, each done with mRNA from a different patient (same cell type from every patient) Patient’s expression values for each gene constitute the features, and patient’s disease constitutes the class

Do: Learn a model that accurately predicts class based on features

Page 11: Machine Learning for  High-Throughput Biological Data

Data Points are: Genes Samples

Clustering Supervised Data Mining

Predict the class value for a patient

based on the expression levels for his/her genes

Location in Task Space

Page 12: Machine Learning for  High-Throughput Biological Data

Leukemia (Golub et al., 1999)Classes Acute Lymphoblastic Leukemia( 淋巴白血病 ) (ALL) and Acute Myeloid Leukemia ( 骨髓白血病 ) (AML)Approach Weighted voting (essentially naïve Bayes)Cross-Validated Accuracy Of 34 samples, declined to predict 5, correct on other 29

Page 13: Machine Learning for  High-Throughput Biological Data

Cancer vs. NormalRelatively easy to predict accurately, because so much goes “haywire” in cancer cellsPrimary barrier is noise in the data… impure RNA, cross-hybridization, etcStudies include breast, colon ( 结肠 ), prostate ( 前列腺 ), lymphoma ( 淋巴瘤 ), and multiple myeloma ( 骨髓瘤 )

Page 14: Machine Learning for  High-Throughput Biological Data

X-Val Accuracies for Multiple Myeloma (74 MM vs.

31 Normal)

Trees

Boosted Trees

SVMs

Vote

Bayes Nets

98.1

99.0

100.0

97.0

100.0

Page 15: Machine Learning for  High-Throughput Biological Data

More MM (300), Benign Condition MGUS (Hardin et al., 2004)

Page 16: Machine Learning for  High-Throughput Biological Data

ROC Curves: Cancer vs. Normal

Page 17: Machine Learning for  High-Throughput Biological Data

ROC: Cancer vs. Benign (MGUS)

Page 18: Machine Learning for  High-Throughput Biological Data

Work by Statisticians Outside of Standard Classification/Clustering

Methods to better convert Affymetrix’s low-level intensity measurements into expression levels: e.g., work by Speed, Wong, IrrizaryMethods to find differentially expressed genes between two samples, e.g. work by Newton and KendziorskiBut the following is most related…

Page 19: Machine Learning for  High-Throughput Biological Data

Ranking Genes by Significance

Some biologists don’t want one predictive model, but a rank-ordered list of genes to explore further (with estimated significance)For each gene we have a set of expression levels under our conditions, say cancer vs. normalWe can do a t-test to see if the mean expression levels are different under the two conditions: p-valueMultiple comparisons problem: if we repeat this test for 30,000 genes, some will pop up as significant just by chance aloneCould do a Bonferoni correction (multiply p-values by 30,000), but this is drastic and might eliminate all

Page 20: Machine Learning for  High-Throughput Biological Data

False Discovery Rate (FDR) [Storey and Tibshirani, 2001]

Addresses multiple comparisons but is less extreme than BonferoniReplaces p-value by q-value: fraction of genes with this p-value or lower that really don’t have different means in the two classes (false discoveries)Publicly available in R as part of Bioconductor packageRecommendation: Use this in addition to your supervised data mining… your collaborators will want to see it

Page 21: Machine Learning for  High-Throughput Biological Data

FDR Highlights Difficulties Getting Insight into Cancer vs. Normal

Page 22: Machine Learning for  High-Throughput Biological Data

Using Benign Condition Instead of Normal Helps Somewhat

Page 23: Machine Learning for  High-Throughput Biological Data

Question to AnticipateYou’ve run a supervised data mining algorithm on your collaborator’s data, and you present an estimate of accuracy or an ROC curve (from X-val)How did you adjust this for the multiple comparisons problem?Answer: you don’t need to because you commit to a single predictive model before ever looking at the test data for a fold—this is only one comparison

Page 24: Machine Learning for  High-Throughput Biological Data

Prognosis and TreatmentFeatures same as for diagnosis

Rather than disease state, class value becomes life expectancy with a given treatment (or positive response vs. no response to given treatment)

Page 25: Machine Learning for  High-Throughput Biological Data

Breast Cancer Prognosis (Van’t Veer et al., 2002)

Classes good prognosis (no metastasis within five years of initial diagnosis) vs. poor prognosisAlgorithm Ensemble of votersResults 83% cross-validated accuracy on 78 cases

Page 26: Machine Learning for  High-Throughput Biological Data

A LessonPrevious work selected features to use in ensemble by looking at the entire data setShould have repeated feature selection on each cross-val foldAuthors also chose ensemble size by seeing which size gave highest cross-val resultAuthors corrected this in web supplement;accuracy went from 83% to 73%Remember to “tune parameters” separately for each cross-val fold!

Page 27: Machine Learning for  High-Throughput Biological Data

Prognosis with Specific Therapy (Rosenwald et al., 2002)

Data set contains gene-expression patterns for 160 patients with diffuse large B-cell lymphoma, receiving anthracycline chemotherapyClass label is five-year survivalOne test-train split 80/80True positive rate: 60% False negative rate: 39%

Page 28: Machine Learning for  High-Throughput Biological Data

Some Future DirectionsUsing gene-chip data to select therapy Predict which therapy gives best prognosis for patientCombining Gene Expression Data with Clinical Data such as Lab Results, Medical and Family History Multiple relational tables, may benefit from relational learning

Page 29: Machine Learning for  High-Throughput Biological Data

Unsupervised Learning Task

Given: a set of microarray experiments under different conditionsDo: cluster the genes, where a gene described by its expression levels in different experiments

Page 30: Machine Learning for  High-Throughput Biological Data

Data Points are: Genes Samples

Clustering Supervised Data Mining

Group genes into clusters, where all

members of a cluster tend to go

up or down together

Location in Task Space

Page 31: Machine Learning for  High-Throughput Biological Data

Example(Green = up-regulated, Red = down-regulated)

Gen

es

Experiments (Samples)

Page 32: Machine Learning for  High-Throughput Biological Data

Visualizing Gene Clusters (eg, Sharan and Shamir, 2000)

Time (10-minute intervals)

Nor

mal

ized

expr

essi

on

Gene Cluster 1, size=20 Gene Cluster 2, size=43

Page 33: Machine Learning for  High-Throughput Biological Data

Unsupervised Learning Task 2

Given: a set of microarray experiments (samples) corresponding to different conditions or patientsDo: cluster the experiments

Page 34: Machine Learning for  High-Throughput Biological Data

Data Points are: Genes Samples

Clustering Supervised Data Mining

Group samples by gene expression

profile

Location in Task Space

Page 35: Machine Learning for  High-Throughput Biological Data

ExamplesCluster samples from mice subjected to a variety of toxic compounds (Thomas et al., 2001)Cluster samples from cancer patients, potentially to discover different subtypes of a cancerCluster samples taken at different time points

Page 36: Machine Learning for  High-Throughput Biological Data

Some Biological PathwaysRegulatory pathways Nodes are labeled by genes Arcs denote influence on transcription G1 codes for P1, P1 inhibits G2’s transcription

Metabolic pathways Nodes are metabolites, large biomolecules (eg,

sugars, lipids, proteins and modified proteins) Arcs from biochemical reaction inputs to

outputs Arcs labeled by enzymes (themselves proteins)

Page 37: Machine Learning for  High-Throughput Biological Data

Metabolic Pathway Example

Fumarate

Malate

Oxaloacetate

Citrate cis-Aconitate

Isocitrate

-Ketoglutarate

Succinyl-CoASuccinate

fumarase

succinate thikinase

MDH

citrate synthase aconitase

IDH

-KDGH

FAD

FADH2

H20

NAD+

NADH

Acetyl CoAHSCoA

H20

H20

NAD+

NADH + CO2

NAD+ + HSCoA

NADH + CO2

GDP + Pi GTP+ HSCoA

(Krebs Cycle,TCA Cycle,Citric Acid Cycle)

Page 38: Machine Learning for  High-Throughput Biological Data

Regulatory Pathway (KEGG)

Page 39: Machine Learning for  High-Throughput Biological Data

Using Microarray Data Only

Regulatory pathways Nodes are labeled by genes Arcs denote influence on transcription G1 codes for P1, P1 inhibits G2’s transcription

Metabolic pathways Nodes are metabolites, large biomolecules (eg,

sugars, lipids, proteins, and modified proteins) Arcs from biochemical reaction inputs to

outputs Arcs labeled by enzymes (themselves proteins)

Page 40: Machine Learning for  High-Throughput Biological Data

Supervised Learning Task 2

Given: a set of microarray experiments for same organism under different conditionsDo: Learn graphical model that accurately predicts expression of some genes in terms of others

Page 41: Machine Learning for  High-Throughput Biological Data

Some Approaches to Learning Regulatory

NetworksBayes Net Learning (started with Friedman & Halpern, 1999, we’ll see more)

Boolean Networks (Akutsu, Kuhara, Maruyama & Miyano, 1998; Ideker, Thorsson & Karp, 2002)

Related Graphical Approaches (Tanay & Shamir, 2001; Chrisman, Langley, Baay & Pohorille, 2003)

Page 42: Machine Learning for  High-Throughput Biological Data

Note: direction of arrowindicates dependencenot causality

P(geneA)

P(geneB)P(

gene

A)

parent node

child node

parent node

child node

geneA

geneB

Bayesian Network (BN)

1.00.0

DatageneBgeneA

Expt1Expt2Expt3Expt4

0.50.5

0.50.5

Page 43: Machine Learning for  High-Throughput Biological Data

Problem: Not CausalityA B

A is a good predictor of B. But is A regulating B??Ground truth might be:

B A A C B

B C A

B

C

A Or a more complicated variant

Page 44: Machine Learning for  High-Throughput Biological Data

Approaches to Get Causality

Use “knock-outs” (Pe’er, Regev, Elidan and Friedman, 2001). But not available in most organisms.Use time-series data and Dynamic Bayesian Networks (Ong, Glasner and Page, 2002). But even less data typically.Use other data sources, eg sequences upstream of genes, where transcription regulators may bind. (Segal, Barash, Simon, Friedman and Koller, 2002; Noto and Craven, 2005)

Page 45: Machine Learning for  High-Throughput Biological Data

A Dynamic Bayes Net

gene 1

gene 2 gene

3gene

N

gene 1

gene 2 gene

3gene

N

Page 46: Machine Learning for  High-Throughput Biological Data

Problem: Not Enough Data Points to Construct Large Network

Fortunate to get 100s of chipsBut have 1000s of genes E. coli: ~4000 Yeast: ~6000 Human: ~30,000Want to learn causal graphical model over 1000s of variables with 100s of examples (settings of the variables)

Page 47: Machine Learning for  High-Throughput Biological Data

Advance: Module Networks [Segal, Pe’er, Regev, Koller & Friedman, 2005]

Cluster genes by similarity over expression experimentsAll genes in a cluster are “tied together”: same parents and CPDsLearn structure subject to this tying together of genesIteratively re-form clusters and re-learn network, in an EM-like fashion

Page 48: Machine Learning for  High-Throughput Biological Data

Problem: Data are Continuous but Models are Discrete

Gene chips provide a real-valued mRNA measurementBoolean networks and most practical Bayes net learning algorithms assume discrete variablesMay lose valuable information by discretizing

Page 49: Machine Learning for  High-Throughput Biological Data

Advance: Use of Dynamic Bayes Nets with Continuous Variables [Segal, Pe’er, Regev, Koller & Friedman, 2005]

Expression measurements used instead of discretized (up, down, same)Assume linear influence of parents on children (Michaelis-Menten assumption)Work so far constructed the network from literature and learned parameters

Page 50: Machine Learning for  High-Throughput Biological Data

Problem: Much Missing Information

mRNA from gene 1 doesn’t directly alter level of mRNA from gene 2Rather, the protein product from gene 1 may alter level of mRNA from gene 2 (e.g., transcription factor)Activation of transcription factor might not occur by making more of it, but just by phosphorylating it (post-translational modification)

Page 51: Machine Learning for  High-Throughput Biological Data

R

Example: Transcription Regulation

DNAgeneA geneB geneCP TO

OperongeneRP O T

Operon OperonOperonR

mRNAmRNA

Page 52: Machine Learning for  High-Throughput Biological Data

Approach: Measure More Stuff

Mass spectrometry (later) can measure protein rather than mRNA Doesn’t measure all proteins Not very quantitative (presence/absence)

2D gels can measure post-translational modifications, but still low-throughput because of current analysisCo-immunoprecipitation (later), Yeast 2-Hybrids can measure protein interactions, but noisy

Page 53: Machine Learning for  High-Throughput Biological Data

Another Way Around Limitations

Identify smaller part of the task that is a step toward a full regulatory pathway Part of a pathway Classes or groups of genesExamples:Chromatin remodelers Predicting the operons in E. coli

Page 54: Machine Learning for  High-Throughput Biological Data

Chromatin Remodelers and Nucleosome [Segal et al. 2006]

Previous DNA picture oversimplifiedDNA double-helix is wrapped in further complex structureDNA is accessible only if part of this structure is unwoundCan we predict what chromatin remodelers act on what parts of DNA, also what activates a remodeler?

Page 55: Machine Learning for  High-Throughput Biological Data

The E. Coli Genome

Page 56: Machine Learning for  High-Throughput Biological Data

Finding Operons in E. coli(Craven, Page, Shavlik, Bockhorst and Glasner, 2000)

Given: known operons and other E. coli data Do: predict all operons in E. coli

Additional Sources of Information gene-expression data functional annotation

g1g2 g3

g5g4

promoter terminator

Page 57: Machine Learning for  High-Throughput Biological Data

Comparing Naive Bayes and Decision Trees (C5.0)

Page 58: Machine Learning for  High-Throughput Biological Data

Using Only Individual Features

Page 59: Machine Learning for  High-Throughput Biological Data

Single-Nucleotide Polymorphisms

SNPs: Individual positions in DNA where variation is commonRoughly 2 million known SNPs in humansNew Affymetrix whole-genome scan measures 500,000 of theseEasier/faster/cheaper to measure SNPs than to completely sequence everyoneMotivation …

Page 60: Machine Learning for  High-Throughput Biological Data

Not Susceptible or Not Responding

Susceptible to Disease D or Responds to Treatment T

If We Sequenced Everyone…

Page 61: Machine Learning for  High-Throughput Biological Data

Example of SNP Data

Person SNP 1 2 3 . . . CLASS Person 1 C T A G T T . . . old Person 2 C C A G C T . . . young Person 3 T T A A C C . . . old Person 4 C T G G T T . . . young . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Page 62: Machine Learning for  High-Throughput Biological Data

Phasing (Haplotyping)

Page 63: Machine Learning for  High-Throughput Biological Data

Advantages of SNP DataPerson’s SNP pattern does not change with time or disease, so it can give more insight into susceptibility

Easier to collect samples (can simply use blood rather than affected tissue)

Page 64: Machine Learning for  High-Throughput Biological Data

Challenges of SNP DataUnphased Algorithms exist for phasing (haplotyping), but they make errors and typically need related individuals, dense coverageMissing values are more common than in microarray data (though improving substantially, down to around 1-2% now)Many more measurements. For example, Affymetrix human SNP chip at a half million SNPs.

Page 65: Machine Learning for  High-Throughput Biological Data

Supervised Learning TaskGiven: a set of SNP profiles, each from a different patient.Phased: nucleotides at each SNP position on each copy of each chromosome constitute the features, and patient’s disease constitutes the class

Unphased: unordered pair of nucleotides at each SNP position constitute the features, and patient’s disease constitutes the class

Do: Learn a model that accurately predicts class based on features

Page 66: Machine Learning for  High-Throughput Biological Data

Waddell et al., 2005Multiple Myeloma, Young (susceptible) vs. Old (less susceptible), 3000 SNPs, best at 64% acc (training)SVM with feature selection (repeated on every fold of cross-validation): 72% accuracy, also naïve Bayes. Significantly better than chance.

Old

Young

Old Young

Actual31 9

14 26

Page 67: Machine Learning for  High-Throughput Biological Data

Listgarten et al., 2005 SVMs from SNP data predict lung cancer susceptibility at 69% accuracy Naïve Bayes gives similar performance Best single SNP at less than 60% accuracy (training)

Page 68: Machine Learning for  High-Throughput Biological Data

LessonsSupervised data mining algorithms can predict disease susceptibility at rates better than chance and better than individual SNPsAccuracies much lower than we see with microarray data, because we’re predicting who will get disease, not who already has it

Page 69: Machine Learning for  High-Throughput Biological Data

Future DirectionsPharmacogenetics: predicting drug response from SNP profile Drug Efficacy Adverse ReactionCombining SNP data with other data types, such as clinical (history, lab tests) and microarray

Page 70: Machine Learning for  High-Throughput Biological Data

ProteomicsMicroarrays are useful primarily because mRNA concentrations serve as surrogate for protein concentrationsLike to measure protein concentrations directly, but at present cannot do so insame high-throughput mannerProteins do not have obvious direct complementsCould build molecules that bind, but binding greatly affected by protein structure

Page 71: Machine Learning for  High-Throughput Biological Data

Time-of-Flight (TOF) Mass Spectrometry (thanks Sean McIlwain)

Laser

+VSample

Measures the time for an ionized particle, starting from the sample plate, to hit the detector

Detector

Page 72: Machine Learning for  High-Throughput Biological Data

Time-of-Flight (TOF) Mass Spectrometry 2

Laser

+VSample

Matrix-Assisted Laser Desorption-Ionization (MALDI) Crystalloid structures made using proton-rich matrix moleculeHitting crystalloid with laser causes molecules to ionize and “fly” towards detector

Detector

Page 73: Machine Learning for  High-Throughput Biological Data

Time-of-Flight Demonstration 0

Sample Plate

Page 74: Machine Learning for  High-Throughput Biological Data

Time-of-Flight Demonstration 1

Matrix Molecules

Page 75: Machine Learning for  High-Throughput Biological Data

Time-of-Flight Demonstration 2

Protein Molecules

Page 76: Machine Learning for  High-Throughput Biological Data

Time-of-Flight Demonstration 3

LaserDetector

+10KV Positive Charge

Page 77: Machine Learning for  High-Throughput Biological Data

Time-of-Flight Demonstration 4

+10KV

+

Proton kicked off matrix molecule onto another molecule

Laser pulsed directly onto sample

Page 78: Machine Learning for  High-Throughput Biological Data

Time-of-Flight Demonstration 5

+10KV

++

++ +

Lots of protons kicked off matrix ions, giving rise to more positively charged molecules

Page 79: Machine Learning for  High-Throughput Biological Data

Time-of-Flight Demonstration 6

+10KV

++

++ +

The high positive potential under sample plate, causes positively charged molecules to accelerate towards detector

Page 80: Machine Learning for  High-Throughput Biological Data

Time-of-Flight Demonstration 7

+10Kv

+

+

+

+

+

+

Smaller mass molecules hit detector first, while heavier ones detected later

Page 81: Machine Learning for  High-Throughput Biological Data

Time-of-Flight Demonstration 8

+10KV

+

+

+

+

+

+

The incident time measured from when laser is pulsed until molecule hits detector

Page 82: Machine Learning for  High-Throughput Biological Data

Time-of-Flight Demonstration 9

+10KV

++ + + ++

Experiment repeated a number of times, counting frequencies of “flight-times”

Page 83: Machine Learning for  High-Throughput Biological Data

Example Spectra from a Competition by Lin et al. at

Duke

These are different fractions from the same sample.

M/Z

Inte

nsity

Page 84: Machine Learning for  High-Throughput Biological Data

Trypsin-Treated Spectra

M/Z

Freq

uenc

y

Page 85: Machine Learning for  High-Throughput Biological Data

Many Challenges Raised by Mass Spectrometry DataNoise: extra peaks from handling of sample, from machine and environment (electrical noise), etc.M/Z values may not align exactly across spectra (resolution ~0.1%)Intensities not calibrated across spectra: quantification is difficultCannot get all proteins… typically only several hundred. To improve odds of getting the ones we want, may fractionate our sample by 2D gel electrophoresis or liquid chromatography.

Page 86: Machine Learning for  High-Throughput Biological Data

Challenges (Continued)Better results if partially digest proteins (break into smaller peptides) firstCan be difficult to determine what proteins we have from spectrumIsotopic peaks: C13 and N15 atoms in varying numbers cause multiple peaks for a single peptide

Page 87: Machine Learning for  High-Throughput Biological Data

Handling Noise: Peak Picking

Want to pick peaks that are statistically significant from the noise signal

Want to use these as features in our learning algorithms.

Page 88: Machine Learning for  High-Throughput Biological Data

Many Supervised Learning Tasks

Learn to predict proteins from spectra, when the organism’s proteome is knownLearn to identify isotopic distributionsLearn to predict disease from either proteins, peaks or isotopic distributions as featuresConstruct pathway models

Page 89: Machine Learning for  High-Throughput Biological Data

Using Mass Spectrometry for Early Detection of Ovarian Cancer [Petricoin et al., 2002]Ovarian cancer difficult to detect early, often leading to poor prognosisTrained and tested on mass spectra from blood serum100 training cases, 50 with cancerHeld-out test set of 116 cases, 50 with cancer100% sensitivity, 95% specificity (63/66) on held-out test set

Page 90: Machine Learning for  High-Throughput Biological Data

Not So FastData mining methodology seems soundBut Keith Baggerly argues that cancer samples were handled differently than normal samples, and perhaps data were preprocessed differently tooIf we run cancer samples Monday and normals Wednesday, could get differences from machine breakdown or nearby electrical equipment that’s running on Monday but not WedLesson: tell collaborators they must randomize samples for the entire processing phase and of course all our preprocessing must be sameDebate is still raging… results not replicated in trials

Page 91: Machine Learning for  High-Throughput Biological Data

Other Proteomics: 3D Structures

Page 92: Machine Learning for  High-Throughput Biological Data

Other Proteomics: Interactions

Figure from Ideker et al., Science 292(5518):929-934, 2001

each node represents a gene product (protein)blue edges show direct protein-protein interactionsyellow edges show interactions in which one protein binds to DNA and affects the expression of another

Page 93: Machine Learning for  High-Throughput Biological Data

Protein-Protein Interactions

Yeast 2-HybridImmunoprecipitation Antibodies (“immuno”) are made by

combinatorial combinations of certain proteins

Millions of antibodies can be made, to recognize a wide variety of different antigens (“invaders”), often by recognizing specific proteinsantibody protein

Page 94: Machine Learning for  High-Throughput Biological Data

Protein-Protein Interactions

Page 95: Machine Learning for  High-Throughput Biological Data

Immunoprecipitationantibody

Page 96: Machine Learning for  High-Throughput Biological Data

Co-Immunoprecipitationantibody

Page 97: Machine Learning for  High-Throughput Biological Data

Many Supervised Learning Tasks

Learn to predict protein-protein interactions: protein 3D structures may be criticalUse protein-protein interactions in construction of pathway modelsLearn to predict protein function from interaction data

Page 98: Machine Learning for  High-Throughput Biological Data

ChIP-Chip DataImmunoprecipitation can also be done to identify proteins interacting with DNA rather than other proteinsChromatin immunoprecipitation (ChIP): grab sample of DNA bound to a particular protein (transcription factor)ChIP-Chip: run this sample of DNA on a microarray to see which DNA was boundExample of analysis of such new data: Keles et al., 2006

Page 99: Machine Learning for  High-Throughput Biological Data

MetabolomicsMeasures concentration of each low-molecular weight molecule in sampleThese typically are “metabolites,” or small molecules produced or consumed by reactions in biochemical pathwaysThese reactions typically catalyzed by proteins (specifically, enzymes)This data typically also mass spectrometry, though could also be NMR

Page 100: Machine Learning for  High-Throughput Biological Data

LipomicsAnalogous to metabolomics, but measuring concentrations of lipids rather than metabolitesPotentially help induce biochemical pathway information or to help disease diagnosis or treatment choice

Page 101: Machine Learning for  High-Throughput Biological Data

To Design a Drug:Identify Target

Protein

DetermineTarget SiteStructure

Synthesize aMolecule that

Will Bind

Knowledge of proteome/genomeRelevant biochemical pathways

Crystallography, NMRDifficult if Membrane-Bound

Imperfect modeling of structureStructures may change at bindingAnd even then…

Page 102: Machine Learning for  High-Throughput Biological Data

Molecule Binds Target But May:

Bind too tightly or not tightly enough.Be toxic.Have other effects (side-effects) in the body.Break down as soon as it gets into the body, or may not leave the body soon enough.It may not get to where it should in the body (e.g., crossing blood-brain barrier).Not diffuse from gut to bloodstream.

Page 103: Machine Learning for  High-Throughput Biological Data

And Every Body is Different:

Even if a molecule works in the test tube and works in animal studies, it may not work in people (will fail in clinical trials).A molecule may work for some people but not others.A molecule may cause harmful side-effects in some people but not others.

Page 104: Machine Learning for  High-Throughput Biological Data

Typical Practice when Target Structure is Unknown

High-Throughput Screening (HTS): Test many molecules (1,000,000) to find some that bind to target (ligands).Infer (induce) shape of target site from 3D structural similarities.Shared 3D substructure is called a pharmacophore.Perfect example of a machine learning task with spatial target.

Page 105: Machine Learning for  High-Throughput Biological Data

An Example of Structure Learning

Act

ive

Inac

tive

Page 106: Machine Learning for  High-Throughput Biological Data

Common Data Mining Approaches

Represent a molecule by thousands to millions of features and use standard techniques (e.g., KDD Cup 2001)Represent each low-energy conformer by feature vector and use multiple-instance learning (e.g., Jain et al., 1998)Relational learning Inductive logic programming (e.g., Finn et

al., 1998) Graph mining

Page 107: Machine Learning for  High-Throughput Biological Data

Supervised Learning TaskGiven: a set of molecules, each labeled by activity -- binding affinity for target protein -- and a set of low-energy conformers for each molecule

Do: Learn a model that accurately predicts activity (may be Boolean or real-valued)

Page 108: Machine Learning for  High-Throughput Biological Data

ILP as Illustration: The Logical Representation of a Pharmacophore

Active(X) if: has-conformation(X,Conf),has-hydrophobic(X,A),has-hydrophobic(X,B),distance(X,Conf,A,B,3.0,1.0),has-acceptor(X,C),distance(X,Conf,A,C,4.0,1.0),distance(X,Conf,B,C,5.0,1.0).

This logical clause states that a molecule X is active if it has someconformation Conf, hydrophobic groups A and B, and a hydrogen acceptor Csuch that the following holds: in conformation Conf of molecule X, thedistance between A and B is 3 Angstroms (plus or minus 1), the distancebetween A and C is 4, and the distance between B and C is 5.

Page 109: Machine Learning for  High-Throughput Biological Data

Background Knowledge IInformation about atoms and bonds in the molecules

atm(m1,a1,o,3,5.915800,-2.441200,1.799700).atm(m1,a2,c,3,0.574700,-2.773300,0.337600).atm(m1,a3,s,3,0.408000,-3.511700,-1.314000).

bond(m1,a1,a2,1). bond(m1,a2,a3,1).

Page 110: Machine Learning for  High-Throughput Biological Data

Background knowledge IIDefinition of distance equivalencedist(Drug,Atom1,Atom2,Dist,Error):- number(Error), coord(Drug,Atom1,X1,Y1,Z1), coord(Drug,Atom2,X2,Y2,Z2), euc_dist(p(X1,Y1,Z1),p(X2,Y2,Z2),Dist1), Diff is Dist1-Dist, absolute_value(Diff,E1), E1 =< Error.

euc_dist(p(X1,Y1,Z1),p(X2,Y2,Z2),D):- Dsq is (X1-X2)^2+(Y1-Y2)^2+(Z1-Z2)^2, D is sqrt(Dsq).

Page 111: Machine Learning for  High-Throughput Biological Data

Central Idea: Generalize by searching a lattice

Lattice of Hypotheses active(X)

active(X) if active(X) if active(X) ifhas-hydrophobic(X,A) has-donor(X,A) has-acceptor(X,A)

active(X) if active(X) if active(X) ifhas-hydrophobic(X,A), has-donor(X,A), has-acceptor(X,A),has-donor(X,B), has-donor(X,B), has-donor(X,B),distance(X,A,B,5.0) distance(X,A,B,4.0) distance(X,A,B,6.0)

etc.

Page 112: Machine Learning for  High-Throughput Biological Data

Conformational model Conformational flexibility modelled as multiple conformations:

Sybyl randomsearch Catalyst

Page 113: Machine Learning for  High-Throughput Biological Data

Pharmacophore description

Atom and site centred Hydrogen bond donor Hydrogen bond acceptor Hydrophobe Site points (limited at present) User definableDistance based

Page 114: Machine Learning for  High-Throughput Biological Data

Example 1: Dopamine agonists

Agonists taken from Martin data set on QSAR society web pagesExamples (5-50 conformations/molecule)

NCH3

CH3

OH

OH

OH

OHNH2 NH

OH

OH

NH2

OH

OHN

CH3 OH

OH

NH2

OH

OH

Page 115: Machine Learning for  High-Throughput Biological Data

Pharmacophore identified

Molecule A has the desired activity if: in conformation B molecule A contains a hydrogen acceptor at C, and in conformation B molecule A contains a basic nitrogen group at D, and the distance between C and D is 7.05966 +/- 0.75 Angstroms, and in conformation B molecule A contains a hydrogen acceptor at E, and the distance between C and E is 2.80871 +/- 0.75 Angstroms, and the distance between D and E is 6.36846 +/- 0.75 Angstroms, and in conformation B molecule A contains a hydrophobic group at F, and the distance between C and F is 2.68136 +/- 0.75 Angstroms, and the distance between D and F is 4.80399 +/- 0.75 Angstroms, and the distance between E and F is 2.74602 +/- 0.75 Angstroms.

Page 116: Machine Learning for  High-Throughput Biological Data

Example II: ACE inhibitors28 angiotensin converting enzyme inhibitors taken from literature D. Mayer et al., J. Comput.-Aided Mol.

Design, 1, 3-16, (1987)

N

NN

O

O

COOH

SHN

COOHO

CH3

NH

POH

OHO N

COOHO

CH3

NH

COOH

Page 117: Machine Learning for  High-Throughput Biological Data

ACE pharmacophoreMolecule A is an ACE inhibitor if: molecule A contains a zinc-site B, molecule A contains a hydrogen acceptor C, the distance between B and C is 7.899 +/- 0.750 A, molecule A contains a hydrogen acceptor D, the distance between B and D is 8.475 +/- 0.750 A, the distance between C and D is 2.133 +/- 0.750 A, molecule A contains a hydrogen acceptor E, the distance between B and E is 4.891 +/- 0.750 A, the distance between C and E is 3.114 +/- 0.750 A, the distance between D and E is 3.753 +/- 0.750 A.

Page 118: Machine Learning for  High-Throughput Biological Data

Pharmacophore discovered

Distance Progol Mayer

A 4.9 5.0

B 3.8 3.8

C 8.5 8.6

Zinc site

H-bond acceptor

C

AB

Page 119: Machine Learning for  High-Throughput Biological Data

Additional FindingOriginal pharmacophore rediscovered plus one other different zinc ligand position similar to alternative proposed by

Ciba-Geigy

7.3

3.94.0

Page 120: Machine Learning for  High-Throughput Biological Data

Example III: Thermolysin inhibitors

10 inhibitors for which crystallographic data is available in PDBConformationally challenging moleculesExperimentally observed superposition

Page 121: Machine Learning for  High-Throughput Biological Data

Key binding site interactions

Asn112-NH

S2’ O=C Asn112

Arg203-NH S1’

O=C Ala113Zn

OOH

NH

O

NHPO

O

R

Page 122: Machine Learning for  High-Throughput Biological Data

Interactions made by inhibitors

Interaction 1HYT 1THL 1TLP 1TMN 2TMN 4TLN 4TMN 5TLN 5TMN 6TMNAsn112-NH S2’ Asn112 C=O Arg 203 NH S1’ Ala113-C=O Zn

Page 123: Machine Learning for  High-Throughput Biological Data

Pharmacophore Identification

Structures considered 1HYT 1THL 1TLP 1TMN 2TMN 4TLN 4TMN 5TLN 5TMN 6TMNConformational analysis using “Best” conformer generation in Catalyst98-251 conformations/molecule

Page 124: Machine Learning for  High-Throughput Biological Data

Thermolysin Results10 5-point pharmacophore identified, falling into 2 groups (7/10 molecules) 3 “acceptors”, 1 hydrophobe, 1 donor 4 “acceptors, 1 donorCommon core of Zn ligands, Arg203 and Asn112 interactions identifiedCorrect assignments of functional groupsCorrect geometry to 1 Angstrom tolerance

Page 125: Machine Learning for  High-Throughput Biological Data

Thermolysin resultsIncreasing tolerance to 1.5Angstroms finds common 6-point pharmacophore including one extra interaction

Page 126: Machine Learning for  High-Throughput Biological Data

Example IV: Antibacterial peptides [Spatola et al., 2000]

Dataset of 11 pentapeptides showing activity against Pseudomonas aeruginosa 6 actives <6g/ml IC50 5 inactives

Page 127: Machine Learning for  High-Throughput Biological Data

Pharmacophore Identified

A Molecule M is active against Pseudomonas Aeruginosaif it has a conformation B such that:

M has a hydrophobic group C,M has a hydrogen acceptor D,the distance between C and D in conformation B is 11.7 AngstromsM has a positively-charged atom E,the distance between C and E in conformation B is 4 Angstromsthe distance between D and E in conformation B is 9.4 AngstromsM has a positively-charged atom F,the distance between C and F in conformation B is 11.1 Angstromsthe distance between D and F in conformation B is 12.6 Angstromsthe distance between E and F in conformation B is 8.7 Angstroms

Tolerance 1.5 Angstroms

Page 128: Machine Learning for  High-Throughput Biological Data
Page 129: Machine Learning for  High-Throughput Biological Data

Clinical Databases of the Future (Dramatically Simplified)

PatientID Gender Birthdate

P1 M 3/22/63

PatientID Date Physician Symptoms Diagnosis

P1 1/1/01 Smith palpitations hypoglycemic P1 2/1/03 Jones fever, aches influenza

PatientID Date Lab Test Result

P1 1/1/01 blood glucose 42 P1 1/9/01 blood glucose 45

PatientID SNP1 SNP2 … SNP500K

P1 AA AB BB P2 AB BB AA

PatientID Date Prescribed Date Filled Physician Medication Dose Duration

P1 5/17/98 5/18/98 Jones prilosec 10mg 3 months

Page 130: Machine Learning for  High-Throughput Biological Data

Final Wrap-upMolecular biology collecting lots and lots of data in post-genome eraOpportunity to “connect” molecular-level information to diseases and treatmentNeed analysis tools to interpretData mining opportunities aboundHopefully this tutorial provided solid start toward applying data mining to high-throughput biological data

Page 131: Machine Learning for  High-Throughput Biological Data

Thanks ToJude ShavlikJohn ShaughnessyBart BarlogieMark CravenSean McIlwainJan StruyfArno SpatolaPaul FinnBeth Burnside

Michael MollaMichael WaddellIrene OngJesse DavisSoumya RayJo HardinJohn CrowleyFenghuang ZhanEric Lantz

Page 132: Machine Learning for  High-Throughput Biological Data

If Time Permits… some of my group’s directions in the area

Clinical Data (with Jesse Davis, Beth Burnside, M.D.)Addressing another problem with current approaches to biological network learning (with Soumya Ray, Eric Lantz)

Page 133: Machine Learning for  High-Throughput Biological Data

Using Machine Learning with Clinical Histories: Example

We’ll use example of Mammography to show some issues that ariseThese issues arise here even with just one relational tableThese issues are even more pronounced with data in multiple tables

Page 134: Machine Learning for  High-Throughput Biological Data

Supervised Learning TaskGiven: a database of mammogram abnormalities for different patients (same cell type from every patient) Radiologist-entered values describing the abnormality constitute the features, and abnormality’s biopsy result as benign or malignant constitutes the class

Do: Learn a model that accurately predicts class based on features

Page 135: Machine Learning for  High-Throughput Biological Data

P1 1 5/02 Spic 0.03 RU4 B P1 2 5/04 Var 0.04 RU4 M P1 3 5/04 Spic 0.04 LL3 B … … … … … … …

Patient Abnormality Date Mass Shape … Mass Size Loc Be/Mal

Mammography Database[Davis et al, 2005; Burnside et al, 2005]

Page 136: Machine Learning for  High-Throughput Biological Data

Original Expert Structure

Page 137: Machine Learning for  High-Throughput Biological Data

Level 1: Parameters

Be/Mal

Shape Size

Given: Features (node labels, or fields in database), Data, Bayes net structureLearn: Probabilities. Note: probabilities needed are Pr(Be/Mal), Pr(Shape|Be/Mal), Pr (Size|Be/Mal)

Page 138: Machine Learning for  High-Throughput Biological Data

Level 2: Structure

Be/Mal

Shape Size

Given: Features, Data Learn: Bayes net structure and probabilities. Note: with this structure, now will need Pr(Size|Shape,Be/Mal) instead of Pr(Size|Be/Mal).

Page 139: Machine Learning for  High-Throughput Biological Data

P1 1 5/02 Spic 0.03 RU4 B P1 2 5/04 Var 0.04 RU4 M P1 3 5/04 Spic 0.04 LL3 B … … … … … … …

Patient Abnormality Date Mass Shape … Mass Size Loc Be/Mal

Mammography Database

Page 140: Machine Learning for  High-Throughput Biological Data

P1 1 5/02 Spic 0.03 RU4 B P1 2 5/04 Var 0.04 RU4 M P1 3 5/04 Spic 0.04 LL3 B … … … … … … …

Patient Abnormality Date Mass Shape … Mass Size Loc Be/Mal

Mammography Database

Page 141: Machine Learning for  High-Throughput Biological Data

P1 1 5/02 Spic 0.03 RU4 B P1 2 5/04 Var 0.04 RU4 M P1 3 5/04 Spic 0.04 LL3 B … … … … … … …

Patient Abnormality Date Mass Shape … Mass Size Loc Be/Mal

Mammography Database

Page 142: Machine Learning for  High-Throughput Biological Data

Level 3: Aggregates

Given: Features, Data, Background knowledge – aggregation functions such as average, mode, max, etc. Learn: Useful aggregate features, Bayes net structure that uses these features, and probabilities. New features may use other rows/tables.

Be/Mal

Shape Size

Avg size this date

Page 143: Machine Learning for  High-Throughput Biological Data

P1 1 5/02 Spic 0.03 RU4 B P1 2 5/04 Var 0.04 RU4 M P1 3 5/04 Spic 0.04 LL3 B … … … … … … …

Patient Abnormality Date Mass Shape … Mass Size Loc Be/Mal

Mammography Database

Page 144: Machine Learning for  High-Throughput Biological Data

P1 1 5/02 Spic 0.03 RU4 B P1 2 5/04 Var 0.04 RU4 M P1 3 5/04 Spic 0.04 LL3 B … … … … … … …

Patient Abnormality Date Mass Shape … Mass Size Loc Be/Mal

Mammography Database

Page 145: Machine Learning for  High-Throughput Biological Data

P1 1 5/02 Spic 0.03 RU4 B P1 2 5/04 Var 0.04 RU4 M P1 3 5/04 Spic 0.04 LL3 B … … … … … … …

Patient Abnormality Date Mass Shape … Mass Size Loc Be/Mal

Mammography Database

Page 146: Machine Learning for  High-Throughput Biological Data

Level 4: View LearningGiven: Features, Data, Background knowledge – aggregation functions and intensionally-defined relations such as “increase” or “same location”Learn: Useful new features defined by views (equivalent to rules or SQL queries), Bayes net structure, and probabilities.

Be/Mal

Shape Size

Avg size this date

Shape change in abnormality at this location

Increase in average size of abnormalities

Page 147: Machine Learning for  High-Throughput Biological Data

Example of Learned Ruleis_malignant(A) IF:

'BIRADS_category'(A,b5), 'MassPAO'(A,present), 'MassesDensity'(A,high), 'HO_BreastCA'(A,hxDCorLC), in_same_mammogram(A,B), 'Calc_Pleomorphic'(B,notPresent), 'Calc_Punctate'(B,notPresent).

Page 148: Machine Learning for  High-Throughput Biological Data

ROC: Level 2 (TAN) vs. Level 1

Page 149: Machine Learning for  High-Throughput Biological Data

Precision-Recall Curves

Page 150: Machine Learning for  High-Throughput Biological Data

All Levels of Learning

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Recall

Prec

isio

nLevel 4 (View)

Level 3 (Aggregate)

Level 2 (Structure)

Level 1 (Parameter)

Page 151: Machine Learning for  High-Throughput Biological Data

SAYU-ViewImproved View Learning approachSAYU: Score As You UseFor each candidate rule, add it to the Bayesian network and see if it improves the network’s scoreOnly add a rule (new field for the view) if it improves the Bayes net

Page 152: Machine Learning for  High-Throughput Biological Data

Relational Learning Algorithms

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Recall

Prec

isio

n

SAYU-View

Initial Level 4 (View)

Level 3 (Aggregates)

Page 153: Machine Learning for  High-Throughput Biological Data

Average Area Under PR Curve For Recall >= 0.5

Level 3

Initial Level 4

SAYU-View

0.00

0.02

0.04

0.06

0.08

0.10

0.12

AURPR

Page 154: Machine Learning for  High-Throughput Biological Data

Clinical Databases of the Future (Dramatically Simplified)

PatientID Gender Birthdate

P1 M 3/22/63

PatientID Date Physician Symptoms Diagnosis

P1 1/1/01 Smith palpitations hypoglycemic P1 2/1/03 Jones fever, aches influenza

PatientID Date Lab Test Result

P1 1/1/01 blood glucose 42 P1 1/9/01 blood glucose 45

PatientID SNP1 SNP2 … SNP500K

P1 AA AB BB P2 AB BB AA

PatientID Date Prescribed Date Filled Physician Medication Dose Duration

P1 5/17/98 5/18/98 Jones prilosec 10mg 3 months

Page 155: Machine Learning for  High-Throughput Biological Data

Another Problem with Current Learning of Regulatory Models

Current techniques all use greedy heuristicBayes net learning algorithms use “sparse candidate” approach: to be considered as a parent of gene 1, another gene 2 must be correlated with gene 1CPDs often represented as trees: use greedy tree learning algorithmsAll can fall prey to functions such as exclusive-or – do these arise?

Page 156: Machine Learning for  High-Throughput Biological Data

Skewing Example [Page & Ray, 2003; Ray & Page, 2004; Rosell et al., 2005; Ray & Page, 2005]

Drosophila survival based on gender and Sxl gene activity

Female SxL Gene Gene3 … Gene100 Survival

0 00…00000000…0000001 …1…1111111

0

0 10…00000000…0000001 …1…1111111

1

1 00…00000000…0000001 …1…1111111

1

1 10…00000000…0000001 …1…1111111

0

Page 157: Machine Learning for  High-Throughput Biological Data

Hard FunctionsOur Definition: those functions for which no attribute has “gain” according to standard purity measures (GINI, Entropy) NOTE: “Hard” does not refer to size

of representation Example: n-variable odd parity

Many others

Page 158: Machine Learning for  High-Throughput Biological Data

Learning Hard Functions

Standard method of learning hard functions (e.g. with decision trees): depth-k Lookahead O(mn2k+1-1) for m examples in n variables

We devise a technique that allows learning algorithms to efficiently learn hard functions

Page 159: Machine Learning for  High-Throughput Biological Data

Key IdeaHard functions are not hard for all data distributionsWe can skew the input distribution to simulate a different one By randomly choosing “preferred

values” for attributesAccumulate evidence over several skews to select a split attribute

Page 160: Machine Learning for  High-Throughput Biological Data

Example: Uniform Distribution

0)(

25.0)1;(25.0)0;(

25.0)(

i

i

i

xGAIN

xfGINIxfGINI

fGINI

x1 x2 x3 f

0 0 0 01

0 1 0 11

1 0 0 11

1 1 0 01

Page 161: Machine Learning for  High-Throughput Biological Data

Example: Skewed Distribution(“Sequential Skewing”, Ray & Page, ICML 2004)

x1 x2 x3 f Weight

0 0 0 01

0 1 0 11

1 0 0 11

1 1 0 01

41

43

25.0)1;(25.0)1;(19.0)1;(

25.0)(

3

2

1

xfGINIxfGINIxfGINI

fGINI

43

43

43

41

41

41

0)(0)(0)(

3

2

1

xGAINxGAINxGAIN