Machine Learning for High-Throughput Biological Data

Machine Learning for High-Throughput Biological Data

These notes were originally from KDD2006 tutorial notes, written by David page at Dept. Biostatistics and Medical Informatics Dept. Computer Sciences University of Wisconsin-Madison.http://www.biostat.wisc.edu/~page/PageKDD2006.ppt

Some Data Types We’ll Discuss

Gene expression microarraySingle-nucleotide polymorphisms ( 單一核苷酸基因多形性 )Mass spectrometry proteomics ( 蛋白質組學 ) and metabolomics ( 代謝物組學 )Protein-protein interactions (from co-immunoprecipitation)High-throughput screening of potential drug molecules

image from the DOE Human Genome Programhttp://www.ornl.gov/hgmis

Probes (DNA)

Gene Chip Surface

Hybridization

Labeled Sample (RNA)

How Microarrays Work

Two Views of Microarray Data

Data points are genes Represented by expression levels across

different samples (ie, features=samples) Goal: categorize new genesData points are samples (eg, patients) Represented by expression levels of

different genes (ie, features=genes) Goal: categorize new samples

Two Ways to View The Data

Person Gene A28202_ac AB00014_at AB00015_at . . . Person 1 1142.0 321.0 2567.2 . . . Person 2 586.3 586.1 759.0 . . . Person 3 105.2 559.3 3210.7 . . . Person 4 42.8 692.1 812.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Data Points are Genes


Data Points are Samples


Supervision: Add Class Values

Person Gene A28202_ac AB00014_at AB00015_at . . . Class Person 1 1142.0 321.0 2567.2 . . . normal Person 2 586.3 586.1 759.0 . . . cancer Person 3 105.2 559.3 3210.7 . . . normal Person 4 42.8 692.1 812.0 . . . cancer . . . . . . . . . . . . . . . . . . . . . . . . . . .

Supervised Learning TaskGiven: a set of microarray experiments, each done with mRNA from a different patient (same cell type from every patient) Patient’s expression values for each gene constitute the features, and patient’s disease constitutes the class

Do: Learn a model that accurately predicts class based on features

Data Points are: Genes Samples

Clustering Supervised Data Mining

Predict the class value for a patient

based on the expression levels for his/her genes

Location in Task Space

Leukemia (Golub et al., 1999)Classes Acute Lymphoblastic Leukemia( 淋巴白血病 ) (ALL) and Acute Myeloid Leukemia ( 骨髓白血病 ) (AML)Approach Weighted voting (essentially naïve Bayes)Cross-Validated Accuracy Of 34 samples, declined to predict 5, correct on other 29

Cancer vs. NormalRelatively easy to predict accurately, because so much goes “haywire” in cancer cellsPrimary barrier is noise in the data… impure RNA, cross-hybridization, etcStudies include breast, colon ( 结肠 ), prostate ( 前列腺 ), lymphoma ( 淋巴瘤 ), and multiple myeloma ( 骨髓瘤 )

X-Val Accuracies for Multiple Myeloma (74 MM vs.

31 Normal)

Trees

Boosted Trees

SVMs

Vote

Bayes Nets

98.1

99.0

100.0

97.0

100.0

More MM (300), Benign Condition MGUS (Hardin et al., 2004)

ROC Curves: Cancer vs. Normal

ROC: Cancer vs. Benign (MGUS)

Work by Statisticians Outside of Standard Classification/Clustering

Methods to better convert Affymetrix’s low-level intensity measurements into expression levels: e.g., work by Speed, Wong, IrrizaryMethods to find differentially expressed genes between two samples, e.g. work by Newton and KendziorskiBut the following is most related…

Ranking Genes by Significance

Some biologists don’t want one predictive model, but a rank-ordered list of genes to explore further (with estimated significance)For each gene we have a set of expression levels under our conditions, say cancer vs. normalWe can do a t-test to see if the mean expression levels are different under the two conditions: p-valueMultiple comparisons problem: if we repeat this test for 30,000 genes, some will pop up as significant just by chance aloneCould do a Bonferoni correction (multiply p-values by 30,000), but this is drastic and might eliminate all

False Discovery Rate (FDR) [Storey and Tibshirani, 2001]

Addresses multiple comparisons but is less extreme than BonferoniReplaces p-value by q-value: fraction of genes with this p-value or lower that really don’t have different means in the two classes (false discoveries)Publicly available in R as part of Bioconductor packageRecommendation: Use this in addition to your supervised data mining… your collaborators will want to see it

FDR Highlights Difficulties Getting Insight into Cancer vs. Normal

Using Benign Condition Instead of Normal Helps Somewhat

Question to AnticipateYou’ve run a supervised data mining algorithm on your collaborator’s data, and you present an estimate of accuracy or an ROC curve (from X-val)How did you adjust this for the multiple comparisons problem?Answer: you don’t need to because you commit to a single predictive model before ever looking at the test data for a fold—this is only one comparison

Prognosis and TreatmentFeatures same as for diagnosis

Rather than disease state, class value becomes life expectancy with a given treatment (or positive response vs. no response to given treatment)

Breast Cancer Prognosis (Van’t Veer et al., 2002)

Classes good prognosis (no metastasis within five years of initial diagnosis) vs. poor prognosisAlgorithm Ensemble of votersResults 83% cross-validated accuracy on 78 cases

A LessonPrevious work selected features to use in ensemble by looking at the entire data setShould have repeated feature selection on each cross-val foldAuthors also chose ensemble size by seeing which size gave highest cross-val resultAuthors corrected this in web supplement;accuracy went from 83% to 73%Remember to “tune parameters” separately for each cross-val fold!

Prognosis with Specific Therapy (Rosenwald et al., 2002)

Data set contains gene-expression patterns for 160 patients with diffuse large B-cell lymphoma, receiving anthracycline chemotherapyClass label is five-year survivalOne test-train split 80/80True positive rate: 60% False negative rate: 39%

Some Future DirectionsUsing gene-chip data to select therapy Predict which therapy gives best prognosis for patientCombining Gene Expression Data with Clinical Data such as Lab Results, Medical and Family History Multiple relational tables, may benefit from relational learning

Unsupervised Learning Task

Given: a set of microarray experiments under different conditionsDo: cluster the genes, where a gene described by its expression levels in different experiments



Group genes into clusters, where all

members of a cluster tend to go

up or down together


Example(Green = up-regulated, Red = down-regulated)

Gen

es

Experiments (Samples)

Visualizing Gene Clusters (eg, Sharan and Shamir, 2000)

Time (10-minute intervals)

Nor

mal

ized

expr

essi

on

Gene Cluster 1, size=20 Gene Cluster 2, size=43

Unsupervised Learning Task 2

Given: a set of microarray experiments (samples) corresponding to different conditions or patientsDo: cluster the experiments



Group samples by gene expression

profile


ExamplesCluster samples from mice subjected to a variety of toxic compounds (Thomas et al., 2001)Cluster samples from cancer patients, potentially to discover different subtypes of a cancerCluster samples taken at different time points

Some Biological PathwaysRegulatory pathways Nodes are labeled by genes Arcs denote influence on transcription G1 codes for P1, P1 inhibits G2’s transcription

Metabolic pathways Nodes are metabolites, large biomolecules (eg,

sugars, lipids, proteins and modified proteins) Arcs from biochemical reaction inputs to

outputs Arcs labeled by enzymes (themselves proteins)

Metabolic Pathway Example

Fumarate

Malate

Oxaloacetate

Citrate cis-Aconitate

Isocitrate

-Ketoglutarate

Succinyl-CoASuccinate

fumarase

succinate thikinase

MDH

citrate synthase aconitase

IDH

-KDGH

FAD

FADH2

H20

NAD+

NADH

Acetyl CoAHSCoA

H20

H20

NAD+

NADH + CO2

NAD+ + HSCoA

NADH + CO2

GDP + Pi GTP+ HSCoA

(Krebs Cycle,TCA Cycle,Citric Acid Cycle)

Regulatory Pathway (KEGG)

Using Microarray Data Only

Regulatory pathways Nodes are labeled by genes Arcs denote influence on transcription G1 codes for P1, P1 inhibits G2’s transcription

Metabolic pathways Nodes are metabolites, large biomolecules (eg,

sugars, lipids, proteins, and modified proteins) Arcs from biochemical reaction inputs to

outputs Arcs labeled by enzymes (themselves proteins)

Supervised Learning Task 2

Given: a set of microarray experiments for same organism under different conditionsDo: Learn graphical model that accurately predicts expression of some genes in terms of others

Some Approaches to Learning Regulatory

NetworksBayes Net Learning (started with Friedman & Halpern, 1999, we’ll see more)

Boolean Networks (Akutsu, Kuhara, Maruyama & Miyano, 1998; Ideker, Thorsson & Karp, 2002)

Related Graphical Approaches (Tanay & Shamir, 2001; Chrisman, Langley, Baay & Pohorille, 2003)

Note: direction of arrowindicates dependencenot causality

P(geneA)

P(geneB)P(

gene

A)

parent node

child node

parent node

child node

geneA

geneB

Bayesian Network (BN)

1.00.0

DatageneBgeneA

Expt1Expt2Expt3Expt4

0.50.5

0.50.5

Problem: Not CausalityA B

A is a good predictor of B. But is A regulating B??Ground truth might be:

B A A C B

B C A

B

C

A Or a more complicated variant

Approaches to Get Causality

Use “knock-outs” (Pe’er, Regev, Elidan and Friedman, 2001). But not available in most organisms.Use time-series data and Dynamic Bayesian Networks (Ong, Glasner and Page, 2002). But even less data typically.Use other data sources, eg sequences upstream of genes, where transcription regulators may bind. (Segal, Barash, Simon, Friedman and Koller, 2002; Noto and Craven, 2005)

A Dynamic Bayes Net

gene 1

gene 2 gene

3gene

N

gene 1

gene 2 gene

3gene

N

Problem: Not Enough Data Points to Construct Large Network

Fortunate to get 100s of chipsBut have 1000s of genes E. coli: ~4000 Yeast: ~6000 Human: ~30,000Want to learn causal graphical model over 1000s of variables with 100s of examples (settings of the variables)

Advance: Module Networks [Segal, Pe’er, Regev, Koller & Friedman, 2005]

Cluster genes by similarity over expression experimentsAll genes in a cluster are “tied together”: same parents and CPDsLearn structure subject to this tying together of genesIteratively re-form clusters and re-learn network, in an EM-like fashion

Problem: Data are Continuous but Models are Discrete

Gene chips provide a real-valued mRNA measurementBoolean networks and most practical Bayes net learning algorithms assume discrete variablesMay lose valuable information by discretizing

Advance: Use of Dynamic Bayes Nets with Continuous Variables [Segal, Pe’er, Regev, Koller & Friedman, 2005]

Expression measurements used instead of discretized (up, down, same)Assume linear influence of parents on children (Michaelis-Menten assumption)Work so far constructed the network from literature and learned parameters

Problem: Much Missing Information

mRNA from gene 1 doesn’t directly alter level of mRNA from gene 2Rather, the protein product from gene 1 may alter level of mRNA from gene 2 (e.g., transcription factor)Activation of transcription factor might not occur by making more of it, but just by phosphorylating it (post-translational modification)

R

Example: Transcription Regulation

DNAgeneA geneB geneCP TO

OperongeneRP O T

Operon OperonOperonR

mRNAmRNA

Approach: Measure More Stuff

Mass spectrometry (later) can measure protein rather than mRNA Doesn’t measure all proteins Not very quantitative (presence/absence)

2D gels can measure post-translational modifications, but still low-throughput because of current analysisCo-immunoprecipitation (later), Yeast 2-Hybrids can measure protein interactions, but noisy

Another Way Around Limitations

Identify smaller part of the task that is a step toward a full regulatory pathway Part of a pathway Classes or groups of genesExamples:Chromatin remodelers Predicting the operons in E. coli

Chromatin Remodelers and Nucleosome [Segal et al. 2006]

Previous DNA picture oversimplifiedDNA double-helix is wrapped in further complex structureDNA is accessible only if part of this structure is unwoundCan we predict what chromatin remodelers act on what parts of DNA, also what activates a remodeler?

The E. Coli Genome

Finding Operons in E. coli(Craven, Page, Shavlik, Bockhorst and Glasner, 2000)

Given: known operons and other E. coli data Do: predict all operons in E. coli

Additional Sources of Information gene-expression data functional annotation

g1g2 g3

g5g4

promoter terminator

Comparing Naive Bayes and Decision Trees (C5.0)

Using Only Individual Features

Single-Nucleotide Polymorphisms

SNPs: Individual positions in DNA where variation is commonRoughly 2 million known SNPs in humansNew Affymetrix whole-genome scan measures 500,000 of theseEasier/faster/cheaper to measure SNPs than to completely sequence everyoneMotivation …

Not Susceptible or Not Responding

Susceptible to Disease D or Responds to Treatment T

If We Sequenced Everyone…

Example of SNP Data

Person SNP 1 2 3 . . . CLASS Person 1 C T A G T T . . . old Person 2 C C A G C T . . . young Person 3 T T A A C C . . . old Person 4 C T G G T T . . . young . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Phasing (Haplotyping)

Advantages of SNP DataPerson’s SNP pattern does not change with time or disease, so it can give more insight into susceptibility

Easier to collect samples (can simply use blood rather than affected tissue)

Challenges of SNP DataUnphased Algorithms exist for phasing (haplotyping), but they make errors and typically need related individuals, dense coverageMissing values are more common than in microarray data (though improving substantially, down to around 1-2% now)Many more measurements. For example, Affymetrix human SNP chip at a half million SNPs.

Supervised Learning TaskGiven: a set of SNP profiles, each from a different patient.Phased: nucleotides at each SNP position on each copy of each chromosome constitute the features, and patient’s disease constitutes the class

Unphased: unordered pair of nucleotides at each SNP position constitute the features, and patient’s disease constitutes the class


Waddell et al., 2005Multiple Myeloma, Young (susceptible) vs. Old (less susceptible), 3000 SNPs, best at 64% acc (training)SVM with feature selection (repeated on every fold of cross-validation): 72% accuracy, also naïve Bayes. Significantly better than chance.

Old

Young

Old Young

Actual31 9

14 26

Listgarten et al., 2005 SVMs from SNP data predict lung cancer susceptibility at 69% accuracy Naïve Bayes gives similar performance Best single SNP at less than 60% accuracy (training)

LessonsSupervised data mining algorithms can predict disease susceptibility at rates better than chance and better than individual SNPsAccuracies much lower than we see with microarray data, because we’re predicting who will get disease, not who already has it

Future DirectionsPharmacogenetics: predicting drug response from SNP profile Drug Efficacy Adverse ReactionCombining SNP data with other data types, such as clinical (history, lab tests) and microarray

ProteomicsMicroarrays are useful primarily because mRNA concentrations serve as surrogate for protein concentrationsLike to measure protein concentrations directly, but at present cannot do so insame high-throughput mannerProteins do not have obvious direct complementsCould build molecules that bind, but binding greatly affected by protein structure

Time-of-Flight (TOF) Mass Spectrometry (thanks Sean McIlwain)

Laser

+VSample

Measures the time for an ionized particle, starting from the sample plate, to hit the detector

Detector

Time-of-Flight (TOF) Mass Spectrometry 2

Laser

+VSample

Matrix-Assisted Laser Desorption-Ionization (MALDI) Crystalloid structures made using proton-rich matrix moleculeHitting crystalloid with laser causes molecules to ionize and “fly” towards detector

Detector

Time-of-Flight Demonstration 0

Sample Plate


Matrix Molecules


Protein Molecules


LaserDetector

+10KV Positive Charge


+10KV

+

Proton kicked off matrix molecule onto another molecule

Laser pulsed directly onto sample


+10KV

++

++ +

Lots of protons kicked off matrix ions, giving rise to more positively charged molecules


+10KV

++

++ +

The high positive potential under sample plate, causes positively charged molecules to accelerate towards detector


+10Kv

+

+

+

+

+

+

Smaller mass molecules hit detector first, while heavier ones detected later


+10KV

+

+

+

+

+

+

The incident time measured from when laser is pulsed until molecule hits detector


+10KV

++ + + ++

Experiment repeated a number of times, counting frequencies of “flight-times”

Example Spectra from a Competition by Lin et al. at

Duke

These are different fractions from the same sample.

M/Z

Inte

nsity

Trypsin-Treated Spectra

M/Z

Freq

uenc

y

Many Challenges Raised by Mass Spectrometry DataNoise: extra peaks from handling of sample, from machine and environment (electrical noise), etc.M/Z values may not align exactly across spectra (resolution ~0.1%)Intensities not calibrated across spectra: quantification is difficultCannot get all proteins… typically only several hundred. To improve odds of getting the ones we want, may fractionate our sample by 2D gel electrophoresis or liquid chromatography.

Challenges (Continued)Better results if partially digest proteins (break into smaller peptides) firstCan be difficult to determine what proteins we have from spectrumIsotopic peaks: C13 and N15 atoms in varying numbers cause multiple peaks for a single peptide

Handling Noise: Peak Picking

Want to pick peaks that are statistically significant from the noise signal

Want to use these as features in our learning algorithms.

Many Supervised Learning Tasks

Learn to predict proteins from spectra, when the organism’s proteome is knownLearn to identify isotopic distributionsLearn to predict disease from either proteins, peaks or isotopic distributions as featuresConstruct pathway models

Using Mass Spectrometry for Early Detection of Ovarian Cancer [Petricoin et al., 2002]Ovarian cancer difficult to detect early, often leading to poor prognosisTrained and tested on mass spectra from blood serum100 training cases, 50 with cancerHeld-out test set of 116 cases, 50 with cancer100% sensitivity, 95% specificity (63/66) on held-out test set

Not So FastData mining methodology seems soundBut Keith Baggerly argues that cancer samples were handled differently than normal samples, and perhaps data were preprocessed differently tooIf we run cancer samples Monday and normals Wednesday, could get differences from machine breakdown or nearby electrical equipment that’s running on Monday but not WedLesson: tell collaborators they must randomize samples for the entire processing phase and of course all our preprocessing must be sameDebate is still raging… results not replicated in trials

Other Proteomics: 3D Structures

Other Proteomics: Interactions

Figure from Ideker et al., Science 292(5518):929-934, 2001

each node represents a gene product (protein)blue edges show direct protein-protein interactionsyellow edges show interactions in which one protein binds to DNA and affects the expression of another

Protein-Protein Interactions

Yeast 2-HybridImmunoprecipitation Antibodies (“immuno”) are made by

combinatorial combinations of certain proteins

Millions of antibodies can be made, to recognize a wide variety of different antigens (“invaders”), often by recognizing specific proteinsantibody protein

Protein-Protein Interactions

Immunoprecipitationantibody

Co-Immunoprecipitationantibody

Many Supervised Learning Tasks

Learn to predict protein-protein interactions: protein 3D structures may be criticalUse protein-protein interactions in construction of pathway modelsLearn to predict protein function from interaction data

ChIP-Chip DataImmunoprecipitation can also be done to identify proteins interacting with DNA rather than other proteinsChromatin immunoprecipitation (ChIP): grab sample of DNA bound to a particular protein (transcription factor)ChIP-Chip: run this sample of DNA on a microarray to see which DNA was boundExample of analysis of such new data: Keles et al., 2006

MetabolomicsMeasures concentration of each low-molecular weight molecule in sampleThese typically are “metabolites,” or small molecules produced or consumed by reactions in biochemical pathwaysThese reactions typically catalyzed by proteins (specifically, enzymes)This data typically also mass spectrometry, though could also be NMR

LipomicsAnalogous to metabolomics, but measuring concentrations of lipids rather than metabolitesPotentially help induce biochemical pathway information or to help disease diagnosis or treatment choice

To Design a Drug:Identify Target

Protein

DetermineTarget SiteStructure

Synthesize aMolecule that

Will Bind

Knowledge of proteome/genomeRelevant biochemical pathways

Crystallography, NMRDifficult if Membrane-Bound

Imperfect modeling of structureStructures may change at bindingAnd even then…

Molecule Binds Target But May:

Bind too tightly or not tightly enough.Be toxic.Have other effects (side-effects) in the body.Break down as soon as it gets into the body, or may not leave the body soon enough.It may not get to where it should in the body (e.g., crossing blood-brain barrier).Not diffuse from gut to bloodstream.

And Every Body is Different:

Even if a molecule works in the test tube and works in animal studies, it may not work in people (will fail in clinical trials).A molecule may work for some people but not others.A molecule may cause harmful side-effects in some people but not others.

Typical Practice when Target Structure is Unknown

High-Throughput Screening (HTS): Test many molecules (1,000,000) to find some that bind to target (ligands).Infer (induce) shape of target site from 3D structural similarities.Shared 3D substructure is called a pharmacophore.Perfect example of a machine learning task with spatial target.

An Example of Structure Learning

Act

ive

Inac

tive

Common Data Mining Approaches

Represent a molecule by thousands to millions of features and use standard techniques (e.g., KDD Cup 2001)Represent each low-energy conformer by feature vector and use multiple-instance learning (e.g., Jain et al., 1998)Relational learning Inductive logic programming (e.g., Finn et

al., 1998) Graph mining

Supervised Learning TaskGiven: a set of molecules, each labeled by activity -- binding affinity for target protein -- and a set of low-energy conformers for each molecule

Do: Learn a model that accurately predicts activity (may be Boolean or real-valued)

ILP as Illustration: The Logical Representation of a Pharmacophore

Active(X) if: has-conformation(X,Conf),has-hydrophobic(X,A),has-hydrophobic(X,B),distance(X,Conf,A,B,3.0,1.0),has-acceptor(X,C),distance(X,Conf,A,C,4.0,1.0),distance(X,Conf,B,C,5.0,1.0).

This logical clause states that a molecule X is active if it has someconformation Conf, hydrophobic groups A and B, and a hydrogen acceptor Csuch that the following holds: in conformation Conf of molecule X, thedistance between A and B is 3 Angstroms (plus or minus 1), the distancebetween A and C is 4, and the distance between B and C is 5.

Background Knowledge IInformation about atoms and bonds in the molecules

atm(m1,a1,o,3,5.915800,-2.441200,1.799700).atm(m1,a2,c,3,0.574700,-2.773300,0.337600).atm(m1,a3,s,3,0.408000,-3.511700,-1.314000).

bond(m1,a1,a2,1). bond(m1,a2,a3,1).

Background knowledge IIDefinition of distance equivalencedist(Drug,Atom1,Atom2,Dist,Error):- number(Error), coord(Drug,Atom1,X1,Y1,Z1), coord(Drug,Atom2,X2,Y2,Z2), euc_dist(p(X1,Y1,Z1),p(X2,Y2,Z2),Dist1), Diff is Dist1-Dist, absolute_value(Diff,E1), E1 =< Error.

euc_dist(p(X1,Y1,Z1),p(X2,Y2,Z2),D):- Dsq is (X1-X2)^2+(Y1-Y2)^2+(Z1-Z2)^2, D is sqrt(Dsq).

Central Idea: Generalize by searching a lattice

Lattice of Hypotheses active(X)

active(X) if active(X) if active(X) ifhas-hydrophobic(X,A) has-donor(X,A) has-acceptor(X,A)

active(X) if active(X) if active(X) ifhas-hydrophobic(X,A), has-donor(X,A), has-acceptor(X,A),has-donor(X,B), has-donor(X,B), has-donor(X,B),distance(X,A,B,5.0) distance(X,A,B,4.0) distance(X,A,B,6.0)

etc.

Conformational model Conformational flexibility modelled as multiple conformations:

Sybyl randomsearch Catalyst

Pharmacophore description

Atom and site centred Hydrogen bond donor Hydrogen bond acceptor Hydrophobe Site points (limited at present) User definableDistance based

Example 1: Dopamine agonists

Agonists taken from Martin data set on QSAR society web pagesExamples (5-50 conformations/molecule)

NCH3

CH3

OH

OH

OH

OHNH2 NH

OH

OH

NH2

OH

OHN

CH3 OH

OH

NH2

OH

OH

Pharmacophore identified

Molecule A has the desired activity if: in conformation B molecule A contains a hydrogen acceptor at C, and in conformation B molecule A contains a basic nitrogen group at D, and the distance between C and D is 7.05966 +/- 0.75 Angstroms, and in conformation B molecule A contains a hydrogen acceptor at E, and the distance between C and E is 2.80871 +/- 0.75 Angstroms, and the distance between D and E is 6.36846 +/- 0.75 Angstroms, and in conformation B molecule A contains a hydrophobic group at F, and the distance between C and F is 2.68136 +/- 0.75 Angstroms, and the distance between D and F is 4.80399 +/- 0.75 Angstroms, and the distance between E and F is 2.74602 +/- 0.75 Angstroms.

Example II: ACE inhibitors28 angiotensin converting enzyme inhibitors taken from literature D. Mayer et al., J. Comput.-Aided Mol.

Design, 1, 3-16, (1987)

N

NN

O

O

COOH

SHN

COOHO

CH3

NH

POH

OHO N

COOHO

CH3

NH

COOH

ACE pharmacophoreMolecule A is an ACE inhibitor if: molecule A contains a zinc-site B, molecule A contains a hydrogen acceptor C, the distance between B and C is 7.899 +/- 0.750 A, molecule A contains a hydrogen acceptor D, the distance between B and D is 8.475 +/- 0.750 A, the distance between C and D is 2.133 +/- 0.750 A, molecule A contains a hydrogen acceptor E, the distance between B and E is 4.891 +/- 0.750 A, the distance between C and E is 3.114 +/- 0.750 A, the distance between D and E is 3.753 +/- 0.750 A.

Pharmacophore discovered

Distance Progol Mayer

A 4.9 5.0

B 3.8 3.8

C 8.5 8.6

Zinc site

H-bond acceptor

C

AB

Additional FindingOriginal pharmacophore rediscovered plus one other different zinc ligand position similar to alternative proposed by

Ciba-Geigy

7.3

3.94.0

Example III: Thermolysin inhibitors

10 inhibitors for which crystallographic data is available in PDBConformationally challenging moleculesExperimentally observed superposition

Key binding site interactions

Asn112-NH

S2’ O=C Asn112

Arg203-NH S1’

O=C Ala113Zn

OOH

NH

O

NHPO

O

R

Interactions made by inhibitors

Interaction 1HYT 1THL 1TLP 1TMN 2TMN 4TLN 4TMN 5TLN 5TMN 6TMNAsn112-NH S2’ Asn112 C=O Arg 203 NH S1’ Ala113-C=O Zn

Pharmacophore Identification

Structures considered 1HYT 1THL 1TLP 1TMN 2TMN 4TLN 4TMN 5TLN 5TMN 6TMNConformational analysis using “Best” conformer generation in Catalyst98-251 conformations/molecule

Thermolysin Results10 5-point pharmacophore identified, falling into 2 groups (7/10 molecules) 3 “acceptors”, 1 hydrophobe, 1 donor 4 “acceptors, 1 donorCommon core of Zn ligands, Arg203 and Asn112 interactions identifiedCorrect assignments of functional groupsCorrect geometry to 1 Angstrom tolerance

Thermolysin resultsIncreasing tolerance to 1.5Angstroms finds common 6-point pharmacophore including one extra interaction

Example IV: Antibacterial peptides [Spatola et al., 2000]

Dataset of 11 pentapeptides showing activity against Pseudomonas aeruginosa 6 actives <6g/ml IC50 5 inactives

Pharmacophore Identified

A Molecule M is active against Pseudomonas Aeruginosaif it has a conformation B such that:

M has a hydrophobic group C,M has a hydrogen acceptor D,the distance between C and D in conformation B is 11.7 AngstromsM has a positively-charged atom E,the distance between C and E in conformation B is 4 Angstromsthe distance between D and E in conformation B is 9.4 AngstromsM has a positively-charged atom F,the distance between C and F in conformation B is 11.1 Angstromsthe distance between D and F in conformation B is 12.6 Angstromsthe distance between E and F in conformation B is 8.7 Angstroms

Tolerance 1.5 Angstroms

Clinical Databases of the Future (Dramatically Simplified)

PatientID Gender Birthdate

P1 M 3/22/63

PatientID Date Physician Symptoms Diagnosis

P1 1/1/01 Smith palpitations hypoglycemic P1 2/1/03 Jones fever, aches influenza

PatientID Date Lab Test Result

P1 1/1/01 blood glucose 42 P1 1/9/01 blood glucose 45

PatientID SNP1 SNP2 … SNP500K

P1 AA AB BB P2 AB BB AA

PatientID Date Prescribed Date Filled Physician Medication Dose Duration

P1 5/17/98 5/18/98 Jones prilosec 10mg 3 months

Final Wrap-upMolecular biology collecting lots and lots of data in post-genome eraOpportunity to “connect” molecular-level information to diseases and treatmentNeed analysis tools to interpretData mining opportunities aboundHopefully this tutorial provided solid start toward applying data mining to high-throughput biological data

Thanks ToJude ShavlikJohn ShaughnessyBart BarlogieMark CravenSean McIlwainJan StruyfArno SpatolaPaul FinnBeth Burnside

Michael MollaMichael WaddellIrene OngJesse DavisSoumya RayJo HardinJohn CrowleyFenghuang ZhanEric Lantz

If Time Permits… some of my group’s directions in the area

Clinical Data (with Jesse Davis, Beth Burnside, M.D.)Addressing another problem with current approaches to biological network learning (with Soumya Ray, Eric Lantz)

Using Machine Learning with Clinical Histories: Example

We’ll use example of Mammography to show some issues that ariseThese issues arise here even with just one relational tableThese issues are even more pronounced with data in multiple tables

Supervised Learning TaskGiven: a database of mammogram abnormalities for different patients (same cell type from every patient) Radiologist-entered values describing the abnormality constitute the features, and abnormality’s biopsy result as benign or malignant constitutes the class


P1 1 5/02 Spic 0.03 RU4 B P1 2 5/04 Var 0.04 RU4 M P1 3 5/04 Spic 0.04 LL3 B … … … … … … …

Patient Abnormality Date Mass Shape … Mass Size Loc Be/Mal

Mammography Database[Davis et al, 2005; Burnside et al, 2005]

Original Expert Structure

Level 1: Parameters

Be/Mal

Shape Size

Given: Features (node labels, or fields in database), Data, Bayes net structureLearn: Probabilities. Note: probabilities needed are Pr(Be/Mal), Pr(Shape|Be/Mal), Pr (Size|Be/Mal)

Level 2: Structure

Be/Mal

Shape Size

Given: Features, Data Learn: Bayes net structure and probabilities. Note: with this structure, now will need Pr(Size|Shape,Be/Mal) instead of Pr(Size|Be/Mal).



Mammography Database







Level 3: Aggregates

Given: Features, Data, Background knowledge – aggregation functions such as average, mode, max, etc. Learn: Useful aggregate features, Bayes net structure that uses these features, and probabilities. New features may use other rows/tables.

Be/Mal

Shape Size

Avg size this date










Level 4: View LearningGiven: Features, Data, Background knowledge – aggregation functions and intensionally-defined relations such as “increase” or “same location”Learn: Useful new features defined by views (equivalent to rules or SQL queries), Bayes net structure, and probabilities.

Be/Mal

Shape Size

Avg size this date

Shape change in abnormality at this location

Increase in average size of abnormalities

Example of Learned Ruleis_malignant(A) IF:

'BIRADS_category'(A,b5), 'MassPAO'(A,present), 'MassesDensity'(A,high), 'HO_BreastCA'(A,hxDCorLC), in_same_mammogram(A,B), 'Calc_Pleomorphic'(B,notPresent), 'Calc_Punctate'(B,notPresent).

ROC: Level 2 (TAN) vs. Level 1

Precision-Recall Curves

All Levels of Learning

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Recall

Prec

isio

nLevel 4 (View)

Level 3 (Aggregate)

Level 2 (Structure)

Level 1 (Parameter)

SAYU-ViewImproved View Learning approachSAYU: Score As You UseFor each candidate rule, add it to the Bayesian network and see if it improves the network’s scoreOnly add a rule (new field for the view) if it improves the Bayes net

Relational Learning Algorithms

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Recall

Prec

isio

n

SAYU-View

Initial Level 4 (View)

Level 3 (Aggregates)

Average Area Under PR Curve For Recall >= 0.5

Level 3

Initial Level 4

SAYU-View

0.00

0.02

0.04

0.06

0.08

0.10

0.12

AURPR

Clinical Databases of the Future (Dramatically Simplified)

PatientID Gender Birthdate

P1 M 3/22/63

PatientID Date Physician Symptoms Diagnosis

P1 1/1/01 Smith palpitations hypoglycemic P1 2/1/03 Jones fever, aches influenza

PatientID Date Lab Test Result

P1 1/1/01 blood glucose 42 P1 1/9/01 blood glucose 45

PatientID SNP1 SNP2 … SNP500K

P1 AA AB BB P2 AB BB AA

PatientID Date Prescribed Date Filled Physician Medication Dose Duration

P1 5/17/98 5/18/98 Jones prilosec 10mg 3 months

Another Problem with Current Learning of Regulatory Models

Current techniques all use greedy heuristicBayes net learning algorithms use “sparse candidate” approach: to be considered as a parent of gene 1, another gene 2 must be correlated with gene 1CPDs often represented as trees: use greedy tree learning algorithmsAll can fall prey to functions such as exclusive-or – do these arise?

Skewing Example [Page & Ray, 2003; Ray & Page, 2004; Rosell et al., 2005; Ray & Page, 2005]

Drosophila survival based on gender and Sxl gene activity

Female SxL Gene Gene3 … Gene100 Survival

0 00…00000000…0000001 …1…1111111

0

0 10…00000000…0000001 …1…1111111

1

1 00…00000000…0000001 …1…1111111

1

1 10…00000000…0000001 …1…1111111

0

Hard FunctionsOur Definition: those functions for which no attribute has “gain” according to standard purity measures (GINI, Entropy) NOTE: “Hard” does not refer to size

of representation Example: n-variable odd parity

Many others

Learning Hard Functions

Standard method of learning hard functions (e.g. with decision trees): depth-k Lookahead O(mn2k+1-1) for m examples in n variables

We devise a technique that allows learning algorithms to efficiently learn hard functions

Key IdeaHard functions are not hard for all data distributionsWe can skew the input distribution to simulate a different one By randomly choosing “preferred

values” for attributesAccumulate evidence over several skews to select a split attribute

Example: Uniform Distribution

0)(

25.0)1;(25.0)0;(

25.0)(

i

i

i

xGAIN

xfGINIxfGINI

fGINI

x1 x2 x3 f

0 0 0 01

0 1 0 11

1 0 0 11

1 1 0 01

Example: Skewed Distribution(“Sequential Skewing”, Ray & Page, ICML 2004)

x1 x2 x3 f Weight

0 0 0 01

0 1 0 11

1 0 0 11

1 1 0 01

41

43

25.0)1;(25.0)1;(19.0)1;(

25.0)(

3

2

1

xfGINIxfGINIxfGINI

fGINI

43

43

43

41

41

41

0)(0)(0)(

3

2

1

xGAINxGAINxGAIN

Documents

Machine Learning for High-Throughput Biological Data