Upload
mercer
View
42
Download
0
Embed Size (px)
DESCRIPTION
Machine Learning for High-Throughput Biological Data. These notes were originally from KDD2006 tutorial notes, written by David page at Dept. Biostatistics and Medical Informatics Dept. Computer Sciences University of Wisconsin-Madison . http://www.biostat.wisc.edu/~page/PageKDD2006.ppt. - PowerPoint PPT Presentation
Citation preview
Machine Learning for High-Throughput Biological Data
These notes were originally from KDD2006 tutorial notes, written by David page at Dept. Biostatistics and Medical Informatics Dept. Computer Sciences University of Wisconsin-Madison.http://www.biostat.wisc.edu/~page/PageKDD2006.ppt
Some Data Types We’ll Discuss
Gene expression microarraySingle-nucleotide polymorphisms ( 單一核苷酸基因多形性 )Mass spectrometry proteomics ( 蛋白質組學 ) and metabolomics ( 代謝物組學 )Protein-protein interactions (from co-immunoprecipitation)High-throughput screening of potential drug molecules
image from the DOE Human Genome Programhttp://www.ornl.gov/hgmis
Probes (DNA)
Gene Chip Surface
Hybridization
Labeled Sample (RNA)
How Microarrays Work
Two Views of Microarray Data
Data points are genes Represented by expression levels across
different samples (ie, features=samples) Goal: categorize new genesData points are samples (eg, patients) Represented by expression levels of
different genes (ie, features=genes) Goal: categorize new samples
Two Ways to View The Data
Person Gene A28202_ac AB00014_at AB00015_at . . . Person 1 1142.0 321.0 2567.2 . . . Person 2 586.3 586.1 759.0 . . . Person 3 105.2 559.3 3210.7 . . . Person 4 42.8 692.1 812.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Data Points are Genes
Person Gene A28202_ac AB00014_at AB00015_at . . . Person 1 1142.0 321.0 2567.2 . . . Person 2 586.3 586.1 759.0 . . . Person 3 105.2 559.3 3210.7 . . . Person 4 42.8 692.1 812.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Data Points are Samples
Person Gene A28202_ac AB00014_at AB00015_at . . . Person 1 1142.0 321.0 2567.2 . . . Person 2 586.3 586.1 759.0 . . . Person 3 105.2 559.3 3210.7 . . . Person 4 42.8 692.1 812.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Supervision: Add Class Values
Person Gene A28202_ac AB00014_at AB00015_at . . . Class Person 1 1142.0 321.0 2567.2 . . . normal Person 2 586.3 586.1 759.0 . . . cancer Person 3 105.2 559.3 3210.7 . . . normal Person 4 42.8 692.1 812.0 . . . cancer . . . . . . . . . . . . . . . . . . . . . . . . . . .
Supervised Learning TaskGiven: a set of microarray experiments, each done with mRNA from a different patient (same cell type from every patient) Patient’s expression values for each gene constitute the features, and patient’s disease constitutes the class
Do: Learn a model that accurately predicts class based on features
Data Points are: Genes Samples
Clustering Supervised Data Mining
Predict the class value for a patient
based on the expression levels for his/her genes
Location in Task Space
Leukemia (Golub et al., 1999)Classes Acute Lymphoblastic Leukemia( 淋巴白血病 ) (ALL) and Acute Myeloid Leukemia ( 骨髓白血病 ) (AML)Approach Weighted voting (essentially naïve Bayes)Cross-Validated Accuracy Of 34 samples, declined to predict 5, correct on other 29
Cancer vs. NormalRelatively easy to predict accurately, because so much goes “haywire” in cancer cellsPrimary barrier is noise in the data… impure RNA, cross-hybridization, etcStudies include breast, colon ( 结肠 ), prostate ( 前列腺 ), lymphoma ( 淋巴瘤 ), and multiple myeloma ( 骨髓瘤 )
X-Val Accuracies for Multiple Myeloma (74 MM vs.
31 Normal)
Trees
Boosted Trees
SVMs
Vote
Bayes Nets
98.1
99.0
100.0
97.0
100.0
More MM (300), Benign Condition MGUS (Hardin et al., 2004)
ROC Curves: Cancer vs. Normal
ROC: Cancer vs. Benign (MGUS)
Work by Statisticians Outside of Standard Classification/Clustering
Methods to better convert Affymetrix’s low-level intensity measurements into expression levels: e.g., work by Speed, Wong, IrrizaryMethods to find differentially expressed genes between two samples, e.g. work by Newton and KendziorskiBut the following is most related…
Ranking Genes by Significance
Some biologists don’t want one predictive model, but a rank-ordered list of genes to explore further (with estimated significance)For each gene we have a set of expression levels under our conditions, say cancer vs. normalWe can do a t-test to see if the mean expression levels are different under the two conditions: p-valueMultiple comparisons problem: if we repeat this test for 30,000 genes, some will pop up as significant just by chance aloneCould do a Bonferoni correction (multiply p-values by 30,000), but this is drastic and might eliminate all
False Discovery Rate (FDR) [Storey and Tibshirani, 2001]
Addresses multiple comparisons but is less extreme than BonferoniReplaces p-value by q-value: fraction of genes with this p-value or lower that really don’t have different means in the two classes (false discoveries)Publicly available in R as part of Bioconductor packageRecommendation: Use this in addition to your supervised data mining… your collaborators will want to see it
FDR Highlights Difficulties Getting Insight into Cancer vs. Normal
Using Benign Condition Instead of Normal Helps Somewhat
Question to AnticipateYou’ve run a supervised data mining algorithm on your collaborator’s data, and you present an estimate of accuracy or an ROC curve (from X-val)How did you adjust this for the multiple comparisons problem?Answer: you don’t need to because you commit to a single predictive model before ever looking at the test data for a fold—this is only one comparison
Prognosis and TreatmentFeatures same as for diagnosis
Rather than disease state, class value becomes life expectancy with a given treatment (or positive response vs. no response to given treatment)
Breast Cancer Prognosis (Van’t Veer et al., 2002)
Classes good prognosis (no metastasis within five years of initial diagnosis) vs. poor prognosisAlgorithm Ensemble of votersResults 83% cross-validated accuracy on 78 cases
A LessonPrevious work selected features to use in ensemble by looking at the entire data setShould have repeated feature selection on each cross-val foldAuthors also chose ensemble size by seeing which size gave highest cross-val resultAuthors corrected this in web supplement;accuracy went from 83% to 73%Remember to “tune parameters” separately for each cross-val fold!
Prognosis with Specific Therapy (Rosenwald et al., 2002)
Data set contains gene-expression patterns for 160 patients with diffuse large B-cell lymphoma, receiving anthracycline chemotherapyClass label is five-year survivalOne test-train split 80/80True positive rate: 60% False negative rate: 39%
Some Future DirectionsUsing gene-chip data to select therapy Predict which therapy gives best prognosis for patientCombining Gene Expression Data with Clinical Data such as Lab Results, Medical and Family History Multiple relational tables, may benefit from relational learning
Unsupervised Learning Task
Given: a set of microarray experiments under different conditionsDo: cluster the genes, where a gene described by its expression levels in different experiments
Data Points are: Genes Samples
Clustering Supervised Data Mining
Group genes into clusters, where all
members of a cluster tend to go
up or down together
Location in Task Space
Example(Green = up-regulated, Red = down-regulated)
Gen
es
Experiments (Samples)
Visualizing Gene Clusters (eg, Sharan and Shamir, 2000)
Time (10-minute intervals)
Nor
mal
ized
expr
essi
on
Gene Cluster 1, size=20 Gene Cluster 2, size=43
Unsupervised Learning Task 2
Given: a set of microarray experiments (samples) corresponding to different conditions or patientsDo: cluster the experiments
Data Points are: Genes Samples
Clustering Supervised Data Mining
Group samples by gene expression
profile
Location in Task Space
ExamplesCluster samples from mice subjected to a variety of toxic compounds (Thomas et al., 2001)Cluster samples from cancer patients, potentially to discover different subtypes of a cancerCluster samples taken at different time points
Some Biological PathwaysRegulatory pathways Nodes are labeled by genes Arcs denote influence on transcription G1 codes for P1, P1 inhibits G2’s transcription
Metabolic pathways Nodes are metabolites, large biomolecules (eg,
sugars, lipids, proteins and modified proteins) Arcs from biochemical reaction inputs to
outputs Arcs labeled by enzymes (themselves proteins)
Metabolic Pathway Example
Fumarate
Malate
Oxaloacetate
Citrate cis-Aconitate
Isocitrate
-Ketoglutarate
Succinyl-CoASuccinate
fumarase
succinate thikinase
MDH
citrate synthase aconitase
IDH
-KDGH
FAD
FADH2
H20
NAD+
NADH
Acetyl CoAHSCoA
H20
H20
NAD+
NADH + CO2
NAD+ + HSCoA
NADH + CO2
GDP + Pi GTP+ HSCoA
(Krebs Cycle,TCA Cycle,Citric Acid Cycle)
Regulatory Pathway (KEGG)
Using Microarray Data Only
Regulatory pathways Nodes are labeled by genes Arcs denote influence on transcription G1 codes for P1, P1 inhibits G2’s transcription
Metabolic pathways Nodes are metabolites, large biomolecules (eg,
sugars, lipids, proteins, and modified proteins) Arcs from biochemical reaction inputs to
outputs Arcs labeled by enzymes (themselves proteins)
Supervised Learning Task 2
Given: a set of microarray experiments for same organism under different conditionsDo: Learn graphical model that accurately predicts expression of some genes in terms of others
Some Approaches to Learning Regulatory
NetworksBayes Net Learning (started with Friedman & Halpern, 1999, we’ll see more)
Boolean Networks (Akutsu, Kuhara, Maruyama & Miyano, 1998; Ideker, Thorsson & Karp, 2002)
Related Graphical Approaches (Tanay & Shamir, 2001; Chrisman, Langley, Baay & Pohorille, 2003)
Note: direction of arrowindicates dependencenot causality
P(geneA)
P(geneB)P(
gene
A)
parent node
child node
parent node
child node
geneA
geneB
Bayesian Network (BN)
1.00.0
DatageneBgeneA
Expt1Expt2Expt3Expt4
0.50.5
0.50.5
Problem: Not CausalityA B
A is a good predictor of B. But is A regulating B??Ground truth might be:
B A A C B
B C A
B
C
A Or a more complicated variant
Approaches to Get Causality
Use “knock-outs” (Pe’er, Regev, Elidan and Friedman, 2001). But not available in most organisms.Use time-series data and Dynamic Bayesian Networks (Ong, Glasner and Page, 2002). But even less data typically.Use other data sources, eg sequences upstream of genes, where transcription regulators may bind. (Segal, Barash, Simon, Friedman and Koller, 2002; Noto and Craven, 2005)
A Dynamic Bayes Net
gene 1
gene 2 gene
3gene
N
gene 1
gene 2 gene
3gene
N
Problem: Not Enough Data Points to Construct Large Network
Fortunate to get 100s of chipsBut have 1000s of genes E. coli: ~4000 Yeast: ~6000 Human: ~30,000Want to learn causal graphical model over 1000s of variables with 100s of examples (settings of the variables)
Advance: Module Networks [Segal, Pe’er, Regev, Koller & Friedman, 2005]
Cluster genes by similarity over expression experimentsAll genes in a cluster are “tied together”: same parents and CPDsLearn structure subject to this tying together of genesIteratively re-form clusters and re-learn network, in an EM-like fashion
Problem: Data are Continuous but Models are Discrete
Gene chips provide a real-valued mRNA measurementBoolean networks and most practical Bayes net learning algorithms assume discrete variablesMay lose valuable information by discretizing
Advance: Use of Dynamic Bayes Nets with Continuous Variables [Segal, Pe’er, Regev, Koller & Friedman, 2005]
Expression measurements used instead of discretized (up, down, same)Assume linear influence of parents on children (Michaelis-Menten assumption)Work so far constructed the network from literature and learned parameters
Problem: Much Missing Information
mRNA from gene 1 doesn’t directly alter level of mRNA from gene 2Rather, the protein product from gene 1 may alter level of mRNA from gene 2 (e.g., transcription factor)Activation of transcription factor might not occur by making more of it, but just by phosphorylating it (post-translational modification)
R
Example: Transcription Regulation
DNAgeneA geneB geneCP TO
OperongeneRP O T
Operon OperonOperonR
mRNAmRNA
Approach: Measure More Stuff
Mass spectrometry (later) can measure protein rather than mRNA Doesn’t measure all proteins Not very quantitative (presence/absence)
2D gels can measure post-translational modifications, but still low-throughput because of current analysisCo-immunoprecipitation (later), Yeast 2-Hybrids can measure protein interactions, but noisy
Another Way Around Limitations
Identify smaller part of the task that is a step toward a full regulatory pathway Part of a pathway Classes or groups of genesExamples:Chromatin remodelers Predicting the operons in E. coli
Chromatin Remodelers and Nucleosome [Segal et al. 2006]
Previous DNA picture oversimplifiedDNA double-helix is wrapped in further complex structureDNA is accessible only if part of this structure is unwoundCan we predict what chromatin remodelers act on what parts of DNA, also what activates a remodeler?
The E. Coli Genome
Finding Operons in E. coli(Craven, Page, Shavlik, Bockhorst and Glasner, 2000)
Given: known operons and other E. coli data Do: predict all operons in E. coli
Additional Sources of Information gene-expression data functional annotation
g1g2 g3
g5g4
promoter terminator
Comparing Naive Bayes and Decision Trees (C5.0)
Using Only Individual Features
Single-Nucleotide Polymorphisms
SNPs: Individual positions in DNA where variation is commonRoughly 2 million known SNPs in humansNew Affymetrix whole-genome scan measures 500,000 of theseEasier/faster/cheaper to measure SNPs than to completely sequence everyoneMotivation …
Not Susceptible or Not Responding
Susceptible to Disease D or Responds to Treatment T
If We Sequenced Everyone…
Example of SNP Data
Person SNP 1 2 3 . . . CLASS Person 1 C T A G T T . . . old Person 2 C C A G C T . . . young Person 3 T T A A C C . . . old Person 4 C T G G T T . . . young . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Phasing (Haplotyping)
Advantages of SNP DataPerson’s SNP pattern does not change with time or disease, so it can give more insight into susceptibility
Easier to collect samples (can simply use blood rather than affected tissue)
Challenges of SNP DataUnphased Algorithms exist for phasing (haplotyping), but they make errors and typically need related individuals, dense coverageMissing values are more common than in microarray data (though improving substantially, down to around 1-2% now)Many more measurements. For example, Affymetrix human SNP chip at a half million SNPs.
Supervised Learning TaskGiven: a set of SNP profiles, each from a different patient.Phased: nucleotides at each SNP position on each copy of each chromosome constitute the features, and patient’s disease constitutes the class
Unphased: unordered pair of nucleotides at each SNP position constitute the features, and patient’s disease constitutes the class
Do: Learn a model that accurately predicts class based on features
Waddell et al., 2005Multiple Myeloma, Young (susceptible) vs. Old (less susceptible), 3000 SNPs, best at 64% acc (training)SVM with feature selection (repeated on every fold of cross-validation): 72% accuracy, also naïve Bayes. Significantly better than chance.
Old
Young
Old Young
Actual31 9
14 26
Listgarten et al., 2005 SVMs from SNP data predict lung cancer susceptibility at 69% accuracy Naïve Bayes gives similar performance Best single SNP at less than 60% accuracy (training)
LessonsSupervised data mining algorithms can predict disease susceptibility at rates better than chance and better than individual SNPsAccuracies much lower than we see with microarray data, because we’re predicting who will get disease, not who already has it
Future DirectionsPharmacogenetics: predicting drug response from SNP profile Drug Efficacy Adverse ReactionCombining SNP data with other data types, such as clinical (history, lab tests) and microarray
ProteomicsMicroarrays are useful primarily because mRNA concentrations serve as surrogate for protein concentrationsLike to measure protein concentrations directly, but at present cannot do so insame high-throughput mannerProteins do not have obvious direct complementsCould build molecules that bind, but binding greatly affected by protein structure
Time-of-Flight (TOF) Mass Spectrometry (thanks Sean McIlwain)
Laser
+VSample
Measures the time for an ionized particle, starting from the sample plate, to hit the detector
Detector
Time-of-Flight (TOF) Mass Spectrometry 2
Laser
+VSample
Matrix-Assisted Laser Desorption-Ionization (MALDI) Crystalloid structures made using proton-rich matrix moleculeHitting crystalloid with laser causes molecules to ionize and “fly” towards detector
Detector
Time-of-Flight Demonstration 0
Sample Plate
Time-of-Flight Demonstration 1
Matrix Molecules
Time-of-Flight Demonstration 2
Protein Molecules
Time-of-Flight Demonstration 3
LaserDetector
+10KV Positive Charge
Time-of-Flight Demonstration 4
+10KV
+
Proton kicked off matrix molecule onto another molecule
Laser pulsed directly onto sample
Time-of-Flight Demonstration 5
+10KV
++
++ +
Lots of protons kicked off matrix ions, giving rise to more positively charged molecules
Time-of-Flight Demonstration 6
+10KV
++
++ +
The high positive potential under sample plate, causes positively charged molecules to accelerate towards detector
Time-of-Flight Demonstration 7
+10Kv
+
+
+
+
+
+
Smaller mass molecules hit detector first, while heavier ones detected later
Time-of-Flight Demonstration 8
+10KV
+
+
+
+
+
+
The incident time measured from when laser is pulsed until molecule hits detector
Time-of-Flight Demonstration 9
+10KV
++ + + ++
Experiment repeated a number of times, counting frequencies of “flight-times”
Example Spectra from a Competition by Lin et al. at
Duke
These are different fractions from the same sample.
M/Z
Inte
nsity
Trypsin-Treated Spectra
M/Z
Freq
uenc
y
Many Challenges Raised by Mass Spectrometry DataNoise: extra peaks from handling of sample, from machine and environment (electrical noise), etc.M/Z values may not align exactly across spectra (resolution ~0.1%)Intensities not calibrated across spectra: quantification is difficultCannot get all proteins… typically only several hundred. To improve odds of getting the ones we want, may fractionate our sample by 2D gel electrophoresis or liquid chromatography.
Challenges (Continued)Better results if partially digest proteins (break into smaller peptides) firstCan be difficult to determine what proteins we have from spectrumIsotopic peaks: C13 and N15 atoms in varying numbers cause multiple peaks for a single peptide
Handling Noise: Peak Picking
Want to pick peaks that are statistically significant from the noise signal
Want to use these as features in our learning algorithms.
Many Supervised Learning Tasks
Learn to predict proteins from spectra, when the organism’s proteome is knownLearn to identify isotopic distributionsLearn to predict disease from either proteins, peaks or isotopic distributions as featuresConstruct pathway models
Using Mass Spectrometry for Early Detection of Ovarian Cancer [Petricoin et al., 2002]Ovarian cancer difficult to detect early, often leading to poor prognosisTrained and tested on mass spectra from blood serum100 training cases, 50 with cancerHeld-out test set of 116 cases, 50 with cancer100% sensitivity, 95% specificity (63/66) on held-out test set
Not So FastData mining methodology seems soundBut Keith Baggerly argues that cancer samples were handled differently than normal samples, and perhaps data were preprocessed differently tooIf we run cancer samples Monday and normals Wednesday, could get differences from machine breakdown or nearby electrical equipment that’s running on Monday but not WedLesson: tell collaborators they must randomize samples for the entire processing phase and of course all our preprocessing must be sameDebate is still raging… results not replicated in trials
Other Proteomics: 3D Structures
Other Proteomics: Interactions
Figure from Ideker et al., Science 292(5518):929-934, 2001
each node represents a gene product (protein)blue edges show direct protein-protein interactionsyellow edges show interactions in which one protein binds to DNA and affects the expression of another
Protein-Protein Interactions
Yeast 2-HybridImmunoprecipitation Antibodies (“immuno”) are made by
combinatorial combinations of certain proteins
Millions of antibodies can be made, to recognize a wide variety of different antigens (“invaders”), often by recognizing specific proteinsantibody protein
Protein-Protein Interactions
Immunoprecipitationantibody
Co-Immunoprecipitationantibody
Many Supervised Learning Tasks
Learn to predict protein-protein interactions: protein 3D structures may be criticalUse protein-protein interactions in construction of pathway modelsLearn to predict protein function from interaction data
ChIP-Chip DataImmunoprecipitation can also be done to identify proteins interacting with DNA rather than other proteinsChromatin immunoprecipitation (ChIP): grab sample of DNA bound to a particular protein (transcription factor)ChIP-Chip: run this sample of DNA on a microarray to see which DNA was boundExample of analysis of such new data: Keles et al., 2006
MetabolomicsMeasures concentration of each low-molecular weight molecule in sampleThese typically are “metabolites,” or small molecules produced or consumed by reactions in biochemical pathwaysThese reactions typically catalyzed by proteins (specifically, enzymes)This data typically also mass spectrometry, though could also be NMR
LipomicsAnalogous to metabolomics, but measuring concentrations of lipids rather than metabolitesPotentially help induce biochemical pathway information or to help disease diagnosis or treatment choice
To Design a Drug:Identify Target
Protein
DetermineTarget SiteStructure
Synthesize aMolecule that
Will Bind
Knowledge of proteome/genomeRelevant biochemical pathways
Crystallography, NMRDifficult if Membrane-Bound
Imperfect modeling of structureStructures may change at bindingAnd even then…
Molecule Binds Target But May:
Bind too tightly or not tightly enough.Be toxic.Have other effects (side-effects) in the body.Break down as soon as it gets into the body, or may not leave the body soon enough.It may not get to where it should in the body (e.g., crossing blood-brain barrier).Not diffuse from gut to bloodstream.
And Every Body is Different:
Even if a molecule works in the test tube and works in animal studies, it may not work in people (will fail in clinical trials).A molecule may work for some people but not others.A molecule may cause harmful side-effects in some people but not others.
Typical Practice when Target Structure is Unknown
High-Throughput Screening (HTS): Test many molecules (1,000,000) to find some that bind to target (ligands).Infer (induce) shape of target site from 3D structural similarities.Shared 3D substructure is called a pharmacophore.Perfect example of a machine learning task with spatial target.
An Example of Structure Learning
Act
ive
Inac
tive
Common Data Mining Approaches
Represent a molecule by thousands to millions of features and use standard techniques (e.g., KDD Cup 2001)Represent each low-energy conformer by feature vector and use multiple-instance learning (e.g., Jain et al., 1998)Relational learning Inductive logic programming (e.g., Finn et
al., 1998) Graph mining
Supervised Learning TaskGiven: a set of molecules, each labeled by activity -- binding affinity for target protein -- and a set of low-energy conformers for each molecule
Do: Learn a model that accurately predicts activity (may be Boolean or real-valued)
ILP as Illustration: The Logical Representation of a Pharmacophore
Active(X) if: has-conformation(X,Conf),has-hydrophobic(X,A),has-hydrophobic(X,B),distance(X,Conf,A,B,3.0,1.0),has-acceptor(X,C),distance(X,Conf,A,C,4.0,1.0),distance(X,Conf,B,C,5.0,1.0).
This logical clause states that a molecule X is active if it has someconformation Conf, hydrophobic groups A and B, and a hydrogen acceptor Csuch that the following holds: in conformation Conf of molecule X, thedistance between A and B is 3 Angstroms (plus or minus 1), the distancebetween A and C is 4, and the distance between B and C is 5.
Background Knowledge IInformation about atoms and bonds in the molecules
atm(m1,a1,o,3,5.915800,-2.441200,1.799700).atm(m1,a2,c,3,0.574700,-2.773300,0.337600).atm(m1,a3,s,3,0.408000,-3.511700,-1.314000).
bond(m1,a1,a2,1). bond(m1,a2,a3,1).
Background knowledge IIDefinition of distance equivalencedist(Drug,Atom1,Atom2,Dist,Error):- number(Error), coord(Drug,Atom1,X1,Y1,Z1), coord(Drug,Atom2,X2,Y2,Z2), euc_dist(p(X1,Y1,Z1),p(X2,Y2,Z2),Dist1), Diff is Dist1-Dist, absolute_value(Diff,E1), E1 =< Error.
euc_dist(p(X1,Y1,Z1),p(X2,Y2,Z2),D):- Dsq is (X1-X2)^2+(Y1-Y2)^2+(Z1-Z2)^2, D is sqrt(Dsq).
Central Idea: Generalize by searching a lattice
Lattice of Hypotheses active(X)
active(X) if active(X) if active(X) ifhas-hydrophobic(X,A) has-donor(X,A) has-acceptor(X,A)
active(X) if active(X) if active(X) ifhas-hydrophobic(X,A), has-donor(X,A), has-acceptor(X,A),has-donor(X,B), has-donor(X,B), has-donor(X,B),distance(X,A,B,5.0) distance(X,A,B,4.0) distance(X,A,B,6.0)
etc.
Conformational model Conformational flexibility modelled as multiple conformations:
Sybyl randomsearch Catalyst
Pharmacophore description
Atom and site centred Hydrogen bond donor Hydrogen bond acceptor Hydrophobe Site points (limited at present) User definableDistance based
Example 1: Dopamine agonists
Agonists taken from Martin data set on QSAR society web pagesExamples (5-50 conformations/molecule)
NCH3
CH3
OH
OH
OH
OHNH2 NH
OH
OH
NH2
OH
OHN
CH3 OH
OH
NH2
OH
OH
Pharmacophore identified
Molecule A has the desired activity if: in conformation B molecule A contains a hydrogen acceptor at C, and in conformation B molecule A contains a basic nitrogen group at D, and the distance between C and D is 7.05966 +/- 0.75 Angstroms, and in conformation B molecule A contains a hydrogen acceptor at E, and the distance between C and E is 2.80871 +/- 0.75 Angstroms, and the distance between D and E is 6.36846 +/- 0.75 Angstroms, and in conformation B molecule A contains a hydrophobic group at F, and the distance between C and F is 2.68136 +/- 0.75 Angstroms, and the distance between D and F is 4.80399 +/- 0.75 Angstroms, and the distance between E and F is 2.74602 +/- 0.75 Angstroms.
Example II: ACE inhibitors28 angiotensin converting enzyme inhibitors taken from literature D. Mayer et al., J. Comput.-Aided Mol.
Design, 1, 3-16, (1987)
N
NN
O
O
COOH
SHN
COOHO
CH3
NH
POH
OHO N
COOHO
CH3
NH
COOH
ACE pharmacophoreMolecule A is an ACE inhibitor if: molecule A contains a zinc-site B, molecule A contains a hydrogen acceptor C, the distance between B and C is 7.899 +/- 0.750 A, molecule A contains a hydrogen acceptor D, the distance between B and D is 8.475 +/- 0.750 A, the distance between C and D is 2.133 +/- 0.750 A, molecule A contains a hydrogen acceptor E, the distance between B and E is 4.891 +/- 0.750 A, the distance between C and E is 3.114 +/- 0.750 A, the distance between D and E is 3.753 +/- 0.750 A.
Pharmacophore discovered
Distance Progol Mayer
A 4.9 5.0
B 3.8 3.8
C 8.5 8.6
Zinc site
H-bond acceptor
C
AB
Additional FindingOriginal pharmacophore rediscovered plus one other different zinc ligand position similar to alternative proposed by
Ciba-Geigy
7.3
3.94.0
Example III: Thermolysin inhibitors
10 inhibitors for which crystallographic data is available in PDBConformationally challenging moleculesExperimentally observed superposition
Key binding site interactions
Asn112-NH
S2’ O=C Asn112
Arg203-NH S1’
O=C Ala113Zn
OOH
NH
O
NHPO
O
R
Interactions made by inhibitors
Interaction 1HYT 1THL 1TLP 1TMN 2TMN 4TLN 4TMN 5TLN 5TMN 6TMNAsn112-NH S2’ Asn112 C=O Arg 203 NH S1’ Ala113-C=O Zn
Pharmacophore Identification
Structures considered 1HYT 1THL 1TLP 1TMN 2TMN 4TLN 4TMN 5TLN 5TMN 6TMNConformational analysis using “Best” conformer generation in Catalyst98-251 conformations/molecule
Thermolysin Results10 5-point pharmacophore identified, falling into 2 groups (7/10 molecules) 3 “acceptors”, 1 hydrophobe, 1 donor 4 “acceptors, 1 donorCommon core of Zn ligands, Arg203 and Asn112 interactions identifiedCorrect assignments of functional groupsCorrect geometry to 1 Angstrom tolerance
Thermolysin resultsIncreasing tolerance to 1.5Angstroms finds common 6-point pharmacophore including one extra interaction
Example IV: Antibacterial peptides [Spatola et al., 2000]
Dataset of 11 pentapeptides showing activity against Pseudomonas aeruginosa 6 actives <6g/ml IC50 5 inactives
Pharmacophore Identified
A Molecule M is active against Pseudomonas Aeruginosaif it has a conformation B such that:
M has a hydrophobic group C,M has a hydrogen acceptor D,the distance between C and D in conformation B is 11.7 AngstromsM has a positively-charged atom E,the distance between C and E in conformation B is 4 Angstromsthe distance between D and E in conformation B is 9.4 AngstromsM has a positively-charged atom F,the distance between C and F in conformation B is 11.1 Angstromsthe distance between D and F in conformation B is 12.6 Angstromsthe distance between E and F in conformation B is 8.7 Angstroms
Tolerance 1.5 Angstroms
Clinical Databases of the Future (Dramatically Simplified)
PatientID Gender Birthdate
P1 M 3/22/63
PatientID Date Physician Symptoms Diagnosis
P1 1/1/01 Smith palpitations hypoglycemic P1 2/1/03 Jones fever, aches influenza
PatientID Date Lab Test Result
P1 1/1/01 blood glucose 42 P1 1/9/01 blood glucose 45
PatientID SNP1 SNP2 … SNP500K
P1 AA AB BB P2 AB BB AA
PatientID Date Prescribed Date Filled Physician Medication Dose Duration
P1 5/17/98 5/18/98 Jones prilosec 10mg 3 months
Final Wrap-upMolecular biology collecting lots and lots of data in post-genome eraOpportunity to “connect” molecular-level information to diseases and treatmentNeed analysis tools to interpretData mining opportunities aboundHopefully this tutorial provided solid start toward applying data mining to high-throughput biological data
Thanks ToJude ShavlikJohn ShaughnessyBart BarlogieMark CravenSean McIlwainJan StruyfArno SpatolaPaul FinnBeth Burnside
Michael MollaMichael WaddellIrene OngJesse DavisSoumya RayJo HardinJohn CrowleyFenghuang ZhanEric Lantz
If Time Permits… some of my group’s directions in the area
Clinical Data (with Jesse Davis, Beth Burnside, M.D.)Addressing another problem with current approaches to biological network learning (with Soumya Ray, Eric Lantz)
Using Machine Learning with Clinical Histories: Example
We’ll use example of Mammography to show some issues that ariseThese issues arise here even with just one relational tableThese issues are even more pronounced with data in multiple tables
Supervised Learning TaskGiven: a database of mammogram abnormalities for different patients (same cell type from every patient) Radiologist-entered values describing the abnormality constitute the features, and abnormality’s biopsy result as benign or malignant constitutes the class
Do: Learn a model that accurately predicts class based on features
P1 1 5/02 Spic 0.03 RU4 B P1 2 5/04 Var 0.04 RU4 M P1 3 5/04 Spic 0.04 LL3 B … … … … … … …
Patient Abnormality Date Mass Shape … Mass Size Loc Be/Mal
Mammography Database[Davis et al, 2005; Burnside et al, 2005]
Original Expert Structure
Level 1: Parameters
Be/Mal
Shape Size
Given: Features (node labels, or fields in database), Data, Bayes net structureLearn: Probabilities. Note: probabilities needed are Pr(Be/Mal), Pr(Shape|Be/Mal), Pr (Size|Be/Mal)
Level 2: Structure
Be/Mal
Shape Size
Given: Features, Data Learn: Bayes net structure and probabilities. Note: with this structure, now will need Pr(Size|Shape,Be/Mal) instead of Pr(Size|Be/Mal).
P1 1 5/02 Spic 0.03 RU4 B P1 2 5/04 Var 0.04 RU4 M P1 3 5/04 Spic 0.04 LL3 B … … … … … … …
Patient Abnormality Date Mass Shape … Mass Size Loc Be/Mal
Mammography Database
P1 1 5/02 Spic 0.03 RU4 B P1 2 5/04 Var 0.04 RU4 M P1 3 5/04 Spic 0.04 LL3 B … … … … … … …
Patient Abnormality Date Mass Shape … Mass Size Loc Be/Mal
Mammography Database
P1 1 5/02 Spic 0.03 RU4 B P1 2 5/04 Var 0.04 RU4 M P1 3 5/04 Spic 0.04 LL3 B … … … … … … …
Patient Abnormality Date Mass Shape … Mass Size Loc Be/Mal
Mammography Database
Level 3: Aggregates
Given: Features, Data, Background knowledge – aggregation functions such as average, mode, max, etc. Learn: Useful aggregate features, Bayes net structure that uses these features, and probabilities. New features may use other rows/tables.
Be/Mal
Shape Size
Avg size this date
P1 1 5/02 Spic 0.03 RU4 B P1 2 5/04 Var 0.04 RU4 M P1 3 5/04 Spic 0.04 LL3 B … … … … … … …
Patient Abnormality Date Mass Shape … Mass Size Loc Be/Mal
Mammography Database
P1 1 5/02 Spic 0.03 RU4 B P1 2 5/04 Var 0.04 RU4 M P1 3 5/04 Spic 0.04 LL3 B … … … … … … …
Patient Abnormality Date Mass Shape … Mass Size Loc Be/Mal
Mammography Database
P1 1 5/02 Spic 0.03 RU4 B P1 2 5/04 Var 0.04 RU4 M P1 3 5/04 Spic 0.04 LL3 B … … … … … … …
Patient Abnormality Date Mass Shape … Mass Size Loc Be/Mal
Mammography Database
Level 4: View LearningGiven: Features, Data, Background knowledge – aggregation functions and intensionally-defined relations such as “increase” or “same location”Learn: Useful new features defined by views (equivalent to rules or SQL queries), Bayes net structure, and probabilities.
Be/Mal
Shape Size
Avg size this date
Shape change in abnormality at this location
Increase in average size of abnormalities
Example of Learned Ruleis_malignant(A) IF:
'BIRADS_category'(A,b5), 'MassPAO'(A,present), 'MassesDensity'(A,high), 'HO_BreastCA'(A,hxDCorLC), in_same_mammogram(A,B), 'Calc_Pleomorphic'(B,notPresent), 'Calc_Punctate'(B,notPresent).
ROC: Level 2 (TAN) vs. Level 1
Precision-Recall Curves
All Levels of Learning
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall
Prec
isio
nLevel 4 (View)
Level 3 (Aggregate)
Level 2 (Structure)
Level 1 (Parameter)
SAYU-ViewImproved View Learning approachSAYU: Score As You UseFor each candidate rule, add it to the Bayesian network and see if it improves the network’s scoreOnly add a rule (new field for the view) if it improves the Bayes net
Relational Learning Algorithms
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall
Prec
isio
n
SAYU-View
Initial Level 4 (View)
Level 3 (Aggregates)
Average Area Under PR Curve For Recall >= 0.5
Level 3
Initial Level 4
SAYU-View
0.00
0.02
0.04
0.06
0.08
0.10
0.12
AURPR
Clinical Databases of the Future (Dramatically Simplified)
PatientID Gender Birthdate
P1 M 3/22/63
PatientID Date Physician Symptoms Diagnosis
P1 1/1/01 Smith palpitations hypoglycemic P1 2/1/03 Jones fever, aches influenza
PatientID Date Lab Test Result
P1 1/1/01 blood glucose 42 P1 1/9/01 blood glucose 45
PatientID SNP1 SNP2 … SNP500K
P1 AA AB BB P2 AB BB AA
PatientID Date Prescribed Date Filled Physician Medication Dose Duration
P1 5/17/98 5/18/98 Jones prilosec 10mg 3 months
Another Problem with Current Learning of Regulatory Models
Current techniques all use greedy heuristicBayes net learning algorithms use “sparse candidate” approach: to be considered as a parent of gene 1, another gene 2 must be correlated with gene 1CPDs often represented as trees: use greedy tree learning algorithmsAll can fall prey to functions such as exclusive-or – do these arise?
Skewing Example [Page & Ray, 2003; Ray & Page, 2004; Rosell et al., 2005; Ray & Page, 2005]
Drosophila survival based on gender and Sxl gene activity
Female SxL Gene Gene3 … Gene100 Survival
0 00…00000000…0000001 …1…1111111
0
0 10…00000000…0000001 …1…1111111
1
1 00…00000000…0000001 …1…1111111
1
1 10…00000000…0000001 …1…1111111
0
Hard FunctionsOur Definition: those functions for which no attribute has “gain” according to standard purity measures (GINI, Entropy) NOTE: “Hard” does not refer to size
of representation Example: n-variable odd parity
Many others
Learning Hard Functions
Standard method of learning hard functions (e.g. with decision trees): depth-k Lookahead O(mn2k+1-1) for m examples in n variables
We devise a technique that allows learning algorithms to efficiently learn hard functions
Key IdeaHard functions are not hard for all data distributionsWe can skew the input distribution to simulate a different one By randomly choosing “preferred
values” for attributesAccumulate evidence over several skews to select a split attribute
Example: Uniform Distribution
0)(
25.0)1;(25.0)0;(
25.0)(
i
i
i
xGAIN
xfGINIxfGINI
fGINI
x1 x2 x3 f
0 0 0 01
0 1 0 11
1 0 0 11
1 1 0 01
Example: Skewed Distribution(“Sequential Skewing”, Ray & Page, ICML 2004)
x1 x2 x3 f Weight
0 0 0 01
0 1 0 11
1 0 0 11
1 1 0 01
41
43
25.0)1;(25.0)1;(19.0)1;(
25.0)(
3
2
1
xfGINIxfGINIxfGINI
fGINI
43
43
43
41
41
41
0)(0)(0)(
3
2
1
xGAINxGAINxGAIN