Upload
dieter-santana
View
24
Download
2
Embed Size (px)
DESCRIPTION
Yu Shyr ( 石 瑜 ), Ph.D. May 14, 2008 China Medical University [email protected]. The Biostatistical & Bioinformatics Challenges in the High Dimensional Data Derived from High Throughput Assays: Today and Tomorrow. Vanderbilt University 泛德堡大學. - PowerPoint PPT Presentation
Citation preview
The Biostatistical & Bioinformatics Challenges in the High The Biostatistical & Bioinformatics Challenges in the High Dimensional Data Derived from High Throughput Assays: Dimensional Data Derived from High Throughput Assays:
Today and TomorrowToday and Tomorrow
Yu Shyr (Yu Shyr ( 石 瑜 ), Ph.D.), Ph.D.
May 14, 2008May 14, 2008
China Medical UniversityChina Medical University
Vanderbilt University Vanderbilt University
泛德堡大學泛德堡大學
US News & World Report (American’s best colleges -2007)
1. 1. Princeton UniversityPrinceton University (NJ) (NJ)
2. 2. Harvard University Harvard University (MA) (MA)
3. 3. Yale University Yale University (CT) (CT)
4. 4. California Institute of Technology California Institute of Technology (CA) (CA)
4. 4. Stanford University Stanford University (CA) (CA)
4. 4. Massachusetts Inst. Of Technology Massachusetts Inst. Of Technology (MA) (MA)
7. 7. University of Pennsylvania University of Pennsylvania (PA)(PA)
8. 8. Duke University Duke University (NC) (NC)
9.9. Dartmouth College Dartmouth College (NH) (NH)
9. 9. Columbia University Columbia University (NY) (NY)
9. 9. University of Chicago University of Chicago (IL) (IL)
12. 12. Cornell University Cornell University (NY) (NY)
12. 12. Washington University in St. Louis Washington University in St. Louis (MO) (MO)
14. Northwestern University (IL)
15. Brown University (RI)
16. Johns Hopkins University (MD)
17. Rice University (TX)
18. Vanderbilt University (TN)
18. Emory University (GA)
20. University of Notre Dame (IN)
21. Carnegie Mellon University (PA)
21. University of California – Berkeley (CA)
23. Georgetown University (DC)
24. University of Virginia (VA)
24. University of Michigan – Ann Arbor (MI)
US News & World Report (American’s best colleges -2007)
Tennessee, the “Volunteer State”Tennessee, the “Volunteer State”
Nashville, TN- “Music City, USA!”Nashville, TN- “Music City, USA!”
Vanderbilt UniversityVanderbilt University
Vanderbilt UniversityVanderbilt University
A private, nonsectarian, coeducational A private, nonsectarian, coeducational
research university in Nashville, TN.research university in Nashville, TN.
Established in 1873 by shipping and Established in 1873 by shipping and
rail magnate Cornelius Vanderbilt.rail magnate Cornelius Vanderbilt.
Enrolls 11,000 students in ten schools Enrolls 11,000 students in ten schools
annually.annually.
Ranks 18Ranks 18thth in the nation among in the nation among
national research universities.national research universities.
Also has several research facilities and Also has several research facilities and
a world-renowned medical center.a world-renowned medical center.
Famous alumni include former vice-Famous alumni include former vice-
president Al Gore.president Al Gore.
Vanderbilt University Medical CenterVanderbilt University Medical Center
VUMCVUMC
Collection of several hospitals and clinics Collection of several hospitals and clinics
associated with Vanderbilt University in Nashville, associated with Vanderbilt University in Nashville,
Tennessee.Tennessee.
In 2003, was placed on the Honor Roll of nation’s In 2003, was placed on the Honor Roll of nation’s
best hospitals.best hospitals.
The medical school was ranked 17The medical school was ranked 17thth in the nation in the nation
among research-oriented medical schools and in the among research-oriented medical schools and in the
ISI top 5 for research impact in clinical medicine and ISI top 5 for research impact in clinical medicine and
pharmacology.pharmacology.
Vanderbilt-Ingram Cancer CenterVanderbilt-Ingram Cancer Center
Only NCI-designated Comprehensive Cancer Only NCI-designated Comprehensive Cancer
Center in Tennessee and one of only 39 in the Center in Tennessee and one of only 39 in the
United StatesUnited States
Nearly 300 investigators in seven research Nearly 300 investigators in seven research
programsprograms
More than $190 million in annual research More than $190 million in annual research
fundingfunding
Among the top 10 in competitively awarded NCI Among the top 10 in competitively awarded NCI
grant supportgrant support
Vanderbilt-Ingram Cancer CenterVanderbilt-Ingram Cancer Center
Ranks 20Ranks 20thth in the nation and consistently ranks in the nation and consistently ranks
among the best places for cancer care by U.S. among the best places for cancer care by U.S.
News and World Report.News and World Report.
One of a select few centers to hold agreements One of a select few centers to hold agreements
with the NCI to conduct Phase I and Phase II with the NCI to conduct Phase I and Phase II
clinical trials, where innovative therapies are clinical trials, where innovative therapies are
first evaluated in patient.first evaluated in patient.
Vanderbilt-Ingram Cancer CenterVanderbilt-Ingram Cancer Center
Department of BiostatisticsDepartment of Biostatistics
Created by the School of Medicine at Vanderbilt Created by the School of Medicine at Vanderbilt
University in September 2003.University in September 2003.
The Dean and other senior medical school faculty The Dean and other senior medical school faculty
are committed to providing outstanding are committed to providing outstanding
collaborative support in biostatistics to clinical collaborative support in biostatistics to clinical
and basic scientists and to develop a graduate and basic scientists and to develop a graduate
program in biostatistics that will train outstanding program in biostatistics that will train outstanding
collaborative scientists and will focus on the collaborative scientists and will focus on the
methods of modern applied statistics.methods of modern applied statistics.
The major challenge in high throughput experiments, e.g.,
microarray data, MALDI-TOF data, SELDI-TOF data, or shotgun
proteomic data is that the data is often high dimensional.
When the number of dimensions reaches thousands or more,
the computational time for the pattern recognition algorithms
can become unreasonable. This can be a problem, especially
when some of the features are not discriminatory.
High Dimensional DataHigh Dimensional Data
The irrelevant features may cause a reduction in the accuracy of
some algorithms. For example (Witten 1999), experiments with a
decision tree classifier have shown that adding a random binary
feature to standard datasets can deteriorate the classification
performance by 5 - 10%.
Furthermore, in many pattern recognition tasks, the number of
features represents the dimension of a search space - the larger
the number of features, the greater the dimension of the search
space, and the harder the problem.
High Dimensional DataHigh Dimensional Data
Outcome Measurement: MALDI-TOFOutcome Measurement: MALDI-TOF
LaserOptics
MALDITarget
TOFAnalyzer
Nitrogen Laser (337
nm)
Reflex MALDI TOF Mass SpectrometerReflex MALDI TOF Mass Spectrometer
Ion Mirror
IonGrid
MicrochannelDetector
Time-of-Flight Mass Spectrometry (TOF-MS)Time-of-Flight Mass Spectrometry (TOF-MS)
Linear TOF :
Ionsignals
Ionizing Probe (start)
M3 M2 M1
+/- U
Ion detector (MCP)
M3
M2
M1
t3t2t1Start
t a M b
Time or M
Issues in the Analysis of High-Throughput ExperimentIssues in the Analysis of High-Throughput Experiment
Experiment DesignExperiment Design
Measurement Measurement
PreprocessingPreprocessing
♦♦ Baseline Correction, Normalization Baseline Correction, Normalization
♦ ♦ Profile Alignment, Feature selection, DenosingProfile Alignment, Feature selection, Denosing
Classification Classification
Feature SelectionFeature Selection
QCA (Quality Control Assessment)QCA (Quality Control Assessment)
Issues in the Analysis of High-Throughput ExperimentIssues in the Analysis of High-Throughput Experiment
Computational ValidationComputational Validation
♦ ♦ Estimate the classification error rateEstimate the classification error rate
♦ ♦ bootstrapping, k-fold validation, leave-one-out validationbootstrapping, k-fold validation, leave-one-out validation
Validation – blind test cohortValidation – blind test cohort
Significance Testing of the Achieved Classification ErrorSignificance Testing of the Achieved Classification Error
Reporting the result - graphic & tableReporting the result - graphic & table
Validation – laboratory technology, e.g. RTPCR, Validation – laboratory technology, e.g. RTPCR, Pathway analysisPathway analysis
PreprocessingPreprocessing
Mass Spectrometry (MS) can generate high throughput protein profiles Mass Spectrometry (MS) can generate high throughput protein profiles
for biomedical applications. A for biomedical applications. A consistentconsistent, , sensitivesensitive and and robustrobust MS data MS data
preprocessing method would be greatly desirable because subsequent preprocessing method would be greatly desirable because subsequent
analyses are determined by the preprocessing output. analyses are determined by the preprocessing output.
The preprocessing goal is to The preprocessing goal is to extractextract and and quantifyquantify the the common featurescommon features
across the spectra. across the spectra.
We propose a new comprehensive MALDI-TOF MS data preprocessing We propose a new comprehensive MALDI-TOF MS data preprocessing
method using feedback concepts associated with several new method using feedback concepts associated with several new
algorithms. algorithms.
This new package successfully resolves many conventional difficulties This new package successfully resolves many conventional difficulties
such as such as removing m/z measure errorremoving m/z measure error, , objectively setting de-nosing objectively setting de-nosing
parametersparameters, and , and define common features across spectradefine common features across spectra..
Math Model for MS Data PreprocessingMath Model for MS Data Preprocessing
From a mathematical point of view, one MS data is a signal From a mathematical point of view, one MS data is a signal
function defined on a time or function defined on a time or m/zm/z domain. An observed MS signal domain. An observed MS signal
is often modeled as the superposition of three components:is often modeled as the superposition of three components:
where where f(x)f(x) is observed signal, is observed signal, B(x)B(x) is a slowly varying “baseline” is a slowly varying “baseline”
artifact, artifact, S(x)S(x) is the “true” signal (peaks) to be extracted, is the “true” signal (peaks) to be extracted, N N is the is the
normalization factor, and normalization factor, and e(x)e(x) represents noise. represents noise.
( ) ( ) * ( ) ( ) ,f x B x N S x e x
Basic Descriptions of the Data PreprocessingBasic Descriptions of the Data Preprocessing
Registration Registration Denoising Denoising Baseline correction Baseline correction
NormalizationNormalization Peak selection Peak alignment or Binning Peak selection Peak alignment or Binning
Math Model for MS Data PreprocessingMath Model for MS Data Preprocessing
The preprocessing goal is to The preprocessing goal is to identifyidentify, , quantifyquantify and and match match
peaks across spectrapeaks across spectra. .
Several modern algorithms such as Several modern algorithms such as waveletswavelets, , splinessplines, ,
nonparametric local maximum likelihood estimate(nonparametric local maximum likelihood estimate(NLMLENLMLE) )
are successfully applied to the whole processing system.are successfully applied to the whole processing system.
The feedbacks optimized the calibration and peak picking The feedbacks optimized the calibration and peak picking
procedures automatically.procedures automatically.
Raw dataRaw data
General stepsGeneral steps
(1) (1) Calibration:Calibration: Calibration based on multiple identified peaks (linear Calibration based on multiple identified peaks (linear
shifts on the time domain) and the shape of peak (convolution); in the shifts on the time domain) and the shape of peak (convolution); in the
meanwhile all spectra get aligned.meanwhile all spectra get aligned.
(2) (2) Quantification:Quantification:
Baseline Correction (splines) =>Normalization (TIC) =>area based Baseline Correction (splines) =>Normalization (TIC) =>area based
peak quantification method.peak quantification method.
(3) (3) Feature Extraction:Feature Extraction:
Denoising (wavelets) => Peak Selection (local maximum) => common Denoising (wavelets) => Peak Selection (local maximum) => common
peak finding across spectra(NLMLE)peak finding across spectra(NLMLE)
(4) (4) Feedback:Feedback: optimally choosing calibration peaks and setting feature optimally choosing calibration peaks and setting feature
extraction parameters.extraction parameters.
Flowchart of the Preprocessing Procedure
Raw data De-noisingPeak
DetectionPeak
Distribution
BaselineCorrection
Normalization
Calibration Alignment
CommonFeature
detection
Results
Convolution Based Calibration AlgorithmConvolution Based Calibration Algorithm
1. Known peaks’ simulation (choose 1. Known peaks’ simulation (choose peaks with high prevalence across peaks with high prevalence across spectra and clear pattern by feedback spectra and clear pattern by feedback 80% ).80% ).
2. Convolve each spectra with the 2. Convolve each spectra with the known peak simulation (Gaussian, or known peak simulation (Gaussian, or Beta). Maximum happens when two Beta). Maximum happens when two peak shapes match best.peak shapes match best.
3. The linear shift units makes multiple 3. The linear shift units makes multiple peaks matched best is the optimal peaks matched best is the optimal shift.shift.
Notice: all process are on the time Notice: all process are on the time domain.domain.
Pre- CalibrationPre- Calibration
Post CalibrationPost Calibration
1.1. Accurate m/z peak position (as theoretical)Accurate m/z peak position (as theoretical)2.2. Less variation of the peaks position Less variation of the peaks position 3.3. Easily to handle large dataset in batch mode. Easily to handle large dataset in batch mode.
Pre- CalibrationPre- Calibration
Post CalibrationPost Calibration
Baseline Correction & NormalizationBaseline Correction & Normalization
Baseline is generally considered as an artificial bias of the Baseline is generally considered as an artificial bias of the
signal.signal.
We propose baseline might be caused by delayed charge We propose baseline might be caused by delayed charge
releasing.releasing.
We apply We apply quadratic splinesquadratic splines to the local minimums to get the to the local minimums to get the
continuous curve by sliding windows.continuous curve by sliding windows.
Trimmed total ion currentTrimmed total ion current ( (TIC) normalization.TIC) normalization.
Baseline Data Before CorrectionBaseline Data Before Correction
Baseline Corrected DataBaseline Corrected Data
Wavelets DenoisingWavelets Denoising
Wavelet: FBI's image coding standard for digitized fingerprints, Wavelet: FBI's image coding standard for digitized fingerprints,
successful to reproduce true signal by removing noises of successful to reproduce true signal by removing noises of
specific energy levels.specific energy levels.
Wavelets method has been used to denoise signals in a wide Wavelets method has been used to denoise signals in a wide
variety of contexts.variety of contexts.
Wavelet method analyzes the data in both time and frequency Wavelet method analyzes the data in both time and frequency
domain to extract more useful information. domain to extract more useful information.
Adaptive stationary discrete wavelet denoising method is Adaptive stationary discrete wavelet denoising method is
applied in our research, which is shift-invariant and efficient in applied in our research, which is shift-invariant and efficient in
denoising.denoising.
,( ) ( , ) ( )j kj Z k Z
f t c j k t
,( , ) ( ) ( )j kc j k f t t dt
Denoising strategyDenoising strategy
Stationary discrete wavelet denoising method is shift-Stationary discrete wavelet denoising method is shift-
invariant and offers both good reconstruction invariant and offers both good reconstruction
performance and smoothness.performance and smoothness.
Adaptive denoising method is based on the noise Adaptive denoising method is based on the noise
distribution, we set up different threshold values at distribution, we set up different threshold values at
different mass intervals and frequency levels.different mass intervals and frequency levels.
Parameters (decomposition and thresholds are Parameters (decomposition and thresholds are
determined by the feedback information)determined by the feedback information)
DWT DecompositionDWT Decomposition
Denoised DataDenoised Data
Peak list across spectraPeak list across spectra
Kernel Density EstimationKernel Density Estimation
Peak distribution without high-quality preprocessing Peak distribution without high-quality preprocessing
Peak distribution with high-quality preprocessing Peak distribution with high-quality preprocessing
Peak SelectionPeak Selection
Peak SelectionPeak Selection
Preprocessing on one spectrum after calibrationPreprocessing on one spectrum after calibration
1.1. Read in spectrum by two columns: m/z values and corresponding intensities. Read in spectrum by two columns: m/z values and corresponding intensities.
2.2. Apply Adaptive Stationary Discrete Wavelet Transform for denoising. Apply Adaptive Stationary Discrete Wavelet Transform for denoising.
3.3. Sliding widow splines estimate the baseline, and subtract the baseline. Total Ion Current Sliding widow splines estimate the baseline, and subtract the baseline. Total Ion Current Normalization through the whole spectrum.Normalization through the whole spectrum.
4.4. Local maximums contribute to peak list across spectra.Local maximums contribute to peak list across spectra.
day1day1
day2day2
day3day3
day4day4
Expression ProfilesExpression Profiles
The Results from the Cluster AnalysisThe Results from the Cluster Analysis
Day
Laser P
ow
er
Why?Why?
Quality Control Assessment - Reproducibility Quality Control Assessment - Reproducibility
Intra-class Correlation Coefficient (ICC)Intra-class Correlation Coefficient (ICC)
Intra / Intra + InterIntra / Intra + Inter
Correlation of Variation (CV) Correlation of Variation (CV)
SD/MeanSD/Mean
Goal – Make sure the data is reproducible !Goal – Make sure the data is reproducible ! SOP is a necessary componentSOP is a necessary component
Variance Component AnalysisVariance Component Analysis
Mixed/Random Effect Model. Mixed/Random Effect Model.
The model: investigators, day, spot, machine, lab, etc.The model: investigators, day, spot, machine, lab, etc.
Source of Variability for MALDI-TOF DataSource of Variability for MALDI-TOF Data
Specimen Collection/Handling EffectsSpecimen Collection/Handling Effects
- Tumor: surgical related effects- Tumor: surgical related effects
- Cell Line: culture condition- Cell Line: culture condition
Biological Heterogeneity in SpecimenBiological Heterogeneity in Specimen
Biological Heterogeneity in PopulationBiological Heterogeneity in Population
Laser power variationLaser power variation
1717994416169944161688442020
191911117717171010551717994455
323224241919242416161111191911116611
1.01.00.50.50.20.21.01.00.50.50.20.21.01.00.50.50.20.2Number Number (m)(m)
Inter-Case Inter-Case VarianceVariance
Inter-Case Inter-Case VarianceVariance
Inter-Case Inter-Case VarianceVariance
SubsampleSubsample
1.01.00.50.50.20.2
Intra-Case VarianceIntra-Case Variance
Table IVTable IVPower = 80% Type I error = 5%Power = 80% Type I error = 5%
CV in different daysCV in different days
ICCICC
Variance Components AnalysisVariance Components Analysis
Variance Component AnalysisVariance Component Analysis
Tumor
Things DON’T DOThings DON’T DO
Fold-change for feature selectionFold-change for feature selection
Cluster analysis for class comparison or class predictionCluster analysis for class comparison or class prediction
Ignore the over-fitting issuesIgnore the over-fitting issues
Only report the good newsOnly report the good news
Extremely small sample size for the Independent test cohortExtremely small sample size for the Independent test cohort
Agulnik, M. et al. J Clin Oncol; 25:2184-2190 2007
Multidimensional scaling (MDS)Multidimensional scaling (MDS)
Multidimensional scaling (MDS)Multidimensional scaling (MDS)
AcknowledgementAcknowledgement
PreprocessingPreprocessing Dr. Dean BillheimerDr. Dean Billheimer Dr. Ming LiDr. Ming Li Dr. Dong HongDr. Dong Hong Shuo ChenShuo Chen Huiming LiHuiming Li
Additional AcknowledgementsAdditional Acknowledgements Bashar ShakhtourBashar Shakhtour Dr. William WuDr. William Wu Dr. Bonnie LeFureDr. Bonnie LeFure
AnalysisAnalysis Jeremy RobertsJeremy Roberts Will GrayWill Gray Nimish GautamNimish Gautam Joan ZhangJoan Zhang Haojie WuHaojie Wu
Dr. Heidi ChenDr. Heidi Chen Dr. Jonathan XuDr. Jonathan Xu Dr. Tatsuki KoyamaDr. Tatsuki Koyama
ENDEND