Yu Shyr ( 石瑜 ), Ph.D. May 14, 2008 China Medical University Yu.Shyr@vanderbilt

The Biostatistical & Bioinformatics Challenges in the High The Biostatistical & Bioinformatics Challenges in the High Dimensional Data Derived from High Throughput Assays: Dimensional Data Derived from High Throughput Assays:

Today and TomorrowToday and Tomorrow

Yu Shyr (Yu Shyr ( 石瑜 ), Ph.D.), Ph.D.

May 14, 2008May 14, 2008

China Medical UniversityChina Medical University

[email protected] [email protected]

Vanderbilt University Vanderbilt University

泛德堡大學泛德堡大學

US News & World Report (American’s best colleges -2007)

1. 1. Princeton UniversityPrinceton University (NJ) (NJ)

2. 2. Harvard University Harvard University (MA) (MA)

3. 3. Yale University Yale University (CT) (CT)

4. 4. California Institute of Technology California Institute of Technology (CA) (CA)

4. 4. Stanford University Stanford University (CA) (CA)

4. 4. Massachusetts Inst. Of Technology Massachusetts Inst. Of Technology (MA) (MA)

7. 7. University of Pennsylvania University of Pennsylvania (PA)(PA)

8. 8. Duke University Duke University (NC) (NC)

9.9. Dartmouth College Dartmouth College (NH) (NH)

9. 9. Columbia University Columbia University (NY) (NY)

9. 9. University of Chicago University of Chicago (IL) (IL)

12. 12. Cornell University Cornell University (NY) (NY)

12. 12. Washington University in St. Louis Washington University in St. Louis (MO) (MO)

14. Northwestern University (IL)

15. Brown University (RI)

16. Johns Hopkins University (MD)

17. Rice University (TX)

18. Vanderbilt University (TN)

18. Emory University (GA)

20. University of Notre Dame (IN)

21. Carnegie Mellon University (PA)

21. University of California – Berkeley (CA)

23. Georgetown University (DC)

24. University of Virginia (VA)

24. University of Michigan – Ann Arbor (MI)

US News & World Report (American’s best colleges -2007)

Tennessee, the “Volunteer State”Tennessee, the “Volunteer State”

Nashville, TN- “Music City, USA!”Nashville, TN- “Music City, USA!”

Vanderbilt UniversityVanderbilt University

Vanderbilt UniversityVanderbilt University

A private, nonsectarian, coeducational A private, nonsectarian, coeducational

research university in Nashville, TN.research university in Nashville, TN.

Established in 1873 by shipping and Established in 1873 by shipping and

rail magnate Cornelius Vanderbilt.rail magnate Cornelius Vanderbilt.

Enrolls 11,000 students in ten schools Enrolls 11,000 students in ten schools

annually.annually.

Ranks 18Ranks 18thth in the nation among in the nation among

national research universities.national research universities.

Also has several research facilities and Also has several research facilities and

a world-renowned medical center.a world-renowned medical center.

Famous alumni include former vice-Famous alumni include former vice-

president Al Gore.president Al Gore.

Vanderbilt University Medical CenterVanderbilt University Medical Center

VUMCVUMC

Collection of several hospitals and clinics Collection of several hospitals and clinics

associated with Vanderbilt University in Nashville, associated with Vanderbilt University in Nashville,

Tennessee.Tennessee.

In 2003, was placed on the Honor Roll of nation’s In 2003, was placed on the Honor Roll of nation’s

best hospitals.best hospitals.

The medical school was ranked 17The medical school was ranked 17thth in the nation in the nation

among research-oriented medical schools and in the among research-oriented medical schools and in the

ISI top 5 for research impact in clinical medicine and ISI top 5 for research impact in clinical medicine and

pharmacology.pharmacology.

Vanderbilt-Ingram Cancer CenterVanderbilt-Ingram Cancer Center

Only NCI-designated Comprehensive Cancer Only NCI-designated Comprehensive Cancer

Center in Tennessee and one of only 39 in the Center in Tennessee and one of only 39 in the

United StatesUnited States

Nearly 300 investigators in seven research Nearly 300 investigators in seven research

programsprograms

More than $190 million in annual research More than $190 million in annual research

fundingfunding

Among the top 10 in competitively awarded NCI Among the top 10 in competitively awarded NCI

grant supportgrant support


Ranks 20Ranks 20thth in the nation and consistently ranks in the nation and consistently ranks

among the best places for cancer care by U.S. among the best places for cancer care by U.S.

News and World Report.News and World Report.

One of a select few centers to hold agreements One of a select few centers to hold agreements

with the NCI to conduct Phase I and Phase II with the NCI to conduct Phase I and Phase II

clinical trials, where innovative therapies are clinical trials, where innovative therapies are

first evaluated in patient.first evaluated in patient.


Department of BiostatisticsDepartment of Biostatistics

Created by the School of Medicine at Vanderbilt Created by the School of Medicine at Vanderbilt

University in September 2003.University in September 2003.

The Dean and other senior medical school faculty The Dean and other senior medical school faculty

are committed to providing outstanding are committed to providing outstanding

collaborative support in biostatistics to clinical collaborative support in biostatistics to clinical

and basic scientists and to develop a graduate and basic scientists and to develop a graduate

program in biostatistics that will train outstanding program in biostatistics that will train outstanding

collaborative scientists and will focus on the collaborative scientists and will focus on the

methods of modern applied statistics.methods of modern applied statistics.

The major challenge in high throughput experiments, e.g.,

microarray data, MALDI-TOF data, SELDI-TOF data, or shotgun

proteomic data is that the data is often high dimensional.

When the number of dimensions reaches thousands or more,

the computational time for the pattern recognition algorithms

can become unreasonable. This can be a problem, especially

when some of the features are not discriminatory.

High Dimensional DataHigh Dimensional Data

The irrelevant features may cause a reduction in the accuracy of

some algorithms. For example (Witten 1999), experiments with a

decision tree classifier have shown that adding a random binary

feature to standard datasets can deteriorate the classification

performance by 5 - 10%.

Furthermore, in many pattern recognition tasks, the number of

features represents the dimension of a search space - the larger

the number of features, the greater the dimension of the search

space, and the harder the problem.

High Dimensional DataHigh Dimensional Data

Outcome Measurement: MALDI-TOFOutcome Measurement: MALDI-TOF

LaserOptics

MALDITarget

TOFAnalyzer

Nitrogen Laser (337

nm)

Reflex MALDI TOF Mass SpectrometerReflex MALDI TOF Mass Spectrometer

Ion Mirror

IonGrid

MicrochannelDetector

Time-of-Flight Mass Spectrometry (TOF-MS)Time-of-Flight Mass Spectrometry (TOF-MS)

Linear TOF :

Ionsignals

Ionizing Probe (start)

M3 M2 M1

+/- U

Ion detector (MCP)

M3

M2

M1

t3t2t1Start

t a M b

Time or M

Issues in the Analysis of High-Throughput ExperimentIssues in the Analysis of High-Throughput Experiment

Experiment DesignExperiment Design

Measurement Measurement

PreprocessingPreprocessing

♦♦ Baseline Correction, Normalization Baseline Correction, Normalization

♦ ♦ Profile Alignment, Feature selection, DenosingProfile Alignment, Feature selection, Denosing

Classification Classification

Feature SelectionFeature Selection

QCA (Quality Control Assessment)QCA (Quality Control Assessment)

Issues in the Analysis of High-Throughput ExperimentIssues in the Analysis of High-Throughput Experiment

Computational ValidationComputational Validation

♦ ♦ Estimate the classification error rateEstimate the classification error rate

♦ ♦ bootstrapping, k-fold validation, leave-one-out validationbootstrapping, k-fold validation, leave-one-out validation

Validation – blind test cohortValidation – blind test cohort

Significance Testing of the Achieved Classification ErrorSignificance Testing of the Achieved Classification Error

Reporting the result - graphic & tableReporting the result - graphic & table

Validation – laboratory technology, e.g. RTPCR, Validation – laboratory technology, e.g. RTPCR, Pathway analysisPathway analysis

PreprocessingPreprocessing

Mass Spectrometry (MS) can generate high throughput protein profiles Mass Spectrometry (MS) can generate high throughput protein profiles

for biomedical applications. A for biomedical applications. A consistentconsistent, , sensitivesensitive and and robustrobust MS data MS data

preprocessing method would be greatly desirable because subsequent preprocessing method would be greatly desirable because subsequent

analyses are determined by the preprocessing output. analyses are determined by the preprocessing output.

The preprocessing goal is to The preprocessing goal is to extractextract and and quantifyquantify the the common featurescommon features

across the spectra. across the spectra.

We propose a new comprehensive MALDI-TOF MS data preprocessing We propose a new comprehensive MALDI-TOF MS data preprocessing

method using feedback concepts associated with several new method using feedback concepts associated with several new

algorithms. algorithms.

This new package successfully resolves many conventional difficulties This new package successfully resolves many conventional difficulties

such as such as removing m/z measure errorremoving m/z measure error, , objectively setting de-nosing objectively setting de-nosing

parametersparameters, and , and define common features across spectradefine common features across spectra..

Math Model for MS Data PreprocessingMath Model for MS Data Preprocessing

From a mathematical point of view, one MS data is a signal From a mathematical point of view, one MS data is a signal

function defined on a time or function defined on a time or m/zm/z domain. An observed MS signal domain. An observed MS signal

is often modeled as the superposition of three components:is often modeled as the superposition of three components:

where where f(x)f(x) is observed signal, is observed signal, B(x)B(x) is a slowly varying “baseline” is a slowly varying “baseline”

artifact, artifact, S(x)S(x) is the “true” signal (peaks) to be extracted, is the “true” signal (peaks) to be extracted, N N is the is the

normalization factor, and normalization factor, and e(x)e(x) represents noise. represents noise.

( ) ( ) * ( ) ( ) ,f x B x N S x e x

Basic Descriptions of the Data PreprocessingBasic Descriptions of the Data Preprocessing

Registration Registration Denoising Denoising Baseline correction Baseline correction

NormalizationNormalization Peak selection Peak alignment or Binning Peak selection Peak alignment or Binning

Math Model for MS Data PreprocessingMath Model for MS Data Preprocessing

The preprocessing goal is to The preprocessing goal is to identifyidentify, , quantifyquantify and and match match

peaks across spectrapeaks across spectra. .

Several modern algorithms such as Several modern algorithms such as waveletswavelets, , splinessplines, ,

nonparametric local maximum likelihood estimate(nonparametric local maximum likelihood estimate(NLMLENLMLE) )

are successfully applied to the whole processing system.are successfully applied to the whole processing system.

The feedbacks optimized the calibration and peak picking The feedbacks optimized the calibration and peak picking

procedures automatically.procedures automatically.

Raw dataRaw data

General stepsGeneral steps

(1) (1) Calibration:Calibration: Calibration based on multiple identified peaks (linear Calibration based on multiple identified peaks (linear

shifts on the time domain) and the shape of peak (convolution); in the shifts on the time domain) and the shape of peak (convolution); in the

meanwhile all spectra get aligned.meanwhile all spectra get aligned.

(2) (2) Quantification:Quantification:

Baseline Correction (splines) =>Normalization (TIC) =>area based Baseline Correction (splines) =>Normalization (TIC) =>area based

peak quantification method.peak quantification method.

(3) (3) Feature Extraction:Feature Extraction:

Denoising (wavelets) => Peak Selection (local maximum) => common Denoising (wavelets) => Peak Selection (local maximum) => common

peak finding across spectra(NLMLE)peak finding across spectra(NLMLE)

(4) (4) Feedback:Feedback: optimally choosing calibration peaks and setting feature optimally choosing calibration peaks and setting feature

extraction parameters.extraction parameters.

Flowchart of the Preprocessing Procedure

Raw data De-noisingPeak

DetectionPeak

Distribution

BaselineCorrection

Normalization

Calibration Alignment

CommonFeature

detection

Results

Convolution Based Calibration AlgorithmConvolution Based Calibration Algorithm

1. Known peaks’ simulation (choose 1. Known peaks’ simulation (choose peaks with high prevalence across peaks with high prevalence across spectra and clear pattern by feedback spectra and clear pattern by feedback 80% ).80% ).

2. Convolve each spectra with the 2. Convolve each spectra with the known peak simulation (Gaussian, or known peak simulation (Gaussian, or Beta). Maximum happens when two Beta). Maximum happens when two peak shapes match best.peak shapes match best.

3. The linear shift units makes multiple 3. The linear shift units makes multiple peaks matched best is the optimal peaks matched best is the optimal shift.shift.

Notice: all process are on the time Notice: all process are on the time domain.domain.

Pre- CalibrationPre- Calibration

Post CalibrationPost Calibration

1.1. Accurate m/z peak position (as theoretical)Accurate m/z peak position (as theoretical)2.2. Less variation of the peaks position Less variation of the peaks position 3.3. Easily to handle large dataset in batch mode. Easily to handle large dataset in batch mode.

Pre- CalibrationPre- Calibration

Post CalibrationPost Calibration

Baseline Correction & NormalizationBaseline Correction & Normalization

Baseline is generally considered as an artificial bias of the Baseline is generally considered as an artificial bias of the

signal.signal.

We propose baseline might be caused by delayed charge We propose baseline might be caused by delayed charge

releasing.releasing.

We apply We apply quadratic splinesquadratic splines to the local minimums to get the to the local minimums to get the

continuous curve by sliding windows.continuous curve by sliding windows.

Trimmed total ion currentTrimmed total ion current ( (TIC) normalization.TIC) normalization.

Baseline Data Before CorrectionBaseline Data Before Correction

Baseline Corrected DataBaseline Corrected Data

Wavelets DenoisingWavelets Denoising

Wavelet: FBI's image coding standard for digitized fingerprints, Wavelet: FBI's image coding standard for digitized fingerprints,

successful to reproduce true signal by removing noises of successful to reproduce true signal by removing noises of

specific energy levels.specific energy levels.

Wavelets method has been used to denoise signals in a wide Wavelets method has been used to denoise signals in a wide

variety of contexts.variety of contexts.

Wavelet method analyzes the data in both time and frequency Wavelet method analyzes the data in both time and frequency

domain to extract more useful information. domain to extract more useful information.

Adaptive stationary discrete wavelet denoising method is Adaptive stationary discrete wavelet denoising method is

applied in our research, which is shift-invariant and efficient in applied in our research, which is shift-invariant and efficient in

denoising.denoising.

,( ) ( , ) ( )j kj Z k Z

f t c j k t

,( , ) ( ) ( )j kc j k f t t dt

Denoising strategyDenoising strategy

Stationary discrete wavelet denoising method is shift-Stationary discrete wavelet denoising method is shift-

invariant and offers both good reconstruction invariant and offers both good reconstruction

performance and smoothness.performance and smoothness.

Adaptive denoising method is based on the noise Adaptive denoising method is based on the noise

distribution, we set up different threshold values at distribution, we set up different threshold values at

different mass intervals and frequency levels.different mass intervals and frequency levels.

Parameters (decomposition and thresholds are Parameters (decomposition and thresholds are

determined by the feedback information)determined by the feedback information)

DWT DecompositionDWT Decomposition

Denoised DataDenoised Data

Peak list across spectraPeak list across spectra

Kernel Density EstimationKernel Density Estimation

Peak distribution without high-quality preprocessing Peak distribution without high-quality preprocessing

Peak distribution with high-quality preprocessing Peak distribution with high-quality preprocessing

Peak SelectionPeak Selection

Peak SelectionPeak Selection

Preprocessing on one spectrum after calibrationPreprocessing on one spectrum after calibration

1.1. Read in spectrum by two columns: m/z values and corresponding intensities. Read in spectrum by two columns: m/z values and corresponding intensities.

2.2. Apply Adaptive Stationary Discrete Wavelet Transform for denoising. Apply Adaptive Stationary Discrete Wavelet Transform for denoising.

3.3. Sliding widow splines estimate the baseline, and subtract the baseline. Total Ion Current Sliding widow splines estimate the baseline, and subtract the baseline. Total Ion Current Normalization through the whole spectrum.Normalization through the whole spectrum.

4.4. Local maximums contribute to peak list across spectra.Local maximums contribute to peak list across spectra.

day1day1

day2day2

day3day3

day4day4

Expression ProfilesExpression Profiles

The Results from the Cluster AnalysisThe Results from the Cluster Analysis

Day

Laser P

ow

er

Why?Why?

Quality Control Assessment - Reproducibility Quality Control Assessment - Reproducibility

Intra-class Correlation Coefficient (ICC)Intra-class Correlation Coefficient (ICC)

Intra / Intra + InterIntra / Intra + Inter

Correlation of Variation (CV) Correlation of Variation (CV)

SD/MeanSD/Mean

Goal – Make sure the data is reproducible !Goal – Make sure the data is reproducible ! SOP is a necessary componentSOP is a necessary component

Variance Component AnalysisVariance Component Analysis

Mixed/Random Effect Model. Mixed/Random Effect Model.

The model: investigators, day, spot, machine, lab, etc.The model: investigators, day, spot, machine, lab, etc.

Source of Variability for MALDI-TOF DataSource of Variability for MALDI-TOF Data

Specimen Collection/Handling EffectsSpecimen Collection/Handling Effects

- Tumor: surgical related effects- Tumor: surgical related effects

- Cell Line: culture condition- Cell Line: culture condition

Biological Heterogeneity in SpecimenBiological Heterogeneity in Specimen

Biological Heterogeneity in PopulationBiological Heterogeneity in Population

Laser power variationLaser power variation

1717994416169944161688442020

191911117717171010551717994455

323224241919242416161111191911116611

1.01.00.50.50.20.21.01.00.50.50.20.21.01.00.50.50.20.2Number Number (m)(m)

Inter-Case Inter-Case VarianceVariance



SubsampleSubsample

1.01.00.50.50.20.2

Intra-Case VarianceIntra-Case Variance

Table IVTable IVPower = 80% Type I error = 5%Power = 80% Type I error = 5%

CV in different daysCV in different days

ICCICC

Variance Components AnalysisVariance Components Analysis

Variance Component AnalysisVariance Component Analysis

Tumor

Things DON’T DOThings DON’T DO

Fold-change for feature selectionFold-change for feature selection

Cluster analysis for class comparison or class predictionCluster analysis for class comparison or class prediction

Ignore the over-fitting issuesIgnore the over-fitting issues

Only report the good newsOnly report the good news

Extremely small sample size for the Independent test cohortExtremely small sample size for the Independent test cohort

Agulnik, M. et al. J Clin Oncol; 25:2184-2190 2007

Multidimensional scaling (MDS)Multidimensional scaling (MDS)

Multidimensional scaling (MDS)Multidimensional scaling (MDS)

AcknowledgementAcknowledgement

PreprocessingPreprocessing Dr. Dean BillheimerDr. Dean Billheimer Dr. Ming LiDr. Ming Li Dr. Dong HongDr. Dong Hong Shuo ChenShuo Chen Huiming LiHuiming Li

Additional AcknowledgementsAdditional Acknowledgements Bashar ShakhtourBashar Shakhtour Dr. William WuDr. William Wu Dr. Bonnie LeFureDr. Bonnie LeFure

AnalysisAnalysis Jeremy RobertsJeremy Roberts Will GrayWill Gray Nimish GautamNimish Gautam Joan ZhangJoan Zhang Haojie WuHaojie Wu

Dr. Heidi ChenDr. Heidi Chen Dr. Jonathan XuDr. Jonathan Xu Dr. Tatsuki KoyamaDr. Tatsuki Koyama

ENDEND

Documents

Yu Shyr ( 石 瑜 ), Ph.D. May 14, 2008 China Medical University Yu.Shyr@vanderbilt

Yu Shyr ( 石瑜 ), Ph.D. May 14, 2008 China Medical University Yu.Shyr@vanderbilt