35
01/27/2014 1 Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Outline 2 Introduction Data pre-treatment 1. Normalization 2. Centering, scaling, transformation Univariate analysis 1. Student’s t-tes 2. Volcano plot Multivariate analysis 1. PCA 2. PLS-DA Machine learning Software packages

Statistical Analysis of Metabolomics Data 1 Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

  • Upload
    ngotu

  • View
    226

  • Download
    3

Embed Size (px)

Citation preview

Page 1: Statistical Analysis of Metabolomics Data 1 Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

01/27/2014

1

Statistical Analysis of Metabolomics Data

Xiuxia Du

Department of Bioinformatics & Genomics

University of North Carolina at Charlotte

Outline

2

• Introduction

• Data pre-treatment1. Normalization

2. Centering, scaling, transformation

• Univariate analysis1. Student’s t-tes

2. Volcano plot

• Multivariate analysis1. PCA

2. PLS-DA

• Machine learning

• Software packages

Page 2: Statistical Analysis of Metabolomics Data 1 Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

01/27/2014

2

Results from data processing

3

Next, analysis of the quantitative metabolite information …

Metabolomics data analysis

4

• Goals– biomarker discovery by identifying significant features associated

with certain conditions

– Disease diagnosis via classification

• Challenges– Limited sample size

– Many metabolites / variables

• Workflow

pre-treatmentunivariateanalysis

multivariate analysis

machine learning

Page 3: Statistical Analysis of Metabolomics Data 1 Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

01/27/2014

3

Data Pre-treatment

5

Normalization (I)

6

• Goals– to reduce systematic variation

– to separate biological variation from variations introduced in the experimental process

– to improve the performance of downstream statistical analysis

• Sources of experimental variation– sample inhomogeneity

– differences in sample preparation

– ion suppression

Page 4: Statistical Analysis of Metabolomics Data 1 Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

01/27/2014

4

Normalization (II)

7

• Approaches– Sample-wise normalization: to make samples comparable to each

other

– Feature/variable-wise normalization: to make features more comparable in magnitude to each other.

• Sample-wise normalization– normalize to a constant sum

– normalize to a reference sample

– normalize to a reference feature (an internal standard)

– sample-specific normalization (dry weight or tissue volume)

• Feature-wise normalization (i.e., centering, scaling, and transformation)

Centering, scaling, transformation

8

Page 5: Statistical Analysis of Metabolomics Data 1 Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

01/27/2014

5

Centering

• Converts all the concentrations to fluctuations around zero instead of around the mean of the metabolite concentrations

• Focuses on the fluctuating part of the data

• Is applied in combination with data scaling and transformation

9

Scaling

• Divide each variable by a factor

• Different variables have a different scaling factor

• Aim to adjust for the differences in fold differences between the different metabolites.

• Results in the inflation of small values

• Two subclasses– Uses a measure of the data dispersion

– Uses a size measure

10

Page 6: Statistical Analysis of Metabolomics Data 1 Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

01/27/2014

6

Scaling: subclass 1

• Use data dispersion as a scaling factor– auto: use the standard deviation as the scaling factor. All the

metabolites have a standard deviation of one and therefore the data is analyzed on the basis of correlations instead of covariance.

– pareto: use the square root of the standard deviation as the scaling factor. Large fold changes are decreased more than small fold changes and thus large fold changes are less dominant compared to clean data.

– vast: use standard deviation and the coefficient of variation as scaling factors. This results in a higher importance for metabolites with a small relative sd.

– range: use (max-min) as scaling factors. Sensitive to outliers.

11

Scaling: subclass 2

• Use average as scaling factors– The resulting values are changes in percentages compared to the

mean concentration.

– The median can be used as a more robust alternative.

12

Page 7: Statistical Analysis of Metabolomics Data 1 Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

01/27/2014

7

Transformation

• Log and power transformation

• Both reduce large values relatively more than the small values.

• Log transformation– pros: removal of heteroscedasticity

– cons: unable to deal with the value zero.

• Power transformation– pros: similar to log transformation

– cons: not able to make multiplicative effects additive

13

Centering, scaling, transformation

A. original dataB. centeringC. autoD. paretoE. rangeF. vastG. levelH. logI. power

14

Page 8: Statistical Analysis of Metabolomics Data 1 Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

01/27/2014

8

Log transformation, again

• Hard to do useful statistical tests with a skewed distribution.

• A skewed distribution or exponentially decaying distribution can be transformed into a Gaussian distribution by applying a log transformation.

http://www.medcalc.org/manual/transforming_data_to_normality.php15

Univariate vs. multivariate analysis

• Univariate analysis examines each variable separately.– t-tests

– volcano plot

• Multivariate analysis considers two or more variables simultaneously and takes into account relationships between variables.– PCA: Principle Component Analysis

– PLS-DA: Partial Least Squares-Discriminant Analysis

• Univariate analyses are often first used to obtain an overview or rough ranking of potentially important features before applying more sophisticated multivariate analyses.

16

Page 9: Statistical Analysis of Metabolomics Data 1 Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

01/27/2014

9

Univariate Statisticst-test

volcano plot

17

Univariate statistics

• A basic way of presenting univariate data is to create a frequency distribution of the individual cases.

Due to the Central Limit Theorem, many of these frequency distributions can be modeled as a normal/Gaussian distribution.

18

Page 10: Statistical Analysis of Metabolomics Data 1 Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

01/27/2014

10

Gaussian distribution

φ μ,σ

2(

0.8

0.6

0.4

0.2

0.0

1

x

1.0

0 2 4

x)

0,μ=0,μ=0,μ=μ=

2 0.2,σ =2 1.0,σ =2 σ =2 σ =

• The total area underneath each density curve is equal to 1.

(x) 1

2e x

2

mean

variance 2

standard deviation

0.0

0.1

0.2

0.3

0.4

−2σ −1σ 1σ−3σ 3σµ 2σ

34.1% 34.1%

13.6%2.1%

13.6% 0.1%0.1%2.1%

https://en.wikipedia.org/wiki/Normal_distribution19

Sample statistics

https://en.wikipedia.org/wiki/Normal_distribution

Sample mean: X=1

nXi

i1

n

Sample variance: S2 1

n1Xi X 2

i1

n

Sample standard deviation: S S2

20

Page 11: Statistical Analysis of Metabolomics Data 1 Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

01/27/2014

11

t-test (I)

Null hypothesis H0: 0

Alternative hypethesis H1: <0

Test statistic: t x 0

s n

Sample standard deviation: s 1

n1xi x 2

i1

n

The test statistic t follows a student’s t distribution. The distribution has n-1degrees of freedom.

• One-sample t-test: is the sample drawn from a known population?

21

t-test (II): p-value

When the null hypothesis is rejected, the result is said to be statistically significant.

22

Page 12: Statistical Analysis of Metabolomics Data 1 Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

01/27/2014

12

t-test (III)

• Two-sample t-test: are the two populations different?

• The two samples should be

independent.

Null hypothesis H0: 1 2 0

Alternative hypethesis H1: 1 2 0

Test statistic: t x1 x2 12

s12

n1

s22

n2

23

t-test (IV)

Equivalent statements:

• The p-value is small.

• The difference between the two populations is unlikely to have occurred by chance, i.e. is statistically significant.

24

Page 13: Statistical Analysis of Metabolomics Data 1 Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

01/27/2014

13

t-test (V)

• The p-value is big.• The difference between the

two populations are said NOT to be statistically significant.

25

t-test (VI)

• Paired t-test: what is the effect of a treatment?

• Measurements made on the same individuals before and after the treatment.

Example: Subjects participated in a study on the effectiveness of a certain diet on serum cholesterol levels.

Subject Before After Difference

1 201 200 -1

2 231 236 +5

3 221 216 -5

5 260 243 -17

6 228 224 -4

7 245 235 -10

H0 :d 0

Ha :d 0

Test statistic: t d dsd n

26

Page 14: Statistical Analysis of Metabolomics Data 1 Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

01/27/2014

14

Volcano plot (I)

• Plot fold change vs. significance

• y-axis: negative log of the p-value

• x-axis: log of the fold change so that changes in both directions (up and down) appear equidistant from the center

• Two regions of interest: those points that are found towards the top of the plot that are far to either the left- or the right-hand side.

27

Volcano plot (II)

28

Page 15: Statistical Analysis of Metabolomics Data 1 Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

01/27/2014

15

Multivariate statisticsPCA

PLS-DA

29

PCA (I)

• PCA is a statistical procedure to transform a set of correlated variables into a set of linearly uncorrelated variables.

30

Page 16: Statistical Analysis of Metabolomics Data 1 Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

01/27/2014

16

PCA (II)

• The uncorrelated variables are ordered in such a way that the first one accounts for as much of the variability in the data as possible and each succeeding one has the highest variance possible in the remaining variables.

• These ordered uncorrelated variables are called principle components.

• By discarding low-variance variables, PCA helps us reduce data dimension and visualize the data.

31

PCA (III)

32

• The transformation matrix

• The transformation

• are called the 1st, 2nd, and nth principle components, respectively.p1, p2,, pn

Page 17: Statistical Analysis of Metabolomics Data 1 Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

01/27/2014

17

PCA (IV)

33

• Each original sample is represented by an n-dimensional vector:

• After the transformation– If all of the principle components are kept, then each sample is still

represented by an n-dimensional vector:

– If only m < n principle components are kept, then each sample will be represented by an m-dimensional vector:

Sbefore transformation x1, x2,, xn

Safter transformation y1, y2,, yn

Safter transformation y1, y2,, ym

PCA (V)

34

• y are called scores.

• For visualization purpose, m is usually chosen to be 2 or 3.

• As a result, each sample will be represented by a 2- or 3-dimenational point in the score plot.

● ●

• 30 • 20 • 10 0 10

•10

010

2030

Principal Component Analysis (Scores)

PC1

PC

2

wt15

wt16

wt18

wt19 wt21

wt22

ko15

ko16

ko18

ko19

ko21ko22

Page 18: Statistical Analysis of Metabolomics Data 1 Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

01/27/2014

18

PCA (VI)

35

• Loadings

• are denoted as points in the loadings plotp11, p12 , p21, p22 ,, pn1, pn2

PCA (VII)

36• 0.06 • 0.04 • 0.02 0.00 0.02 0.04

•0.0

6•0

.04

•0.

020.

000.

020.

040.

06

Principal Component Analysis (Loadings)

PC1

PC

2

200.1/2924

200.9/3473202.1/3497

204.9/2590

205/2788 206/2789

207/2744

207.1/2717

213.2/3497

214.1/2519

215.2/3478

215.9/3506

215.9/3170

218.2/3366

219.1/2523

225/3232

225.1/3265

225.1/3151

225.1/3668

226/2962

229.1/3424

231.1/2849

231.1/2528

233/3026

233.1/3025

235/3130

235.1/2716

236.1/2522

236.9/3632

239.1/3252

241.1/3646

241.2/3641

243.1/3542 243.2/3651

244/4103

244.1/2830

244.9/3130

245.1/2882

245.9/3132

246.4/3287

246.7/3295

246.8/3281

247/3490

247.1/2648

247.9/2979

248.2/3240

249.1/3631250.1/3639

250.1/3674

250.2/3627

250.2/4059

251.1/3880

251.2/3882

254.1/3226255.1/3668

256.1/3444256.1/3467

256.1/3652

256.2/3454

256.2/3636256.2/3662

257.1/3660

257.1/3640

257.2/3647

257.9/3425

258/3497

258/3232

259.1/2802

261.1/3148

264.2/3158

265.1/2620265.2/3659

265.2/4114 266.1/3624

266.2/3634

267.1/2783

267.2/3628268.1/3868

268.2/3888

269.1/3886

269.2/3888

269.2/4129

270.1/3891

270.2/3880

270.5/3166

271.1/3879

271.2/3404

271.3/3264

272.1/3123

272.1/2727

273.1/3425

273.1/3655

273.2/3424

273.9/2640

274.2/3431

277/2638

277.2/3845

279/2788

279/2755

280/2788

280.1/3315

280.2/3319

281/2793

281.1/2788

281.2/3672

282.1/2616

282.1/3468

282.2/3471

283.1/3636

283.2/3883

283.2/3640

283.2/3675 283.2/3134

284.1/2716

284.1/3634

284.2/3884284.2/3627

284.2/3675

285.1/3854

285.1/4129285.1/2716

285.2/3861

285.2/4127

285.2/3637

286/2620

286.1/3263

286.1/3283

286.1/3851

286.2/3256

286.2/2855

287/2647

287.1/3256

287.1/2635

287.2/3262

288.1/2794

288.2/2795

288.2/2922

289/2952

289.1/3118

289.1/2718

289.2/3125

291.1/3629

292.1/3492

292.2/3217

292.2/3459

294.1/2577

296.1/2727

296.9/4127

297.1/3848

297.2/3854

297.3/3381

298.1/3182

298.1/3671

298.2/3187

299.2/4126

300.2/3390

301/2787

301.1/3891

301.1/3384

301.1/3681

301.2/3391

301.2/3886

301.2/3656

302/2621

302/2789

302.1/3253

302.1/3395

302.1/2818

302.2/3389

302.3/3901

303/2622

303.1/3238

303.2/3388

304.1/2924304.1/3236

304.2/3471

305.1/3506

305.1/2926

305.1/2877

305.2/3290

306.1/3506306.1/2929

306.9/2617

307/2621

307.1/3141

307.1/3503

307.1/3864

308/2623

308.1/3257

308.2/3261

309.1/2923

309.1/3649

309.2/3633

310.1/3463

310.1/2923

310.1/2798

310.2/3635310.2/3468

311.1/4131

311.1/3080

311.1/3625

311.2/3875

312.1/3000

312.2/3877312.2/4134

313/2786

313.1/4137

313.1/3290

313.1/3506313.1/2824

313.2/3290

313.2/4137

314/2786

314.1/3289

314.1/4141

314.1/2767

314.1/3857

314.2/3289

315/2509

315/2782

315/2536

315/3270

315.1/2773

315.1/3653

316.1/3087

316.1/3653

316.2/3089

317/2788

317.2/3492

317.2/3092

317.2/3844

317.3/3942

318.1/3246

319.2/3245

321.1/3641

321.1/3673

322.1/3507

322.1/3394

322.1/2899

322.2/3504

322.2/3635322.2/3149

323.1/3503

323.1/3391

323.1/3896

323.2/3511323.2/3496

323.2/3392

323.2/3883

324/3268

324.1/2789

324.1/3509

324.2/3507

324.2/3277

325/3255

325.2/3649

325.2/3238

326.1/3300

326.1/3090

326.2/3420

326.4/4114

327.1/3505

327.1/2927

327.2/3420

327.2/3851

328.1/3508

328.2/3631

328.2/3607

328.2/3376

328.2/3500

329/2921

329/3150

329.1/3499

329.2/3880

329.2/3489

329.2/3611

330.1/3503

330.2/3491

330.2/3166

330.2/3630

330.2/3606

331.1/3498

331.1/3459

331.2/3497

331.2/3453

332/2509

332/2537

332.1/3496

333/2514

333.1/3121

333.1/2548

334/2531334.2/2811

335/2785

335/3278

336/2785

336.1/3533

336.2/3192

336.2/2995

337/2537

337/2509

337.1/3650

337.2/3438

338/2511

338/2536338.1/3022

338.1/2789

338.2/3829

338.2/3630

339.1/4124

339.2/3319

339.2/4124

340.1/3328

340.1/3307

340.2/3317

340.2/2758

341.2/3556

343/2684

343/2597343/2662

343.1/3510

343.1/3650

344/2684

344.2/3350344.2/3152

344.2/2940

345/2684

345/3238

345.1/2692

345.2/3388

345.2/3852

345.3/3386

346.1/3499

346.1/3275

346.2/3493

346.9/2782

347.1/3499

347.2/3494

347.2/3669

348.1/3420348.1/3288

348.2/3499

348.2/3430

348.2/3285349.1/3666

349.1/3421

349.1/3633

349.1/3290

349.2/3278

349.2/3665

349.2/3407

350/3385

350.1/3630

350.2/3626

351.1/3499

351.1/3028

351.2/3503

351.2/3485

351.2/3615

351.9/2564

352.1/3498

352.1/2789

352.2/3490

353/2528

353.1/3498353.2/3495

354/2681

354.1/2782

354.1/3671

354.2/3599

355.2/3104

355.2/3591

355.2/3567

356.1/3869

356.2/3830

357.1/3469

357.2/3474

357.2/3833

358/2783

358/3958

358.1/2778

358.2/3832

359/3507

359/3225359.1/3514

359.2/3720

360/2692

360.3/3397

360.8/2915361/2684

361.1/3167

361.1/2693

361.2/3167

361.9/2922

362/2690

362/2664

362.1/3165

362.1/2685

362.2/3156

362.2/3325

363.1/3259

363.1/3481

363.2/3263364.1/3264

364.2/3262364.2/2582

364.2/3655

364.2/3464

364.9/2553

365/2684

365/2595

365/2659

365/3484

365.1/3388

365.2/3404

365.2/3385

366/2684

366/2662

366.1/2788

366.1/3395

366.2/3842367/2666

367.2/3528

367.2/4132367.2/3584

368.1/2792

368.2/3529

368.2/2786

368.2/4133

368.3/3863 369.2/4113370.2/3044

370.2/3918

371.2/3073

371.2/3605

372.1/3626

372.1/3291

375/3071

376.1/3450

376.2/3459

376.2/3722

377/3074

377/3509

377.1/3077

377.2/2997

377.2/3458

378.2/3831

379.1/3327

379.1/3476

379.1/3505 379.2/3336

379.2/3307

380.1/3154

380.1/3203

380.1/3329

380.2/3333

380.2/2971

380.2/3660

381/2686

381/2663

381.1/3154

381.1/3194

381.2/3194

382/2677

382.1/3246

382.2/3788

383.2/3241383.2/4087

384.2/3999

384.3/4003

385.1/3181

385.2/4134

385.2/3174385.2/3449

386/2535

386.1/3177

386.2/3179

386.2/4133

386.3/3577

387.2/3243

388/3959

388.1/2685

388.1/2663

388.1/3651

389/3886389.1/3646

389.2/3349

390.1/3376

390.2/3353 390.3/3050

391.1/3595

391.1/3632

391.1/3611

391.2/3490391.2/3864

391.2/3610

392.1/3604

392.1/3472

392.1/4127

392.1/3495

392.1/3631

392.2/2929

392.2/3616

394.2/3077

395/4121

395.1/3558

395.2/3719

395.2/3674

395.2/3570

396.1/3348

396.1/3560

396.2/3326

396.2/2868

396.2/3862

397/2780

397.1/3315

397.2/3326397.2/3877

397.2/2872

398.3/4059

398.3/3636

399/2530

399.1/2799

400.1/2792

401.1/3334

401.2/2862

402.1/2688

402.1/3337

402.1/2672

402.2/3325

403.1/2950

403.1/3329

403.2/3889

404/2707

404.1/2686

404.3/3900

405/2686

405/3251

405.2/3391

405.3/3940

407.2/3501

408.1/2894

408.1/3494

408.2/2576

410.2/3938

410.3/3944

411.2/3944

411.3/3933

412.2/3944

412.2/3369

412.3/4118

413.1/3634

413.3/4116

414.1/3624

414.2/3062

416.1/2686

416.2/2915

417.1/2683419.1/3062

419.1/3505

419.2/3854

419.2/3062

419.2/3807

419.3/3531

420.1/3786

420.2/2722

421.1/2798

421.1/3265

421.3/3867

422.2/2887

422.3/2572

423.1/3254

423.2/3272 423.2/3253

424.2/2952

424.2/4005

424.2/3155

424.3/4002

425.1/2950

425.2/2955

426.1/3017

426.2/3060

426.9/2692

427/2687

427/2707

427.1/3015

427.3/3269

428.3/3634

429.2/3888

429.2/4080

429.2/3155

429.2/2951

429.2/4031

429.2/4054

429.4/3464

430.1/2686

430.2/3885

430.2/4073

431.1/2685

431.1/3910

431.2/2880

431.2/3884

432.1/2683 432.2/2719

433/3506

433.1/2682433.2/3810

433.2/2703

434.1/3792

436/3126

436.2/2954

436.2/3631

436.2/3596

436.3/2950

437.1/2789

438.1/3455

438.2/3461

438.2/4074

438.3/4058

438.3/3548

439.1/3456

439.2/3459

439.2/3481

439.3/4054

440.1/3150

440.2/3471

440.2/2854

440.2/3160

440.2/4060

440.3/4059

440.3/3089

441.2/3604

441.3/3495

441.3/3050

442.9/2691

443/2684

444/2666

444.1/2686

444.2/2897

444.2/3883444.3/3867

445/2684

445.1/2683

447.2/3877

447.2/3859

448.1/3016

448.1/3549

448.2/3862

448.2/3550

449.1/2895

449.1/3290

449.1/2743

449.1/3534

449.2/2739

450.1/3879

452.1/3076

452.1/2550

452.2/2547

452.2/3146

453.2/3074

453.2/2959

454.1/3291

454.1/3997

454.2/3286

454.2/3088

455.1/3290

455.2/3288

456.1/3288

456.2/3289

457.1/3292

457.2/4112

457.2/3293

458.2/3046

458.2/2941

458.3/2945

458.9/2940

459.1/4137

459.2/3047

459.3/3866

460.1/3452

460.2/3450 460.4/3550

461.1/3542

461.2/3912

461.3/3933

462.1/3094

462.2/3273

462.3/4433

463.2/3046

463.3/3826

464.1/3047

464.2/3468

464.2/3044

465.2/3466

466.1/3203

466.2/3470466.2/3658

466.2/3200

466.2/3695

467.2/3661

467.2/3699

467.3/3460

468.2/3139

468.2/2940468.2/3718

468.3/2941

469.2/3864

471.3/4112

471.9/2941

472/2954

472.2/3121

473.2/2935

473.2/3290

473.2/3138

473.9/3293

474.1/3075

474.4/3722

475.2/3089475.2/2563

475.2/2536

475.2/2787

475.2/2725

475.2/2511

475.2/2689475.2/2665

475.2/2617

475.2/3058475.2/2830

476.1/3290 476.2/2678

477.1/3288

477.2/3293

478/3033

478.1/3162

478.2/3603

478.2/3639

478.4/3022

479.1/3171

479.2/3599

479.3/3630

480.1/3318480.1/3336

480.2/3317

480.2/3099

480.3/4131

481.1/3318

481.2/3314

482.2/3318

482.2/3598

482.3/3201

483.1/3538

483.1/3054

483.2/3598

483.2/3553

484.1/3249

484.1/3588

484.2/3599

484.2/2845

485.1/4136

485.1/3607

485.2/2847

485.2/3041

485.2/3609

485.3/4134

485.3/3057

486/3425

486.1/3464

486.2/2799

486.2/3476

486.2/3457

486.3/4121

487.2/2799

488.2/2882

490.3/3253

491.3/3840491.3/3539

492.2/3643

492.3/3873

492.3/3633

493.1/2882

493.2/3667

494.2/3428

494.2/3120

494.2/3165

495.2/3430495.3/2930

495.3/3417

496.2/3403496.2/3337

496.2/3442

496.2/3363

497.2/3337

497.2/3407

497.2/3363498.2/3444

498.2/3403

498.4/4051

499.3/4112 499.4/4120

500.1/3045 500.2/2798

500.2/3040501.1/3046

501.2/2797

501.2/3493

502.1/3166

502.1/3189

502.1/4192

502.2/3032

502.2/3171

503.1/3166

503.2/3167

504.1/3260

504.1/3164

504.1/3598

504.2/3260

504.2/3031

505.1/3262

505.2/3264

505.4/3699

506.1/3261

506.2/3391

506.2/3258

506.3/3028

506.3/3310507.1/3017

507.2/3030

507.2/3390

508.1/3367

508.2/3528

508.2/3373

508.3/3839

509.1/3030509.2/3528

509.3/3835

510.1/3471

510.2/3527

510.2/3761511.2/3763

512/2971

512.1/3368

512.2/2923

512.2/3761512.2/3127

512.2/3358

512.3/2926

512.3/3127

512.4/4066

513.1/3756

513.2/3365

513.2/2913

513.3/2928

513.3/3522 513.4/3526

515.2/2869

515.9/2929

516.2/3262

517.1/2909

517.1/2937

517.2/2925

517.9/2512

518.2/3409

518.2/3436

518.2/3112

518.3/3522519/3727

520.2/3212

520.2/3268

520.2/3284

520.3/4135

520.4/4140

521.2/3212

521.3/3163

521.3/4137521.4/4138

522.2/3363

522.2/3426

523.1/3037

523.2/3362

523.2/3426

523.2/3459

523.2/3389

523.3/4145

523.4/4140524.1/3157

524.1/2884

524.2/3696

524.2/3624

524.3/2927

525.2/3697

525.2/3162

526.1/3177

526.2/3693

526.2/3180527.1/3178

527.2/3191

527.2/3701

527.3/3532

528.1/3241

528.1/3179

528.2/2837

528.2/3239

528.2/3168

528.3/2835

529.1/3242

529.2/3240

529.2/3170

529.4/2928

530.2/3352

530.2/3541

530.2/3744

531.2/3350

531.2/3514

532.1/3517

532.2/3489

532.2/2870

532.2/3353

532.2/3458

533/2886

533.1/3351

533.2/3488

533.2/2870

533.2/2681

533.2/3459533.2/3343

533.2/3898

533.3/3892

533.3/3171

533.4/4086

534.1/3455

534.2/3583

534.3/3864

534.3/4115

534.3/2667

535.2/3583

535.3/3117

535.3/3867

535.3/3518 535.3/3832 536.1/3940536.2/3565

536.2/3719

536.2/3673

536.3/3419

537/3930

537.2/3564

537.2/3672

537.2/2870

537.4/3174

537.4/4125

538/3520

538.2/3975

538.3/4134

538.3/4098

538.4/4138

538.9/2509539.2/2908

539.3/4135

539.4/4141

540.4/4132

540.4/3698

540.5/4170

541.4/4132

542.3/3209

543.4/3699 544/3714

544.1/3444

544.1/3370

544.2/3207

544.2/3429

544.2/3389

545.2/3201

545.2/3381

545.3/4120

545.5/2646

546.2/3205

546.2/3018

546.2/3701

546.3/4137

546.3/3016

546.3/3657

546.4/4141

547.2/3021

547.2/3683547.2/3206

547.3/3019

547.3/4139

547.4/4141

548.1/3187

548.1/3425

548.2/3020548.3/3015

548.3/4235

548.4/4232

549/3522

549/3262

549.1/3514

549.1/3186

549.3/4128

549.3/3428

550.1/3237550.2/3665

550.2/3604

550.2/3245

550.3/4130

550.3/3600

550.3/3685

550.4/4134

550.4/4115

551.2/3665

551.2/3013

551.2/3032

551.2/3603

551.3/3601

551.3/3025

552.1/3353

552.1/3369

552.2/3350

552.3/3819

553/2545553.1/3357553.1/4164

553.2/3348

553.2/4057553.3/3817

554.1/3215

554.1/2526

554.2/3339

554.2/3215

554.2/3490

556.2/3111

556.2/3444

556.2/3592

556.3/2914

556.3/3118557/3161

557.1/2574

557.2/2916

557.2/3115

557.3/2914

557.3/3121

558/3514

558.1/3526

558.1/3396

558.2/3535

558.2/3392

559.1/3391

559.1/3520

559.2/3393

559.2/3536

560.2/3626

560.2/3096

560.3/3089

561.2/2916

561.3/3635

561.3/2910

561.4/3504

561.9/2539

562.1/3514

562.2/3426

562.2/3740

563/3515

563.4/4133

564.2/3883

564.2/3452

564.2/3481

564.2/3621

564.3/3049

564.3/3511

564.3/3886

564.3/2977

564.4/4140

565.2/3452

565.2/3881

565.3/3479

565.4/4140

565.4/3903

565.4/2649566.2/3573

566.2/3204

566.3/4130

566.3/3595566.4/4231

567.1/3030

567.2/3573

567.2/3016

567.4/4231

567.5/4234

568.2/3207

568.2/3224

568.3/3617

568.3/2910

568.4/2913

568.4/4231

568.8/2627

569.1/3295569.2/3212

569.3/3616

569.3/3594

569.8/2919

570.1/3212570.3/4124

570.4/3509

570.5/3514

571/3508

571.4/4130

571.4/3715

571.4/3677

571.6/2509

571.8/2913

572.1/3676

572.2/3390

572.2/3663

572.2/3417

572.2/3844

572.3/2827

572.4/3932

572.8/2915

573.2/3004

573.2/3420

573.3/3608

573.3/3391

573.8/2921

574.2/4044

574.3/4051

574.8/2915

575.3/4115

575.4/4130

576/3212

576.2/2858

576.2/3663 576.3/2859

576.3/4112

576.4/4123

576.4/3863

577.2/2861

577.3/4136

577.4/4208

578.3/3941

579.1/2789

579.2/2788

579.3/3844

579.4/3510

579.5/3513

580.1/2789

580.2/3574

580.3/3314

580.5/2523

581.1/2789

581.2/2789

581.2/3575

581.2/2861

582.3/3485

582.3/4122

582.4/3827

583.4/4097584.1/3394

584.2/3507

584.3/3371

584.4/4137

585/3503

585.3/3087

586.1/3285

586.2/3279

586.3/3301

587.2/3274

588/3273

588.2/2794

588.3/2789

589.2/2797

589.4/4134

589.5/4402

590.1/3369

590.1/3453

590.2/3214

590.2/3467

590.2/3007

590.3/3005

591.2/3220

591.2/2982

591.3/3005

591.5/4105

592.2/3629

592.2/2998

592.3/3006

592.4/4130

593/3501

593.2/3274

593.3/3260

593.3/4128

593.4/3705

593.4/3672

594/2618

594.1/3671

594.2/3376

594.2/3408

594.3/3619

594.3/3602

594.4/2537

595/2627

595.2/3004

595.3/3016

596/3029

596.2/3008

596.2/3844

596.2/4052596.3/3807

596.4/3803

597.2/4057

597.2/4128597.3/3812

597.3/3000

597.3/3789

597.4/3795

598.3/3704

599.3/4126

599.4/4131

Loadings plot for all variables

Page 19: Statistical Analysis of Metabolomics Data 1 Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

01/27/2014

19

PCA (VIII)

37• 0.04 • 0.02 0.00 0.02 0.04

•0.

06•

0.04

•0.

020.

000.

020.

040.

06

Principal Component Analysis (Loadings) Top 25

PC1

PC

2

●● ●

●●

●●

●●

356.2/3830357.2/3833396.2/3862

349.1/3290301.2/3391

324/3268548.1/3425

300.2/3390

358.2/3832

412.2/3944382.1/3246322.1/3394

523.2/3459

359/3507

460.1/3452

302.2/3389

472/2954533.2/2870

410.2/3938

466.2/3658

323.1/3391

532.2/3458

556.2/3444

362.1/3165

392.1/3631

356.2/3830357.2/3833396.2/3862349.1/3290301.2/3391324/3268548.1/3425300.2/3390358.2/3832412.2/3944382.1/3246322.1/3394523.2/3459359/3507460.1/3452302.2/3389472/2954533.2/2870410.2/3938466.2/3658323.1/3391532.2/3458556.2/3444362.1/3165392.1/3631

Loadings plot for the top 25 varaibles

PCA (IX)

38

Scree plot: variance vs. principle component number

Page 20: Statistical Analysis of Metabolomics Data 1 Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

01/27/2014

20

PLS-DA (I)

39

• A supervised method to find a predictive model that describes the direction of maximum covariance between a dataset (X) and the class membership (Y)

• Similar to PCA, the original variables are summarized into much fewer new variables using their weighted averages.

• The new variables are called scores.

• The weighting profiles are called loadings.

• PLS-DA can perform both classification and feature selection.

• Feature importance measure: VIP (Variable Importance in Projection)

PLS-DA (II)

40

• Interpretation of the model– R2X and R2Y

• fraction of the variance that the model explains in the independent (X) and dependent variables (Y)

• Range: 0-1

– Q2Y

• measure of the predictive accuracy of the model

• usually estimated by cross validation or permutation testing

• Range: 0-1

• > 0.5 is considered good while > 0.9 is outstanding

Page 21: Statistical Analysis of Metabolomics Data 1 Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

01/27/2014

21

PLS-DA (III)

41

• Note of caution– Supervised classification methods are powerful.

– BUT, they can overfit your data, severely.

Machine LearningClustering

Classification

42

Page 22: Statistical Analysis of Metabolomics Data 1 Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

01/27/2014

22

Clustering

• Group similar objects together

• Any clustering method requires– A method to measure similarity/dissimilarity between objects

– A threshold to decide whether an object belongs to a cluster

– A way to measure the distance between two clusters

• Common clustering algorithms– K-means

– Hierarchical

– Self-organizing map

• Unsupervised machine learning techniques

43

Hierarchical clustering (I)

1. Find the two closest objects and merge them into a cluster

2. Find and merge the next two closest objects (or an object and a cluster, or two clusters)

3. Repeat step 2 until all objects have been clustered

44

Page 23: Statistical Analysis of Metabolomics Data 1 Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

01/27/2014

23

Hierarchical clustering (II)

• Methods to measure similarity between objects– Euclidean, Manhattan

– Pearson correlation

– Cosine similarity

• Linkage: ways to measure the distance between two clusters

45

single complete centroid average

Hierarchical clustering (III)

46

Page 24: Statistical Analysis of Metabolomics Data 1 Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

01/27/2014

24

Classification

• Use a training set of correctly-identified observations to build a predictive model

• Predict to which of a set of categories a new observation belongs

• Supervised machine learning

• Methods– Linear discriminant analysis

– Support vector machine (SVM)

– Artificial neural network (ANN)– k-nearest neighbor

– Random forest

– PLS-DA

47

Software PackagesMetaboAnalyst

XCMS

48

Page 25: Statistical Analysis of Metabolomics Data 1 Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

01/27/2014

25

MetaboAnalyst 2.0

• This is an online set of statistical tools

– http://www.metaboanalyst.ca/

• First, create an Excel file for each sample containing three columns, the m/z of each metabolite, its retention time and the peak intensity

• Put the files for a group in a folder.

• Then zip the two folders 

• On the MetaboAnalyst website, click on “Welcome, click here to start”

Structure of Excel file for submission to MetaboAnalyst

Page 26: Statistical Analysis of Metabolomics Data 1 Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

01/27/2014

26

MetaboAnalyst data entry

Page 27: Statistical Analysis of Metabolomics Data 1 Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

01/27/2014

27

Initial parameter setting

Initial evaluation

Page 28: Statistical Analysis of Metabolomics Data 1 Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

01/27/2014

28

More assessment of the data

Data filtering

Page 29: Statistical Analysis of Metabolomics Data 1 Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

01/27/2014

29

Data transformation and scaling

Page 30: Statistical Analysis of Metabolomics Data 1 Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

01/27/2014

30

Two‐fold change cutoff

T‐test cutoff, p <0.05

Page 31: Statistical Analysis of Metabolomics Data 1 Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

01/27/2014

31

Volcano plot

Correlation matrix

Page 32: Statistical Analysis of Metabolomics Data 1 Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

01/27/2014

32

Peaks differentiating Ctrl/OVEX

PCA 2D‐plot

Page 33: Statistical Analysis of Metabolomics Data 1 Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

01/27/2014

33

3D PCA plot

PLSDA 2D‐plot

Page 34: Statistical Analysis of Metabolomics Data 1 Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

01/27/2014

34

3D‐PLSDA plot

Results from XCMS Online

68

Page 35: Statistical Analysis of Metabolomics Data 1 Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

01/27/2014

35

69

For in-depth statistical analysis and data interpretation, please make an appointment with a biostatistician.

70

Thank you!