18
Data preprocessing ผศ.ดร. ศิลา กิตติวัชนะ และคณะนักศึกษา ภาควิชาเคมี คณะวิทยาศาสตร์ มหาวิทยาลัยเชียงใหม่ E-mail: [email protected] Tel: 087-9166692

Data preprocessing - Postharvest Technology Innovation Center · Data preprocessing •Data pre-processing is to apply a mathematical modification to the values of a matrix prior

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Data preprocessing - Postharvest Technology Innovation Center · Data preprocessing •Data pre-processing is to apply a mathematical modification to the values of a matrix prior

Data preprocessing

1

ผศ.ดร. ศิลา กิตติวัชนะ และคณะนักศึกษาภาควิชาเคมี คณะวิทยาศาสตร์ มหาวิทยาลัยเชียงใหม่

E-mail: [email protected]: 087-9166692

Page 2: Data preprocessing - Postharvest Technology Innovation Center · Data preprocessing •Data pre-processing is to apply a mathematical modification to the values of a matrix prior

The problems are….

• Data are from different types/scales

• Variation during experiments

• Errors: Systematic and non-systementic errors

• Noise: Homoscedastic and heteroscedastic noise

• Characteristics of the acquisition methods such as light scattering in NIR

• …………………etc

2

Page 3: Data preprocessing - Postharvest Technology Innovation Center · Data preprocessing •Data pre-processing is to apply a mathematical modification to the values of a matrix prior

Data preprocessing

• Data pre-processing is to apply a mathematical modification to the values of a matrix prior to formal data analysis methods.

• A raw data may be shifted/rescaled along the variables/samples to place emphasis on the aspects wanted from the data analysis or to improve the interpretation of the data.

• This is to ensure that the right trends are being studied, and the analysis methods do not get confused by non-essential information.

3

Page 4: Data preprocessing - Postharvest Technology Innovation Center · Data preprocessing •Data pre-processing is to apply a mathematical modification to the values of a matrix prior

Raw dataUV-Vis data

0 5 10 15 20 25 30 35

2.6

2.8

3

3.2

3.4

3.6

3.8

4

4.2

Wavelength

Ab

so

rban

ce

0 5 10 15 20 25 30 350

2000

4000

6000

8000

10000

12000

Wavelength

Ab

so

rban

ce

0 5 10 15 20 25 30-3000

-2000

-1000

0

1000

2000

3000

4000

Wavelength

Ab

so

rban

ce

0 5 10 15 20 25 30 350

0.01

0.02

0.03

0.04

0.05

0.06

0.07

Wavelength

Ab

so

rban

ce

0 5 10 15 20 25 30 35-4

-3

-2

-1

0

1

2

3

4

5

Wavelength

Ab

so

rban

ce

4

Page 5: Data preprocessing - Postharvest Technology Innovation Center · Data preprocessing •Data pre-processing is to apply a mathematical modification to the values of a matrix prior

Raw data

400 600 800 1000 1200 1400 1600 1800 2000 2200 24000

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

Wevelength (nm)

Ab

so

rban

ce

KDML 105

WR

PT 1

400 600 800 1000 1200 1400 1600 1800 2000 2200 2400

-3

-2

-1

0

1

2

3

Wevelength (nm)

Ab

so

rban

ce

KDML 105

WR

PT 1

400 600 800 1000 1200 1400 1600 1800 2000 2200 24000

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8x 10

-3

Wevelength (nm)

Ab

so

rban

ce

KDML 105

WR

PT 1

400 600 800 1000 1200 1400 1600 1800 2000 2200 2400-1.5

-1

-0.5

0

0.5

1

1.5

Wevelength (nm)

Ab

so

rban

ce

KDML 105

WR

PT 1

400 600 800 1000 1200 1400 1600 1800 2000 2200 24000

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

Wevelength (nm)

Ab

so

rban

ce

KDML 105

WR

PT 1

Wavelength (nm)

Wavelength (nm)

Wavelength (nm)

Wavelength (nm)Wavelength (nm)

NIR data

5

Page 6: Data preprocessing - Postharvest Technology Innovation Center · Data preprocessing •Data pre-processing is to apply a mathematical modification to the values of a matrix prior

Data preprocessing1. Data smoothing

– Savitzky–Golay (SG) filters– Moving averages

2. Data scaling– Square root scaling– Log scaling

3. Normalization– Row scaling

4. Auto scaling– Mean centring– Standardization

5. Others– SNV– MSC– Derivatives

6

Page 7: Data preprocessing - Postharvest Technology Innovation Center · Data preprocessing •Data pre-processing is to apply a mathematical modification to the values of a matrix prior

710 720 730 740 750 760 770 780-0.1

-0.05

0

0.05

0.1

Wavelength (nm)

Absorb

ance

Raw spectra

Savitzky–Golay (SG) filters (5)

Savitzky–Golay (SG) filters (15)

Moving averages (3)

Moving averages (5)

600 610 620 630 640 650 660 670 680 690

0

0.2

0.4

0.6

0.8

1

1.2

Wavelength (nm)

Absorb

ance

Raw spectra

Savitzky–Golay (SG) filters (5)

Savitzky–Golay (SG) filters (15)

Moving averages (3)

Moving averages (7)

200 300 400 500 600 700 800 900 1000 1100-0.5

0

0.5

1

1.5

2

2.5

3

3.5

4

Wavelength (nm)

Absorb

ance

Raw spectra

Savitzky–Golay (SG) filters (5)

Savitzky–Golay (SG) filters (15)

Moving averages (3)

Moving averages (7)

1. Data Smoothing

– Savitzky–Golay (SG) filters

– Moving averages

Wavelength (nm)

Wavelength (nm)

Ab

sorb

ance

Ab

sorb

ance

Wavelength (nm)

Ab

sorb

ance

Page 8: Data preprocessing - Postharvest Technology Innovation Center · Data preprocessing •Data pre-processing is to apply a mathematical modification to the values of a matrix prior

2. Data scaling

– Square root scaling

– Log scaling

0 1 2 3 4 5 6 70

1

2

3

4

5

6

7

Variable

Siz

e

0 1 2 3 4 5 6 70

5

10

15

20

25

30

35

Variable

Siz

e

0 1 2 3 4 5 6 70

1

2

3

4

5

6

7

8

9

10

Variable

Siz

e

Raw data (X)

𝑋𝑠𝑞𝑟𝑡.5 = 𝑋

0 1 2 3 4 5 6 70

100

200

300

400

500

600

700

800

900

1000

Variable

Siz

e

𝑋𝑠𝑞𝑟𝑡.33 = 1 3𝑋 𝑋𝑙𝑜𝑔 = log𝑋

𝑿𝑠𝑞𝑟𝑡 = 𝑿

𝑿𝑙𝑜𝑔 = log𝑿

8

Page 9: Data preprocessing - Postharvest Technology Innovation Center · Data preprocessing •Data pre-processing is to apply a mathematical modification to the values of a matrix prior

3. Normalization

– Row scaling

Raw data Row scaling

0 5 10 15 20 25 30 350

2000

4000

6000

8000

10000

12000

Wavelength

Ab

so

rban

ce

0 5 10 15 20 25 30 350

0.01

0.02

0.03

0.04

0.05

0.06

0.07

Wavelength

Ab

so

rban

ce

𝑟𝑠𝑥𝑖𝑗 =𝑥𝑖𝑗

𝐽=1𝐽

𝑥𝑖𝑗

𝑟𝑠𝑥𝑖𝑗 = normalized value of 𝑥𝑖𝑗𝐽 = number of parameter

9

Page 10: Data preprocessing - Postharvest Technology Innovation Center · Data preprocessing •Data pre-processing is to apply a mathematical modification to the values of a matrix prior

4. Auto scaling

– Mean centring

Raw data

1 5 10 50 100

2 6 20 60 200

3 7 30 70 300

4 8 40 80 400

5 9 50 90 500

1 2 3 4 5-250

-200

-150

-100

-50

0

50

100

150

200

250

Variables

siz

e

Mean centring

1 2 3 4 50

100

200

300

400

500

600

Variables

siz

e

None preprocessing

𝑐𝑒𝑛𝑡𝑥𝑖𝑗 = 𝑥𝑖𝑗 − 𝑥𝑗

𝑥𝑗 = 𝑖=1𝐼 𝑥𝑖𝑗𝐼

𝑐𝑒𝑛𝑡𝑥𝑖𝑗= mean centring value of 𝑥𝑖𝑗with average value of 𝑥𝑗

10

Page 11: Data preprocessing - Postharvest Technology Innovation Center · Data preprocessing •Data pre-processing is to apply a mathematical modification to the values of a matrix prior

4. Auto scaling

– Standardization

Raw data

1 5 10 50 100

2 6 20 60 200

3 7 30 70 300

4 8 40 80 400

5 9 50 90 500

1 2 3 4 5-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

Variables

siz

e

Standardization

1 2 3 4 50

100

200

300

400

500

600

Variables

siz

e

None preprocessing

𝑠𝑡𝑑𝑥𝑖𝑗 =𝑥𝑖𝑗 − 𝑥𝑗

𝑆𝑗

𝑆𝑗 = 𝑖=1𝐼 𝑥𝑖𝑗 − 𝑥𝑗

2

𝐼

𝑠𝑡𝑑𝑥𝑖𝑗 = standardized element of 𝑥𝑖𝑗 using

mean of 𝑥𝑗 standard deviation of 𝑆𝑗

11

Page 12: Data preprocessing - Postharvest Technology Innovation Center · Data preprocessing •Data pre-processing is to apply a mathematical modification to the values of a matrix prior

5. Others

- Multiplicative scatter-correction (MSC)

- Standard normal variate (SNV)

SNV𝐴𝑖𝑗 =𝐴𝑖𝑗− 𝑥𝑖

𝑆𝐷𝑒𝑣

𝑖 = spectrum counter𝑗 = absorbance value counter of ith spectrumSNV𝐴𝑖𝑗 = corrected absorbance value

𝐴𝑖𝑗 = measured absorbance value

𝑥𝑖 = mean absorbance value𝑆𝐷𝑒𝑣 = standard deviation

𝑥𝑖 = 𝑎 + 𝑏 𝑦𝑎𝑣𝑒𝑟𝑎𝑔𝑒 + 𝜀

𝑀𝑆𝐶𝑥𝑖𝑗 =𝑥𝑖𝑗 − 𝑎

𝑏

𝑥𝑖 = the ith spectrum collection𝑥𝑖𝑗 = the absorbance of the ith spectrum and jth

wavelength of the collection

For each sample, a and b are estimated by ordinary least-squares regression of spectrum xi versus yAverageover the available wavelengths j

https://www.labcognition.com/onlinehelp/en/standard_normal_variate_correction.htmhttps://www.labcognition.com/onlinehelp/en/multiplicative_scatter_correction.htm

12

Page 13: Data preprocessing - Postharvest Technology Innovation Center · Data preprocessing •Data pre-processing is to apply a mathematical modification to the values of a matrix prior

13

400 600 800 1000 1200 1400 1600 1800 2000 2200 2400-1.5

-1

-0.5

0

0.5

1

1.5

2

Wevelength (nm)

Ab

so

rban

ce

Raw spectra

SNV

MSC

Page 14: Data preprocessing - Postharvest Technology Innovation Center · Data preprocessing •Data pre-processing is to apply a mathematical modification to the values of a matrix prior

580 600 620 640 660 680 700-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

Wavelength (nm)

Absorb

ance

Raw spectra

1st derivative

2nd derivative

200 300 400 500 600 700 800 900 1000 1100-2

-1

0

1

2

3

4

Wavelength (nm)

Absorb

ance

Raw spectra

1st derivative

2nd derivative

5. Others

– Derivatives 𝑑𝑋/𝑑𝑌 =∆𝑋

∆𝑌1st derivative

2nd derivative 𝑑2𝑋/𝑑𝑌2 =∆2𝑋

∆𝑌2

Wavelength (nm)

Ab

sorb

ance

Wavelength (nm)

Ab

sorb

ance

Page 15: Data preprocessing - Postharvest Technology Innovation Center · Data preprocessing •Data pre-processing is to apply a mathematical modification to the values of a matrix prior

0 5 10 15 20 25 30 350

2000

4000

6000

8000

10000

12000

Wavelength

Inte

nsity

0 5 10 15 20 25 30 350

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Wavelength

Inte

nsity

0 5 10 15 20 25 30 35-3000

-2000

-1000

0

1000

2000

3000

4000

Wavalenght

Inte

nsity

0 5 10 15 20 25 30 35-4

-3

-2

-1

0

1

2

3

4

5

Wavalenght

Inte

nsity

0 5 10 15 20 25 30 355.5

6

6.5

7

7.5

8

8.5

9

9.5

Wavalenght

Inte

nsity

2 2.2 2.4 2.6 2.8 3 3.2

x 104

-3000

-2000

-1000

0

1000

2000

3000

4000

5000

PC1 (99.2848%)

PC

2 (

0.6

395%

)

-15 -10 -5 0 5 10-6

-4

-2

0

2

4

6

PC1 (71.8221%)

PC

2 (

15.2

518%

)

43.5 44 44.5 45 45.5 46 46.5-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

PC1 (99.9879%)

PC

2 (

0.0

104%

)

-6000 -4000 -2000 0 2000 4000 6000-5000

-4000

-3000

-2000

-1000

0

1000

2000

PC1 (78.2349%)

PC

2 (

20.1

207%

)

0.198 0.2 0.202 0.204 0.206 0.208 0.21 0.212 0.214 0.216 0.218-0.02

-0.015

-0.01

-0.005

0

0.005

0.01

0.015

0.02

0.025

0.03

PC1 (99.2474%)

PC

2 (

0.6

852%

)

PCA score plots

15

Page 16: Data preprocessing - Postharvest Technology Innovation Center · Data preprocessing •Data pre-processing is to apply a mathematical modification to the values of a matrix prior

0 5 10 15 20 25 30 350

2000

4000

6000

8000

10000

12000

Wavelength

Inte

nsity

0 5 10 15 20 25 30 350

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Wavelength

Inte

nsity

0 5 10 15 20 25 30 35-3000

-2000

-1000

0

1000

2000

3000

4000

Wavalenght

Inte

nsity

0 5 10 15 20 25 30 35-4

-3

-2

-1

0

1

2

3

4

5

Wavalenght

Inte

nsity

0 5 10 15 20 25 30 355.5

6

6.5

7

7.5

8

8.5

9

9.5

Wavalenght

Inte

nsity

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

12

3

4

56

7

8

9

10

11

12

13

14151617

181920

212223

2425262728293031

0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.2 0.21-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

1

2

3

4

5 6

7

8

9

10

111213

14

1516

1718192021

22232425262728293031

PC1 (99.9881%)

PC

2 (

0.0

102%

)

0.15 0.16 0.17 0.18 0.19 0.2-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

1

2

3

4

56 7

8

9

10

11

1213

14

1516

17

181920

2122 23

2425

26

2728293031

PC1 (71.2556%)

PC

2 (

16.0

794%

)

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

1

23

4

5

6 7

8

9

10

11

12

13

14

151617

1819

2021

222324

2526

27

28

29

30

31

PC1 (99.2475%)

PC

2 (

0.6

849%

)

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

12 3

4

5

67

8

9

10

1112

13

14

15161718

1920

212223

2425

2627

2829

30

31

PC1 (78.2351%)

PC

2 (

20.1

206%

)PCA loading plots

16

Page 17: Data preprocessing - Postharvest Technology Innovation Center · Data preprocessing •Data pre-processing is to apply a mathematical modification to the values of a matrix prior

Raw data

400 600 800 1000 1200 1400 1600 1800 2000 2200 2400

-3

-2

-1

0

1

2

3

Wevelength (nm)

Ab

so

rban

ce

KDML 105

WR

PT 1

400 600 800 1000 1200 1400 1600 1800 2000 2200 24000

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8x 10

-3

Wevelength (nm)

Ab

so

rban

ce

KDML 105

WR

PT 1

400 600 800 1000 1200 1400 1600 1800 2000 2200 2400-1.5

-1

-0.5

0

0.5

1

1.5

Wevelength (nm)

Ab

so

rban

ce

KDML 105

WR

PT 1

400 600 800 1000 1200 1400 1600 1800 2000 2200 24000

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

Wevelength (nm)

Ab

so

rban

ce

KDML 105

WR

PT 1

400 600 800 1000 1200 1400 1600 1800 2000 2200 24000

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

Wevelength (nm)

Ab

so

rban

ce

KDML 105

WR

PT 1

31 31.5 32 32.5 33 33.5 34 34.5 35 35.5-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

PC1 (99.9887%)

PC

2 (

0.0

073%

)

KDML 105

WR

PT 1

0.0368 0.0369 0.037 0.0371 0.0372 0.0373 0.0374 0.0375-6

-4

-2

0

2

4

6x 10

-4

PC1 (99.9885%)

PC

2 (

0.0

075%

)

KDML 105

WR

PT 1

-60 -40 -20 0 20 40 60-30

-20

-10

0

10

20

30

PC1 (67.9196%)

PC

2 (

25.6

601%

)

KDML 105

WR

PT 1

-32.386 -32.384 -32.382 -32.38 -32.378 -32.376 -32.374 -32.372-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

PC1 (99.9785%)

PC

2 (

0.0

160%

)

KDML 105

WR

PT 1

33.5692 33.5692 33.5693 33.5693 33.5694 33.5694 33.5695 33.5695-0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

PC1 (99.9933%)

PC

2 (

0.0

050%

)

KDML 105

WR

PT 1

17

Page 18: Data preprocessing - Postharvest Technology Innovation Center · Data preprocessing •Data pre-processing is to apply a mathematical modification to the values of a matrix prior

400 600 800 1000 1200 1400 1600 1800 2000 2200 2400

-3

-2

-1

0

1

2

3

Wevelength (nm)

Ab

so

rban

ce

KDML 105

WR

PT 1

400 600 800 1000 1200 1400 1600 1800 2000 2200 24000

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8x 10

-3

Wevelength (nm)

Ab

so

rban

ce

KDML 105

WR

PT 1

400 600 800 1000 1200 1400 1600 1800 2000 2200 2400-1.5

-1

-0.5

0

0.5

1

1.5

Wevelength (nm)

Ab

so

rban

ce

KDML 105

WR

PT 1

400 600 800 1000 1200 1400 1600 1800 2000 2200 24000

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

Wevelength (nm)

Ab

so

rban

ce

KDML 105

WR

PT 1

400 600 800 1000 1200 1400 1600 1800 2000 2200 24000

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

Wevelength (nm)

Ab

so

rban

ce

KDML 105

WR

PT 1

PCA loading plots

0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05-0.12

-0.1

-0.08

-0.06

-0.04

-0.02

0

0.02

0.04

PC1 (99.9887%)

PC

2 (

0.0

073%

) 230352

447 402

636

535

1050

1

-0.04 -0.03 -0.02 -0.01 0 0.01 0.02

-0.06

-0.05

-0.04

-0.03

-0.02

-0.01

0

0.01

0.02

PC1 (67.9196%)

PC

2 (

25.6

601%

)

990

39

2

0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05-0.12

-0.1

-0.08

-0.06

-0.04

-0.02

0

0.02

0.04

PC1 (99.9885%)

PC

2 (

0.0

075%

)

638

544

753

1050

1

241362

453405

-0.04 -0.03 -0.02 -0.01 0 0.01 0.02 0.03 0.04-0.04

-0.02

0

0.02

0.04

0.06

0.08

0.1

PC1 (99.9785%)

PC

2 (

0.0

160%

)

1050

543

727

240

1

23

0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045

-0.1

-0.08

-0.06

-0.04

-0.02

0

0.02

0.04

PC1 (99.9933%)

PC

2 (

0.0

050%

)

241

1

547

749

1050

732

2718