Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Data preprocessing
1
ผศ.ดร. ศิลา กิตติวัชนะ และคณะนักศึกษาภาควิชาเคมี คณะวิทยาศาสตร์ มหาวิทยาลัยเชียงใหม่
E-mail: [email protected]: 087-9166692
The problems are….
• Data are from different types/scales
• Variation during experiments
• Errors: Systematic and non-systementic errors
• Noise: Homoscedastic and heteroscedastic noise
• Characteristics of the acquisition methods such as light scattering in NIR
• …………………etc
2
Data preprocessing
• Data pre-processing is to apply a mathematical modification to the values of a matrix prior to formal data analysis methods.
• A raw data may be shifted/rescaled along the variables/samples to place emphasis on the aspects wanted from the data analysis or to improve the interpretation of the data.
• This is to ensure that the right trends are being studied, and the analysis methods do not get confused by non-essential information.
3
Raw dataUV-Vis data
0 5 10 15 20 25 30 35
2.6
2.8
3
3.2
3.4
3.6
3.8
4
4.2
Wavelength
Ab
so
rban
ce
0 5 10 15 20 25 30 350
2000
4000
6000
8000
10000
12000
Wavelength
Ab
so
rban
ce
0 5 10 15 20 25 30-3000
-2000
-1000
0
1000
2000
3000
4000
Wavelength
Ab
so
rban
ce
0 5 10 15 20 25 30 350
0.01
0.02
0.03
0.04
0.05
0.06
0.07
Wavelength
Ab
so
rban
ce
0 5 10 15 20 25 30 35-4
-3
-2
-1
0
1
2
3
4
5
Wavelength
Ab
so
rban
ce
4
Raw data
400 600 800 1000 1200 1400 1600 1800 2000 2200 24000
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
Wevelength (nm)
Ab
so
rban
ce
KDML 105
WR
PT 1
400 600 800 1000 1200 1400 1600 1800 2000 2200 2400
-3
-2
-1
0
1
2
3
Wevelength (nm)
Ab
so
rban
ce
KDML 105
WR
PT 1
400 600 800 1000 1200 1400 1600 1800 2000 2200 24000
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8x 10
-3
Wevelength (nm)
Ab
so
rban
ce
KDML 105
WR
PT 1
400 600 800 1000 1200 1400 1600 1800 2000 2200 2400-1.5
-1
-0.5
0
0.5
1
1.5
Wevelength (nm)
Ab
so
rban
ce
KDML 105
WR
PT 1
400 600 800 1000 1200 1400 1600 1800 2000 2200 24000
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
Wevelength (nm)
Ab
so
rban
ce
KDML 105
WR
PT 1
Wavelength (nm)
Wavelength (nm)
Wavelength (nm)
Wavelength (nm)Wavelength (nm)
NIR data
5
Data preprocessing1. Data smoothing
– Savitzky–Golay (SG) filters– Moving averages
2. Data scaling– Square root scaling– Log scaling
3. Normalization– Row scaling
4. Auto scaling– Mean centring– Standardization
5. Others– SNV– MSC– Derivatives
6
710 720 730 740 750 760 770 780-0.1
-0.05
0
0.05
0.1
Wavelength (nm)
Absorb
ance
Raw spectra
Savitzky–Golay (SG) filters (5)
Savitzky–Golay (SG) filters (15)
Moving averages (3)
Moving averages (5)
600 610 620 630 640 650 660 670 680 690
0
0.2
0.4
0.6
0.8
1
1.2
Wavelength (nm)
Absorb
ance
Raw spectra
Savitzky–Golay (SG) filters (5)
Savitzky–Golay (SG) filters (15)
Moving averages (3)
Moving averages (7)
200 300 400 500 600 700 800 900 1000 1100-0.5
0
0.5
1
1.5
2
2.5
3
3.5
4
Wavelength (nm)
Absorb
ance
Raw spectra
Savitzky–Golay (SG) filters (5)
Savitzky–Golay (SG) filters (15)
Moving averages (3)
Moving averages (7)
1. Data Smoothing
– Savitzky–Golay (SG) filters
– Moving averages
Wavelength (nm)
Wavelength (nm)
Ab
sorb
ance
Ab
sorb
ance
Wavelength (nm)
Ab
sorb
ance
2. Data scaling
– Square root scaling
– Log scaling
0 1 2 3 4 5 6 70
1
2
3
4
5
6
7
Variable
Siz
e
0 1 2 3 4 5 6 70
5
10
15
20
25
30
35
Variable
Siz
e
0 1 2 3 4 5 6 70
1
2
3
4
5
6
7
8
9
10
Variable
Siz
e
Raw data (X)
𝑋𝑠𝑞𝑟𝑡.5 = 𝑋
0 1 2 3 4 5 6 70
100
200
300
400
500
600
700
800
900
1000
Variable
Siz
e
𝑋𝑠𝑞𝑟𝑡.33 = 1 3𝑋 𝑋𝑙𝑜𝑔 = log𝑋
𝑿𝑠𝑞𝑟𝑡 = 𝑿
𝑿𝑙𝑜𝑔 = log𝑿
8
3. Normalization
– Row scaling
Raw data Row scaling
0 5 10 15 20 25 30 350
2000
4000
6000
8000
10000
12000
Wavelength
Ab
so
rban
ce
0 5 10 15 20 25 30 350
0.01
0.02
0.03
0.04
0.05
0.06
0.07
Wavelength
Ab
so
rban
ce
𝑟𝑠𝑥𝑖𝑗 =𝑥𝑖𝑗
𝐽=1𝐽
𝑥𝑖𝑗
𝑟𝑠𝑥𝑖𝑗 = normalized value of 𝑥𝑖𝑗𝐽 = number of parameter
9
4. Auto scaling
– Mean centring
Raw data
1 5 10 50 100
2 6 20 60 200
3 7 30 70 300
4 8 40 80 400
5 9 50 90 500
1 2 3 4 5-250
-200
-150
-100
-50
0
50
100
150
200
250
Variables
siz
e
Mean centring
1 2 3 4 50
100
200
300
400
500
600
Variables
siz
e
None preprocessing
𝑐𝑒𝑛𝑡𝑥𝑖𝑗 = 𝑥𝑖𝑗 − 𝑥𝑗
𝑥𝑗 = 𝑖=1𝐼 𝑥𝑖𝑗𝐼
𝑐𝑒𝑛𝑡𝑥𝑖𝑗= mean centring value of 𝑥𝑖𝑗with average value of 𝑥𝑗
10
4. Auto scaling
– Standardization
Raw data
1 5 10 50 100
2 6 20 60 200
3 7 30 70 300
4 8 40 80 400
5 9 50 90 500
1 2 3 4 5-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
Variables
siz
e
Standardization
1 2 3 4 50
100
200
300
400
500
600
Variables
siz
e
None preprocessing
𝑠𝑡𝑑𝑥𝑖𝑗 =𝑥𝑖𝑗 − 𝑥𝑗
𝑆𝑗
𝑆𝑗 = 𝑖=1𝐼 𝑥𝑖𝑗 − 𝑥𝑗
2
𝐼
𝑠𝑡𝑑𝑥𝑖𝑗 = standardized element of 𝑥𝑖𝑗 using
mean of 𝑥𝑗 standard deviation of 𝑆𝑗
11
5. Others
- Multiplicative scatter-correction (MSC)
- Standard normal variate (SNV)
SNV𝐴𝑖𝑗 =𝐴𝑖𝑗− 𝑥𝑖
𝑆𝐷𝑒𝑣
𝑖 = spectrum counter𝑗 = absorbance value counter of ith spectrumSNV𝐴𝑖𝑗 = corrected absorbance value
𝐴𝑖𝑗 = measured absorbance value
𝑥𝑖 = mean absorbance value𝑆𝐷𝑒𝑣 = standard deviation
𝑥𝑖 = 𝑎 + 𝑏 𝑦𝑎𝑣𝑒𝑟𝑎𝑔𝑒 + 𝜀
𝑀𝑆𝐶𝑥𝑖𝑗 =𝑥𝑖𝑗 − 𝑎
𝑏
𝑥𝑖 = the ith spectrum collection𝑥𝑖𝑗 = the absorbance of the ith spectrum and jth
wavelength of the collection
For each sample, a and b are estimated by ordinary least-squares regression of spectrum xi versus yAverageover the available wavelengths j
https://www.labcognition.com/onlinehelp/en/standard_normal_variate_correction.htmhttps://www.labcognition.com/onlinehelp/en/multiplicative_scatter_correction.htm
12
13
400 600 800 1000 1200 1400 1600 1800 2000 2200 2400-1.5
-1
-0.5
0
0.5
1
1.5
2
Wevelength (nm)
Ab
so
rban
ce
Raw spectra
SNV
MSC
580 600 620 640 660 680 700-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
Wavelength (nm)
Absorb
ance
Raw spectra
1st derivative
2nd derivative
200 300 400 500 600 700 800 900 1000 1100-2
-1
0
1
2
3
4
Wavelength (nm)
Absorb
ance
Raw spectra
1st derivative
2nd derivative
5. Others
– Derivatives 𝑑𝑋/𝑑𝑌 =∆𝑋
∆𝑌1st derivative
2nd derivative 𝑑2𝑋/𝑑𝑌2 =∆2𝑋
∆𝑌2
Wavelength (nm)
Ab
sorb
ance
Wavelength (nm)
Ab
sorb
ance
0 5 10 15 20 25 30 350
2000
4000
6000
8000
10000
12000
Wavelength
Inte
nsity
0 5 10 15 20 25 30 350
0.05
0.1
0.15
0.2
0.25
0.3
0.35
Wavelength
Inte
nsity
0 5 10 15 20 25 30 35-3000
-2000
-1000
0
1000
2000
3000
4000
Wavalenght
Inte
nsity
0 5 10 15 20 25 30 35-4
-3
-2
-1
0
1
2
3
4
5
Wavalenght
Inte
nsity
0 5 10 15 20 25 30 355.5
6
6.5
7
7.5
8
8.5
9
9.5
Wavalenght
Inte
nsity
2 2.2 2.4 2.6 2.8 3 3.2
x 104
-3000
-2000
-1000
0
1000
2000
3000
4000
5000
PC1 (99.2848%)
PC
2 (
0.6
395%
)
-15 -10 -5 0 5 10-6
-4
-2
0
2
4
6
PC1 (71.8221%)
PC
2 (
15.2
518%
)
43.5 44 44.5 45 45.5 46 46.5-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
PC1 (99.9879%)
PC
2 (
0.0
104%
)
-6000 -4000 -2000 0 2000 4000 6000-5000
-4000
-3000
-2000
-1000
0
1000
2000
PC1 (78.2349%)
PC
2 (
20.1
207%
)
0.198 0.2 0.202 0.204 0.206 0.208 0.21 0.212 0.214 0.216 0.218-0.02
-0.015
-0.01
-0.005
0
0.005
0.01
0.015
0.02
0.025
0.03
PC1 (99.2474%)
PC
2 (
0.6
852%
)
PCA score plots
15
0 5 10 15 20 25 30 350
2000
4000
6000
8000
10000
12000
Wavelength
Inte
nsity
0 5 10 15 20 25 30 350
0.05
0.1
0.15
0.2
0.25
0.3
0.35
Wavelength
Inte
nsity
0 5 10 15 20 25 30 35-3000
-2000
-1000
0
1000
2000
3000
4000
Wavalenght
Inte
nsity
0 5 10 15 20 25 30 35-4
-3
-2
-1
0
1
2
3
4
5
Wavalenght
Inte
nsity
0 5 10 15 20 25 30 355.5
6
6.5
7
7.5
8
8.5
9
9.5
Wavalenght
Inte
nsity
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
12
3
4
56
7
8
9
10
11
12
13
14151617
181920
212223
2425262728293031
0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.2 0.21-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
1
2
3
4
5 6
7
8
9
10
111213
14
1516
1718192021
22232425262728293031
PC1 (99.9881%)
PC
2 (
0.0
102%
)
0.15 0.16 0.17 0.18 0.19 0.2-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
1
2
3
4
56 7
8
9
10
11
1213
14
1516
17
181920
2122 23
2425
26
2728293031
PC1 (71.2556%)
PC
2 (
16.0
794%
)
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
1
23
4
5
6 7
8
9
10
11
12
13
14
151617
1819
2021
222324
2526
27
28
29
30
31
PC1 (99.2475%)
PC
2 (
0.6
849%
)
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
12 3
4
5
67
8
9
10
1112
13
14
15161718
1920
212223
2425
2627
2829
30
31
PC1 (78.2351%)
PC
2 (
20.1
206%
)PCA loading plots
16
Raw data
400 600 800 1000 1200 1400 1600 1800 2000 2200 2400
-3
-2
-1
0
1
2
3
Wevelength (nm)
Ab
so
rban
ce
KDML 105
WR
PT 1
400 600 800 1000 1200 1400 1600 1800 2000 2200 24000
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8x 10
-3
Wevelength (nm)
Ab
so
rban
ce
KDML 105
WR
PT 1
400 600 800 1000 1200 1400 1600 1800 2000 2200 2400-1.5
-1
-0.5
0
0.5
1
1.5
Wevelength (nm)
Ab
so
rban
ce
KDML 105
WR
PT 1
400 600 800 1000 1200 1400 1600 1800 2000 2200 24000
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
Wevelength (nm)
Ab
so
rban
ce
KDML 105
WR
PT 1
400 600 800 1000 1200 1400 1600 1800 2000 2200 24000
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
Wevelength (nm)
Ab
so
rban
ce
KDML 105
WR
PT 1
31 31.5 32 32.5 33 33.5 34 34.5 35 35.5-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
PC1 (99.9887%)
PC
2 (
0.0
073%
)
KDML 105
WR
PT 1
0.0368 0.0369 0.037 0.0371 0.0372 0.0373 0.0374 0.0375-6
-4
-2
0
2
4
6x 10
-4
PC1 (99.9885%)
PC
2 (
0.0
075%
)
KDML 105
WR
PT 1
-60 -40 -20 0 20 40 60-30
-20
-10
0
10
20
30
PC1 (67.9196%)
PC
2 (
25.6
601%
)
KDML 105
WR
PT 1
-32.386 -32.384 -32.382 -32.38 -32.378 -32.376 -32.374 -32.372-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
PC1 (99.9785%)
PC
2 (
0.0
160%
)
KDML 105
WR
PT 1
33.5692 33.5692 33.5693 33.5693 33.5694 33.5694 33.5695 33.5695-0.5
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
PC1 (99.9933%)
PC
2 (
0.0
050%
)
KDML 105
WR
PT 1
17
400 600 800 1000 1200 1400 1600 1800 2000 2200 2400
-3
-2
-1
0
1
2
3
Wevelength (nm)
Ab
so
rban
ce
KDML 105
WR
PT 1
400 600 800 1000 1200 1400 1600 1800 2000 2200 24000
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8x 10
-3
Wevelength (nm)
Ab
so
rban
ce
KDML 105
WR
PT 1
400 600 800 1000 1200 1400 1600 1800 2000 2200 2400-1.5
-1
-0.5
0
0.5
1
1.5
Wevelength (nm)
Ab
so
rban
ce
KDML 105
WR
PT 1
400 600 800 1000 1200 1400 1600 1800 2000 2200 24000
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
Wevelength (nm)
Ab
so
rban
ce
KDML 105
WR
PT 1
400 600 800 1000 1200 1400 1600 1800 2000 2200 24000
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
Wevelength (nm)
Ab
so
rban
ce
KDML 105
WR
PT 1
PCA loading plots
0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05-0.12
-0.1
-0.08
-0.06
-0.04
-0.02
0
0.02
0.04
PC1 (99.9887%)
PC
2 (
0.0
073%
) 230352
447 402
636
535
1050
1
-0.04 -0.03 -0.02 -0.01 0 0.01 0.02
-0.06
-0.05
-0.04
-0.03
-0.02
-0.01
0
0.01
0.02
PC1 (67.9196%)
PC
2 (
25.6
601%
)
990
39
2
0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05-0.12
-0.1
-0.08
-0.06
-0.04
-0.02
0
0.02
0.04
PC1 (99.9885%)
PC
2 (
0.0
075%
)
638
544
753
1050
1
241362
453405
-0.04 -0.03 -0.02 -0.01 0 0.01 0.02 0.03 0.04-0.04
-0.02
0
0.02
0.04
0.06
0.08
0.1
PC1 (99.9785%)
PC
2 (
0.0
160%
)
1050
543
727
240
1
23
0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045
-0.1
-0.08
-0.06
-0.04
-0.02
0
0.02
0.04
PC1 (99.9933%)
PC
2 (
0.0
050%
)
241
1
547
749
1050
732
2718