View
4
Download
0
Category
Preview:
Citation preview
Learning From DataLecture 11Overfitting
What is Overfitting
When does Overfitting OccurStochastic and Deterministic Noise
M. Magdon-IsmailCSCI 4100/6100
recap: Nonlinear Transforms
Φ−→
1. Original data
xn ∈ X
2. Transform the data
zn = Φ(xn) ∈ Z
↓
‘Φ−1’←−
4. Classify in X -space
g(x) = g̃(Φ(x)) = sign(w̃tΦ(x))
3. Separate data in Z-space
g̃(z) = sign(w̃tz)
X -space is Rd Z-space is Rd̃
x =
1x1
...xd
z = Φ(x) =
1Φ1(x)
...Φ
d̃(x)
=
1z1...zd̃
x1,x2, . . . ,xN z1, z2, . . . , zN
y1, y2, . . . , yN y1, y2, . . . , yN
no weights w̃ =
w0
w1
...w
d̃
dvc = d+ 1 dvc = d+ 1
g(x) = sign(w̃tΦ(x))
c© AML Creator: Malik Magdon-Ismail Overfitting: 2 /24 Superstitions −→
Superstitions – Myth or Reality?
• Paraskevedekatriaphobia – fear of Friday the 13th.
– Are future Friday the 13ths really more dangerous?
• OCD
the subjects performs an action which leads to a good outcome and thereby generalizes it as
cause and effect: the action will always give good results. Having overfit the data, the subject
compulsively engages in that activity.
Humans are overfitting machines, very good at “finding coincidences”.
c© AML Creator: Malik Magdon-Ismail Overfitting: 3 /24 Simple illustration −→
An Illustration of Overfitting on a Simple Example
Quadratic f
5 data points
A little noise (measurement error)
5 data points→ 4th order polynomial fit
x
y
DataTarget
Classic overfitting: simple target with excessively complex H.
The noise did us in. (why?)
c© AML Creator: Malik Magdon-Ismail Overfitting: 4 /24 Classic overfitting −→
An Illustration of Overfitting on a Simple Example
Quadratic f
5 data points
A little noise (measurement error)
5 data points→ 4th order polynomial fit
x
y
DataTargetFit
Classic overfitting: simple target with excessively complex H.
Ein ≈ 0; Eout ≫ 0
The noise did us in. (why?)
c© AML Creator: Malik Magdon-Ismail Overfitting: 5 /24 What is overfitting? −→
What is Overfitting?
Fitting the data more than is warranted
c© AML Creator: Malik Magdon-Ismail Overfitting: 6 /24 Is it bad generalization? −→
Overfitting is Not Just Bad Generalization
in-sample error
out-of-sample error
bad generalization
VC dimension, dvc
Error
VC Analysis:
Covers bad generalization but with lots of slack – the VC bound is loose
c© AML Creator: Malik Magdon-Ismail Overfitting: 7 /24 Beyond bad generalization −→
Overfitting is Not Just Bad Generalization
in-sample error
out-of-sample error
overfitting
VC dimension, dvc
Error
Overfitting:
Going for lower and lower Ein results in higher and higher Eout
c© AML Creator: Malik Magdon-Ismail Overfitting: 8 /24 Case study: simple and complex f −→
Case Study: 2nd vs 10th Order Polynomial Fit
x
y
DataTarget
x
y
DataTarget
10th order f with noise. 50th order f with no noise.
H2: 2nd order polynomial fit
H10: 10th order polynomial fit←− special case of linear models with feature transform x 7→ (1, x, x2, · · · ).
Which model do you pick for which problem and why?
c© AML Creator: Malik Magdon-Ismail Overfitting: 9 /24 H2 versus H10 −→
Case Study: 2nd vs 10th Order Polynomial Fit
x
y
DataTarget
x
y
DataTarget
10th order f with noise. 50th order f with no noise.
H2: 2nd order polynomial fit
H10: 10th order polynomial fit←− special case of linear models with feature transform x 7→ (1, x, x2, · · · ).
Which model do you pick for which problem and why?
c© AML Creator: Malik Magdon-Ismail Overfitting: 10 /24 H2 wins for both cases −→
Case Study: 2nd vs 10th Order Polynomial Fit
replacements
x
y
Data2nd Order Fit10th Order Fit
x
y
Data2nd Order Fit10th Order Fit
simple noisy target
2nd Order 10th Order
Ein 0.050 0.034Eout 0.127 9.00
complex noiseless target
2nd Order 10th Order
Ein 0.029 10−5
Eout 0.120 7680
Go figure:
Simpler H is better even for the more complex target with no noise.
c© AML Creator: Malik Magdon-Ismail Overfitting: 11 /24 Is there really no noise −→
Is there Really “No Noise” with the Complex f?
x
y
DataTarget
x
y
DataTarget
Simple f with noise. Complex f with no noise.
H should match quantity and quality of data, not f
c© AML Creator: Malik Magdon-Ismail Overfitting: 12 /24 Look only at the data −→
Is there Really “No Noise” with the Complex f?
x
y
x
y
Simple f with noise. Complex f with no noise.
H should match quantity and quality of data, not f
c© AML Creator: Malik Magdon-Ismail Overfitting: 13 /24 Learning curves for H2, H10 −→
When is H2 Better than H10?
Learning curves for H2 Learning curves for H10
Number of Data Points, N
Exp
ectedError Eout
Ein
Number of Data Points, N
Exp
ectedError
Eout
Ein
Overfitting:Eout(H10) > Eout(H2)
c© AML Creator: Malik Magdon-Ismail Overfitting: 14 /24 Overfit measure σ2 vs. N −→
Overfit Measure: Eout(H10)− Eout(H2)
Number of Data Points, N
Noise
Level,σ2
80 100 120-0.2
-0.1
0
0.1
0.2
0
1
2
c© AML Creator: Malik Magdon-Ismail Overfitting: 15 /24 Overfit measure Qf vs. N −→
Overfit Measure: Eout(H10)− Eout(H2)
Number of Data Points, N
Noise
Level,σ2
80 100 120-0.2
-0.1
0
0.1
0.2
0
1
2
Number of Data Points, N
TargetCom
plexity,Q
f
80 100 120-0.2
-0.1
0
0.1
0.2
0
25
50
75
100
Number of data points ↑ Overfitting ↓Noise ↑ Overfitting ↑
Target complexity ↑ Overfitting ↑
c© AML Creator: Malik Magdon-Ismail Overfitting: 16 /24 Define ‘noise’ −→
Noise
That part of y we cannot model
it has two sources . . .
c© AML Creator: Malik Magdon-Ismail Overfitting: 17 /24 Stochastic noise −→
Stochastic Noise — Data Error
We would like to learn from ◦:yn = f(xn)
Unfortunately, we only observe ◦:yn = f(xn) + ‘stochastic noise’
↑no one can model this
x
y
y = f(x)
stoch. noise
Stochastic Noise: fluctuations/measurement errors we cannot model.
c© AML Creator: Malik Magdon-Ismail Overfitting: 18 /24 Deterministic noise −→
Deterministic Noise — Model Error
We would like to learn from ◦:yn = h∗(xn)
Unfortunately, we only observe ◦:yn = f(xn)
= h∗(xn) + ‘deterministic noise’
↑H cannot model this
x
y
best approximation to f in H
h∗(x)
y = f(x)
det. noise
Deterministic Noise: the part of f we cannot model.
c© AML Creator: Malik Magdon-Ismail Overfitting: 19 /24 Both hurt learning −→
Stochastic & Deterministic Noise Hurt Learning
Stochastic Noise
x
y
f(x)
y = f(x)+stoch. noise
source: random measurement errors
re-measure ynstochastic noise changes.
change H
stochastic noise the same.
Deterministic Noise
x
y
h∗
y = h∗(x)+det. noise
source: learner’s H cannot model f
re-measure yndeterministic noise the same.
change H
deterministic noise changes.
We have single D and fixed H so we cannot distinguish
c© AML Creator: Malik Magdon-Ismail Overfitting: 20 /24 Stochastic noise and bias-var −→
Noise and the Bias-Variance Decomposition
y = f(x) + ǫ
↑measurement error
E[Eout(x)] = ED,ǫ[(g(x)− f(x)− ǫ)2]
= ED,ǫ[(g(x)− f(x))2 + 2(g(x)− f(x))ǫ + ǫ2]
↓ ↓ ↓
bias + var 0 σ2
c© AML Creator: Malik Magdon-Ismail Overfitting: 21 /24 bias-var-σ2 and noise −→
Noise and the Bias-Variance Decomposition
y = f(x) + ǫ
↑measurement error
E[Eout(x)] = σ2 + bias + var
↑ ↑ ↑stochastic deterministic indirect
noise noise impactof noise
c© AML Creator: Malik Magdon-Ismail Overfitting: 22 /24 Noise causes overfitting −→
Noise is the Culprit
Overfitting is the disease
Noise is the causeLearning is led astray by fitting the noise more than the signal
Cures
Regularization: Putting on the brakes.
Validation: A reality check from peeking at Eout (the bottom line).
c© AML Creator: Malik Magdon-Ismail Overfitting: 23 /24 Regularization teaser −→
Regularization
no regularization regularization!
x
y
Data
Target
Fit
x
y
c© AML Creator: Malik Magdon-Ismail Overfitting: 24 /24
Recommended