Learning From Data Lecture 11 Overﬁttingmagdon/courses/LFD-Slides/SlidesLect11.pdf · Learning...

Learning From DataLecture 11Overfitting

What is Overfitting

When does Overfitting OccurStochastic and Deterministic Noise

M. Magdon-IsmailCSCI 4100/6100

recap: Nonlinear Transforms

Φ−→

1. Original data

xn ∈ X

2. Transform the data

zn = Φ(xn) ∈ Z

‘Φ−1’←−

4. Classify in X -space

g(x) = g̃(Φ(x)) = sign(w̃tΦ(x))

3. Separate data in Z-space

g̃(z) = sign(w̃tz)

X -space is Rd Z-space is Rd̃

z = Φ(x) =

1Φ1(x)

d̃(x)

1z1...zd̃

x1,x2, . . . ,xN z1, z2, . . . , zN

y1, y2, . . . , yN y1, y2, . . . , yN

no weights w̃ =

dvc = d+ 1 dvc = d+ 1

g(x) = sign(w̃tΦ(x))

c© AML Creator: Malik Magdon-Ismail Overfitting: 2 /24 Superstitions −→

Superstitions – Myth or Reality?

• Paraskevedekatriaphobia – fear of Friday the 13th.

– Are future Friday the 13ths really more dangerous?

• OCD

the subjects performs an action which leads to a good outcome and thereby generalizes it as

cause and effect: the action will always give good results. Having overfit the data, the subject

compulsively engages in that activity.

Humans are overfitting machines, very good at “finding coincidences”.

c© AML Creator: Malik Magdon-Ismail Overfitting: 3 /24 Simple illustration −→

An Illustration of Overfitting on a Simple Example

Quadratic f

5 data points

A little noise (measurement error)

5 data points→ 4th order polynomial fit

DataTarget

Classic overfitting: simple target with excessively complex H.

The noise did us in. (why?)

c© AML Creator: Malik Magdon-Ismail Overfitting: 4 /24 Classic overfitting −→

An Illustration of Overfitting on a Simple Example

Quadratic f

5 data points

A little noise (measurement error)

5 data points→ 4th order polynomial fit

DataTargetFit

Classic overfitting: simple target with excessively complex H.

Ein ≈ 0; Eout ≫ 0

The noise did us in. (why?)

c© AML Creator: Malik Magdon-Ismail Overfitting: 5 /24 What is overfitting? −→

What is Overfitting?

Fitting the data more than is warranted

c© AML Creator: Malik Magdon-Ismail Overfitting: 6 /24 Is it bad generalization? −→

Overfitting is Not Just Bad Generalization

in-sample error

out-of-sample error

bad generalization

VC dimension, dvc

VC Analysis:

Covers bad generalization but with lots of slack – the VC bound is loose

c© AML Creator: Malik Magdon-Ismail Overfitting: 7 /24 Beyond bad generalization −→

Overfitting is Not Just Bad Generalization

in-sample error

out-of-sample error

overfitting

VC dimension, dvc

Overfitting:

Going for lower and lower Ein results in higher and higher Eout

c© AML Creator: Malik Magdon-Ismail Overfitting: 8 /24 Case study: simple and complex f −→

Case Study: 2nd vs 10th Order Polynomial Fit

DataTarget

10th order f with noise. 50th order f with no noise.

H2: 2nd order polynomial fit

H10: 10th order polynomial fit←− special case of linear models with feature transform x 7→ (1, x, x2, · · · ).

Which model do you pick for which problem and why?

c© AML Creator: Malik Magdon-Ismail Overfitting: 9 /24 H2 versus H10 −→

DataTarget

10th order f with noise. 50th order f with no noise.

H2: 2nd order polynomial fit

H10: 10th order polynomial fit←− special case of linear models with feature transform x 7→ (1, x, x2, · · · ).

Which model do you pick for which problem and why?

c© AML Creator: Malik Magdon-Ismail Overfitting: 10 /24 H2 wins for both cases −→

replacements

Data2nd Order Fit10th Order Fit

simple noisy target

2nd Order 10th Order

Ein 0.050 0.034Eout 0.127 9.00

complex noiseless target

2nd Order 10th Order

Ein 0.029 10−5

Eout 0.120 7680

Go figure:

Simpler H is better even for the more complex target with no noise.

c© AML Creator: Malik Magdon-Ismail Overfitting: 11 /24 Is there really no noise −→

Is there Really “No Noise” with the Complex f?

DataTarget

Simple f with noise. Complex f with no noise.

H should match quantity and quality of data, not f

c© AML Creator: Malik Magdon-Ismail Overfitting: 12 /24 Look only at the data −→

Is there Really “No Noise” with the Complex f?

Simple f with noise. Complex f with no noise.

H should match quantity and quality of data, not f

c© AML Creator: Malik Magdon-Ismail Overfitting: 13 /24 Learning curves for H2, H10 −→

When is H2 Better than H10?

Learning curves for H2 Learning curves for H10

Number of Data Points, N

ectedError Eout

ectedError

Overfitting:Eout(H10) > Eout(H2)

c© AML Creator: Malik Magdon-Ismail Overfitting: 14 /24 Overfit measure σ2 vs. N −→

Overfit Measure: Eout(H10)− Eout(H2)

Level,σ2

80 100 120-0.2

c© AML Creator: Malik Magdon-Ismail Overfitting: 15 /24 Overfit measure Qf vs. N −→

Overfit Measure: Eout(H10)− Eout(H2)

Level,σ2

80 100 120-0.2

TargetCom

plexity,Q

80 100 120-0.2

Number of data points ↑ Overfitting ↓Noise ↑ Overfitting ↑

Target complexity ↑ Overfitting ↑

c© AML Creator: Malik Magdon-Ismail Overfitting: 16 /24 Define ‘noise’ −→

That part of y we cannot model

it has two sources . . .

c© AML Creator: Malik Magdon-Ismail Overfitting: 17 /24 Stochastic noise −→

Stochastic Noise — Data Error

We would like to learn from ◦:yn = f(xn)

Unfortunately, we only observe ◦:yn = f(xn) + ‘stochastic noise’

↑no one can model this

y = f(x)

stoch. noise

Stochastic Noise: fluctuations/measurement errors we cannot model.

Deterministic Noise — Model Error

We would like to learn from ◦:yn = h∗(xn)

Unfortunately, we only observe ◦:yn = f(xn)

= h∗(xn) + ‘deterministic noise’

↑H cannot model this

best approximation to f in H

h∗(x)

y = f(x)

det. noise

Deterministic Noise: the part of f we cannot model.

Stochastic & Deterministic Noise Hurt Learning

Stochastic Noise

y = f(x)+stoch. noise

source: random measurement errors

re-measure ynstochastic noise changes.

change H

stochastic noise the same.

Deterministic Noise

y = h∗(x)+det. noise

source: learner’s H cannot model f

re-measure yndeterministic noise the same.

change H

deterministic noise changes.

We have single D and fixed H so we cannot distinguish

Noise and the Bias-Variance Decomposition

y = f(x) + ǫ

↑measurement error

E[Eout(x)] = ED,ǫ[(g(x)− f(x)− ǫ)2]

= ED,ǫ[(g(x)− f(x))2 + 2(g(x)− f(x))ǫ + ǫ2]

↓ ↓ ↓

bias + var 0 σ2

Noise and the Bias-Variance Decomposition

y = f(x) + ǫ

↑measurement error

E[Eout(x)] = σ2 + bias + var

↑ ↑ ↑stochastic deterministic indirect

noise noise impactof noise

Noise is the Culprit

Overfitting is the disease

Noise is the causeLearning is led astray by fitting the noise more than the signal

Regularization: Putting on the brakes.

Validation: A reality check from peeking at Eout (the bottom line).

Regularization

no regularization regularization!

Target

Learning From Data Lecture 11 Overﬁttingmagdon/courses/LFD-Slides/SlidesLect11.pdf · Learning...

Documents

Algorithmic Foundations of Learning Lecture 9 …Algorithmic Foundations of Learning Lecture 9 Oracle Model. Gradient Descent Methods Lecturer: Patrick Rebeschini Version: November

Mini lecture on Learning Analytics

3cm Algorithmic Foundations of Learning Lecture 10 Non ...rebeschi/teaching/AFoL/19/material/slides10.pdf · Algorithmic Foundations of Learning Lecture 10 Non-Euclidean Settings

The Automated Inspection of Opaque Liquid Vaccines · data. To mitigate overﬁtting transfer learning was used. In contrast, we turn to self-training in Section 7. Zhao et al. [32]

Lecture 12 HYDRAULIC ACTUATORS - NPTEL 2/Lecture 12.pdf · Lecture 12 HYDRAULIC ACTUATORS Learning Objectives ... inside a cylindrical housing called barrel. On one end of the piston

Lecture 7: Policy Gradient - jbnu.ac.krnlp.jbnu.ac.kr/AI2019/slides_RL/pg.pdf · 2019. 10. 28. · Lecture 7: Policy Gradient Introduction Policy-Based Reinforcement Learning In the

Machine Learning Foundations hxÒhtlin/course/ml13fall/doc/04_handout.pdf · Machine Learning Foundations (_hxÒœó) Lecture 4: Feasibility of Learning Hsuan-Tien Lin (ŠÒ0) htlin@csie.ntu.edu.tw

2013-1 Machine Learning Lecture 01 - Pattern Recognition

Machine Learning Foundations hxÒhtlin/mooc/doc/16_handout.pdf · 2020. 5. 29. · Machine Learning Foundations (_hxÒœó) Lecture 16: Three Learning Principles Hsuan-Tien Lin (ŠÒ0)

CS 540: Machine Learning Lecture 3: Bayesian parameter ... › ~arnaud › cs540 › lec3_CS540_handouts.pdf · Lecture 3: Bayesian parameter estimation and hypothesis testing for

Machine Learning Foundations hxÒ - 國立臺灣大學 › ~htlin › mooc › doc › 01_present.pdf · Machine Learning Foundations (_hxÒœó) Lecture 1: The Learning Problem Hsuan-Tien

Lecture 7: Policy Gradient Reinforcement... · 2017-03-06 · Lecture 7: Policy Gradient Introduction Policy-Based Reinforcement Learning In the last lecture we approximated the value

CS 540: Machine Learning Lecture 9: Support Vector Machinesarnaud/cs540/lec9_CS540_handouts.pdf · CS 540: Machine Learning Lecture 9: Support Vector Machines AD & KPM February 2008

Deep Learning for Object Category Recognitioncvgl.stanford.edu/teaching/cs231a_winter1314/lectures/lecture... · Deep Learning VS Shallow Learning ... Neural Net Sparse Coding Deep

LECTURE 23: Traveling waves - fliphysics.comfliphysics.com/...23-Traveling-waves_complete.pdf · LECTURE 23: Traveling waves Select LEARNING OBJECTIVES: • Be able to take a physical

ECE276B: Planning & Learning in Robotics Lecture 4: The

CS 103: Representation Learning, Information Theory and ... · CS 103: Representation Learning, Information Theory and Control Lecture 6, Feb 15, 2019

Lecture 10: Bayesian Deep Learning - GitHub Pagesuvadlc.github.io/lectures/sep2018/lecture10-bayesiandeeplearning.pdf · Lecture 10: Bayesian Deep Learning Efstratios Gavves. UVA

Learning From Data Lecture 15 Reﬂecting on Our Path ...magdon/courses/LFD-Slides/SlidesLect15.pdf · PAC-learning errormeasures biometrics multiclass MDL one versus all active learning

Lecture 5: Model-Free Predictionjosephmodayil.com/UCLRL/MC-TD.pdf · Lecture 5: Model-Free Prediction Outline 1 Introduction 2 Monte-Carlo Learning 3 Temporal-Di erence Learning 4