30
Data Science: Principles and Practice Lecture 7 Guy Emerson

PrinciplesandPractice DataScience · Training data Dev data L optimal Epochs stopping point 19 If a model overfits, the loss on the training set will decrease, but the loss on a

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: PrinciplesandPractice DataScience · Training data Dev data L optimal Epochs stopping point 19 If a model overfits, the loss on the training set will decrease, but the loss on a

Data Science:Principles and Practice

Lecture 7

Guy Emerson

Page 2: PrinciplesandPractice DataScience · Training data Dev data L optimal Epochs stopping point 19 If a model overfits, the loss on the training set will decrease, but the loss on a

Today’s Lecture

� Neural networks:� Architectures

� Training

� Overfitting

1

Disclaimer: any similarity with biological neuralnetworks is coincidental.

Many data scientists now jump straight for neuralnetwork models. Hopefully, the past 6 lectures willhelp to situate neural nets within a wider range oftools.

Page 3: PrinciplesandPractice DataScience · Training data Dev data L optimal Epochs stopping point 19 If a model overfits, the loss on the training set will decrease, but the loss on a

Features

input features

engineered

trained

prediction

trained

� Engineering at a more abstract level

2

Many machine learning models can be broken intotwo steps: feature extraction, and training. Recallfrom the practicals how we manually defined fea-tures (such as for the housing dataset), and thentrained a classifier.

Page 4: PrinciplesandPractice DataScience · Training data Dev data L optimal Epochs stopping point 19 If a model overfits, the loss on the training set will decrease, but the loss on a

Features

input features

engineered

trained

prediction

trained

� Engineering at a more abstract level

2

Neural network models train the features as well.People talk of “end-to-end” training, because allsteps are trained, from the input to the output.

Engineering decisions are pushed to a higher level:not in terms of individual features, but in terms ofthe model architecture.

Page 5: PrinciplesandPractice DataScience · Training data Dev data L optimal Epochs stopping point 19 If a model overfits, the loss on the training set will decrease, but the loss on a

Feedforward Networks

x 7→ f1(x) 7→ f2(f1(x))

� Linear: f (x) = Ax

� but can simplify matrix multiplicationAB = C

� Nonlinear: f (x) = g(Ax)(g applied componentwise)

� Can approximate any function

3

A feedforward net applies a sequence of functions to mapfrom the input to the output.

If the functions are linear, a sequence of functions doesn’tgive us anything – we can express two matrix multiplicationsas a single matrix. However, with nonlinear functions, a se-quence of functions may be more complicated than a singlefunction. The simplest way to do this is to first use a linearmap, and then apply a nonlinear function to each dimension.

The benefit is that we can approximate a complicated func-tion using a sequence of simple functions. We can make theapproximation more accurate by having a longer sequence,or by increasing the dimensionality of the intermdiate repre-sentation f1(x). (The dimensionalities of the input x and out-put f2(f1(x)) are fixed by the data.)

In practice, there will usually also be a “bias” term:f (x) = g(Ax+ b). (In strict mathematical terminology, Ax + bwould be called “affine”, rather than “linear”, but in machinelearning, many authors use the term “linear”.)

Page 6: PrinciplesandPractice DataScience · Training data Dev data L optimal Epochs stopping point 19 If a model overfits, the loss on the training set will decrease, but the loss on a

Nonlinear Activation Functions

�1

1+e−x “sigmoid” (cf. logistic regression)

�1−e−2x

1+e−2x “tanh”

� max{x,0} “rectified linear”

� log(1+ ex) “softplus”

4

−4 −2 0 2 40

0.5

1 We came across the sigmoid func-tion when looking at logistic re-gression. It is also called the lo-gistic function.

−4 −2 0 2 4−1

0

1The tanh function (pronounced“tanch”, short for “hyperbolic tan-gent”), is important mathemati-cally, but for reasons irrelevanthere. It’s a rescaled sigmoid func-tion, bounded between -1 and 1,and shrunk in the x direction.

−4 −2 0 2 40

2

4 The rectified linear unit is possiblythe simplest nonlinearity, and fastto calculate. It isn’t differentiableat 0, and to avoid this, we canuse the softplus function, which issmoothed out around 0.

Page 7: PrinciplesandPractice DataScience · Training data Dev data L optimal Epochs stopping point 19 If a model overfits, the loss on the training set will decrease, but the loss on a

Nonlinear Decision Boundaries

x1

x2

??

???

◦◦◦◦◦

?

?

?? ?

◦◦ ◦◦ ◦

Can be done with adecision tree

5

We have seen how decision trees allow us to learnnonlinear decision boundaries.

Page 8: PrinciplesandPractice DataScience · Training data Dev data L optimal Epochs stopping point 19 If a model overfits, the loss on the training set will decrease, but the loss on a

Nonlinear Decision Boundaries

x1

x2

??

???

◦◦◦◦◦

?

?

?? ?

◦◦ ◦◦ ◦

Rectified linear units:

r( x1 + x2 − 2)+ r(−x1 − x2 + 2)− r( x1 − x2)

− r(−x1 + x2)

5

A feedforward net can also learn a nonlinear deci-sion boundary.

Here, r is the rectified linear function. We linearlymap the input to a 4-dimensional vector, and thenapply r componentwise. We then linearly map thisvector to a single number – if it’s above 0, wechoose ?, and if it’s below 0, we choose ◦.

This is less interpretable than the decision tree, butneural networks give us a wider class of models.

Page 9: PrinciplesandPractice DataScience · Training data Dev data L optimal Epochs stopping point 19 If a model overfits, the loss on the training set will decrease, but the loss on a

Feedforward Networks

x h y

Multiple classes: “softmax”(multiclass logistic regression)

6

We can draw each vector in a neural net as a nodein a graph.

Page 10: PrinciplesandPractice DataScience · Training data Dev data L optimal Epochs stopping point 19 If a model overfits, the loss on the training set will decrease, but the loss on a

Feedforward Networks

Multiple classes: “softmax”(multiclass logistic regression)

6

We can also draw individual units (individual dimen-sions).

In the example from the previous slide, we had twoinput units, four hidden units, and one output unit(to decide between the two classes).

Page 11: PrinciplesandPractice DataScience · Training data Dev data L optimal Epochs stopping point 19 If a model overfits, the loss on the training set will decrease, but the loss on a

Feedforward Networks

Multiple classes: “softmax”(multiclass logistic regression)

6

For multiple classes, we can use a softmax layer,which has one unit for each class. Mathematically,it’s the same as multiclass logistic regression.

Page 12: PrinciplesandPractice DataScience · Training data Dev data L optimal Epochs stopping point 19 If a model overfits, the loss on the training set will decrease, but the loss on a

“Deep” Feedforward Networks

x h1 h2 h3 y

aa

7

A “deep” network is a network with many layers.

The choice of the term “deep” was good for public-ity. The word has connotations of being “meaning-ful” or “serious”, and it sounds much more excitingthan “function approximation parametrised by thecomposition of a sequence of simple functions”

Page 13: PrinciplesandPractice DataScience · Training data Dev data L optimal Epochs stopping point 19 If a model overfits, the loss on the training set will decrease, but the loss on a

AlphaGo

8

AlphaGo, a Go-playing program based on deeplearning, has consistently beaten the world’s bestGo players.

It has not been expected that a program couldreach this level of performance – at least, basedon traditional approaches to AI.

Image from: https://deepmind.com/alphago-china

Page 14: PrinciplesandPractice DataScience · Training data Dev data L optimal Epochs stopping point 19 If a model overfits, the loss on the training set will decrease, but the loss on a

Diagnosis from medical imaging

9

In a more practical setting, deep learning has alsobeen applied to medical diagnosis.

This image comes from a metastudy, comparingdeep learning models and medical professionals,tested on the same medical images. Each dot isone study, and the best performance would be inthe top left corner. We can see that overall, thedeep learning systems are performing about as wellas the professionals.

Liu et al. (2019) “A comparison of deep learningperformance against health-care profession-als in detecting diseases from medical imag-ing: a systematic review and meta-analysis”https://www.thelancet.com/journals/landig/article/PIIS2589-7500(19)30123-2/fulltext

Page 15: PrinciplesandPractice DataScience · Training data Dev data L optimal Epochs stopping point 19 If a model overfits, the loss on the training set will decrease, but the loss on a

Sequence Labelling

article noun verb article noun

Every picture tells a story

t1 t2 t3 t4 t5

h1

w1

h2

w2

h3

w3

h4

w4

h5

w5

10

In part-of-speech tagging, the task is to label eachword in a sentence with the correct part of speech.

Page 16: PrinciplesandPractice DataScience · Training data Dev data L optimal Epochs stopping point 19 If a model overfits, the loss on the training set will decrease, but the loss on a

Sequence Labelling

article noun verb article noun

Every picture tells a story

t1 t2 t3 t4 t5

h1

w1

h2

w2

h3

w3

h4

w4

h5

w5

10

More generally, “sequence labelling” refers to taskswhere we have one output ti for each token wi.

Page 17: PrinciplesandPractice DataScience · Training data Dev data L optimal Epochs stopping point 19 If a model overfits, the loss on the training set will decrease, but the loss on a

Convolutional Neural Net

article noun verb article noun

Every picture tells a story

t1 t2 t3 t4 t5

h1

w1

h2

w2

h3

w3

h4

w4

h5

w5

10

In a convolutional neural net (CNN), each hiddenvector (and each output) is a function of a windowof vectors – in this diagram, a window of one tokeneither side.

The same function is used at each token – e.g.h2 = f (w1,w2,w3) is the same function as h3 =f (w2,w3,w4). Applying the same function acrossdifferent windows is called a “convolution”.

More precisely, the input vectors are concatenated– given three vectors w1,w2,w3, each with N dimen-sions, we can view them together as one vector w3

1with 3N dimensions. We can then apply a normalfeedforward layer – e.g. h2 = g(Aw3

1 + b), for a ma-trix A, vector b, and nonlinearity g.

Page 18: PrinciplesandPractice DataScience · Training data Dev data L optimal Epochs stopping point 19 If a model overfits, the loss on the training set will decrease, but the loss on a

Recurrent Neural Net

article noun verb article noun

Every picture tells a story

t1 t2 t3 t4 t5

h1

w1

h2

w2

h3

w3

h4

w4

h5

w5

10

One limitation of a CNN is that each output is afunction of a limited window of words (on the pre-vious slide, two words either side). However, forsome tasks, we may need to take into account alarger context (for example, long-distance syntac-tic dependencies).

In a recurrent neural network (RNN), we have a hid-den state which is dependent on the current tokenand the previous hidden state. This means thateach prediction is dependent on the current tokenand all previous tokens. For example, t5 dependson all input tokens – in the CNN on the previousslide, t5 only depended on w3 to w5.

In a “vanilla” RNN, we concatenate wi and hi−1, anduse a normal feedforward layer. The same functionis used at each token.

Page 19: PrinciplesandPractice DataScience · Training data Dev data L optimal Epochs stopping point 19 If a model overfits, the loss on the training set will decrease, but the loss on a

Training a Network

� Loss function: L(y,y)

� Gradient wrt parameters:d

L(y,y)�

� Update: θ← θ − αd

L(y,y)�

11

The most common way to train a neural net is us-ing gradient descent, which we saw in a previouslecture.

Page 20: PrinciplesandPractice DataScience · Training data Dev data L optimal Epochs stopping point 19 If a model overfits, the loss on the training set will decrease, but the loss on a

Backpropagation

� Chain rule:dLdθ

=dLdu

du

� Backprop: efficient chain rule

12

Because a neural net is composed of many lay-ers, the gradients can be calculated using the chainrule. An efficient algorithm to apply the chain ruleis backpropagation, which we will see in more detailnext lecture.

Page 21: PrinciplesandPractice DataScience · Training data Dev data L optimal Epochs stopping point 19 If a model overfits, the loss on the training set will decrease, but the loss on a

Backpropagation

x h1 h2 h3 y

Forward pass

Backward pass(calculate gradients with chain rule)

13

To train a network, we apply the network forwardsto get its predictions, and then use backpropaga-tion to get the gradients.

Page 22: PrinciplesandPractice DataScience · Training data Dev data L optimal Epochs stopping point 19 If a model overfits, the loss on the training set will decrease, but the loss on a

Training Hyperparameters

� Large learning rate:

� Faster training

� Small learning rate:

� More stable training

14

We saw in the practicals how the learning rate is animportant hyperparameter for gradient descent.

We will now look at some more hyperparameters.

Page 23: PrinciplesandPractice DataScience · Training data Dev data L optimal Epochs stopping point 19 If a model overfits, the loss on the training set will decrease, but the loss on a

Gradient Descent

� Loss per datapoint: L(yi,yi)

� Total training loss:N∑

i=1

L(yi,yi)

� Ideal gradient:d

N∑

i=1

L(yi,yi)�

15

Ideally, the gradients would be calculated accord-ing to the total loss across the whole training set.

However, this is computationally expensive forlarge datasets.

Page 24: PrinciplesandPractice DataScience · Training data Dev data L optimal Epochs stopping point 19 If a model overfits, the loss on the training set will decrease, but the loss on a

Stochastic Gradient Descent

� Ideal gradient:N∑

i=1

d

dθL(yi,yi)

� Stochastic gradient:d

dθL(yi,yi)

for i = 1,2,3, . . .

� Minibatch gradient:j+b∑

i=j+1

d

dθL(yi,yi)

for j = 0,b,2b, . . .

16

Rather than waiting until we’ve covered the en-tire training set before making an update to themodel parameters, stochastic gradient descent(SGD) makes an update for each datapoint.

The downside of this is that the gradients can varysubstantially from one datapoint to the next.

A middle ground is to make an update for each“minibatch” of datapoints. The batch size (b above)is an additional hyperparameter.

Page 25: PrinciplesandPractice DataScience · Training data Dev data L optimal Epochs stopping point 19 If a model overfits, the loss on the training set will decrease, but the loss on a

Training Hyperparameters

� Small batch size, large learning rate:

� Faster training

� Large batch size, small learning rate:

� More stable training

17

Varying the batch size has a similar effect to vary-ing the learning rate.

Page 26: PrinciplesandPractice DataScience · Training data Dev data L optimal Epochs stopping point 19 If a model overfits, the loss on the training set will decrease, but the loss on a

Overfitting

� Neural nets have many parameters

� Easy to overfit

� Some solutions:� Early stopping

� Regularisation

� Dropout

18

Overfitting is a general problem, not just for neu-ral nets. However, neural nets can have so manyparameters that it becomes a serious problem.

Early stopping and regularisation are also applica-ble to non-neural models, but dropout is specific toneural nets.

Page 27: PrinciplesandPractice DataScience · Training data Dev data L optimal Epochs stopping point 19 If a model overfits, the loss on the training set will decrease, but the loss on a

Early stopping

Training data

Dev data

L

Epochsoptimalstopping

point19

If a model overfits, the loss on the training set willdecrease, but the loss on a held-out test set willincrease.

The number of epochs (number of passes over thetraining data) can be treated as an additional hy-perparameter, to be tuned on a development set.

Benefit: easy to implement.

Drawback: cannot benefit from further computa-tion time.

Page 28: PrinciplesandPractice DataScience · Training data Dev data L optimal Epochs stopping point 19 If a model overfits, the loss on the training set will decrease, but the loss on a

Regularisation

� Penalise “bad” parameters:

L = Lerr(yi,yi) +Lreg(θ)

� For example:

L1(θ) =∑

i |θi|

L2(θ) =∑

i |θi|2

20

Rather than stopping training just before we overfit,we can try to avoid the model overfitting in the firstplace.

Regularisation does this by adding an additionalterm to the loss function, which penalises overfit-ted parameters.

To design a regularisation term, we need to under-stand what overfitted parameters look like. A sim-ple approach is to assume that very large valuesmay be overfitted, since the model may be relyingon these too much. L1 regularisation penalises theabsolute value, and L2 regularisation penalises thesquare value.

Using regularisation adds hyperparameters, e.g.L = Lerr + λ1L1 + λ2L2

Page 29: PrinciplesandPractice DataScience · Training data Dev data L optimal Epochs stopping point 19 If a model overfits, the loss on the training set will decrease, but the loss on a

Dropout

� Less dependent on specific units

� More robust to noise

21

With dropout, each time we apply the network dur-ing training, we randomly switch off some units.

The probability of switching off each unit can betreated as an additional hyperparameter. Using aprobability of 0.5 usually works well.

At test time, we can use the expected activation ofeach unit. For example, with a dropout probabilityof 0.5, we use all units but with half the activation.

The best way to apply dropout can depend on thearchitecture. Further reading: Gal and Ghahramani(2016) give a Bayesian analysis of dropout and ap-ply it to RNNs. This is an example of how theorycan take time to catch up with practice.https://papers.nips.cc/paper/6241-a-theoretically-grounded-

application-of-dropout-in-recurrent-neural-networks

Page 30: PrinciplesandPractice DataScience · Training data Dev data L optimal Epochs stopping point 19 If a model overfits, the loss on the training set will decrease, but the loss on a

What we’ve covered

� Neural nets, activation functions

� Architectures: CNNs, RNNs

� Training by gradient descent

� Early stopping, regularisation, dropout

22