21
Backpropagation: Understanding How to Update Artificial Neural Networks Weights By Ahmed Fawzy Gad Information Technology (IT) Department Faculty of Computers and Information (FCI) Menoufia University Egypt [email protected] 13-Nov-2017 ΨΉΨ© Ψ§Ω„Ω…Ω†ΩˆΩΩŠΨ© Ψ¬Ψ§Ω…ΨΉΩ„ΩˆΩ…Ψ§ΨͺΨͺ ΩˆΨ§Ω„Ω…Ω„Ψ­Ψ§Ψ³Ψ¨Ψ§ ΩƒΩ„ΩŠΨ© Ψ§MENOUFIA UNIVERSITY FACULTY OF COMPUTERS AND INFORMATION Ω…ΨΉΨ© Ψ§Ω„Ω…Ω†ΩˆΩΩŠΨ© Ψ¬Ψ§

Backpropagation: Understanding How to Update Artificial Neural Networks Weights Step by Step

Embed Size (px)

Citation preview

Page 1: Backpropagation: Understanding How to Update Artificial Neural Networks Weights Step by Step

Backpropagation: Understanding How to Update Artificial

Neural Networks Weights

By

Ahmed Fawzy Gad Information Technology (IT) Department

Faculty of Computers and Information (FCI) Menoufia University

Egypt [email protected]

13-Nov-2017

Ψ¬Ψ§Ω…ΨΉΨ© Ψ§Ω„Ω…Ω†ΩˆΩΩŠΨ© ΩƒΩ„ΩŠΨ© Ψ§Ω„Ψ­Ψ§Ψ³Ψ¨Ψ§Ψͺ ΩˆΨ§Ω„Ω…ΨΉΩ„ΩˆΩ…Ψ§Ψͺ

MENOUFIA UNIVERSITY FACULTY OF COMPUTERS

AND INFORMATION

Ψ¬Ψ§Ω…ΨΉΨ© Ψ§Ω„Ω…Ω†ΩˆΩΩŠΨ©

Page 2: Backpropagation: Understanding How to Update Artificial Neural Networks Weights Step by Step

1

This article discusses how the backpropagation algorithm is useful in updating the artificial

neural networks (ANNs) weights using two examples step by step. Readers should have a basic

understanding of how ANNs work, partial derivatives, and multivariate chain rule.

This article won`t dive directly in the details of the algorithm but will start by training a very

simple network. This is because the backpropagation algorithm is meant to be applied over a

network after training. So, we should train the network before applying it to catch the benefits

of backpropagation algorithm and how to use it.

The following figure shows the network structure of the first example we are working on to

explain how the backpropagation algorithm works. It just have two inputs symbolized as π‘ΏπŸ and

π‘ΏπŸ. The output layer has just a single neuron and there is no hidden layers. Each input has a

corresponding weight where π‘ΎπŸ and π‘ΎπŸ are the weights for π‘ΏπŸ and π‘ΏπŸ respectively. There is a

bias for the output layer neuron with a fixed value +𝟏 and a weight 𝒃.

The output layer neuron uses the sigmoid activation function defined by the following equation:

𝒇(𝒔) =𝟏

𝟏 + π’†βˆ’π’”

Where 𝒔 is the sum of products (sop) between each input and its corresponding weight. 𝒔 is the

input to the activation function which, in this example, is defined as follows:

𝒔 = π‘ΏπŸ βˆ— π‘ΎπŸ + π‘ΏπŸ βˆ— π‘ΎπŸ + 𝒃

The following table shows a single input and its corresponding desired output used as the

training data. The basic target of this example is not training the network but understanding

Page 3: Backpropagation: Understanding How to Update Artificial Neural Networks Weights Step by Step

2

how the weights are updated using backpropagation. This is why just a single record of data is

used to get off complexity of all operations and concentrate on just backpropagation.

π‘ΏπŸ π‘ΏπŸ π‘«π’†π’”π’Šπ’“π’†π’… 𝑢𝒖𝒕𝒑𝒖𝒕 𝟎. 𝟏 𝟎. πŸ‘ 𝟎. πŸŽπŸ‘

Assume the initial values for both weights and bias are as follows:

π‘ΎπŸ π‘ΎπŸ 𝒃 𝟎. πŸ“ 𝟎. 𝟐 𝟏. πŸ–πŸ‘

For simplicity, the values for all inputs, weights, and bias will be added to the network diagram

to look as follows:

Now, let us train the network and see whether the desired output will be returned based on the

current weights and bias. The input of the activation function will be the sop between each input

and its weight then adding the bias to the total as follows:

𝒔 = π‘ΏπŸ βˆ— π‘ΎπŸ + π‘ΏπŸ βˆ— π‘ΎπŸ + 𝒃

𝒔 = 𝟎. 𝟏 βˆ— 𝟎. πŸ“ + 𝟎. πŸ‘ βˆ— 𝟎. 𝟐 + 𝟏. πŸ–πŸ‘

𝒔 = 𝟏. πŸ—πŸ’

The output of the activation function will be calculated by applying the previously calculated

sop to the used function (sigmoid) as follows:

𝒇(𝒔) =𝟏

𝟏 + π’†βˆ’π’”

Page 4: Backpropagation: Understanding How to Update Artificial Neural Networks Weights Step by Step

3

𝒇(𝒔) =𝟏

𝟏 + π’†βˆ’πŸ.πŸ—πŸ’

𝒇(𝒔) =𝟏

𝟏 + 𝟎. πŸπŸ’πŸ‘πŸ•πŸŽπŸ‘πŸ—πŸ’πŸ—

𝒇(𝒔) =𝟏

𝟏. πŸπŸ’πŸ‘πŸ•πŸŽπŸ‘πŸ—πŸ’πŸ—

𝒇(𝒔) = 𝟎. πŸ–πŸ•πŸ’πŸ‘πŸ“πŸπŸπŸ’πŸ‘

The output of the activation function reflects the predicted output for the current inputs. It is

obvious that there is a difference between the desired and the expected output. But what are

the sources for that difference? How the predicted output should be changed to get closer to

the desired result? These questions will be answered later. But at least, let us see the error of

our NN based on an error function.

The error functions tell how close the predicted outputs from the desired outputs. The optimal

value for the error is zero meaning there is no error at all and both desired and predicted results

are identical. One of the error functions is the squared error function:

𝑬 =𝟏

𝟐(π’…π’†π’”π’Šπ’“π’†π’… βˆ’ π’‘π’“π’†π’…π’Šπ’„π’•π’†π’…)𝟐

Note that the 𝟏

𝟐 added to the equation is for simplifying derivatives later.

We can measure the error of our network as follows:

𝑬 =𝟏

𝟐(𝟎. πŸŽπŸ‘ βˆ’ 𝟎. πŸ–πŸ•πŸ’πŸ‘πŸ“πŸπŸπŸ’πŸ‘)𝟐

𝑬 =𝟏

𝟐(βˆ’πŸŽ. πŸ–πŸ’πŸ’πŸ‘πŸ“πŸπŸπŸ’πŸ‘)𝟐

𝑬 =𝟏

𝟐(𝟎. πŸ•πŸπŸπŸ—πŸ‘πŸŽπŸ“πŸ’πŸ)

𝑬 = 𝟎. πŸ‘πŸ“πŸ”πŸ’πŸ”πŸ“πŸπŸ•πŸ

The result ensures the existence of error; large one (~0.357). This is what the error tells. It just

give us an indication of how far the predicted results from the desired results. But after knowing

that there is an error, what should we do? Because it is error, then we should minimize it. But

how to minimize the error? There must be something to be changed in order to minimize the

error. The only playable parameter we have is the weight. We can try different weights and then

test our network.

Weights Update Equation The weights can be changed according to the following equation:

𝑾(𝒏 + 𝟏) = 𝑾(𝒏) + Ξ·[𝒅(𝒏) βˆ’ 𝒀(𝒏)]𝑿(𝒏)

Page 5: Backpropagation: Understanding How to Update Artificial Neural Networks Weights Step by Step

4

Where

𝒏: training step (0, 1, 2, …).

𝑾(𝒏): weights in the current training step. 𝑾(𝒏) =

[𝒃(𝒏), π‘ΎπŸ(𝒏), π‘ΎπŸ(𝒏), π‘ΎπŸ‘(𝒏), … , π‘Ύπ’Ž(𝒏)]

Ξ·: network learning rate.

𝒅(𝒏): desired output.

𝒀(𝒏): predicted output.

𝑿(𝒏): current input at which the network made false prediction.

For our network, these parameters has the following values:

𝒏: 0

𝑾(𝒏): [1.83, 0.5, 0.2]

Ξ·: hyperparameter. We can choose it 0.01 for example.

𝒅(𝒏): [0.03].

𝒀(𝒏): [0.874352143].

𝑿(𝒏): [+1, 0.1, 0.3]. First value (+1) is for the bias.

We can update our neural network weights based on the previous equation:

𝑾(𝒏 + 𝟏) = 𝑾(𝒏) + Ξ·[𝒅(𝒏) βˆ’ 𝒀(𝒏)]𝑿(𝒏)

= [𝟏. πŸ–πŸ‘, 𝟎. πŸ“, 𝟎. 𝟐] + 0.01[𝟎. πŸŽπŸ‘ βˆ’ 𝟎. πŸ–πŸ•πŸ’πŸ‘πŸ“πŸπŸπŸ’πŸ‘][+𝟏, 𝟎. 𝟏, 𝟎. πŸ‘]

= [𝟏. πŸ–πŸ‘, 𝟎. πŸ“, 𝟎. 𝟐] + 0.01[βˆ’πŸŽ. πŸ–πŸ’πŸ’πŸ‘πŸ“πŸπŸπŸ’πŸ‘][+𝟏, 𝟎. 𝟏, 𝟎. πŸ‘]

= [𝟏. πŸ–πŸ‘, 𝟎. πŸ“, 𝟎. 𝟐] + βˆ’πŸŽ. πŸŽπŸŽπŸ–πŸ’πŸ’πŸ‘πŸ“πŸπŸπŸ’πŸ‘[+𝟏, 𝟎. 𝟏, 𝟎. πŸ‘]

= [𝟏. πŸ–πŸ‘, 𝟎. πŸ“, 𝟎. 𝟐] + [βˆ’πŸŽ. πŸŽπŸŽπŸ–πŸ’πŸ’πŸ‘πŸ“πŸπŸ, βˆ’πŸŽ. πŸŽπŸŽπŸŽπŸ–πŸ’πŸ’πŸ‘πŸ“πŸ, βˆ’πŸŽ. πŸŽπŸŽπŸπŸ“πŸ‘πŸ‘πŸŽπŸ“πŸ”]

= [𝟏. πŸ–πŸπŸπŸ“πŸ“πŸ”πŸ’πŸ•πŸ—, 𝟎. πŸ’πŸ—πŸ—πŸπŸ“πŸ“πŸ”πŸ’πŸ–, 𝟎. πŸπŸ—πŸ•πŸ’πŸ”πŸ”πŸ—πŸ’πŸ‘]

The new weights are as follows:

π‘ΎπŸπ’π’†π’˜ π‘ΎπŸπ’π’†π’˜ π’ƒπ’π’†π’˜ 𝟎. πŸπŸ—πŸ•πŸ’πŸ”πŸ”πŸ—πŸ’πŸ‘ 𝟎. πŸ’πŸ—πŸ—πŸπŸ“πŸ“πŸ”πŸ’πŸ– 𝟏. πŸ–πŸπŸπŸ“πŸ“πŸ”πŸ’πŸ•πŸ—

Based on the new weights, we will recalculate the predicted output and continue updating

weights and calculating the predicted output until reaching an acceptable value for the error for

the problem in hand.

Here we successfully updated the weights without using the backpropagation algorithm. Are we

still in need of that algorithm? Yes. The reasons will be explained next.

Why backpropagation algorithm is important? Suppose, for the optimal case, that the weight update equation generated the best weights, it

still anonymous what this function did. It is like a black box that we don`t understand its internal

Page 6: Backpropagation: Understanding How to Update Artificial Neural Networks Weights Step by Step

5

operations. All we know is that we should apply this equation in case there is a classification

error. Then the function will generate new weights to be used in next training steps. But why

the new weights are better in prediction? What is the effect of each weight on the prediction

error? How increasing or decreasing one or more weights affects the prediction error?

It is required to have better understanding of how the best weights are calculated. To do that,

we should use the backpropagation algorithm. It helps us to understand how each weight affects

the NN total error and tell us how to minimize the error to a value very close to zero.

Forward Vs. Backward Passes When training a neural network, there are two passes: forward and backward as in the next

figure. The start is always the forward pass in which the inputs are applied to the input layer and

moving toward the output layer calculating the sop between inputs and weights, applying

activation functions to generate outputs, and finally calculating the prediction error to know

how accurate the current network is.

But what if there is prediction error? We should modify the network to reduce that error. This

is done in the backward pass. In the forward pass, we start from the inputs until calculating the

prediction errors. But in the backward pass, we start from the errors until reaching the inputs.

The goal of this pass is to know how each weight affects the total error. Knowing the relationship

between the weight and the error will allows us to modify network weights to decrease the

error. For example, in the backward pass, we can get useful information such as increasing the

current value of π‘ΎπŸ by 1.0 will increase the prediction error by 0.07. This make us understand

how to select the new value of π‘ΎπŸ in order to minimize the error (π‘ΎπŸ should not be increased).

Partial Derivative

Page 7: Backpropagation: Understanding How to Update Artificial Neural Networks Weights Step by Step

6

One important operation used in the backward pass is to calculate derivatives. Before getting

into the calculations of derivatives in the backward pass, we can start by a simple example to

make things easier.

For a multivariate function like π‘Œ = 𝑋2𝑍 + 𝐻, what is the effect on the output Y given a change

in variable X? This question is answered using partial derivative. It is written as follows: 𝝏𝒀

𝝏𝑿=

𝝏

𝝏𝑿(π‘ΏπŸπ’ + 𝑯)

𝝏𝒀

𝝏𝑿= πŸπ‘Ώπ’ + 𝟎

𝝏𝒀

𝝏𝑿= πŸπ‘Ώπ’

Note that everything except 𝑋 is regarded a constant. This is why 𝐻 is replaced by 0 after

calculating partial derivative. Here πœ•π‘‹ means a tiny change of variable 𝑋 and πœ•π‘Œ means a tiny

change of π‘Œ. The change of π‘Œ is the result of changing 𝑋. By making a very small change in 𝑋,

what is the effect on π‘Œ? The small change can be an increase or decrease by a tiny value such as

0.0.1. By substituting by the different values of 𝑋, we can find how π‘Œ changes with respect to

𝑋.

The same procedure will be followed in order to know how the NN prediction error changes wrt

changes in network weights. So, our target is to calculate πœ•πΈ

πœ•π‘ΎπŸ and

πœ•πΈ

πœ•π‘ΎπŸ as we have just two

weights π‘ΎπŸ and π‘ΎπŸ. Let's calculate them.

Change in Prediction Error wrt Weights Looking in this equation, 𝒀 = π‘ΏπŸπ’ + 𝑯, it seems straightforward to calculate the partial

derivate πœ•π‘Œ

πœ•π‘‹ because there is an equation relating both π‘Œ and 𝑋. But there is no direct equation

between the prediction error and the weights. This is why we are going to use the multivariate

chain rule to find the partial derivative of π‘Œ wrt 𝑋.

Prediction Error to Weights Chain Let us try to find the chain relating the prediction error to the weights. The prediction error is

calculated based on this equation:

𝑬 =𝟏

𝟐(π’…π’†π’”π’Šπ’“π’†π’… βˆ’ π’‘π’“π’†π’…π’Šπ’„π’•π’†π’…)𝟐

But this equation doesn`t have any weights. No problem, we can follow the calculations for each

input of the previous equation to until reaching the weights. The desired output is a constant

and thus there is no chance for reaching weights through it. The predicted output is calculated

based on this equation:

Page 8: Backpropagation: Understanding How to Update Artificial Neural Networks Weights Step by Step

7

𝒇(𝒔) =𝟏

𝟏 + π’†βˆ’π’”

Again, the equation for calculating the predicted output doesn't have any weight. But there is

still variable 𝒔 (sop) that already depends on weights for its calculation according to this

equation:

𝒔 = π‘ΏπŸ βˆ— π‘ΎπŸ + π‘ΏπŸ βˆ— π‘ΎπŸ + 𝒃

Here is the chain to be followed to get the weights:

As a result, to know how the prediction error changes wrt changes in the weights we should do

some intermediate operations including finding how the prediction error changes wrt changes

in the predicted output. Then finding the relation between the predicted output and the sop.

Finally, finding how the sop change by changing the weights. There are four intermediate partial

derivatives as follows:

𝝏𝑬

ππ‘·π’“π’†π’…π’Šπ’„π’•π’†π’…,

ππ‘·π’“π’†π’…π’Šπ’„π’•π’†π’…

𝝏𝒔,

𝝏𝒔

ππ‘ΎπŸ and

𝝏𝒔

ππ‘ΎπŸ

This chain will finally tell how the prediction error changes wrt changes in each weight, which is

our goal, by multiplying all individual partial derivatives as follows: 𝝏𝑬

ππ‘ΎπŸ=

𝝏𝑬

ππ‘·π’“π’†π’…π’Šπ’„π’•π’†π’…βˆ—

ππ‘·π’“π’†π’…π’Šπ’„π’•π’†π’…

ππ’”βˆ—

𝝏𝒔

ππ‘ΎπŸ

𝝏𝑬

ππ‘ΎπŸ=

𝝏𝑬

ππ‘·π’“π’†π’…π’Šπ’„π’•π’†π’…βˆ—

ππ‘·π’“π’†π’…π’Šπ’„π’•π’†π’…

ππ’”βˆ—

𝝏𝒔

ππ‘ΎπŸ

Important note: Currently, there is no equation directly relating prediction error to network

weights but we can create a one relating them and applying partial derivative directly to it. The

equation will be as follows:

𝑬 =𝟏

𝟐(π’…π’†π’”π’Šπ’“π’†π’… βˆ’

𝟏

𝟏 + π’†βˆ’(π‘ΏπŸβˆ— π‘ΎπŸ+ π‘ΏπŸβˆ—π‘ΎπŸ+𝒃))

𝟐

Because this equation seems complex, then we can use the multivariate chain rule for simplicity.

Calculating Chain Partial Derivatives

Page 9: Backpropagation: Understanding How to Update Artificial Neural Networks Weights Step by Step

8

Let us calculate the partial derivatives of each part of the chain previously created.

Error-Predicted Output Partial Derivative: 𝝏𝑬

ππ‘·π’“π’†π’…π’Šπ’„π’•π’†π’…=

𝝏

ππ‘·π’“π’†π’…π’Šπ’„π’•π’†π’…(𝟏

𝟐(π’…π’†π’”π’Šπ’“π’†π’… βˆ’ π’‘π’“π’†π’…π’Šπ’„π’•π’†π’…)𝟐)

= 𝟐 βˆ—πŸ

𝟐(π’…π’†π’”π’Šπ’“π’†π’… βˆ’ π’‘π’“π’†π’…π’Šπ’„π’•π’†π’…)πŸβˆ’πŸ βˆ— (𝟎 βˆ’ 𝟏)

= (π’…π’†π’”π’Šπ’“π’†π’… βˆ’ π’‘π’“π’†π’…π’Šπ’„π’•π’†π’…) βˆ— (βˆ’πŸ)

= π’‘π’“π’†π’…π’Šπ’„π’•π’†π’… βˆ’ π’…π’†π’”π’Šπ’“π’†π’…

By values substitution, 𝝏𝑬

ππ‘·π’“π’†π’…π’Šπ’„π’•π’†π’…= π’‘π’“π’†π’…π’Šπ’„π’•π’†π’… βˆ’ π’…π’†π’”π’Šπ’“π’†π’… = 𝟎. πŸ–πŸ•πŸ’πŸ‘πŸ“πŸπŸπŸ’πŸ‘ βˆ’ 𝟎. πŸŽπŸ‘

𝝏𝑬

ππ‘·π’“π’†π’…π’Šπ’„π’•π’†π’…= 𝟎. πŸ–πŸ’πŸ’πŸ‘πŸ“πŸπŸπŸ’πŸ‘

Predicted Output-sop Partial Derivative: ππ‘·π’“π’†π’…π’Šπ’„π’•π’†π’…

𝝏𝒔=

𝝏

𝝏𝒔(

𝟏

𝟏 + π’†βˆ’π’”)

Remember: the quotient rule can be used to find the derivative of the sigmoid function as

follows: ππ‘·π’“π’†π’…π’Šπ’„π’•π’†π’…

𝝏𝒔=

𝟏

𝟏 + π’†βˆ’π’”(𝟏 βˆ’

𝟏

𝟏 + π’†βˆ’π’”)

By values substitution, ππ‘·π’“π’†π’…π’Šπ’„π’•π’†π’…

𝝏𝒔=

𝟏

𝟏 + π’†βˆ’π’”(𝟏 βˆ’

𝟏

𝟏 + π’†βˆ’π’”) =

𝟏

𝟏 + π’†βˆ’πŸ.πŸ—πŸ’(𝟏 βˆ’

𝟏

𝟏 + π’†βˆ’πŸ.πŸ—πŸ’)

=𝟏

𝟏 + 𝟎. πŸπŸ’πŸ‘πŸ•πŸŽπŸ‘πŸ—πŸ’πŸ—(𝟏 βˆ’

𝟏

𝟏 + 𝟎. πŸπŸ’πŸ‘πŸ•πŸŽπŸ‘πŸ—πŸ’πŸ—)

=𝟏

𝟏. πŸπŸ’πŸ‘πŸ•πŸŽπŸ‘πŸ—πŸ’πŸ—(𝟏 βˆ’

𝟏

𝟏. πŸπŸ’πŸ‘πŸ•πŸŽπŸ‘πŸ—πŸ’πŸ—)

= 𝟎. πŸ–πŸ•πŸ’πŸ‘πŸ“πŸπŸπŸ’πŸ‘(𝟏 βˆ’ 𝟎. πŸ–πŸ•πŸ’πŸ‘πŸ“πŸπŸπŸ’πŸ‘)

= 𝟎. πŸ–πŸ•πŸ’πŸ‘πŸ“πŸπŸπŸ’πŸ‘(𝟎. πŸπŸπŸ“πŸ”πŸ’πŸ•πŸ–πŸ“πŸ•) ππ‘·π’“π’†π’…π’Šπ’„π’•π’†π’…

𝝏𝒔= 𝟎. πŸπŸŽπŸ—πŸ–πŸ”πŸŽπŸ’πŸ•πŸ‘

Sop-π‘ΎπŸ Partial Derivative: 𝝏𝒔

ππ‘ΎπŸ=

𝝏

ππ‘ΎπŸ(π‘ΏπŸ βˆ— π‘ΎπŸ + π‘ΏπŸ βˆ— π‘ΎπŸ + 𝒃)

Page 10: Backpropagation: Understanding How to Update Artificial Neural Networks Weights Step by Step

9

= 𝟏 βˆ— π‘ΏπŸ βˆ— (π‘ΎπŸ)(πŸβˆ’πŸ) + 𝟎 + 𝟎

= π‘ΏπŸ βˆ— (π‘ΎπŸ)(𝟎)

= π‘ΏπŸ(𝟏) 𝝏𝒔

ππ‘ΎπŸ= π‘ΏπŸ

By values substitution, 𝝏𝒔

ππ‘ΎπŸ= π‘ΏπŸ = 𝟎. 𝟏

Sop-π‘ΎπŸ Partial Derivative: 𝝏𝒔

ππ‘ΎπŸ=

𝝏

ππ‘ΎπŸ(π‘ΏπŸ βˆ— π‘ΎπŸ + π‘ΏπŸ βˆ— π‘ΎπŸ + 𝒃)

= 𝟎 + 𝟏 βˆ— π‘ΏπŸ βˆ— (π‘ΎπŸ)(πŸβˆ’πŸ) + 𝟎

= π‘ΏπŸ βˆ— (π‘ΎπŸ)(𝟎)

= π‘ΏπŸ(𝟏) 𝝏𝒔

ππ‘ΎπŸ= π‘ΏπŸ

By values substitution, 𝝏𝒔

ππ‘ΎπŸ= π‘ΏπŸ = 𝟎. πŸ‘

After calculating each individual derivative, we can multiply all of them to get the desired

relationship between the prediction error and each weight.

Prediction Error-π‘ΎπŸ Partial Derivative: 𝝏𝑬

ππ‘ΎπŸ= 𝟎. πŸ–πŸ’πŸ’πŸ‘πŸ“πŸπŸπŸ’πŸ‘ βˆ— 𝟎. πŸπŸŽπŸ—πŸ–πŸ”πŸŽπŸ’πŸ•πŸ‘ βˆ— 𝟎. 𝟏

𝝏𝑬

ππ‘ΎπŸ= 𝟎. πŸŽπŸŽπŸ—πŸπŸ•πŸ”πŸŽπŸ—πŸ‘

Prediction Error-π‘ΎπŸ Partial Derivative: 𝝏𝑬

ππ‘ΎπŸ= 𝟎. πŸ–πŸ’πŸ’πŸ‘πŸ“πŸπŸπŸ’πŸ‘ βˆ— 𝟎. πŸπŸŽπŸ—πŸ–πŸ”πŸŽπŸ’πŸ•πŸ‘ βˆ— 𝟎. πŸ‘

𝝏𝑬

ππ‘ΎπŸ= 𝟎. πŸŽπŸπŸ•πŸ–πŸπŸ–πŸπŸ•πŸ–

Page 11: Backpropagation: Understanding How to Update Artificial Neural Networks Weights Step by Step

10

Finally, there are two values reflecting how the prediction error changes with respect to the

weights (0.009276093 for W1 and 0.027828278 for W2). But what does that mean? Results needs

interpretation.

Interpreting Results of Backpropagation There are two useful conclusions from each of the last two derivatives obtained from the

following:

1. Derivative sign

2. Derivative magnitude (DM)

If the derivative is positive, that means increasing the weight will increase the error. In other

words, decreasing the weight will decrease the error.

If the derivative is negative, then increasing the weight will decrease the error. In other words,

if it is negative then decreasing the weight will increase the error.

But by how much the error will increase or decrease? The derivative magnitude can tell us. For

positive derivative, increasing the weight by 𝑝 will increase the error by 𝐷𝑀 βˆ— 𝑝. For negative

derivative, increasing the weight by 𝑝 will decrease the error by 𝐷𝑀 βˆ— 𝑝.

Because the result of the 𝝏𝑬

ππ‘ΎπŸ derivative is positive, this means that if π‘ΎπŸ increased by 1 then

the total error will increase by 0.009276093. Also because the result of the 𝝏𝑬

ππ‘ΎπŸ derivative is

positive, this means that if π‘ΎπŸ increased by 1 then the total error will increase by 0.027828278.

Updating Weights After successfully calculating the derivatives of the error with respect to each individual weight,

we can update the weights in order to enhance the prediction.

Each weight will be updated based on its derivative as follows:

π‘ΎπŸπ’π’†π’˜ = π‘ΎπŸ βˆ’ Ξ· βˆ—ππ‘¬

ππ‘ΎπŸ

= 𝟎. πŸ“ βˆ’ 0.01 βˆ— 𝟎. πŸŽπŸŽπŸ—πŸπŸ•πŸ”πŸŽπŸ—πŸ‘

π‘ΎπŸπ’π’†π’˜ = 𝟎. πŸ’πŸ—πŸ—πŸ—πŸŽπŸ•πŸπŸ‘πŸ—πŸŽπŸ•

For second weight,

π‘ΎπŸπ’π’†π’˜ = π‘ΎπŸ βˆ’ Ξ· βˆ—ππ‘¬

ππ‘ΎπŸ

= 𝟎. 𝟐 βˆ’ 0.01 βˆ— 𝟎. πŸŽπŸπŸ•πŸ–πŸπŸ–πŸπŸ•πŸ–

π‘ΎπŸπ’π’†π’˜ = 𝟎. πŸπŸ—πŸ—πŸ•πŸπŸπŸ•πŸπŸ•πŸ

Note that the derivative is subtracted not added to the weight because it is positive.

Page 12: Backpropagation: Understanding How to Update Artificial Neural Networks Weights Step by Step

11

Then continue the process of prediction and updating the weights until the desired outputs are

generated with an acceptable error.

Page 13: Backpropagation: Understanding How to Update Artificial Neural Networks Weights Step by Step

12

Second Example – Backpropagation for NN with Hidden Layer To make the ideas more clear, we can apply the backpropagation algorithm over the following

NN after adding one hidden layer with two neurons.

The same inputs, output, activation function, and learning rate used previously will also be

applied in this example. Here are the complete weights of the network:

π‘ΎπŸ π‘ΎπŸ π‘ΎπŸ‘ π‘ΎπŸ’ π‘ΎπŸ“ π‘ΎπŸ” π’ƒπŸ π’ƒπŸ π’ƒπŸ‘ 𝟎. πŸ“ 𝟎. 𝟏 𝟎. πŸ”πŸ 𝟎. 𝟐 βˆ’πŸŽ. 𝟐 𝟎. πŸ‘ 𝟎. πŸ’ βˆ’πŸŽ. 𝟏 𝟏. πŸ–πŸ‘

The following diagram shows the previous network with all inputs and weights added.

Page 14: Backpropagation: Understanding How to Update Artificial Neural Networks Weights Step by Step

13

At first, we should go through the forward pass to get the predicted output. If there was an error

in prediction, then we should go through the backward pass to update the weights according to

the backpropagation algorithm.

Let us calculate the inputs to the first neuron in the hidden layer (π’‰πŸ):

π’‰πŸπ’Šπ’ = π‘ΏπŸ βˆ— π‘ΎπŸ + π‘ΏπŸ βˆ— π‘ΎπŸ + π’ƒπŸ

= 𝟎. 𝟏 βˆ— 𝟎. πŸ“ + 𝟎. πŸ‘ βˆ— 𝟎. 𝟏 + 𝟎. πŸ’

π’‰πŸπ’Šπ’ = 𝟎. πŸ’πŸ–

The input to the second neuron in the hidden layer (π’‰πŸ):

π’‰πŸπ’Šπ’ = π‘ΏπŸ βˆ— π‘ΎπŸ‘ + π‘ΏπŸ βˆ— π‘ΎπŸ’ + π’ƒπŸ

= 𝟎. 𝟏 βˆ— 𝟎. πŸ”πŸ + 𝟎. πŸ‘ βˆ— 𝟎. 𝟐 βˆ’ 𝟎. 𝟏

= 𝟎. 𝟎𝟐𝟐

The output of the first neuron of the hidden layer:

π’‰πŸπ’π’–π’• =𝟏

𝟏 + π’†βˆ’π’‰πŸπ’Šπ’

=𝟏

𝟏 + π’†βˆ’πŸŽ.πŸ’πŸ–

=𝟏

𝟏 + 𝟎. πŸ”πŸπŸ—

=𝟏

𝟏. πŸ”πŸπŸ—

π’‰πŸπ’π’–π’• = 𝟎. πŸ”πŸπŸ–

The output of the second neuron of the hidden layer:

π’‰πŸπ’π’–π’• =𝟏

𝟏 + π’†βˆ’π’‰πŸπ’Šπ’

=𝟏

𝟏 + π’†βˆ’πŸŽ.𝟎𝟐𝟐

=𝟏

𝟏 + 𝟎. πŸ—πŸ•πŸ–

=𝟏

𝟏. πŸ—πŸ•πŸ–

π’‰πŸπ’π’–π’• = 𝟎. πŸ“πŸŽπŸ”

Next step is to calculate the input of the output neuron:

π’π’–π’•π’Šπ’ = π’‰πŸπ’π’–π’• βˆ— π‘ΎπŸ“ + π’‰πŸπ’π’–π’• βˆ— π‘ΎπŸ” + π’ƒπŸ‘

= 𝟎. πŸ”πŸπŸ– βˆ— βˆ’πŸŽ. 𝟐 + 𝟎. πŸ“πŸŽπŸ” βˆ— 𝟎. πŸ‘ + 𝟏. πŸ–πŸ‘

π’π’–π’•π’Šπ’ = 𝟏. πŸ–πŸ“πŸ–

The output of the output neuron:

Page 15: Backpropagation: Understanding How to Update Artificial Neural Networks Weights Step by Step

14

𝒐𝒖𝒕𝒐𝒖𝒕 =𝟏

𝟏 + π’†βˆ’π’π’–π’•π’Šπ’

=𝟏

𝟏 + π’†βˆ’πŸ.πŸ–πŸ“πŸ–

=𝟏

𝟏 + 𝟎. πŸπŸ“πŸ”

=𝟏

𝟏. πŸπŸ“πŸ”

𝒐𝒖𝒕𝒐𝒖𝒕 = 𝟎. πŸ–πŸ”πŸ“

Thus the expected output of our NN based on the current weights is 0.865. We can then

calculate the prediction error according to the following equation:

𝑬 =𝟏

𝟐(π’…π’†π’”π’Šπ’“π’†π’… βˆ’ 𝒐𝒖𝒕𝒐𝒖𝒕)𝟐

=𝟏

𝟐(𝟎. πŸŽπŸ‘ βˆ’ 𝟎. πŸ–πŸ”πŸ“)𝟐

=𝟏

𝟐(βˆ’. πŸ–πŸ‘πŸ“)𝟐

=𝟏

𝟐(𝟎. πŸ”πŸ—πŸ•)

𝑬 = 𝟎. πŸ‘πŸ’πŸ—

The error seems very high and thus we should update the network weights using the

backpropagation algorithm.

Partial Derivatives Our goal is to get how the total error 𝑬 changes wrt each of the 6 weights (π‘ΎπŸ: π‘ΎπŸ”):

𝝏𝑬

ππ‘ΎπŸ,

𝝏𝑬

ππ‘ΎπŸ,

𝝏𝑬

ππ‘ΎπŸ‘,

𝝏𝑬

ππ‘ΎπŸ’,

𝝏𝑬

ππ‘ΎπŸ“,

𝝏𝑬

ππ‘ΎπŸ”

Let us start by calculating the partial derivative of the output wrt the hidden-output layers

weights (π‘ΎπŸ“ and π‘ΎπŸ”).

Eβˆ’π‘ΎπŸ“ Parial Derivative:

Starting by π‘ΎπŸ“, we will follow that chain: 𝝏𝑬

ππ‘ΎπŸ“=

𝝏𝑬

ππ’π’–π’•π’π’–π’•βˆ—

𝝏𝒐𝒖𝒕𝒐𝒖𝒕

ππ’π’–π’•π’Šπ’βˆ—

ππ’π’–π’•π’Šπ’

ππ‘ΎπŸ“

We can calculate each individual part at first and then combine them to get the desired

derivative.

Page 16: Backpropagation: Understanding How to Update Artificial Neural Networks Weights Step by Step

15

For the first derivative πœ•πΈ

πœ•π‘œπ‘’π‘‘π‘œπ‘’π‘‘:

𝝏𝑬

𝝏𝒐𝒖𝒕𝒐𝒖𝒕=

𝝏

𝝏𝒐𝒖𝒕𝒐𝒖𝒕(𝟏

𝟐(π’…π’†π’”π’Šπ’“π’†π’… βˆ’ 𝒐𝒖𝒕𝒐𝒖𝒕)𝟐)

= 𝟐 βˆ—πŸ

𝟐(π’…π’†π’”π’Šπ’“π’†π’… βˆ’ 𝒐𝒖𝒕𝒐𝒖𝒕)πŸβˆ’πŸ βˆ— (𝟎 βˆ’ 𝟏)

= π’…π’†π’”π’Šπ’“π’†π’… βˆ’ 𝒐𝒖𝒕𝒐𝒖𝒕 βˆ— (βˆ’πŸ)

= 𝒐𝒖𝒕𝒐𝒖𝒕 βˆ’ π’…π’†π’”π’Šπ’“π’†π’…

By substituting with the values of these variables,

= 𝒐𝒖𝒕𝒐𝒖𝒕 βˆ’ π’…π’†π’”π’Šπ’“π’†π’… = 𝟎. πŸ–πŸ”πŸ“ βˆ’ 𝟎. πŸŽπŸ‘ 𝝏𝑬

𝝏𝒐𝒖𝒕𝒐𝒖𝒕= 𝟎. πŸ–πŸ‘πŸ“

For the second derivative πœ•π‘œπ‘’π‘‘π‘œπ‘’π‘‘

πœ•π‘œπ‘’π‘‘π‘–π‘›:

𝝏𝒐𝒖𝒕𝒐𝒖𝒕

ππ’π’–π’•π’Šπ’=

𝝏

ππ’π’–π’•π’Šπ’(

𝟏

𝟏 + π’†βˆ’π’π’–π’•π’Šπ’)

= (𝟏

𝟏 + π’†βˆ’π’π’–π’•π’Šπ’)(𝟏 βˆ’

𝟏

𝟏 + π’†βˆ’π’π’–π’•π’Šπ’)

= (𝟏

𝟏 + π’†βˆ’πŸ.πŸ–πŸ“πŸ–)(𝟏 βˆ’

𝟏

𝟏 + π’†βˆ’πŸ.πŸ–πŸ“πŸ–)

= (𝟏

𝟏. πŸ“πŸ”)(𝟏 βˆ’

𝟏

𝟏. πŸ“πŸ”)

= (𝟎. πŸ”πŸ’πŸ)(𝟏 βˆ’ 𝟎. πŸ”πŸ’πŸ) = (𝟎. πŸ”πŸ’πŸ)(𝟎. πŸ‘πŸ“πŸ—) 𝝏𝒐𝒖𝒕𝒐𝒖𝒕

ππ’π’–π’•π’Šπ’= 𝟎. πŸπŸ‘

For the last derivative πœ•π‘œπ‘’π‘‘π‘–π‘›

πœ•π‘Š5:

ππ’π’–π’•π’Šπ’

ππ‘ΎπŸ“=

𝝏

ππ‘ΎπŸ“(π’‰πŸπ’π’–π’• βˆ— π‘ΎπŸ“ + π’‰πŸπ’π’–π’• βˆ— π‘ΎπŸ” + π’ƒπŸ‘)

= 𝟏 βˆ— π’‰πŸπ’π’–π’• βˆ— (π‘ΎπŸ“)πŸβˆ’πŸ + 𝟎 + 𝟎

= π’‰πŸπ’π’–π’• ππ’π’–π’•π’Šπ’

ππ‘ΎπŸ“= 𝟎. πŸ”πŸπŸ–

After calculating all required three derivatives, we can calculate the target derivative as follows: 𝝏𝑬

ππ‘ΎπŸ“=

𝝏𝑬

ππ’π’–π’•π’π’–π’•βˆ—

𝝏𝒐𝒖𝒕𝒐𝒖𝒕

ππ’π’–π’•π’Šπ’βˆ—

ππ’π’–π’•π’Šπ’

ππ‘ΎπŸ“

Page 17: Backpropagation: Understanding How to Update Artificial Neural Networks Weights Step by Step

16

𝝏𝑬

ππ‘ΎπŸ“= 𝟎. πŸ–πŸ‘πŸ“ βˆ— 𝟎. πŸπŸ‘ βˆ— 𝟎. πŸ”πŸπŸ–

𝝏𝑬

ππ‘ΎπŸ“= 𝟎. πŸπŸπŸ—

Eβˆ’π‘ΎπŸ” Parial Derivative:

For calculating πœ•πΈ

πœ•π‘Š6, we will use the following chain:

𝝏𝑬

ππ‘ΎπŸ”=

𝝏𝑬

ππ’π’–π’•π’π’–π’•βˆ—

𝝏𝒐𝒖𝒕𝒐𝒖𝒕

ππ’π’–π’•π’Šπ’βˆ—

ππ’π’–π’•π’Šπ’

ππ‘ΎπŸ”

The same previous calculations will be repeated with just a change in the last derivative πœ•π‘œπ‘’π‘‘π‘–π‘›

πœ•π‘Š6.

It can be calculated as follows: ππ’π’–π’•π’Šπ’

ππ‘ΎπŸ”=

𝝏

ππ‘ΎπŸ”(π’‰πŸπ’π’–π’• βˆ— π‘ΎπŸ“ + π’‰πŸπ’π’–π’• βˆ— π‘ΎπŸ” + π’ƒπŸ‘)

= 𝟎 + 𝟏 βˆ— π’‰πŸπ’π’–π’• βˆ— (π‘ΎπŸ”)πŸβˆ’πŸ + 𝟎

= π’‰πŸπ’π’–π’• ππ’π’–π’•π’Šπ’

ππ‘ΎπŸ”= 𝟎. πŸ“πŸŽπŸ”

Finally, the derivative πœ•πΈ

πœ•π‘Š6 can be calculated:

𝝏𝑬

ππ‘ΎπŸ”=

𝝏𝑬

ππ’π’–π’•π’π’–π’•βˆ—

𝝏𝒐𝒖𝒕𝒐𝒖𝒕

ππ’π’–π’•π’Šπ’βˆ—

ππ’π’–π’•π’Šπ’

ππ‘ΎπŸ”

= 𝟎. πŸ–πŸ‘πŸ“ βˆ— 𝟎. πŸπŸ‘ βˆ— 𝟎. πŸ“πŸŽπŸ” 𝝏𝑬

ππ‘ΎπŸ”= 𝟎. πŸŽπŸ—πŸ•

This is for π‘Š5 and π‘Š6. Let's calculate the derivative wrt to π‘Š1 to π‘Š4.

Eβˆ’π‘ΎπŸ Parial Derivative:

Starting by π‘Š1, we will follow that chain: 𝝏𝑬

ππ‘ΎπŸ=

𝝏𝑬

ππ’π’–π’•π’π’–π’•βˆ—

𝝏𝒐𝒖𝒕𝒐𝒖𝒕

ππ’π’–π’•π’Šπ’βˆ—

ππ’π’–π’•π’Šπ’

ππ’‰πŸπ’π’–π’•βˆ—

ππ’‰πŸπ’π’–π’•

ππ’‰πŸπ’Šπ’βˆ—

ππ’‰πŸπ’Šπ’

ππ‘ΎπŸ

We will follow the same previous procedure by calculating each individual derivative and finally

combining all of them.

Page 18: Backpropagation: Understanding How to Update Artificial Neural Networks Weights Step by Step

17

The first two derivatives πœ•πΈ

πœ•π‘œπ‘’π‘‘π‘œπ‘’π‘‘ and

πœ•π‘œπ‘’π‘‘π‘œπ‘’π‘‘

πœ•π‘œπ‘’π‘‘π‘–π‘› are already calculated previously and their results

are as follows: 𝝏𝑬

𝝏𝒐𝒖𝒕𝒐𝒖𝒕= 𝟎. πŸ–πŸ‘πŸ“

𝝏𝒐𝒖𝒕𝒐𝒖𝒕

ππ’π’–π’•π’Šπ’= 𝟎. πŸπŸ‘

For the next derivative πœ•π‘œπ‘’π‘‘π‘–π‘›

πœ•β„Ž1π‘œπ‘’π‘‘:

ππ’π’–π’•π’Šπ’

ππ’‰πŸπ’π’–π’•=

𝝏

ππ’‰πŸπ’π’–π’•(π’‰πŸπ’π’–π’• βˆ— π‘ΎπŸ“ + π’‰πŸπ’π’–π’• βˆ— π‘ΎπŸ” + π’ƒπŸ‘)

= (π’‰πŸπ’π’–π’•)πŸβˆ’πŸ βˆ— π‘ΎπŸ“ + 𝟎 + 𝟎

= π‘ΎπŸ“ ππ’π’–π’•π’Šπ’

ππ’‰πŸπ’π’–π’•= βˆ’πŸŽ. 𝟐

For πœ•β„Ž1π‘œπ‘’π‘‘

πœ•β„Ž1𝑖𝑛:

ππ’‰πŸπ’π’–π’•

ππ’‰πŸπ’Šπ’=

𝝏

ππ’‰πŸπ’Šπ’(

𝟏

𝟏 + π’†βˆ’π’‰πŸπ’Šπ’)

= (𝟏

𝟏 + π’†βˆ’π’‰πŸπ’Šπ’)(𝟏 βˆ’

𝟏

𝟏 + π’†βˆ’π’‰πŸπ’Šπ’)

= (𝟏

𝟏 + π’†βˆ’πŸŽ.πŸ’πŸ–)(𝟏 βˆ’

𝟏

𝟏 + π’†βˆ’πŸŽ.πŸ’πŸ–)

= (𝟏

𝟏. πŸ”πŸπŸ—)(𝟏 βˆ’

𝟏

𝟏. πŸ”πŸπŸ—)

= (𝟎. πŸ”πŸπŸ–)(𝟏 βˆ’ 𝟎. πŸ”πŸπŸ–) = 𝟎. πŸ”πŸπŸ– βˆ— 𝟎. πŸ‘πŸ–πŸ ππ’‰πŸπ’π’–π’•

ππ’‰πŸπ’Šπ’= 𝟎. πŸπŸ‘πŸ”

For πœ•β„Ž1𝑖𝑛

πœ•π‘Š1:

ππ’‰πŸπ’Šπ’

ππ‘ΎπŸ=

𝝏

ππ‘ΎπŸ(π‘ΏπŸ βˆ— π‘ΎπŸ + π‘ΏπŸ βˆ— π‘ΎπŸ + π’ƒπŸ)

= π‘ΏπŸ βˆ— (π‘ΎπŸ)πŸβˆ’πŸ + 𝟎 + 𝟎

= π‘ΏπŸ ππ’‰πŸπ’Šπ’

ππ‘ΎπŸ= 𝟎. 𝟏

Finally, the target derivative can be calculated:

Page 19: Backpropagation: Understanding How to Update Artificial Neural Networks Weights Step by Step

18

𝝏𝑬

ππ‘ΎπŸ= 𝟎. πŸ–πŸ‘πŸ“ βˆ— 𝟎. πŸπŸ‘ βˆ— βˆ’πŸŽ. 𝟐 βˆ— 𝟎. πŸπŸ‘πŸ” βˆ— 𝟎. 𝟏

𝝏𝑬

ππ‘ΎπŸ= βˆ’πŸŽ. 𝟎𝟎𝟏

Eβˆ’π‘ΎπŸ Parial Derivative:

Similar to the way of calculating πœ•πΈ

πœ•π‘Š1, we can calculate

πœ•πΈ

πœ•π‘Š2. The only change will be in the last

derivative πœ•β„Ž1𝑖𝑛

πœ•π‘Š2.

𝝏𝑬

ππ‘ΎπŸ=

𝝏𝑬

ππ’π’–π’•π’π’–π’•βˆ—

𝝏𝒐𝒖𝒕𝒐𝒖𝒕

ππ’π’–π’•π’Šπ’βˆ—

ππ’π’–π’•π’Šπ’

ππ’‰πŸπ’π’–π’•βˆ—

ππ’‰πŸπ’π’–π’•

ππ’‰πŸπ’Šπ’βˆ—

ππ’‰πŸπ’Šπ’

ππ‘ΎπŸ

ππ’‰πŸπ’Šπ’

ππ‘ΎπŸ=

𝝏

ππ‘ΎπŸ(π‘ΏπŸ βˆ— π‘ΎπŸ + π‘ΏπŸ βˆ— π‘ΎπŸ + π’ƒπŸ)

= 𝟎 + π‘ΏπŸ βˆ— (π‘ΎπŸ)πŸβˆ’πŸ + 𝟎

= π‘ΏπŸ ππ’‰πŸπ’Šπ’

ππ‘ΎπŸ= 𝟎. πŸ‘

Then: 𝝏𝑬

ππ‘ΎπŸ= 𝟎. πŸ–πŸ‘πŸ“ βˆ— 𝟎. πŸπŸ‘ βˆ— βˆ’πŸŽ. 𝟐 βˆ— 𝟎. πŸπŸ‘πŸ” βˆ— 𝟎. πŸ‘

𝝏𝑬

ππ‘ΎπŸ= βˆ’. πŸŽπŸŽπŸ‘

For the last two weights (π‘Š3 and π‘Š4), they can be calculated similarly to π‘Š1 and π‘Š2.

Eβˆ’π‘ΎπŸ‘ Parial Derivative:

Starting by π‘Š3, we should this chain: 𝝏𝑬

ππ‘ΎπŸ‘=

𝝏𝑬

ππ’π’–π’•π’π’–π’•βˆ—

𝝏𝒐𝒖𝒕𝒐𝒖𝒕

ππ’π’–π’•π’Šπ’βˆ—

ππ’π’–π’•π’Šπ’

ππ’‰πŸπ’π’–π’•βˆ—

ππ’‰πŸπ’π’–π’•

ππ’‰πŸπ’Šπ’βˆ—

ππ’‰πŸπ’Šπ’

ππ‘ΎπŸ‘

The missing derivatives to be calculated are πœ•π‘œπ‘’π‘‘π‘–π‘›

πœ•β„Ž2π‘œπ‘’π‘‘,

πœ•β„Ž2π‘œπ‘’π‘‘

πœ•β„Ž2𝑖𝑛and

πœ•β„Ž2𝑖𝑛

πœ•π‘Š3.

ππ’π’–π’•π’Šπ’

ππ’‰πŸπ’π’–π’•=

𝝏

ππ’‰πŸπ’π’–π’•(π’‰πŸπ’π’–π’• βˆ— π‘ΎπŸ“ + π’‰πŸπ’π’–π’• βˆ— π‘ΎπŸ” + π’ƒπŸ‘)

= 𝟎 + (π’‰πŸπ’π’–π’•)πŸβˆ’πŸ βˆ— π‘ΎπŸ” + 𝟎

= π‘ΎπŸ” ππ’π’–π’•π’Šπ’

ππ’‰πŸπ’π’–π’•= 𝟎. πŸ‘

Page 20: Backpropagation: Understanding How to Update Artificial Neural Networks Weights Step by Step

19

For πœ•β„Ž2π‘œπ‘’π‘‘

πœ•β„Ž2𝑖𝑛:

ππ’‰πŸπ’π’–π’•

ππ’‰πŸπ’Šπ’=

𝝏

ππ’‰πŸπ’Šπ’(

𝟏

𝟏 + π’†βˆ’π’‰πŸπ’Šπ’)

= (𝟏

𝟏 + π’†βˆ’π’‰πŸπ’Šπ’)(𝟏 βˆ’

𝟏

𝟏 + π’†βˆ’π’‰πŸπ’Šπ’)

= (𝟏

𝟏 + π’†βˆ’πŸŽ.𝟎𝟐𝟐)(𝟏 βˆ’

𝟏

𝟏 + π’†βˆ’πŸŽ.𝟎𝟐𝟐)

= (𝟏

𝟏. πŸ—πŸ•πŸ–)(𝟏 βˆ’

𝟏

𝟏. πŸ—πŸ•πŸ–)

= (𝟎. πŸ“πŸŽπŸ”)(𝟏 βˆ’ 𝟎. πŸ“πŸŽπŸ”) ππ’‰πŸπ’π’–π’•

ππ’‰πŸπ’Šπ’= 𝟎. πŸπŸ“

For πœ•β„Ž2𝑖𝑛

πœ•π‘Š3:

ππ’‰πŸπ’Šπ’

ππ‘ΎπŸ‘=

𝝏

ππ‘ΎπŸ‘(π‘ΏπŸ βˆ— π‘ΎπŸ‘ + π‘ΏπŸ βˆ— π‘ΎπŸ’ + π’ƒπŸ)

= π‘ΏπŸ βˆ— π‘ΎπŸ‘ + π‘ΏπŸ βˆ— π‘ΎπŸ’ + π’ƒπŸ

= (π‘ΏπŸ)πŸβˆ’πŸ βˆ— π‘ΎπŸ‘ + 𝟎 + 𝟎

= π‘ΎπŸ‘

= 𝟎. πŸ”πŸ

Finally, we can calculate the desired derivative as follows: 𝝏𝑬

ππ‘ΎπŸ‘=

𝝏𝑬

ππ’π’–π’•π’π’–π’•βˆ—

𝝏𝒐𝒖𝒕𝒐𝒖𝒕

ππ’π’–π’•π’Šπ’βˆ—

ππ’π’–π’•π’Šπ’

ππ’‰πŸπ’π’–π’•βˆ—

ππ’‰πŸπ’π’–π’•

ππ’‰πŸπ’Šπ’βˆ—

ππ’‰πŸπ’Šπ’

ππ‘ΎπŸ‘

𝝏𝑬

ππ‘ΎπŸ‘= 𝟎. πŸ–πŸ‘πŸ“ βˆ— 𝟎. πŸπŸ‘ βˆ— 𝟎. πŸ‘ βˆ— 𝟎. πŸπŸ“ βˆ— 𝟎. πŸ”πŸ

𝝏𝑬

ππ‘ΎπŸ‘= 𝟎. πŸŽπŸŽπŸ—

Eβˆ’π‘ΎπŸ’ Parial Derivative:

We can now calculate πœ•πΈ

πœ•π‘Š4 similarly:

𝝏𝑬

ππ‘ΎπŸ’=

𝝏𝑬

ππ’π’–π’•π’π’–π’•βˆ—

𝝏𝒐𝒖𝒕𝒐𝒖𝒕

ππ’π’–π’•π’Šπ’βˆ—

ππ’π’–π’•π’Šπ’

ππ’‰πŸπ’π’–π’•βˆ—

ππ’‰πŸπ’π’–π’•

ππ’‰πŸπ’Šπ’βˆ—

ππ’‰πŸπ’Šπ’

ππ‘ΎπŸ’

We should calculate the missing derivative πœ•β„Ž2𝑖𝑛

πœ•π‘Š4:

Page 21: Backpropagation: Understanding How to Update Artificial Neural Networks Weights Step by Step

20

ππ’‰πŸπ’Šπ’

ππ‘ΎπŸ’=

𝝏

ππ‘ΎπŸ’(π‘ΏπŸ βˆ— π‘ΎπŸ‘ + π‘ΏπŸ βˆ— π‘ΎπŸ’ + π’ƒπŸ)

= π‘ΏπŸ βˆ— π‘ΎπŸ‘ + π‘ΏπŸ βˆ— π‘ΎπŸ’ + π’ƒπŸ

= 𝟎 + (π‘ΏπŸ)πŸβˆ’πŸ βˆ— π‘ΎπŸ’ + 𝟎

= π‘ΎπŸ’

= 𝟎. 𝟐

Then calculate πœ•πΈ

πœ•π‘Š4:

𝝏𝑬

ππ‘ΎπŸ’=

𝝏𝑬

ππ’π’–π’•π’π’–π’•βˆ—

𝝏𝒐𝒖𝒕𝒐𝒖𝒕

ππ’π’–π’•π’Šπ’βˆ—

ππ’π’–π’•π’Šπ’

ππ’‰πŸπ’π’–π’•βˆ—

ππ’‰πŸπ’π’–π’•

ππ’‰πŸπ’Šπ’βˆ—

ππ’‰πŸπ’Šπ’

ππ‘ΎπŸ’

𝝏𝑬

ππ‘ΎπŸ’= 𝟎. πŸ–πŸ‘πŸ“ βˆ— 𝟎. πŸπŸ‘ βˆ— 𝟎. πŸ‘ βˆ— 𝟎. πŸπŸ“ βˆ— 𝟎. 𝟐

𝝏𝑬

ππ‘ΎπŸ’=. πŸŽπŸŽπŸ‘

At this point, we have successfully calculated the derivative of the total error according to each

weight in the network. Next is to update the weights according to the derivatives and re-train

the network. The updated weights will be calculated as follows:

π‘ΎπŸπ’π’†π’˜ = π‘ΎπŸ βˆ’ Ξ· βˆ—ππ‘¬

ππ‘ΎπŸ= 𝟎. πŸ“βˆ’. 𝟎𝟏 βˆ— βˆ’πŸŽ. 𝟎𝟎𝟏 = 𝟎. πŸ“πŸŽπŸŽπŸŽπŸ

π‘ΎπŸπ’π’†π’˜ = π‘ΎπŸ βˆ’ Ξ· βˆ—ππ‘¬

ππ‘ΎπŸ= 𝟎. πŸβˆ’. 𝟎𝟏 βˆ— βˆ’πŸŽ. πŸŽπŸŽπŸ‘ = 𝟎. πŸπŸŽπŸŽπŸŽπŸ‘

π‘ΎπŸ‘π’π’†π’˜ = π‘ΎπŸ‘ βˆ’ Ξ· βˆ—ππ‘¬

ππ‘ΎπŸ‘= 𝟎. πŸ”πŸβˆ’. 𝟎𝟏 βˆ— 𝟎. πŸŽπŸŽπŸ— = 𝟎. πŸ”πŸπŸ—πŸ—πŸ

π‘ΎπŸ’π’π’†π’˜ = π‘ΎπŸ’ βˆ’ Ξ· βˆ—ππ‘¬

ππ‘ΎπŸ’= 𝟎. πŸβˆ’. 𝟎𝟏 βˆ— 𝟎. πŸŽπŸŽπŸ‘ = 𝟎. πŸπŸ—πŸ—πŸ•

π‘ΎπŸ“π’π’†π’˜ = π‘ΎπŸ“ βˆ’ Ξ· βˆ—ππ‘¬

ππ‘ΎπŸ“= βˆ’πŸŽ. πŸβˆ’. 𝟎𝟏 βˆ— 𝟎. πŸ”πŸπŸ– = βˆ’πŸŽ. πŸπŸŽπŸ”πŸπŸ–

π‘ΎπŸ”π’π’†π’˜ = π‘ΎπŸ” βˆ’ Ξ· βˆ—ππ‘¬

ππ‘ΎπŸ”= 𝟎. πŸ‘βˆ’. 𝟎𝟏 βˆ— 𝟎. πŸŽπŸ—πŸ• = 𝟎. πŸπŸ—πŸ—πŸŽπŸ‘