Upload
ahmed-gad
View
164
Download
1
Embed Size (px)
Citation preview
Backpropagation: Understanding How to Update Artificial
Neural Networks Weights
By
Ahmed Fawzy Gad Information Technology (IT) Department
Faculty of Computers and Information (FCI) Menoufia University
Egypt [email protected]
13-Nov-2017
Ψ¬Ψ§Ω ΨΉΨ© Ψ§ΩΩ ΩΩΩΩΨ© ΩΩΩΨ© Ψ§ΩΨΨ§Ψ³Ψ¨Ψ§Ψͺ ΩΨ§ΩΩ ΨΉΩΩΩ Ψ§Ψͺ
MENOUFIA UNIVERSITY FACULTY OF COMPUTERS
AND INFORMATION
Ψ¬Ψ§Ω ΨΉΨ© Ψ§ΩΩ ΩΩΩΩΨ©
1
This article discusses how the backpropagation algorithm is useful in updating the artificial
neural networks (ANNs) weights using two examples step by step. Readers should have a basic
understanding of how ANNs work, partial derivatives, and multivariate chain rule.
This article won`t dive directly in the details of the algorithm but will start by training a very
simple network. This is because the backpropagation algorithm is meant to be applied over a
network after training. So, we should train the network before applying it to catch the benefits
of backpropagation algorithm and how to use it.
The following figure shows the network structure of the first example we are working on to
explain how the backpropagation algorithm works. It just have two inputs symbolized as πΏπ and
πΏπ. The output layer has just a single neuron and there is no hidden layers. Each input has a
corresponding weight where πΎπ and πΎπ are the weights for πΏπ and πΏπ respectively. There is a
bias for the output layer neuron with a fixed value +π and a weight π.
The output layer neuron uses the sigmoid activation function defined by the following equation:
π(π) =π
π + πβπ
Where π is the sum of products (sop) between each input and its corresponding weight. π is the
input to the activation function which, in this example, is defined as follows:
π = πΏπ β πΎπ + πΏπ β πΎπ + π
The following table shows a single input and its corresponding desired output used as the
training data. The basic target of this example is not training the network but understanding
2
how the weights are updated using backpropagation. This is why just a single record of data is
used to get off complexity of all operations and concentrate on just backpropagation.
πΏπ πΏπ π«ππππππ πΆπππππ π. π π. π π. ππ
Assume the initial values for both weights and bias are as follows:
πΎπ πΎπ π π. π π. π π. ππ
For simplicity, the values for all inputs, weights, and bias will be added to the network diagram
to look as follows:
Now, let us train the network and see whether the desired output will be returned based on the
current weights and bias. The input of the activation function will be the sop between each input
and its weight then adding the bias to the total as follows:
π = πΏπ β πΎπ + πΏπ β πΎπ + π
π = π. π β π. π + π. π β π. π + π. ππ
π = π. ππ
The output of the activation function will be calculated by applying the previously calculated
sop to the used function (sigmoid) as follows:
π(π) =π
π + πβπ
3
π(π) =π
π + πβπ.ππ
π(π) =π
π + π. πππππππππ
π(π) =π
π. πππππππππ
π(π) = π. πππππππππ
The output of the activation function reflects the predicted output for the current inputs. It is
obvious that there is a difference between the desired and the expected output. But what are
the sources for that difference? How the predicted output should be changed to get closer to
the desired result? These questions will be answered later. But at least, let us see the error of
our NN based on an error function.
The error functions tell how close the predicted outputs from the desired outputs. The optimal
value for the error is zero meaning there is no error at all and both desired and predicted results
are identical. One of the error functions is the squared error function:
π¬ =π
π(π ππππππ β ππππ πππππ )π
Note that the π
π added to the equation is for simplifying derivatives later.
We can measure the error of our network as follows:
π¬ =π
π(π. ππ β π. πππππππππ)π
π¬ =π
π(βπ. πππππππππ)π
π¬ =π
π(π. πππππππππ)
π¬ = π. πππππππππ
The result ensures the existence of error; large one (~0.357). This is what the error tells. It just
give us an indication of how far the predicted results from the desired results. But after knowing
that there is an error, what should we do? Because it is error, then we should minimize it. But
how to minimize the error? There must be something to be changed in order to minimize the
error. The only playable parameter we have is the weight. We can try different weights and then
test our network.
Weights Update Equation The weights can be changed according to the following equation:
πΎ(π + π) = πΎ(π) + Ξ·[π (π) β π(π)]πΏ(π)
4
Where
π: training step (0, 1, 2, β¦).
πΎ(π): weights in the current training step. πΎ(π) =
[π(π), πΎπ(π), πΎπ(π), πΎπ(π), β¦ , πΎπ(π)]
Ξ·: network learning rate.
π (π): desired output.
π(π): predicted output.
πΏ(π): current input at which the network made false prediction.
For our network, these parameters has the following values:
π: 0
πΎ(π): [1.83, 0.5, 0.2]
Ξ·: hyperparameter. We can choose it 0.01 for example.
π (π): [0.03].
π(π): [0.874352143].
πΏ(π): [+1, 0.1, 0.3]. First value (+1) is for the bias.
We can update our neural network weights based on the previous equation:
πΎ(π + π) = πΎ(π) + Ξ·[π (π) β π(π)]πΏ(π)
= [π. ππ, π. π, π. π] + 0.01[π. ππ β π. πππππππππ][+π, π. π, π. π]
= [π. ππ, π. π, π. π] + 0.01[βπ. πππππππππ][+π, π. π, π. π]
= [π. ππ, π. π, π. π] + βπ. πππππππππππ[+π, π. π, π. π]
= [π. ππ, π. π, π. π] + [βπ. πππππππππ, βπ. πππππππππ, βπ. πππππππππ]
= [π. πππππππππ, π. πππππππππ, π. πππππππππ]
The new weights are as follows:
πΎππππ πΎππππ ππππ π. πππππππππ π. πππππππππ π. πππππππππ
Based on the new weights, we will recalculate the predicted output and continue updating
weights and calculating the predicted output until reaching an acceptable value for the error for
the problem in hand.
Here we successfully updated the weights without using the backpropagation algorithm. Are we
still in need of that algorithm? Yes. The reasons will be explained next.
Why backpropagation algorithm is important? Suppose, for the optimal case, that the weight update equation generated the best weights, it
still anonymous what this function did. It is like a black box that we don`t understand its internal
5
operations. All we know is that we should apply this equation in case there is a classification
error. Then the function will generate new weights to be used in next training steps. But why
the new weights are better in prediction? What is the effect of each weight on the prediction
error? How increasing or decreasing one or more weights affects the prediction error?
It is required to have better understanding of how the best weights are calculated. To do that,
we should use the backpropagation algorithm. It helps us to understand how each weight affects
the NN total error and tell us how to minimize the error to a value very close to zero.
Forward Vs. Backward Passes When training a neural network, there are two passes: forward and backward as in the next
figure. The start is always the forward pass in which the inputs are applied to the input layer and
moving toward the output layer calculating the sop between inputs and weights, applying
activation functions to generate outputs, and finally calculating the prediction error to know
how accurate the current network is.
But what if there is prediction error? We should modify the network to reduce that error. This
is done in the backward pass. In the forward pass, we start from the inputs until calculating the
prediction errors. But in the backward pass, we start from the errors until reaching the inputs.
The goal of this pass is to know how each weight affects the total error. Knowing the relationship
between the weight and the error will allows us to modify network weights to decrease the
error. For example, in the backward pass, we can get useful information such as increasing the
current value of πΎπ by 1.0 will increase the prediction error by 0.07. This make us understand
how to select the new value of πΎπ in order to minimize the error (πΎπ should not be increased).
Partial Derivative
6
One important operation used in the backward pass is to calculate derivatives. Before getting
into the calculations of derivatives in the backward pass, we can start by a simple example to
make things easier.
For a multivariate function like π = π2π + π», what is the effect on the output Y given a change
in variable X? This question is answered using partial derivative. It is written as follows: ππ
ππΏ=
π
ππΏ(πΏππ + π―)
ππ
ππΏ= ππΏπ + π
ππ
ππΏ= ππΏπ
Note that everything except π is regarded a constant. This is why π» is replaced by 0 after
calculating partial derivative. Here ππ means a tiny change of variable π and ππ means a tiny
change of π. The change of π is the result of changing π. By making a very small change in π,
what is the effect on π? The small change can be an increase or decrease by a tiny value such as
0.0.1. By substituting by the different values of π, we can find how π changes with respect to
π.
The same procedure will be followed in order to know how the NN prediction error changes wrt
changes in network weights. So, our target is to calculate ππΈ
ππΎπ and
ππΈ
ππΎπ as we have just two
weights πΎπ and πΎπ. Let's calculate them.
Change in Prediction Error wrt Weights Looking in this equation, π = πΏππ + π―, it seems straightforward to calculate the partial
derivate ππ
ππ because there is an equation relating both π and π. But there is no direct equation
between the prediction error and the weights. This is why we are going to use the multivariate
chain rule to find the partial derivative of π wrt π.
Prediction Error to Weights Chain Let us try to find the chain relating the prediction error to the weights. The prediction error is
calculated based on this equation:
π¬ =π
π(π ππππππ β ππππ πππππ )π
But this equation doesn`t have any weights. No problem, we can follow the calculations for each
input of the previous equation to until reaching the weights. The desired output is a constant
and thus there is no chance for reaching weights through it. The predicted output is calculated
based on this equation:
7
π(π) =π
π + πβπ
Again, the equation for calculating the predicted output doesn't have any weight. But there is
still variable π (sop) that already depends on weights for its calculation according to this
equation:
π = πΏπ β πΎπ + πΏπ β πΎπ + π
Here is the chain to be followed to get the weights:
As a result, to know how the prediction error changes wrt changes in the weights we should do
some intermediate operations including finding how the prediction error changes wrt changes
in the predicted output. Then finding the relation between the predicted output and the sop.
Finally, finding how the sop change by changing the weights. There are four intermediate partial
derivatives as follows:
ππ¬
ππ·πππ πππππ ,
ππ·πππ πππππ
ππ,
ππ
ππΎπ and
ππ
ππΎπ
This chain will finally tell how the prediction error changes wrt changes in each weight, which is
our goal, by multiplying all individual partial derivatives as follows: ππ¬
ππΎπ=
ππ¬
ππ·πππ πππππ β
ππ·πππ πππππ
ππβ
ππ
ππΎπ
ππ¬
ππΎπ=
ππ¬
ππ·πππ πππππ β
ππ·πππ πππππ
ππβ
ππ
ππΎπ
Important note: Currently, there is no equation directly relating prediction error to network
weights but we can create a one relating them and applying partial derivative directly to it. The
equation will be as follows:
π¬ =π
π(π ππππππ β
π
π + πβ(πΏπβ πΎπ+ πΏπβπΎπ+π))
π
Because this equation seems complex, then we can use the multivariate chain rule for simplicity.
Calculating Chain Partial Derivatives
8
Let us calculate the partial derivatives of each part of the chain previously created.
Error-Predicted Output Partial Derivative: ππ¬
ππ·πππ πππππ =
π
ππ·πππ πππππ (π
π(π ππππππ β ππππ πππππ )π)
= π βπ
π(π ππππππ β ππππ πππππ )πβπ β (π β π)
= (π ππππππ β ππππ πππππ ) β (βπ)
= ππππ πππππ β π ππππππ
By values substitution, ππ¬
ππ·πππ πππππ = ππππ πππππ β π ππππππ = π. πππππππππ β π. ππ
ππ¬
ππ·πππ πππππ = π. πππππππππ
Predicted Output-sop Partial Derivative: ππ·πππ πππππ
ππ=
π
ππ(
π
π + πβπ)
Remember: the quotient rule can be used to find the derivative of the sigmoid function as
follows: ππ·πππ πππππ
ππ=
π
π + πβπ(π β
π
π + πβπ)
By values substitution, ππ·πππ πππππ
ππ=
π
π + πβπ(π β
π
π + πβπ) =
π
π + πβπ.ππ(π β
π
π + πβπ.ππ)
=π
π + π. πππππππππ(π β
π
π + π. πππππππππ)
=π
π. πππππππππ(π β
π
π. πππππππππ)
= π. πππππππππ(π β π. πππππππππ)
= π. πππππππππ(π. πππππππππ) ππ·πππ πππππ
ππ= π. πππππππππ
Sop-πΎπ Partial Derivative: ππ
ππΎπ=
π
ππΎπ(πΏπ β πΎπ + πΏπ β πΎπ + π)
9
= π β πΏπ β (πΎπ)(πβπ) + π + π
= πΏπ β (πΎπ)(π)
= πΏπ(π) ππ
ππΎπ= πΏπ
By values substitution, ππ
ππΎπ= πΏπ = π. π
Sop-πΎπ Partial Derivative: ππ
ππΎπ=
π
ππΎπ(πΏπ β πΎπ + πΏπ β πΎπ + π)
= π + π β πΏπ β (πΎπ)(πβπ) + π
= πΏπ β (πΎπ)(π)
= πΏπ(π) ππ
ππΎπ= πΏπ
By values substitution, ππ
ππΎπ= πΏπ = π. π
After calculating each individual derivative, we can multiply all of them to get the desired
relationship between the prediction error and each weight.
Prediction Error-πΎπ Partial Derivative: ππ¬
ππΎπ= π. πππππππππ β π. πππππππππ β π. π
ππ¬
ππΎπ= π. πππππππππ
Prediction Error-πΎπ Partial Derivative: ππ¬
ππΎπ= π. πππππππππ β π. πππππππππ β π. π
ππ¬
ππΎπ= π. πππππππππ
10
Finally, there are two values reflecting how the prediction error changes with respect to the
weights (0.009276093 for W1 and 0.027828278 for W2). But what does that mean? Results needs
interpretation.
Interpreting Results of Backpropagation There are two useful conclusions from each of the last two derivatives obtained from the
following:
1. Derivative sign
2. Derivative magnitude (DM)
If the derivative is positive, that means increasing the weight will increase the error. In other
words, decreasing the weight will decrease the error.
If the derivative is negative, then increasing the weight will decrease the error. In other words,
if it is negative then decreasing the weight will increase the error.
But by how much the error will increase or decrease? The derivative magnitude can tell us. For
positive derivative, increasing the weight by π will increase the error by π·π β π. For negative
derivative, increasing the weight by π will decrease the error by π·π β π.
Because the result of the ππ¬
ππΎπ derivative is positive, this means that if πΎπ increased by 1 then
the total error will increase by 0.009276093. Also because the result of the ππ¬
ππΎπ derivative is
positive, this means that if πΎπ increased by 1 then the total error will increase by 0.027828278.
Updating Weights After successfully calculating the derivatives of the error with respect to each individual weight,
we can update the weights in order to enhance the prediction.
Each weight will be updated based on its derivative as follows:
πΎππππ = πΎπ β Ξ· βππ¬
ππΎπ
= π. π β 0.01 β π. πππππππππ
πΎππππ = π. πππππππππππ
For second weight,
πΎππππ = πΎπ β Ξ· βππ¬
ππΎπ
= π. π β 0.01 β π. πππππππππ
πΎππππ = π. ππππππππππ
Note that the derivative is subtracted not added to the weight because it is positive.
11
Then continue the process of prediction and updating the weights until the desired outputs are
generated with an acceptable error.
12
Second Example β Backpropagation for NN with Hidden Layer To make the ideas more clear, we can apply the backpropagation algorithm over the following
NN after adding one hidden layer with two neurons.
The same inputs, output, activation function, and learning rate used previously will also be
applied in this example. Here are the complete weights of the network:
πΎπ πΎπ πΎπ πΎπ πΎπ πΎπ ππ ππ ππ π. π π. π π. ππ π. π βπ. π π. π π. π βπ. π π. ππ
The following diagram shows the previous network with all inputs and weights added.
13
At first, we should go through the forward pass to get the predicted output. If there was an error
in prediction, then we should go through the backward pass to update the weights according to
the backpropagation algorithm.
Let us calculate the inputs to the first neuron in the hidden layer (ππ):
ππππ = πΏπ β πΎπ + πΏπ β πΎπ + ππ
= π. π β π. π + π. π β π. π + π. π
ππππ = π. ππ
The input to the second neuron in the hidden layer (ππ):
ππππ = πΏπ β πΎπ + πΏπ β πΎπ + ππ
= π. π β π. ππ + π. π β π. π β π. π
= π. πππ
The output of the first neuron of the hidden layer:
πππππ =π
π + πβππππ
=π
π + πβπ.ππ
=π
π + π. πππ
=π
π. πππ
πππππ = π. πππ
The output of the second neuron of the hidden layer:
πππππ =π
π + πβππππ
=π
π + πβπ.πππ
=π
π + π. πππ
=π
π. πππ
πππππ = π. πππ
Next step is to calculate the input of the output neuron:
πππππ = πππππ β πΎπ + πππππ β πΎπ + ππ
= π. πππ β βπ. π + π. πππ β π. π + π. ππ
πππππ = π. πππ
The output of the output neuron:
14
ππππππ =π
π + πβπππππ
=π
π + πβπ.πππ
=π
π + π. πππ
=π
π. πππ
ππππππ = π. πππ
Thus the expected output of our NN based on the current weights is 0.865. We can then
calculate the prediction error according to the following equation:
π¬ =π
π(π ππππππ β ππππππ)π
=π
π(π. ππ β π. πππ)π
=π
π(β. πππ)π
=π
π(π. πππ)
π¬ = π. πππ
The error seems very high and thus we should update the network weights using the
backpropagation algorithm.
Partial Derivatives Our goal is to get how the total error π¬ changes wrt each of the 6 weights (πΎπ: πΎπ):
ππ¬
ππΎπ,
ππ¬
ππΎπ,
ππ¬
ππΎπ,
ππ¬
ππΎπ,
ππ¬
ππΎπ,
ππ¬
ππΎπ
Let us start by calculating the partial derivative of the output wrt the hidden-output layers
weights (πΎπ and πΎπ).
EβπΎπ Parial Derivative:
Starting by πΎπ, we will follow that chain: ππ¬
ππΎπ=
ππ¬
πππππππβ
πππππππ
ππππππβ
ππππππ
ππΎπ
We can calculate each individual part at first and then combine them to get the desired
derivative.
15
For the first derivative ππΈ
πππ’π‘ππ’π‘:
ππ¬
πππππππ=
π
πππππππ(π
π(π ππππππ β ππππππ)π)
= π βπ
π(π ππππππ β ππππππ)πβπ β (π β π)
= π ππππππ β ππππππ β (βπ)
= ππππππ β π ππππππ
By substituting with the values of these variables,
= ππππππ β π ππππππ = π. πππ β π. ππ ππ¬
πππππππ= π. πππ
For the second derivative πππ’π‘ππ’π‘
πππ’π‘ππ:
πππππππ
ππππππ=
π
ππππππ(
π
π + πβπππππ)
= (π
π + πβπππππ)(π β
π
π + πβπππππ)
= (π
π + πβπ.πππ)(π β
π
π + πβπ.πππ)
= (π
π. ππ)(π β
π
π. ππ)
= (π. πππ)(π β π. πππ) = (π. πππ)(π. πππ) πππππππ
ππππππ= π. ππ
For the last derivative πππ’π‘ππ
ππ5:
ππππππ
ππΎπ=
π
ππΎπ(πππππ β πΎπ + πππππ β πΎπ + ππ)
= π β πππππ β (πΎπ)πβπ + π + π
= πππππ ππππππ
ππΎπ= π. πππ
After calculating all required three derivatives, we can calculate the target derivative as follows: ππ¬
ππΎπ=
ππ¬
πππππππβ
πππππππ
ππππππβ
ππππππ
ππΎπ
16
ππ¬
ππΎπ= π. πππ β π. ππ β π. πππ
ππ¬
ππΎπ= π. πππ
EβπΎπ Parial Derivative:
For calculating ππΈ
ππ6, we will use the following chain:
ππ¬
ππΎπ=
ππ¬
πππππππβ
πππππππ
ππππππβ
ππππππ
ππΎπ
The same previous calculations will be repeated with just a change in the last derivative πππ’π‘ππ
ππ6.
It can be calculated as follows: ππππππ
ππΎπ=
π
ππΎπ(πππππ β πΎπ + πππππ β πΎπ + ππ)
= π + π β πππππ β (πΎπ)πβπ + π
= πππππ ππππππ
ππΎπ= π. πππ
Finally, the derivative ππΈ
ππ6 can be calculated:
ππ¬
ππΎπ=
ππ¬
πππππππβ
πππππππ
ππππππβ
ππππππ
ππΎπ
= π. πππ β π. ππ β π. πππ ππ¬
ππΎπ= π. πππ
This is for π5 and π6. Let's calculate the derivative wrt to π1 to π4.
EβπΎπ Parial Derivative:
Starting by π1, we will follow that chain: ππ¬
ππΎπ=
ππ¬
πππππππβ
πππππππ
ππππππβ
ππππππ
ππππππβ
ππππππ
πππππβ
πππππ
ππΎπ
We will follow the same previous procedure by calculating each individual derivative and finally
combining all of them.
17
The first two derivatives ππΈ
πππ’π‘ππ’π‘ and
πππ’π‘ππ’π‘
πππ’π‘ππ are already calculated previously and their results
are as follows: ππ¬
πππππππ= π. πππ
πππππππ
ππππππ= π. ππ
For the next derivative πππ’π‘ππ
πβ1ππ’π‘:
ππππππ
ππππππ=
π
ππππππ(πππππ β πΎπ + πππππ β πΎπ + ππ)
= (πππππ)πβπ β πΎπ + π + π
= πΎπ ππππππ
ππππππ= βπ. π
For πβ1ππ’π‘
πβ1ππ:
ππππππ
πππππ=
π
πππππ(
π
π + πβππππ)
= (π
π + πβππππ)(π β
π
π + πβππππ)
= (π
π + πβπ.ππ)(π β
π
π + πβπ.ππ)
= (π
π. πππ)(π β
π
π. πππ)
= (π. πππ)(π β π. πππ) = π. πππ β π. πππ ππππππ
πππππ= π. πππ
For πβ1ππ
ππ1:
πππππ
ππΎπ=
π
ππΎπ(πΏπ β πΎπ + πΏπ β πΎπ + ππ)
= πΏπ β (πΎπ)πβπ + π + π
= πΏπ πππππ
ππΎπ= π. π
Finally, the target derivative can be calculated:
18
ππ¬
ππΎπ= π. πππ β π. ππ β βπ. π β π. πππ β π. π
ππ¬
ππΎπ= βπ. πππ
EβπΎπ Parial Derivative:
Similar to the way of calculating ππΈ
ππ1, we can calculate
ππΈ
ππ2. The only change will be in the last
derivative πβ1ππ
ππ2.
ππ¬
ππΎπ=
ππ¬
πππππππβ
πππππππ
ππππππβ
ππππππ
ππππππβ
ππππππ
πππππβ
πππππ
ππΎπ
πππππ
ππΎπ=
π
ππΎπ(πΏπ β πΎπ + πΏπ β πΎπ + ππ)
= π + πΏπ β (πΎπ)πβπ + π
= πΏπ πππππ
ππΎπ= π. π
Then: ππ¬
ππΎπ= π. πππ β π. ππ β βπ. π β π. πππ β π. π
ππ¬
ππΎπ= β. πππ
For the last two weights (π3 and π4), they can be calculated similarly to π1 and π2.
EβπΎπ Parial Derivative:
Starting by π3, we should this chain: ππ¬
ππΎπ=
ππ¬
πππππππβ
πππππππ
ππππππβ
ππππππ
ππππππβ
ππππππ
πππππβ
πππππ
ππΎπ
The missing derivatives to be calculated are πππ’π‘ππ
πβ2ππ’π‘,
πβ2ππ’π‘
πβ2ππand
πβ2ππ
ππ3.
ππππππ
ππππππ=
π
ππππππ(πππππ β πΎπ + πππππ β πΎπ + ππ)
= π + (πππππ)πβπ β πΎπ + π
= πΎπ ππππππ
ππππππ= π. π
19
For πβ2ππ’π‘
πβ2ππ:
ππππππ
πππππ=
π
πππππ(
π
π + πβππππ)
= (π
π + πβππππ)(π β
π
π + πβππππ)
= (π
π + πβπ.πππ)(π β
π
π + πβπ.πππ)
= (π
π. πππ)(π β
π
π. πππ)
= (π. πππ)(π β π. πππ) ππππππ
πππππ= π. ππ
For πβ2ππ
ππ3:
πππππ
ππΎπ=
π
ππΎπ(πΏπ β πΎπ + πΏπ β πΎπ + ππ)
= πΏπ β πΎπ + πΏπ β πΎπ + ππ
= (πΏπ)πβπ β πΎπ + π + π
= πΎπ
= π. ππ
Finally, we can calculate the desired derivative as follows: ππ¬
ππΎπ=
ππ¬
πππππππβ
πππππππ
ππππππβ
ππππππ
ππππππβ
ππππππ
πππππβ
πππππ
ππΎπ
ππ¬
ππΎπ= π. πππ β π. ππ β π. π β π. ππ β π. ππ
ππ¬
ππΎπ= π. πππ
EβπΎπ Parial Derivative:
We can now calculate ππΈ
ππ4 similarly:
ππ¬
ππΎπ=
ππ¬
πππππππβ
πππππππ
ππππππβ
ππππππ
ππππππβ
ππππππ
πππππβ
πππππ
ππΎπ
We should calculate the missing derivative πβ2ππ
ππ4:
20
πππππ
ππΎπ=
π
ππΎπ(πΏπ β πΎπ + πΏπ β πΎπ + ππ)
= πΏπ β πΎπ + πΏπ β πΎπ + ππ
= π + (πΏπ)πβπ β πΎπ + π
= πΎπ
= π. π
Then calculate ππΈ
ππ4:
ππ¬
ππΎπ=
ππ¬
πππππππβ
πππππππ
ππππππβ
ππππππ
ππππππβ
ππππππ
πππππβ
πππππ
ππΎπ
ππ¬
ππΎπ= π. πππ β π. ππ β π. π β π. ππ β π. π
ππ¬
ππΎπ=. πππ
At this point, we have successfully calculated the derivative of the total error according to each
weight in the network. Next is to update the weights according to the derivatives and re-train
the network. The updated weights will be calculated as follows:
πΎππππ = πΎπ β Ξ· βππ¬
ππΎπ= π. πβ. ππ β βπ. πππ = π. πππππ
πΎππππ = πΎπ β Ξ· βππ¬
ππΎπ= π. πβ. ππ β βπ. πππ = π. πππππ
πΎππππ = πΎπ β Ξ· βππ¬
ππΎπ= π. ππβ. ππ β π. πππ = π. πππππ
πΎππππ = πΎπ β Ξ· βππ¬
ππΎπ= π. πβ. ππ β π. πππ = π. ππππ
πΎππππ = πΎπ β Ξ· βππ¬
ππΎπ= βπ. πβ. ππ β π. πππ = βπ. πππππ
πΎππππ = πΎπ β Ξ· βππ¬
ππΎπ= π. πβ. ππ β π. πππ = π. πππππ