Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 1
Learning strategies for neuronal nets -the backpropagation algorithm
In contrast to the NNs with thresholds we handled until now NNs are the NNs with non-linear activation functions f(x). The most common function is thesigmoid function ψ(x):
1( ) ( )1 axf x x
eψ −= =
+
-6 -4 -2 0 2 4 60
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
a
It approxmiates with increasing a thestep function σ(x).
ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 2
The function ψ(x) is closely related to the tanh function. For a=1 applies:
2
2tanh( ) 1 2 (2 ) 11 xx x
eψ−= − = −
+If the two parameter a and b are added, a more general form results:
( )
1( ) 11g a x bx
eψ − −= −
+
2
1 1 1( 1) (1 ) ( )(1 ( ))(1 ) 1 1
xx x x
d e x xdx e e eψ ψ ψ−
− − −
−= − = − = −
+ + +
with a controlling the sheerness and with b controlling the position on the x-axis.
The non-linearity of the activation function is the core reason for the existence of the multi-layer perceptron; without this property the multi-layer perceptronwould coincide with a trivial linear one-layered network.
The differentiability of f(x) allows us to apply neccessary constraints (∂(·)/∂wi,j) for optimization of the weight coefficients wi,j.
The first derivate of the sigmoid function can be ascribed to itself:
ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 3
Figure of the extended weight matrix
[ ]
1,0 1 1,1 1,2 1,
2,0 2 2,1 2,2 1,
3,0 3 3,1 3,2 1,
,0 ,1 ,2 ,
,
N
N
N
M M M M M N
w b w w ww b w w ww b w w w
w b w w w
=⎡ ⎤⎢ ⎥=⎢ ⎥⎢ ⎥′ == =⎢ ⎥⎢ ⎥⎢ ⎥=⎣ ⎦
W b W
………
…
ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 4
The neural net with H layersThe multi-layer NN with H layers has a peculiar weight matrix W´i in every
layer, but identical sigmoid-functions:
{ } { }22ˆmin mini i
HJ E E′ ′
= − = −W W
y y y y
W´1 W´2y1W´HyH-1y2 yHx´
2 1( ( ( ) ))H H′ ′ ′ ′=y ψ W ψ W ψ W x… …The learning is based on adjustment of the weight matrices with minimization of a square error criterion between target value y and approximation ŷ by the net(supervised learning) in mind. The expected value has to be formed by theavailable training ensemble of {ŷj,xj}:
Note: The one-layer nn and the linear polynom classifier are identical, if theactivation function is set to ψ≡1 .
ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 5
Relations to the concept of function approximation by a linear combination of base functions
The first layer creates the base functions in form of a hidden vector y1and the second layer forms a linear combination of these basefunctions.
Therefore the coefficient matrix W´1 of the first layer controls theappearance of the base functions, while the weight matrix W´2 of thesecond layer contains the coefficients of the linear combination.Additionally the matrix is weighted through the activation function.
ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 6
Reaction of a neuron with two inputs and a threshold function σ
ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 7
Reaction of a neuron with two inputs and a sigmoid function ψ
ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 8
Overlapping two neurons with two inputseach and a sigmoid function
ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 9
Reaction of a neuron in the second layer to two neuronsof first layer after valuation by sigmoid function
ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 10
The backpropagation learning algorithm• The class map is done by a multi-layer perceptron, which produces a 1 on the
output in the regions of x, that are populated by the samples of the accordingclass, and produces a 0 in the areas allocated by other classes. In the areas in-between and outside an interpolation resp. extrapolation is done.
{ }2ˆ ( , )iJ E ′ ′= −y W x y
• This term is non-linear both in the elements of the input-vector x´, and in the weight coefficients {w´ij
h}.
• The network ist trained based on the optimization of the squared quality factor:
ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 11
• From insertion of the function map of the multi-layer perceptron results:
2 1
ˆ
2
( ( ( ) ))HJ E⎧ ⎫⎪ ⎪= ⎨ ⎬⎪
′
⎩
′ ′ −
⎭
′⎪y
ψ W ψ W ψ W x y… …
• The iteration terminates, when the gradient disappears at a (local or evenglobal) minimum.
• A proper choice of α is difficult. Small values increase the number of requirediterations. High values reduce the probabilty of running into a local minimum, risking the method to diverge and not finding the minimum (or giving rise to oscillations).
• Sought is the global minimum. It is not clear whether such an element exists orhow many local minima exist.The backpropagation algorithm solves the problem iteratively with a gradientalgorithm. One iterates according to:
{ }1 2 with: , , ,
and the gradients:
H
hnm
J Jw
α′ ′ ′ ′ ′ ′← − ∇ =
⎧ ⎫∂ ∂∇ = = ⎨ ⎬′ ′∂ ∂⎩ ⎭
W W J W W W W
JW
…
ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 12
• The iteration has only linear convergence (gradient algorithm).
• The error caused by a sample results in:
2 1( ( ( ) ))ˆ ( ) H′ ′′ ′= ′ψ W ψ Wy ψ W xx … …
2 21 12 2
1
ˆ ˆ( ) ( ( ) ( ))HN
j j j k kk
J y j y j=
= − = −∑y x y
• The expected value E{...} of the gradient has to be approximated by a meanvalue of all sample gradients:
1
1
n
jnj=
∇ = ∇∑J J
• Calculating the gradient:For determination of the composed function
• the chain rule is needed.
ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 13
– and cumulative learning (batch learning), based on
• Partial derivations for one layer:For the r-th layer applies:
1
1
n
jnj=
∇ = ∇∑J J
• We have to distinguish between– individual learning based on the last sample
,j j j⎡ ⎤ ⇒∇⎣ ⎦x y J
0
( )
n-th input of layer r
m-th output of layer r
rNr r r rm m nm n
n
rn
rm
y s w x
x
y
ψ ψ=
⎛ ⎞′ ′= = ⎜ ⎟
⎝ ⎠′
∑
ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 14
Definition of the variables of two layers forthe backpropagation algorithm
+
+
+
f
f
f +
+
+
f
f
f
1rky −
layer r-1 layer r
rjkw′
rjs r
jy
1rks −
rjδ
1rkδ−
ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 15
1Hm
H HHn nnH H H H
nm n nm nm
y
J J s sw s w w
δ
−=
∂ ∂ ∂ ∂= ⋅ = −
′ ′ ′∂ ∂ ∂ ∂
• First of all we calculate the effect of the last hidden layer on the starting layerr=H. Using partial derivations of the quality function J (actually Jj, but j is omittedfor simplification reasons) and applying the chain rule results:
• and thus for update of weights:
1 1
with
ˆ( ) ( )
neu alt
H H H H Hnm n m n n n m
Jw w w ww
w y y y f s y
α
αδ α− −
∂= + ∆ ∆ = −
∂′ ′∆ = − = − −
• introducing the sensibility of cell n
nn
Js
δ ∂= −
∂
• results:
212
1
ˆ( ))
ˆ
ˆ ˆ( ) ( )ˆ
HN
k kk
n
HH Hnn n n nH H H
n n n
y y
y
yJ J y y f ss y s
δ
=
⎛ ⎞⎜ ⎟∂ −⎜ ⎟⎜ ⎟⎝ ⎠
∂
∂∂ ∂ ′= − = − ⋅ = −∂ ∂ ∂
∑
ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 16
• With
1
1 1
rrkj
rk
r r rkj k j
sJ Js s s
δδ −
− −
∂∂ ∂=
∂ ∂ ∂∑
• For all other hidden layers r <H the thoughts behind are a litte more complex. Due to dependencies amongst each other, the values of sj
r-1 have an effect on all elements sk
r of the following layer. Using the chain rule results in:
• results: 1 1( )r r r r
j j kj kk
f s wδ δ− − ⎡ ⎤′ ′= ⎢ ⎥⎣ ⎦∑
1( )
1
11 1 ( )
rmf s
r rkm m
mrr rk
kj jr rj j
w ys w f s
s s
−
−
−− −
⎛ ⎞⎜ ⎟′∂ ⎜ ⎟⎜ ⎟∂ ⎝ ⎠ ′ ′= =
∂ ∂
∑
• i.e. the errors “backpropagate” from the starting layer to lower layers!
ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 17
• All learning samples are crossed without any changes of the weightcoefficients, and the gradient ∇J results by forming the mean value of the ∇Jj. Both values can be used for update of the parameter matrix W´ of theperceptron:
1
1
individual learning
batch learningk k j
k k
α
α+
+
′ ′= − ∇
′ ′= − ∇
W W J
W W J
• The gradients ∇J and ∇Jj differ. The latter is the mean value of the first (∇J = E{∇ Jj}), or: the first value is random value of the latter.
ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 18
The whole individual learning procedure consists of these steps:• Choose interim values for the coefficient matrices W1,W2, ... ,WH of all layers• Given a new observation [x´j, ŷj] • Calculate the discrimination function ŷj from the given observation x´j and the
current weight matrices (forward calculation)• Calculate the error between estimation ŷ and target vector y:
∆y= ŷ-y• Calculate the gradient ∇J regarding to all perceptron weights (error
backpropagation). To do so calculate first δjH of the starting layer according to:
1 1( ) für , 1, , 2r r r rj j kj k
kf s w r H Hδ δ− − ⎡ ⎤′ ′= = −⎢ ⎥⎣ ⎦
∑ …
• and calculate from this backwards all values of the lower layers according to:
ˆ ˆ( ) ( )ˆ
HH Hnj j j jH H H
n n n
J J y y y f ss y s
δ ∂ ∂ ∂ ′= − = − ⋅ = −∂ ∂ ∂
ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 19
• (Parallely) correct all perceptron weights according to:
1
1,2,= für 1, 2,
1, 2,
r r r r r Hnm nm nm n mr
nm H
r HJw w w y n N
wm M
α αδ −
= …∂′ ′ ′← − − = …′∂
= …
• For individual learning the weights {w´nmh} are corrected with ∇J for every
sample, while for batch learning all learning samples have to be considered, in order to determine the averaged gradient ∇J from the sequence {∇J}, before theweights can be corrected according to:
1
1,2,( ) ( ) für 1, 2,
1, 2,
r r r r Hnm nm n m
i H
r Hw w i y i n N
m Mαδ −
= …′ ′← − = …
= …∑
• considering the first derivate of the sigmoid function f(s)=ψ(s):
( )( ) ( )(1 ( )) (1 )d sf s s s y ydsψ ψ ψ′ = = − = −
ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 20
Backpropagation algorithm for matrices• Choose interim values for the coefficient matrices W1,W2, ... ,WH of all layers• Given a new observation [x´j, ŷj] • Calculate the discrimination function ŷj from the given observation x´j and the
current weight matrices (forward calculation). Store all values of yr and sr in all layers inbetween r = 1,2,…,H.
• Calculate the error between estimation ŷ and target vector y on the output:∆y= ŷ-y
• Calculate the gradient ∇J regarding all perceptron weights (errorbackpropagation). To do so first calculate δH of the starting layer according to:
ˆ ˆ( ) ( )ˆ
HH H
H H H
J J f∂ ∂ ∂ ′= − = − ⋅ = −∂ ∂ ∂
yδ y y ss y s
• and calculate from this backwards all values of the lower layers according to:
1 1 1( ) ( ) für , 1, , 2Tr r rf r H H− −′= =′ −δ s δW …
ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 21
• Individual leraning: (parallely) correct all perceptron weight matrices with ∇Jfor every sample according to:
( 1)= für 1, 2,Tr r r r r
r
J r Hα α −∂′ ′ ′← − − = …′∂
W W W δ yW
• Cumulative learning: all samples have to be considered, in order to determinethe averaged value ∇J from the sequence of the {∇J}, before the weights can becorrected according to:
( 1) für 1,2, Tr r r r
j jj
r Hα −′ ′← − = …∑W W δ y
• considering the first derivate of the sigmoid function f(s)=ψ(s):
( )( ) ( )(1 ( )) (1 )d sf s s s y ydsψ ψ ψ′ = = − = −
ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 22
Properties:• The backpropagation algorithm is simple to implement, but very complex
especially for large coefficient matrices, which causes the number of samples to be large as well. The dependency of the method on the starting values of theweights, the correction factor α, and the order of processing the samples havealso an adverse effect.
• Linear dependencies remain unconsidered, just like for recursive training of thepolynom classifier.
• The gradient algorithm has the advantage that it can be used for very large problems.
• The dimension of the coefficient matrix W results from the dimensions of theinput vector, the masked layers, and the output vector (N,N 1,..., NH-1,K) as:
1 0
1
dim( ) ( 1) with and
dim( ) feature spaceˆdim( ) number of classes
Hh h H
h
T N N N N N K
NK
−
=
= = + = =
==
∑W
xy
ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 23
Dimensioning the net• The considerations for designing the multi-layer perceptron with threshold
function (σ resp. sign) gives quite a good idea how many hidden-layers and how many neurons should be used for a MLPC for backpropagationlearning (assumed one has a certain idea of distribution of clusters).
• The net is to be chosen as simple as possible. The higher the dimension thehigher the risk of overfitting, accompanied by a loss of ability to generalize. Also higher dimension allows a lot of local maximas, which can cause thealgorithm to get stuck!
epochs
training
testinger
ror• For a given number of samples, the
learning phase should be stopped at a certain point. If not stopped thenet is overfitted to the existing dataand looses its ablility to generalize.
ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 24
Complexity of backpropagation algorithm
• For T=dim(W) number of weights and bias terms, linear complexity in thenumber of weights can be easily understood, since O(T) computation steps areneeded for forward simulation, O(T) steps are needed for backpropagation of the error and also O(T) operations for correction of weights, altogether: O(T).
• If gradients would be determined experimentally by finite differences (on order to do so, one would have to increment each weight and determine the effect on the quality criterion by forward calculation) by calculating a differential quotient according to (i.e. no analytical evaluation):
• T forward calculations of complexity O(T) would result and therefore a overall(?) complexity of O(T2)
( ) ( )( )
r rji ji
rji
J w J wJ Ow
εε
ε+ −∂
= +∂
ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 25
Backpropagation for training a two-layer network forfunction approximation
(Matlab demo: „Backpropagation Calculation)
1 1 1 1 1 1
1 11,11 1 11 12,1 2
( ) ( ) ( )
mit: und
x
w bw b
′ ′= + = =
⎡ ⎤ ⎡ ⎤= =⎢ ⎥ ⎢ ⎥
⎣ ⎦⎣ ⎦
y ψ W b ψ W x ψ s
W b
2 2 1 2
2 2 21,1 1,2
ˆ ( )
mit:
y y b
w w
= = +
⎡ ⎤= ⎣ ⎦
W y
W
4( ) 1 sin( ) for 2 2y x x xπ= + − ≤ ≤
• Sought is an approximation of the function
+
+
1x +2 ˆy y=
1
11,1w
12,1w
12s
11s
11y
12y
11b
12b
2b
21,1w
21,2w
2s
1
2δ11δ
12δ
ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 26
Backpropagation for training a two-layer network forfunction approximation
2 2 1 2 21, and for 1,2j jw y b jαδ αδ∆ = ∆ = =
• Backpropagation: For output layer or second layer with linear activationfunction results due to f ´=1:
2 ˆ( )y yδ = −
• Correction of weights in the output layer:
• Backpropagation: For first layer with sigmoid function results:
1 1 2 2 1 1 2 21, 1,( ) (1 ) for 1,2j j j j j jf s w y y w jδ δ δ′ ′⎡ ⎤ ⎡ ⎤= = − =⎣ ⎦ ⎣ ⎦
• Correction of weights in the input layer:
1 1 1 1,1 and for 1, 2j j j jw x b jαδ αδ∆ = ∆ = =
ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 27
Starting Matlab demomatlab-BPC.bat
ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 28
Backpropagation for training a two-layer network forfunction approximation(Matlab demo: „Function Approximation)
4( ) 1 sin( ) for 2 2y x i x xπ= + ⋅ − ≤ ≤
• Sought is an approximation of the function
• With increasing value of i (difficulty index) the MLPC networkrequirements increase. The approximation based on the sigmoid functionneeds more and more layers in order to depict several periods of the sinusfunction.
• Problem: Convergence to local minima, even when the network is bigenough to represent the function– The network is too small, so that approximation is poor, i.e. the global minimum can
be reached, but approximation quality is poor (i=8 and network 1-3-1)– The network is big enough, so that approximation is good, but it converges towards a
local minimum (i=4 and network 1-3-1)
ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 29
ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 30
Starting Matlab demomatlab-FA.bat
ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 31
Model big enough but only local minimum
ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 32
Generalization ability of the network
• Assuming that the network is trained for 11 samples
{ } { } { }1 1 2 2 11 11, , , , ,y x y x y x…
• How well does the network approximate the function for samplesnot learned depending on the complexity of the network?
• Rule: The network should have fewer parameters than theavailable input/output pairs
Starting Matlab demo (Hagan 11-21)matlab-GENERAL.bat
ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 33
Improving the ability to generalize of a netby adding noise
• If only few samples are available, the generalization ability of a nn can be improved by extending the number of samples e.g. bynormally distributed noise. i.e. adding further samples, which canbe found in the nearest neighbourhood with high probablity.
• In doing so the intra class areas are broadened and the bordersbetween classes are sharpened.
class 1
class 2
ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 34
Interpolation by overlapping of normal distributions
( ) ( )i i iy x f x tα= −∑
-3 -2 -1 0 1 2 30
0.2
0.4
0.6
0.8
1
1.2
1.4W eighted Sum of Radial Basis Transfer Functions
I t
ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 35
Interpolation by overlapping of normal distributions
ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 36
Character recognition task for26 letters of size 7×5
• with gradually increasing additive noise:
ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 37
y2= ψ(W2 ψ(W1x+b1)+b2)
Two-layer neural net for symbol recognition with 35 inputvalues (pixel), 10 neurons in the hidden layer and 26 neurons
in the output layer
ψ+
W1
b11
xN×1
S1×N=10 ×35
S1×1=10×1
S1×1=10 × 1
s1
y1=f1(W1x+b1)
first layer
N=35
ψ+
W2
b21
y1
S1×1=10 × 1 S2×S1=
26 × 10
S2×1=26 × 1
s2
y2=f2(W2y1+b2)
second layer
y2
S2×1=26 × 1
S2×1=26 × 1
S2×1=26 × 1
S1×1=10×1
ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 38
Demo with MATLAB
• Opening Matlab– Demo in Toolbox Neural Networks– „Character recognition“ (command line)– Two networks are trained
• network 1 without noise• network 2 with noise
– Network 2 has better results than network 1 considering an independent set of test data
Starting Matlab demomatlab-CR.bat
ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 39
Possibilites to accelerate the adaption
Adaption of NN is simply a general parameter optimizationproblem. Accordingly a variety of other optimization methodsfrom numerical mathematics could be applied in principle. Thiscan lead to significant improvements (convergence speed, convergence area, stability, complexity).
In general conclusions cannot be made easily, since the events candiffer a lot and compromises have to be made!
• Heuristic improvements of the gradient algorithm (incrementcontrol)
– gradient algorithm (steepest descent) with momentum term (update of weights depends not only on gradient but also on former update)
– using an adaptive learning factor• Conjugate gradient algorithm
ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 40
• Newton and Semi-Newton-algorithms (using the second derivatives of theerror function, e.g. in form of Hesse mtrix or its estimation)
– Quickprop– Levenberg-Marquardt-algorithm
• Pruning techniques. Starting with a sufficiently big network, neurons, thathave no or little effect on the quality function, are taken off the net, whichcan reduce overfitting of data.
12
2
Taylor expansion of quality function in :( ) ( ) terms of higher order
with: gradient vector
and: Hesse matrix
T T
i j
J JJ
Jw w
+ ∆ = +∇ ∆ + ∆ ∆ +
∂∇ =
∂⎧ ⎫∂⎪ ⎪= ⎨ ⎬∂ ∂⎪ ⎪⎩ ⎭
ww w w J w w H w
Jw
H
ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 41
Demo with MATLAB(Demo of Theodoridis)
• C:\...\matlab\PR\startdemo• Example 2 (XOR) with (2-2-1 and 2-5-5-1)• Example 3 with three-layer network [5,5]
(3000 epochs, learning rate 0,7, momentum 0,5)Start der Matlab-Demomatlab-theodoridis.bat
Two solutions of the XOR-problem:
Convex area implemented with two-layernet (2-2-1). Possible mal-classificationfor variance:
Union of 2 convex areas implemented with a three-layer net (2-5-5-1). Possible mal-classification for variance:
14 2 0,35> ≈n 0,5>n
2-2-2-1 does notconverge!
ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 42
XOR with high noise in order to activate second solution! With two lines a error-free (exact) combination cannot be reached and the gradient is still different from 0!
ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 43
ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 44
ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 45
Demo with MATLAB (Klassifikationgui.m)
• Opening Matlab– first invoke setmypath.m, then– C:\Home\ppt\Lehre\ME_2002\matlab\KlassifikatorEntwurf-
WinXX\Klassifikationgui– set of data: load Samples/xor2.mat
Setting: 500 epochs, learning rate 0.7– load weight matrices: models/xor-trivial.mat– load weight matrices : models/xor-3.mat
• Two-class problem with banana shape– load set of data Samples/banana_test_samples.mat– load presetting from: models/MLP_7_3_2_banana.mat
Starting Matlab-Demomatlab-Klassifikation_gui.bat
ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 46
Solution of the XOR-problem with two-layer nn
ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 47
Solution of the XOR-problem with two-layer nn
ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 48
Solution of the XOR-problem with three-layer nn
ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 49
Solution of the XOR-problem with three-layer nn
ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 50
ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 51
ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 52
Convergence behaviour of backpropagation algorithm
• Backpropagation with common gradient descent(steepest descent)
Starting Matlab demomatlab-BP-gradient.bat
• Backpropagation with conjugate gradient descent(conjugate gradient) – significantly better convergence!
Starting Matlab demomatlab-BP-CGgradient.bat
ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 53
NN and their properties• „Quick and Dirty“. A nn design is simple to implement and results in
solutions are by all means usable (a suboptimal solution is better than just anysolution).
• Particularly large problems can be solved. • All strategies only use “local” optimization functions and generally do not
reach a global optimum, which is a big draw back for using neural nets.
ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 54
Using a NN for iterative calculation of PrincipalComponent Analysis (KLT)
• Instead of solving the KLT explicitely by finding Eigen values, a nn can beused for iterative calculation.
• A two-layer perceptron is used. The output of the hidden layer w is thefeature vector sought.
1
N
1M
1
N
x̂x
TA
w
A
• The first layer calculates the feature vector as:T=w A x
ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 55
• Supposing a linear activation function is used. The second layerimplements the reconstruction of x with the transposed weight matrix of thefirst layer
• BM is a N× M-matrix of the M dominant Eigenvectors of E{xxT} and T a orthonormal M× M–matrix, which causes a rotation of the coordinatesystem, within the space that is spanned by the M dominant Eigenvectors of E{xxT}.
M=A B T
ˆ =x Aw• Target of optimization is to minimize:
{ } { } { }22 2ˆ TJ E E E= − = − = −x x Aw x AA x x
• The following learning rule:
( ) Tα α← − − = − ∇A A Aw x w A J
• leads to a coefficient matrix A, which can be written as a product of twomatrices (without proof!):
ME II Kap 8bH Burkhardt Institut für Informatik Universität Freiburg 56
• The learning strategy does not include translation. i.e. this strategycalculates the Eigenvectors of the moment matrix E{xxT} and not of thecovariance matrix K = E{(x-µx)(x- µx)T}. This can be implemented bysubtracting a recursively estimated expected value µx = E{x} before startingthe recursive estimation of the feature vector. The same effect can beretrieved by using an extended observation vector:
1 instead of
⎡ ⎤= ⎢ ⎥⎣ ⎦
x xx