Part II 1
CSE 5526: Introduction to Neural Networks
Linear Regression
Part II 2
Problem statement
Part II 3
Problem statement
Part II 4
Linear regression with one variable
• Given a set of N pairs of data < xi, di >, approximate d by a linear function of x (regressor)i.e.
or
where the activation function φ(x) = x is a linear function, and it corresponds to a linear neuron. y is the output of the neuron, and
is called the regression (expectational) error
bwxd +≈
iiiii bwxyd εϕε ++=+= )(
iii yd −=ε
ii bwx ε++=
Part II 5
Linear regression (cont.)
• The problem of regression with one variable is how to choose w and b to minimize the regression error
• The least squares method aims to minimize the square error E:
∑∑==
−==N
iii
N
ii ydE
1
2
1
2 )(21
21 ε
Part II 6
Linear regression (cont.)
• To minimize the two-variable square function, set
=∂∂
=∂∂
0
0
wE
bE
Part II 7
Linear regression (cont.)
∑
∑=−−−=
∂−−∂
=∂∂
iii
i
ii
bwxd
bbwxd
bE
0)(
)(21 2
∑
∑=−−−=
∂−−∂
=∂∂
iiii
i
ii
xbwxd
wbwxd
wE
0)(
)(21 2
Part II 8
Linear regression (cont.)
• Hence
where an overbar (i.e. ) indicates the mean
∑∑∑∑∑
−
−=
ii
iii
ii
ii
ii
xxN
dxxdxb
])([ 2
2
∑∑
−
−−=
ii
iii
xx
ddxxw 2)(
))((
x
Derive yourself!
Part II 9
Linear regression (cont.)
• This method gives an optimal solution, but it can be time-and memory-consuming as a batch solution
Part II 10
Finding optimal parameters via search
• Without loss of generality, set b = 0
E(w) is called a cost function
∑=
−=N
iii wxdwE
1
2)(21)(
Part II 11
Cost function
w
E(w)
w*
Emin
Question: how can we update w to minimize E?
Part II 12
Gradient and directional derivatives
• Without loss of generality, consider a two-variable function f(x, y). The gradient of f(x, y) at a given point (x0, y0)T is
where ux and uy are unit vectors in the x and y directions, andand
0
0
)),(,),((yyxx
Ty
yxfx
yxff==∂
∂∂
∂=∇
xff x ∂∂= yff y ∂∂=
yyxx yxfyxf uu ),(),( 0000 +=
Part II 13
Gradient and directional derivatives (cont.)
• At any given direction, u = aux + buy, with , the directional derivative at (x0, y0)T along the unit vector u is
• Which direction has the greatest slope? – The gradient because of the dot product!
hyxfhbyhaxfyxfD
h
),(),(lim),( 0000000
−++=
→u
122 =+ ba
hyxfhbyxfhbyxfhbyhaxf
h
)],(),([)],(),([lim 000000000
−+++−++=
→
),(),( 0000 yxbfyxaf yx +=
u),( 00 yxf T∇=
Part II 14
Gradient and directional derivatives (cont.)
• Example: see blackboard
Part II 15
Gradient and directional derivatives (cont.)
• To find the gradient at a particular point (x0, y0)T, first find the level curve or contour of f(x, y) at that point, C(x0, y0). A tangent vector u to C satisfies
because f(x, y) is constant on a level curve. Hence the gradient vector is perpendicular to the tangent vector
0),( 00 =∇= uu yxfD T
Part II 16
An illustration of level curves
Part II 17
Gradient and directional derivatives (cont.)
• The gradient of a cost function is a vector with the dimension of w that points to the direction of maximum Eincrease and with a magnitude equal to the slope of the tangent of the cost function along that direction• Can the slope be negative?
Part II 18
Gradient illustration
w
E(w)
w*
Emin
w0
Δww
wwEwwEwE
w ∆∆−−∆+
=
∇
→∆ 2)()(lim
)(
000
0
Gradient
Part II 19
Gradient descent
• Minimize the cost function via gradient (steepest) descent –a case of hill-climbing
n: iteration numberη: learning rate
•See previous figure
)()()1( nEnwnw ∇−=+ η
Part II 20
Gradient descent (cont.)
• For the mean-square-error cost function:
22 )]()([21)(
21)( nyndnenE −==
linear neurons
)()(
21
)()(
2
nwne
nwEnE
∂∂
=∂∂
=∇
2)]()()([21 nxnwnd −=
)()( nxne−=
Part II 21
Gradient descent (cont.)
• Hence
• This is the least-mean-square (LMS) algorithm, or the Widrow-Hoff rule
)()()()1( nxnenwnw η+=+
)()]()([)( nxnyndnw −+= η
Part II 22
Multi-variable case
• The analysis for the one-variable case extends to the multi-variable case
where w0= b (bias) and x0 = 1, as done for perceptron learning
2)]()()([21)( nnndnE T xw−=
T
mwE
wE
wEE ),...,,()(
10 ∂∂
∂∂
∂∂
=∇ w
Part II 23
Multi-variable case (cont.)
• The LMS algorithm
)()()1( nEnn ∇−=+ ηww
)()()( nnen xw η+=
)()]()([)( nnyndn xw −+= η
Part II 24
LMS algorithm
• Remarks• The LMS rule is exactly the same in math form as the perceptron
learning rule• Perceptron learning is for McCulloch-Pitts neurons, which are
nonlinear, whereas LMS learning is for linear neurons. In other words, perceptron learning is for classification and LMS is for function approximation
• LMS should be less sensitive to noise in the input data than perceptrons. On the other hand, LMS learning converges slowly
• Newton’s method changes weights in the direction of the minimum E(w) and leads to fast convergence. But it is not an online version and computationally extensive
Part II 25
Stability of adaptation
When η is too small, learning converges slowly
Part II 26
Stability of adaptation (cont.)
When η is too large, learning doesn’t converge
Part II 27
Learning rate annealing
• Basic idea: start with a large rate but gradually decrease it• Stochastic approximation
c is a positive parameter
ncn =)(η
Part II 28
Learning rate annealing (cont.)
• Search-then-converge
η0 and τ are positive parameters
•When n is small compared to τ, learning rate is approximately constant•When n is large compared to τ, learning rate schedule roughly follows stochastic approximation
)(1)( 0
τηη
nn
+=
Part II 29
Rate annealing illustration
Part II 30
Nonlinear neurons
• To extend the LMS algorithm to nonlinear neurons, consider differentiable activation function φ at iteration n
2)]()([21)( nyndnE −=
2)])(()([21 ∑−=
jjj nxwnd ϕ
Part II 31
Nonlinear neurons (cont.)
• By chain rule of differentiation
jj wv
vy
yE
wE
∂∂
∂∂
∂∂
=∂∂
)())(()]()([ nxnvnynd jϕ′−−=
)())(()( nxnvne jϕ′−=
Part II 32
Nonlinear neurons (cont.)
• The gradient descent gives
• The above is called the delta (δ) rule• If we choose a logistic sigmoid for φ
then
)())(()()()1( nxnvnenwnw jjj ϕη ′+=+
)exp(11)(
avv
−+=ϕ
)](1)[()( vvav ϕϕϕ −=′ (see textbook)
)()()( nxnnw jj ηδ+=
Part II 33
Role of activation function
v
φ
v
φ′
The role of φ′: weight update is most sensitive when v is near zero