30minutes Adagrad Rda

30 AdaGrad+RDA

echizen_tm Oct.11, 2014

(3p) Stochastic Gradient Descent(2p) AdaGrad+RDA(6p) AdaGrad+RDA(3p) (1p)

(1/3)

{, } {, , , , } {10, 20, 30, 40, }

(2/3)

:x :w

y>0Ay

(3/3) x = {:1, :1, :1, :1} w = {:1, :1, :1, :-1} y = 1*1 + 1*1 + 1*1 + 1*-1

= 2 > 0

(t=1)(t=-1)

Stochastic Gradient Descent(1/2)

wxt w

:

:

f ( w, x, t) =max(0,1 t xiwii )

f ( w, x, t) = 12 (t xiwii )2

Stochastic Gradient Descent(2/2)

w Stochastic Gradient Descent(SGD)

Stochastic Gradient Descent =

w = 0; for ((x,t) in X) { w -= f(w, x, t); }

:

:

f ( w, x, t) /wi = txif ( w, x, t) /wi = (t xiwi

i )xi

AdaGrad+RDA(1/6) SGDAROWSCW

AdaGrad+RDA

AROWSCW

SGDAdaGrad+RDA SGD:

sxs+1w AdaGrad+RDA:

0ss+1

AdaGrad+RDA(2/6) AdaGrad+RDARegret

R( ws+1) = ( giws+1,i )i + ws+1 +

12 ( hii ws+1,i2 )

gi =1s f (

wj,x j, t j ) /wj,i

j=0

s

hj =1s f (

wj,x j, t j ) /wj,i{ }

2

i=0

s

AdaGrad+RDA(3/6)

Regretw, g, h, 4 w: s+1 g,h: (f) ghRegret

:

R( ws+1) = ( giws+1,i )i + ws+1 +

12 ( hii ws+1,i2 )

AdaGrad+RDA(4/6)

gh g

h

s

gi =1s f (

wj,x j, t j ) /wj,i

j=0

s

hj =1s f (

wj,x j, t j ) /wj,i{ }

2

i=0

s

f ( wj,x j, t j ) /wj,i

AdaGrad+RDA(5/6)

R(w)=0w=r(,g,h) =

w = 0; for ((x,t) in X) { g(w,x,t); h(w,x,t); w = r(, g, h); }

AdaGrad+RDA(6/6) R(w)=0w=r(,g,h)

gi wi = 0

gi > wi = (gi +) / h i

wi = (gi ) / h igi <

AdaGrad+RDA(1/3)

AdaGrad+RDA

AdaGrad = Adaptive Gradient = AROWSCW

RDA = Regularized Dual Averaging Regularized: () Dual Averaging: ()

AdaGrad+RDA(2/3) Regret

R( ws+1) = ( giws+1,i )i + ws+1 +

12 ( hii ws+1,i2 )

loss function: ()

regularization term:

proximal term

Dual

Averaging

Regularized

Adaptive Gradient

AdaGrad+RDA(3/3) 1 w

maxws+1

f j,wj ws+1j=0

s

=maxws+1

f j,wjj=0

s

f j,ws+1j=0

s

=minws+1

f j,ws+1j=0

s

=minws+1 f jj=0

s

/ s,ws+1

=minws+1

g,ws+1 =minws+1 giws+1,ii

(1/1) SGD AdaGrad+RDA AdaGrad+RDA

(https://github.com/echizentm/AdaGrad)

: Duchi et.al.(2010) Adaptive Subgradient Methods for Online

Learning and Stochastic Optimization Xiao(2010) Dual Averaging Methods for Regularized

Stochastic Learning and Online Optimization

Documents

30minutes Adagrad Rda