30minutes Adagrad Rda

Embed Size (px)

Citation preview

  • 30 AdaGrad+RDA

    echizen_tm Oct.11, 2014

  • (3p) Stochastic Gradient Descent(2p) AdaGrad+RDA(6p) AdaGrad+RDA(3p) (1p)

  • (1/3)

    {, } {, , , , } {10, 20, 30, 40, }

  • (2/3)

    :x :w

    y>0Ay

  • (3/3) x = {:1, :1, :1, :1} w = {:1, :1, :1, :-1} y = 1*1 + 1*1 + 1*1 + 1*-1

    = 2 > 0

    (t=1)(t=-1)

  • Stochastic Gradient Descent(1/2)

    wxt w

    :

    :

    f ( w, x, t) =max(0,1 t xiwii )

    f ( w, x, t) = 12 (t xiwii )2

  • Stochastic Gradient Descent(2/2)

    w Stochastic Gradient Descent(SGD)

    Stochastic Gradient Descent =

    w = 0; for ((x,t) in X) { w -= f(w, x, t); }

    :

    :

    f ( w, x, t) /wi = txif ( w, x, t) /wi = (t xiwi

    i )xi

  • AdaGrad+RDA(1/6) SGDAROWSCW

    AdaGrad+RDA

    AROWSCW

    SGDAdaGrad+RDA SGD:

    sxs+1w AdaGrad+RDA:

    0ss+1

  • AdaGrad+RDA(2/6) AdaGrad+RDARegret

    R( ws+1) = ( giws+1,i )i + ws+1 +

    12 ( hii ws+1,i2 )

    gi =1s f (

    wj,x j, t j ) /wj,i

    j=0

    s

    hj =1s f (

    wj,x j, t j ) /wj,i{ }

    2

    i=0

    s

  • AdaGrad+RDA(3/6)

    Regretw, g, h, 4 w: s+1 g,h: (f) ghRegret

    :

    R( ws+1) = ( giws+1,i )i + ws+1 +

    12 ( hii ws+1,i2 )

  • AdaGrad+RDA(4/6)

    gh g

    h

    s

    gi =1s f (

    wj,x j, t j ) /wj,i

    j=0

    s

    hj =1s f (

    wj,x j, t j ) /wj,i{ }

    2

    i=0

    s

    f ( wj,x j, t j ) /wj,i

  • AdaGrad+RDA(5/6)

    R(w)=0w=r(,g,h) =

    w = 0; for ((x,t) in X) { g(w,x,t); h(w,x,t); w = r(, g, h); }

  • AdaGrad+RDA(6/6) R(w)=0w=r(,g,h)

    gi wi = 0

    gi > wi = (gi +) / h i

    wi = (gi ) / h igi <

  • AdaGrad+RDA(1/3)

    AdaGrad+RDA

    AdaGrad = Adaptive Gradient = AROWSCW

    RDA = Regularized Dual Averaging Regularized: () Dual Averaging: ()

  • AdaGrad+RDA(2/3) Regret

    R( ws+1) = ( giws+1,i )i + ws+1 +

    12 ( hii ws+1,i2 )

    loss function: ()

    regularization term:

    proximal term

    Dual

    Averaging

    Regularized

    Adaptive Gradient

  • AdaGrad+RDA(3/3) 1 w

    maxws+1

    f j,wj ws+1j=0

    s

    =maxws+1

    f j,wjj=0

    s

    f j,ws+1j=0

    s

    =minws+1

    f j,ws+1j=0

    s

    =minws+1 f jj=0

    s

    / s,ws+1

    =minws+1

    g,ws+1 =minws+1 giws+1,ii

  • (1/1) SGD AdaGrad+RDA AdaGrad+RDA

    (https://github.com/echizentm/AdaGrad)

    : Duchi et.al.(2010) Adaptive Subgradient Methods for Online

    Learning and Stochastic Optimization Xiao(2010) Dual Averaging Methods for Regularized

    Stochastic Learning and Online Optimization