Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Lecture15:OptimizationCS109B,STAT121B,AC209B,CSE109B
MarkGlickman andPavlos Protopapas
Learningvs.Optimization
• Goaloflearning:minimizegeneralizationerror
• Inpractice,empiricalriskminimization:
J(θ ) = E(x,y)~pdata L( f (x;θ ), y)[ ]
J(θ ) = 1m
Li=1
m
∑ ( f (x(i);θ ), y(i) )
Quantityoptimizeddifferentfromthequantity
wecareabout
Batchvs.StochasticAlgorithms
• Batchalgorithms– Optimizeempiricalriskusingexactgradients
• Stochasticalgorithms– Estimatesgradientfromasmallrandomsample
∇J(θ ) = E(x,y)~pdata ∇L( f (x;θ ), y)[ ]
Largemini-batch:gradientcomputationexpensive
Smallmini-batch:greatervarianceinestimate,longerstepsforconvergence
CriticalPoints
• Pointswithzerogradient• 2nd-derivate(Hessian)determinescurvature
Goodfellow etal.(2016)
StochasticGradientDescent
• Takesmallstepsindirectionofnegativegradient• Samplem examplesfromtrainingsetandcompute:
• Updateparameters:
g = 1m
∇L( f (x(i);θ ), y(i) )i∑
θ =θ −εkg
Inpractice:shuffletrainingsetonceandpassthroughmultipletimes
StochasticGradientDescent
Oscillationsbecauseupdatesdonotexploitcurvatureinformation
J(θ )
Goodfellow etal.(2016)
Outline
• ChallengesinOptimization• Momentum• AdaptiveLearningRate• ParameterInitialization• BatchNormalization
LocalMinima
Goodfellow etal.(2016)
LocalMinima
• Oldview:localminimaismajorprobleminneuralnetworktraining
• Recentview:– Forsufficientlylargeneuralnetworks,mostlocalminimaincurlowcost
– Notimportanttofindtrueglobalminimum
SaddlePoints
• Recentstudiesindicatethatinhighdim,saddlepointsaremorelikelythanlocalmin
• Gradientcanbeverysmallnearsaddlepoints
Bothlocalminandmax
Goodfellow etal.(2016)
SaddlePoints
• SGDisseentoescapesaddlepoints–Movesdown-hill,usesnoisygradients
• Second-ordermethodsgetstuck– solvesforapointwithzerogradient
Goodfellow etal.(2016)
PoorConditioning
• PoorlyconditionedHessianmatrix– Highcurvature:smallstepsleadstohugeincrease
• Learningisslowdespitestronggradients
Oscillationsslowdownprogress
Goodfellow etal.(2016)
NoCriticalPoints
• Somecostfunctionsdonothavecriticalpoints
Goodfellow etal.(2016)
NoCriticalPoints
Gradientnormincreases,butvalidationerrordecreases
ConvolutionNetsforObjectDetection
Goodfellow etal.(2016)
ExplodingandVanishingGradients
h1 =Wxhi =Whi−1, i = 2…n
y =σ (h1n + h2
n ), where σ (s) = 11+ e−s
Linearactivation
deeplearning.ai
ExplodingandVanishingGradients
h11
h12
!
"
##
$
%
&&= a 0
0 b
!
"#
$
%&
x1
x2
!
"##
$
%&& !
hn1hn2
!
"
##
$
%
&&= an 0
0 bn!
"##
$
%&&
x1
x2
!
"##
$
%&&
Suppose W = a 00 b
!
"#
$
%& :
y =σ (anx1 + bnx2 )
∇y = "σ (anx1 + bnx2 )
nan−1x1nbn−1x2
$
%
&&
'
(
))
ExplodingandVanishingGradients
Suppose x = 11
!
"#
$
%&
Case 1: a =1, b = 2 :
y→1, ∇y→ nn2n−1
!
"##
$
%&&
Case 2: a = 0.5, b = 0.9 :
y→ 0, ∇y→ 00
!
"#
$
%&
Explodes!
Vanishes!
ExplodingandVanishingGradients
• Explodinggradientsleadtocliffs• Canbemitigatedusinggradientclipping
Goodfellow etal.(2016)
Poorcorrespondencebetweenlocalandglobalstructure
Goodfellow etal.(2016)
Outline
• ChallengesinOptimization• Momentum• AdaptiveLearningRate• ParameterInitialization• BatchNormalization
Momentum
• SGDisslowwhenthereishighcurvature
• Averagegradientpresentsfasterpathtoopt:– verticalcomponentscancelout
J(θ )
Deeplearning.ai
Momentum• Usespastgradientsforupdate• Maintainsanewquantity:‘velocity’• Exponentiallydecayingaverage ofgradients:
v = αv + (−εg)controlshowquickly
effectofpastgradientsdecayα ∈ [0,1)
Currentgradientupdate
Momentum
• Computegradientestimate:
• Updatevelocity:
• Updateparameters:
g = 1m
∇θL( f (x(i);θ ), y(i) )
i∑
v =αv−εg
θ =θ + v
Momentum
Dampedoscillations:gradientsinoppositedirectionsgetcancelledout
J(θ )
Goodfellow etal.(2016)
Nesterov Momentum
• Applyaninterim update:
• Performacorrectionbasedongradientattheinterimpoint:
g = 1m
∇θL( f (x(i); !θ ), y(i) )
i∑v =αv−εg
θ =θ + v
!θ =θ + v
Momentumbasedonlook-aheadslope
Outline
• ChallengesinOptimization• Momentum• AdaptiveLearningRate• ParameterInitialization• BatchNormalization
AdaptiveLearningRates
• Oscillationsalongverticaldirection– Learningmustbesloweralongparameter2
• Useadifferentlearningrateforeachparameter?
θ1
θ2
J(θ )
AdaGrad
• Accumulatesquaredgradients:
• Updateeachparameter:
• Greaterprogressalonggentlyslopeddirections
ri = ri + gi2
θi =θi −ε
δ + rigi
Inverselyproportionaltocumulativesquaredgradient
RMSProp
• Fornon-convexproblems,AdaGrad canprematurelydecreaselearningrate
• Useexponentiallyweightedaverageforgradientaccumulation
ri = ρri + (1− ρ)gi2
θi =θi −ε
δ + rigi
Adam
• RMSProp +Momentum• Estimatefirstmoment:
• Estimatesecondmoment:
• Updateparameters:
vi = ρ1vi + (1− ρ1 )gi
θi =θi −ε
δ + rivi
ri = ρ2ri + (1− ρ2 )gi2
Alsoappliesbiascorrection
tov andr
Workswellinpractice,isfairlyrobusttohyper-parameters
Outline
• ChallengesinOptimization• Momentum• AdaptiveLearningRate• ParameterInitialization• BatchNormalization
ParameterInitialization
• Goal:breaksymmetrybetweenunits– sothateachunitcomputesadifferentfunction
• Initializeallweights(notbiases)randomly– Gaussianoruniformdistribution
• Scaleofinitialization?– Large ->gradexplosion,Small ->gradvanishing
XavierInitialization
• Heuristicforalloutputstohaveunitvariance• Forafully-connectedlayerwithm inputs:
• ForReLU units,itisrecommended:
Wij ~ N 0, 1m
!
"#
$
%&
Wij ~ N 0, 2m
!
"#
$
%&
NormalizedInitialization• Fully-connectedlayerwithm inputs,n outputs:
• Heuristictradesoffbetweeninitializealllayershavesameactivationandgradientvariance
• Sparse variantwhenm islarge– Initializek nonzeroweightsineachunit
Wij ~U −6
m+ n, 6
m+ n
"
#$
%
&'
BiasInitialization
• Outputunitbias–Marginalstatisticsoftheoutputinthetrainingset
• Hiddenunitbias– Avoidsaturationatinitialization– E.g.inReLU,initializebiasto0.1insteadof0
• Unitscontrollingparticipationofotherunits– Setbiastoallowparticipationatinitialization
Outline
• ChallengesinOptimization• Momentum• AdaptiveLearningRate• ParameterInitialization• BatchNormalization
FeatureNormalization
• Goodpracticetonormalizefeaturesbeforeapplyinglearningalgorithm:
• Featuresinsamescale:mean0andvariance1– Speedsuplearning
!x = x −µσ
Vectorofmeanfeaturevalues
VectorofSDoffeaturevalues
Featurevector
FeatureNormalization
Beforenormalization Afternormalization
J(θ )
InternalCovarianceShift
Eachhiddenlayerchangesdistributionofinputstonextlayer:slowsdownlearning
Normalizeinputstolayer2
Normalizeinputstolayern
…
BatchNormalization
• Trainingtime:–Mini-batchofactivationsforlayertonormalize
H =
H11 ! H1K
" # "HN1 ! HNK
!
"
####
$
%
&&&&
K hiddenlayeractivations
N datapointsinmini-batch
BatchNormalization
• Trainingtime:–Mini-batchofactivationsforlayertonormalize
where
H ' = H −µσ
µ =1m
Hi,:i∑ σ =
1m
(H −µ)i2 +δ
i∑
Vectorofmeanactivationsacrossmini-batch
VectorofSDofeachunitacrossmini-batch
BatchNormalization
• Trainingtime:– Normalizationcanreduceexpressivepower– Insteaduse:
– Allowsnetworktocontrolrangeofnormalization
Learnableparameters
γ !H +β
µ1 =1m
Hi,:i∑
σ 1 =1m
(H −µ)i2 +δ
i∑
BatchNormalization
…..
Batch1
BatchNAddnormalizationoperationsforlayer1
µ 2 =1m
Hi,:i∑
σ 2 =1m
(H −µ)i2 +δ
i∑
BatchNormalization
Batch1
BatchN…..
Addnormalizationoperationsforlayer2andsoon…
BatchNormalization
• DifferentiatethejointlossforNmini-batches• Back-propagatethrough thenormoperations
• Testtime:–Modelneedstobeevaluatedonasingleexample– Replaceμ andσ withrunning averagescollectedduringtraining