27
Do We Need Zero Training Loss Aer Achieving Zero Training Error? Takashi Ishida 1,2 Ikko Yamane 1 Tomoya Sakai 3 Gang Niu 2 Masashi Sugiyama 2,1 1 e University of Tokyo 2 RIKEN 3 NEC Corporation Abstract Overparameterized deep networks have the capacity to memorize training data with zero training error. Even aer memorization, the training loss continues to approach zero, making the model overcondent and the test performance degraded. Since existing regularizers do not directly aim to avoid zero training loss, they oen fail to maintain a moderate level of training loss, ending up with a too small or too large loss. We propose a direct solution called ooding that intentionally prevents further reduction of the training loss when it reaches a reasonably small value, which we call the ooding level. Our approach makes the loss oat around the ooding level by doing mini-batched gradient descent as usual but gradient ascent if the training loss is below the ooding level. is can be implemented with one line of code, and is compatible with any stochastic optimizer and other regularizers. With ooding, the model will continue to “random walk” with the same non-zero training loss, and we expect it to dri into an area with a at loss landscape that leads to beer generalization. We experimentally show that ooding improves performance and as a byproduct, induces a double descent curve of the test loss. 1 Introduction “Overing” is one of the biggest interests and concerns in the machine learning community [Belkin et al., 2018, Caruana et al., 2000, Ng, 1997, Roelofs et al., 2019, Werpachowski et al., 2019]. One way of identifying if overing is happening or not, is to see whether the generalization gap, the test minus the training loss, is increasing or not [Goodfellow et al., 2016]. We can further decompose this situation of the generalization gap into two concepts: e rst concept is the situation where both the training and test losses are decreasing, but the training loss is decreasing faster than the test loss ([A] in Fig. 1(a).) e next concept is the situation where the training loss is decreasing, but the test loss is increasing. is tends to occur aer the rst concept ([B] in Fig. 1(a)). Within the concept [B], aer learning for even more epochs, the training loss will continue to decrease and may become (near-)zero. is is shown as [C] in Fig. 1(a). If you continue training 1 arXiv:2002.08709v1 [cs.LG] 20 Feb 2020

Do We Need Zero Training Loss A›er Achieving Zero Training … · 2020. 2. 21. · directly aim to avoid zero training loss, they o›en fail to maintain a moderate level of training

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Do We Need Zero Training Loss A›er Achieving Zero Training … · 2020. 2. 21. · directly aim to avoid zero training loss, they o›en fail to maintain a moderate level of training

Do We Need Zero Training LossAer Achieving Zero Training Error

Takashi Ishida12 Ikko Yamane1 Tomoya Sakai3Gang Niu2 Masashi Sugiyama21

1e University of Tokyo 2RIKEN 3NEC Corporation

Abstract

Overparameterized deep networks have the capacity to memorize training data with zerotraining error Even aer memorization the training loss continues to approach zero makingthe model overcondent and the test performance degraded Since existing regularizers do notdirectly aim to avoid zero training loss they oen fail to maintain a moderate level of trainingloss ending up with a too small or too large loss We propose a direct solution called oodingthat intentionally prevents further reduction of the training loss when it reaches a reasonablysmall value which we call the ooding level Our approach makes the loss oat around theooding level by doing mini-batched gradient descent as usual but gradient ascent if thetraining loss is below the ooding level is can be implemented with one line of code and iscompatible with any stochastic optimizer and other regularizers With ooding the modelwill continue to ldquorandom walkrdquo with the same non-zero training loss and we expect it to driinto an area with a at loss landscape that leads to beer generalization We experimentallyshow that ooding improves performance and as a byproduct induces a double descent curveof the test loss

1 IntroductionldquoOveringrdquo is one of the biggest interests and concerns in the machine learning community[Belkin et al 2018 Caruana et al 2000 Ng 1997 Roelofs et al 2019 Werpachowski et al 2019]One way of identifying if overing is happening or not is to see whether the generalizationgap the test minus the training loss is increasing or not [Goodfellow et al 2016] We can furtherdecompose this situation of the generalization gap into two concepts e rst concept is thesituation where both the training and test losses are decreasing but the training loss is decreasingfaster than the test loss ([A] in Fig 1(a)) e next concept is the situation where the trainingloss is decreasing but the test loss is increasing is tends to occur aer the rst concept ([B] inFig 1(a))

Within the concept [B] aer learning for even more epochs the training loss will continue todecrease and may become (near-)zero is is shown as [C] in Fig 1(a) If you continue training

1

arX

iv2

002

0870

9v1

[cs

LG

] 2

0 Fe

b 20

20

(a) wo Flooding (b) w Flooding

[B]

[C]training loss

test loss

Epochs

[A]

(c) CIFAR-10 wo Flooding

training loss

test loss

flooded areaEpochs

(d) CIFAR-10 w Flooding

Figure 1 (a) shows 3 dierent concepts related to overing [A] shows the generalization gap increaseswhile training amp test losses decrease [B] also shows the increasing gap but the test loss starts to rise [C]shows the training loss becoming (near-)zero We avoid [C] by ooding the boom area visualized in (b)which forces the training loss to stay around a constant is leads to a decreasing test loss once again Weconrm these claims in experiments with CIFAR-10 shown in (c)ndash(d)

even aer the model has memorized [Arpit et al 2017 Belkin et al 2018 Zhang et al 2017] thetraining data completely with zero error the training loss can easily become (near-)zero especiallywith overparametrized models Recent works on overparametrization and double descent curves[Belkin et al 2019 Nakkiran et al 2020] have shown that learning until zero training erroris meaningful to achieve a lower generalization error However whether zero training loss isnecessary aer achieving zero training error remains an open issue

In this paper we propose a method to make the training loss oat around a small constantvalue in order to prevent the training loss from approaching zero is is analogous to oodingthe boom area with water and we refer to the constant value as the ooding level Note thateven if we add ooding we can still memorize the training data Our proposal only forces thetraining loss to become positive which does not necessarily mean the training error will becomepositive as long as the ooding level is not too large e idea of ooding is shown in Fig 1(b)and we show learning curves before and aer ooding with benchmark experiments in Fig 1(c)and Fig 1(d)1

Algorithm and implementation Our algorithm of ooding is surprisingly simple If theoriginal learning objective is J the proposed modied learning objective J with ooding is

J(θ) = |J(θ)minus b|+ b (1)

where b gt 0 is the ooding level specied by the user and θ is the model parametere gradient of J wrt θ will point in the same direction as that of J(θ) when J(θ) gt b but

in the opposite direction when J(θ) lt b is means that when the learning objective is abovethe ooding level there is a ldquogravityrdquo eect with gradient descent but when the learning objectiveis below the ooding level there is a ldquobuoyancyrdquo eect with gradient ascent In practice this willbe performed with a mini-batch and will be compatible with any stochastic optimizers It canalso be used along with other regularization methods

1For the details of these experiments see Appendix D

2

During ooding the training loss will repeat going below and above the ooding level emodel will continue to ldquorandom walkrdquo with the same non-zero training loss and we expect it todri into an area with a at loss landscape that leads to beer generalization [Chaudhari et al2017 Keskar et al 2017 Li et al 2018] 2

Since it is a simple solution this modication can be incorporated into existing machinelearning code easily Add one line of code for Eq (1) aer evaluating the original objectivefunction J(θ) A minimal working example with a mini-batch in PyTorch [Paszke et al 2019] isdemonstrated below to show the additional one line of code

1 outputs = model(inputs)2 loss = criterion(outputs labels)3 flood = (loss-b)abs()+b This is it4 optimizerzerograd()5 floodbackward()6 optimizerstep()

It may be hard to set the ooding level without expert knowledge on the domain or task Wecan circumvent this situation easily by treating the ooding level as a hyper-parameter We mayuse a naive search which exhaustively evaluates the accuracy for the predened hyper-parametercandidates with a validation dataset is procedure can be performed in parallel

Previous regularizationmethods Many previous regularization methods also aim at avoidingtraining too much in various ways eg restricting the parameter norm to become small by decayingthe parameter weights [Hanson and Pra 1988] raising the diculty of training by droppingactivations of neural networks [Srivastava et al 2014] smoothing the training labels [Szegedyet al 2016] or simply stopping training at an earlier phase [Morgan and Bourlard 1990] esemethods can be considered as indirect ways to control the training loss by also introducingadditional assumptions eg the optimal model parameters are close to zero Although makingthe regularization eect stronger would make it harder for the training loss to approach zeroit is still hard to maintain the right level of training loss till the end of training In fact foroverparametrized deep networks applying a small regularization parameter would not stop thetraining loss becoming (near-)zero making it even harder to choose a hyper-parameter thatcorresponds to a specic level of loss

Flooding on the other hand is a direct solution to the issue that the training loss becomes(near-)zero Flooding intentionally prevents further reduction of the training loss when it reachesa reasonably small value and the ooding level corresponds to the level of training loss that theuser wants to keep

Contributions Our proposed regularizer called ooding makes the training loss oat around asmall constant value instead of making it head towards zero loss Flooding is a regularizer that isdomain- task- and model-independent eoretically we nd that the mean squared error can bereduced with ooding under certain conditions Not only do we show test accuracy improving

2In Appendix F we show that during this period of random walk there is an increase in atness of the lossfunction

3

aer ooding we also observe that even aer we avoid zero training loss memorization with zerotraining error still takes place

2 BackgroundsIn this section we review regularization methods (summarized in Table 1) recent works onoverparametrization and double descent curves and the area of weakly supervised learning wheresimilar techniques to ooding has been explored

21 Regularization Methodse name ldquoregularizationrdquo dates back to at least Tikhonov regularization for the ill-posed linearleast-squares problem [Tikhonov and Arsenin 1977 Tikhonov 1943] One example is to modifyXgtX (where X is the design matrix) to become ldquoregularrdquo by adding a term to the objectivefunction `2 regularization is a generalization of the above example and can be applied to non-linear models ese methods implicitly assume that the optimal model parameters are close tozero

It is known that weight decay [Hanson and Pra 1988] dropout [Srivastava et al 2014] andearly stopping [Morgan and Bourlard 1990] are equivalent to `2 regularization under certainconditions [Bishop 1995 Goodfellow et al 2016 Loshchilov and Huer 2019 Wager et al 2013]implying that there is a similar assumption on the optimal model parameters ere are otherpenalties based on dierent assumptions such as the `1 regularization [Tibshirani 1996] based onthe sparsity assumption that the optimal model has only a few non-zero parameters

Modern machine learning tasks are applied to complex problems where the optimal modelparameters are not necessarily close to zero or may not be sparse and it would be ideal if wecan properly add regularization eects to the optimization stage without such assumptions Ourproposed method does not have assumptions on the optimal model parameters and can be usefulfor more complex problems

More recently ldquoregularizationrdquo has further evolved to a more general meaning includingvarious methods that alleviate overing but do not necessarily have a step to regularize asingular matrix or add a regularization term to the objective function For example Goodfellowet al [2016] denes regularization as ldquoany modication we make to a learning algorithm that isintended to reduce its generalization error but not its training errorrdquo In this paper we adopt thisbroader meaning of ldquoregularizationrdquo

Examples of the more general regularization category include mixup [Zhang et al 2018]and data augmentation methods like cropping and ipping or adjusting brightness or sharpness[Shorten and Khoshgoaar 2019] ese methods have been adopted in many papers to obtainstate-of-the-art performance [Berthelot et al 2019 Verma et al 2019] and are becoming essentialregularization tools for developing new systems However these regularization methods havethe drawback of being domain-specic ey are designed for the vision domain and requiresome eorts when applying to other domains [Guo et al 2019 ulasidasan et al 2019] Otherregularizers such as label smoothing [Szegedy et al 2016] is used for problems with class labels

4

Table 1 Conceptual comparisons of various regularizers ldquotrrdquo stands for ldquotrainingrdquo ldquoIndeprdquo stands forldquoindependentrdquo X stands for yes and times stands for no

Regularization and other methods Targettr loss

Domainindep

Taskindep

Modelindep Main assumption

`2 regularization [Tikhonov 1943] times X X X Optimal model params are close to 0Weight decay [Hanson and Pra 1988] times X X X Optimal model params are close to 0Early stopping [Morgan and Bourlard 1990] times X X X Overing occurs in later epochs`1 regularization [Tibshirani 1996] times X X X Optimal model has to be sparseDropout [Srivastava et al 2014] times X X times Weight scaling inference ruleBatch normalization [Ioe and Szegedy 2015] times X X times Existence of internal covariate shiLabel smoothing [Szegedy et al 2016] times X times X True posterior is not a one-hot vectorMixup [Zhang et al 2018] times times times X Linear relationship between x and yImage augment [Shorten and Khoshgoaar 2019] times times X X Input is invariant to the translations

Flooding (proposed method) X X X X Learning until zero loss is harmful

and harder to use with regression or ranking meaning they are task-specic Batch normalization[Ioe and Szegedy 2015] and dropout [Srivastava et al 2014] are designed for neural networksand are model-specic

Although these regularization methodsmdashboth the special and general onesmdashalready work wellin practice and have become the de facto standard tools [Bishop 2011 Goodfellow et al 2016]we provide an alternative which is even more general in the sense that it is domain- task- andmodel-independent

at being said we want to emphasize that the most important dierence between oodingand other regularization methods is whether it is possible to target a specic level of training lossother than zero While ooding allows the user to choose the level of training loss directly it ishard to achieve this with other regularizers

22 Double Descent Curves with OverparametrizationRecently there has been increasing aention on the phenomenon of ldquodouble descentrdquo named byBelkin et al [2019] to explain the two regimes of deep learning e rst one (underparametrizedregime) occurs where the model complexity is small compared to the number of training samplesand the test error as a function of model complexity decreases with low model complexity butstarts to increase aer the model complexity is large enough is follows the classical viewof machine learning that excessive complexity leads to poor generalization e second one(overparametrized regime) occurs when an even larger model complexity is considered enincreasing the complexity only decreases test error which leads to a double descent shape ephase of decreasing test error oen occurs aer the training error becomes zero is follows themodern view of machine learning that bigger models lead to beer generalization3

As far as we know the discovery of double descent curves dates back to at least Krogh and3httpswwwefforgaimetrics

5

Hertz [1992] where they theoretically showed the double descent phenomenon under a linearregression setup Recent works [Belkin et al 2019 Nakkiran et al 2020] have shown empiricallythat a similar phenomenon can be observed with deep learning methods Nakkiran et al [2020]observed that the double descent curves can be shown not only as a function of model complexitybut also as a function of the epoch number

We want to note a byproduct of our ooding method We were able to produce the epoch-wisedouble descent curve for the test loss with about 100 epochs Investigating the connection betweenour accelerated double descent curves and previous double descent curves [Belkin et al 2019Krogh and Hertz 1992 Nakkiran et al 2020] is out of the scope of this paper but is an importantfuture direction

23 Lower-Bounding the Empirical RiskLower-bounding the empirical risk has been used in the area of weakly supervised learning erewere a common phenomenon where the empirical risk goes below zero [Kiryo et al 2017] whenan equivalent form of the risk expressed with the given weak supervision was alternatively used[Cid-Sueiro et al 2014 du Plessis et al 2014 2015 Natarajan et al 2013 Patrini et al 2017 vanRooyen and Williamson 2018] A gradient ascent technique was used to force the empirical riskto become non-negative in Kiryo et al [2017] is idea has been generalized and applied to otherweakly supervised seings [Han et al 2018 Ishida et al 2019 Lu et al 2020]

Although we also set a lower bound on the empirical risk the motivation is dierent Firstwhile Kiryo et al [2017] and others aim to x the negative empirical risk to become lower boundedby zero our empirical risk already has a lower bound of zero Instead we are aiming to sink theoriginal empirical risk by placing a positive lower bound Second the problem seings are dierentWeakly supervised learning methods require certain loss corrections or sample corrections [Hanet al 2018] before the non-negative correction but we work on the original empirical risk withoutany seing-specic modications

3 Flooding How to Avoid Zero Training LossIn this section we propose our regularization method ooding Note that this section and thefollowing sections only consider multi-class classication for simplicity

31 PreliminariesConsider input variable x isin Rd and output variable y isin 1 K where K is the number ofclasses ey follow an unknown joint probability distribution with density p(x y) We denotethe score function by g Rd rarr RK For any test data point x0 our prediction of the output labelwill be given by y0 = argmaxzisin1K gz(x0) where gz(middot) is the z-th element of g(middot) and incase of a tie argmax returns the largest argument Let ` RK times1 2 K rarr R denote a loss

6

function ` can be the zero-one loss

`01(v zprime) =

0 if argmaxzisin1K vz = zprime

1 otherwise(2)

where v = (v1 vK)gt isin RK or a surrogate loss such as the somax cross-entropy loss

`CE(v zprime) = minus log

exp(vzprime)sumzisin1K exp(vz)

(3)

For a surrogate loss ` we denote the classication risk by

R(g) = Ep(xy)[`(g(x) y)] (4)

where Ep(xy)[middot] is the expectation over (x y) sim p(x y) We use R01(g) to denote Eq (4) when` = `01 and call it the classication error

e goal of multi-class classication is to learn g that minimizes the classication errorR01(g)In optimization we consider the minimization of the risk with a almost surely dierentiablesurrogate loss R(g) instead to make the problem more tractable Furthermore since p(x y) isusually unknown and there is no way to exactly evaluate R(g) we minimize its empirical versioncalculated from the training data instead

R(g) =1

n

nsumi=1

`(g(xi) yi) (5)

where (xi yi)ni=1 are iid sampled from p(x y) We call R the empirical riskWe would like to clarify some of the undened terms used in the title and the introduction

e ldquotraintest lossrdquo is the empirical risk with respect to the surrogate loss function ` over thetrainingtest data respectively We refer to the ldquotrainingtest errorrdquo as the empirical risk withrespect to `01 over the trainingtest data respectively (which is equal to one minus accuracy)[Zhang 2004]

Finally we formally dene the Bayes risk as

Rlowast = infhR(h) (6)

where the inmum is taken over all vector-valued functions h Rd rarr RK e Bayes risk is oenreferred to as the Bayes error if the zero-one loss is used

infhR01(h) (7)

32 Algorithm

With exible models R(g) wrt a surrogate loss can easily become small if not zero as wementioned in Section 1 see [C] in Fig 1(a) We propose a method that ldquooods the boom areaand sinks the original empirical riskrdquo as in Fig 1(b) so that the empirical risk cannot go belowthe ooding level More technically if we denote the ooding level as b our proposed trainingobjective with ooding is a simple x

7

Denition 1 e ooded empirical risk is dened as4

R(g) = |R(g)minus b|+ b (8)

Note that when b = 0 then R(g) = R(g) e gradient of R(g) wrt model parameters willpoint to the same direction as that of R(g) when R(g) gt b but in the opposite direction whenR(g) lt b is means that when the learning objective is above the ooding level we performgradient descent as usual (gravity zone) but when the learning objective is below the oodinglevel we perform gradient ascent instead (buoyancy zone)

e issue is that in general we seldom know the optimal ooding level in advance isissue can be mitigated by searching for the optimal ooding level blowast with a hyper-parameteroptimization technique In practice we can search for the optimal ooding level by performingthe exhaustive search in parallel

33 ImplementationFor large scale problems we can employ mini-batched stochastic optimization for ecient com-putation Suppose that we have M disjoint mini-batch splits We denote the empirical risk (5)with respect to the m-th mini-batch by Rm(g) for m isin 1 M en our mini-batchedoptimization performs gradient descent updates in the direction of the gradient of Rm(g) By theconvexity of the absolute value function and Jensenrsquos inequality we have

R(g) le 1

M

Msumm=1

(|Rm(g)minus b|+ b

) (9)

is indicates that mini-batched optimization will simply minimize an upper bound of the full-batch case with R(g)

34 eoretical AnalysisIn the following theorem we will show that the mean squared error (MSE) of the proposed riskestimator with ooding is smaller than that of the original risk estimator without ooding

eorem 1 Fix any measurable vector-valued function g If the ooding level b satises R(g) ltb lt R(g) we have

MSE(R(g)) gt MSE(R(g)) (10)

If b le R(g) we haveMSE(R(g)) = MSE(R(g)) (11)

A proof is given in Appendix A If we regard R(g) as the training loss and R(g) as the testloss we would want b to be between those two for the MSE to improve

4Strictly speaking Eq (1) is dierent from Eq (8) since Eq (1) can ignore constant terms of the original empiricalrisk We will refer to Eq (8) for the ooding operator for the rest of the paper

8

4 ExperimentsIn this section we show experimental results with synthetic and benchmark datasets e imple-mentation is based on PyTorch [Paszke et al 2019] and demo code will be available5 Experimentswere carried out with NVIDIA GeForce GTX 1080 Ti NVIDIA adro RTX 5000 and Intel XeonGold 6142

41 Synthetic Experimentse aim of our synthetic experiments is to study the behavior of ooding with a controlled setupWe use three types of synthetic data described below

Two Gaussians Data We perform binary classication with two 10-dimensional Gaus-sian distributions with covariance matrix identity and means microP = [0 0 0]gt and microN =[mm m]gt where m isin 08 10 e Bayes risk for m = 10 and m = 08 are 014 and024 respectively where proofs are shown in Appendix B e training validation and test samplesizes are 25 10000 and 10000 per class respectively

Sinusoid Data e sinusoid data [Nakkiran et al 2019] are generated as follows We rstdraw input data points uniformly from the inside of a 2-dimensional ball of radius 1 en we putclass labels based on

y = sign(xgtw + sin(xgtwprime))

where w and wprime are any two 2-dimesional vectors such that w perp wprime e training validationand test sample sizes are 100 100 and 20000 respectively

Spiral Data e spiral data [Sugiyama 2015] are two-dimensional synthetic data Letθ+1 = 0 θ+2 θ

+n+ = 4π be equally spaced n+ points in the interval [0 4π] and θminus1 =

0 θminus2 θminusnminus = 4π be equally spaced nminus points in the interval [0 4π] Let positive and negative

input data points be

x+i+ = θ+i+ [cos(θ

+i+) sin(θ

+i+)]

gt + τν+i+

xminusiminus = (θminusiminus + π)[cos(θminusiminus) sin(θminusiminus)]

gt + τνminusiminus

for i+ = 1 n+ and iminus = 1 nminus where τ controls the magnitude of the noise ν+i and

νminusi are iid distributed according to the two-dimensional standard normal distribution enwe make data for classication by (xi yi)ni=1 = (x+

i+ +1)n+

i+=1 cup (xminusiminus minus1)

nminus

iminus=1 wheren = n+ + nminus e training validation and test sample sizes are 100 100 and 10000 per classrespectively

For Two Gaussians we use a one-hidden-layer feedforward neural network with 500 unitsin the hidden layer with the ReLU activation function [Nair and Hinton 2010] We train thenetwork for 1000 epochs with the logistic loss and vanilla gradient descent with learning rate of005 e ooding level is chosen from b isin 0 001 002 040 For Sinusoid and Spiral weuse a four-hidden-layer feedforward neural network with 500 units in the hidden layer with the

5httpsgithubcomtakashiishidaflooding

9

Table 2 Experimental results for the synthetic data Sub-table (A) shows the results without early stoppingSub-table (B) shows the results with early stopping e beer method is shown in bold in each of sub-tables (A) and (B) ldquoBRrdquo stands for the Bayes risk For Two Gaussians the distance between the positive andnegative distributions is larger for larger m See the description in Section 41 for the details

(A) Without Early Stopping (B) With Early Stopping

Data Seing WithoutFlooding

WithFlooding

Chosenb

WithoutFlooding

WithFlooding

Chosenb

Two Gaussians m 10 BR 014 8796 9225 028 9163 9225 027Two Gaussians m 08 BR 024 8200 8731 033 8657 8729 035Sinusoid Label Noise 001 9384 9446 001 9254 9254 000Sinusoid Label Noise 005 9112 9544 010 9326 9460 001Sinusoid Label Noise 010 8657 9602 017 9670 9670 000Spiral Label Noise 001 9896 9785 001 9860 9888 001Spiral Label Noise 005 9387 9624 004 9658 9562 014Spiral Label Noise 010 8970 9296 016 8970 9296 016

ReLU activation function [Nair and Hinton 2010] and batch normalization [Ioe and Szegedy2015] We train the network for 500 epochs with the logistic loss and Adam [Kingma and Ba2015] optimizer with 100 mini-batch size and learning rate of 0001 e ooding level is chosenfrom b isin 0 001 002 020 Note that training with b = 0 is identical to the baselinemethod without ooding We report the test accuracy of the ooding level with the best validationaccuracy We rst conduct experiments without early stopping which means that the last epochwas chosen for all ooding levels

Results e results are summarized in Table 2 It is worth noting that for Two Gaussians thechosen ooding level b is larger for the smaller distance between the two distributions which iswhen the classication task is harder and the Bayes risk becomes larger since the two distributionsbecome less separated We see similar tendencies for Sinusoid and Spiral data a larger b waschosen for larger ipping probability for label noise which is expected to increase the Bayes riskis implies the positive correlation between the optimal ooding level and the Bayes risk asis also partially suggested by eorem 1 Another interesting observation is that the chosen bis close to but higher than the Bayes risk for Two Gaussians data is may look inconsistentwith eorem 1 However it makes sense to adopt larger b with stronger regularization eect thatallows some bias as a trade-o for reducing the variance of the risk estimator In fact eorem 1does not deny the possibility that some b ge R(g) achieves even beer estimation

From (A) in Table 2 we can see that the method with ooding oen improves test accuracyover the baseline method without ooding As we mentioned in the introduction it ca be harmfulto keep training a model until the end without ooding However with ooding the model at thenal epoch has good prediction performance according to the results which implies that ooding

10

Table 3 Results with benchmark datasets We report classication accuracy for all combinations ofweight decay (X and times) early stopping (X and times) and ooding (X and times) e second column shows thetrainingvalidation split used for the experiment W stands for weight decay E stands for early stopping andF stands for ooding ldquomdashrdquo means that ooding level of zero was optimal ldquoNArdquo means that we skipped theexperiments because zero weight decay was optimal in the case without ooding e best and equivalentare shown in bold by comparing ldquowith oodingrdquo and ldquowithout oodingrdquo for two columns with the sameseing for W and E eg the rst and h columns out of the 8 columns e best performing combinationis highlighted

Datasettr W times times X X times times X Xva E times X times X times X times X

split F times times times times X X X X

MNIST 08 9832 9830 9851 9842 9846 9853 9850 984804 9771 9770 9782 9791 9774 9785 mdash 9783

Fashion- 08 8934 8936 NA NA mdash mdash NA NAMNIST 04 8848 8863 8860 8862 mdash mdash mdash mdash

Kuzushiji- 08 9163 9162 9163 9171 9240 9212 9211 9197MNIST 04 8918 8918 8958 8973 9041 9015 8971 8988

CIFAR-10 08 7359 7336 7365 7357 7306 7344 mdash 744104 6639 6663 6931 6928 6720 6758 mdash mdash

CIFAR-100 08 4216 4233 4267 4245 4250 4236 mdash mdash04 3427 3434 3797 3882 3499 3514 mdash mdash

SVHN 08 9238 9241 9320 9299 9278 9279 mdash 934204 9032 9035 9043 9049 9057 9061 9116 9121

helps the late-stage training improve test accuracyWe also conducted experiments with early stopping meaning that we chose the model that

recorded the best validation accuracy during training e results are reported in sub-table (B) ofTable 2 Compared with sub-table (A) we see that early stopping improves the baseline methodwithout ooding well in many cases is indicates that training longer without ooding washarmful in our experiments On the other hand the accuracy for ooding combined with earlystopping is oen close to that with early stopping meaning that training until the end withooding tends to be already as good as doing so with early stopping e table shows that oodingoen improves or retains the test accuracy of the baseline method without ooding even aerdeploying early stopping Flooding does not hurt performance but can be benecial for methodsused with early stopping

11

(a) CIFAR-10 (000) (b) CIFAR-10 (003) (c) CIFAR-10 (007) (d) CIFAR-10 (020)

Figure 2 Learning curves of training and test loss for trainingvalidation proportion of 08 (a) shows thelearning curves without ooding (b) (c) and (d) show the learning curves with dierent ooding levels

42 Benchmark ExperimentsWe next perform experiments with benchmark datasets Not only do we compare with the baselinewithout ooding we also compare or combine with other general regularization methods whichare early stopping and weight decay

Settings We use the following six benchmark datasets MNIST Fashion-MNIST Kuzushiji-MNIST CIFAR-10 CIFAR-100 and SVHN e details of the benchmark datasets can be found inAppendix C1 We split the original training dataset into training and validation data with dierentproportions 08 or 04 (meaning 80 or 40 was used for training and the rest was used forvalidation respectively) We perform the exhaustive hyper-parameter search for the ooding levelwith candidates from 000 001 020 e number of epochs is 500 Stochastic gradientdescent [Robbins and Monro 1951] is used with learning rate of 01 and momentum of 09 ForMNIST Fashion-MNIST and Kuzushiji-MNIST we use a one-hidden-layer feedforward neuralnetwork with 500 units and ReLU activation function [Nair and Hinton 2010] For CIFAR-10CIFAR-100 and SVHN we used ResNet-18 [He et al 2016] We do not use any data augmentationor manual learning rate decay We deployed early stopping in the same way as in Section 41

We rst ran experiments with the following candidates for the weight decay rate 1times10minus5 1times10minus4 4times 10minus4 7times 10minus4 1times 10minus3 4times 10minus3 7times 10minus3 We choose the weight decay rate withthe best validation accuracy for each dataset and each trainingvalidation proportion en xingthe weight decay to the chosen one we ran experiments with ooding level candidates from0 001 020 to investigate whether weight decay and ooding have complementary eectsor if adding weight decay will diminish the accuracy gain of ooding

Results We show the results in Table 3 and the chosen ooding levels in Table 4 in Appendix C2We can observe that ooding gives beer accuracy for most cases We can also see that combiningooding with early stopping or with both early stopping and weight decay may lead to even beeraccuracy in some cases

12

(a) wo early stop (08) (b) wo early stop (04) (c) w early stop (08) (d) w early stop (04)

Figure 3 We show the optimal ooding level maintains memorization e vertical axis shows the trainingaccuracy e horizontal axis shows the ooding level We show results with and without early stoppingand for dierent trainingvalidation splits with proportion of 08 or 04 e marks (4 O ) areplaced on the ooding level that was chosen based on validation accuracy

43 MemorizationCan we maintain memorization even aer adding ooding We investigate if the trained modelhas zero training error (100 accuracy) for the ooding level that was chosen with validation dataWe show the results for all benchmark datasets and all trainingvalidation splits with proportions08 and 04 We also show the case without early stopping (choosing the last epoch) and withearly stopping (choosing the epoch with the highest validation accuracy) e results are shownin Fig 3

All gures show downward curves implying that the model will give up eventually onmemorizing all training data as the ooding level becomes higher A more interesting andimportant observation is the position of the optimal ooding level (the one chosen by validationaccuracy which is marked with 4 O or ) We can observe that the marks are oen ploedat zero error and in some cases there is a mark on the highest ooding level that maintains zeroerror ese results are consistent with recent empirical works that imply zero training error leadsto lower generalization error [Belkin et al 2019 Nakkiran et al 2020] but we further demonstratethat zero training loss may be harmful under zero training error

5 ConclusionWe proposed a novel regularization method called ooding that keeps the training loss to stayaround a small constant value to avoid zero training loss In our experiments the optimal oodinglevel oen maintained memorization of training data with zero error With ooding we showedthat the test accuracy will improve for various benchmark datasets and theoretically showed thatthe mean squared error will be reduced under certain conditions

As a byproduct we were able to produce a double descent curve for the test loss with a relativelyfew number of epochs eg in around 100 epochs shown in Fig 2 and Fig 4 in Appendix D Animportant future direction is to study the relationship between this and the double descent curves

13

from previous works [Belkin et al 2019 Krogh and Hertz 1992 Nakkiran et al 2020]It would also be interesting to see if Bayesian optimization [Shahriari et al 2016 Snoek

et al 2012] methods can be utilized to search for the optimal ooding level eciently We willinvestigate this direction in the future

AcknowledgementsWe thank Chang Xu Genki Yamamoto Kento Nozawa Nontawat Charoenphakdee Voot Tangkaraand Yoshihiro Nagano for the helpful discussions TI was supported by the Google PhD FellowshipProgram MS and IY were supported by JST CREST Grant Number JPMJCR18A2 including AIPchallenge program Japan

ReferencesDevansh Arpit Stanisaw Jastrzebski Nicolas Ballas David Krueger Emmanuel Bengio Maxin-

der S Kanwal Tegan Maharaj Asja Fischer Aaron Courville Yoshua Bengio and SimonLacoste-Julien A closer look at memorization in deep networks In ICML 2017

Mikhail Belkin Daniel Hsu and Partha P Mitra Overing or perfect ing Risk bounds forclassication and regression rules that interpolate In NeurIPS 2018

Mikhail Belkin Daniel Hsu Siyuan Ma and Soumik Mandal Reconciling modern machine-learningpractice and the classical biasvariance trade-o PNAS 11615850ndash15854 2019

David Berthelot Nicholas Carlini Ian Goodfellow Nicolas Papernot Avital Oliver and Colin ARael MixMatch A holistic approach to semi-supervised learning In NeurIPS 2019

Christopher M Bishop Paern Recognition and Machine Learning (Information Science andStatistics) Springer 2011

Christopher Michael Bishop Regularization and complexity control in feed-forward networks InICANN 1995

Rich Caruana Steve Lawrence and C Lee Giles Overing in neural nets Backpropagationconjugate gradient and early stopping In NeurIPS 2000

Pratik Chaudhari Anna Choromanska Stefano Soao Yann LeCun Carlo Baldassi ChristianBorgs Jennifer Chayes Levent Sagun and Riccardo Zecchina Entropy-SGD Biasing gradientdescent into wide valleys In ICLR 2017

Jesus Cid-Sueiro Darıo Garcıa-Garcıa and Raul Santos-Rodrıguez Consistency of losses forlearning from weak labels In ECML-PKDD 2014

14

Tarin Clanuwat Mikel Bober-Irizar Asanobu Kitamoto Alex Lamb Kazuaki Yamamoto and DavidHa Deep learning for classical Japanese literature In NeurIPS Workshop on Machine Learningfor Creativity and Design 2018

Marthinus Christoel du Plessis Gang Niu and Masashi Sugiyama Analysis of learning frompositive and unlabeled data In NeurIPS 2014

Marthinus Christoel du Plessis Gang Niu and Masashi Sugiyama Convex formulation forlearning from positive and unlabeled data In ICML 2015

Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MIT Press 2016

Hongyu Guo Yongyi Mao and Richong Zhang Augmenting data with mixup for sentenceclassication An empirical study In arXiv190508941 2019

Bo Han Gang Niu Jiangchao Yao Xingrui Yu Miao Xu Ivor Tsang and Masashi SugiyamaPumpout A meta approach to robust deep learning with noisy labels In arXiv180911008 2018

Stephen Jose Hanson and Lorien Y Pra Comparing biases for minimal network constructionwith back-propagation In NeurIPS 1988

Kaiming He Xiangyu Zhang Shaoqing Ren and Jian Sun Deep residual learning for imagerecognition In CVPR 2016

Sergey Ioe and Christian Szegedy Batch normalization Accelerating deep network training byreducing internal covariate shi In ICML 2015

Takashi Ishida Gang Niu Aditya Krishna Menon and Masashi Sugiyama Complementary-labellearning for arbitrary losses and models In ICML 2019

Nitish Shirish Keskar Dheevatsa Mudigere Jorge Nocedal Mikhail Smelyanskiy and Ping Tak PeterTang On large-batch training for deep learning Generalization gap and sharp minima In ICLR2017

Diederik P Kingma and Jimmy L Ba Adam A method for stochastic optimization In ICLR 2015

Ryuichi Kiryo Gang Niu Marthinus Christoel du Plessis and Masashi Sugiyama Positive-unlabeled learning with non-negative risk estimator In NeurIPS 2017

Anders Krogh and John A Hertz A simple weight decay can improve generalization In NeurIPS1992

Yann Lecun Lon Boou Yoshua Bengio and Patrick Haner Gradient-based learning applied todocument recognition In Proceedings of the IEEE pages 2278ndash2324 1998

Hao Li Zheng Xu Gavin Taylor Christoph Studer and Tom Goldstein Visualizing the losslandscape of neural nets In NeurIPS 2018

15

Ilya Loshchilov and Frank Huer Decoupled weight decay regularization In ICLR 2019

Nan Lu Tianyi Zhang Gang Niu and Masashi Sugiyama Mitigating overing in supervisedclassication from two unlabeled datasets A consistent risk correction approach In AISTATS2020

N Morgan and H Bourlard Generalization and parameter estimation in feedforward nets Someexperiments In NeurIPS 1990

Vinod Nair and Georey E Hinton Rectied linear units improve restricted boltzmann machinesIn ICML 2010

Preetum Nakkiran Gal Kaplun Dimitris Kalimeris Tristan Yang Benjamin L Edelman FredZhang and Boaz Barak SGD on neural networks learns functions of increasing complexity InNeurIPS 2019

Preetum Nakkiran Gal Kaplun Yamini Bansal Tristan Yang Boaz Barak and Ilya Sutskever Deepdouble descent Where bigger models and more data hurt In ICLR 2020

Nagarajan Natarajan Inderjit S Dhillon Pradeep K Ravikumar and Ambuj Tewari Learningwith noisy labels In NeurIPS 2013

Yuval Netzer Tao Wang Adam Coates Alessandro Bissacco Bo Wu and Andrew Y Ng Readingdigits in natural images with unsupervised feature learning In NeurIPS Workshop on DeepLearning and Unsupervised Feature Learning 2011

Andrew Y Ng Preventing ldquooveringrdquo of cross-validation data In ICML 1997

Adam Paszke Sam Gross Francisco Massa Adam Lerer James Bradbury Gregory Chanan TrevorKilleen Zeming Lin Natalia Gimelshein Luca Antiga Alban Desmaison Andreas Kopf EdwardYang Zachary DeVito Martin Raison Alykhan Tejani Sasank Chilamkurthy Benoit SteinerLu Fang Junjie Bai and Soumith Chintala PyTorch An imperative style high-performancedeep learning library In NeurIPS 2019

Giorgio Patrini Alessandro Rozza Aditya K Menon Richard Nock and Lizhen Making deepneural networks robust to label noise A loss correction approach In CVPR 2017

Herbert Robbins and Suon Monro A stochastic approximation method Annals of MathematicalStatistics 22400ndash407 1951

Rebecca Roelofs Vaishaal Shankar Benjamin Recht Sara Fridovich-Keil Moritz Hardt John Millerand Ludwig Schmidt A meta-analysis of overing in machine learning In NeurIPS 2019

Bobak Shahriari Kevin Swersky Ziyu Wang Ryan P Adams and Nando de Freitas Taking thehuman out of the loop A review of bayesian optimization Proceedings of the IEEE 104148ndash1752016

16

Connor Shorten and Taghi M Khoshgoaar A survey on image data augmentation for deeplearning Journal of Big Data 6 2019

Jasper Snoek Hugo Larochelle and Ryan P Adams Practical bayesian optimization of machinelearning algorithms In NeurIPS 2012

Nitish Srivastava Georey Hinton Alex Krizhevsky Ilya Sutskever and Ruslan SalakhutdinovDropout A simple way to prevent neural networks from overing Journal of Machine LearningResearch 151929ndash1958 2014

Masashi Sugiyama Introduction to statistical machine learning Morgan Kaufmann 2015

Christian Szegedy Vincent Vanhoucke Sergey Ioe and Jon Shlens Rethinking the inceptionarchitecture for computer vision In CVPR 2016

Sunil ulasidasan Gopinath Chennupati Je Bilmes Tanmoy Bhaacharya and Sarah MichalakOn mixup training Improved calibration and predictive uncertainty for deep neural networksIn NeurIPS 2019

Robert Tibshirani Regression shrinkage and selection via the lasso Journal of the Royal StatisticalSociety Series B (Methodological) 58267ndash288 1996

A N Tikhonov and V Y Arsenin Solutions of Ill Posed Problems Winston 1977

Andrey Nikolayevich Tikhonov On the stability of inverse problems Doklady Akademii NaukSSSR 39195ndash198 1943

Antonio Torralba Rob Fergus and William T Freeman 80 million tiny images A large data setfor nonparametric object and scene recognition In IEEE Trans PAMI 2008

Brendan van Rooyen and Robert C Williamson A theory of learning with corrupted labelsJournal of Machine Learning Research 181ndash50 2018

Vikas Verma Alex Lamb Juho Kannala Yoshua Bengio and David Lopez-Paz Interpolationconsistency training for semi-supervised learning In IJCAI 2019

Stefan Wager Sida Wang and Percy Liang Dropout training as adaptive regularization In NeurIPS2013

Roman Werpachowski Andrs Gyrgy and Csaba Szepesvri Detecting overing via adversarialexamples In NeurIPS 2019

Han Xiao Kashif Rasul and Roland Vollgraf Fashion-MNIST a novel image dataset for bench-marking machine learning algorithms arXiv preprint arXiv170807747 2017

Chiyuan Zhang Samy Bengio Moritz Hardt Benjamin Recht and Oriol Vinyals Understandingdeep learning requires rethinking generalization In ICLR 2017

17

Hongyi Zhang Moustapha Cisse Yann N Dauphin and David Lopez-Paz mixup Beyondempirical risk minimization In ICLR 2018

Tong Zhang Statistical behavior and consistency of classication methods based on convex riskminimization e Annals of Statistics 3256ndash85 2004

18

A Proof of eoremProof If the ooding level is b then the proposed ooding estimator is

R(g) = |R(g)minus b|+ b (12)

Since the absolute operator can be expressed with a max operator with max(a b) = a+b+|aminusb|2

theproposed estimator can be re-expressed as

R(g) = 2max(R(g) b)minus R(g) = Aminus R(g) (13)

For convenience we used A = 2max(R(g) b) From the denition of MSE

MSE(R(g)) = E[(R(g)minusR(g))2] (14)MSE(R(g)) = E[(R(g)minusR(g))2] (15)

= E[(Aminus R(g)minusR(g))2)] (16)= E[A2]minus 2E[A(R(g) +R(g))] + E[(R(g) +R(g))2] (17)

We are interested in the sign of

MSE(R(g))minusMSE(R(g)) = E[minus4R(g)R(g)minus A2 + 2A(R(g) +R(g))] (18)

Dene the inside of the expectation as B = minus4R(g)R(g)minus A2 + 2A(R(g) + R(g)) B can bedivided into two cases depending on the outcome of the max operator

B =

minus4R(g)R(g)minus 4R(g)2 + 4R(g)(R(g) +R(g)) if R(g) ge b

minus4R(g)R(g)minus 4b2 + 4b(R(g) +R(g)) if R(g) lt b(19)

=

0 if R(g) ge b

minus4R(g)R(g)minus 4b2 + 4b(R(g) +R(g)) if R(g) lt b(20)

=

0 if R(g) ge b

minus4(bminus R(g))(bminusR(g)) if R(g) lt b (21)

e laer case becomes positive when R(g) lt b lt R(g) erefore when R(g) lt b lt R(g)

MSE(R(g))minusMSE(R(g)) gt 0 (22)MSE(R(g)) gt MSE(R(g)) (23)

When b le R(g)

MSE(R(g))minusMSE(R(g)) gt 0 (24)MSE(R(g)) = MSE(R(g)) (25)

19

B Bayes Risk for Gaussian DistributionsIn this section we explain in detail how we derived the Bayes risk with respect to the surrogateloss in the experiments with Gaussian data in Section 41 Since we are using the logistic loss inthe synthetic experiments the loss of the margin is

`(yg(x)) = log(1 + exp(minusyg(x))) (26)

where g(x) Rd rarr R is a scalar instead of the vector denition that was used previously becausethe synthetic experiments only consider binary classication Take the derivative to derive

partE[`(yg(x))]partg(x)

=E[log(1 + exp(minusyg(x)))]

partg(x)= E

[minusy exp(minusyg(x))1 + exp(minusyg(x))

∣∣∣∣∣x]p(x) (27)

= E

[minusy

1 + exp(yg(x))

∣∣∣∣∣x]p(x) (28)

= E

[y + 1

2

1

exp(minusg(x)) + 1+y minus 1

2

minus1exp(g(x)) + 1

∣∣∣∣∣x]p(x) (29)

= E

[y + 1

2

∣∣∣∣∣x]

1

exp(minusg(x)) + 1p(x) + E

[y minus 1

2

∣∣∣∣∣x]

minus1exp(g(x)) + 1

p(x) (30)

= p(y = +1|x) 1

exp(minusg(x)) + 1p(x) + p(y = minus1|x) minus1

exp(g(x)) + 1p(x) (31)

Set this to zero divide by p(x) gt 0 to obtain

p(y = minus1|x) 1

exp(minusg(x)) + 1= p(y = +1|x) 1

exp(g(x)) + 1(32)

exp(g(x)) =p(y = +1|x)p(y = minus1|x)

(33)

g(x) = logp(y = +1|x)p(y = minus1|x)

(34)

Since we are interested in the surrogate loss under this classier we plug this into the logisticloss to obtain the Bayes risk

E[`(yg(x))] = E[log(1 +

p(minusy|x)p(y|x)

)

]= E

[log(

1

p(y|x))

]= E[minus log p(y|x)] (35)

In the experiments in Section 41 we report the empirical version of this with the test dataset asthe Bayes risk

20

C Details of Experiments

C1 Benchmark DatasetsIn the experiments in Section 42 we use six benchmark datasets explained below

bull MNIST6 [Lecun et al 1998] is a 10 class dataset of handwrien digits 1 2 9 and 0 Eachsample is a 28times 28 grayscale image e number of training and test samples are 60000and 10000 respectively

bull Fashion-MNIST7 [Xiao et al 2017] is a 10 class dataset of fashion items T-shirttop TrouserPullover Dress Coat Sandal Shirt Sneaker Bag and Ankle boot Each sample is a 28times 28grayscale image e number of training and test samples are 60000 and 10000 respectively

bull Kuzushiji-MNIST8 [Clanuwat et al 2018] is a 10 class dataset of cursive Japanese (ldquoKuzushijirdquo)characters Each sample is a 28 times 28 grayscale image e number of training and testsamples are 60000 and 10000 respectively

bull CIFAR-109 is a 10 class dataset of various objects airplane automobile bird cat deer dogfrog horse ship and truck Each sample is a colored image in 32times 32times 3 RGB format It isa subset of the 80 million tiny images dataset [Torralba et al 2008] ere are 6000 imagesper class where 5000 are for training and 1000 are for test

bull CIFAR-10010 is a 100 class dataset of various objects Each class has 600 samples where 500samples are for training and 100 samples are for test is is also a subset of the 80 milliontiny images dataset [Torralba et al 2008]

bull SVHN11 [Netzer et al 2011] is a 10 class dataset of house numbers from Google Street Viewimages in 32times 32times 3 RGB format 73257 digits are for training and 26032 digits are fortesting

6httpyannlecuncomexdbmnist7httpsgithubcomzalandoresearchfashion-mnist8httpsgithubcomrois-codhkmnist9httpswwwcstorontoedusimkrizcifarhtml

10httpswwwcstorontoedusimkrizcifarhtml11httpufldlstanfordeduhousenumbers

21

C2 Chosen Flooding LevelsIn Table 4 we report the chosen ooding levels for our experiments with benchmark datasets

Table 4 e chosen ooding levels for benchmark experiments

Early stopping times X times XWeight decay times times X XFlooding X X X X

MNIST (08) 002 002 003 002MNIST (04) 002 003 000 002Fashion-MNIST (08) 000 000 mdash mdashFashion-MNIST (04) 000 000 mdash mdashKuzushiji-MNIST (08) 001 003 003 003Kuzushiji-MNIST (04) 003 002 004 003CIFAR-10 (08) 004 004 000 001CIFAR-10 (04) 003 003 000 000CIFAR-100 (08) 002 002 000 000CIFAR-100 (04) 003 003 000 000SVHN (08) 001 001 000 002SVHN (04) 001 009 003 003

D Learning CurvesIn Figure 4 we visualize learning curves for all datasets (including CIFAR-10 which we alreadyvisualized in Figure 2) We only show the learning curves for trainingvalidation proportion of 08since the results for 04 were similar with 08 Note that Figure 1(c) shows the learning curves forthe rst 80 epochs for CIFAR-10 without ooding Figure 1(d) shows the learning curves withooding when the ooding level is 018

22

(a) MNIST (000) (b) MNIST (001) (c) MNIST (002) (d) MNIST (003)

(e) Fashion-MNIST(000)

(f) Fashion-MNIST(002)

(g) Fashion-MNIST(005)

(h) Fashion-MNIST(010)

(i) KMNIST (000) (j) KMNIST (002) (k) KMNIST (005) (l) KMNIST (010)

(m) CIFAR-10 (000) (n) CIFAR-10 (003) (o) CIFAR-10 (007) (p) CIFAR-10 (020)

(q) CIFAR-100 (000) (r) CIFAR-100 (001) (s) CIFAR-100 (006) (t) CIFAR-100 (020)

(u) SVHN (000) (v) SVHN (001) (w) SVHN (006) (x) SVHN (020)

Figure 4 Learning curves of training and test loss e rst gure in each row is the learning curveswithout ooding e 2nd 3rd and 4th columns show the results with dierent ooding levels e oodinglevel increases towards the right-hand side 23

E Relationship between Performance and GradientsSettings We visualize the relationship between test performance (loss or accuracy) and gradientamplitude of the trainingtest loss in Figure 5 where the gradient amplitude is the `2 norm of thelter-nomalized gradient of the loss e lter-normalized gradient is the gradient appropriatelyscaled depending on the magnitude of the corresponding convolutional lter similarly to Liet al [2018] More specicically for each lter of every comvolutional layer we multiply thecorresponding elements of the gradient by the norm of the lter Note that a fully connected layeris a special case of convolutional layer and subject to this scaling We exclude Fashion-MNISTbecause the optimal ooding level was zero We used the results with trainingvalidation splitratio of 08

Results For the gures with gradient amplitude of training loss on the horizontal axis ldquordquomarks (w ooding) are oen ploed on the right of ldquo+rdquo marks (wo ooding) which impliesthat ooding prevents the model from staying a local minimum For the gures with gradientamplitude of test loss on the horizontal axis we can observe the method with ooding (ldquordquo)improves performance while the gradient amplitude becomes smaller On the other hand theperformance with the method without ooding (ldquo+rdquo) degenerates while the gradient amplitude oftest loss keeps increasing

24

(a) MNIST xtrainyloss

(b) MNIST xtrainyacc

(c) MNIST xtestyloss

(d) MNIST xtest yacc

(e) K-M xtrain yloss (f) K-M xtrain yacc (g) K-M xtest yloss (h) K-M xtest yacc

(i) C10 xtrain yloss (j) C10 xtrain yacc (k) C10 xtest yloss (l) C10 xtest yacc

(m) C100 xtrainyloss

(n) C100 xtrain yacc (o) C100 xtest yloss (p) C100 xtest yacc

(q) SVHN xtrainyloss

(r) SVHN xtrainyacc

(s) SVHN xtest yloss (t) SVHN xtest yacc

Figure 5 Relationship between test performance (loss or accuracy) and amplitude of gradient (with trainingor test loss) Each point (ldquordquo or ldquo+rdquo) in the gures corresponds to a single model at a certain epoch Weremove the rst 5 epochs and plot the rest ldquordquo is used for the method with ooding and ldquo+rdquo is used forthe method without ooding e large black ldquordquo and ldquo+rdquo show the epochs with early stopping e colorbecomes lighter (purplerarr yellow) as the training proceeds K-M C10 and C100 stand for Kuzushiji-MNISTCIFAR-10 and CIFAR-100

25

F FlatnessSettings We follow Li et al [2018] and give a one-dimensional visualization of atness for eachdataset in Figure 6 We exclude Fashion-MNIST because the optimal ooding level was zero Weused the results with trainingvalidation split ratio of 08 We compare the atness of the modelright aer the empirical risk with respect to a mini-batch becomes smaller than the ooding levelRm(g) lt b for the rst time (doed blue line) and the model aer training (solid blue line) Wealso compare them with the model trained by the baseline method without ooding and trainingis nished (solid red line)

Results According to Figure 6 the test loss becomes lower and more at during the trainingwith ooding Note that the training loss on the other hand continues to oat around the oodinglevel until the end of training aer it enters the ooding zone We expect that the model makes arandom walk and escapes regions with sharp loss landscapes during the period is may be apossible reason for beer generalization results with our proposed method

26

(a) MNIST (train) (b) MNIST (test) (c) Kuzushiji-MNIST(train)

(d) Kuzushiji-MNIST(test)

(e) CIFAR-10 (train) (f) CIFAR-10 (test) (g) CIFAR-100 (train) (h) CIFAR-100 (test)

(i) SVHN (train) (j) SVHN (test)

Figure 6 One-dimensional visualization of atness We visualize the trainingtest loss with respect toperturbation We depict the results for 3 models the model when the empirical risk with respect to trainingdata is below the ooding level for the rst time during training (doed blue) the model at the end oftraining with ooding (solid blue) and the model at the end of training without ooding (solid red)

27

  • 1 Introduction
  • 2 Backgrounds
    • 21 Regularization Methods
    • 22 Double Descent Curves with Overparametrization
    • 23 Lower-Bounding the Empirical Risk
      • 3 Flooding How to Avoid Zero Training Loss
        • 31 Preliminaries
        • 32 Algorithm
        • 33 Implementation
        • 34 Theoretical Analysis
          • 4 Experiments
            • 41 Synthetic Experiments
            • 42 Benchmark Experiments
            • 43 Memorization
              • 5 Conclusion
              • A Proof of Theorem
              • B Bayes Risk for Gaussian Distributions
              • C Details of Experiments
                • C1 Benchmark Datasets
                • C2 Chosen Flooding Levels
                  • D Learning Curves
                  • E Relationship between Performance and Gradients
                  • F Flatness
Page 2: Do We Need Zero Training Loss A›er Achieving Zero Training … · 2020. 2. 21. · directly aim to avoid zero training loss, they o›en fail to maintain a moderate level of training

(a) wo Flooding (b) w Flooding

[B]

[C]training loss

test loss

Epochs

[A]

(c) CIFAR-10 wo Flooding

training loss

test loss

flooded areaEpochs

(d) CIFAR-10 w Flooding

Figure 1 (a) shows 3 dierent concepts related to overing [A] shows the generalization gap increaseswhile training amp test losses decrease [B] also shows the increasing gap but the test loss starts to rise [C]shows the training loss becoming (near-)zero We avoid [C] by ooding the boom area visualized in (b)which forces the training loss to stay around a constant is leads to a decreasing test loss once again Weconrm these claims in experiments with CIFAR-10 shown in (c)ndash(d)

even aer the model has memorized [Arpit et al 2017 Belkin et al 2018 Zhang et al 2017] thetraining data completely with zero error the training loss can easily become (near-)zero especiallywith overparametrized models Recent works on overparametrization and double descent curves[Belkin et al 2019 Nakkiran et al 2020] have shown that learning until zero training erroris meaningful to achieve a lower generalization error However whether zero training loss isnecessary aer achieving zero training error remains an open issue

In this paper we propose a method to make the training loss oat around a small constantvalue in order to prevent the training loss from approaching zero is is analogous to oodingthe boom area with water and we refer to the constant value as the ooding level Note thateven if we add ooding we can still memorize the training data Our proposal only forces thetraining loss to become positive which does not necessarily mean the training error will becomepositive as long as the ooding level is not too large e idea of ooding is shown in Fig 1(b)and we show learning curves before and aer ooding with benchmark experiments in Fig 1(c)and Fig 1(d)1

Algorithm and implementation Our algorithm of ooding is surprisingly simple If theoriginal learning objective is J the proposed modied learning objective J with ooding is

J(θ) = |J(θ)minus b|+ b (1)

where b gt 0 is the ooding level specied by the user and θ is the model parametere gradient of J wrt θ will point in the same direction as that of J(θ) when J(θ) gt b but

in the opposite direction when J(θ) lt b is means that when the learning objective is abovethe ooding level there is a ldquogravityrdquo eect with gradient descent but when the learning objectiveis below the ooding level there is a ldquobuoyancyrdquo eect with gradient ascent In practice this willbe performed with a mini-batch and will be compatible with any stochastic optimizers It canalso be used along with other regularization methods

1For the details of these experiments see Appendix D

2

During ooding the training loss will repeat going below and above the ooding level emodel will continue to ldquorandom walkrdquo with the same non-zero training loss and we expect it todri into an area with a at loss landscape that leads to beer generalization [Chaudhari et al2017 Keskar et al 2017 Li et al 2018] 2

Since it is a simple solution this modication can be incorporated into existing machinelearning code easily Add one line of code for Eq (1) aer evaluating the original objectivefunction J(θ) A minimal working example with a mini-batch in PyTorch [Paszke et al 2019] isdemonstrated below to show the additional one line of code

1 outputs = model(inputs)2 loss = criterion(outputs labels)3 flood = (loss-b)abs()+b This is it4 optimizerzerograd()5 floodbackward()6 optimizerstep()

It may be hard to set the ooding level without expert knowledge on the domain or task Wecan circumvent this situation easily by treating the ooding level as a hyper-parameter We mayuse a naive search which exhaustively evaluates the accuracy for the predened hyper-parametercandidates with a validation dataset is procedure can be performed in parallel

Previous regularizationmethods Many previous regularization methods also aim at avoidingtraining too much in various ways eg restricting the parameter norm to become small by decayingthe parameter weights [Hanson and Pra 1988] raising the diculty of training by droppingactivations of neural networks [Srivastava et al 2014] smoothing the training labels [Szegedyet al 2016] or simply stopping training at an earlier phase [Morgan and Bourlard 1990] esemethods can be considered as indirect ways to control the training loss by also introducingadditional assumptions eg the optimal model parameters are close to zero Although makingthe regularization eect stronger would make it harder for the training loss to approach zeroit is still hard to maintain the right level of training loss till the end of training In fact foroverparametrized deep networks applying a small regularization parameter would not stop thetraining loss becoming (near-)zero making it even harder to choose a hyper-parameter thatcorresponds to a specic level of loss

Flooding on the other hand is a direct solution to the issue that the training loss becomes(near-)zero Flooding intentionally prevents further reduction of the training loss when it reachesa reasonably small value and the ooding level corresponds to the level of training loss that theuser wants to keep

Contributions Our proposed regularizer called ooding makes the training loss oat around asmall constant value instead of making it head towards zero loss Flooding is a regularizer that isdomain- task- and model-independent eoretically we nd that the mean squared error can bereduced with ooding under certain conditions Not only do we show test accuracy improving

2In Appendix F we show that during this period of random walk there is an increase in atness of the lossfunction

3

aer ooding we also observe that even aer we avoid zero training loss memorization with zerotraining error still takes place

2 BackgroundsIn this section we review regularization methods (summarized in Table 1) recent works onoverparametrization and double descent curves and the area of weakly supervised learning wheresimilar techniques to ooding has been explored

21 Regularization Methodse name ldquoregularizationrdquo dates back to at least Tikhonov regularization for the ill-posed linearleast-squares problem [Tikhonov and Arsenin 1977 Tikhonov 1943] One example is to modifyXgtX (where X is the design matrix) to become ldquoregularrdquo by adding a term to the objectivefunction `2 regularization is a generalization of the above example and can be applied to non-linear models ese methods implicitly assume that the optimal model parameters are close tozero

It is known that weight decay [Hanson and Pra 1988] dropout [Srivastava et al 2014] andearly stopping [Morgan and Bourlard 1990] are equivalent to `2 regularization under certainconditions [Bishop 1995 Goodfellow et al 2016 Loshchilov and Huer 2019 Wager et al 2013]implying that there is a similar assumption on the optimal model parameters ere are otherpenalties based on dierent assumptions such as the `1 regularization [Tibshirani 1996] based onthe sparsity assumption that the optimal model has only a few non-zero parameters

Modern machine learning tasks are applied to complex problems where the optimal modelparameters are not necessarily close to zero or may not be sparse and it would be ideal if wecan properly add regularization eects to the optimization stage without such assumptions Ourproposed method does not have assumptions on the optimal model parameters and can be usefulfor more complex problems

More recently ldquoregularizationrdquo has further evolved to a more general meaning includingvarious methods that alleviate overing but do not necessarily have a step to regularize asingular matrix or add a regularization term to the objective function For example Goodfellowet al [2016] denes regularization as ldquoany modication we make to a learning algorithm that isintended to reduce its generalization error but not its training errorrdquo In this paper we adopt thisbroader meaning of ldquoregularizationrdquo

Examples of the more general regularization category include mixup [Zhang et al 2018]and data augmentation methods like cropping and ipping or adjusting brightness or sharpness[Shorten and Khoshgoaar 2019] ese methods have been adopted in many papers to obtainstate-of-the-art performance [Berthelot et al 2019 Verma et al 2019] and are becoming essentialregularization tools for developing new systems However these regularization methods havethe drawback of being domain-specic ey are designed for the vision domain and requiresome eorts when applying to other domains [Guo et al 2019 ulasidasan et al 2019] Otherregularizers such as label smoothing [Szegedy et al 2016] is used for problems with class labels

4

Table 1 Conceptual comparisons of various regularizers ldquotrrdquo stands for ldquotrainingrdquo ldquoIndeprdquo stands forldquoindependentrdquo X stands for yes and times stands for no

Regularization and other methods Targettr loss

Domainindep

Taskindep

Modelindep Main assumption

`2 regularization [Tikhonov 1943] times X X X Optimal model params are close to 0Weight decay [Hanson and Pra 1988] times X X X Optimal model params are close to 0Early stopping [Morgan and Bourlard 1990] times X X X Overing occurs in later epochs`1 regularization [Tibshirani 1996] times X X X Optimal model has to be sparseDropout [Srivastava et al 2014] times X X times Weight scaling inference ruleBatch normalization [Ioe and Szegedy 2015] times X X times Existence of internal covariate shiLabel smoothing [Szegedy et al 2016] times X times X True posterior is not a one-hot vectorMixup [Zhang et al 2018] times times times X Linear relationship between x and yImage augment [Shorten and Khoshgoaar 2019] times times X X Input is invariant to the translations

Flooding (proposed method) X X X X Learning until zero loss is harmful

and harder to use with regression or ranking meaning they are task-specic Batch normalization[Ioe and Szegedy 2015] and dropout [Srivastava et al 2014] are designed for neural networksand are model-specic

Although these regularization methodsmdashboth the special and general onesmdashalready work wellin practice and have become the de facto standard tools [Bishop 2011 Goodfellow et al 2016]we provide an alternative which is even more general in the sense that it is domain- task- andmodel-independent

at being said we want to emphasize that the most important dierence between oodingand other regularization methods is whether it is possible to target a specic level of training lossother than zero While ooding allows the user to choose the level of training loss directly it ishard to achieve this with other regularizers

22 Double Descent Curves with OverparametrizationRecently there has been increasing aention on the phenomenon of ldquodouble descentrdquo named byBelkin et al [2019] to explain the two regimes of deep learning e rst one (underparametrizedregime) occurs where the model complexity is small compared to the number of training samplesand the test error as a function of model complexity decreases with low model complexity butstarts to increase aer the model complexity is large enough is follows the classical viewof machine learning that excessive complexity leads to poor generalization e second one(overparametrized regime) occurs when an even larger model complexity is considered enincreasing the complexity only decreases test error which leads to a double descent shape ephase of decreasing test error oen occurs aer the training error becomes zero is follows themodern view of machine learning that bigger models lead to beer generalization3

As far as we know the discovery of double descent curves dates back to at least Krogh and3httpswwwefforgaimetrics

5

Hertz [1992] where they theoretically showed the double descent phenomenon under a linearregression setup Recent works [Belkin et al 2019 Nakkiran et al 2020] have shown empiricallythat a similar phenomenon can be observed with deep learning methods Nakkiran et al [2020]observed that the double descent curves can be shown not only as a function of model complexitybut also as a function of the epoch number

We want to note a byproduct of our ooding method We were able to produce the epoch-wisedouble descent curve for the test loss with about 100 epochs Investigating the connection betweenour accelerated double descent curves and previous double descent curves [Belkin et al 2019Krogh and Hertz 1992 Nakkiran et al 2020] is out of the scope of this paper but is an importantfuture direction

23 Lower-Bounding the Empirical RiskLower-bounding the empirical risk has been used in the area of weakly supervised learning erewere a common phenomenon where the empirical risk goes below zero [Kiryo et al 2017] whenan equivalent form of the risk expressed with the given weak supervision was alternatively used[Cid-Sueiro et al 2014 du Plessis et al 2014 2015 Natarajan et al 2013 Patrini et al 2017 vanRooyen and Williamson 2018] A gradient ascent technique was used to force the empirical riskto become non-negative in Kiryo et al [2017] is idea has been generalized and applied to otherweakly supervised seings [Han et al 2018 Ishida et al 2019 Lu et al 2020]

Although we also set a lower bound on the empirical risk the motivation is dierent Firstwhile Kiryo et al [2017] and others aim to x the negative empirical risk to become lower boundedby zero our empirical risk already has a lower bound of zero Instead we are aiming to sink theoriginal empirical risk by placing a positive lower bound Second the problem seings are dierentWeakly supervised learning methods require certain loss corrections or sample corrections [Hanet al 2018] before the non-negative correction but we work on the original empirical risk withoutany seing-specic modications

3 Flooding How to Avoid Zero Training LossIn this section we propose our regularization method ooding Note that this section and thefollowing sections only consider multi-class classication for simplicity

31 PreliminariesConsider input variable x isin Rd and output variable y isin 1 K where K is the number ofclasses ey follow an unknown joint probability distribution with density p(x y) We denotethe score function by g Rd rarr RK For any test data point x0 our prediction of the output labelwill be given by y0 = argmaxzisin1K gz(x0) where gz(middot) is the z-th element of g(middot) and incase of a tie argmax returns the largest argument Let ` RK times1 2 K rarr R denote a loss

6

function ` can be the zero-one loss

`01(v zprime) =

0 if argmaxzisin1K vz = zprime

1 otherwise(2)

where v = (v1 vK)gt isin RK or a surrogate loss such as the somax cross-entropy loss

`CE(v zprime) = minus log

exp(vzprime)sumzisin1K exp(vz)

(3)

For a surrogate loss ` we denote the classication risk by

R(g) = Ep(xy)[`(g(x) y)] (4)

where Ep(xy)[middot] is the expectation over (x y) sim p(x y) We use R01(g) to denote Eq (4) when` = `01 and call it the classication error

e goal of multi-class classication is to learn g that minimizes the classication errorR01(g)In optimization we consider the minimization of the risk with a almost surely dierentiablesurrogate loss R(g) instead to make the problem more tractable Furthermore since p(x y) isusually unknown and there is no way to exactly evaluate R(g) we minimize its empirical versioncalculated from the training data instead

R(g) =1

n

nsumi=1

`(g(xi) yi) (5)

where (xi yi)ni=1 are iid sampled from p(x y) We call R the empirical riskWe would like to clarify some of the undened terms used in the title and the introduction

e ldquotraintest lossrdquo is the empirical risk with respect to the surrogate loss function ` over thetrainingtest data respectively We refer to the ldquotrainingtest errorrdquo as the empirical risk withrespect to `01 over the trainingtest data respectively (which is equal to one minus accuracy)[Zhang 2004]

Finally we formally dene the Bayes risk as

Rlowast = infhR(h) (6)

where the inmum is taken over all vector-valued functions h Rd rarr RK e Bayes risk is oenreferred to as the Bayes error if the zero-one loss is used

infhR01(h) (7)

32 Algorithm

With exible models R(g) wrt a surrogate loss can easily become small if not zero as wementioned in Section 1 see [C] in Fig 1(a) We propose a method that ldquooods the boom areaand sinks the original empirical riskrdquo as in Fig 1(b) so that the empirical risk cannot go belowthe ooding level More technically if we denote the ooding level as b our proposed trainingobjective with ooding is a simple x

7

Denition 1 e ooded empirical risk is dened as4

R(g) = |R(g)minus b|+ b (8)

Note that when b = 0 then R(g) = R(g) e gradient of R(g) wrt model parameters willpoint to the same direction as that of R(g) when R(g) gt b but in the opposite direction whenR(g) lt b is means that when the learning objective is above the ooding level we performgradient descent as usual (gravity zone) but when the learning objective is below the oodinglevel we perform gradient ascent instead (buoyancy zone)

e issue is that in general we seldom know the optimal ooding level in advance isissue can be mitigated by searching for the optimal ooding level blowast with a hyper-parameteroptimization technique In practice we can search for the optimal ooding level by performingthe exhaustive search in parallel

33 ImplementationFor large scale problems we can employ mini-batched stochastic optimization for ecient com-putation Suppose that we have M disjoint mini-batch splits We denote the empirical risk (5)with respect to the m-th mini-batch by Rm(g) for m isin 1 M en our mini-batchedoptimization performs gradient descent updates in the direction of the gradient of Rm(g) By theconvexity of the absolute value function and Jensenrsquos inequality we have

R(g) le 1

M

Msumm=1

(|Rm(g)minus b|+ b

) (9)

is indicates that mini-batched optimization will simply minimize an upper bound of the full-batch case with R(g)

34 eoretical AnalysisIn the following theorem we will show that the mean squared error (MSE) of the proposed riskestimator with ooding is smaller than that of the original risk estimator without ooding

eorem 1 Fix any measurable vector-valued function g If the ooding level b satises R(g) ltb lt R(g) we have

MSE(R(g)) gt MSE(R(g)) (10)

If b le R(g) we haveMSE(R(g)) = MSE(R(g)) (11)

A proof is given in Appendix A If we regard R(g) as the training loss and R(g) as the testloss we would want b to be between those two for the MSE to improve

4Strictly speaking Eq (1) is dierent from Eq (8) since Eq (1) can ignore constant terms of the original empiricalrisk We will refer to Eq (8) for the ooding operator for the rest of the paper

8

4 ExperimentsIn this section we show experimental results with synthetic and benchmark datasets e imple-mentation is based on PyTorch [Paszke et al 2019] and demo code will be available5 Experimentswere carried out with NVIDIA GeForce GTX 1080 Ti NVIDIA adro RTX 5000 and Intel XeonGold 6142

41 Synthetic Experimentse aim of our synthetic experiments is to study the behavior of ooding with a controlled setupWe use three types of synthetic data described below

Two Gaussians Data We perform binary classication with two 10-dimensional Gaus-sian distributions with covariance matrix identity and means microP = [0 0 0]gt and microN =[mm m]gt where m isin 08 10 e Bayes risk for m = 10 and m = 08 are 014 and024 respectively where proofs are shown in Appendix B e training validation and test samplesizes are 25 10000 and 10000 per class respectively

Sinusoid Data e sinusoid data [Nakkiran et al 2019] are generated as follows We rstdraw input data points uniformly from the inside of a 2-dimensional ball of radius 1 en we putclass labels based on

y = sign(xgtw + sin(xgtwprime))

where w and wprime are any two 2-dimesional vectors such that w perp wprime e training validationand test sample sizes are 100 100 and 20000 respectively

Spiral Data e spiral data [Sugiyama 2015] are two-dimensional synthetic data Letθ+1 = 0 θ+2 θ

+n+ = 4π be equally spaced n+ points in the interval [0 4π] and θminus1 =

0 θminus2 θminusnminus = 4π be equally spaced nminus points in the interval [0 4π] Let positive and negative

input data points be

x+i+ = θ+i+ [cos(θ

+i+) sin(θ

+i+)]

gt + τν+i+

xminusiminus = (θminusiminus + π)[cos(θminusiminus) sin(θminusiminus)]

gt + τνminusiminus

for i+ = 1 n+ and iminus = 1 nminus where τ controls the magnitude of the noise ν+i and

νminusi are iid distributed according to the two-dimensional standard normal distribution enwe make data for classication by (xi yi)ni=1 = (x+

i+ +1)n+

i+=1 cup (xminusiminus minus1)

nminus

iminus=1 wheren = n+ + nminus e training validation and test sample sizes are 100 100 and 10000 per classrespectively

For Two Gaussians we use a one-hidden-layer feedforward neural network with 500 unitsin the hidden layer with the ReLU activation function [Nair and Hinton 2010] We train thenetwork for 1000 epochs with the logistic loss and vanilla gradient descent with learning rate of005 e ooding level is chosen from b isin 0 001 002 040 For Sinusoid and Spiral weuse a four-hidden-layer feedforward neural network with 500 units in the hidden layer with the

5httpsgithubcomtakashiishidaflooding

9

Table 2 Experimental results for the synthetic data Sub-table (A) shows the results without early stoppingSub-table (B) shows the results with early stopping e beer method is shown in bold in each of sub-tables (A) and (B) ldquoBRrdquo stands for the Bayes risk For Two Gaussians the distance between the positive andnegative distributions is larger for larger m See the description in Section 41 for the details

(A) Without Early Stopping (B) With Early Stopping

Data Seing WithoutFlooding

WithFlooding

Chosenb

WithoutFlooding

WithFlooding

Chosenb

Two Gaussians m 10 BR 014 8796 9225 028 9163 9225 027Two Gaussians m 08 BR 024 8200 8731 033 8657 8729 035Sinusoid Label Noise 001 9384 9446 001 9254 9254 000Sinusoid Label Noise 005 9112 9544 010 9326 9460 001Sinusoid Label Noise 010 8657 9602 017 9670 9670 000Spiral Label Noise 001 9896 9785 001 9860 9888 001Spiral Label Noise 005 9387 9624 004 9658 9562 014Spiral Label Noise 010 8970 9296 016 8970 9296 016

ReLU activation function [Nair and Hinton 2010] and batch normalization [Ioe and Szegedy2015] We train the network for 500 epochs with the logistic loss and Adam [Kingma and Ba2015] optimizer with 100 mini-batch size and learning rate of 0001 e ooding level is chosenfrom b isin 0 001 002 020 Note that training with b = 0 is identical to the baselinemethod without ooding We report the test accuracy of the ooding level with the best validationaccuracy We rst conduct experiments without early stopping which means that the last epochwas chosen for all ooding levels

Results e results are summarized in Table 2 It is worth noting that for Two Gaussians thechosen ooding level b is larger for the smaller distance between the two distributions which iswhen the classication task is harder and the Bayes risk becomes larger since the two distributionsbecome less separated We see similar tendencies for Sinusoid and Spiral data a larger b waschosen for larger ipping probability for label noise which is expected to increase the Bayes riskis implies the positive correlation between the optimal ooding level and the Bayes risk asis also partially suggested by eorem 1 Another interesting observation is that the chosen bis close to but higher than the Bayes risk for Two Gaussians data is may look inconsistentwith eorem 1 However it makes sense to adopt larger b with stronger regularization eect thatallows some bias as a trade-o for reducing the variance of the risk estimator In fact eorem 1does not deny the possibility that some b ge R(g) achieves even beer estimation

From (A) in Table 2 we can see that the method with ooding oen improves test accuracyover the baseline method without ooding As we mentioned in the introduction it ca be harmfulto keep training a model until the end without ooding However with ooding the model at thenal epoch has good prediction performance according to the results which implies that ooding

10

Table 3 Results with benchmark datasets We report classication accuracy for all combinations ofweight decay (X and times) early stopping (X and times) and ooding (X and times) e second column shows thetrainingvalidation split used for the experiment W stands for weight decay E stands for early stopping andF stands for ooding ldquomdashrdquo means that ooding level of zero was optimal ldquoNArdquo means that we skipped theexperiments because zero weight decay was optimal in the case without ooding e best and equivalentare shown in bold by comparing ldquowith oodingrdquo and ldquowithout oodingrdquo for two columns with the sameseing for W and E eg the rst and h columns out of the 8 columns e best performing combinationis highlighted

Datasettr W times times X X times times X Xva E times X times X times X times X

split F times times times times X X X X

MNIST 08 9832 9830 9851 9842 9846 9853 9850 984804 9771 9770 9782 9791 9774 9785 mdash 9783

Fashion- 08 8934 8936 NA NA mdash mdash NA NAMNIST 04 8848 8863 8860 8862 mdash mdash mdash mdash

Kuzushiji- 08 9163 9162 9163 9171 9240 9212 9211 9197MNIST 04 8918 8918 8958 8973 9041 9015 8971 8988

CIFAR-10 08 7359 7336 7365 7357 7306 7344 mdash 744104 6639 6663 6931 6928 6720 6758 mdash mdash

CIFAR-100 08 4216 4233 4267 4245 4250 4236 mdash mdash04 3427 3434 3797 3882 3499 3514 mdash mdash

SVHN 08 9238 9241 9320 9299 9278 9279 mdash 934204 9032 9035 9043 9049 9057 9061 9116 9121

helps the late-stage training improve test accuracyWe also conducted experiments with early stopping meaning that we chose the model that

recorded the best validation accuracy during training e results are reported in sub-table (B) ofTable 2 Compared with sub-table (A) we see that early stopping improves the baseline methodwithout ooding well in many cases is indicates that training longer without ooding washarmful in our experiments On the other hand the accuracy for ooding combined with earlystopping is oen close to that with early stopping meaning that training until the end withooding tends to be already as good as doing so with early stopping e table shows that oodingoen improves or retains the test accuracy of the baseline method without ooding even aerdeploying early stopping Flooding does not hurt performance but can be benecial for methodsused with early stopping

11

(a) CIFAR-10 (000) (b) CIFAR-10 (003) (c) CIFAR-10 (007) (d) CIFAR-10 (020)

Figure 2 Learning curves of training and test loss for trainingvalidation proportion of 08 (a) shows thelearning curves without ooding (b) (c) and (d) show the learning curves with dierent ooding levels

42 Benchmark ExperimentsWe next perform experiments with benchmark datasets Not only do we compare with the baselinewithout ooding we also compare or combine with other general regularization methods whichare early stopping and weight decay

Settings We use the following six benchmark datasets MNIST Fashion-MNIST Kuzushiji-MNIST CIFAR-10 CIFAR-100 and SVHN e details of the benchmark datasets can be found inAppendix C1 We split the original training dataset into training and validation data with dierentproportions 08 or 04 (meaning 80 or 40 was used for training and the rest was used forvalidation respectively) We perform the exhaustive hyper-parameter search for the ooding levelwith candidates from 000 001 020 e number of epochs is 500 Stochastic gradientdescent [Robbins and Monro 1951] is used with learning rate of 01 and momentum of 09 ForMNIST Fashion-MNIST and Kuzushiji-MNIST we use a one-hidden-layer feedforward neuralnetwork with 500 units and ReLU activation function [Nair and Hinton 2010] For CIFAR-10CIFAR-100 and SVHN we used ResNet-18 [He et al 2016] We do not use any data augmentationor manual learning rate decay We deployed early stopping in the same way as in Section 41

We rst ran experiments with the following candidates for the weight decay rate 1times10minus5 1times10minus4 4times 10minus4 7times 10minus4 1times 10minus3 4times 10minus3 7times 10minus3 We choose the weight decay rate withthe best validation accuracy for each dataset and each trainingvalidation proportion en xingthe weight decay to the chosen one we ran experiments with ooding level candidates from0 001 020 to investigate whether weight decay and ooding have complementary eectsor if adding weight decay will diminish the accuracy gain of ooding

Results We show the results in Table 3 and the chosen ooding levels in Table 4 in Appendix C2We can observe that ooding gives beer accuracy for most cases We can also see that combiningooding with early stopping or with both early stopping and weight decay may lead to even beeraccuracy in some cases

12

(a) wo early stop (08) (b) wo early stop (04) (c) w early stop (08) (d) w early stop (04)

Figure 3 We show the optimal ooding level maintains memorization e vertical axis shows the trainingaccuracy e horizontal axis shows the ooding level We show results with and without early stoppingand for dierent trainingvalidation splits with proportion of 08 or 04 e marks (4 O ) areplaced on the ooding level that was chosen based on validation accuracy

43 MemorizationCan we maintain memorization even aer adding ooding We investigate if the trained modelhas zero training error (100 accuracy) for the ooding level that was chosen with validation dataWe show the results for all benchmark datasets and all trainingvalidation splits with proportions08 and 04 We also show the case without early stopping (choosing the last epoch) and withearly stopping (choosing the epoch with the highest validation accuracy) e results are shownin Fig 3

All gures show downward curves implying that the model will give up eventually onmemorizing all training data as the ooding level becomes higher A more interesting andimportant observation is the position of the optimal ooding level (the one chosen by validationaccuracy which is marked with 4 O or ) We can observe that the marks are oen ploedat zero error and in some cases there is a mark on the highest ooding level that maintains zeroerror ese results are consistent with recent empirical works that imply zero training error leadsto lower generalization error [Belkin et al 2019 Nakkiran et al 2020] but we further demonstratethat zero training loss may be harmful under zero training error

5 ConclusionWe proposed a novel regularization method called ooding that keeps the training loss to stayaround a small constant value to avoid zero training loss In our experiments the optimal oodinglevel oen maintained memorization of training data with zero error With ooding we showedthat the test accuracy will improve for various benchmark datasets and theoretically showed thatthe mean squared error will be reduced under certain conditions

As a byproduct we were able to produce a double descent curve for the test loss with a relativelyfew number of epochs eg in around 100 epochs shown in Fig 2 and Fig 4 in Appendix D Animportant future direction is to study the relationship between this and the double descent curves

13

from previous works [Belkin et al 2019 Krogh and Hertz 1992 Nakkiran et al 2020]It would also be interesting to see if Bayesian optimization [Shahriari et al 2016 Snoek

et al 2012] methods can be utilized to search for the optimal ooding level eciently We willinvestigate this direction in the future

AcknowledgementsWe thank Chang Xu Genki Yamamoto Kento Nozawa Nontawat Charoenphakdee Voot Tangkaraand Yoshihiro Nagano for the helpful discussions TI was supported by the Google PhD FellowshipProgram MS and IY were supported by JST CREST Grant Number JPMJCR18A2 including AIPchallenge program Japan

ReferencesDevansh Arpit Stanisaw Jastrzebski Nicolas Ballas David Krueger Emmanuel Bengio Maxin-

der S Kanwal Tegan Maharaj Asja Fischer Aaron Courville Yoshua Bengio and SimonLacoste-Julien A closer look at memorization in deep networks In ICML 2017

Mikhail Belkin Daniel Hsu and Partha P Mitra Overing or perfect ing Risk bounds forclassication and regression rules that interpolate In NeurIPS 2018

Mikhail Belkin Daniel Hsu Siyuan Ma and Soumik Mandal Reconciling modern machine-learningpractice and the classical biasvariance trade-o PNAS 11615850ndash15854 2019

David Berthelot Nicholas Carlini Ian Goodfellow Nicolas Papernot Avital Oliver and Colin ARael MixMatch A holistic approach to semi-supervised learning In NeurIPS 2019

Christopher M Bishop Paern Recognition and Machine Learning (Information Science andStatistics) Springer 2011

Christopher Michael Bishop Regularization and complexity control in feed-forward networks InICANN 1995

Rich Caruana Steve Lawrence and C Lee Giles Overing in neural nets Backpropagationconjugate gradient and early stopping In NeurIPS 2000

Pratik Chaudhari Anna Choromanska Stefano Soao Yann LeCun Carlo Baldassi ChristianBorgs Jennifer Chayes Levent Sagun and Riccardo Zecchina Entropy-SGD Biasing gradientdescent into wide valleys In ICLR 2017

Jesus Cid-Sueiro Darıo Garcıa-Garcıa and Raul Santos-Rodrıguez Consistency of losses forlearning from weak labels In ECML-PKDD 2014

14

Tarin Clanuwat Mikel Bober-Irizar Asanobu Kitamoto Alex Lamb Kazuaki Yamamoto and DavidHa Deep learning for classical Japanese literature In NeurIPS Workshop on Machine Learningfor Creativity and Design 2018

Marthinus Christoel du Plessis Gang Niu and Masashi Sugiyama Analysis of learning frompositive and unlabeled data In NeurIPS 2014

Marthinus Christoel du Plessis Gang Niu and Masashi Sugiyama Convex formulation forlearning from positive and unlabeled data In ICML 2015

Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MIT Press 2016

Hongyu Guo Yongyi Mao and Richong Zhang Augmenting data with mixup for sentenceclassication An empirical study In arXiv190508941 2019

Bo Han Gang Niu Jiangchao Yao Xingrui Yu Miao Xu Ivor Tsang and Masashi SugiyamaPumpout A meta approach to robust deep learning with noisy labels In arXiv180911008 2018

Stephen Jose Hanson and Lorien Y Pra Comparing biases for minimal network constructionwith back-propagation In NeurIPS 1988

Kaiming He Xiangyu Zhang Shaoqing Ren and Jian Sun Deep residual learning for imagerecognition In CVPR 2016

Sergey Ioe and Christian Szegedy Batch normalization Accelerating deep network training byreducing internal covariate shi In ICML 2015

Takashi Ishida Gang Niu Aditya Krishna Menon and Masashi Sugiyama Complementary-labellearning for arbitrary losses and models In ICML 2019

Nitish Shirish Keskar Dheevatsa Mudigere Jorge Nocedal Mikhail Smelyanskiy and Ping Tak PeterTang On large-batch training for deep learning Generalization gap and sharp minima In ICLR2017

Diederik P Kingma and Jimmy L Ba Adam A method for stochastic optimization In ICLR 2015

Ryuichi Kiryo Gang Niu Marthinus Christoel du Plessis and Masashi Sugiyama Positive-unlabeled learning with non-negative risk estimator In NeurIPS 2017

Anders Krogh and John A Hertz A simple weight decay can improve generalization In NeurIPS1992

Yann Lecun Lon Boou Yoshua Bengio and Patrick Haner Gradient-based learning applied todocument recognition In Proceedings of the IEEE pages 2278ndash2324 1998

Hao Li Zheng Xu Gavin Taylor Christoph Studer and Tom Goldstein Visualizing the losslandscape of neural nets In NeurIPS 2018

15

Ilya Loshchilov and Frank Huer Decoupled weight decay regularization In ICLR 2019

Nan Lu Tianyi Zhang Gang Niu and Masashi Sugiyama Mitigating overing in supervisedclassication from two unlabeled datasets A consistent risk correction approach In AISTATS2020

N Morgan and H Bourlard Generalization and parameter estimation in feedforward nets Someexperiments In NeurIPS 1990

Vinod Nair and Georey E Hinton Rectied linear units improve restricted boltzmann machinesIn ICML 2010

Preetum Nakkiran Gal Kaplun Dimitris Kalimeris Tristan Yang Benjamin L Edelman FredZhang and Boaz Barak SGD on neural networks learns functions of increasing complexity InNeurIPS 2019

Preetum Nakkiran Gal Kaplun Yamini Bansal Tristan Yang Boaz Barak and Ilya Sutskever Deepdouble descent Where bigger models and more data hurt In ICLR 2020

Nagarajan Natarajan Inderjit S Dhillon Pradeep K Ravikumar and Ambuj Tewari Learningwith noisy labels In NeurIPS 2013

Yuval Netzer Tao Wang Adam Coates Alessandro Bissacco Bo Wu and Andrew Y Ng Readingdigits in natural images with unsupervised feature learning In NeurIPS Workshop on DeepLearning and Unsupervised Feature Learning 2011

Andrew Y Ng Preventing ldquooveringrdquo of cross-validation data In ICML 1997

Adam Paszke Sam Gross Francisco Massa Adam Lerer James Bradbury Gregory Chanan TrevorKilleen Zeming Lin Natalia Gimelshein Luca Antiga Alban Desmaison Andreas Kopf EdwardYang Zachary DeVito Martin Raison Alykhan Tejani Sasank Chilamkurthy Benoit SteinerLu Fang Junjie Bai and Soumith Chintala PyTorch An imperative style high-performancedeep learning library In NeurIPS 2019

Giorgio Patrini Alessandro Rozza Aditya K Menon Richard Nock and Lizhen Making deepneural networks robust to label noise A loss correction approach In CVPR 2017

Herbert Robbins and Suon Monro A stochastic approximation method Annals of MathematicalStatistics 22400ndash407 1951

Rebecca Roelofs Vaishaal Shankar Benjamin Recht Sara Fridovich-Keil Moritz Hardt John Millerand Ludwig Schmidt A meta-analysis of overing in machine learning In NeurIPS 2019

Bobak Shahriari Kevin Swersky Ziyu Wang Ryan P Adams and Nando de Freitas Taking thehuman out of the loop A review of bayesian optimization Proceedings of the IEEE 104148ndash1752016

16

Connor Shorten and Taghi M Khoshgoaar A survey on image data augmentation for deeplearning Journal of Big Data 6 2019

Jasper Snoek Hugo Larochelle and Ryan P Adams Practical bayesian optimization of machinelearning algorithms In NeurIPS 2012

Nitish Srivastava Georey Hinton Alex Krizhevsky Ilya Sutskever and Ruslan SalakhutdinovDropout A simple way to prevent neural networks from overing Journal of Machine LearningResearch 151929ndash1958 2014

Masashi Sugiyama Introduction to statistical machine learning Morgan Kaufmann 2015

Christian Szegedy Vincent Vanhoucke Sergey Ioe and Jon Shlens Rethinking the inceptionarchitecture for computer vision In CVPR 2016

Sunil ulasidasan Gopinath Chennupati Je Bilmes Tanmoy Bhaacharya and Sarah MichalakOn mixup training Improved calibration and predictive uncertainty for deep neural networksIn NeurIPS 2019

Robert Tibshirani Regression shrinkage and selection via the lasso Journal of the Royal StatisticalSociety Series B (Methodological) 58267ndash288 1996

A N Tikhonov and V Y Arsenin Solutions of Ill Posed Problems Winston 1977

Andrey Nikolayevich Tikhonov On the stability of inverse problems Doklady Akademii NaukSSSR 39195ndash198 1943

Antonio Torralba Rob Fergus and William T Freeman 80 million tiny images A large data setfor nonparametric object and scene recognition In IEEE Trans PAMI 2008

Brendan van Rooyen and Robert C Williamson A theory of learning with corrupted labelsJournal of Machine Learning Research 181ndash50 2018

Vikas Verma Alex Lamb Juho Kannala Yoshua Bengio and David Lopez-Paz Interpolationconsistency training for semi-supervised learning In IJCAI 2019

Stefan Wager Sida Wang and Percy Liang Dropout training as adaptive regularization In NeurIPS2013

Roman Werpachowski Andrs Gyrgy and Csaba Szepesvri Detecting overing via adversarialexamples In NeurIPS 2019

Han Xiao Kashif Rasul and Roland Vollgraf Fashion-MNIST a novel image dataset for bench-marking machine learning algorithms arXiv preprint arXiv170807747 2017

Chiyuan Zhang Samy Bengio Moritz Hardt Benjamin Recht and Oriol Vinyals Understandingdeep learning requires rethinking generalization In ICLR 2017

17

Hongyi Zhang Moustapha Cisse Yann N Dauphin and David Lopez-Paz mixup Beyondempirical risk minimization In ICLR 2018

Tong Zhang Statistical behavior and consistency of classication methods based on convex riskminimization e Annals of Statistics 3256ndash85 2004

18

A Proof of eoremProof If the ooding level is b then the proposed ooding estimator is

R(g) = |R(g)minus b|+ b (12)

Since the absolute operator can be expressed with a max operator with max(a b) = a+b+|aminusb|2

theproposed estimator can be re-expressed as

R(g) = 2max(R(g) b)minus R(g) = Aminus R(g) (13)

For convenience we used A = 2max(R(g) b) From the denition of MSE

MSE(R(g)) = E[(R(g)minusR(g))2] (14)MSE(R(g)) = E[(R(g)minusR(g))2] (15)

= E[(Aminus R(g)minusR(g))2)] (16)= E[A2]minus 2E[A(R(g) +R(g))] + E[(R(g) +R(g))2] (17)

We are interested in the sign of

MSE(R(g))minusMSE(R(g)) = E[minus4R(g)R(g)minus A2 + 2A(R(g) +R(g))] (18)

Dene the inside of the expectation as B = minus4R(g)R(g)minus A2 + 2A(R(g) + R(g)) B can bedivided into two cases depending on the outcome of the max operator

B =

minus4R(g)R(g)minus 4R(g)2 + 4R(g)(R(g) +R(g)) if R(g) ge b

minus4R(g)R(g)minus 4b2 + 4b(R(g) +R(g)) if R(g) lt b(19)

=

0 if R(g) ge b

minus4R(g)R(g)minus 4b2 + 4b(R(g) +R(g)) if R(g) lt b(20)

=

0 if R(g) ge b

minus4(bminus R(g))(bminusR(g)) if R(g) lt b (21)

e laer case becomes positive when R(g) lt b lt R(g) erefore when R(g) lt b lt R(g)

MSE(R(g))minusMSE(R(g)) gt 0 (22)MSE(R(g)) gt MSE(R(g)) (23)

When b le R(g)

MSE(R(g))minusMSE(R(g)) gt 0 (24)MSE(R(g)) = MSE(R(g)) (25)

19

B Bayes Risk for Gaussian DistributionsIn this section we explain in detail how we derived the Bayes risk with respect to the surrogateloss in the experiments with Gaussian data in Section 41 Since we are using the logistic loss inthe synthetic experiments the loss of the margin is

`(yg(x)) = log(1 + exp(minusyg(x))) (26)

where g(x) Rd rarr R is a scalar instead of the vector denition that was used previously becausethe synthetic experiments only consider binary classication Take the derivative to derive

partE[`(yg(x))]partg(x)

=E[log(1 + exp(minusyg(x)))]

partg(x)= E

[minusy exp(minusyg(x))1 + exp(minusyg(x))

∣∣∣∣∣x]p(x) (27)

= E

[minusy

1 + exp(yg(x))

∣∣∣∣∣x]p(x) (28)

= E

[y + 1

2

1

exp(minusg(x)) + 1+y minus 1

2

minus1exp(g(x)) + 1

∣∣∣∣∣x]p(x) (29)

= E

[y + 1

2

∣∣∣∣∣x]

1

exp(minusg(x)) + 1p(x) + E

[y minus 1

2

∣∣∣∣∣x]

minus1exp(g(x)) + 1

p(x) (30)

= p(y = +1|x) 1

exp(minusg(x)) + 1p(x) + p(y = minus1|x) minus1

exp(g(x)) + 1p(x) (31)

Set this to zero divide by p(x) gt 0 to obtain

p(y = minus1|x) 1

exp(minusg(x)) + 1= p(y = +1|x) 1

exp(g(x)) + 1(32)

exp(g(x)) =p(y = +1|x)p(y = minus1|x)

(33)

g(x) = logp(y = +1|x)p(y = minus1|x)

(34)

Since we are interested in the surrogate loss under this classier we plug this into the logisticloss to obtain the Bayes risk

E[`(yg(x))] = E[log(1 +

p(minusy|x)p(y|x)

)

]= E

[log(

1

p(y|x))

]= E[minus log p(y|x)] (35)

In the experiments in Section 41 we report the empirical version of this with the test dataset asthe Bayes risk

20

C Details of Experiments

C1 Benchmark DatasetsIn the experiments in Section 42 we use six benchmark datasets explained below

bull MNIST6 [Lecun et al 1998] is a 10 class dataset of handwrien digits 1 2 9 and 0 Eachsample is a 28times 28 grayscale image e number of training and test samples are 60000and 10000 respectively

bull Fashion-MNIST7 [Xiao et al 2017] is a 10 class dataset of fashion items T-shirttop TrouserPullover Dress Coat Sandal Shirt Sneaker Bag and Ankle boot Each sample is a 28times 28grayscale image e number of training and test samples are 60000 and 10000 respectively

bull Kuzushiji-MNIST8 [Clanuwat et al 2018] is a 10 class dataset of cursive Japanese (ldquoKuzushijirdquo)characters Each sample is a 28 times 28 grayscale image e number of training and testsamples are 60000 and 10000 respectively

bull CIFAR-109 is a 10 class dataset of various objects airplane automobile bird cat deer dogfrog horse ship and truck Each sample is a colored image in 32times 32times 3 RGB format It isa subset of the 80 million tiny images dataset [Torralba et al 2008] ere are 6000 imagesper class where 5000 are for training and 1000 are for test

bull CIFAR-10010 is a 100 class dataset of various objects Each class has 600 samples where 500samples are for training and 100 samples are for test is is also a subset of the 80 milliontiny images dataset [Torralba et al 2008]

bull SVHN11 [Netzer et al 2011] is a 10 class dataset of house numbers from Google Street Viewimages in 32times 32times 3 RGB format 73257 digits are for training and 26032 digits are fortesting

6httpyannlecuncomexdbmnist7httpsgithubcomzalandoresearchfashion-mnist8httpsgithubcomrois-codhkmnist9httpswwwcstorontoedusimkrizcifarhtml

10httpswwwcstorontoedusimkrizcifarhtml11httpufldlstanfordeduhousenumbers

21

C2 Chosen Flooding LevelsIn Table 4 we report the chosen ooding levels for our experiments with benchmark datasets

Table 4 e chosen ooding levels for benchmark experiments

Early stopping times X times XWeight decay times times X XFlooding X X X X

MNIST (08) 002 002 003 002MNIST (04) 002 003 000 002Fashion-MNIST (08) 000 000 mdash mdashFashion-MNIST (04) 000 000 mdash mdashKuzushiji-MNIST (08) 001 003 003 003Kuzushiji-MNIST (04) 003 002 004 003CIFAR-10 (08) 004 004 000 001CIFAR-10 (04) 003 003 000 000CIFAR-100 (08) 002 002 000 000CIFAR-100 (04) 003 003 000 000SVHN (08) 001 001 000 002SVHN (04) 001 009 003 003

D Learning CurvesIn Figure 4 we visualize learning curves for all datasets (including CIFAR-10 which we alreadyvisualized in Figure 2) We only show the learning curves for trainingvalidation proportion of 08since the results for 04 were similar with 08 Note that Figure 1(c) shows the learning curves forthe rst 80 epochs for CIFAR-10 without ooding Figure 1(d) shows the learning curves withooding when the ooding level is 018

22

(a) MNIST (000) (b) MNIST (001) (c) MNIST (002) (d) MNIST (003)

(e) Fashion-MNIST(000)

(f) Fashion-MNIST(002)

(g) Fashion-MNIST(005)

(h) Fashion-MNIST(010)

(i) KMNIST (000) (j) KMNIST (002) (k) KMNIST (005) (l) KMNIST (010)

(m) CIFAR-10 (000) (n) CIFAR-10 (003) (o) CIFAR-10 (007) (p) CIFAR-10 (020)

(q) CIFAR-100 (000) (r) CIFAR-100 (001) (s) CIFAR-100 (006) (t) CIFAR-100 (020)

(u) SVHN (000) (v) SVHN (001) (w) SVHN (006) (x) SVHN (020)

Figure 4 Learning curves of training and test loss e rst gure in each row is the learning curveswithout ooding e 2nd 3rd and 4th columns show the results with dierent ooding levels e oodinglevel increases towards the right-hand side 23

E Relationship between Performance and GradientsSettings We visualize the relationship between test performance (loss or accuracy) and gradientamplitude of the trainingtest loss in Figure 5 where the gradient amplitude is the `2 norm of thelter-nomalized gradient of the loss e lter-normalized gradient is the gradient appropriatelyscaled depending on the magnitude of the corresponding convolutional lter similarly to Liet al [2018] More specicically for each lter of every comvolutional layer we multiply thecorresponding elements of the gradient by the norm of the lter Note that a fully connected layeris a special case of convolutional layer and subject to this scaling We exclude Fashion-MNISTbecause the optimal ooding level was zero We used the results with trainingvalidation splitratio of 08

Results For the gures with gradient amplitude of training loss on the horizontal axis ldquordquomarks (w ooding) are oen ploed on the right of ldquo+rdquo marks (wo ooding) which impliesthat ooding prevents the model from staying a local minimum For the gures with gradientamplitude of test loss on the horizontal axis we can observe the method with ooding (ldquordquo)improves performance while the gradient amplitude becomes smaller On the other hand theperformance with the method without ooding (ldquo+rdquo) degenerates while the gradient amplitude oftest loss keeps increasing

24

(a) MNIST xtrainyloss

(b) MNIST xtrainyacc

(c) MNIST xtestyloss

(d) MNIST xtest yacc

(e) K-M xtrain yloss (f) K-M xtrain yacc (g) K-M xtest yloss (h) K-M xtest yacc

(i) C10 xtrain yloss (j) C10 xtrain yacc (k) C10 xtest yloss (l) C10 xtest yacc

(m) C100 xtrainyloss

(n) C100 xtrain yacc (o) C100 xtest yloss (p) C100 xtest yacc

(q) SVHN xtrainyloss

(r) SVHN xtrainyacc

(s) SVHN xtest yloss (t) SVHN xtest yacc

Figure 5 Relationship between test performance (loss or accuracy) and amplitude of gradient (with trainingor test loss) Each point (ldquordquo or ldquo+rdquo) in the gures corresponds to a single model at a certain epoch Weremove the rst 5 epochs and plot the rest ldquordquo is used for the method with ooding and ldquo+rdquo is used forthe method without ooding e large black ldquordquo and ldquo+rdquo show the epochs with early stopping e colorbecomes lighter (purplerarr yellow) as the training proceeds K-M C10 and C100 stand for Kuzushiji-MNISTCIFAR-10 and CIFAR-100

25

F FlatnessSettings We follow Li et al [2018] and give a one-dimensional visualization of atness for eachdataset in Figure 6 We exclude Fashion-MNIST because the optimal ooding level was zero Weused the results with trainingvalidation split ratio of 08 We compare the atness of the modelright aer the empirical risk with respect to a mini-batch becomes smaller than the ooding levelRm(g) lt b for the rst time (doed blue line) and the model aer training (solid blue line) Wealso compare them with the model trained by the baseline method without ooding and trainingis nished (solid red line)

Results According to Figure 6 the test loss becomes lower and more at during the trainingwith ooding Note that the training loss on the other hand continues to oat around the oodinglevel until the end of training aer it enters the ooding zone We expect that the model makes arandom walk and escapes regions with sharp loss landscapes during the period is may be apossible reason for beer generalization results with our proposed method

26

(a) MNIST (train) (b) MNIST (test) (c) Kuzushiji-MNIST(train)

(d) Kuzushiji-MNIST(test)

(e) CIFAR-10 (train) (f) CIFAR-10 (test) (g) CIFAR-100 (train) (h) CIFAR-100 (test)

(i) SVHN (train) (j) SVHN (test)

Figure 6 One-dimensional visualization of atness We visualize the trainingtest loss with respect toperturbation We depict the results for 3 models the model when the empirical risk with respect to trainingdata is below the ooding level for the rst time during training (doed blue) the model at the end oftraining with ooding (solid blue) and the model at the end of training without ooding (solid red)

27

  • 1 Introduction
  • 2 Backgrounds
    • 21 Regularization Methods
    • 22 Double Descent Curves with Overparametrization
    • 23 Lower-Bounding the Empirical Risk
      • 3 Flooding How to Avoid Zero Training Loss
        • 31 Preliminaries
        • 32 Algorithm
        • 33 Implementation
        • 34 Theoretical Analysis
          • 4 Experiments
            • 41 Synthetic Experiments
            • 42 Benchmark Experiments
            • 43 Memorization
              • 5 Conclusion
              • A Proof of Theorem
              • B Bayes Risk for Gaussian Distributions
              • C Details of Experiments
                • C1 Benchmark Datasets
                • C2 Chosen Flooding Levels
                  • D Learning Curves
                  • E Relationship between Performance and Gradients
                  • F Flatness
Page 3: Do We Need Zero Training Loss A›er Achieving Zero Training … · 2020. 2. 21. · directly aim to avoid zero training loss, they o›en fail to maintain a moderate level of training

During ooding the training loss will repeat going below and above the ooding level emodel will continue to ldquorandom walkrdquo with the same non-zero training loss and we expect it todri into an area with a at loss landscape that leads to beer generalization [Chaudhari et al2017 Keskar et al 2017 Li et al 2018] 2

Since it is a simple solution this modication can be incorporated into existing machinelearning code easily Add one line of code for Eq (1) aer evaluating the original objectivefunction J(θ) A minimal working example with a mini-batch in PyTorch [Paszke et al 2019] isdemonstrated below to show the additional one line of code

1 outputs = model(inputs)2 loss = criterion(outputs labels)3 flood = (loss-b)abs()+b This is it4 optimizerzerograd()5 floodbackward()6 optimizerstep()

It may be hard to set the ooding level without expert knowledge on the domain or task Wecan circumvent this situation easily by treating the ooding level as a hyper-parameter We mayuse a naive search which exhaustively evaluates the accuracy for the predened hyper-parametercandidates with a validation dataset is procedure can be performed in parallel

Previous regularizationmethods Many previous regularization methods also aim at avoidingtraining too much in various ways eg restricting the parameter norm to become small by decayingthe parameter weights [Hanson and Pra 1988] raising the diculty of training by droppingactivations of neural networks [Srivastava et al 2014] smoothing the training labels [Szegedyet al 2016] or simply stopping training at an earlier phase [Morgan and Bourlard 1990] esemethods can be considered as indirect ways to control the training loss by also introducingadditional assumptions eg the optimal model parameters are close to zero Although makingthe regularization eect stronger would make it harder for the training loss to approach zeroit is still hard to maintain the right level of training loss till the end of training In fact foroverparametrized deep networks applying a small regularization parameter would not stop thetraining loss becoming (near-)zero making it even harder to choose a hyper-parameter thatcorresponds to a specic level of loss

Flooding on the other hand is a direct solution to the issue that the training loss becomes(near-)zero Flooding intentionally prevents further reduction of the training loss when it reachesa reasonably small value and the ooding level corresponds to the level of training loss that theuser wants to keep

Contributions Our proposed regularizer called ooding makes the training loss oat around asmall constant value instead of making it head towards zero loss Flooding is a regularizer that isdomain- task- and model-independent eoretically we nd that the mean squared error can bereduced with ooding under certain conditions Not only do we show test accuracy improving

2In Appendix F we show that during this period of random walk there is an increase in atness of the lossfunction

3

aer ooding we also observe that even aer we avoid zero training loss memorization with zerotraining error still takes place

2 BackgroundsIn this section we review regularization methods (summarized in Table 1) recent works onoverparametrization and double descent curves and the area of weakly supervised learning wheresimilar techniques to ooding has been explored

21 Regularization Methodse name ldquoregularizationrdquo dates back to at least Tikhonov regularization for the ill-posed linearleast-squares problem [Tikhonov and Arsenin 1977 Tikhonov 1943] One example is to modifyXgtX (where X is the design matrix) to become ldquoregularrdquo by adding a term to the objectivefunction `2 regularization is a generalization of the above example and can be applied to non-linear models ese methods implicitly assume that the optimal model parameters are close tozero

It is known that weight decay [Hanson and Pra 1988] dropout [Srivastava et al 2014] andearly stopping [Morgan and Bourlard 1990] are equivalent to `2 regularization under certainconditions [Bishop 1995 Goodfellow et al 2016 Loshchilov and Huer 2019 Wager et al 2013]implying that there is a similar assumption on the optimal model parameters ere are otherpenalties based on dierent assumptions such as the `1 regularization [Tibshirani 1996] based onthe sparsity assumption that the optimal model has only a few non-zero parameters

Modern machine learning tasks are applied to complex problems where the optimal modelparameters are not necessarily close to zero or may not be sparse and it would be ideal if wecan properly add regularization eects to the optimization stage without such assumptions Ourproposed method does not have assumptions on the optimal model parameters and can be usefulfor more complex problems

More recently ldquoregularizationrdquo has further evolved to a more general meaning includingvarious methods that alleviate overing but do not necessarily have a step to regularize asingular matrix or add a regularization term to the objective function For example Goodfellowet al [2016] denes regularization as ldquoany modication we make to a learning algorithm that isintended to reduce its generalization error but not its training errorrdquo In this paper we adopt thisbroader meaning of ldquoregularizationrdquo

Examples of the more general regularization category include mixup [Zhang et al 2018]and data augmentation methods like cropping and ipping or adjusting brightness or sharpness[Shorten and Khoshgoaar 2019] ese methods have been adopted in many papers to obtainstate-of-the-art performance [Berthelot et al 2019 Verma et al 2019] and are becoming essentialregularization tools for developing new systems However these regularization methods havethe drawback of being domain-specic ey are designed for the vision domain and requiresome eorts when applying to other domains [Guo et al 2019 ulasidasan et al 2019] Otherregularizers such as label smoothing [Szegedy et al 2016] is used for problems with class labels

4

Table 1 Conceptual comparisons of various regularizers ldquotrrdquo stands for ldquotrainingrdquo ldquoIndeprdquo stands forldquoindependentrdquo X stands for yes and times stands for no

Regularization and other methods Targettr loss

Domainindep

Taskindep

Modelindep Main assumption

`2 regularization [Tikhonov 1943] times X X X Optimal model params are close to 0Weight decay [Hanson and Pra 1988] times X X X Optimal model params are close to 0Early stopping [Morgan and Bourlard 1990] times X X X Overing occurs in later epochs`1 regularization [Tibshirani 1996] times X X X Optimal model has to be sparseDropout [Srivastava et al 2014] times X X times Weight scaling inference ruleBatch normalization [Ioe and Szegedy 2015] times X X times Existence of internal covariate shiLabel smoothing [Szegedy et al 2016] times X times X True posterior is not a one-hot vectorMixup [Zhang et al 2018] times times times X Linear relationship between x and yImage augment [Shorten and Khoshgoaar 2019] times times X X Input is invariant to the translations

Flooding (proposed method) X X X X Learning until zero loss is harmful

and harder to use with regression or ranking meaning they are task-specic Batch normalization[Ioe and Szegedy 2015] and dropout [Srivastava et al 2014] are designed for neural networksand are model-specic

Although these regularization methodsmdashboth the special and general onesmdashalready work wellin practice and have become the de facto standard tools [Bishop 2011 Goodfellow et al 2016]we provide an alternative which is even more general in the sense that it is domain- task- andmodel-independent

at being said we want to emphasize that the most important dierence between oodingand other regularization methods is whether it is possible to target a specic level of training lossother than zero While ooding allows the user to choose the level of training loss directly it ishard to achieve this with other regularizers

22 Double Descent Curves with OverparametrizationRecently there has been increasing aention on the phenomenon of ldquodouble descentrdquo named byBelkin et al [2019] to explain the two regimes of deep learning e rst one (underparametrizedregime) occurs where the model complexity is small compared to the number of training samplesand the test error as a function of model complexity decreases with low model complexity butstarts to increase aer the model complexity is large enough is follows the classical viewof machine learning that excessive complexity leads to poor generalization e second one(overparametrized regime) occurs when an even larger model complexity is considered enincreasing the complexity only decreases test error which leads to a double descent shape ephase of decreasing test error oen occurs aer the training error becomes zero is follows themodern view of machine learning that bigger models lead to beer generalization3

As far as we know the discovery of double descent curves dates back to at least Krogh and3httpswwwefforgaimetrics

5

Hertz [1992] where they theoretically showed the double descent phenomenon under a linearregression setup Recent works [Belkin et al 2019 Nakkiran et al 2020] have shown empiricallythat a similar phenomenon can be observed with deep learning methods Nakkiran et al [2020]observed that the double descent curves can be shown not only as a function of model complexitybut also as a function of the epoch number

We want to note a byproduct of our ooding method We were able to produce the epoch-wisedouble descent curve for the test loss with about 100 epochs Investigating the connection betweenour accelerated double descent curves and previous double descent curves [Belkin et al 2019Krogh and Hertz 1992 Nakkiran et al 2020] is out of the scope of this paper but is an importantfuture direction

23 Lower-Bounding the Empirical RiskLower-bounding the empirical risk has been used in the area of weakly supervised learning erewere a common phenomenon where the empirical risk goes below zero [Kiryo et al 2017] whenan equivalent form of the risk expressed with the given weak supervision was alternatively used[Cid-Sueiro et al 2014 du Plessis et al 2014 2015 Natarajan et al 2013 Patrini et al 2017 vanRooyen and Williamson 2018] A gradient ascent technique was used to force the empirical riskto become non-negative in Kiryo et al [2017] is idea has been generalized and applied to otherweakly supervised seings [Han et al 2018 Ishida et al 2019 Lu et al 2020]

Although we also set a lower bound on the empirical risk the motivation is dierent Firstwhile Kiryo et al [2017] and others aim to x the negative empirical risk to become lower boundedby zero our empirical risk already has a lower bound of zero Instead we are aiming to sink theoriginal empirical risk by placing a positive lower bound Second the problem seings are dierentWeakly supervised learning methods require certain loss corrections or sample corrections [Hanet al 2018] before the non-negative correction but we work on the original empirical risk withoutany seing-specic modications

3 Flooding How to Avoid Zero Training LossIn this section we propose our regularization method ooding Note that this section and thefollowing sections only consider multi-class classication for simplicity

31 PreliminariesConsider input variable x isin Rd and output variable y isin 1 K where K is the number ofclasses ey follow an unknown joint probability distribution with density p(x y) We denotethe score function by g Rd rarr RK For any test data point x0 our prediction of the output labelwill be given by y0 = argmaxzisin1K gz(x0) where gz(middot) is the z-th element of g(middot) and incase of a tie argmax returns the largest argument Let ` RK times1 2 K rarr R denote a loss

6

function ` can be the zero-one loss

`01(v zprime) =

0 if argmaxzisin1K vz = zprime

1 otherwise(2)

where v = (v1 vK)gt isin RK or a surrogate loss such as the somax cross-entropy loss

`CE(v zprime) = minus log

exp(vzprime)sumzisin1K exp(vz)

(3)

For a surrogate loss ` we denote the classication risk by

R(g) = Ep(xy)[`(g(x) y)] (4)

where Ep(xy)[middot] is the expectation over (x y) sim p(x y) We use R01(g) to denote Eq (4) when` = `01 and call it the classication error

e goal of multi-class classication is to learn g that minimizes the classication errorR01(g)In optimization we consider the minimization of the risk with a almost surely dierentiablesurrogate loss R(g) instead to make the problem more tractable Furthermore since p(x y) isusually unknown and there is no way to exactly evaluate R(g) we minimize its empirical versioncalculated from the training data instead

R(g) =1

n

nsumi=1

`(g(xi) yi) (5)

where (xi yi)ni=1 are iid sampled from p(x y) We call R the empirical riskWe would like to clarify some of the undened terms used in the title and the introduction

e ldquotraintest lossrdquo is the empirical risk with respect to the surrogate loss function ` over thetrainingtest data respectively We refer to the ldquotrainingtest errorrdquo as the empirical risk withrespect to `01 over the trainingtest data respectively (which is equal to one minus accuracy)[Zhang 2004]

Finally we formally dene the Bayes risk as

Rlowast = infhR(h) (6)

where the inmum is taken over all vector-valued functions h Rd rarr RK e Bayes risk is oenreferred to as the Bayes error if the zero-one loss is used

infhR01(h) (7)

32 Algorithm

With exible models R(g) wrt a surrogate loss can easily become small if not zero as wementioned in Section 1 see [C] in Fig 1(a) We propose a method that ldquooods the boom areaand sinks the original empirical riskrdquo as in Fig 1(b) so that the empirical risk cannot go belowthe ooding level More technically if we denote the ooding level as b our proposed trainingobjective with ooding is a simple x

7

Denition 1 e ooded empirical risk is dened as4

R(g) = |R(g)minus b|+ b (8)

Note that when b = 0 then R(g) = R(g) e gradient of R(g) wrt model parameters willpoint to the same direction as that of R(g) when R(g) gt b but in the opposite direction whenR(g) lt b is means that when the learning objective is above the ooding level we performgradient descent as usual (gravity zone) but when the learning objective is below the oodinglevel we perform gradient ascent instead (buoyancy zone)

e issue is that in general we seldom know the optimal ooding level in advance isissue can be mitigated by searching for the optimal ooding level blowast with a hyper-parameteroptimization technique In practice we can search for the optimal ooding level by performingthe exhaustive search in parallel

33 ImplementationFor large scale problems we can employ mini-batched stochastic optimization for ecient com-putation Suppose that we have M disjoint mini-batch splits We denote the empirical risk (5)with respect to the m-th mini-batch by Rm(g) for m isin 1 M en our mini-batchedoptimization performs gradient descent updates in the direction of the gradient of Rm(g) By theconvexity of the absolute value function and Jensenrsquos inequality we have

R(g) le 1

M

Msumm=1

(|Rm(g)minus b|+ b

) (9)

is indicates that mini-batched optimization will simply minimize an upper bound of the full-batch case with R(g)

34 eoretical AnalysisIn the following theorem we will show that the mean squared error (MSE) of the proposed riskestimator with ooding is smaller than that of the original risk estimator without ooding

eorem 1 Fix any measurable vector-valued function g If the ooding level b satises R(g) ltb lt R(g) we have

MSE(R(g)) gt MSE(R(g)) (10)

If b le R(g) we haveMSE(R(g)) = MSE(R(g)) (11)

A proof is given in Appendix A If we regard R(g) as the training loss and R(g) as the testloss we would want b to be between those two for the MSE to improve

4Strictly speaking Eq (1) is dierent from Eq (8) since Eq (1) can ignore constant terms of the original empiricalrisk We will refer to Eq (8) for the ooding operator for the rest of the paper

8

4 ExperimentsIn this section we show experimental results with synthetic and benchmark datasets e imple-mentation is based on PyTorch [Paszke et al 2019] and demo code will be available5 Experimentswere carried out with NVIDIA GeForce GTX 1080 Ti NVIDIA adro RTX 5000 and Intel XeonGold 6142

41 Synthetic Experimentse aim of our synthetic experiments is to study the behavior of ooding with a controlled setupWe use three types of synthetic data described below

Two Gaussians Data We perform binary classication with two 10-dimensional Gaus-sian distributions with covariance matrix identity and means microP = [0 0 0]gt and microN =[mm m]gt where m isin 08 10 e Bayes risk for m = 10 and m = 08 are 014 and024 respectively where proofs are shown in Appendix B e training validation and test samplesizes are 25 10000 and 10000 per class respectively

Sinusoid Data e sinusoid data [Nakkiran et al 2019] are generated as follows We rstdraw input data points uniformly from the inside of a 2-dimensional ball of radius 1 en we putclass labels based on

y = sign(xgtw + sin(xgtwprime))

where w and wprime are any two 2-dimesional vectors such that w perp wprime e training validationand test sample sizes are 100 100 and 20000 respectively

Spiral Data e spiral data [Sugiyama 2015] are two-dimensional synthetic data Letθ+1 = 0 θ+2 θ

+n+ = 4π be equally spaced n+ points in the interval [0 4π] and θminus1 =

0 θminus2 θminusnminus = 4π be equally spaced nminus points in the interval [0 4π] Let positive and negative

input data points be

x+i+ = θ+i+ [cos(θ

+i+) sin(θ

+i+)]

gt + τν+i+

xminusiminus = (θminusiminus + π)[cos(θminusiminus) sin(θminusiminus)]

gt + τνminusiminus

for i+ = 1 n+ and iminus = 1 nminus where τ controls the magnitude of the noise ν+i and

νminusi are iid distributed according to the two-dimensional standard normal distribution enwe make data for classication by (xi yi)ni=1 = (x+

i+ +1)n+

i+=1 cup (xminusiminus minus1)

nminus

iminus=1 wheren = n+ + nminus e training validation and test sample sizes are 100 100 and 10000 per classrespectively

For Two Gaussians we use a one-hidden-layer feedforward neural network with 500 unitsin the hidden layer with the ReLU activation function [Nair and Hinton 2010] We train thenetwork for 1000 epochs with the logistic loss and vanilla gradient descent with learning rate of005 e ooding level is chosen from b isin 0 001 002 040 For Sinusoid and Spiral weuse a four-hidden-layer feedforward neural network with 500 units in the hidden layer with the

5httpsgithubcomtakashiishidaflooding

9

Table 2 Experimental results for the synthetic data Sub-table (A) shows the results without early stoppingSub-table (B) shows the results with early stopping e beer method is shown in bold in each of sub-tables (A) and (B) ldquoBRrdquo stands for the Bayes risk For Two Gaussians the distance between the positive andnegative distributions is larger for larger m See the description in Section 41 for the details

(A) Without Early Stopping (B) With Early Stopping

Data Seing WithoutFlooding

WithFlooding

Chosenb

WithoutFlooding

WithFlooding

Chosenb

Two Gaussians m 10 BR 014 8796 9225 028 9163 9225 027Two Gaussians m 08 BR 024 8200 8731 033 8657 8729 035Sinusoid Label Noise 001 9384 9446 001 9254 9254 000Sinusoid Label Noise 005 9112 9544 010 9326 9460 001Sinusoid Label Noise 010 8657 9602 017 9670 9670 000Spiral Label Noise 001 9896 9785 001 9860 9888 001Spiral Label Noise 005 9387 9624 004 9658 9562 014Spiral Label Noise 010 8970 9296 016 8970 9296 016

ReLU activation function [Nair and Hinton 2010] and batch normalization [Ioe and Szegedy2015] We train the network for 500 epochs with the logistic loss and Adam [Kingma and Ba2015] optimizer with 100 mini-batch size and learning rate of 0001 e ooding level is chosenfrom b isin 0 001 002 020 Note that training with b = 0 is identical to the baselinemethod without ooding We report the test accuracy of the ooding level with the best validationaccuracy We rst conduct experiments without early stopping which means that the last epochwas chosen for all ooding levels

Results e results are summarized in Table 2 It is worth noting that for Two Gaussians thechosen ooding level b is larger for the smaller distance between the two distributions which iswhen the classication task is harder and the Bayes risk becomes larger since the two distributionsbecome less separated We see similar tendencies for Sinusoid and Spiral data a larger b waschosen for larger ipping probability for label noise which is expected to increase the Bayes riskis implies the positive correlation between the optimal ooding level and the Bayes risk asis also partially suggested by eorem 1 Another interesting observation is that the chosen bis close to but higher than the Bayes risk for Two Gaussians data is may look inconsistentwith eorem 1 However it makes sense to adopt larger b with stronger regularization eect thatallows some bias as a trade-o for reducing the variance of the risk estimator In fact eorem 1does not deny the possibility that some b ge R(g) achieves even beer estimation

From (A) in Table 2 we can see that the method with ooding oen improves test accuracyover the baseline method without ooding As we mentioned in the introduction it ca be harmfulto keep training a model until the end without ooding However with ooding the model at thenal epoch has good prediction performance according to the results which implies that ooding

10

Table 3 Results with benchmark datasets We report classication accuracy for all combinations ofweight decay (X and times) early stopping (X and times) and ooding (X and times) e second column shows thetrainingvalidation split used for the experiment W stands for weight decay E stands for early stopping andF stands for ooding ldquomdashrdquo means that ooding level of zero was optimal ldquoNArdquo means that we skipped theexperiments because zero weight decay was optimal in the case without ooding e best and equivalentare shown in bold by comparing ldquowith oodingrdquo and ldquowithout oodingrdquo for two columns with the sameseing for W and E eg the rst and h columns out of the 8 columns e best performing combinationis highlighted

Datasettr W times times X X times times X Xva E times X times X times X times X

split F times times times times X X X X

MNIST 08 9832 9830 9851 9842 9846 9853 9850 984804 9771 9770 9782 9791 9774 9785 mdash 9783

Fashion- 08 8934 8936 NA NA mdash mdash NA NAMNIST 04 8848 8863 8860 8862 mdash mdash mdash mdash

Kuzushiji- 08 9163 9162 9163 9171 9240 9212 9211 9197MNIST 04 8918 8918 8958 8973 9041 9015 8971 8988

CIFAR-10 08 7359 7336 7365 7357 7306 7344 mdash 744104 6639 6663 6931 6928 6720 6758 mdash mdash

CIFAR-100 08 4216 4233 4267 4245 4250 4236 mdash mdash04 3427 3434 3797 3882 3499 3514 mdash mdash

SVHN 08 9238 9241 9320 9299 9278 9279 mdash 934204 9032 9035 9043 9049 9057 9061 9116 9121

helps the late-stage training improve test accuracyWe also conducted experiments with early stopping meaning that we chose the model that

recorded the best validation accuracy during training e results are reported in sub-table (B) ofTable 2 Compared with sub-table (A) we see that early stopping improves the baseline methodwithout ooding well in many cases is indicates that training longer without ooding washarmful in our experiments On the other hand the accuracy for ooding combined with earlystopping is oen close to that with early stopping meaning that training until the end withooding tends to be already as good as doing so with early stopping e table shows that oodingoen improves or retains the test accuracy of the baseline method without ooding even aerdeploying early stopping Flooding does not hurt performance but can be benecial for methodsused with early stopping

11

(a) CIFAR-10 (000) (b) CIFAR-10 (003) (c) CIFAR-10 (007) (d) CIFAR-10 (020)

Figure 2 Learning curves of training and test loss for trainingvalidation proportion of 08 (a) shows thelearning curves without ooding (b) (c) and (d) show the learning curves with dierent ooding levels

42 Benchmark ExperimentsWe next perform experiments with benchmark datasets Not only do we compare with the baselinewithout ooding we also compare or combine with other general regularization methods whichare early stopping and weight decay

Settings We use the following six benchmark datasets MNIST Fashion-MNIST Kuzushiji-MNIST CIFAR-10 CIFAR-100 and SVHN e details of the benchmark datasets can be found inAppendix C1 We split the original training dataset into training and validation data with dierentproportions 08 or 04 (meaning 80 or 40 was used for training and the rest was used forvalidation respectively) We perform the exhaustive hyper-parameter search for the ooding levelwith candidates from 000 001 020 e number of epochs is 500 Stochastic gradientdescent [Robbins and Monro 1951] is used with learning rate of 01 and momentum of 09 ForMNIST Fashion-MNIST and Kuzushiji-MNIST we use a one-hidden-layer feedforward neuralnetwork with 500 units and ReLU activation function [Nair and Hinton 2010] For CIFAR-10CIFAR-100 and SVHN we used ResNet-18 [He et al 2016] We do not use any data augmentationor manual learning rate decay We deployed early stopping in the same way as in Section 41

We rst ran experiments with the following candidates for the weight decay rate 1times10minus5 1times10minus4 4times 10minus4 7times 10minus4 1times 10minus3 4times 10minus3 7times 10minus3 We choose the weight decay rate withthe best validation accuracy for each dataset and each trainingvalidation proportion en xingthe weight decay to the chosen one we ran experiments with ooding level candidates from0 001 020 to investigate whether weight decay and ooding have complementary eectsor if adding weight decay will diminish the accuracy gain of ooding

Results We show the results in Table 3 and the chosen ooding levels in Table 4 in Appendix C2We can observe that ooding gives beer accuracy for most cases We can also see that combiningooding with early stopping or with both early stopping and weight decay may lead to even beeraccuracy in some cases

12

(a) wo early stop (08) (b) wo early stop (04) (c) w early stop (08) (d) w early stop (04)

Figure 3 We show the optimal ooding level maintains memorization e vertical axis shows the trainingaccuracy e horizontal axis shows the ooding level We show results with and without early stoppingand for dierent trainingvalidation splits with proportion of 08 or 04 e marks (4 O ) areplaced on the ooding level that was chosen based on validation accuracy

43 MemorizationCan we maintain memorization even aer adding ooding We investigate if the trained modelhas zero training error (100 accuracy) for the ooding level that was chosen with validation dataWe show the results for all benchmark datasets and all trainingvalidation splits with proportions08 and 04 We also show the case without early stopping (choosing the last epoch) and withearly stopping (choosing the epoch with the highest validation accuracy) e results are shownin Fig 3

All gures show downward curves implying that the model will give up eventually onmemorizing all training data as the ooding level becomes higher A more interesting andimportant observation is the position of the optimal ooding level (the one chosen by validationaccuracy which is marked with 4 O or ) We can observe that the marks are oen ploedat zero error and in some cases there is a mark on the highest ooding level that maintains zeroerror ese results are consistent with recent empirical works that imply zero training error leadsto lower generalization error [Belkin et al 2019 Nakkiran et al 2020] but we further demonstratethat zero training loss may be harmful under zero training error

5 ConclusionWe proposed a novel regularization method called ooding that keeps the training loss to stayaround a small constant value to avoid zero training loss In our experiments the optimal oodinglevel oen maintained memorization of training data with zero error With ooding we showedthat the test accuracy will improve for various benchmark datasets and theoretically showed thatthe mean squared error will be reduced under certain conditions

As a byproduct we were able to produce a double descent curve for the test loss with a relativelyfew number of epochs eg in around 100 epochs shown in Fig 2 and Fig 4 in Appendix D Animportant future direction is to study the relationship between this and the double descent curves

13

from previous works [Belkin et al 2019 Krogh and Hertz 1992 Nakkiran et al 2020]It would also be interesting to see if Bayesian optimization [Shahriari et al 2016 Snoek

et al 2012] methods can be utilized to search for the optimal ooding level eciently We willinvestigate this direction in the future

AcknowledgementsWe thank Chang Xu Genki Yamamoto Kento Nozawa Nontawat Charoenphakdee Voot Tangkaraand Yoshihiro Nagano for the helpful discussions TI was supported by the Google PhD FellowshipProgram MS and IY were supported by JST CREST Grant Number JPMJCR18A2 including AIPchallenge program Japan

ReferencesDevansh Arpit Stanisaw Jastrzebski Nicolas Ballas David Krueger Emmanuel Bengio Maxin-

der S Kanwal Tegan Maharaj Asja Fischer Aaron Courville Yoshua Bengio and SimonLacoste-Julien A closer look at memorization in deep networks In ICML 2017

Mikhail Belkin Daniel Hsu and Partha P Mitra Overing or perfect ing Risk bounds forclassication and regression rules that interpolate In NeurIPS 2018

Mikhail Belkin Daniel Hsu Siyuan Ma and Soumik Mandal Reconciling modern machine-learningpractice and the classical biasvariance trade-o PNAS 11615850ndash15854 2019

David Berthelot Nicholas Carlini Ian Goodfellow Nicolas Papernot Avital Oliver and Colin ARael MixMatch A holistic approach to semi-supervised learning In NeurIPS 2019

Christopher M Bishop Paern Recognition and Machine Learning (Information Science andStatistics) Springer 2011

Christopher Michael Bishop Regularization and complexity control in feed-forward networks InICANN 1995

Rich Caruana Steve Lawrence and C Lee Giles Overing in neural nets Backpropagationconjugate gradient and early stopping In NeurIPS 2000

Pratik Chaudhari Anna Choromanska Stefano Soao Yann LeCun Carlo Baldassi ChristianBorgs Jennifer Chayes Levent Sagun and Riccardo Zecchina Entropy-SGD Biasing gradientdescent into wide valleys In ICLR 2017

Jesus Cid-Sueiro Darıo Garcıa-Garcıa and Raul Santos-Rodrıguez Consistency of losses forlearning from weak labels In ECML-PKDD 2014

14

Tarin Clanuwat Mikel Bober-Irizar Asanobu Kitamoto Alex Lamb Kazuaki Yamamoto and DavidHa Deep learning for classical Japanese literature In NeurIPS Workshop on Machine Learningfor Creativity and Design 2018

Marthinus Christoel du Plessis Gang Niu and Masashi Sugiyama Analysis of learning frompositive and unlabeled data In NeurIPS 2014

Marthinus Christoel du Plessis Gang Niu and Masashi Sugiyama Convex formulation forlearning from positive and unlabeled data In ICML 2015

Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MIT Press 2016

Hongyu Guo Yongyi Mao and Richong Zhang Augmenting data with mixup for sentenceclassication An empirical study In arXiv190508941 2019

Bo Han Gang Niu Jiangchao Yao Xingrui Yu Miao Xu Ivor Tsang and Masashi SugiyamaPumpout A meta approach to robust deep learning with noisy labels In arXiv180911008 2018

Stephen Jose Hanson and Lorien Y Pra Comparing biases for minimal network constructionwith back-propagation In NeurIPS 1988

Kaiming He Xiangyu Zhang Shaoqing Ren and Jian Sun Deep residual learning for imagerecognition In CVPR 2016

Sergey Ioe and Christian Szegedy Batch normalization Accelerating deep network training byreducing internal covariate shi In ICML 2015

Takashi Ishida Gang Niu Aditya Krishna Menon and Masashi Sugiyama Complementary-labellearning for arbitrary losses and models In ICML 2019

Nitish Shirish Keskar Dheevatsa Mudigere Jorge Nocedal Mikhail Smelyanskiy and Ping Tak PeterTang On large-batch training for deep learning Generalization gap and sharp minima In ICLR2017

Diederik P Kingma and Jimmy L Ba Adam A method for stochastic optimization In ICLR 2015

Ryuichi Kiryo Gang Niu Marthinus Christoel du Plessis and Masashi Sugiyama Positive-unlabeled learning with non-negative risk estimator In NeurIPS 2017

Anders Krogh and John A Hertz A simple weight decay can improve generalization In NeurIPS1992

Yann Lecun Lon Boou Yoshua Bengio and Patrick Haner Gradient-based learning applied todocument recognition In Proceedings of the IEEE pages 2278ndash2324 1998

Hao Li Zheng Xu Gavin Taylor Christoph Studer and Tom Goldstein Visualizing the losslandscape of neural nets In NeurIPS 2018

15

Ilya Loshchilov and Frank Huer Decoupled weight decay regularization In ICLR 2019

Nan Lu Tianyi Zhang Gang Niu and Masashi Sugiyama Mitigating overing in supervisedclassication from two unlabeled datasets A consistent risk correction approach In AISTATS2020

N Morgan and H Bourlard Generalization and parameter estimation in feedforward nets Someexperiments In NeurIPS 1990

Vinod Nair and Georey E Hinton Rectied linear units improve restricted boltzmann machinesIn ICML 2010

Preetum Nakkiran Gal Kaplun Dimitris Kalimeris Tristan Yang Benjamin L Edelman FredZhang and Boaz Barak SGD on neural networks learns functions of increasing complexity InNeurIPS 2019

Preetum Nakkiran Gal Kaplun Yamini Bansal Tristan Yang Boaz Barak and Ilya Sutskever Deepdouble descent Where bigger models and more data hurt In ICLR 2020

Nagarajan Natarajan Inderjit S Dhillon Pradeep K Ravikumar and Ambuj Tewari Learningwith noisy labels In NeurIPS 2013

Yuval Netzer Tao Wang Adam Coates Alessandro Bissacco Bo Wu and Andrew Y Ng Readingdigits in natural images with unsupervised feature learning In NeurIPS Workshop on DeepLearning and Unsupervised Feature Learning 2011

Andrew Y Ng Preventing ldquooveringrdquo of cross-validation data In ICML 1997

Adam Paszke Sam Gross Francisco Massa Adam Lerer James Bradbury Gregory Chanan TrevorKilleen Zeming Lin Natalia Gimelshein Luca Antiga Alban Desmaison Andreas Kopf EdwardYang Zachary DeVito Martin Raison Alykhan Tejani Sasank Chilamkurthy Benoit SteinerLu Fang Junjie Bai and Soumith Chintala PyTorch An imperative style high-performancedeep learning library In NeurIPS 2019

Giorgio Patrini Alessandro Rozza Aditya K Menon Richard Nock and Lizhen Making deepneural networks robust to label noise A loss correction approach In CVPR 2017

Herbert Robbins and Suon Monro A stochastic approximation method Annals of MathematicalStatistics 22400ndash407 1951

Rebecca Roelofs Vaishaal Shankar Benjamin Recht Sara Fridovich-Keil Moritz Hardt John Millerand Ludwig Schmidt A meta-analysis of overing in machine learning In NeurIPS 2019

Bobak Shahriari Kevin Swersky Ziyu Wang Ryan P Adams and Nando de Freitas Taking thehuman out of the loop A review of bayesian optimization Proceedings of the IEEE 104148ndash1752016

16

Connor Shorten and Taghi M Khoshgoaar A survey on image data augmentation for deeplearning Journal of Big Data 6 2019

Jasper Snoek Hugo Larochelle and Ryan P Adams Practical bayesian optimization of machinelearning algorithms In NeurIPS 2012

Nitish Srivastava Georey Hinton Alex Krizhevsky Ilya Sutskever and Ruslan SalakhutdinovDropout A simple way to prevent neural networks from overing Journal of Machine LearningResearch 151929ndash1958 2014

Masashi Sugiyama Introduction to statistical machine learning Morgan Kaufmann 2015

Christian Szegedy Vincent Vanhoucke Sergey Ioe and Jon Shlens Rethinking the inceptionarchitecture for computer vision In CVPR 2016

Sunil ulasidasan Gopinath Chennupati Je Bilmes Tanmoy Bhaacharya and Sarah MichalakOn mixup training Improved calibration and predictive uncertainty for deep neural networksIn NeurIPS 2019

Robert Tibshirani Regression shrinkage and selection via the lasso Journal of the Royal StatisticalSociety Series B (Methodological) 58267ndash288 1996

A N Tikhonov and V Y Arsenin Solutions of Ill Posed Problems Winston 1977

Andrey Nikolayevich Tikhonov On the stability of inverse problems Doklady Akademii NaukSSSR 39195ndash198 1943

Antonio Torralba Rob Fergus and William T Freeman 80 million tiny images A large data setfor nonparametric object and scene recognition In IEEE Trans PAMI 2008

Brendan van Rooyen and Robert C Williamson A theory of learning with corrupted labelsJournal of Machine Learning Research 181ndash50 2018

Vikas Verma Alex Lamb Juho Kannala Yoshua Bengio and David Lopez-Paz Interpolationconsistency training for semi-supervised learning In IJCAI 2019

Stefan Wager Sida Wang and Percy Liang Dropout training as adaptive regularization In NeurIPS2013

Roman Werpachowski Andrs Gyrgy and Csaba Szepesvri Detecting overing via adversarialexamples In NeurIPS 2019

Han Xiao Kashif Rasul and Roland Vollgraf Fashion-MNIST a novel image dataset for bench-marking machine learning algorithms arXiv preprint arXiv170807747 2017

Chiyuan Zhang Samy Bengio Moritz Hardt Benjamin Recht and Oriol Vinyals Understandingdeep learning requires rethinking generalization In ICLR 2017

17

Hongyi Zhang Moustapha Cisse Yann N Dauphin and David Lopez-Paz mixup Beyondempirical risk minimization In ICLR 2018

Tong Zhang Statistical behavior and consistency of classication methods based on convex riskminimization e Annals of Statistics 3256ndash85 2004

18

A Proof of eoremProof If the ooding level is b then the proposed ooding estimator is

R(g) = |R(g)minus b|+ b (12)

Since the absolute operator can be expressed with a max operator with max(a b) = a+b+|aminusb|2

theproposed estimator can be re-expressed as

R(g) = 2max(R(g) b)minus R(g) = Aminus R(g) (13)

For convenience we used A = 2max(R(g) b) From the denition of MSE

MSE(R(g)) = E[(R(g)minusR(g))2] (14)MSE(R(g)) = E[(R(g)minusR(g))2] (15)

= E[(Aminus R(g)minusR(g))2)] (16)= E[A2]minus 2E[A(R(g) +R(g))] + E[(R(g) +R(g))2] (17)

We are interested in the sign of

MSE(R(g))minusMSE(R(g)) = E[minus4R(g)R(g)minus A2 + 2A(R(g) +R(g))] (18)

Dene the inside of the expectation as B = minus4R(g)R(g)minus A2 + 2A(R(g) + R(g)) B can bedivided into two cases depending on the outcome of the max operator

B =

minus4R(g)R(g)minus 4R(g)2 + 4R(g)(R(g) +R(g)) if R(g) ge b

minus4R(g)R(g)minus 4b2 + 4b(R(g) +R(g)) if R(g) lt b(19)

=

0 if R(g) ge b

minus4R(g)R(g)minus 4b2 + 4b(R(g) +R(g)) if R(g) lt b(20)

=

0 if R(g) ge b

minus4(bminus R(g))(bminusR(g)) if R(g) lt b (21)

e laer case becomes positive when R(g) lt b lt R(g) erefore when R(g) lt b lt R(g)

MSE(R(g))minusMSE(R(g)) gt 0 (22)MSE(R(g)) gt MSE(R(g)) (23)

When b le R(g)

MSE(R(g))minusMSE(R(g)) gt 0 (24)MSE(R(g)) = MSE(R(g)) (25)

19

B Bayes Risk for Gaussian DistributionsIn this section we explain in detail how we derived the Bayes risk with respect to the surrogateloss in the experiments with Gaussian data in Section 41 Since we are using the logistic loss inthe synthetic experiments the loss of the margin is

`(yg(x)) = log(1 + exp(minusyg(x))) (26)

where g(x) Rd rarr R is a scalar instead of the vector denition that was used previously becausethe synthetic experiments only consider binary classication Take the derivative to derive

partE[`(yg(x))]partg(x)

=E[log(1 + exp(minusyg(x)))]

partg(x)= E

[minusy exp(minusyg(x))1 + exp(minusyg(x))

∣∣∣∣∣x]p(x) (27)

= E

[minusy

1 + exp(yg(x))

∣∣∣∣∣x]p(x) (28)

= E

[y + 1

2

1

exp(minusg(x)) + 1+y minus 1

2

minus1exp(g(x)) + 1

∣∣∣∣∣x]p(x) (29)

= E

[y + 1

2

∣∣∣∣∣x]

1

exp(minusg(x)) + 1p(x) + E

[y minus 1

2

∣∣∣∣∣x]

minus1exp(g(x)) + 1

p(x) (30)

= p(y = +1|x) 1

exp(minusg(x)) + 1p(x) + p(y = minus1|x) minus1

exp(g(x)) + 1p(x) (31)

Set this to zero divide by p(x) gt 0 to obtain

p(y = minus1|x) 1

exp(minusg(x)) + 1= p(y = +1|x) 1

exp(g(x)) + 1(32)

exp(g(x)) =p(y = +1|x)p(y = minus1|x)

(33)

g(x) = logp(y = +1|x)p(y = minus1|x)

(34)

Since we are interested in the surrogate loss under this classier we plug this into the logisticloss to obtain the Bayes risk

E[`(yg(x))] = E[log(1 +

p(minusy|x)p(y|x)

)

]= E

[log(

1

p(y|x))

]= E[minus log p(y|x)] (35)

In the experiments in Section 41 we report the empirical version of this with the test dataset asthe Bayes risk

20

C Details of Experiments

C1 Benchmark DatasetsIn the experiments in Section 42 we use six benchmark datasets explained below

bull MNIST6 [Lecun et al 1998] is a 10 class dataset of handwrien digits 1 2 9 and 0 Eachsample is a 28times 28 grayscale image e number of training and test samples are 60000and 10000 respectively

bull Fashion-MNIST7 [Xiao et al 2017] is a 10 class dataset of fashion items T-shirttop TrouserPullover Dress Coat Sandal Shirt Sneaker Bag and Ankle boot Each sample is a 28times 28grayscale image e number of training and test samples are 60000 and 10000 respectively

bull Kuzushiji-MNIST8 [Clanuwat et al 2018] is a 10 class dataset of cursive Japanese (ldquoKuzushijirdquo)characters Each sample is a 28 times 28 grayscale image e number of training and testsamples are 60000 and 10000 respectively

bull CIFAR-109 is a 10 class dataset of various objects airplane automobile bird cat deer dogfrog horse ship and truck Each sample is a colored image in 32times 32times 3 RGB format It isa subset of the 80 million tiny images dataset [Torralba et al 2008] ere are 6000 imagesper class where 5000 are for training and 1000 are for test

bull CIFAR-10010 is a 100 class dataset of various objects Each class has 600 samples where 500samples are for training and 100 samples are for test is is also a subset of the 80 milliontiny images dataset [Torralba et al 2008]

bull SVHN11 [Netzer et al 2011] is a 10 class dataset of house numbers from Google Street Viewimages in 32times 32times 3 RGB format 73257 digits are for training and 26032 digits are fortesting

6httpyannlecuncomexdbmnist7httpsgithubcomzalandoresearchfashion-mnist8httpsgithubcomrois-codhkmnist9httpswwwcstorontoedusimkrizcifarhtml

10httpswwwcstorontoedusimkrizcifarhtml11httpufldlstanfordeduhousenumbers

21

C2 Chosen Flooding LevelsIn Table 4 we report the chosen ooding levels for our experiments with benchmark datasets

Table 4 e chosen ooding levels for benchmark experiments

Early stopping times X times XWeight decay times times X XFlooding X X X X

MNIST (08) 002 002 003 002MNIST (04) 002 003 000 002Fashion-MNIST (08) 000 000 mdash mdashFashion-MNIST (04) 000 000 mdash mdashKuzushiji-MNIST (08) 001 003 003 003Kuzushiji-MNIST (04) 003 002 004 003CIFAR-10 (08) 004 004 000 001CIFAR-10 (04) 003 003 000 000CIFAR-100 (08) 002 002 000 000CIFAR-100 (04) 003 003 000 000SVHN (08) 001 001 000 002SVHN (04) 001 009 003 003

D Learning CurvesIn Figure 4 we visualize learning curves for all datasets (including CIFAR-10 which we alreadyvisualized in Figure 2) We only show the learning curves for trainingvalidation proportion of 08since the results for 04 were similar with 08 Note that Figure 1(c) shows the learning curves forthe rst 80 epochs for CIFAR-10 without ooding Figure 1(d) shows the learning curves withooding when the ooding level is 018

22

(a) MNIST (000) (b) MNIST (001) (c) MNIST (002) (d) MNIST (003)

(e) Fashion-MNIST(000)

(f) Fashion-MNIST(002)

(g) Fashion-MNIST(005)

(h) Fashion-MNIST(010)

(i) KMNIST (000) (j) KMNIST (002) (k) KMNIST (005) (l) KMNIST (010)

(m) CIFAR-10 (000) (n) CIFAR-10 (003) (o) CIFAR-10 (007) (p) CIFAR-10 (020)

(q) CIFAR-100 (000) (r) CIFAR-100 (001) (s) CIFAR-100 (006) (t) CIFAR-100 (020)

(u) SVHN (000) (v) SVHN (001) (w) SVHN (006) (x) SVHN (020)

Figure 4 Learning curves of training and test loss e rst gure in each row is the learning curveswithout ooding e 2nd 3rd and 4th columns show the results with dierent ooding levels e oodinglevel increases towards the right-hand side 23

E Relationship between Performance and GradientsSettings We visualize the relationship between test performance (loss or accuracy) and gradientamplitude of the trainingtest loss in Figure 5 where the gradient amplitude is the `2 norm of thelter-nomalized gradient of the loss e lter-normalized gradient is the gradient appropriatelyscaled depending on the magnitude of the corresponding convolutional lter similarly to Liet al [2018] More specicically for each lter of every comvolutional layer we multiply thecorresponding elements of the gradient by the norm of the lter Note that a fully connected layeris a special case of convolutional layer and subject to this scaling We exclude Fashion-MNISTbecause the optimal ooding level was zero We used the results with trainingvalidation splitratio of 08

Results For the gures with gradient amplitude of training loss on the horizontal axis ldquordquomarks (w ooding) are oen ploed on the right of ldquo+rdquo marks (wo ooding) which impliesthat ooding prevents the model from staying a local minimum For the gures with gradientamplitude of test loss on the horizontal axis we can observe the method with ooding (ldquordquo)improves performance while the gradient amplitude becomes smaller On the other hand theperformance with the method without ooding (ldquo+rdquo) degenerates while the gradient amplitude oftest loss keeps increasing

24

(a) MNIST xtrainyloss

(b) MNIST xtrainyacc

(c) MNIST xtestyloss

(d) MNIST xtest yacc

(e) K-M xtrain yloss (f) K-M xtrain yacc (g) K-M xtest yloss (h) K-M xtest yacc

(i) C10 xtrain yloss (j) C10 xtrain yacc (k) C10 xtest yloss (l) C10 xtest yacc

(m) C100 xtrainyloss

(n) C100 xtrain yacc (o) C100 xtest yloss (p) C100 xtest yacc

(q) SVHN xtrainyloss

(r) SVHN xtrainyacc

(s) SVHN xtest yloss (t) SVHN xtest yacc

Figure 5 Relationship between test performance (loss or accuracy) and amplitude of gradient (with trainingor test loss) Each point (ldquordquo or ldquo+rdquo) in the gures corresponds to a single model at a certain epoch Weremove the rst 5 epochs and plot the rest ldquordquo is used for the method with ooding and ldquo+rdquo is used forthe method without ooding e large black ldquordquo and ldquo+rdquo show the epochs with early stopping e colorbecomes lighter (purplerarr yellow) as the training proceeds K-M C10 and C100 stand for Kuzushiji-MNISTCIFAR-10 and CIFAR-100

25

F FlatnessSettings We follow Li et al [2018] and give a one-dimensional visualization of atness for eachdataset in Figure 6 We exclude Fashion-MNIST because the optimal ooding level was zero Weused the results with trainingvalidation split ratio of 08 We compare the atness of the modelright aer the empirical risk with respect to a mini-batch becomes smaller than the ooding levelRm(g) lt b for the rst time (doed blue line) and the model aer training (solid blue line) Wealso compare them with the model trained by the baseline method without ooding and trainingis nished (solid red line)

Results According to Figure 6 the test loss becomes lower and more at during the trainingwith ooding Note that the training loss on the other hand continues to oat around the oodinglevel until the end of training aer it enters the ooding zone We expect that the model makes arandom walk and escapes regions with sharp loss landscapes during the period is may be apossible reason for beer generalization results with our proposed method

26

(a) MNIST (train) (b) MNIST (test) (c) Kuzushiji-MNIST(train)

(d) Kuzushiji-MNIST(test)

(e) CIFAR-10 (train) (f) CIFAR-10 (test) (g) CIFAR-100 (train) (h) CIFAR-100 (test)

(i) SVHN (train) (j) SVHN (test)

Figure 6 One-dimensional visualization of atness We visualize the trainingtest loss with respect toperturbation We depict the results for 3 models the model when the empirical risk with respect to trainingdata is below the ooding level for the rst time during training (doed blue) the model at the end oftraining with ooding (solid blue) and the model at the end of training without ooding (solid red)

27

  • 1 Introduction
  • 2 Backgrounds
    • 21 Regularization Methods
    • 22 Double Descent Curves with Overparametrization
    • 23 Lower-Bounding the Empirical Risk
      • 3 Flooding How to Avoid Zero Training Loss
        • 31 Preliminaries
        • 32 Algorithm
        • 33 Implementation
        • 34 Theoretical Analysis
          • 4 Experiments
            • 41 Synthetic Experiments
            • 42 Benchmark Experiments
            • 43 Memorization
              • 5 Conclusion
              • A Proof of Theorem
              • B Bayes Risk for Gaussian Distributions
              • C Details of Experiments
                • C1 Benchmark Datasets
                • C2 Chosen Flooding Levels
                  • D Learning Curves
                  • E Relationship between Performance and Gradients
                  • F Flatness
Page 4: Do We Need Zero Training Loss A›er Achieving Zero Training … · 2020. 2. 21. · directly aim to avoid zero training loss, they o›en fail to maintain a moderate level of training

aer ooding we also observe that even aer we avoid zero training loss memorization with zerotraining error still takes place

2 BackgroundsIn this section we review regularization methods (summarized in Table 1) recent works onoverparametrization and double descent curves and the area of weakly supervised learning wheresimilar techniques to ooding has been explored

21 Regularization Methodse name ldquoregularizationrdquo dates back to at least Tikhonov regularization for the ill-posed linearleast-squares problem [Tikhonov and Arsenin 1977 Tikhonov 1943] One example is to modifyXgtX (where X is the design matrix) to become ldquoregularrdquo by adding a term to the objectivefunction `2 regularization is a generalization of the above example and can be applied to non-linear models ese methods implicitly assume that the optimal model parameters are close tozero

It is known that weight decay [Hanson and Pra 1988] dropout [Srivastava et al 2014] andearly stopping [Morgan and Bourlard 1990] are equivalent to `2 regularization under certainconditions [Bishop 1995 Goodfellow et al 2016 Loshchilov and Huer 2019 Wager et al 2013]implying that there is a similar assumption on the optimal model parameters ere are otherpenalties based on dierent assumptions such as the `1 regularization [Tibshirani 1996] based onthe sparsity assumption that the optimal model has only a few non-zero parameters

Modern machine learning tasks are applied to complex problems where the optimal modelparameters are not necessarily close to zero or may not be sparse and it would be ideal if wecan properly add regularization eects to the optimization stage without such assumptions Ourproposed method does not have assumptions on the optimal model parameters and can be usefulfor more complex problems

More recently ldquoregularizationrdquo has further evolved to a more general meaning includingvarious methods that alleviate overing but do not necessarily have a step to regularize asingular matrix or add a regularization term to the objective function For example Goodfellowet al [2016] denes regularization as ldquoany modication we make to a learning algorithm that isintended to reduce its generalization error but not its training errorrdquo In this paper we adopt thisbroader meaning of ldquoregularizationrdquo

Examples of the more general regularization category include mixup [Zhang et al 2018]and data augmentation methods like cropping and ipping or adjusting brightness or sharpness[Shorten and Khoshgoaar 2019] ese methods have been adopted in many papers to obtainstate-of-the-art performance [Berthelot et al 2019 Verma et al 2019] and are becoming essentialregularization tools for developing new systems However these regularization methods havethe drawback of being domain-specic ey are designed for the vision domain and requiresome eorts when applying to other domains [Guo et al 2019 ulasidasan et al 2019] Otherregularizers such as label smoothing [Szegedy et al 2016] is used for problems with class labels

4

Table 1 Conceptual comparisons of various regularizers ldquotrrdquo stands for ldquotrainingrdquo ldquoIndeprdquo stands forldquoindependentrdquo X stands for yes and times stands for no

Regularization and other methods Targettr loss

Domainindep

Taskindep

Modelindep Main assumption

`2 regularization [Tikhonov 1943] times X X X Optimal model params are close to 0Weight decay [Hanson and Pra 1988] times X X X Optimal model params are close to 0Early stopping [Morgan and Bourlard 1990] times X X X Overing occurs in later epochs`1 regularization [Tibshirani 1996] times X X X Optimal model has to be sparseDropout [Srivastava et al 2014] times X X times Weight scaling inference ruleBatch normalization [Ioe and Szegedy 2015] times X X times Existence of internal covariate shiLabel smoothing [Szegedy et al 2016] times X times X True posterior is not a one-hot vectorMixup [Zhang et al 2018] times times times X Linear relationship between x and yImage augment [Shorten and Khoshgoaar 2019] times times X X Input is invariant to the translations

Flooding (proposed method) X X X X Learning until zero loss is harmful

and harder to use with regression or ranking meaning they are task-specic Batch normalization[Ioe and Szegedy 2015] and dropout [Srivastava et al 2014] are designed for neural networksand are model-specic

Although these regularization methodsmdashboth the special and general onesmdashalready work wellin practice and have become the de facto standard tools [Bishop 2011 Goodfellow et al 2016]we provide an alternative which is even more general in the sense that it is domain- task- andmodel-independent

at being said we want to emphasize that the most important dierence between oodingand other regularization methods is whether it is possible to target a specic level of training lossother than zero While ooding allows the user to choose the level of training loss directly it ishard to achieve this with other regularizers

22 Double Descent Curves with OverparametrizationRecently there has been increasing aention on the phenomenon of ldquodouble descentrdquo named byBelkin et al [2019] to explain the two regimes of deep learning e rst one (underparametrizedregime) occurs where the model complexity is small compared to the number of training samplesand the test error as a function of model complexity decreases with low model complexity butstarts to increase aer the model complexity is large enough is follows the classical viewof machine learning that excessive complexity leads to poor generalization e second one(overparametrized regime) occurs when an even larger model complexity is considered enincreasing the complexity only decreases test error which leads to a double descent shape ephase of decreasing test error oen occurs aer the training error becomes zero is follows themodern view of machine learning that bigger models lead to beer generalization3

As far as we know the discovery of double descent curves dates back to at least Krogh and3httpswwwefforgaimetrics

5

Hertz [1992] where they theoretically showed the double descent phenomenon under a linearregression setup Recent works [Belkin et al 2019 Nakkiran et al 2020] have shown empiricallythat a similar phenomenon can be observed with deep learning methods Nakkiran et al [2020]observed that the double descent curves can be shown not only as a function of model complexitybut also as a function of the epoch number

We want to note a byproduct of our ooding method We were able to produce the epoch-wisedouble descent curve for the test loss with about 100 epochs Investigating the connection betweenour accelerated double descent curves and previous double descent curves [Belkin et al 2019Krogh and Hertz 1992 Nakkiran et al 2020] is out of the scope of this paper but is an importantfuture direction

23 Lower-Bounding the Empirical RiskLower-bounding the empirical risk has been used in the area of weakly supervised learning erewere a common phenomenon where the empirical risk goes below zero [Kiryo et al 2017] whenan equivalent form of the risk expressed with the given weak supervision was alternatively used[Cid-Sueiro et al 2014 du Plessis et al 2014 2015 Natarajan et al 2013 Patrini et al 2017 vanRooyen and Williamson 2018] A gradient ascent technique was used to force the empirical riskto become non-negative in Kiryo et al [2017] is idea has been generalized and applied to otherweakly supervised seings [Han et al 2018 Ishida et al 2019 Lu et al 2020]

Although we also set a lower bound on the empirical risk the motivation is dierent Firstwhile Kiryo et al [2017] and others aim to x the negative empirical risk to become lower boundedby zero our empirical risk already has a lower bound of zero Instead we are aiming to sink theoriginal empirical risk by placing a positive lower bound Second the problem seings are dierentWeakly supervised learning methods require certain loss corrections or sample corrections [Hanet al 2018] before the non-negative correction but we work on the original empirical risk withoutany seing-specic modications

3 Flooding How to Avoid Zero Training LossIn this section we propose our regularization method ooding Note that this section and thefollowing sections only consider multi-class classication for simplicity

31 PreliminariesConsider input variable x isin Rd and output variable y isin 1 K where K is the number ofclasses ey follow an unknown joint probability distribution with density p(x y) We denotethe score function by g Rd rarr RK For any test data point x0 our prediction of the output labelwill be given by y0 = argmaxzisin1K gz(x0) where gz(middot) is the z-th element of g(middot) and incase of a tie argmax returns the largest argument Let ` RK times1 2 K rarr R denote a loss

6

function ` can be the zero-one loss

`01(v zprime) =

0 if argmaxzisin1K vz = zprime

1 otherwise(2)

where v = (v1 vK)gt isin RK or a surrogate loss such as the somax cross-entropy loss

`CE(v zprime) = minus log

exp(vzprime)sumzisin1K exp(vz)

(3)

For a surrogate loss ` we denote the classication risk by

R(g) = Ep(xy)[`(g(x) y)] (4)

where Ep(xy)[middot] is the expectation over (x y) sim p(x y) We use R01(g) to denote Eq (4) when` = `01 and call it the classication error

e goal of multi-class classication is to learn g that minimizes the classication errorR01(g)In optimization we consider the minimization of the risk with a almost surely dierentiablesurrogate loss R(g) instead to make the problem more tractable Furthermore since p(x y) isusually unknown and there is no way to exactly evaluate R(g) we minimize its empirical versioncalculated from the training data instead

R(g) =1

n

nsumi=1

`(g(xi) yi) (5)

where (xi yi)ni=1 are iid sampled from p(x y) We call R the empirical riskWe would like to clarify some of the undened terms used in the title and the introduction

e ldquotraintest lossrdquo is the empirical risk with respect to the surrogate loss function ` over thetrainingtest data respectively We refer to the ldquotrainingtest errorrdquo as the empirical risk withrespect to `01 over the trainingtest data respectively (which is equal to one minus accuracy)[Zhang 2004]

Finally we formally dene the Bayes risk as

Rlowast = infhR(h) (6)

where the inmum is taken over all vector-valued functions h Rd rarr RK e Bayes risk is oenreferred to as the Bayes error if the zero-one loss is used

infhR01(h) (7)

32 Algorithm

With exible models R(g) wrt a surrogate loss can easily become small if not zero as wementioned in Section 1 see [C] in Fig 1(a) We propose a method that ldquooods the boom areaand sinks the original empirical riskrdquo as in Fig 1(b) so that the empirical risk cannot go belowthe ooding level More technically if we denote the ooding level as b our proposed trainingobjective with ooding is a simple x

7

Denition 1 e ooded empirical risk is dened as4

R(g) = |R(g)minus b|+ b (8)

Note that when b = 0 then R(g) = R(g) e gradient of R(g) wrt model parameters willpoint to the same direction as that of R(g) when R(g) gt b but in the opposite direction whenR(g) lt b is means that when the learning objective is above the ooding level we performgradient descent as usual (gravity zone) but when the learning objective is below the oodinglevel we perform gradient ascent instead (buoyancy zone)

e issue is that in general we seldom know the optimal ooding level in advance isissue can be mitigated by searching for the optimal ooding level blowast with a hyper-parameteroptimization technique In practice we can search for the optimal ooding level by performingthe exhaustive search in parallel

33 ImplementationFor large scale problems we can employ mini-batched stochastic optimization for ecient com-putation Suppose that we have M disjoint mini-batch splits We denote the empirical risk (5)with respect to the m-th mini-batch by Rm(g) for m isin 1 M en our mini-batchedoptimization performs gradient descent updates in the direction of the gradient of Rm(g) By theconvexity of the absolute value function and Jensenrsquos inequality we have

R(g) le 1

M

Msumm=1

(|Rm(g)minus b|+ b

) (9)

is indicates that mini-batched optimization will simply minimize an upper bound of the full-batch case with R(g)

34 eoretical AnalysisIn the following theorem we will show that the mean squared error (MSE) of the proposed riskestimator with ooding is smaller than that of the original risk estimator without ooding

eorem 1 Fix any measurable vector-valued function g If the ooding level b satises R(g) ltb lt R(g) we have

MSE(R(g)) gt MSE(R(g)) (10)

If b le R(g) we haveMSE(R(g)) = MSE(R(g)) (11)

A proof is given in Appendix A If we regard R(g) as the training loss and R(g) as the testloss we would want b to be between those two for the MSE to improve

4Strictly speaking Eq (1) is dierent from Eq (8) since Eq (1) can ignore constant terms of the original empiricalrisk We will refer to Eq (8) for the ooding operator for the rest of the paper

8

4 ExperimentsIn this section we show experimental results with synthetic and benchmark datasets e imple-mentation is based on PyTorch [Paszke et al 2019] and demo code will be available5 Experimentswere carried out with NVIDIA GeForce GTX 1080 Ti NVIDIA adro RTX 5000 and Intel XeonGold 6142

41 Synthetic Experimentse aim of our synthetic experiments is to study the behavior of ooding with a controlled setupWe use three types of synthetic data described below

Two Gaussians Data We perform binary classication with two 10-dimensional Gaus-sian distributions with covariance matrix identity and means microP = [0 0 0]gt and microN =[mm m]gt where m isin 08 10 e Bayes risk for m = 10 and m = 08 are 014 and024 respectively where proofs are shown in Appendix B e training validation and test samplesizes are 25 10000 and 10000 per class respectively

Sinusoid Data e sinusoid data [Nakkiran et al 2019] are generated as follows We rstdraw input data points uniformly from the inside of a 2-dimensional ball of radius 1 en we putclass labels based on

y = sign(xgtw + sin(xgtwprime))

where w and wprime are any two 2-dimesional vectors such that w perp wprime e training validationand test sample sizes are 100 100 and 20000 respectively

Spiral Data e spiral data [Sugiyama 2015] are two-dimensional synthetic data Letθ+1 = 0 θ+2 θ

+n+ = 4π be equally spaced n+ points in the interval [0 4π] and θminus1 =

0 θminus2 θminusnminus = 4π be equally spaced nminus points in the interval [0 4π] Let positive and negative

input data points be

x+i+ = θ+i+ [cos(θ

+i+) sin(θ

+i+)]

gt + τν+i+

xminusiminus = (θminusiminus + π)[cos(θminusiminus) sin(θminusiminus)]

gt + τνminusiminus

for i+ = 1 n+ and iminus = 1 nminus where τ controls the magnitude of the noise ν+i and

νminusi are iid distributed according to the two-dimensional standard normal distribution enwe make data for classication by (xi yi)ni=1 = (x+

i+ +1)n+

i+=1 cup (xminusiminus minus1)

nminus

iminus=1 wheren = n+ + nminus e training validation and test sample sizes are 100 100 and 10000 per classrespectively

For Two Gaussians we use a one-hidden-layer feedforward neural network with 500 unitsin the hidden layer with the ReLU activation function [Nair and Hinton 2010] We train thenetwork for 1000 epochs with the logistic loss and vanilla gradient descent with learning rate of005 e ooding level is chosen from b isin 0 001 002 040 For Sinusoid and Spiral weuse a four-hidden-layer feedforward neural network with 500 units in the hidden layer with the

5httpsgithubcomtakashiishidaflooding

9

Table 2 Experimental results for the synthetic data Sub-table (A) shows the results without early stoppingSub-table (B) shows the results with early stopping e beer method is shown in bold in each of sub-tables (A) and (B) ldquoBRrdquo stands for the Bayes risk For Two Gaussians the distance between the positive andnegative distributions is larger for larger m See the description in Section 41 for the details

(A) Without Early Stopping (B) With Early Stopping

Data Seing WithoutFlooding

WithFlooding

Chosenb

WithoutFlooding

WithFlooding

Chosenb

Two Gaussians m 10 BR 014 8796 9225 028 9163 9225 027Two Gaussians m 08 BR 024 8200 8731 033 8657 8729 035Sinusoid Label Noise 001 9384 9446 001 9254 9254 000Sinusoid Label Noise 005 9112 9544 010 9326 9460 001Sinusoid Label Noise 010 8657 9602 017 9670 9670 000Spiral Label Noise 001 9896 9785 001 9860 9888 001Spiral Label Noise 005 9387 9624 004 9658 9562 014Spiral Label Noise 010 8970 9296 016 8970 9296 016

ReLU activation function [Nair and Hinton 2010] and batch normalization [Ioe and Szegedy2015] We train the network for 500 epochs with the logistic loss and Adam [Kingma and Ba2015] optimizer with 100 mini-batch size and learning rate of 0001 e ooding level is chosenfrom b isin 0 001 002 020 Note that training with b = 0 is identical to the baselinemethod without ooding We report the test accuracy of the ooding level with the best validationaccuracy We rst conduct experiments without early stopping which means that the last epochwas chosen for all ooding levels

Results e results are summarized in Table 2 It is worth noting that for Two Gaussians thechosen ooding level b is larger for the smaller distance between the two distributions which iswhen the classication task is harder and the Bayes risk becomes larger since the two distributionsbecome less separated We see similar tendencies for Sinusoid and Spiral data a larger b waschosen for larger ipping probability for label noise which is expected to increase the Bayes riskis implies the positive correlation between the optimal ooding level and the Bayes risk asis also partially suggested by eorem 1 Another interesting observation is that the chosen bis close to but higher than the Bayes risk for Two Gaussians data is may look inconsistentwith eorem 1 However it makes sense to adopt larger b with stronger regularization eect thatallows some bias as a trade-o for reducing the variance of the risk estimator In fact eorem 1does not deny the possibility that some b ge R(g) achieves even beer estimation

From (A) in Table 2 we can see that the method with ooding oen improves test accuracyover the baseline method without ooding As we mentioned in the introduction it ca be harmfulto keep training a model until the end without ooding However with ooding the model at thenal epoch has good prediction performance according to the results which implies that ooding

10

Table 3 Results with benchmark datasets We report classication accuracy for all combinations ofweight decay (X and times) early stopping (X and times) and ooding (X and times) e second column shows thetrainingvalidation split used for the experiment W stands for weight decay E stands for early stopping andF stands for ooding ldquomdashrdquo means that ooding level of zero was optimal ldquoNArdquo means that we skipped theexperiments because zero weight decay was optimal in the case without ooding e best and equivalentare shown in bold by comparing ldquowith oodingrdquo and ldquowithout oodingrdquo for two columns with the sameseing for W and E eg the rst and h columns out of the 8 columns e best performing combinationis highlighted

Datasettr W times times X X times times X Xva E times X times X times X times X

split F times times times times X X X X

MNIST 08 9832 9830 9851 9842 9846 9853 9850 984804 9771 9770 9782 9791 9774 9785 mdash 9783

Fashion- 08 8934 8936 NA NA mdash mdash NA NAMNIST 04 8848 8863 8860 8862 mdash mdash mdash mdash

Kuzushiji- 08 9163 9162 9163 9171 9240 9212 9211 9197MNIST 04 8918 8918 8958 8973 9041 9015 8971 8988

CIFAR-10 08 7359 7336 7365 7357 7306 7344 mdash 744104 6639 6663 6931 6928 6720 6758 mdash mdash

CIFAR-100 08 4216 4233 4267 4245 4250 4236 mdash mdash04 3427 3434 3797 3882 3499 3514 mdash mdash

SVHN 08 9238 9241 9320 9299 9278 9279 mdash 934204 9032 9035 9043 9049 9057 9061 9116 9121

helps the late-stage training improve test accuracyWe also conducted experiments with early stopping meaning that we chose the model that

recorded the best validation accuracy during training e results are reported in sub-table (B) ofTable 2 Compared with sub-table (A) we see that early stopping improves the baseline methodwithout ooding well in many cases is indicates that training longer without ooding washarmful in our experiments On the other hand the accuracy for ooding combined with earlystopping is oen close to that with early stopping meaning that training until the end withooding tends to be already as good as doing so with early stopping e table shows that oodingoen improves or retains the test accuracy of the baseline method without ooding even aerdeploying early stopping Flooding does not hurt performance but can be benecial for methodsused with early stopping

11

(a) CIFAR-10 (000) (b) CIFAR-10 (003) (c) CIFAR-10 (007) (d) CIFAR-10 (020)

Figure 2 Learning curves of training and test loss for trainingvalidation proportion of 08 (a) shows thelearning curves without ooding (b) (c) and (d) show the learning curves with dierent ooding levels

42 Benchmark ExperimentsWe next perform experiments with benchmark datasets Not only do we compare with the baselinewithout ooding we also compare or combine with other general regularization methods whichare early stopping and weight decay

Settings We use the following six benchmark datasets MNIST Fashion-MNIST Kuzushiji-MNIST CIFAR-10 CIFAR-100 and SVHN e details of the benchmark datasets can be found inAppendix C1 We split the original training dataset into training and validation data with dierentproportions 08 or 04 (meaning 80 or 40 was used for training and the rest was used forvalidation respectively) We perform the exhaustive hyper-parameter search for the ooding levelwith candidates from 000 001 020 e number of epochs is 500 Stochastic gradientdescent [Robbins and Monro 1951] is used with learning rate of 01 and momentum of 09 ForMNIST Fashion-MNIST and Kuzushiji-MNIST we use a one-hidden-layer feedforward neuralnetwork with 500 units and ReLU activation function [Nair and Hinton 2010] For CIFAR-10CIFAR-100 and SVHN we used ResNet-18 [He et al 2016] We do not use any data augmentationor manual learning rate decay We deployed early stopping in the same way as in Section 41

We rst ran experiments with the following candidates for the weight decay rate 1times10minus5 1times10minus4 4times 10minus4 7times 10minus4 1times 10minus3 4times 10minus3 7times 10minus3 We choose the weight decay rate withthe best validation accuracy for each dataset and each trainingvalidation proportion en xingthe weight decay to the chosen one we ran experiments with ooding level candidates from0 001 020 to investigate whether weight decay and ooding have complementary eectsor if adding weight decay will diminish the accuracy gain of ooding

Results We show the results in Table 3 and the chosen ooding levels in Table 4 in Appendix C2We can observe that ooding gives beer accuracy for most cases We can also see that combiningooding with early stopping or with both early stopping and weight decay may lead to even beeraccuracy in some cases

12

(a) wo early stop (08) (b) wo early stop (04) (c) w early stop (08) (d) w early stop (04)

Figure 3 We show the optimal ooding level maintains memorization e vertical axis shows the trainingaccuracy e horizontal axis shows the ooding level We show results with and without early stoppingand for dierent trainingvalidation splits with proportion of 08 or 04 e marks (4 O ) areplaced on the ooding level that was chosen based on validation accuracy

43 MemorizationCan we maintain memorization even aer adding ooding We investigate if the trained modelhas zero training error (100 accuracy) for the ooding level that was chosen with validation dataWe show the results for all benchmark datasets and all trainingvalidation splits with proportions08 and 04 We also show the case without early stopping (choosing the last epoch) and withearly stopping (choosing the epoch with the highest validation accuracy) e results are shownin Fig 3

All gures show downward curves implying that the model will give up eventually onmemorizing all training data as the ooding level becomes higher A more interesting andimportant observation is the position of the optimal ooding level (the one chosen by validationaccuracy which is marked with 4 O or ) We can observe that the marks are oen ploedat zero error and in some cases there is a mark on the highest ooding level that maintains zeroerror ese results are consistent with recent empirical works that imply zero training error leadsto lower generalization error [Belkin et al 2019 Nakkiran et al 2020] but we further demonstratethat zero training loss may be harmful under zero training error

5 ConclusionWe proposed a novel regularization method called ooding that keeps the training loss to stayaround a small constant value to avoid zero training loss In our experiments the optimal oodinglevel oen maintained memorization of training data with zero error With ooding we showedthat the test accuracy will improve for various benchmark datasets and theoretically showed thatthe mean squared error will be reduced under certain conditions

As a byproduct we were able to produce a double descent curve for the test loss with a relativelyfew number of epochs eg in around 100 epochs shown in Fig 2 and Fig 4 in Appendix D Animportant future direction is to study the relationship between this and the double descent curves

13

from previous works [Belkin et al 2019 Krogh and Hertz 1992 Nakkiran et al 2020]It would also be interesting to see if Bayesian optimization [Shahriari et al 2016 Snoek

et al 2012] methods can be utilized to search for the optimal ooding level eciently We willinvestigate this direction in the future

AcknowledgementsWe thank Chang Xu Genki Yamamoto Kento Nozawa Nontawat Charoenphakdee Voot Tangkaraand Yoshihiro Nagano for the helpful discussions TI was supported by the Google PhD FellowshipProgram MS and IY were supported by JST CREST Grant Number JPMJCR18A2 including AIPchallenge program Japan

ReferencesDevansh Arpit Stanisaw Jastrzebski Nicolas Ballas David Krueger Emmanuel Bengio Maxin-

der S Kanwal Tegan Maharaj Asja Fischer Aaron Courville Yoshua Bengio and SimonLacoste-Julien A closer look at memorization in deep networks In ICML 2017

Mikhail Belkin Daniel Hsu and Partha P Mitra Overing or perfect ing Risk bounds forclassication and regression rules that interpolate In NeurIPS 2018

Mikhail Belkin Daniel Hsu Siyuan Ma and Soumik Mandal Reconciling modern machine-learningpractice and the classical biasvariance trade-o PNAS 11615850ndash15854 2019

David Berthelot Nicholas Carlini Ian Goodfellow Nicolas Papernot Avital Oliver and Colin ARael MixMatch A holistic approach to semi-supervised learning In NeurIPS 2019

Christopher M Bishop Paern Recognition and Machine Learning (Information Science andStatistics) Springer 2011

Christopher Michael Bishop Regularization and complexity control in feed-forward networks InICANN 1995

Rich Caruana Steve Lawrence and C Lee Giles Overing in neural nets Backpropagationconjugate gradient and early stopping In NeurIPS 2000

Pratik Chaudhari Anna Choromanska Stefano Soao Yann LeCun Carlo Baldassi ChristianBorgs Jennifer Chayes Levent Sagun and Riccardo Zecchina Entropy-SGD Biasing gradientdescent into wide valleys In ICLR 2017

Jesus Cid-Sueiro Darıo Garcıa-Garcıa and Raul Santos-Rodrıguez Consistency of losses forlearning from weak labels In ECML-PKDD 2014

14

Tarin Clanuwat Mikel Bober-Irizar Asanobu Kitamoto Alex Lamb Kazuaki Yamamoto and DavidHa Deep learning for classical Japanese literature In NeurIPS Workshop on Machine Learningfor Creativity and Design 2018

Marthinus Christoel du Plessis Gang Niu and Masashi Sugiyama Analysis of learning frompositive and unlabeled data In NeurIPS 2014

Marthinus Christoel du Plessis Gang Niu and Masashi Sugiyama Convex formulation forlearning from positive and unlabeled data In ICML 2015

Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MIT Press 2016

Hongyu Guo Yongyi Mao and Richong Zhang Augmenting data with mixup for sentenceclassication An empirical study In arXiv190508941 2019

Bo Han Gang Niu Jiangchao Yao Xingrui Yu Miao Xu Ivor Tsang and Masashi SugiyamaPumpout A meta approach to robust deep learning with noisy labels In arXiv180911008 2018

Stephen Jose Hanson and Lorien Y Pra Comparing biases for minimal network constructionwith back-propagation In NeurIPS 1988

Kaiming He Xiangyu Zhang Shaoqing Ren and Jian Sun Deep residual learning for imagerecognition In CVPR 2016

Sergey Ioe and Christian Szegedy Batch normalization Accelerating deep network training byreducing internal covariate shi In ICML 2015

Takashi Ishida Gang Niu Aditya Krishna Menon and Masashi Sugiyama Complementary-labellearning for arbitrary losses and models In ICML 2019

Nitish Shirish Keskar Dheevatsa Mudigere Jorge Nocedal Mikhail Smelyanskiy and Ping Tak PeterTang On large-batch training for deep learning Generalization gap and sharp minima In ICLR2017

Diederik P Kingma and Jimmy L Ba Adam A method for stochastic optimization In ICLR 2015

Ryuichi Kiryo Gang Niu Marthinus Christoel du Plessis and Masashi Sugiyama Positive-unlabeled learning with non-negative risk estimator In NeurIPS 2017

Anders Krogh and John A Hertz A simple weight decay can improve generalization In NeurIPS1992

Yann Lecun Lon Boou Yoshua Bengio and Patrick Haner Gradient-based learning applied todocument recognition In Proceedings of the IEEE pages 2278ndash2324 1998

Hao Li Zheng Xu Gavin Taylor Christoph Studer and Tom Goldstein Visualizing the losslandscape of neural nets In NeurIPS 2018

15

Ilya Loshchilov and Frank Huer Decoupled weight decay regularization In ICLR 2019

Nan Lu Tianyi Zhang Gang Niu and Masashi Sugiyama Mitigating overing in supervisedclassication from two unlabeled datasets A consistent risk correction approach In AISTATS2020

N Morgan and H Bourlard Generalization and parameter estimation in feedforward nets Someexperiments In NeurIPS 1990

Vinod Nair and Georey E Hinton Rectied linear units improve restricted boltzmann machinesIn ICML 2010

Preetum Nakkiran Gal Kaplun Dimitris Kalimeris Tristan Yang Benjamin L Edelman FredZhang and Boaz Barak SGD on neural networks learns functions of increasing complexity InNeurIPS 2019

Preetum Nakkiran Gal Kaplun Yamini Bansal Tristan Yang Boaz Barak and Ilya Sutskever Deepdouble descent Where bigger models and more data hurt In ICLR 2020

Nagarajan Natarajan Inderjit S Dhillon Pradeep K Ravikumar and Ambuj Tewari Learningwith noisy labels In NeurIPS 2013

Yuval Netzer Tao Wang Adam Coates Alessandro Bissacco Bo Wu and Andrew Y Ng Readingdigits in natural images with unsupervised feature learning In NeurIPS Workshop on DeepLearning and Unsupervised Feature Learning 2011

Andrew Y Ng Preventing ldquooveringrdquo of cross-validation data In ICML 1997

Adam Paszke Sam Gross Francisco Massa Adam Lerer James Bradbury Gregory Chanan TrevorKilleen Zeming Lin Natalia Gimelshein Luca Antiga Alban Desmaison Andreas Kopf EdwardYang Zachary DeVito Martin Raison Alykhan Tejani Sasank Chilamkurthy Benoit SteinerLu Fang Junjie Bai and Soumith Chintala PyTorch An imperative style high-performancedeep learning library In NeurIPS 2019

Giorgio Patrini Alessandro Rozza Aditya K Menon Richard Nock and Lizhen Making deepneural networks robust to label noise A loss correction approach In CVPR 2017

Herbert Robbins and Suon Monro A stochastic approximation method Annals of MathematicalStatistics 22400ndash407 1951

Rebecca Roelofs Vaishaal Shankar Benjamin Recht Sara Fridovich-Keil Moritz Hardt John Millerand Ludwig Schmidt A meta-analysis of overing in machine learning In NeurIPS 2019

Bobak Shahriari Kevin Swersky Ziyu Wang Ryan P Adams and Nando de Freitas Taking thehuman out of the loop A review of bayesian optimization Proceedings of the IEEE 104148ndash1752016

16

Connor Shorten and Taghi M Khoshgoaar A survey on image data augmentation for deeplearning Journal of Big Data 6 2019

Jasper Snoek Hugo Larochelle and Ryan P Adams Practical bayesian optimization of machinelearning algorithms In NeurIPS 2012

Nitish Srivastava Georey Hinton Alex Krizhevsky Ilya Sutskever and Ruslan SalakhutdinovDropout A simple way to prevent neural networks from overing Journal of Machine LearningResearch 151929ndash1958 2014

Masashi Sugiyama Introduction to statistical machine learning Morgan Kaufmann 2015

Christian Szegedy Vincent Vanhoucke Sergey Ioe and Jon Shlens Rethinking the inceptionarchitecture for computer vision In CVPR 2016

Sunil ulasidasan Gopinath Chennupati Je Bilmes Tanmoy Bhaacharya and Sarah MichalakOn mixup training Improved calibration and predictive uncertainty for deep neural networksIn NeurIPS 2019

Robert Tibshirani Regression shrinkage and selection via the lasso Journal of the Royal StatisticalSociety Series B (Methodological) 58267ndash288 1996

A N Tikhonov and V Y Arsenin Solutions of Ill Posed Problems Winston 1977

Andrey Nikolayevich Tikhonov On the stability of inverse problems Doklady Akademii NaukSSSR 39195ndash198 1943

Antonio Torralba Rob Fergus and William T Freeman 80 million tiny images A large data setfor nonparametric object and scene recognition In IEEE Trans PAMI 2008

Brendan van Rooyen and Robert C Williamson A theory of learning with corrupted labelsJournal of Machine Learning Research 181ndash50 2018

Vikas Verma Alex Lamb Juho Kannala Yoshua Bengio and David Lopez-Paz Interpolationconsistency training for semi-supervised learning In IJCAI 2019

Stefan Wager Sida Wang and Percy Liang Dropout training as adaptive regularization In NeurIPS2013

Roman Werpachowski Andrs Gyrgy and Csaba Szepesvri Detecting overing via adversarialexamples In NeurIPS 2019

Han Xiao Kashif Rasul and Roland Vollgraf Fashion-MNIST a novel image dataset for bench-marking machine learning algorithms arXiv preprint arXiv170807747 2017

Chiyuan Zhang Samy Bengio Moritz Hardt Benjamin Recht and Oriol Vinyals Understandingdeep learning requires rethinking generalization In ICLR 2017

17

Hongyi Zhang Moustapha Cisse Yann N Dauphin and David Lopez-Paz mixup Beyondempirical risk minimization In ICLR 2018

Tong Zhang Statistical behavior and consistency of classication methods based on convex riskminimization e Annals of Statistics 3256ndash85 2004

18

A Proof of eoremProof If the ooding level is b then the proposed ooding estimator is

R(g) = |R(g)minus b|+ b (12)

Since the absolute operator can be expressed with a max operator with max(a b) = a+b+|aminusb|2

theproposed estimator can be re-expressed as

R(g) = 2max(R(g) b)minus R(g) = Aminus R(g) (13)

For convenience we used A = 2max(R(g) b) From the denition of MSE

MSE(R(g)) = E[(R(g)minusR(g))2] (14)MSE(R(g)) = E[(R(g)minusR(g))2] (15)

= E[(Aminus R(g)minusR(g))2)] (16)= E[A2]minus 2E[A(R(g) +R(g))] + E[(R(g) +R(g))2] (17)

We are interested in the sign of

MSE(R(g))minusMSE(R(g)) = E[minus4R(g)R(g)minus A2 + 2A(R(g) +R(g))] (18)

Dene the inside of the expectation as B = minus4R(g)R(g)minus A2 + 2A(R(g) + R(g)) B can bedivided into two cases depending on the outcome of the max operator

B =

minus4R(g)R(g)minus 4R(g)2 + 4R(g)(R(g) +R(g)) if R(g) ge b

minus4R(g)R(g)minus 4b2 + 4b(R(g) +R(g)) if R(g) lt b(19)

=

0 if R(g) ge b

minus4R(g)R(g)minus 4b2 + 4b(R(g) +R(g)) if R(g) lt b(20)

=

0 if R(g) ge b

minus4(bminus R(g))(bminusR(g)) if R(g) lt b (21)

e laer case becomes positive when R(g) lt b lt R(g) erefore when R(g) lt b lt R(g)

MSE(R(g))minusMSE(R(g)) gt 0 (22)MSE(R(g)) gt MSE(R(g)) (23)

When b le R(g)

MSE(R(g))minusMSE(R(g)) gt 0 (24)MSE(R(g)) = MSE(R(g)) (25)

19

B Bayes Risk for Gaussian DistributionsIn this section we explain in detail how we derived the Bayes risk with respect to the surrogateloss in the experiments with Gaussian data in Section 41 Since we are using the logistic loss inthe synthetic experiments the loss of the margin is

`(yg(x)) = log(1 + exp(minusyg(x))) (26)

where g(x) Rd rarr R is a scalar instead of the vector denition that was used previously becausethe synthetic experiments only consider binary classication Take the derivative to derive

partE[`(yg(x))]partg(x)

=E[log(1 + exp(minusyg(x)))]

partg(x)= E

[minusy exp(minusyg(x))1 + exp(minusyg(x))

∣∣∣∣∣x]p(x) (27)

= E

[minusy

1 + exp(yg(x))

∣∣∣∣∣x]p(x) (28)

= E

[y + 1

2

1

exp(minusg(x)) + 1+y minus 1

2

minus1exp(g(x)) + 1

∣∣∣∣∣x]p(x) (29)

= E

[y + 1

2

∣∣∣∣∣x]

1

exp(minusg(x)) + 1p(x) + E

[y minus 1

2

∣∣∣∣∣x]

minus1exp(g(x)) + 1

p(x) (30)

= p(y = +1|x) 1

exp(minusg(x)) + 1p(x) + p(y = minus1|x) minus1

exp(g(x)) + 1p(x) (31)

Set this to zero divide by p(x) gt 0 to obtain

p(y = minus1|x) 1

exp(minusg(x)) + 1= p(y = +1|x) 1

exp(g(x)) + 1(32)

exp(g(x)) =p(y = +1|x)p(y = minus1|x)

(33)

g(x) = logp(y = +1|x)p(y = minus1|x)

(34)

Since we are interested in the surrogate loss under this classier we plug this into the logisticloss to obtain the Bayes risk

E[`(yg(x))] = E[log(1 +

p(minusy|x)p(y|x)

)

]= E

[log(

1

p(y|x))

]= E[minus log p(y|x)] (35)

In the experiments in Section 41 we report the empirical version of this with the test dataset asthe Bayes risk

20

C Details of Experiments

C1 Benchmark DatasetsIn the experiments in Section 42 we use six benchmark datasets explained below

bull MNIST6 [Lecun et al 1998] is a 10 class dataset of handwrien digits 1 2 9 and 0 Eachsample is a 28times 28 grayscale image e number of training and test samples are 60000and 10000 respectively

bull Fashion-MNIST7 [Xiao et al 2017] is a 10 class dataset of fashion items T-shirttop TrouserPullover Dress Coat Sandal Shirt Sneaker Bag and Ankle boot Each sample is a 28times 28grayscale image e number of training and test samples are 60000 and 10000 respectively

bull Kuzushiji-MNIST8 [Clanuwat et al 2018] is a 10 class dataset of cursive Japanese (ldquoKuzushijirdquo)characters Each sample is a 28 times 28 grayscale image e number of training and testsamples are 60000 and 10000 respectively

bull CIFAR-109 is a 10 class dataset of various objects airplane automobile bird cat deer dogfrog horse ship and truck Each sample is a colored image in 32times 32times 3 RGB format It isa subset of the 80 million tiny images dataset [Torralba et al 2008] ere are 6000 imagesper class where 5000 are for training and 1000 are for test

bull CIFAR-10010 is a 100 class dataset of various objects Each class has 600 samples where 500samples are for training and 100 samples are for test is is also a subset of the 80 milliontiny images dataset [Torralba et al 2008]

bull SVHN11 [Netzer et al 2011] is a 10 class dataset of house numbers from Google Street Viewimages in 32times 32times 3 RGB format 73257 digits are for training and 26032 digits are fortesting

6httpyannlecuncomexdbmnist7httpsgithubcomzalandoresearchfashion-mnist8httpsgithubcomrois-codhkmnist9httpswwwcstorontoedusimkrizcifarhtml

10httpswwwcstorontoedusimkrizcifarhtml11httpufldlstanfordeduhousenumbers

21

C2 Chosen Flooding LevelsIn Table 4 we report the chosen ooding levels for our experiments with benchmark datasets

Table 4 e chosen ooding levels for benchmark experiments

Early stopping times X times XWeight decay times times X XFlooding X X X X

MNIST (08) 002 002 003 002MNIST (04) 002 003 000 002Fashion-MNIST (08) 000 000 mdash mdashFashion-MNIST (04) 000 000 mdash mdashKuzushiji-MNIST (08) 001 003 003 003Kuzushiji-MNIST (04) 003 002 004 003CIFAR-10 (08) 004 004 000 001CIFAR-10 (04) 003 003 000 000CIFAR-100 (08) 002 002 000 000CIFAR-100 (04) 003 003 000 000SVHN (08) 001 001 000 002SVHN (04) 001 009 003 003

D Learning CurvesIn Figure 4 we visualize learning curves for all datasets (including CIFAR-10 which we alreadyvisualized in Figure 2) We only show the learning curves for trainingvalidation proportion of 08since the results for 04 were similar with 08 Note that Figure 1(c) shows the learning curves forthe rst 80 epochs for CIFAR-10 without ooding Figure 1(d) shows the learning curves withooding when the ooding level is 018

22

(a) MNIST (000) (b) MNIST (001) (c) MNIST (002) (d) MNIST (003)

(e) Fashion-MNIST(000)

(f) Fashion-MNIST(002)

(g) Fashion-MNIST(005)

(h) Fashion-MNIST(010)

(i) KMNIST (000) (j) KMNIST (002) (k) KMNIST (005) (l) KMNIST (010)

(m) CIFAR-10 (000) (n) CIFAR-10 (003) (o) CIFAR-10 (007) (p) CIFAR-10 (020)

(q) CIFAR-100 (000) (r) CIFAR-100 (001) (s) CIFAR-100 (006) (t) CIFAR-100 (020)

(u) SVHN (000) (v) SVHN (001) (w) SVHN (006) (x) SVHN (020)

Figure 4 Learning curves of training and test loss e rst gure in each row is the learning curveswithout ooding e 2nd 3rd and 4th columns show the results with dierent ooding levels e oodinglevel increases towards the right-hand side 23

E Relationship between Performance and GradientsSettings We visualize the relationship between test performance (loss or accuracy) and gradientamplitude of the trainingtest loss in Figure 5 where the gradient amplitude is the `2 norm of thelter-nomalized gradient of the loss e lter-normalized gradient is the gradient appropriatelyscaled depending on the magnitude of the corresponding convolutional lter similarly to Liet al [2018] More specicically for each lter of every comvolutional layer we multiply thecorresponding elements of the gradient by the norm of the lter Note that a fully connected layeris a special case of convolutional layer and subject to this scaling We exclude Fashion-MNISTbecause the optimal ooding level was zero We used the results with trainingvalidation splitratio of 08

Results For the gures with gradient amplitude of training loss on the horizontal axis ldquordquomarks (w ooding) are oen ploed on the right of ldquo+rdquo marks (wo ooding) which impliesthat ooding prevents the model from staying a local minimum For the gures with gradientamplitude of test loss on the horizontal axis we can observe the method with ooding (ldquordquo)improves performance while the gradient amplitude becomes smaller On the other hand theperformance with the method without ooding (ldquo+rdquo) degenerates while the gradient amplitude oftest loss keeps increasing

24

(a) MNIST xtrainyloss

(b) MNIST xtrainyacc

(c) MNIST xtestyloss

(d) MNIST xtest yacc

(e) K-M xtrain yloss (f) K-M xtrain yacc (g) K-M xtest yloss (h) K-M xtest yacc

(i) C10 xtrain yloss (j) C10 xtrain yacc (k) C10 xtest yloss (l) C10 xtest yacc

(m) C100 xtrainyloss

(n) C100 xtrain yacc (o) C100 xtest yloss (p) C100 xtest yacc

(q) SVHN xtrainyloss

(r) SVHN xtrainyacc

(s) SVHN xtest yloss (t) SVHN xtest yacc

Figure 5 Relationship between test performance (loss or accuracy) and amplitude of gradient (with trainingor test loss) Each point (ldquordquo or ldquo+rdquo) in the gures corresponds to a single model at a certain epoch Weremove the rst 5 epochs and plot the rest ldquordquo is used for the method with ooding and ldquo+rdquo is used forthe method without ooding e large black ldquordquo and ldquo+rdquo show the epochs with early stopping e colorbecomes lighter (purplerarr yellow) as the training proceeds K-M C10 and C100 stand for Kuzushiji-MNISTCIFAR-10 and CIFAR-100

25

F FlatnessSettings We follow Li et al [2018] and give a one-dimensional visualization of atness for eachdataset in Figure 6 We exclude Fashion-MNIST because the optimal ooding level was zero Weused the results with trainingvalidation split ratio of 08 We compare the atness of the modelright aer the empirical risk with respect to a mini-batch becomes smaller than the ooding levelRm(g) lt b for the rst time (doed blue line) and the model aer training (solid blue line) Wealso compare them with the model trained by the baseline method without ooding and trainingis nished (solid red line)

Results According to Figure 6 the test loss becomes lower and more at during the trainingwith ooding Note that the training loss on the other hand continues to oat around the oodinglevel until the end of training aer it enters the ooding zone We expect that the model makes arandom walk and escapes regions with sharp loss landscapes during the period is may be apossible reason for beer generalization results with our proposed method

26

(a) MNIST (train) (b) MNIST (test) (c) Kuzushiji-MNIST(train)

(d) Kuzushiji-MNIST(test)

(e) CIFAR-10 (train) (f) CIFAR-10 (test) (g) CIFAR-100 (train) (h) CIFAR-100 (test)

(i) SVHN (train) (j) SVHN (test)

Figure 6 One-dimensional visualization of atness We visualize the trainingtest loss with respect toperturbation We depict the results for 3 models the model when the empirical risk with respect to trainingdata is below the ooding level for the rst time during training (doed blue) the model at the end oftraining with ooding (solid blue) and the model at the end of training without ooding (solid red)

27

  • 1 Introduction
  • 2 Backgrounds
    • 21 Regularization Methods
    • 22 Double Descent Curves with Overparametrization
    • 23 Lower-Bounding the Empirical Risk
      • 3 Flooding How to Avoid Zero Training Loss
        • 31 Preliminaries
        • 32 Algorithm
        • 33 Implementation
        • 34 Theoretical Analysis
          • 4 Experiments
            • 41 Synthetic Experiments
            • 42 Benchmark Experiments
            • 43 Memorization
              • 5 Conclusion
              • A Proof of Theorem
              • B Bayes Risk for Gaussian Distributions
              • C Details of Experiments
                • C1 Benchmark Datasets
                • C2 Chosen Flooding Levels
                  • D Learning Curves
                  • E Relationship between Performance and Gradients
                  • F Flatness
Page 5: Do We Need Zero Training Loss A›er Achieving Zero Training … · 2020. 2. 21. · directly aim to avoid zero training loss, they o›en fail to maintain a moderate level of training

Table 1 Conceptual comparisons of various regularizers ldquotrrdquo stands for ldquotrainingrdquo ldquoIndeprdquo stands forldquoindependentrdquo X stands for yes and times stands for no

Regularization and other methods Targettr loss

Domainindep

Taskindep

Modelindep Main assumption

`2 regularization [Tikhonov 1943] times X X X Optimal model params are close to 0Weight decay [Hanson and Pra 1988] times X X X Optimal model params are close to 0Early stopping [Morgan and Bourlard 1990] times X X X Overing occurs in later epochs`1 regularization [Tibshirani 1996] times X X X Optimal model has to be sparseDropout [Srivastava et al 2014] times X X times Weight scaling inference ruleBatch normalization [Ioe and Szegedy 2015] times X X times Existence of internal covariate shiLabel smoothing [Szegedy et al 2016] times X times X True posterior is not a one-hot vectorMixup [Zhang et al 2018] times times times X Linear relationship between x and yImage augment [Shorten and Khoshgoaar 2019] times times X X Input is invariant to the translations

Flooding (proposed method) X X X X Learning until zero loss is harmful

and harder to use with regression or ranking meaning they are task-specic Batch normalization[Ioe and Szegedy 2015] and dropout [Srivastava et al 2014] are designed for neural networksand are model-specic

Although these regularization methodsmdashboth the special and general onesmdashalready work wellin practice and have become the de facto standard tools [Bishop 2011 Goodfellow et al 2016]we provide an alternative which is even more general in the sense that it is domain- task- andmodel-independent

at being said we want to emphasize that the most important dierence between oodingand other regularization methods is whether it is possible to target a specic level of training lossother than zero While ooding allows the user to choose the level of training loss directly it ishard to achieve this with other regularizers

22 Double Descent Curves with OverparametrizationRecently there has been increasing aention on the phenomenon of ldquodouble descentrdquo named byBelkin et al [2019] to explain the two regimes of deep learning e rst one (underparametrizedregime) occurs where the model complexity is small compared to the number of training samplesand the test error as a function of model complexity decreases with low model complexity butstarts to increase aer the model complexity is large enough is follows the classical viewof machine learning that excessive complexity leads to poor generalization e second one(overparametrized regime) occurs when an even larger model complexity is considered enincreasing the complexity only decreases test error which leads to a double descent shape ephase of decreasing test error oen occurs aer the training error becomes zero is follows themodern view of machine learning that bigger models lead to beer generalization3

As far as we know the discovery of double descent curves dates back to at least Krogh and3httpswwwefforgaimetrics

5

Hertz [1992] where they theoretically showed the double descent phenomenon under a linearregression setup Recent works [Belkin et al 2019 Nakkiran et al 2020] have shown empiricallythat a similar phenomenon can be observed with deep learning methods Nakkiran et al [2020]observed that the double descent curves can be shown not only as a function of model complexitybut also as a function of the epoch number

We want to note a byproduct of our ooding method We were able to produce the epoch-wisedouble descent curve for the test loss with about 100 epochs Investigating the connection betweenour accelerated double descent curves and previous double descent curves [Belkin et al 2019Krogh and Hertz 1992 Nakkiran et al 2020] is out of the scope of this paper but is an importantfuture direction

23 Lower-Bounding the Empirical RiskLower-bounding the empirical risk has been used in the area of weakly supervised learning erewere a common phenomenon where the empirical risk goes below zero [Kiryo et al 2017] whenan equivalent form of the risk expressed with the given weak supervision was alternatively used[Cid-Sueiro et al 2014 du Plessis et al 2014 2015 Natarajan et al 2013 Patrini et al 2017 vanRooyen and Williamson 2018] A gradient ascent technique was used to force the empirical riskto become non-negative in Kiryo et al [2017] is idea has been generalized and applied to otherweakly supervised seings [Han et al 2018 Ishida et al 2019 Lu et al 2020]

Although we also set a lower bound on the empirical risk the motivation is dierent Firstwhile Kiryo et al [2017] and others aim to x the negative empirical risk to become lower boundedby zero our empirical risk already has a lower bound of zero Instead we are aiming to sink theoriginal empirical risk by placing a positive lower bound Second the problem seings are dierentWeakly supervised learning methods require certain loss corrections or sample corrections [Hanet al 2018] before the non-negative correction but we work on the original empirical risk withoutany seing-specic modications

3 Flooding How to Avoid Zero Training LossIn this section we propose our regularization method ooding Note that this section and thefollowing sections only consider multi-class classication for simplicity

31 PreliminariesConsider input variable x isin Rd and output variable y isin 1 K where K is the number ofclasses ey follow an unknown joint probability distribution with density p(x y) We denotethe score function by g Rd rarr RK For any test data point x0 our prediction of the output labelwill be given by y0 = argmaxzisin1K gz(x0) where gz(middot) is the z-th element of g(middot) and incase of a tie argmax returns the largest argument Let ` RK times1 2 K rarr R denote a loss

6

function ` can be the zero-one loss

`01(v zprime) =

0 if argmaxzisin1K vz = zprime

1 otherwise(2)

where v = (v1 vK)gt isin RK or a surrogate loss such as the somax cross-entropy loss

`CE(v zprime) = minus log

exp(vzprime)sumzisin1K exp(vz)

(3)

For a surrogate loss ` we denote the classication risk by

R(g) = Ep(xy)[`(g(x) y)] (4)

where Ep(xy)[middot] is the expectation over (x y) sim p(x y) We use R01(g) to denote Eq (4) when` = `01 and call it the classication error

e goal of multi-class classication is to learn g that minimizes the classication errorR01(g)In optimization we consider the minimization of the risk with a almost surely dierentiablesurrogate loss R(g) instead to make the problem more tractable Furthermore since p(x y) isusually unknown and there is no way to exactly evaluate R(g) we minimize its empirical versioncalculated from the training data instead

R(g) =1

n

nsumi=1

`(g(xi) yi) (5)

where (xi yi)ni=1 are iid sampled from p(x y) We call R the empirical riskWe would like to clarify some of the undened terms used in the title and the introduction

e ldquotraintest lossrdquo is the empirical risk with respect to the surrogate loss function ` over thetrainingtest data respectively We refer to the ldquotrainingtest errorrdquo as the empirical risk withrespect to `01 over the trainingtest data respectively (which is equal to one minus accuracy)[Zhang 2004]

Finally we formally dene the Bayes risk as

Rlowast = infhR(h) (6)

where the inmum is taken over all vector-valued functions h Rd rarr RK e Bayes risk is oenreferred to as the Bayes error if the zero-one loss is used

infhR01(h) (7)

32 Algorithm

With exible models R(g) wrt a surrogate loss can easily become small if not zero as wementioned in Section 1 see [C] in Fig 1(a) We propose a method that ldquooods the boom areaand sinks the original empirical riskrdquo as in Fig 1(b) so that the empirical risk cannot go belowthe ooding level More technically if we denote the ooding level as b our proposed trainingobjective with ooding is a simple x

7

Denition 1 e ooded empirical risk is dened as4

R(g) = |R(g)minus b|+ b (8)

Note that when b = 0 then R(g) = R(g) e gradient of R(g) wrt model parameters willpoint to the same direction as that of R(g) when R(g) gt b but in the opposite direction whenR(g) lt b is means that when the learning objective is above the ooding level we performgradient descent as usual (gravity zone) but when the learning objective is below the oodinglevel we perform gradient ascent instead (buoyancy zone)

e issue is that in general we seldom know the optimal ooding level in advance isissue can be mitigated by searching for the optimal ooding level blowast with a hyper-parameteroptimization technique In practice we can search for the optimal ooding level by performingthe exhaustive search in parallel

33 ImplementationFor large scale problems we can employ mini-batched stochastic optimization for ecient com-putation Suppose that we have M disjoint mini-batch splits We denote the empirical risk (5)with respect to the m-th mini-batch by Rm(g) for m isin 1 M en our mini-batchedoptimization performs gradient descent updates in the direction of the gradient of Rm(g) By theconvexity of the absolute value function and Jensenrsquos inequality we have

R(g) le 1

M

Msumm=1

(|Rm(g)minus b|+ b

) (9)

is indicates that mini-batched optimization will simply minimize an upper bound of the full-batch case with R(g)

34 eoretical AnalysisIn the following theorem we will show that the mean squared error (MSE) of the proposed riskestimator with ooding is smaller than that of the original risk estimator without ooding

eorem 1 Fix any measurable vector-valued function g If the ooding level b satises R(g) ltb lt R(g) we have

MSE(R(g)) gt MSE(R(g)) (10)

If b le R(g) we haveMSE(R(g)) = MSE(R(g)) (11)

A proof is given in Appendix A If we regard R(g) as the training loss and R(g) as the testloss we would want b to be between those two for the MSE to improve

4Strictly speaking Eq (1) is dierent from Eq (8) since Eq (1) can ignore constant terms of the original empiricalrisk We will refer to Eq (8) for the ooding operator for the rest of the paper

8

4 ExperimentsIn this section we show experimental results with synthetic and benchmark datasets e imple-mentation is based on PyTorch [Paszke et al 2019] and demo code will be available5 Experimentswere carried out with NVIDIA GeForce GTX 1080 Ti NVIDIA adro RTX 5000 and Intel XeonGold 6142

41 Synthetic Experimentse aim of our synthetic experiments is to study the behavior of ooding with a controlled setupWe use three types of synthetic data described below

Two Gaussians Data We perform binary classication with two 10-dimensional Gaus-sian distributions with covariance matrix identity and means microP = [0 0 0]gt and microN =[mm m]gt where m isin 08 10 e Bayes risk for m = 10 and m = 08 are 014 and024 respectively where proofs are shown in Appendix B e training validation and test samplesizes are 25 10000 and 10000 per class respectively

Sinusoid Data e sinusoid data [Nakkiran et al 2019] are generated as follows We rstdraw input data points uniformly from the inside of a 2-dimensional ball of radius 1 en we putclass labels based on

y = sign(xgtw + sin(xgtwprime))

where w and wprime are any two 2-dimesional vectors such that w perp wprime e training validationand test sample sizes are 100 100 and 20000 respectively

Spiral Data e spiral data [Sugiyama 2015] are two-dimensional synthetic data Letθ+1 = 0 θ+2 θ

+n+ = 4π be equally spaced n+ points in the interval [0 4π] and θminus1 =

0 θminus2 θminusnminus = 4π be equally spaced nminus points in the interval [0 4π] Let positive and negative

input data points be

x+i+ = θ+i+ [cos(θ

+i+) sin(θ

+i+)]

gt + τν+i+

xminusiminus = (θminusiminus + π)[cos(θminusiminus) sin(θminusiminus)]

gt + τνminusiminus

for i+ = 1 n+ and iminus = 1 nminus where τ controls the magnitude of the noise ν+i and

νminusi are iid distributed according to the two-dimensional standard normal distribution enwe make data for classication by (xi yi)ni=1 = (x+

i+ +1)n+

i+=1 cup (xminusiminus minus1)

nminus

iminus=1 wheren = n+ + nminus e training validation and test sample sizes are 100 100 and 10000 per classrespectively

For Two Gaussians we use a one-hidden-layer feedforward neural network with 500 unitsin the hidden layer with the ReLU activation function [Nair and Hinton 2010] We train thenetwork for 1000 epochs with the logistic loss and vanilla gradient descent with learning rate of005 e ooding level is chosen from b isin 0 001 002 040 For Sinusoid and Spiral weuse a four-hidden-layer feedforward neural network with 500 units in the hidden layer with the

5httpsgithubcomtakashiishidaflooding

9

Table 2 Experimental results for the synthetic data Sub-table (A) shows the results without early stoppingSub-table (B) shows the results with early stopping e beer method is shown in bold in each of sub-tables (A) and (B) ldquoBRrdquo stands for the Bayes risk For Two Gaussians the distance between the positive andnegative distributions is larger for larger m See the description in Section 41 for the details

(A) Without Early Stopping (B) With Early Stopping

Data Seing WithoutFlooding

WithFlooding

Chosenb

WithoutFlooding

WithFlooding

Chosenb

Two Gaussians m 10 BR 014 8796 9225 028 9163 9225 027Two Gaussians m 08 BR 024 8200 8731 033 8657 8729 035Sinusoid Label Noise 001 9384 9446 001 9254 9254 000Sinusoid Label Noise 005 9112 9544 010 9326 9460 001Sinusoid Label Noise 010 8657 9602 017 9670 9670 000Spiral Label Noise 001 9896 9785 001 9860 9888 001Spiral Label Noise 005 9387 9624 004 9658 9562 014Spiral Label Noise 010 8970 9296 016 8970 9296 016

ReLU activation function [Nair and Hinton 2010] and batch normalization [Ioe and Szegedy2015] We train the network for 500 epochs with the logistic loss and Adam [Kingma and Ba2015] optimizer with 100 mini-batch size and learning rate of 0001 e ooding level is chosenfrom b isin 0 001 002 020 Note that training with b = 0 is identical to the baselinemethod without ooding We report the test accuracy of the ooding level with the best validationaccuracy We rst conduct experiments without early stopping which means that the last epochwas chosen for all ooding levels

Results e results are summarized in Table 2 It is worth noting that for Two Gaussians thechosen ooding level b is larger for the smaller distance between the two distributions which iswhen the classication task is harder and the Bayes risk becomes larger since the two distributionsbecome less separated We see similar tendencies for Sinusoid and Spiral data a larger b waschosen for larger ipping probability for label noise which is expected to increase the Bayes riskis implies the positive correlation between the optimal ooding level and the Bayes risk asis also partially suggested by eorem 1 Another interesting observation is that the chosen bis close to but higher than the Bayes risk for Two Gaussians data is may look inconsistentwith eorem 1 However it makes sense to adopt larger b with stronger regularization eect thatallows some bias as a trade-o for reducing the variance of the risk estimator In fact eorem 1does not deny the possibility that some b ge R(g) achieves even beer estimation

From (A) in Table 2 we can see that the method with ooding oen improves test accuracyover the baseline method without ooding As we mentioned in the introduction it ca be harmfulto keep training a model until the end without ooding However with ooding the model at thenal epoch has good prediction performance according to the results which implies that ooding

10

Table 3 Results with benchmark datasets We report classication accuracy for all combinations ofweight decay (X and times) early stopping (X and times) and ooding (X and times) e second column shows thetrainingvalidation split used for the experiment W stands for weight decay E stands for early stopping andF stands for ooding ldquomdashrdquo means that ooding level of zero was optimal ldquoNArdquo means that we skipped theexperiments because zero weight decay was optimal in the case without ooding e best and equivalentare shown in bold by comparing ldquowith oodingrdquo and ldquowithout oodingrdquo for two columns with the sameseing for W and E eg the rst and h columns out of the 8 columns e best performing combinationis highlighted

Datasettr W times times X X times times X Xva E times X times X times X times X

split F times times times times X X X X

MNIST 08 9832 9830 9851 9842 9846 9853 9850 984804 9771 9770 9782 9791 9774 9785 mdash 9783

Fashion- 08 8934 8936 NA NA mdash mdash NA NAMNIST 04 8848 8863 8860 8862 mdash mdash mdash mdash

Kuzushiji- 08 9163 9162 9163 9171 9240 9212 9211 9197MNIST 04 8918 8918 8958 8973 9041 9015 8971 8988

CIFAR-10 08 7359 7336 7365 7357 7306 7344 mdash 744104 6639 6663 6931 6928 6720 6758 mdash mdash

CIFAR-100 08 4216 4233 4267 4245 4250 4236 mdash mdash04 3427 3434 3797 3882 3499 3514 mdash mdash

SVHN 08 9238 9241 9320 9299 9278 9279 mdash 934204 9032 9035 9043 9049 9057 9061 9116 9121

helps the late-stage training improve test accuracyWe also conducted experiments with early stopping meaning that we chose the model that

recorded the best validation accuracy during training e results are reported in sub-table (B) ofTable 2 Compared with sub-table (A) we see that early stopping improves the baseline methodwithout ooding well in many cases is indicates that training longer without ooding washarmful in our experiments On the other hand the accuracy for ooding combined with earlystopping is oen close to that with early stopping meaning that training until the end withooding tends to be already as good as doing so with early stopping e table shows that oodingoen improves or retains the test accuracy of the baseline method without ooding even aerdeploying early stopping Flooding does not hurt performance but can be benecial for methodsused with early stopping

11

(a) CIFAR-10 (000) (b) CIFAR-10 (003) (c) CIFAR-10 (007) (d) CIFAR-10 (020)

Figure 2 Learning curves of training and test loss for trainingvalidation proportion of 08 (a) shows thelearning curves without ooding (b) (c) and (d) show the learning curves with dierent ooding levels

42 Benchmark ExperimentsWe next perform experiments with benchmark datasets Not only do we compare with the baselinewithout ooding we also compare or combine with other general regularization methods whichare early stopping and weight decay

Settings We use the following six benchmark datasets MNIST Fashion-MNIST Kuzushiji-MNIST CIFAR-10 CIFAR-100 and SVHN e details of the benchmark datasets can be found inAppendix C1 We split the original training dataset into training and validation data with dierentproportions 08 or 04 (meaning 80 or 40 was used for training and the rest was used forvalidation respectively) We perform the exhaustive hyper-parameter search for the ooding levelwith candidates from 000 001 020 e number of epochs is 500 Stochastic gradientdescent [Robbins and Monro 1951] is used with learning rate of 01 and momentum of 09 ForMNIST Fashion-MNIST and Kuzushiji-MNIST we use a one-hidden-layer feedforward neuralnetwork with 500 units and ReLU activation function [Nair and Hinton 2010] For CIFAR-10CIFAR-100 and SVHN we used ResNet-18 [He et al 2016] We do not use any data augmentationor manual learning rate decay We deployed early stopping in the same way as in Section 41

We rst ran experiments with the following candidates for the weight decay rate 1times10minus5 1times10minus4 4times 10minus4 7times 10minus4 1times 10minus3 4times 10minus3 7times 10minus3 We choose the weight decay rate withthe best validation accuracy for each dataset and each trainingvalidation proportion en xingthe weight decay to the chosen one we ran experiments with ooding level candidates from0 001 020 to investigate whether weight decay and ooding have complementary eectsor if adding weight decay will diminish the accuracy gain of ooding

Results We show the results in Table 3 and the chosen ooding levels in Table 4 in Appendix C2We can observe that ooding gives beer accuracy for most cases We can also see that combiningooding with early stopping or with both early stopping and weight decay may lead to even beeraccuracy in some cases

12

(a) wo early stop (08) (b) wo early stop (04) (c) w early stop (08) (d) w early stop (04)

Figure 3 We show the optimal ooding level maintains memorization e vertical axis shows the trainingaccuracy e horizontal axis shows the ooding level We show results with and without early stoppingand for dierent trainingvalidation splits with proportion of 08 or 04 e marks (4 O ) areplaced on the ooding level that was chosen based on validation accuracy

43 MemorizationCan we maintain memorization even aer adding ooding We investigate if the trained modelhas zero training error (100 accuracy) for the ooding level that was chosen with validation dataWe show the results for all benchmark datasets and all trainingvalidation splits with proportions08 and 04 We also show the case without early stopping (choosing the last epoch) and withearly stopping (choosing the epoch with the highest validation accuracy) e results are shownin Fig 3

All gures show downward curves implying that the model will give up eventually onmemorizing all training data as the ooding level becomes higher A more interesting andimportant observation is the position of the optimal ooding level (the one chosen by validationaccuracy which is marked with 4 O or ) We can observe that the marks are oen ploedat zero error and in some cases there is a mark on the highest ooding level that maintains zeroerror ese results are consistent with recent empirical works that imply zero training error leadsto lower generalization error [Belkin et al 2019 Nakkiran et al 2020] but we further demonstratethat zero training loss may be harmful under zero training error

5 ConclusionWe proposed a novel regularization method called ooding that keeps the training loss to stayaround a small constant value to avoid zero training loss In our experiments the optimal oodinglevel oen maintained memorization of training data with zero error With ooding we showedthat the test accuracy will improve for various benchmark datasets and theoretically showed thatthe mean squared error will be reduced under certain conditions

As a byproduct we were able to produce a double descent curve for the test loss with a relativelyfew number of epochs eg in around 100 epochs shown in Fig 2 and Fig 4 in Appendix D Animportant future direction is to study the relationship between this and the double descent curves

13

from previous works [Belkin et al 2019 Krogh and Hertz 1992 Nakkiran et al 2020]It would also be interesting to see if Bayesian optimization [Shahriari et al 2016 Snoek

et al 2012] methods can be utilized to search for the optimal ooding level eciently We willinvestigate this direction in the future

AcknowledgementsWe thank Chang Xu Genki Yamamoto Kento Nozawa Nontawat Charoenphakdee Voot Tangkaraand Yoshihiro Nagano for the helpful discussions TI was supported by the Google PhD FellowshipProgram MS and IY were supported by JST CREST Grant Number JPMJCR18A2 including AIPchallenge program Japan

ReferencesDevansh Arpit Stanisaw Jastrzebski Nicolas Ballas David Krueger Emmanuel Bengio Maxin-

der S Kanwal Tegan Maharaj Asja Fischer Aaron Courville Yoshua Bengio and SimonLacoste-Julien A closer look at memorization in deep networks In ICML 2017

Mikhail Belkin Daniel Hsu and Partha P Mitra Overing or perfect ing Risk bounds forclassication and regression rules that interpolate In NeurIPS 2018

Mikhail Belkin Daniel Hsu Siyuan Ma and Soumik Mandal Reconciling modern machine-learningpractice and the classical biasvariance trade-o PNAS 11615850ndash15854 2019

David Berthelot Nicholas Carlini Ian Goodfellow Nicolas Papernot Avital Oliver and Colin ARael MixMatch A holistic approach to semi-supervised learning In NeurIPS 2019

Christopher M Bishop Paern Recognition and Machine Learning (Information Science andStatistics) Springer 2011

Christopher Michael Bishop Regularization and complexity control in feed-forward networks InICANN 1995

Rich Caruana Steve Lawrence and C Lee Giles Overing in neural nets Backpropagationconjugate gradient and early stopping In NeurIPS 2000

Pratik Chaudhari Anna Choromanska Stefano Soao Yann LeCun Carlo Baldassi ChristianBorgs Jennifer Chayes Levent Sagun and Riccardo Zecchina Entropy-SGD Biasing gradientdescent into wide valleys In ICLR 2017

Jesus Cid-Sueiro Darıo Garcıa-Garcıa and Raul Santos-Rodrıguez Consistency of losses forlearning from weak labels In ECML-PKDD 2014

14

Tarin Clanuwat Mikel Bober-Irizar Asanobu Kitamoto Alex Lamb Kazuaki Yamamoto and DavidHa Deep learning for classical Japanese literature In NeurIPS Workshop on Machine Learningfor Creativity and Design 2018

Marthinus Christoel du Plessis Gang Niu and Masashi Sugiyama Analysis of learning frompositive and unlabeled data In NeurIPS 2014

Marthinus Christoel du Plessis Gang Niu and Masashi Sugiyama Convex formulation forlearning from positive and unlabeled data In ICML 2015

Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MIT Press 2016

Hongyu Guo Yongyi Mao and Richong Zhang Augmenting data with mixup for sentenceclassication An empirical study In arXiv190508941 2019

Bo Han Gang Niu Jiangchao Yao Xingrui Yu Miao Xu Ivor Tsang and Masashi SugiyamaPumpout A meta approach to robust deep learning with noisy labels In arXiv180911008 2018

Stephen Jose Hanson and Lorien Y Pra Comparing biases for minimal network constructionwith back-propagation In NeurIPS 1988

Kaiming He Xiangyu Zhang Shaoqing Ren and Jian Sun Deep residual learning for imagerecognition In CVPR 2016

Sergey Ioe and Christian Szegedy Batch normalization Accelerating deep network training byreducing internal covariate shi In ICML 2015

Takashi Ishida Gang Niu Aditya Krishna Menon and Masashi Sugiyama Complementary-labellearning for arbitrary losses and models In ICML 2019

Nitish Shirish Keskar Dheevatsa Mudigere Jorge Nocedal Mikhail Smelyanskiy and Ping Tak PeterTang On large-batch training for deep learning Generalization gap and sharp minima In ICLR2017

Diederik P Kingma and Jimmy L Ba Adam A method for stochastic optimization In ICLR 2015

Ryuichi Kiryo Gang Niu Marthinus Christoel du Plessis and Masashi Sugiyama Positive-unlabeled learning with non-negative risk estimator In NeurIPS 2017

Anders Krogh and John A Hertz A simple weight decay can improve generalization In NeurIPS1992

Yann Lecun Lon Boou Yoshua Bengio and Patrick Haner Gradient-based learning applied todocument recognition In Proceedings of the IEEE pages 2278ndash2324 1998

Hao Li Zheng Xu Gavin Taylor Christoph Studer and Tom Goldstein Visualizing the losslandscape of neural nets In NeurIPS 2018

15

Ilya Loshchilov and Frank Huer Decoupled weight decay regularization In ICLR 2019

Nan Lu Tianyi Zhang Gang Niu and Masashi Sugiyama Mitigating overing in supervisedclassication from two unlabeled datasets A consistent risk correction approach In AISTATS2020

N Morgan and H Bourlard Generalization and parameter estimation in feedforward nets Someexperiments In NeurIPS 1990

Vinod Nair and Georey E Hinton Rectied linear units improve restricted boltzmann machinesIn ICML 2010

Preetum Nakkiran Gal Kaplun Dimitris Kalimeris Tristan Yang Benjamin L Edelman FredZhang and Boaz Barak SGD on neural networks learns functions of increasing complexity InNeurIPS 2019

Preetum Nakkiran Gal Kaplun Yamini Bansal Tristan Yang Boaz Barak and Ilya Sutskever Deepdouble descent Where bigger models and more data hurt In ICLR 2020

Nagarajan Natarajan Inderjit S Dhillon Pradeep K Ravikumar and Ambuj Tewari Learningwith noisy labels In NeurIPS 2013

Yuval Netzer Tao Wang Adam Coates Alessandro Bissacco Bo Wu and Andrew Y Ng Readingdigits in natural images with unsupervised feature learning In NeurIPS Workshop on DeepLearning and Unsupervised Feature Learning 2011

Andrew Y Ng Preventing ldquooveringrdquo of cross-validation data In ICML 1997

Adam Paszke Sam Gross Francisco Massa Adam Lerer James Bradbury Gregory Chanan TrevorKilleen Zeming Lin Natalia Gimelshein Luca Antiga Alban Desmaison Andreas Kopf EdwardYang Zachary DeVito Martin Raison Alykhan Tejani Sasank Chilamkurthy Benoit SteinerLu Fang Junjie Bai and Soumith Chintala PyTorch An imperative style high-performancedeep learning library In NeurIPS 2019

Giorgio Patrini Alessandro Rozza Aditya K Menon Richard Nock and Lizhen Making deepneural networks robust to label noise A loss correction approach In CVPR 2017

Herbert Robbins and Suon Monro A stochastic approximation method Annals of MathematicalStatistics 22400ndash407 1951

Rebecca Roelofs Vaishaal Shankar Benjamin Recht Sara Fridovich-Keil Moritz Hardt John Millerand Ludwig Schmidt A meta-analysis of overing in machine learning In NeurIPS 2019

Bobak Shahriari Kevin Swersky Ziyu Wang Ryan P Adams and Nando de Freitas Taking thehuman out of the loop A review of bayesian optimization Proceedings of the IEEE 104148ndash1752016

16

Connor Shorten and Taghi M Khoshgoaar A survey on image data augmentation for deeplearning Journal of Big Data 6 2019

Jasper Snoek Hugo Larochelle and Ryan P Adams Practical bayesian optimization of machinelearning algorithms In NeurIPS 2012

Nitish Srivastava Georey Hinton Alex Krizhevsky Ilya Sutskever and Ruslan SalakhutdinovDropout A simple way to prevent neural networks from overing Journal of Machine LearningResearch 151929ndash1958 2014

Masashi Sugiyama Introduction to statistical machine learning Morgan Kaufmann 2015

Christian Szegedy Vincent Vanhoucke Sergey Ioe and Jon Shlens Rethinking the inceptionarchitecture for computer vision In CVPR 2016

Sunil ulasidasan Gopinath Chennupati Je Bilmes Tanmoy Bhaacharya and Sarah MichalakOn mixup training Improved calibration and predictive uncertainty for deep neural networksIn NeurIPS 2019

Robert Tibshirani Regression shrinkage and selection via the lasso Journal of the Royal StatisticalSociety Series B (Methodological) 58267ndash288 1996

A N Tikhonov and V Y Arsenin Solutions of Ill Posed Problems Winston 1977

Andrey Nikolayevich Tikhonov On the stability of inverse problems Doklady Akademii NaukSSSR 39195ndash198 1943

Antonio Torralba Rob Fergus and William T Freeman 80 million tiny images A large data setfor nonparametric object and scene recognition In IEEE Trans PAMI 2008

Brendan van Rooyen and Robert C Williamson A theory of learning with corrupted labelsJournal of Machine Learning Research 181ndash50 2018

Vikas Verma Alex Lamb Juho Kannala Yoshua Bengio and David Lopez-Paz Interpolationconsistency training for semi-supervised learning In IJCAI 2019

Stefan Wager Sida Wang and Percy Liang Dropout training as adaptive regularization In NeurIPS2013

Roman Werpachowski Andrs Gyrgy and Csaba Szepesvri Detecting overing via adversarialexamples In NeurIPS 2019

Han Xiao Kashif Rasul and Roland Vollgraf Fashion-MNIST a novel image dataset for bench-marking machine learning algorithms arXiv preprint arXiv170807747 2017

Chiyuan Zhang Samy Bengio Moritz Hardt Benjamin Recht and Oriol Vinyals Understandingdeep learning requires rethinking generalization In ICLR 2017

17

Hongyi Zhang Moustapha Cisse Yann N Dauphin and David Lopez-Paz mixup Beyondempirical risk minimization In ICLR 2018

Tong Zhang Statistical behavior and consistency of classication methods based on convex riskminimization e Annals of Statistics 3256ndash85 2004

18

A Proof of eoremProof If the ooding level is b then the proposed ooding estimator is

R(g) = |R(g)minus b|+ b (12)

Since the absolute operator can be expressed with a max operator with max(a b) = a+b+|aminusb|2

theproposed estimator can be re-expressed as

R(g) = 2max(R(g) b)minus R(g) = Aminus R(g) (13)

For convenience we used A = 2max(R(g) b) From the denition of MSE

MSE(R(g)) = E[(R(g)minusR(g))2] (14)MSE(R(g)) = E[(R(g)minusR(g))2] (15)

= E[(Aminus R(g)minusR(g))2)] (16)= E[A2]minus 2E[A(R(g) +R(g))] + E[(R(g) +R(g))2] (17)

We are interested in the sign of

MSE(R(g))minusMSE(R(g)) = E[minus4R(g)R(g)minus A2 + 2A(R(g) +R(g))] (18)

Dene the inside of the expectation as B = minus4R(g)R(g)minus A2 + 2A(R(g) + R(g)) B can bedivided into two cases depending on the outcome of the max operator

B =

minus4R(g)R(g)minus 4R(g)2 + 4R(g)(R(g) +R(g)) if R(g) ge b

minus4R(g)R(g)minus 4b2 + 4b(R(g) +R(g)) if R(g) lt b(19)

=

0 if R(g) ge b

minus4R(g)R(g)minus 4b2 + 4b(R(g) +R(g)) if R(g) lt b(20)

=

0 if R(g) ge b

minus4(bminus R(g))(bminusR(g)) if R(g) lt b (21)

e laer case becomes positive when R(g) lt b lt R(g) erefore when R(g) lt b lt R(g)

MSE(R(g))minusMSE(R(g)) gt 0 (22)MSE(R(g)) gt MSE(R(g)) (23)

When b le R(g)

MSE(R(g))minusMSE(R(g)) gt 0 (24)MSE(R(g)) = MSE(R(g)) (25)

19

B Bayes Risk for Gaussian DistributionsIn this section we explain in detail how we derived the Bayes risk with respect to the surrogateloss in the experiments with Gaussian data in Section 41 Since we are using the logistic loss inthe synthetic experiments the loss of the margin is

`(yg(x)) = log(1 + exp(minusyg(x))) (26)

where g(x) Rd rarr R is a scalar instead of the vector denition that was used previously becausethe synthetic experiments only consider binary classication Take the derivative to derive

partE[`(yg(x))]partg(x)

=E[log(1 + exp(minusyg(x)))]

partg(x)= E

[minusy exp(minusyg(x))1 + exp(minusyg(x))

∣∣∣∣∣x]p(x) (27)

= E

[minusy

1 + exp(yg(x))

∣∣∣∣∣x]p(x) (28)

= E

[y + 1

2

1

exp(minusg(x)) + 1+y minus 1

2

minus1exp(g(x)) + 1

∣∣∣∣∣x]p(x) (29)

= E

[y + 1

2

∣∣∣∣∣x]

1

exp(minusg(x)) + 1p(x) + E

[y minus 1

2

∣∣∣∣∣x]

minus1exp(g(x)) + 1

p(x) (30)

= p(y = +1|x) 1

exp(minusg(x)) + 1p(x) + p(y = minus1|x) minus1

exp(g(x)) + 1p(x) (31)

Set this to zero divide by p(x) gt 0 to obtain

p(y = minus1|x) 1

exp(minusg(x)) + 1= p(y = +1|x) 1

exp(g(x)) + 1(32)

exp(g(x)) =p(y = +1|x)p(y = minus1|x)

(33)

g(x) = logp(y = +1|x)p(y = minus1|x)

(34)

Since we are interested in the surrogate loss under this classier we plug this into the logisticloss to obtain the Bayes risk

E[`(yg(x))] = E[log(1 +

p(minusy|x)p(y|x)

)

]= E

[log(

1

p(y|x))

]= E[minus log p(y|x)] (35)

In the experiments in Section 41 we report the empirical version of this with the test dataset asthe Bayes risk

20

C Details of Experiments

C1 Benchmark DatasetsIn the experiments in Section 42 we use six benchmark datasets explained below

bull MNIST6 [Lecun et al 1998] is a 10 class dataset of handwrien digits 1 2 9 and 0 Eachsample is a 28times 28 grayscale image e number of training and test samples are 60000and 10000 respectively

bull Fashion-MNIST7 [Xiao et al 2017] is a 10 class dataset of fashion items T-shirttop TrouserPullover Dress Coat Sandal Shirt Sneaker Bag and Ankle boot Each sample is a 28times 28grayscale image e number of training and test samples are 60000 and 10000 respectively

bull Kuzushiji-MNIST8 [Clanuwat et al 2018] is a 10 class dataset of cursive Japanese (ldquoKuzushijirdquo)characters Each sample is a 28 times 28 grayscale image e number of training and testsamples are 60000 and 10000 respectively

bull CIFAR-109 is a 10 class dataset of various objects airplane automobile bird cat deer dogfrog horse ship and truck Each sample is a colored image in 32times 32times 3 RGB format It isa subset of the 80 million tiny images dataset [Torralba et al 2008] ere are 6000 imagesper class where 5000 are for training and 1000 are for test

bull CIFAR-10010 is a 100 class dataset of various objects Each class has 600 samples where 500samples are for training and 100 samples are for test is is also a subset of the 80 milliontiny images dataset [Torralba et al 2008]

bull SVHN11 [Netzer et al 2011] is a 10 class dataset of house numbers from Google Street Viewimages in 32times 32times 3 RGB format 73257 digits are for training and 26032 digits are fortesting

6httpyannlecuncomexdbmnist7httpsgithubcomzalandoresearchfashion-mnist8httpsgithubcomrois-codhkmnist9httpswwwcstorontoedusimkrizcifarhtml

10httpswwwcstorontoedusimkrizcifarhtml11httpufldlstanfordeduhousenumbers

21

C2 Chosen Flooding LevelsIn Table 4 we report the chosen ooding levels for our experiments with benchmark datasets

Table 4 e chosen ooding levels for benchmark experiments

Early stopping times X times XWeight decay times times X XFlooding X X X X

MNIST (08) 002 002 003 002MNIST (04) 002 003 000 002Fashion-MNIST (08) 000 000 mdash mdashFashion-MNIST (04) 000 000 mdash mdashKuzushiji-MNIST (08) 001 003 003 003Kuzushiji-MNIST (04) 003 002 004 003CIFAR-10 (08) 004 004 000 001CIFAR-10 (04) 003 003 000 000CIFAR-100 (08) 002 002 000 000CIFAR-100 (04) 003 003 000 000SVHN (08) 001 001 000 002SVHN (04) 001 009 003 003

D Learning CurvesIn Figure 4 we visualize learning curves for all datasets (including CIFAR-10 which we alreadyvisualized in Figure 2) We only show the learning curves for trainingvalidation proportion of 08since the results for 04 were similar with 08 Note that Figure 1(c) shows the learning curves forthe rst 80 epochs for CIFAR-10 without ooding Figure 1(d) shows the learning curves withooding when the ooding level is 018

22

(a) MNIST (000) (b) MNIST (001) (c) MNIST (002) (d) MNIST (003)

(e) Fashion-MNIST(000)

(f) Fashion-MNIST(002)

(g) Fashion-MNIST(005)

(h) Fashion-MNIST(010)

(i) KMNIST (000) (j) KMNIST (002) (k) KMNIST (005) (l) KMNIST (010)

(m) CIFAR-10 (000) (n) CIFAR-10 (003) (o) CIFAR-10 (007) (p) CIFAR-10 (020)

(q) CIFAR-100 (000) (r) CIFAR-100 (001) (s) CIFAR-100 (006) (t) CIFAR-100 (020)

(u) SVHN (000) (v) SVHN (001) (w) SVHN (006) (x) SVHN (020)

Figure 4 Learning curves of training and test loss e rst gure in each row is the learning curveswithout ooding e 2nd 3rd and 4th columns show the results with dierent ooding levels e oodinglevel increases towards the right-hand side 23

E Relationship between Performance and GradientsSettings We visualize the relationship between test performance (loss or accuracy) and gradientamplitude of the trainingtest loss in Figure 5 where the gradient amplitude is the `2 norm of thelter-nomalized gradient of the loss e lter-normalized gradient is the gradient appropriatelyscaled depending on the magnitude of the corresponding convolutional lter similarly to Liet al [2018] More specicically for each lter of every comvolutional layer we multiply thecorresponding elements of the gradient by the norm of the lter Note that a fully connected layeris a special case of convolutional layer and subject to this scaling We exclude Fashion-MNISTbecause the optimal ooding level was zero We used the results with trainingvalidation splitratio of 08

Results For the gures with gradient amplitude of training loss on the horizontal axis ldquordquomarks (w ooding) are oen ploed on the right of ldquo+rdquo marks (wo ooding) which impliesthat ooding prevents the model from staying a local minimum For the gures with gradientamplitude of test loss on the horizontal axis we can observe the method with ooding (ldquordquo)improves performance while the gradient amplitude becomes smaller On the other hand theperformance with the method without ooding (ldquo+rdquo) degenerates while the gradient amplitude oftest loss keeps increasing

24

(a) MNIST xtrainyloss

(b) MNIST xtrainyacc

(c) MNIST xtestyloss

(d) MNIST xtest yacc

(e) K-M xtrain yloss (f) K-M xtrain yacc (g) K-M xtest yloss (h) K-M xtest yacc

(i) C10 xtrain yloss (j) C10 xtrain yacc (k) C10 xtest yloss (l) C10 xtest yacc

(m) C100 xtrainyloss

(n) C100 xtrain yacc (o) C100 xtest yloss (p) C100 xtest yacc

(q) SVHN xtrainyloss

(r) SVHN xtrainyacc

(s) SVHN xtest yloss (t) SVHN xtest yacc

Figure 5 Relationship between test performance (loss or accuracy) and amplitude of gradient (with trainingor test loss) Each point (ldquordquo or ldquo+rdquo) in the gures corresponds to a single model at a certain epoch Weremove the rst 5 epochs and plot the rest ldquordquo is used for the method with ooding and ldquo+rdquo is used forthe method without ooding e large black ldquordquo and ldquo+rdquo show the epochs with early stopping e colorbecomes lighter (purplerarr yellow) as the training proceeds K-M C10 and C100 stand for Kuzushiji-MNISTCIFAR-10 and CIFAR-100

25

F FlatnessSettings We follow Li et al [2018] and give a one-dimensional visualization of atness for eachdataset in Figure 6 We exclude Fashion-MNIST because the optimal ooding level was zero Weused the results with trainingvalidation split ratio of 08 We compare the atness of the modelright aer the empirical risk with respect to a mini-batch becomes smaller than the ooding levelRm(g) lt b for the rst time (doed blue line) and the model aer training (solid blue line) Wealso compare them with the model trained by the baseline method without ooding and trainingis nished (solid red line)

Results According to Figure 6 the test loss becomes lower and more at during the trainingwith ooding Note that the training loss on the other hand continues to oat around the oodinglevel until the end of training aer it enters the ooding zone We expect that the model makes arandom walk and escapes regions with sharp loss landscapes during the period is may be apossible reason for beer generalization results with our proposed method

26

(a) MNIST (train) (b) MNIST (test) (c) Kuzushiji-MNIST(train)

(d) Kuzushiji-MNIST(test)

(e) CIFAR-10 (train) (f) CIFAR-10 (test) (g) CIFAR-100 (train) (h) CIFAR-100 (test)

(i) SVHN (train) (j) SVHN (test)

Figure 6 One-dimensional visualization of atness We visualize the trainingtest loss with respect toperturbation We depict the results for 3 models the model when the empirical risk with respect to trainingdata is below the ooding level for the rst time during training (doed blue) the model at the end oftraining with ooding (solid blue) and the model at the end of training without ooding (solid red)

27

  • 1 Introduction
  • 2 Backgrounds
    • 21 Regularization Methods
    • 22 Double Descent Curves with Overparametrization
    • 23 Lower-Bounding the Empirical Risk
      • 3 Flooding How to Avoid Zero Training Loss
        • 31 Preliminaries
        • 32 Algorithm
        • 33 Implementation
        • 34 Theoretical Analysis
          • 4 Experiments
            • 41 Synthetic Experiments
            • 42 Benchmark Experiments
            • 43 Memorization
              • 5 Conclusion
              • A Proof of Theorem
              • B Bayes Risk for Gaussian Distributions
              • C Details of Experiments
                • C1 Benchmark Datasets
                • C2 Chosen Flooding Levels
                  • D Learning Curves
                  • E Relationship between Performance and Gradients
                  • F Flatness
Page 6: Do We Need Zero Training Loss A›er Achieving Zero Training … · 2020. 2. 21. · directly aim to avoid zero training loss, they o›en fail to maintain a moderate level of training

Hertz [1992] where they theoretically showed the double descent phenomenon under a linearregression setup Recent works [Belkin et al 2019 Nakkiran et al 2020] have shown empiricallythat a similar phenomenon can be observed with deep learning methods Nakkiran et al [2020]observed that the double descent curves can be shown not only as a function of model complexitybut also as a function of the epoch number

We want to note a byproduct of our ooding method We were able to produce the epoch-wisedouble descent curve for the test loss with about 100 epochs Investigating the connection betweenour accelerated double descent curves and previous double descent curves [Belkin et al 2019Krogh and Hertz 1992 Nakkiran et al 2020] is out of the scope of this paper but is an importantfuture direction

23 Lower-Bounding the Empirical RiskLower-bounding the empirical risk has been used in the area of weakly supervised learning erewere a common phenomenon where the empirical risk goes below zero [Kiryo et al 2017] whenan equivalent form of the risk expressed with the given weak supervision was alternatively used[Cid-Sueiro et al 2014 du Plessis et al 2014 2015 Natarajan et al 2013 Patrini et al 2017 vanRooyen and Williamson 2018] A gradient ascent technique was used to force the empirical riskto become non-negative in Kiryo et al [2017] is idea has been generalized and applied to otherweakly supervised seings [Han et al 2018 Ishida et al 2019 Lu et al 2020]

Although we also set a lower bound on the empirical risk the motivation is dierent Firstwhile Kiryo et al [2017] and others aim to x the negative empirical risk to become lower boundedby zero our empirical risk already has a lower bound of zero Instead we are aiming to sink theoriginal empirical risk by placing a positive lower bound Second the problem seings are dierentWeakly supervised learning methods require certain loss corrections or sample corrections [Hanet al 2018] before the non-negative correction but we work on the original empirical risk withoutany seing-specic modications

3 Flooding How to Avoid Zero Training LossIn this section we propose our regularization method ooding Note that this section and thefollowing sections only consider multi-class classication for simplicity

31 PreliminariesConsider input variable x isin Rd and output variable y isin 1 K where K is the number ofclasses ey follow an unknown joint probability distribution with density p(x y) We denotethe score function by g Rd rarr RK For any test data point x0 our prediction of the output labelwill be given by y0 = argmaxzisin1K gz(x0) where gz(middot) is the z-th element of g(middot) and incase of a tie argmax returns the largest argument Let ` RK times1 2 K rarr R denote a loss

6

function ` can be the zero-one loss

`01(v zprime) =

0 if argmaxzisin1K vz = zprime

1 otherwise(2)

where v = (v1 vK)gt isin RK or a surrogate loss such as the somax cross-entropy loss

`CE(v zprime) = minus log

exp(vzprime)sumzisin1K exp(vz)

(3)

For a surrogate loss ` we denote the classication risk by

R(g) = Ep(xy)[`(g(x) y)] (4)

where Ep(xy)[middot] is the expectation over (x y) sim p(x y) We use R01(g) to denote Eq (4) when` = `01 and call it the classication error

e goal of multi-class classication is to learn g that minimizes the classication errorR01(g)In optimization we consider the minimization of the risk with a almost surely dierentiablesurrogate loss R(g) instead to make the problem more tractable Furthermore since p(x y) isusually unknown and there is no way to exactly evaluate R(g) we minimize its empirical versioncalculated from the training data instead

R(g) =1

n

nsumi=1

`(g(xi) yi) (5)

where (xi yi)ni=1 are iid sampled from p(x y) We call R the empirical riskWe would like to clarify some of the undened terms used in the title and the introduction

e ldquotraintest lossrdquo is the empirical risk with respect to the surrogate loss function ` over thetrainingtest data respectively We refer to the ldquotrainingtest errorrdquo as the empirical risk withrespect to `01 over the trainingtest data respectively (which is equal to one minus accuracy)[Zhang 2004]

Finally we formally dene the Bayes risk as

Rlowast = infhR(h) (6)

where the inmum is taken over all vector-valued functions h Rd rarr RK e Bayes risk is oenreferred to as the Bayes error if the zero-one loss is used

infhR01(h) (7)

32 Algorithm

With exible models R(g) wrt a surrogate loss can easily become small if not zero as wementioned in Section 1 see [C] in Fig 1(a) We propose a method that ldquooods the boom areaand sinks the original empirical riskrdquo as in Fig 1(b) so that the empirical risk cannot go belowthe ooding level More technically if we denote the ooding level as b our proposed trainingobjective with ooding is a simple x

7

Denition 1 e ooded empirical risk is dened as4

R(g) = |R(g)minus b|+ b (8)

Note that when b = 0 then R(g) = R(g) e gradient of R(g) wrt model parameters willpoint to the same direction as that of R(g) when R(g) gt b but in the opposite direction whenR(g) lt b is means that when the learning objective is above the ooding level we performgradient descent as usual (gravity zone) but when the learning objective is below the oodinglevel we perform gradient ascent instead (buoyancy zone)

e issue is that in general we seldom know the optimal ooding level in advance isissue can be mitigated by searching for the optimal ooding level blowast with a hyper-parameteroptimization technique In practice we can search for the optimal ooding level by performingthe exhaustive search in parallel

33 ImplementationFor large scale problems we can employ mini-batched stochastic optimization for ecient com-putation Suppose that we have M disjoint mini-batch splits We denote the empirical risk (5)with respect to the m-th mini-batch by Rm(g) for m isin 1 M en our mini-batchedoptimization performs gradient descent updates in the direction of the gradient of Rm(g) By theconvexity of the absolute value function and Jensenrsquos inequality we have

R(g) le 1

M

Msumm=1

(|Rm(g)minus b|+ b

) (9)

is indicates that mini-batched optimization will simply minimize an upper bound of the full-batch case with R(g)

34 eoretical AnalysisIn the following theorem we will show that the mean squared error (MSE) of the proposed riskestimator with ooding is smaller than that of the original risk estimator without ooding

eorem 1 Fix any measurable vector-valued function g If the ooding level b satises R(g) ltb lt R(g) we have

MSE(R(g)) gt MSE(R(g)) (10)

If b le R(g) we haveMSE(R(g)) = MSE(R(g)) (11)

A proof is given in Appendix A If we regard R(g) as the training loss and R(g) as the testloss we would want b to be between those two for the MSE to improve

4Strictly speaking Eq (1) is dierent from Eq (8) since Eq (1) can ignore constant terms of the original empiricalrisk We will refer to Eq (8) for the ooding operator for the rest of the paper

8

4 ExperimentsIn this section we show experimental results with synthetic and benchmark datasets e imple-mentation is based on PyTorch [Paszke et al 2019] and demo code will be available5 Experimentswere carried out with NVIDIA GeForce GTX 1080 Ti NVIDIA adro RTX 5000 and Intel XeonGold 6142

41 Synthetic Experimentse aim of our synthetic experiments is to study the behavior of ooding with a controlled setupWe use three types of synthetic data described below

Two Gaussians Data We perform binary classication with two 10-dimensional Gaus-sian distributions with covariance matrix identity and means microP = [0 0 0]gt and microN =[mm m]gt where m isin 08 10 e Bayes risk for m = 10 and m = 08 are 014 and024 respectively where proofs are shown in Appendix B e training validation and test samplesizes are 25 10000 and 10000 per class respectively

Sinusoid Data e sinusoid data [Nakkiran et al 2019] are generated as follows We rstdraw input data points uniformly from the inside of a 2-dimensional ball of radius 1 en we putclass labels based on

y = sign(xgtw + sin(xgtwprime))

where w and wprime are any two 2-dimesional vectors such that w perp wprime e training validationand test sample sizes are 100 100 and 20000 respectively

Spiral Data e spiral data [Sugiyama 2015] are two-dimensional synthetic data Letθ+1 = 0 θ+2 θ

+n+ = 4π be equally spaced n+ points in the interval [0 4π] and θminus1 =

0 θminus2 θminusnminus = 4π be equally spaced nminus points in the interval [0 4π] Let positive and negative

input data points be

x+i+ = θ+i+ [cos(θ

+i+) sin(θ

+i+)]

gt + τν+i+

xminusiminus = (θminusiminus + π)[cos(θminusiminus) sin(θminusiminus)]

gt + τνminusiminus

for i+ = 1 n+ and iminus = 1 nminus where τ controls the magnitude of the noise ν+i and

νminusi are iid distributed according to the two-dimensional standard normal distribution enwe make data for classication by (xi yi)ni=1 = (x+

i+ +1)n+

i+=1 cup (xminusiminus minus1)

nminus

iminus=1 wheren = n+ + nminus e training validation and test sample sizes are 100 100 and 10000 per classrespectively

For Two Gaussians we use a one-hidden-layer feedforward neural network with 500 unitsin the hidden layer with the ReLU activation function [Nair and Hinton 2010] We train thenetwork for 1000 epochs with the logistic loss and vanilla gradient descent with learning rate of005 e ooding level is chosen from b isin 0 001 002 040 For Sinusoid and Spiral weuse a four-hidden-layer feedforward neural network with 500 units in the hidden layer with the

5httpsgithubcomtakashiishidaflooding

9

Table 2 Experimental results for the synthetic data Sub-table (A) shows the results without early stoppingSub-table (B) shows the results with early stopping e beer method is shown in bold in each of sub-tables (A) and (B) ldquoBRrdquo stands for the Bayes risk For Two Gaussians the distance between the positive andnegative distributions is larger for larger m See the description in Section 41 for the details

(A) Without Early Stopping (B) With Early Stopping

Data Seing WithoutFlooding

WithFlooding

Chosenb

WithoutFlooding

WithFlooding

Chosenb

Two Gaussians m 10 BR 014 8796 9225 028 9163 9225 027Two Gaussians m 08 BR 024 8200 8731 033 8657 8729 035Sinusoid Label Noise 001 9384 9446 001 9254 9254 000Sinusoid Label Noise 005 9112 9544 010 9326 9460 001Sinusoid Label Noise 010 8657 9602 017 9670 9670 000Spiral Label Noise 001 9896 9785 001 9860 9888 001Spiral Label Noise 005 9387 9624 004 9658 9562 014Spiral Label Noise 010 8970 9296 016 8970 9296 016

ReLU activation function [Nair and Hinton 2010] and batch normalization [Ioe and Szegedy2015] We train the network for 500 epochs with the logistic loss and Adam [Kingma and Ba2015] optimizer with 100 mini-batch size and learning rate of 0001 e ooding level is chosenfrom b isin 0 001 002 020 Note that training with b = 0 is identical to the baselinemethod without ooding We report the test accuracy of the ooding level with the best validationaccuracy We rst conduct experiments without early stopping which means that the last epochwas chosen for all ooding levels

Results e results are summarized in Table 2 It is worth noting that for Two Gaussians thechosen ooding level b is larger for the smaller distance between the two distributions which iswhen the classication task is harder and the Bayes risk becomes larger since the two distributionsbecome less separated We see similar tendencies for Sinusoid and Spiral data a larger b waschosen for larger ipping probability for label noise which is expected to increase the Bayes riskis implies the positive correlation between the optimal ooding level and the Bayes risk asis also partially suggested by eorem 1 Another interesting observation is that the chosen bis close to but higher than the Bayes risk for Two Gaussians data is may look inconsistentwith eorem 1 However it makes sense to adopt larger b with stronger regularization eect thatallows some bias as a trade-o for reducing the variance of the risk estimator In fact eorem 1does not deny the possibility that some b ge R(g) achieves even beer estimation

From (A) in Table 2 we can see that the method with ooding oen improves test accuracyover the baseline method without ooding As we mentioned in the introduction it ca be harmfulto keep training a model until the end without ooding However with ooding the model at thenal epoch has good prediction performance according to the results which implies that ooding

10

Table 3 Results with benchmark datasets We report classication accuracy for all combinations ofweight decay (X and times) early stopping (X and times) and ooding (X and times) e second column shows thetrainingvalidation split used for the experiment W stands for weight decay E stands for early stopping andF stands for ooding ldquomdashrdquo means that ooding level of zero was optimal ldquoNArdquo means that we skipped theexperiments because zero weight decay was optimal in the case without ooding e best and equivalentare shown in bold by comparing ldquowith oodingrdquo and ldquowithout oodingrdquo for two columns with the sameseing for W and E eg the rst and h columns out of the 8 columns e best performing combinationis highlighted

Datasettr W times times X X times times X Xva E times X times X times X times X

split F times times times times X X X X

MNIST 08 9832 9830 9851 9842 9846 9853 9850 984804 9771 9770 9782 9791 9774 9785 mdash 9783

Fashion- 08 8934 8936 NA NA mdash mdash NA NAMNIST 04 8848 8863 8860 8862 mdash mdash mdash mdash

Kuzushiji- 08 9163 9162 9163 9171 9240 9212 9211 9197MNIST 04 8918 8918 8958 8973 9041 9015 8971 8988

CIFAR-10 08 7359 7336 7365 7357 7306 7344 mdash 744104 6639 6663 6931 6928 6720 6758 mdash mdash

CIFAR-100 08 4216 4233 4267 4245 4250 4236 mdash mdash04 3427 3434 3797 3882 3499 3514 mdash mdash

SVHN 08 9238 9241 9320 9299 9278 9279 mdash 934204 9032 9035 9043 9049 9057 9061 9116 9121

helps the late-stage training improve test accuracyWe also conducted experiments with early stopping meaning that we chose the model that

recorded the best validation accuracy during training e results are reported in sub-table (B) ofTable 2 Compared with sub-table (A) we see that early stopping improves the baseline methodwithout ooding well in many cases is indicates that training longer without ooding washarmful in our experiments On the other hand the accuracy for ooding combined with earlystopping is oen close to that with early stopping meaning that training until the end withooding tends to be already as good as doing so with early stopping e table shows that oodingoen improves or retains the test accuracy of the baseline method without ooding even aerdeploying early stopping Flooding does not hurt performance but can be benecial for methodsused with early stopping

11

(a) CIFAR-10 (000) (b) CIFAR-10 (003) (c) CIFAR-10 (007) (d) CIFAR-10 (020)

Figure 2 Learning curves of training and test loss for trainingvalidation proportion of 08 (a) shows thelearning curves without ooding (b) (c) and (d) show the learning curves with dierent ooding levels

42 Benchmark ExperimentsWe next perform experiments with benchmark datasets Not only do we compare with the baselinewithout ooding we also compare or combine with other general regularization methods whichare early stopping and weight decay

Settings We use the following six benchmark datasets MNIST Fashion-MNIST Kuzushiji-MNIST CIFAR-10 CIFAR-100 and SVHN e details of the benchmark datasets can be found inAppendix C1 We split the original training dataset into training and validation data with dierentproportions 08 or 04 (meaning 80 or 40 was used for training and the rest was used forvalidation respectively) We perform the exhaustive hyper-parameter search for the ooding levelwith candidates from 000 001 020 e number of epochs is 500 Stochastic gradientdescent [Robbins and Monro 1951] is used with learning rate of 01 and momentum of 09 ForMNIST Fashion-MNIST and Kuzushiji-MNIST we use a one-hidden-layer feedforward neuralnetwork with 500 units and ReLU activation function [Nair and Hinton 2010] For CIFAR-10CIFAR-100 and SVHN we used ResNet-18 [He et al 2016] We do not use any data augmentationor manual learning rate decay We deployed early stopping in the same way as in Section 41

We rst ran experiments with the following candidates for the weight decay rate 1times10minus5 1times10minus4 4times 10minus4 7times 10minus4 1times 10minus3 4times 10minus3 7times 10minus3 We choose the weight decay rate withthe best validation accuracy for each dataset and each trainingvalidation proportion en xingthe weight decay to the chosen one we ran experiments with ooding level candidates from0 001 020 to investigate whether weight decay and ooding have complementary eectsor if adding weight decay will diminish the accuracy gain of ooding

Results We show the results in Table 3 and the chosen ooding levels in Table 4 in Appendix C2We can observe that ooding gives beer accuracy for most cases We can also see that combiningooding with early stopping or with both early stopping and weight decay may lead to even beeraccuracy in some cases

12

(a) wo early stop (08) (b) wo early stop (04) (c) w early stop (08) (d) w early stop (04)

Figure 3 We show the optimal ooding level maintains memorization e vertical axis shows the trainingaccuracy e horizontal axis shows the ooding level We show results with and without early stoppingand for dierent trainingvalidation splits with proportion of 08 or 04 e marks (4 O ) areplaced on the ooding level that was chosen based on validation accuracy

43 MemorizationCan we maintain memorization even aer adding ooding We investigate if the trained modelhas zero training error (100 accuracy) for the ooding level that was chosen with validation dataWe show the results for all benchmark datasets and all trainingvalidation splits with proportions08 and 04 We also show the case without early stopping (choosing the last epoch) and withearly stopping (choosing the epoch with the highest validation accuracy) e results are shownin Fig 3

All gures show downward curves implying that the model will give up eventually onmemorizing all training data as the ooding level becomes higher A more interesting andimportant observation is the position of the optimal ooding level (the one chosen by validationaccuracy which is marked with 4 O or ) We can observe that the marks are oen ploedat zero error and in some cases there is a mark on the highest ooding level that maintains zeroerror ese results are consistent with recent empirical works that imply zero training error leadsto lower generalization error [Belkin et al 2019 Nakkiran et al 2020] but we further demonstratethat zero training loss may be harmful under zero training error

5 ConclusionWe proposed a novel regularization method called ooding that keeps the training loss to stayaround a small constant value to avoid zero training loss In our experiments the optimal oodinglevel oen maintained memorization of training data with zero error With ooding we showedthat the test accuracy will improve for various benchmark datasets and theoretically showed thatthe mean squared error will be reduced under certain conditions

As a byproduct we were able to produce a double descent curve for the test loss with a relativelyfew number of epochs eg in around 100 epochs shown in Fig 2 and Fig 4 in Appendix D Animportant future direction is to study the relationship between this and the double descent curves

13

from previous works [Belkin et al 2019 Krogh and Hertz 1992 Nakkiran et al 2020]It would also be interesting to see if Bayesian optimization [Shahriari et al 2016 Snoek

et al 2012] methods can be utilized to search for the optimal ooding level eciently We willinvestigate this direction in the future

AcknowledgementsWe thank Chang Xu Genki Yamamoto Kento Nozawa Nontawat Charoenphakdee Voot Tangkaraand Yoshihiro Nagano for the helpful discussions TI was supported by the Google PhD FellowshipProgram MS and IY were supported by JST CREST Grant Number JPMJCR18A2 including AIPchallenge program Japan

ReferencesDevansh Arpit Stanisaw Jastrzebski Nicolas Ballas David Krueger Emmanuel Bengio Maxin-

der S Kanwal Tegan Maharaj Asja Fischer Aaron Courville Yoshua Bengio and SimonLacoste-Julien A closer look at memorization in deep networks In ICML 2017

Mikhail Belkin Daniel Hsu and Partha P Mitra Overing or perfect ing Risk bounds forclassication and regression rules that interpolate In NeurIPS 2018

Mikhail Belkin Daniel Hsu Siyuan Ma and Soumik Mandal Reconciling modern machine-learningpractice and the classical biasvariance trade-o PNAS 11615850ndash15854 2019

David Berthelot Nicholas Carlini Ian Goodfellow Nicolas Papernot Avital Oliver and Colin ARael MixMatch A holistic approach to semi-supervised learning In NeurIPS 2019

Christopher M Bishop Paern Recognition and Machine Learning (Information Science andStatistics) Springer 2011

Christopher Michael Bishop Regularization and complexity control in feed-forward networks InICANN 1995

Rich Caruana Steve Lawrence and C Lee Giles Overing in neural nets Backpropagationconjugate gradient and early stopping In NeurIPS 2000

Pratik Chaudhari Anna Choromanska Stefano Soao Yann LeCun Carlo Baldassi ChristianBorgs Jennifer Chayes Levent Sagun and Riccardo Zecchina Entropy-SGD Biasing gradientdescent into wide valleys In ICLR 2017

Jesus Cid-Sueiro Darıo Garcıa-Garcıa and Raul Santos-Rodrıguez Consistency of losses forlearning from weak labels In ECML-PKDD 2014

14

Tarin Clanuwat Mikel Bober-Irizar Asanobu Kitamoto Alex Lamb Kazuaki Yamamoto and DavidHa Deep learning for classical Japanese literature In NeurIPS Workshop on Machine Learningfor Creativity and Design 2018

Marthinus Christoel du Plessis Gang Niu and Masashi Sugiyama Analysis of learning frompositive and unlabeled data In NeurIPS 2014

Marthinus Christoel du Plessis Gang Niu and Masashi Sugiyama Convex formulation forlearning from positive and unlabeled data In ICML 2015

Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MIT Press 2016

Hongyu Guo Yongyi Mao and Richong Zhang Augmenting data with mixup for sentenceclassication An empirical study In arXiv190508941 2019

Bo Han Gang Niu Jiangchao Yao Xingrui Yu Miao Xu Ivor Tsang and Masashi SugiyamaPumpout A meta approach to robust deep learning with noisy labels In arXiv180911008 2018

Stephen Jose Hanson and Lorien Y Pra Comparing biases for minimal network constructionwith back-propagation In NeurIPS 1988

Kaiming He Xiangyu Zhang Shaoqing Ren and Jian Sun Deep residual learning for imagerecognition In CVPR 2016

Sergey Ioe and Christian Szegedy Batch normalization Accelerating deep network training byreducing internal covariate shi In ICML 2015

Takashi Ishida Gang Niu Aditya Krishna Menon and Masashi Sugiyama Complementary-labellearning for arbitrary losses and models In ICML 2019

Nitish Shirish Keskar Dheevatsa Mudigere Jorge Nocedal Mikhail Smelyanskiy and Ping Tak PeterTang On large-batch training for deep learning Generalization gap and sharp minima In ICLR2017

Diederik P Kingma and Jimmy L Ba Adam A method for stochastic optimization In ICLR 2015

Ryuichi Kiryo Gang Niu Marthinus Christoel du Plessis and Masashi Sugiyama Positive-unlabeled learning with non-negative risk estimator In NeurIPS 2017

Anders Krogh and John A Hertz A simple weight decay can improve generalization In NeurIPS1992

Yann Lecun Lon Boou Yoshua Bengio and Patrick Haner Gradient-based learning applied todocument recognition In Proceedings of the IEEE pages 2278ndash2324 1998

Hao Li Zheng Xu Gavin Taylor Christoph Studer and Tom Goldstein Visualizing the losslandscape of neural nets In NeurIPS 2018

15

Ilya Loshchilov and Frank Huer Decoupled weight decay regularization In ICLR 2019

Nan Lu Tianyi Zhang Gang Niu and Masashi Sugiyama Mitigating overing in supervisedclassication from two unlabeled datasets A consistent risk correction approach In AISTATS2020

N Morgan and H Bourlard Generalization and parameter estimation in feedforward nets Someexperiments In NeurIPS 1990

Vinod Nair and Georey E Hinton Rectied linear units improve restricted boltzmann machinesIn ICML 2010

Preetum Nakkiran Gal Kaplun Dimitris Kalimeris Tristan Yang Benjamin L Edelman FredZhang and Boaz Barak SGD on neural networks learns functions of increasing complexity InNeurIPS 2019

Preetum Nakkiran Gal Kaplun Yamini Bansal Tristan Yang Boaz Barak and Ilya Sutskever Deepdouble descent Where bigger models and more data hurt In ICLR 2020

Nagarajan Natarajan Inderjit S Dhillon Pradeep K Ravikumar and Ambuj Tewari Learningwith noisy labels In NeurIPS 2013

Yuval Netzer Tao Wang Adam Coates Alessandro Bissacco Bo Wu and Andrew Y Ng Readingdigits in natural images with unsupervised feature learning In NeurIPS Workshop on DeepLearning and Unsupervised Feature Learning 2011

Andrew Y Ng Preventing ldquooveringrdquo of cross-validation data In ICML 1997

Adam Paszke Sam Gross Francisco Massa Adam Lerer James Bradbury Gregory Chanan TrevorKilleen Zeming Lin Natalia Gimelshein Luca Antiga Alban Desmaison Andreas Kopf EdwardYang Zachary DeVito Martin Raison Alykhan Tejani Sasank Chilamkurthy Benoit SteinerLu Fang Junjie Bai and Soumith Chintala PyTorch An imperative style high-performancedeep learning library In NeurIPS 2019

Giorgio Patrini Alessandro Rozza Aditya K Menon Richard Nock and Lizhen Making deepneural networks robust to label noise A loss correction approach In CVPR 2017

Herbert Robbins and Suon Monro A stochastic approximation method Annals of MathematicalStatistics 22400ndash407 1951

Rebecca Roelofs Vaishaal Shankar Benjamin Recht Sara Fridovich-Keil Moritz Hardt John Millerand Ludwig Schmidt A meta-analysis of overing in machine learning In NeurIPS 2019

Bobak Shahriari Kevin Swersky Ziyu Wang Ryan P Adams and Nando de Freitas Taking thehuman out of the loop A review of bayesian optimization Proceedings of the IEEE 104148ndash1752016

16

Connor Shorten and Taghi M Khoshgoaar A survey on image data augmentation for deeplearning Journal of Big Data 6 2019

Jasper Snoek Hugo Larochelle and Ryan P Adams Practical bayesian optimization of machinelearning algorithms In NeurIPS 2012

Nitish Srivastava Georey Hinton Alex Krizhevsky Ilya Sutskever and Ruslan SalakhutdinovDropout A simple way to prevent neural networks from overing Journal of Machine LearningResearch 151929ndash1958 2014

Masashi Sugiyama Introduction to statistical machine learning Morgan Kaufmann 2015

Christian Szegedy Vincent Vanhoucke Sergey Ioe and Jon Shlens Rethinking the inceptionarchitecture for computer vision In CVPR 2016

Sunil ulasidasan Gopinath Chennupati Je Bilmes Tanmoy Bhaacharya and Sarah MichalakOn mixup training Improved calibration and predictive uncertainty for deep neural networksIn NeurIPS 2019

Robert Tibshirani Regression shrinkage and selection via the lasso Journal of the Royal StatisticalSociety Series B (Methodological) 58267ndash288 1996

A N Tikhonov and V Y Arsenin Solutions of Ill Posed Problems Winston 1977

Andrey Nikolayevich Tikhonov On the stability of inverse problems Doklady Akademii NaukSSSR 39195ndash198 1943

Antonio Torralba Rob Fergus and William T Freeman 80 million tiny images A large data setfor nonparametric object and scene recognition In IEEE Trans PAMI 2008

Brendan van Rooyen and Robert C Williamson A theory of learning with corrupted labelsJournal of Machine Learning Research 181ndash50 2018

Vikas Verma Alex Lamb Juho Kannala Yoshua Bengio and David Lopez-Paz Interpolationconsistency training for semi-supervised learning In IJCAI 2019

Stefan Wager Sida Wang and Percy Liang Dropout training as adaptive regularization In NeurIPS2013

Roman Werpachowski Andrs Gyrgy and Csaba Szepesvri Detecting overing via adversarialexamples In NeurIPS 2019

Han Xiao Kashif Rasul and Roland Vollgraf Fashion-MNIST a novel image dataset for bench-marking machine learning algorithms arXiv preprint arXiv170807747 2017

Chiyuan Zhang Samy Bengio Moritz Hardt Benjamin Recht and Oriol Vinyals Understandingdeep learning requires rethinking generalization In ICLR 2017

17

Hongyi Zhang Moustapha Cisse Yann N Dauphin and David Lopez-Paz mixup Beyondempirical risk minimization In ICLR 2018

Tong Zhang Statistical behavior and consistency of classication methods based on convex riskminimization e Annals of Statistics 3256ndash85 2004

18

A Proof of eoremProof If the ooding level is b then the proposed ooding estimator is

R(g) = |R(g)minus b|+ b (12)

Since the absolute operator can be expressed with a max operator with max(a b) = a+b+|aminusb|2

theproposed estimator can be re-expressed as

R(g) = 2max(R(g) b)minus R(g) = Aminus R(g) (13)

For convenience we used A = 2max(R(g) b) From the denition of MSE

MSE(R(g)) = E[(R(g)minusR(g))2] (14)MSE(R(g)) = E[(R(g)minusR(g))2] (15)

= E[(Aminus R(g)minusR(g))2)] (16)= E[A2]minus 2E[A(R(g) +R(g))] + E[(R(g) +R(g))2] (17)

We are interested in the sign of

MSE(R(g))minusMSE(R(g)) = E[minus4R(g)R(g)minus A2 + 2A(R(g) +R(g))] (18)

Dene the inside of the expectation as B = minus4R(g)R(g)minus A2 + 2A(R(g) + R(g)) B can bedivided into two cases depending on the outcome of the max operator

B =

minus4R(g)R(g)minus 4R(g)2 + 4R(g)(R(g) +R(g)) if R(g) ge b

minus4R(g)R(g)minus 4b2 + 4b(R(g) +R(g)) if R(g) lt b(19)

=

0 if R(g) ge b

minus4R(g)R(g)minus 4b2 + 4b(R(g) +R(g)) if R(g) lt b(20)

=

0 if R(g) ge b

minus4(bminus R(g))(bminusR(g)) if R(g) lt b (21)

e laer case becomes positive when R(g) lt b lt R(g) erefore when R(g) lt b lt R(g)

MSE(R(g))minusMSE(R(g)) gt 0 (22)MSE(R(g)) gt MSE(R(g)) (23)

When b le R(g)

MSE(R(g))minusMSE(R(g)) gt 0 (24)MSE(R(g)) = MSE(R(g)) (25)

19

B Bayes Risk for Gaussian DistributionsIn this section we explain in detail how we derived the Bayes risk with respect to the surrogateloss in the experiments with Gaussian data in Section 41 Since we are using the logistic loss inthe synthetic experiments the loss of the margin is

`(yg(x)) = log(1 + exp(minusyg(x))) (26)

where g(x) Rd rarr R is a scalar instead of the vector denition that was used previously becausethe synthetic experiments only consider binary classication Take the derivative to derive

partE[`(yg(x))]partg(x)

=E[log(1 + exp(minusyg(x)))]

partg(x)= E

[minusy exp(minusyg(x))1 + exp(minusyg(x))

∣∣∣∣∣x]p(x) (27)

= E

[minusy

1 + exp(yg(x))

∣∣∣∣∣x]p(x) (28)

= E

[y + 1

2

1

exp(minusg(x)) + 1+y minus 1

2

minus1exp(g(x)) + 1

∣∣∣∣∣x]p(x) (29)

= E

[y + 1

2

∣∣∣∣∣x]

1

exp(minusg(x)) + 1p(x) + E

[y minus 1

2

∣∣∣∣∣x]

minus1exp(g(x)) + 1

p(x) (30)

= p(y = +1|x) 1

exp(minusg(x)) + 1p(x) + p(y = minus1|x) minus1

exp(g(x)) + 1p(x) (31)

Set this to zero divide by p(x) gt 0 to obtain

p(y = minus1|x) 1

exp(minusg(x)) + 1= p(y = +1|x) 1

exp(g(x)) + 1(32)

exp(g(x)) =p(y = +1|x)p(y = minus1|x)

(33)

g(x) = logp(y = +1|x)p(y = minus1|x)

(34)

Since we are interested in the surrogate loss under this classier we plug this into the logisticloss to obtain the Bayes risk

E[`(yg(x))] = E[log(1 +

p(minusy|x)p(y|x)

)

]= E

[log(

1

p(y|x))

]= E[minus log p(y|x)] (35)

In the experiments in Section 41 we report the empirical version of this with the test dataset asthe Bayes risk

20

C Details of Experiments

C1 Benchmark DatasetsIn the experiments in Section 42 we use six benchmark datasets explained below

bull MNIST6 [Lecun et al 1998] is a 10 class dataset of handwrien digits 1 2 9 and 0 Eachsample is a 28times 28 grayscale image e number of training and test samples are 60000and 10000 respectively

bull Fashion-MNIST7 [Xiao et al 2017] is a 10 class dataset of fashion items T-shirttop TrouserPullover Dress Coat Sandal Shirt Sneaker Bag and Ankle boot Each sample is a 28times 28grayscale image e number of training and test samples are 60000 and 10000 respectively

bull Kuzushiji-MNIST8 [Clanuwat et al 2018] is a 10 class dataset of cursive Japanese (ldquoKuzushijirdquo)characters Each sample is a 28 times 28 grayscale image e number of training and testsamples are 60000 and 10000 respectively

bull CIFAR-109 is a 10 class dataset of various objects airplane automobile bird cat deer dogfrog horse ship and truck Each sample is a colored image in 32times 32times 3 RGB format It isa subset of the 80 million tiny images dataset [Torralba et al 2008] ere are 6000 imagesper class where 5000 are for training and 1000 are for test

bull CIFAR-10010 is a 100 class dataset of various objects Each class has 600 samples where 500samples are for training and 100 samples are for test is is also a subset of the 80 milliontiny images dataset [Torralba et al 2008]

bull SVHN11 [Netzer et al 2011] is a 10 class dataset of house numbers from Google Street Viewimages in 32times 32times 3 RGB format 73257 digits are for training and 26032 digits are fortesting

6httpyannlecuncomexdbmnist7httpsgithubcomzalandoresearchfashion-mnist8httpsgithubcomrois-codhkmnist9httpswwwcstorontoedusimkrizcifarhtml

10httpswwwcstorontoedusimkrizcifarhtml11httpufldlstanfordeduhousenumbers

21

C2 Chosen Flooding LevelsIn Table 4 we report the chosen ooding levels for our experiments with benchmark datasets

Table 4 e chosen ooding levels for benchmark experiments

Early stopping times X times XWeight decay times times X XFlooding X X X X

MNIST (08) 002 002 003 002MNIST (04) 002 003 000 002Fashion-MNIST (08) 000 000 mdash mdashFashion-MNIST (04) 000 000 mdash mdashKuzushiji-MNIST (08) 001 003 003 003Kuzushiji-MNIST (04) 003 002 004 003CIFAR-10 (08) 004 004 000 001CIFAR-10 (04) 003 003 000 000CIFAR-100 (08) 002 002 000 000CIFAR-100 (04) 003 003 000 000SVHN (08) 001 001 000 002SVHN (04) 001 009 003 003

D Learning CurvesIn Figure 4 we visualize learning curves for all datasets (including CIFAR-10 which we alreadyvisualized in Figure 2) We only show the learning curves for trainingvalidation proportion of 08since the results for 04 were similar with 08 Note that Figure 1(c) shows the learning curves forthe rst 80 epochs for CIFAR-10 without ooding Figure 1(d) shows the learning curves withooding when the ooding level is 018

22

(a) MNIST (000) (b) MNIST (001) (c) MNIST (002) (d) MNIST (003)

(e) Fashion-MNIST(000)

(f) Fashion-MNIST(002)

(g) Fashion-MNIST(005)

(h) Fashion-MNIST(010)

(i) KMNIST (000) (j) KMNIST (002) (k) KMNIST (005) (l) KMNIST (010)

(m) CIFAR-10 (000) (n) CIFAR-10 (003) (o) CIFAR-10 (007) (p) CIFAR-10 (020)

(q) CIFAR-100 (000) (r) CIFAR-100 (001) (s) CIFAR-100 (006) (t) CIFAR-100 (020)

(u) SVHN (000) (v) SVHN (001) (w) SVHN (006) (x) SVHN (020)

Figure 4 Learning curves of training and test loss e rst gure in each row is the learning curveswithout ooding e 2nd 3rd and 4th columns show the results with dierent ooding levels e oodinglevel increases towards the right-hand side 23

E Relationship between Performance and GradientsSettings We visualize the relationship between test performance (loss or accuracy) and gradientamplitude of the trainingtest loss in Figure 5 where the gradient amplitude is the `2 norm of thelter-nomalized gradient of the loss e lter-normalized gradient is the gradient appropriatelyscaled depending on the magnitude of the corresponding convolutional lter similarly to Liet al [2018] More specicically for each lter of every comvolutional layer we multiply thecorresponding elements of the gradient by the norm of the lter Note that a fully connected layeris a special case of convolutional layer and subject to this scaling We exclude Fashion-MNISTbecause the optimal ooding level was zero We used the results with trainingvalidation splitratio of 08

Results For the gures with gradient amplitude of training loss on the horizontal axis ldquordquomarks (w ooding) are oen ploed on the right of ldquo+rdquo marks (wo ooding) which impliesthat ooding prevents the model from staying a local minimum For the gures with gradientamplitude of test loss on the horizontal axis we can observe the method with ooding (ldquordquo)improves performance while the gradient amplitude becomes smaller On the other hand theperformance with the method without ooding (ldquo+rdquo) degenerates while the gradient amplitude oftest loss keeps increasing

24

(a) MNIST xtrainyloss

(b) MNIST xtrainyacc

(c) MNIST xtestyloss

(d) MNIST xtest yacc

(e) K-M xtrain yloss (f) K-M xtrain yacc (g) K-M xtest yloss (h) K-M xtest yacc

(i) C10 xtrain yloss (j) C10 xtrain yacc (k) C10 xtest yloss (l) C10 xtest yacc

(m) C100 xtrainyloss

(n) C100 xtrain yacc (o) C100 xtest yloss (p) C100 xtest yacc

(q) SVHN xtrainyloss

(r) SVHN xtrainyacc

(s) SVHN xtest yloss (t) SVHN xtest yacc

Figure 5 Relationship between test performance (loss or accuracy) and amplitude of gradient (with trainingor test loss) Each point (ldquordquo or ldquo+rdquo) in the gures corresponds to a single model at a certain epoch Weremove the rst 5 epochs and plot the rest ldquordquo is used for the method with ooding and ldquo+rdquo is used forthe method without ooding e large black ldquordquo and ldquo+rdquo show the epochs with early stopping e colorbecomes lighter (purplerarr yellow) as the training proceeds K-M C10 and C100 stand for Kuzushiji-MNISTCIFAR-10 and CIFAR-100

25

F FlatnessSettings We follow Li et al [2018] and give a one-dimensional visualization of atness for eachdataset in Figure 6 We exclude Fashion-MNIST because the optimal ooding level was zero Weused the results with trainingvalidation split ratio of 08 We compare the atness of the modelright aer the empirical risk with respect to a mini-batch becomes smaller than the ooding levelRm(g) lt b for the rst time (doed blue line) and the model aer training (solid blue line) Wealso compare them with the model trained by the baseline method without ooding and trainingis nished (solid red line)

Results According to Figure 6 the test loss becomes lower and more at during the trainingwith ooding Note that the training loss on the other hand continues to oat around the oodinglevel until the end of training aer it enters the ooding zone We expect that the model makes arandom walk and escapes regions with sharp loss landscapes during the period is may be apossible reason for beer generalization results with our proposed method

26

(a) MNIST (train) (b) MNIST (test) (c) Kuzushiji-MNIST(train)

(d) Kuzushiji-MNIST(test)

(e) CIFAR-10 (train) (f) CIFAR-10 (test) (g) CIFAR-100 (train) (h) CIFAR-100 (test)

(i) SVHN (train) (j) SVHN (test)

Figure 6 One-dimensional visualization of atness We visualize the trainingtest loss with respect toperturbation We depict the results for 3 models the model when the empirical risk with respect to trainingdata is below the ooding level for the rst time during training (doed blue) the model at the end oftraining with ooding (solid blue) and the model at the end of training without ooding (solid red)

27

  • 1 Introduction
  • 2 Backgrounds
    • 21 Regularization Methods
    • 22 Double Descent Curves with Overparametrization
    • 23 Lower-Bounding the Empirical Risk
      • 3 Flooding How to Avoid Zero Training Loss
        • 31 Preliminaries
        • 32 Algorithm
        • 33 Implementation
        • 34 Theoretical Analysis
          • 4 Experiments
            • 41 Synthetic Experiments
            • 42 Benchmark Experiments
            • 43 Memorization
              • 5 Conclusion
              • A Proof of Theorem
              • B Bayes Risk for Gaussian Distributions
              • C Details of Experiments
                • C1 Benchmark Datasets
                • C2 Chosen Flooding Levels
                  • D Learning Curves
                  • E Relationship between Performance and Gradients
                  • F Flatness
Page 7: Do We Need Zero Training Loss A›er Achieving Zero Training … · 2020. 2. 21. · directly aim to avoid zero training loss, they o›en fail to maintain a moderate level of training

function ` can be the zero-one loss

`01(v zprime) =

0 if argmaxzisin1K vz = zprime

1 otherwise(2)

where v = (v1 vK)gt isin RK or a surrogate loss such as the somax cross-entropy loss

`CE(v zprime) = minus log

exp(vzprime)sumzisin1K exp(vz)

(3)

For a surrogate loss ` we denote the classication risk by

R(g) = Ep(xy)[`(g(x) y)] (4)

where Ep(xy)[middot] is the expectation over (x y) sim p(x y) We use R01(g) to denote Eq (4) when` = `01 and call it the classication error

e goal of multi-class classication is to learn g that minimizes the classication errorR01(g)In optimization we consider the minimization of the risk with a almost surely dierentiablesurrogate loss R(g) instead to make the problem more tractable Furthermore since p(x y) isusually unknown and there is no way to exactly evaluate R(g) we minimize its empirical versioncalculated from the training data instead

R(g) =1

n

nsumi=1

`(g(xi) yi) (5)

where (xi yi)ni=1 are iid sampled from p(x y) We call R the empirical riskWe would like to clarify some of the undened terms used in the title and the introduction

e ldquotraintest lossrdquo is the empirical risk with respect to the surrogate loss function ` over thetrainingtest data respectively We refer to the ldquotrainingtest errorrdquo as the empirical risk withrespect to `01 over the trainingtest data respectively (which is equal to one minus accuracy)[Zhang 2004]

Finally we formally dene the Bayes risk as

Rlowast = infhR(h) (6)

where the inmum is taken over all vector-valued functions h Rd rarr RK e Bayes risk is oenreferred to as the Bayes error if the zero-one loss is used

infhR01(h) (7)

32 Algorithm

With exible models R(g) wrt a surrogate loss can easily become small if not zero as wementioned in Section 1 see [C] in Fig 1(a) We propose a method that ldquooods the boom areaand sinks the original empirical riskrdquo as in Fig 1(b) so that the empirical risk cannot go belowthe ooding level More technically if we denote the ooding level as b our proposed trainingobjective with ooding is a simple x

7

Denition 1 e ooded empirical risk is dened as4

R(g) = |R(g)minus b|+ b (8)

Note that when b = 0 then R(g) = R(g) e gradient of R(g) wrt model parameters willpoint to the same direction as that of R(g) when R(g) gt b but in the opposite direction whenR(g) lt b is means that when the learning objective is above the ooding level we performgradient descent as usual (gravity zone) but when the learning objective is below the oodinglevel we perform gradient ascent instead (buoyancy zone)

e issue is that in general we seldom know the optimal ooding level in advance isissue can be mitigated by searching for the optimal ooding level blowast with a hyper-parameteroptimization technique In practice we can search for the optimal ooding level by performingthe exhaustive search in parallel

33 ImplementationFor large scale problems we can employ mini-batched stochastic optimization for ecient com-putation Suppose that we have M disjoint mini-batch splits We denote the empirical risk (5)with respect to the m-th mini-batch by Rm(g) for m isin 1 M en our mini-batchedoptimization performs gradient descent updates in the direction of the gradient of Rm(g) By theconvexity of the absolute value function and Jensenrsquos inequality we have

R(g) le 1

M

Msumm=1

(|Rm(g)minus b|+ b

) (9)

is indicates that mini-batched optimization will simply minimize an upper bound of the full-batch case with R(g)

34 eoretical AnalysisIn the following theorem we will show that the mean squared error (MSE) of the proposed riskestimator with ooding is smaller than that of the original risk estimator without ooding

eorem 1 Fix any measurable vector-valued function g If the ooding level b satises R(g) ltb lt R(g) we have

MSE(R(g)) gt MSE(R(g)) (10)

If b le R(g) we haveMSE(R(g)) = MSE(R(g)) (11)

A proof is given in Appendix A If we regard R(g) as the training loss and R(g) as the testloss we would want b to be between those two for the MSE to improve

4Strictly speaking Eq (1) is dierent from Eq (8) since Eq (1) can ignore constant terms of the original empiricalrisk We will refer to Eq (8) for the ooding operator for the rest of the paper

8

4 ExperimentsIn this section we show experimental results with synthetic and benchmark datasets e imple-mentation is based on PyTorch [Paszke et al 2019] and demo code will be available5 Experimentswere carried out with NVIDIA GeForce GTX 1080 Ti NVIDIA adro RTX 5000 and Intel XeonGold 6142

41 Synthetic Experimentse aim of our synthetic experiments is to study the behavior of ooding with a controlled setupWe use three types of synthetic data described below

Two Gaussians Data We perform binary classication with two 10-dimensional Gaus-sian distributions with covariance matrix identity and means microP = [0 0 0]gt and microN =[mm m]gt where m isin 08 10 e Bayes risk for m = 10 and m = 08 are 014 and024 respectively where proofs are shown in Appendix B e training validation and test samplesizes are 25 10000 and 10000 per class respectively

Sinusoid Data e sinusoid data [Nakkiran et al 2019] are generated as follows We rstdraw input data points uniformly from the inside of a 2-dimensional ball of radius 1 en we putclass labels based on

y = sign(xgtw + sin(xgtwprime))

where w and wprime are any two 2-dimesional vectors such that w perp wprime e training validationand test sample sizes are 100 100 and 20000 respectively

Spiral Data e spiral data [Sugiyama 2015] are two-dimensional synthetic data Letθ+1 = 0 θ+2 θ

+n+ = 4π be equally spaced n+ points in the interval [0 4π] and θminus1 =

0 θminus2 θminusnminus = 4π be equally spaced nminus points in the interval [0 4π] Let positive and negative

input data points be

x+i+ = θ+i+ [cos(θ

+i+) sin(θ

+i+)]

gt + τν+i+

xminusiminus = (θminusiminus + π)[cos(θminusiminus) sin(θminusiminus)]

gt + τνminusiminus

for i+ = 1 n+ and iminus = 1 nminus where τ controls the magnitude of the noise ν+i and

νminusi are iid distributed according to the two-dimensional standard normal distribution enwe make data for classication by (xi yi)ni=1 = (x+

i+ +1)n+

i+=1 cup (xminusiminus minus1)

nminus

iminus=1 wheren = n+ + nminus e training validation and test sample sizes are 100 100 and 10000 per classrespectively

For Two Gaussians we use a one-hidden-layer feedforward neural network with 500 unitsin the hidden layer with the ReLU activation function [Nair and Hinton 2010] We train thenetwork for 1000 epochs with the logistic loss and vanilla gradient descent with learning rate of005 e ooding level is chosen from b isin 0 001 002 040 For Sinusoid and Spiral weuse a four-hidden-layer feedforward neural network with 500 units in the hidden layer with the

5httpsgithubcomtakashiishidaflooding

9

Table 2 Experimental results for the synthetic data Sub-table (A) shows the results without early stoppingSub-table (B) shows the results with early stopping e beer method is shown in bold in each of sub-tables (A) and (B) ldquoBRrdquo stands for the Bayes risk For Two Gaussians the distance between the positive andnegative distributions is larger for larger m See the description in Section 41 for the details

(A) Without Early Stopping (B) With Early Stopping

Data Seing WithoutFlooding

WithFlooding

Chosenb

WithoutFlooding

WithFlooding

Chosenb

Two Gaussians m 10 BR 014 8796 9225 028 9163 9225 027Two Gaussians m 08 BR 024 8200 8731 033 8657 8729 035Sinusoid Label Noise 001 9384 9446 001 9254 9254 000Sinusoid Label Noise 005 9112 9544 010 9326 9460 001Sinusoid Label Noise 010 8657 9602 017 9670 9670 000Spiral Label Noise 001 9896 9785 001 9860 9888 001Spiral Label Noise 005 9387 9624 004 9658 9562 014Spiral Label Noise 010 8970 9296 016 8970 9296 016

ReLU activation function [Nair and Hinton 2010] and batch normalization [Ioe and Szegedy2015] We train the network for 500 epochs with the logistic loss and Adam [Kingma and Ba2015] optimizer with 100 mini-batch size and learning rate of 0001 e ooding level is chosenfrom b isin 0 001 002 020 Note that training with b = 0 is identical to the baselinemethod without ooding We report the test accuracy of the ooding level with the best validationaccuracy We rst conduct experiments without early stopping which means that the last epochwas chosen for all ooding levels

Results e results are summarized in Table 2 It is worth noting that for Two Gaussians thechosen ooding level b is larger for the smaller distance between the two distributions which iswhen the classication task is harder and the Bayes risk becomes larger since the two distributionsbecome less separated We see similar tendencies for Sinusoid and Spiral data a larger b waschosen for larger ipping probability for label noise which is expected to increase the Bayes riskis implies the positive correlation between the optimal ooding level and the Bayes risk asis also partially suggested by eorem 1 Another interesting observation is that the chosen bis close to but higher than the Bayes risk for Two Gaussians data is may look inconsistentwith eorem 1 However it makes sense to adopt larger b with stronger regularization eect thatallows some bias as a trade-o for reducing the variance of the risk estimator In fact eorem 1does not deny the possibility that some b ge R(g) achieves even beer estimation

From (A) in Table 2 we can see that the method with ooding oen improves test accuracyover the baseline method without ooding As we mentioned in the introduction it ca be harmfulto keep training a model until the end without ooding However with ooding the model at thenal epoch has good prediction performance according to the results which implies that ooding

10

Table 3 Results with benchmark datasets We report classication accuracy for all combinations ofweight decay (X and times) early stopping (X and times) and ooding (X and times) e second column shows thetrainingvalidation split used for the experiment W stands for weight decay E stands for early stopping andF stands for ooding ldquomdashrdquo means that ooding level of zero was optimal ldquoNArdquo means that we skipped theexperiments because zero weight decay was optimal in the case without ooding e best and equivalentare shown in bold by comparing ldquowith oodingrdquo and ldquowithout oodingrdquo for two columns with the sameseing for W and E eg the rst and h columns out of the 8 columns e best performing combinationis highlighted

Datasettr W times times X X times times X Xva E times X times X times X times X

split F times times times times X X X X

MNIST 08 9832 9830 9851 9842 9846 9853 9850 984804 9771 9770 9782 9791 9774 9785 mdash 9783

Fashion- 08 8934 8936 NA NA mdash mdash NA NAMNIST 04 8848 8863 8860 8862 mdash mdash mdash mdash

Kuzushiji- 08 9163 9162 9163 9171 9240 9212 9211 9197MNIST 04 8918 8918 8958 8973 9041 9015 8971 8988

CIFAR-10 08 7359 7336 7365 7357 7306 7344 mdash 744104 6639 6663 6931 6928 6720 6758 mdash mdash

CIFAR-100 08 4216 4233 4267 4245 4250 4236 mdash mdash04 3427 3434 3797 3882 3499 3514 mdash mdash

SVHN 08 9238 9241 9320 9299 9278 9279 mdash 934204 9032 9035 9043 9049 9057 9061 9116 9121

helps the late-stage training improve test accuracyWe also conducted experiments with early stopping meaning that we chose the model that

recorded the best validation accuracy during training e results are reported in sub-table (B) ofTable 2 Compared with sub-table (A) we see that early stopping improves the baseline methodwithout ooding well in many cases is indicates that training longer without ooding washarmful in our experiments On the other hand the accuracy for ooding combined with earlystopping is oen close to that with early stopping meaning that training until the end withooding tends to be already as good as doing so with early stopping e table shows that oodingoen improves or retains the test accuracy of the baseline method without ooding even aerdeploying early stopping Flooding does not hurt performance but can be benecial for methodsused with early stopping

11

(a) CIFAR-10 (000) (b) CIFAR-10 (003) (c) CIFAR-10 (007) (d) CIFAR-10 (020)

Figure 2 Learning curves of training and test loss for trainingvalidation proportion of 08 (a) shows thelearning curves without ooding (b) (c) and (d) show the learning curves with dierent ooding levels

42 Benchmark ExperimentsWe next perform experiments with benchmark datasets Not only do we compare with the baselinewithout ooding we also compare or combine with other general regularization methods whichare early stopping and weight decay

Settings We use the following six benchmark datasets MNIST Fashion-MNIST Kuzushiji-MNIST CIFAR-10 CIFAR-100 and SVHN e details of the benchmark datasets can be found inAppendix C1 We split the original training dataset into training and validation data with dierentproportions 08 or 04 (meaning 80 or 40 was used for training and the rest was used forvalidation respectively) We perform the exhaustive hyper-parameter search for the ooding levelwith candidates from 000 001 020 e number of epochs is 500 Stochastic gradientdescent [Robbins and Monro 1951] is used with learning rate of 01 and momentum of 09 ForMNIST Fashion-MNIST and Kuzushiji-MNIST we use a one-hidden-layer feedforward neuralnetwork with 500 units and ReLU activation function [Nair and Hinton 2010] For CIFAR-10CIFAR-100 and SVHN we used ResNet-18 [He et al 2016] We do not use any data augmentationor manual learning rate decay We deployed early stopping in the same way as in Section 41

We rst ran experiments with the following candidates for the weight decay rate 1times10minus5 1times10minus4 4times 10minus4 7times 10minus4 1times 10minus3 4times 10minus3 7times 10minus3 We choose the weight decay rate withthe best validation accuracy for each dataset and each trainingvalidation proportion en xingthe weight decay to the chosen one we ran experiments with ooding level candidates from0 001 020 to investigate whether weight decay and ooding have complementary eectsor if adding weight decay will diminish the accuracy gain of ooding

Results We show the results in Table 3 and the chosen ooding levels in Table 4 in Appendix C2We can observe that ooding gives beer accuracy for most cases We can also see that combiningooding with early stopping or with both early stopping and weight decay may lead to even beeraccuracy in some cases

12

(a) wo early stop (08) (b) wo early stop (04) (c) w early stop (08) (d) w early stop (04)

Figure 3 We show the optimal ooding level maintains memorization e vertical axis shows the trainingaccuracy e horizontal axis shows the ooding level We show results with and without early stoppingand for dierent trainingvalidation splits with proportion of 08 or 04 e marks (4 O ) areplaced on the ooding level that was chosen based on validation accuracy

43 MemorizationCan we maintain memorization even aer adding ooding We investigate if the trained modelhas zero training error (100 accuracy) for the ooding level that was chosen with validation dataWe show the results for all benchmark datasets and all trainingvalidation splits with proportions08 and 04 We also show the case without early stopping (choosing the last epoch) and withearly stopping (choosing the epoch with the highest validation accuracy) e results are shownin Fig 3

All gures show downward curves implying that the model will give up eventually onmemorizing all training data as the ooding level becomes higher A more interesting andimportant observation is the position of the optimal ooding level (the one chosen by validationaccuracy which is marked with 4 O or ) We can observe that the marks are oen ploedat zero error and in some cases there is a mark on the highest ooding level that maintains zeroerror ese results are consistent with recent empirical works that imply zero training error leadsto lower generalization error [Belkin et al 2019 Nakkiran et al 2020] but we further demonstratethat zero training loss may be harmful under zero training error

5 ConclusionWe proposed a novel regularization method called ooding that keeps the training loss to stayaround a small constant value to avoid zero training loss In our experiments the optimal oodinglevel oen maintained memorization of training data with zero error With ooding we showedthat the test accuracy will improve for various benchmark datasets and theoretically showed thatthe mean squared error will be reduced under certain conditions

As a byproduct we were able to produce a double descent curve for the test loss with a relativelyfew number of epochs eg in around 100 epochs shown in Fig 2 and Fig 4 in Appendix D Animportant future direction is to study the relationship between this and the double descent curves

13

from previous works [Belkin et al 2019 Krogh and Hertz 1992 Nakkiran et al 2020]It would also be interesting to see if Bayesian optimization [Shahriari et al 2016 Snoek

et al 2012] methods can be utilized to search for the optimal ooding level eciently We willinvestigate this direction in the future

AcknowledgementsWe thank Chang Xu Genki Yamamoto Kento Nozawa Nontawat Charoenphakdee Voot Tangkaraand Yoshihiro Nagano for the helpful discussions TI was supported by the Google PhD FellowshipProgram MS and IY were supported by JST CREST Grant Number JPMJCR18A2 including AIPchallenge program Japan

ReferencesDevansh Arpit Stanisaw Jastrzebski Nicolas Ballas David Krueger Emmanuel Bengio Maxin-

der S Kanwal Tegan Maharaj Asja Fischer Aaron Courville Yoshua Bengio and SimonLacoste-Julien A closer look at memorization in deep networks In ICML 2017

Mikhail Belkin Daniel Hsu and Partha P Mitra Overing or perfect ing Risk bounds forclassication and regression rules that interpolate In NeurIPS 2018

Mikhail Belkin Daniel Hsu Siyuan Ma and Soumik Mandal Reconciling modern machine-learningpractice and the classical biasvariance trade-o PNAS 11615850ndash15854 2019

David Berthelot Nicholas Carlini Ian Goodfellow Nicolas Papernot Avital Oliver and Colin ARael MixMatch A holistic approach to semi-supervised learning In NeurIPS 2019

Christopher M Bishop Paern Recognition and Machine Learning (Information Science andStatistics) Springer 2011

Christopher Michael Bishop Regularization and complexity control in feed-forward networks InICANN 1995

Rich Caruana Steve Lawrence and C Lee Giles Overing in neural nets Backpropagationconjugate gradient and early stopping In NeurIPS 2000

Pratik Chaudhari Anna Choromanska Stefano Soao Yann LeCun Carlo Baldassi ChristianBorgs Jennifer Chayes Levent Sagun and Riccardo Zecchina Entropy-SGD Biasing gradientdescent into wide valleys In ICLR 2017

Jesus Cid-Sueiro Darıo Garcıa-Garcıa and Raul Santos-Rodrıguez Consistency of losses forlearning from weak labels In ECML-PKDD 2014

14

Tarin Clanuwat Mikel Bober-Irizar Asanobu Kitamoto Alex Lamb Kazuaki Yamamoto and DavidHa Deep learning for classical Japanese literature In NeurIPS Workshop on Machine Learningfor Creativity and Design 2018

Marthinus Christoel du Plessis Gang Niu and Masashi Sugiyama Analysis of learning frompositive and unlabeled data In NeurIPS 2014

Marthinus Christoel du Plessis Gang Niu and Masashi Sugiyama Convex formulation forlearning from positive and unlabeled data In ICML 2015

Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MIT Press 2016

Hongyu Guo Yongyi Mao and Richong Zhang Augmenting data with mixup for sentenceclassication An empirical study In arXiv190508941 2019

Bo Han Gang Niu Jiangchao Yao Xingrui Yu Miao Xu Ivor Tsang and Masashi SugiyamaPumpout A meta approach to robust deep learning with noisy labels In arXiv180911008 2018

Stephen Jose Hanson and Lorien Y Pra Comparing biases for minimal network constructionwith back-propagation In NeurIPS 1988

Kaiming He Xiangyu Zhang Shaoqing Ren and Jian Sun Deep residual learning for imagerecognition In CVPR 2016

Sergey Ioe and Christian Szegedy Batch normalization Accelerating deep network training byreducing internal covariate shi In ICML 2015

Takashi Ishida Gang Niu Aditya Krishna Menon and Masashi Sugiyama Complementary-labellearning for arbitrary losses and models In ICML 2019

Nitish Shirish Keskar Dheevatsa Mudigere Jorge Nocedal Mikhail Smelyanskiy and Ping Tak PeterTang On large-batch training for deep learning Generalization gap and sharp minima In ICLR2017

Diederik P Kingma and Jimmy L Ba Adam A method for stochastic optimization In ICLR 2015

Ryuichi Kiryo Gang Niu Marthinus Christoel du Plessis and Masashi Sugiyama Positive-unlabeled learning with non-negative risk estimator In NeurIPS 2017

Anders Krogh and John A Hertz A simple weight decay can improve generalization In NeurIPS1992

Yann Lecun Lon Boou Yoshua Bengio and Patrick Haner Gradient-based learning applied todocument recognition In Proceedings of the IEEE pages 2278ndash2324 1998

Hao Li Zheng Xu Gavin Taylor Christoph Studer and Tom Goldstein Visualizing the losslandscape of neural nets In NeurIPS 2018

15

Ilya Loshchilov and Frank Huer Decoupled weight decay regularization In ICLR 2019

Nan Lu Tianyi Zhang Gang Niu and Masashi Sugiyama Mitigating overing in supervisedclassication from two unlabeled datasets A consistent risk correction approach In AISTATS2020

N Morgan and H Bourlard Generalization and parameter estimation in feedforward nets Someexperiments In NeurIPS 1990

Vinod Nair and Georey E Hinton Rectied linear units improve restricted boltzmann machinesIn ICML 2010

Preetum Nakkiran Gal Kaplun Dimitris Kalimeris Tristan Yang Benjamin L Edelman FredZhang and Boaz Barak SGD on neural networks learns functions of increasing complexity InNeurIPS 2019

Preetum Nakkiran Gal Kaplun Yamini Bansal Tristan Yang Boaz Barak and Ilya Sutskever Deepdouble descent Where bigger models and more data hurt In ICLR 2020

Nagarajan Natarajan Inderjit S Dhillon Pradeep K Ravikumar and Ambuj Tewari Learningwith noisy labels In NeurIPS 2013

Yuval Netzer Tao Wang Adam Coates Alessandro Bissacco Bo Wu and Andrew Y Ng Readingdigits in natural images with unsupervised feature learning In NeurIPS Workshop on DeepLearning and Unsupervised Feature Learning 2011

Andrew Y Ng Preventing ldquooveringrdquo of cross-validation data In ICML 1997

Adam Paszke Sam Gross Francisco Massa Adam Lerer James Bradbury Gregory Chanan TrevorKilleen Zeming Lin Natalia Gimelshein Luca Antiga Alban Desmaison Andreas Kopf EdwardYang Zachary DeVito Martin Raison Alykhan Tejani Sasank Chilamkurthy Benoit SteinerLu Fang Junjie Bai and Soumith Chintala PyTorch An imperative style high-performancedeep learning library In NeurIPS 2019

Giorgio Patrini Alessandro Rozza Aditya K Menon Richard Nock and Lizhen Making deepneural networks robust to label noise A loss correction approach In CVPR 2017

Herbert Robbins and Suon Monro A stochastic approximation method Annals of MathematicalStatistics 22400ndash407 1951

Rebecca Roelofs Vaishaal Shankar Benjamin Recht Sara Fridovich-Keil Moritz Hardt John Millerand Ludwig Schmidt A meta-analysis of overing in machine learning In NeurIPS 2019

Bobak Shahriari Kevin Swersky Ziyu Wang Ryan P Adams and Nando de Freitas Taking thehuman out of the loop A review of bayesian optimization Proceedings of the IEEE 104148ndash1752016

16

Connor Shorten and Taghi M Khoshgoaar A survey on image data augmentation for deeplearning Journal of Big Data 6 2019

Jasper Snoek Hugo Larochelle and Ryan P Adams Practical bayesian optimization of machinelearning algorithms In NeurIPS 2012

Nitish Srivastava Georey Hinton Alex Krizhevsky Ilya Sutskever and Ruslan SalakhutdinovDropout A simple way to prevent neural networks from overing Journal of Machine LearningResearch 151929ndash1958 2014

Masashi Sugiyama Introduction to statistical machine learning Morgan Kaufmann 2015

Christian Szegedy Vincent Vanhoucke Sergey Ioe and Jon Shlens Rethinking the inceptionarchitecture for computer vision In CVPR 2016

Sunil ulasidasan Gopinath Chennupati Je Bilmes Tanmoy Bhaacharya and Sarah MichalakOn mixup training Improved calibration and predictive uncertainty for deep neural networksIn NeurIPS 2019

Robert Tibshirani Regression shrinkage and selection via the lasso Journal of the Royal StatisticalSociety Series B (Methodological) 58267ndash288 1996

A N Tikhonov and V Y Arsenin Solutions of Ill Posed Problems Winston 1977

Andrey Nikolayevich Tikhonov On the stability of inverse problems Doklady Akademii NaukSSSR 39195ndash198 1943

Antonio Torralba Rob Fergus and William T Freeman 80 million tiny images A large data setfor nonparametric object and scene recognition In IEEE Trans PAMI 2008

Brendan van Rooyen and Robert C Williamson A theory of learning with corrupted labelsJournal of Machine Learning Research 181ndash50 2018

Vikas Verma Alex Lamb Juho Kannala Yoshua Bengio and David Lopez-Paz Interpolationconsistency training for semi-supervised learning In IJCAI 2019

Stefan Wager Sida Wang and Percy Liang Dropout training as adaptive regularization In NeurIPS2013

Roman Werpachowski Andrs Gyrgy and Csaba Szepesvri Detecting overing via adversarialexamples In NeurIPS 2019

Han Xiao Kashif Rasul and Roland Vollgraf Fashion-MNIST a novel image dataset for bench-marking machine learning algorithms arXiv preprint arXiv170807747 2017

Chiyuan Zhang Samy Bengio Moritz Hardt Benjamin Recht and Oriol Vinyals Understandingdeep learning requires rethinking generalization In ICLR 2017

17

Hongyi Zhang Moustapha Cisse Yann N Dauphin and David Lopez-Paz mixup Beyondempirical risk minimization In ICLR 2018

Tong Zhang Statistical behavior and consistency of classication methods based on convex riskminimization e Annals of Statistics 3256ndash85 2004

18

A Proof of eoremProof If the ooding level is b then the proposed ooding estimator is

R(g) = |R(g)minus b|+ b (12)

Since the absolute operator can be expressed with a max operator with max(a b) = a+b+|aminusb|2

theproposed estimator can be re-expressed as

R(g) = 2max(R(g) b)minus R(g) = Aminus R(g) (13)

For convenience we used A = 2max(R(g) b) From the denition of MSE

MSE(R(g)) = E[(R(g)minusR(g))2] (14)MSE(R(g)) = E[(R(g)minusR(g))2] (15)

= E[(Aminus R(g)minusR(g))2)] (16)= E[A2]minus 2E[A(R(g) +R(g))] + E[(R(g) +R(g))2] (17)

We are interested in the sign of

MSE(R(g))minusMSE(R(g)) = E[minus4R(g)R(g)minus A2 + 2A(R(g) +R(g))] (18)

Dene the inside of the expectation as B = minus4R(g)R(g)minus A2 + 2A(R(g) + R(g)) B can bedivided into two cases depending on the outcome of the max operator

B =

minus4R(g)R(g)minus 4R(g)2 + 4R(g)(R(g) +R(g)) if R(g) ge b

minus4R(g)R(g)minus 4b2 + 4b(R(g) +R(g)) if R(g) lt b(19)

=

0 if R(g) ge b

minus4R(g)R(g)minus 4b2 + 4b(R(g) +R(g)) if R(g) lt b(20)

=

0 if R(g) ge b

minus4(bminus R(g))(bminusR(g)) if R(g) lt b (21)

e laer case becomes positive when R(g) lt b lt R(g) erefore when R(g) lt b lt R(g)

MSE(R(g))minusMSE(R(g)) gt 0 (22)MSE(R(g)) gt MSE(R(g)) (23)

When b le R(g)

MSE(R(g))minusMSE(R(g)) gt 0 (24)MSE(R(g)) = MSE(R(g)) (25)

19

B Bayes Risk for Gaussian DistributionsIn this section we explain in detail how we derived the Bayes risk with respect to the surrogateloss in the experiments with Gaussian data in Section 41 Since we are using the logistic loss inthe synthetic experiments the loss of the margin is

`(yg(x)) = log(1 + exp(minusyg(x))) (26)

where g(x) Rd rarr R is a scalar instead of the vector denition that was used previously becausethe synthetic experiments only consider binary classication Take the derivative to derive

partE[`(yg(x))]partg(x)

=E[log(1 + exp(minusyg(x)))]

partg(x)= E

[minusy exp(minusyg(x))1 + exp(minusyg(x))

∣∣∣∣∣x]p(x) (27)

= E

[minusy

1 + exp(yg(x))

∣∣∣∣∣x]p(x) (28)

= E

[y + 1

2

1

exp(minusg(x)) + 1+y minus 1

2

minus1exp(g(x)) + 1

∣∣∣∣∣x]p(x) (29)

= E

[y + 1

2

∣∣∣∣∣x]

1

exp(minusg(x)) + 1p(x) + E

[y minus 1

2

∣∣∣∣∣x]

minus1exp(g(x)) + 1

p(x) (30)

= p(y = +1|x) 1

exp(minusg(x)) + 1p(x) + p(y = minus1|x) minus1

exp(g(x)) + 1p(x) (31)

Set this to zero divide by p(x) gt 0 to obtain

p(y = minus1|x) 1

exp(minusg(x)) + 1= p(y = +1|x) 1

exp(g(x)) + 1(32)

exp(g(x)) =p(y = +1|x)p(y = minus1|x)

(33)

g(x) = logp(y = +1|x)p(y = minus1|x)

(34)

Since we are interested in the surrogate loss under this classier we plug this into the logisticloss to obtain the Bayes risk

E[`(yg(x))] = E[log(1 +

p(minusy|x)p(y|x)

)

]= E

[log(

1

p(y|x))

]= E[minus log p(y|x)] (35)

In the experiments in Section 41 we report the empirical version of this with the test dataset asthe Bayes risk

20

C Details of Experiments

C1 Benchmark DatasetsIn the experiments in Section 42 we use six benchmark datasets explained below

bull MNIST6 [Lecun et al 1998] is a 10 class dataset of handwrien digits 1 2 9 and 0 Eachsample is a 28times 28 grayscale image e number of training and test samples are 60000and 10000 respectively

bull Fashion-MNIST7 [Xiao et al 2017] is a 10 class dataset of fashion items T-shirttop TrouserPullover Dress Coat Sandal Shirt Sneaker Bag and Ankle boot Each sample is a 28times 28grayscale image e number of training and test samples are 60000 and 10000 respectively

bull Kuzushiji-MNIST8 [Clanuwat et al 2018] is a 10 class dataset of cursive Japanese (ldquoKuzushijirdquo)characters Each sample is a 28 times 28 grayscale image e number of training and testsamples are 60000 and 10000 respectively

bull CIFAR-109 is a 10 class dataset of various objects airplane automobile bird cat deer dogfrog horse ship and truck Each sample is a colored image in 32times 32times 3 RGB format It isa subset of the 80 million tiny images dataset [Torralba et al 2008] ere are 6000 imagesper class where 5000 are for training and 1000 are for test

bull CIFAR-10010 is a 100 class dataset of various objects Each class has 600 samples where 500samples are for training and 100 samples are for test is is also a subset of the 80 milliontiny images dataset [Torralba et al 2008]

bull SVHN11 [Netzer et al 2011] is a 10 class dataset of house numbers from Google Street Viewimages in 32times 32times 3 RGB format 73257 digits are for training and 26032 digits are fortesting

6httpyannlecuncomexdbmnist7httpsgithubcomzalandoresearchfashion-mnist8httpsgithubcomrois-codhkmnist9httpswwwcstorontoedusimkrizcifarhtml

10httpswwwcstorontoedusimkrizcifarhtml11httpufldlstanfordeduhousenumbers

21

C2 Chosen Flooding LevelsIn Table 4 we report the chosen ooding levels for our experiments with benchmark datasets

Table 4 e chosen ooding levels for benchmark experiments

Early stopping times X times XWeight decay times times X XFlooding X X X X

MNIST (08) 002 002 003 002MNIST (04) 002 003 000 002Fashion-MNIST (08) 000 000 mdash mdashFashion-MNIST (04) 000 000 mdash mdashKuzushiji-MNIST (08) 001 003 003 003Kuzushiji-MNIST (04) 003 002 004 003CIFAR-10 (08) 004 004 000 001CIFAR-10 (04) 003 003 000 000CIFAR-100 (08) 002 002 000 000CIFAR-100 (04) 003 003 000 000SVHN (08) 001 001 000 002SVHN (04) 001 009 003 003

D Learning CurvesIn Figure 4 we visualize learning curves for all datasets (including CIFAR-10 which we alreadyvisualized in Figure 2) We only show the learning curves for trainingvalidation proportion of 08since the results for 04 were similar with 08 Note that Figure 1(c) shows the learning curves forthe rst 80 epochs for CIFAR-10 without ooding Figure 1(d) shows the learning curves withooding when the ooding level is 018

22

(a) MNIST (000) (b) MNIST (001) (c) MNIST (002) (d) MNIST (003)

(e) Fashion-MNIST(000)

(f) Fashion-MNIST(002)

(g) Fashion-MNIST(005)

(h) Fashion-MNIST(010)

(i) KMNIST (000) (j) KMNIST (002) (k) KMNIST (005) (l) KMNIST (010)

(m) CIFAR-10 (000) (n) CIFAR-10 (003) (o) CIFAR-10 (007) (p) CIFAR-10 (020)

(q) CIFAR-100 (000) (r) CIFAR-100 (001) (s) CIFAR-100 (006) (t) CIFAR-100 (020)

(u) SVHN (000) (v) SVHN (001) (w) SVHN (006) (x) SVHN (020)

Figure 4 Learning curves of training and test loss e rst gure in each row is the learning curveswithout ooding e 2nd 3rd and 4th columns show the results with dierent ooding levels e oodinglevel increases towards the right-hand side 23

E Relationship between Performance and GradientsSettings We visualize the relationship between test performance (loss or accuracy) and gradientamplitude of the trainingtest loss in Figure 5 where the gradient amplitude is the `2 norm of thelter-nomalized gradient of the loss e lter-normalized gradient is the gradient appropriatelyscaled depending on the magnitude of the corresponding convolutional lter similarly to Liet al [2018] More specicically for each lter of every comvolutional layer we multiply thecorresponding elements of the gradient by the norm of the lter Note that a fully connected layeris a special case of convolutional layer and subject to this scaling We exclude Fashion-MNISTbecause the optimal ooding level was zero We used the results with trainingvalidation splitratio of 08

Results For the gures with gradient amplitude of training loss on the horizontal axis ldquordquomarks (w ooding) are oen ploed on the right of ldquo+rdquo marks (wo ooding) which impliesthat ooding prevents the model from staying a local minimum For the gures with gradientamplitude of test loss on the horizontal axis we can observe the method with ooding (ldquordquo)improves performance while the gradient amplitude becomes smaller On the other hand theperformance with the method without ooding (ldquo+rdquo) degenerates while the gradient amplitude oftest loss keeps increasing

24

(a) MNIST xtrainyloss

(b) MNIST xtrainyacc

(c) MNIST xtestyloss

(d) MNIST xtest yacc

(e) K-M xtrain yloss (f) K-M xtrain yacc (g) K-M xtest yloss (h) K-M xtest yacc

(i) C10 xtrain yloss (j) C10 xtrain yacc (k) C10 xtest yloss (l) C10 xtest yacc

(m) C100 xtrainyloss

(n) C100 xtrain yacc (o) C100 xtest yloss (p) C100 xtest yacc

(q) SVHN xtrainyloss

(r) SVHN xtrainyacc

(s) SVHN xtest yloss (t) SVHN xtest yacc

Figure 5 Relationship between test performance (loss or accuracy) and amplitude of gradient (with trainingor test loss) Each point (ldquordquo or ldquo+rdquo) in the gures corresponds to a single model at a certain epoch Weremove the rst 5 epochs and plot the rest ldquordquo is used for the method with ooding and ldquo+rdquo is used forthe method without ooding e large black ldquordquo and ldquo+rdquo show the epochs with early stopping e colorbecomes lighter (purplerarr yellow) as the training proceeds K-M C10 and C100 stand for Kuzushiji-MNISTCIFAR-10 and CIFAR-100

25

F FlatnessSettings We follow Li et al [2018] and give a one-dimensional visualization of atness for eachdataset in Figure 6 We exclude Fashion-MNIST because the optimal ooding level was zero Weused the results with trainingvalidation split ratio of 08 We compare the atness of the modelright aer the empirical risk with respect to a mini-batch becomes smaller than the ooding levelRm(g) lt b for the rst time (doed blue line) and the model aer training (solid blue line) Wealso compare them with the model trained by the baseline method without ooding and trainingis nished (solid red line)

Results According to Figure 6 the test loss becomes lower and more at during the trainingwith ooding Note that the training loss on the other hand continues to oat around the oodinglevel until the end of training aer it enters the ooding zone We expect that the model makes arandom walk and escapes regions with sharp loss landscapes during the period is may be apossible reason for beer generalization results with our proposed method

26

(a) MNIST (train) (b) MNIST (test) (c) Kuzushiji-MNIST(train)

(d) Kuzushiji-MNIST(test)

(e) CIFAR-10 (train) (f) CIFAR-10 (test) (g) CIFAR-100 (train) (h) CIFAR-100 (test)

(i) SVHN (train) (j) SVHN (test)

Figure 6 One-dimensional visualization of atness We visualize the trainingtest loss with respect toperturbation We depict the results for 3 models the model when the empirical risk with respect to trainingdata is below the ooding level for the rst time during training (doed blue) the model at the end oftraining with ooding (solid blue) and the model at the end of training without ooding (solid red)

27

  • 1 Introduction
  • 2 Backgrounds
    • 21 Regularization Methods
    • 22 Double Descent Curves with Overparametrization
    • 23 Lower-Bounding the Empirical Risk
      • 3 Flooding How to Avoid Zero Training Loss
        • 31 Preliminaries
        • 32 Algorithm
        • 33 Implementation
        • 34 Theoretical Analysis
          • 4 Experiments
            • 41 Synthetic Experiments
            • 42 Benchmark Experiments
            • 43 Memorization
              • 5 Conclusion
              • A Proof of Theorem
              • B Bayes Risk for Gaussian Distributions
              • C Details of Experiments
                • C1 Benchmark Datasets
                • C2 Chosen Flooding Levels
                  • D Learning Curves
                  • E Relationship between Performance and Gradients
                  • F Flatness
Page 8: Do We Need Zero Training Loss A›er Achieving Zero Training … · 2020. 2. 21. · directly aim to avoid zero training loss, they o›en fail to maintain a moderate level of training

Denition 1 e ooded empirical risk is dened as4

R(g) = |R(g)minus b|+ b (8)

Note that when b = 0 then R(g) = R(g) e gradient of R(g) wrt model parameters willpoint to the same direction as that of R(g) when R(g) gt b but in the opposite direction whenR(g) lt b is means that when the learning objective is above the ooding level we performgradient descent as usual (gravity zone) but when the learning objective is below the oodinglevel we perform gradient ascent instead (buoyancy zone)

e issue is that in general we seldom know the optimal ooding level in advance isissue can be mitigated by searching for the optimal ooding level blowast with a hyper-parameteroptimization technique In practice we can search for the optimal ooding level by performingthe exhaustive search in parallel

33 ImplementationFor large scale problems we can employ mini-batched stochastic optimization for ecient com-putation Suppose that we have M disjoint mini-batch splits We denote the empirical risk (5)with respect to the m-th mini-batch by Rm(g) for m isin 1 M en our mini-batchedoptimization performs gradient descent updates in the direction of the gradient of Rm(g) By theconvexity of the absolute value function and Jensenrsquos inequality we have

R(g) le 1

M

Msumm=1

(|Rm(g)minus b|+ b

) (9)

is indicates that mini-batched optimization will simply minimize an upper bound of the full-batch case with R(g)

34 eoretical AnalysisIn the following theorem we will show that the mean squared error (MSE) of the proposed riskestimator with ooding is smaller than that of the original risk estimator without ooding

eorem 1 Fix any measurable vector-valued function g If the ooding level b satises R(g) ltb lt R(g) we have

MSE(R(g)) gt MSE(R(g)) (10)

If b le R(g) we haveMSE(R(g)) = MSE(R(g)) (11)

A proof is given in Appendix A If we regard R(g) as the training loss and R(g) as the testloss we would want b to be between those two for the MSE to improve

4Strictly speaking Eq (1) is dierent from Eq (8) since Eq (1) can ignore constant terms of the original empiricalrisk We will refer to Eq (8) for the ooding operator for the rest of the paper

8

4 ExperimentsIn this section we show experimental results with synthetic and benchmark datasets e imple-mentation is based on PyTorch [Paszke et al 2019] and demo code will be available5 Experimentswere carried out with NVIDIA GeForce GTX 1080 Ti NVIDIA adro RTX 5000 and Intel XeonGold 6142

41 Synthetic Experimentse aim of our synthetic experiments is to study the behavior of ooding with a controlled setupWe use three types of synthetic data described below

Two Gaussians Data We perform binary classication with two 10-dimensional Gaus-sian distributions with covariance matrix identity and means microP = [0 0 0]gt and microN =[mm m]gt where m isin 08 10 e Bayes risk for m = 10 and m = 08 are 014 and024 respectively where proofs are shown in Appendix B e training validation and test samplesizes are 25 10000 and 10000 per class respectively

Sinusoid Data e sinusoid data [Nakkiran et al 2019] are generated as follows We rstdraw input data points uniformly from the inside of a 2-dimensional ball of radius 1 en we putclass labels based on

y = sign(xgtw + sin(xgtwprime))

where w and wprime are any two 2-dimesional vectors such that w perp wprime e training validationand test sample sizes are 100 100 and 20000 respectively

Spiral Data e spiral data [Sugiyama 2015] are two-dimensional synthetic data Letθ+1 = 0 θ+2 θ

+n+ = 4π be equally spaced n+ points in the interval [0 4π] and θminus1 =

0 θminus2 θminusnminus = 4π be equally spaced nminus points in the interval [0 4π] Let positive and negative

input data points be

x+i+ = θ+i+ [cos(θ

+i+) sin(θ

+i+)]

gt + τν+i+

xminusiminus = (θminusiminus + π)[cos(θminusiminus) sin(θminusiminus)]

gt + τνminusiminus

for i+ = 1 n+ and iminus = 1 nminus where τ controls the magnitude of the noise ν+i and

νminusi are iid distributed according to the two-dimensional standard normal distribution enwe make data for classication by (xi yi)ni=1 = (x+

i+ +1)n+

i+=1 cup (xminusiminus minus1)

nminus

iminus=1 wheren = n+ + nminus e training validation and test sample sizes are 100 100 and 10000 per classrespectively

For Two Gaussians we use a one-hidden-layer feedforward neural network with 500 unitsin the hidden layer with the ReLU activation function [Nair and Hinton 2010] We train thenetwork for 1000 epochs with the logistic loss and vanilla gradient descent with learning rate of005 e ooding level is chosen from b isin 0 001 002 040 For Sinusoid and Spiral weuse a four-hidden-layer feedforward neural network with 500 units in the hidden layer with the

5httpsgithubcomtakashiishidaflooding

9

Table 2 Experimental results for the synthetic data Sub-table (A) shows the results without early stoppingSub-table (B) shows the results with early stopping e beer method is shown in bold in each of sub-tables (A) and (B) ldquoBRrdquo stands for the Bayes risk For Two Gaussians the distance between the positive andnegative distributions is larger for larger m See the description in Section 41 for the details

(A) Without Early Stopping (B) With Early Stopping

Data Seing WithoutFlooding

WithFlooding

Chosenb

WithoutFlooding

WithFlooding

Chosenb

Two Gaussians m 10 BR 014 8796 9225 028 9163 9225 027Two Gaussians m 08 BR 024 8200 8731 033 8657 8729 035Sinusoid Label Noise 001 9384 9446 001 9254 9254 000Sinusoid Label Noise 005 9112 9544 010 9326 9460 001Sinusoid Label Noise 010 8657 9602 017 9670 9670 000Spiral Label Noise 001 9896 9785 001 9860 9888 001Spiral Label Noise 005 9387 9624 004 9658 9562 014Spiral Label Noise 010 8970 9296 016 8970 9296 016

ReLU activation function [Nair and Hinton 2010] and batch normalization [Ioe and Szegedy2015] We train the network for 500 epochs with the logistic loss and Adam [Kingma and Ba2015] optimizer with 100 mini-batch size and learning rate of 0001 e ooding level is chosenfrom b isin 0 001 002 020 Note that training with b = 0 is identical to the baselinemethod without ooding We report the test accuracy of the ooding level with the best validationaccuracy We rst conduct experiments without early stopping which means that the last epochwas chosen for all ooding levels

Results e results are summarized in Table 2 It is worth noting that for Two Gaussians thechosen ooding level b is larger for the smaller distance between the two distributions which iswhen the classication task is harder and the Bayes risk becomes larger since the two distributionsbecome less separated We see similar tendencies for Sinusoid and Spiral data a larger b waschosen for larger ipping probability for label noise which is expected to increase the Bayes riskis implies the positive correlation between the optimal ooding level and the Bayes risk asis also partially suggested by eorem 1 Another interesting observation is that the chosen bis close to but higher than the Bayes risk for Two Gaussians data is may look inconsistentwith eorem 1 However it makes sense to adopt larger b with stronger regularization eect thatallows some bias as a trade-o for reducing the variance of the risk estimator In fact eorem 1does not deny the possibility that some b ge R(g) achieves even beer estimation

From (A) in Table 2 we can see that the method with ooding oen improves test accuracyover the baseline method without ooding As we mentioned in the introduction it ca be harmfulto keep training a model until the end without ooding However with ooding the model at thenal epoch has good prediction performance according to the results which implies that ooding

10

Table 3 Results with benchmark datasets We report classication accuracy for all combinations ofweight decay (X and times) early stopping (X and times) and ooding (X and times) e second column shows thetrainingvalidation split used for the experiment W stands for weight decay E stands for early stopping andF stands for ooding ldquomdashrdquo means that ooding level of zero was optimal ldquoNArdquo means that we skipped theexperiments because zero weight decay was optimal in the case without ooding e best and equivalentare shown in bold by comparing ldquowith oodingrdquo and ldquowithout oodingrdquo for two columns with the sameseing for W and E eg the rst and h columns out of the 8 columns e best performing combinationis highlighted

Datasettr W times times X X times times X Xva E times X times X times X times X

split F times times times times X X X X

MNIST 08 9832 9830 9851 9842 9846 9853 9850 984804 9771 9770 9782 9791 9774 9785 mdash 9783

Fashion- 08 8934 8936 NA NA mdash mdash NA NAMNIST 04 8848 8863 8860 8862 mdash mdash mdash mdash

Kuzushiji- 08 9163 9162 9163 9171 9240 9212 9211 9197MNIST 04 8918 8918 8958 8973 9041 9015 8971 8988

CIFAR-10 08 7359 7336 7365 7357 7306 7344 mdash 744104 6639 6663 6931 6928 6720 6758 mdash mdash

CIFAR-100 08 4216 4233 4267 4245 4250 4236 mdash mdash04 3427 3434 3797 3882 3499 3514 mdash mdash

SVHN 08 9238 9241 9320 9299 9278 9279 mdash 934204 9032 9035 9043 9049 9057 9061 9116 9121

helps the late-stage training improve test accuracyWe also conducted experiments with early stopping meaning that we chose the model that

recorded the best validation accuracy during training e results are reported in sub-table (B) ofTable 2 Compared with sub-table (A) we see that early stopping improves the baseline methodwithout ooding well in many cases is indicates that training longer without ooding washarmful in our experiments On the other hand the accuracy for ooding combined with earlystopping is oen close to that with early stopping meaning that training until the end withooding tends to be already as good as doing so with early stopping e table shows that oodingoen improves or retains the test accuracy of the baseline method without ooding even aerdeploying early stopping Flooding does not hurt performance but can be benecial for methodsused with early stopping

11

(a) CIFAR-10 (000) (b) CIFAR-10 (003) (c) CIFAR-10 (007) (d) CIFAR-10 (020)

Figure 2 Learning curves of training and test loss for trainingvalidation proportion of 08 (a) shows thelearning curves without ooding (b) (c) and (d) show the learning curves with dierent ooding levels

42 Benchmark ExperimentsWe next perform experiments with benchmark datasets Not only do we compare with the baselinewithout ooding we also compare or combine with other general regularization methods whichare early stopping and weight decay

Settings We use the following six benchmark datasets MNIST Fashion-MNIST Kuzushiji-MNIST CIFAR-10 CIFAR-100 and SVHN e details of the benchmark datasets can be found inAppendix C1 We split the original training dataset into training and validation data with dierentproportions 08 or 04 (meaning 80 or 40 was used for training and the rest was used forvalidation respectively) We perform the exhaustive hyper-parameter search for the ooding levelwith candidates from 000 001 020 e number of epochs is 500 Stochastic gradientdescent [Robbins and Monro 1951] is used with learning rate of 01 and momentum of 09 ForMNIST Fashion-MNIST and Kuzushiji-MNIST we use a one-hidden-layer feedforward neuralnetwork with 500 units and ReLU activation function [Nair and Hinton 2010] For CIFAR-10CIFAR-100 and SVHN we used ResNet-18 [He et al 2016] We do not use any data augmentationor manual learning rate decay We deployed early stopping in the same way as in Section 41

We rst ran experiments with the following candidates for the weight decay rate 1times10minus5 1times10minus4 4times 10minus4 7times 10minus4 1times 10minus3 4times 10minus3 7times 10minus3 We choose the weight decay rate withthe best validation accuracy for each dataset and each trainingvalidation proportion en xingthe weight decay to the chosen one we ran experiments with ooding level candidates from0 001 020 to investigate whether weight decay and ooding have complementary eectsor if adding weight decay will diminish the accuracy gain of ooding

Results We show the results in Table 3 and the chosen ooding levels in Table 4 in Appendix C2We can observe that ooding gives beer accuracy for most cases We can also see that combiningooding with early stopping or with both early stopping and weight decay may lead to even beeraccuracy in some cases

12

(a) wo early stop (08) (b) wo early stop (04) (c) w early stop (08) (d) w early stop (04)

Figure 3 We show the optimal ooding level maintains memorization e vertical axis shows the trainingaccuracy e horizontal axis shows the ooding level We show results with and without early stoppingand for dierent trainingvalidation splits with proportion of 08 or 04 e marks (4 O ) areplaced on the ooding level that was chosen based on validation accuracy

43 MemorizationCan we maintain memorization even aer adding ooding We investigate if the trained modelhas zero training error (100 accuracy) for the ooding level that was chosen with validation dataWe show the results for all benchmark datasets and all trainingvalidation splits with proportions08 and 04 We also show the case without early stopping (choosing the last epoch) and withearly stopping (choosing the epoch with the highest validation accuracy) e results are shownin Fig 3

All gures show downward curves implying that the model will give up eventually onmemorizing all training data as the ooding level becomes higher A more interesting andimportant observation is the position of the optimal ooding level (the one chosen by validationaccuracy which is marked with 4 O or ) We can observe that the marks are oen ploedat zero error and in some cases there is a mark on the highest ooding level that maintains zeroerror ese results are consistent with recent empirical works that imply zero training error leadsto lower generalization error [Belkin et al 2019 Nakkiran et al 2020] but we further demonstratethat zero training loss may be harmful under zero training error

5 ConclusionWe proposed a novel regularization method called ooding that keeps the training loss to stayaround a small constant value to avoid zero training loss In our experiments the optimal oodinglevel oen maintained memorization of training data with zero error With ooding we showedthat the test accuracy will improve for various benchmark datasets and theoretically showed thatthe mean squared error will be reduced under certain conditions

As a byproduct we were able to produce a double descent curve for the test loss with a relativelyfew number of epochs eg in around 100 epochs shown in Fig 2 and Fig 4 in Appendix D Animportant future direction is to study the relationship between this and the double descent curves

13

from previous works [Belkin et al 2019 Krogh and Hertz 1992 Nakkiran et al 2020]It would also be interesting to see if Bayesian optimization [Shahriari et al 2016 Snoek

et al 2012] methods can be utilized to search for the optimal ooding level eciently We willinvestigate this direction in the future

AcknowledgementsWe thank Chang Xu Genki Yamamoto Kento Nozawa Nontawat Charoenphakdee Voot Tangkaraand Yoshihiro Nagano for the helpful discussions TI was supported by the Google PhD FellowshipProgram MS and IY were supported by JST CREST Grant Number JPMJCR18A2 including AIPchallenge program Japan

ReferencesDevansh Arpit Stanisaw Jastrzebski Nicolas Ballas David Krueger Emmanuel Bengio Maxin-

der S Kanwal Tegan Maharaj Asja Fischer Aaron Courville Yoshua Bengio and SimonLacoste-Julien A closer look at memorization in deep networks In ICML 2017

Mikhail Belkin Daniel Hsu and Partha P Mitra Overing or perfect ing Risk bounds forclassication and regression rules that interpolate In NeurIPS 2018

Mikhail Belkin Daniel Hsu Siyuan Ma and Soumik Mandal Reconciling modern machine-learningpractice and the classical biasvariance trade-o PNAS 11615850ndash15854 2019

David Berthelot Nicholas Carlini Ian Goodfellow Nicolas Papernot Avital Oliver and Colin ARael MixMatch A holistic approach to semi-supervised learning In NeurIPS 2019

Christopher M Bishop Paern Recognition and Machine Learning (Information Science andStatistics) Springer 2011

Christopher Michael Bishop Regularization and complexity control in feed-forward networks InICANN 1995

Rich Caruana Steve Lawrence and C Lee Giles Overing in neural nets Backpropagationconjugate gradient and early stopping In NeurIPS 2000

Pratik Chaudhari Anna Choromanska Stefano Soao Yann LeCun Carlo Baldassi ChristianBorgs Jennifer Chayes Levent Sagun and Riccardo Zecchina Entropy-SGD Biasing gradientdescent into wide valleys In ICLR 2017

Jesus Cid-Sueiro Darıo Garcıa-Garcıa and Raul Santos-Rodrıguez Consistency of losses forlearning from weak labels In ECML-PKDD 2014

14

Tarin Clanuwat Mikel Bober-Irizar Asanobu Kitamoto Alex Lamb Kazuaki Yamamoto and DavidHa Deep learning for classical Japanese literature In NeurIPS Workshop on Machine Learningfor Creativity and Design 2018

Marthinus Christoel du Plessis Gang Niu and Masashi Sugiyama Analysis of learning frompositive and unlabeled data In NeurIPS 2014

Marthinus Christoel du Plessis Gang Niu and Masashi Sugiyama Convex formulation forlearning from positive and unlabeled data In ICML 2015

Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MIT Press 2016

Hongyu Guo Yongyi Mao and Richong Zhang Augmenting data with mixup for sentenceclassication An empirical study In arXiv190508941 2019

Bo Han Gang Niu Jiangchao Yao Xingrui Yu Miao Xu Ivor Tsang and Masashi SugiyamaPumpout A meta approach to robust deep learning with noisy labels In arXiv180911008 2018

Stephen Jose Hanson and Lorien Y Pra Comparing biases for minimal network constructionwith back-propagation In NeurIPS 1988

Kaiming He Xiangyu Zhang Shaoqing Ren and Jian Sun Deep residual learning for imagerecognition In CVPR 2016

Sergey Ioe and Christian Szegedy Batch normalization Accelerating deep network training byreducing internal covariate shi In ICML 2015

Takashi Ishida Gang Niu Aditya Krishna Menon and Masashi Sugiyama Complementary-labellearning for arbitrary losses and models In ICML 2019

Nitish Shirish Keskar Dheevatsa Mudigere Jorge Nocedal Mikhail Smelyanskiy and Ping Tak PeterTang On large-batch training for deep learning Generalization gap and sharp minima In ICLR2017

Diederik P Kingma and Jimmy L Ba Adam A method for stochastic optimization In ICLR 2015

Ryuichi Kiryo Gang Niu Marthinus Christoel du Plessis and Masashi Sugiyama Positive-unlabeled learning with non-negative risk estimator In NeurIPS 2017

Anders Krogh and John A Hertz A simple weight decay can improve generalization In NeurIPS1992

Yann Lecun Lon Boou Yoshua Bengio and Patrick Haner Gradient-based learning applied todocument recognition In Proceedings of the IEEE pages 2278ndash2324 1998

Hao Li Zheng Xu Gavin Taylor Christoph Studer and Tom Goldstein Visualizing the losslandscape of neural nets In NeurIPS 2018

15

Ilya Loshchilov and Frank Huer Decoupled weight decay regularization In ICLR 2019

Nan Lu Tianyi Zhang Gang Niu and Masashi Sugiyama Mitigating overing in supervisedclassication from two unlabeled datasets A consistent risk correction approach In AISTATS2020

N Morgan and H Bourlard Generalization and parameter estimation in feedforward nets Someexperiments In NeurIPS 1990

Vinod Nair and Georey E Hinton Rectied linear units improve restricted boltzmann machinesIn ICML 2010

Preetum Nakkiran Gal Kaplun Dimitris Kalimeris Tristan Yang Benjamin L Edelman FredZhang and Boaz Barak SGD on neural networks learns functions of increasing complexity InNeurIPS 2019

Preetum Nakkiran Gal Kaplun Yamini Bansal Tristan Yang Boaz Barak and Ilya Sutskever Deepdouble descent Where bigger models and more data hurt In ICLR 2020

Nagarajan Natarajan Inderjit S Dhillon Pradeep K Ravikumar and Ambuj Tewari Learningwith noisy labels In NeurIPS 2013

Yuval Netzer Tao Wang Adam Coates Alessandro Bissacco Bo Wu and Andrew Y Ng Readingdigits in natural images with unsupervised feature learning In NeurIPS Workshop on DeepLearning and Unsupervised Feature Learning 2011

Andrew Y Ng Preventing ldquooveringrdquo of cross-validation data In ICML 1997

Adam Paszke Sam Gross Francisco Massa Adam Lerer James Bradbury Gregory Chanan TrevorKilleen Zeming Lin Natalia Gimelshein Luca Antiga Alban Desmaison Andreas Kopf EdwardYang Zachary DeVito Martin Raison Alykhan Tejani Sasank Chilamkurthy Benoit SteinerLu Fang Junjie Bai and Soumith Chintala PyTorch An imperative style high-performancedeep learning library In NeurIPS 2019

Giorgio Patrini Alessandro Rozza Aditya K Menon Richard Nock and Lizhen Making deepneural networks robust to label noise A loss correction approach In CVPR 2017

Herbert Robbins and Suon Monro A stochastic approximation method Annals of MathematicalStatistics 22400ndash407 1951

Rebecca Roelofs Vaishaal Shankar Benjamin Recht Sara Fridovich-Keil Moritz Hardt John Millerand Ludwig Schmidt A meta-analysis of overing in machine learning In NeurIPS 2019

Bobak Shahriari Kevin Swersky Ziyu Wang Ryan P Adams and Nando de Freitas Taking thehuman out of the loop A review of bayesian optimization Proceedings of the IEEE 104148ndash1752016

16

Connor Shorten and Taghi M Khoshgoaar A survey on image data augmentation for deeplearning Journal of Big Data 6 2019

Jasper Snoek Hugo Larochelle and Ryan P Adams Practical bayesian optimization of machinelearning algorithms In NeurIPS 2012

Nitish Srivastava Georey Hinton Alex Krizhevsky Ilya Sutskever and Ruslan SalakhutdinovDropout A simple way to prevent neural networks from overing Journal of Machine LearningResearch 151929ndash1958 2014

Masashi Sugiyama Introduction to statistical machine learning Morgan Kaufmann 2015

Christian Szegedy Vincent Vanhoucke Sergey Ioe and Jon Shlens Rethinking the inceptionarchitecture for computer vision In CVPR 2016

Sunil ulasidasan Gopinath Chennupati Je Bilmes Tanmoy Bhaacharya and Sarah MichalakOn mixup training Improved calibration and predictive uncertainty for deep neural networksIn NeurIPS 2019

Robert Tibshirani Regression shrinkage and selection via the lasso Journal of the Royal StatisticalSociety Series B (Methodological) 58267ndash288 1996

A N Tikhonov and V Y Arsenin Solutions of Ill Posed Problems Winston 1977

Andrey Nikolayevich Tikhonov On the stability of inverse problems Doklady Akademii NaukSSSR 39195ndash198 1943

Antonio Torralba Rob Fergus and William T Freeman 80 million tiny images A large data setfor nonparametric object and scene recognition In IEEE Trans PAMI 2008

Brendan van Rooyen and Robert C Williamson A theory of learning with corrupted labelsJournal of Machine Learning Research 181ndash50 2018

Vikas Verma Alex Lamb Juho Kannala Yoshua Bengio and David Lopez-Paz Interpolationconsistency training for semi-supervised learning In IJCAI 2019

Stefan Wager Sida Wang and Percy Liang Dropout training as adaptive regularization In NeurIPS2013

Roman Werpachowski Andrs Gyrgy and Csaba Szepesvri Detecting overing via adversarialexamples In NeurIPS 2019

Han Xiao Kashif Rasul and Roland Vollgraf Fashion-MNIST a novel image dataset for bench-marking machine learning algorithms arXiv preprint arXiv170807747 2017

Chiyuan Zhang Samy Bengio Moritz Hardt Benjamin Recht and Oriol Vinyals Understandingdeep learning requires rethinking generalization In ICLR 2017

17

Hongyi Zhang Moustapha Cisse Yann N Dauphin and David Lopez-Paz mixup Beyondempirical risk minimization In ICLR 2018

Tong Zhang Statistical behavior and consistency of classication methods based on convex riskminimization e Annals of Statistics 3256ndash85 2004

18

A Proof of eoremProof If the ooding level is b then the proposed ooding estimator is

R(g) = |R(g)minus b|+ b (12)

Since the absolute operator can be expressed with a max operator with max(a b) = a+b+|aminusb|2

theproposed estimator can be re-expressed as

R(g) = 2max(R(g) b)minus R(g) = Aminus R(g) (13)

For convenience we used A = 2max(R(g) b) From the denition of MSE

MSE(R(g)) = E[(R(g)minusR(g))2] (14)MSE(R(g)) = E[(R(g)minusR(g))2] (15)

= E[(Aminus R(g)minusR(g))2)] (16)= E[A2]minus 2E[A(R(g) +R(g))] + E[(R(g) +R(g))2] (17)

We are interested in the sign of

MSE(R(g))minusMSE(R(g)) = E[minus4R(g)R(g)minus A2 + 2A(R(g) +R(g))] (18)

Dene the inside of the expectation as B = minus4R(g)R(g)minus A2 + 2A(R(g) + R(g)) B can bedivided into two cases depending on the outcome of the max operator

B =

minus4R(g)R(g)minus 4R(g)2 + 4R(g)(R(g) +R(g)) if R(g) ge b

minus4R(g)R(g)minus 4b2 + 4b(R(g) +R(g)) if R(g) lt b(19)

=

0 if R(g) ge b

minus4R(g)R(g)minus 4b2 + 4b(R(g) +R(g)) if R(g) lt b(20)

=

0 if R(g) ge b

minus4(bminus R(g))(bminusR(g)) if R(g) lt b (21)

e laer case becomes positive when R(g) lt b lt R(g) erefore when R(g) lt b lt R(g)

MSE(R(g))minusMSE(R(g)) gt 0 (22)MSE(R(g)) gt MSE(R(g)) (23)

When b le R(g)

MSE(R(g))minusMSE(R(g)) gt 0 (24)MSE(R(g)) = MSE(R(g)) (25)

19

B Bayes Risk for Gaussian DistributionsIn this section we explain in detail how we derived the Bayes risk with respect to the surrogateloss in the experiments with Gaussian data in Section 41 Since we are using the logistic loss inthe synthetic experiments the loss of the margin is

`(yg(x)) = log(1 + exp(minusyg(x))) (26)

where g(x) Rd rarr R is a scalar instead of the vector denition that was used previously becausethe synthetic experiments only consider binary classication Take the derivative to derive

partE[`(yg(x))]partg(x)

=E[log(1 + exp(minusyg(x)))]

partg(x)= E

[minusy exp(minusyg(x))1 + exp(minusyg(x))

∣∣∣∣∣x]p(x) (27)

= E

[minusy

1 + exp(yg(x))

∣∣∣∣∣x]p(x) (28)

= E

[y + 1

2

1

exp(minusg(x)) + 1+y minus 1

2

minus1exp(g(x)) + 1

∣∣∣∣∣x]p(x) (29)

= E

[y + 1

2

∣∣∣∣∣x]

1

exp(minusg(x)) + 1p(x) + E

[y minus 1

2

∣∣∣∣∣x]

minus1exp(g(x)) + 1

p(x) (30)

= p(y = +1|x) 1

exp(minusg(x)) + 1p(x) + p(y = minus1|x) minus1

exp(g(x)) + 1p(x) (31)

Set this to zero divide by p(x) gt 0 to obtain

p(y = minus1|x) 1

exp(minusg(x)) + 1= p(y = +1|x) 1

exp(g(x)) + 1(32)

exp(g(x)) =p(y = +1|x)p(y = minus1|x)

(33)

g(x) = logp(y = +1|x)p(y = minus1|x)

(34)

Since we are interested in the surrogate loss under this classier we plug this into the logisticloss to obtain the Bayes risk

E[`(yg(x))] = E[log(1 +

p(minusy|x)p(y|x)

)

]= E

[log(

1

p(y|x))

]= E[minus log p(y|x)] (35)

In the experiments in Section 41 we report the empirical version of this with the test dataset asthe Bayes risk

20

C Details of Experiments

C1 Benchmark DatasetsIn the experiments in Section 42 we use six benchmark datasets explained below

bull MNIST6 [Lecun et al 1998] is a 10 class dataset of handwrien digits 1 2 9 and 0 Eachsample is a 28times 28 grayscale image e number of training and test samples are 60000and 10000 respectively

bull Fashion-MNIST7 [Xiao et al 2017] is a 10 class dataset of fashion items T-shirttop TrouserPullover Dress Coat Sandal Shirt Sneaker Bag and Ankle boot Each sample is a 28times 28grayscale image e number of training and test samples are 60000 and 10000 respectively

bull Kuzushiji-MNIST8 [Clanuwat et al 2018] is a 10 class dataset of cursive Japanese (ldquoKuzushijirdquo)characters Each sample is a 28 times 28 grayscale image e number of training and testsamples are 60000 and 10000 respectively

bull CIFAR-109 is a 10 class dataset of various objects airplane automobile bird cat deer dogfrog horse ship and truck Each sample is a colored image in 32times 32times 3 RGB format It isa subset of the 80 million tiny images dataset [Torralba et al 2008] ere are 6000 imagesper class where 5000 are for training and 1000 are for test

bull CIFAR-10010 is a 100 class dataset of various objects Each class has 600 samples where 500samples are for training and 100 samples are for test is is also a subset of the 80 milliontiny images dataset [Torralba et al 2008]

bull SVHN11 [Netzer et al 2011] is a 10 class dataset of house numbers from Google Street Viewimages in 32times 32times 3 RGB format 73257 digits are for training and 26032 digits are fortesting

6httpyannlecuncomexdbmnist7httpsgithubcomzalandoresearchfashion-mnist8httpsgithubcomrois-codhkmnist9httpswwwcstorontoedusimkrizcifarhtml

10httpswwwcstorontoedusimkrizcifarhtml11httpufldlstanfordeduhousenumbers

21

C2 Chosen Flooding LevelsIn Table 4 we report the chosen ooding levels for our experiments with benchmark datasets

Table 4 e chosen ooding levels for benchmark experiments

Early stopping times X times XWeight decay times times X XFlooding X X X X

MNIST (08) 002 002 003 002MNIST (04) 002 003 000 002Fashion-MNIST (08) 000 000 mdash mdashFashion-MNIST (04) 000 000 mdash mdashKuzushiji-MNIST (08) 001 003 003 003Kuzushiji-MNIST (04) 003 002 004 003CIFAR-10 (08) 004 004 000 001CIFAR-10 (04) 003 003 000 000CIFAR-100 (08) 002 002 000 000CIFAR-100 (04) 003 003 000 000SVHN (08) 001 001 000 002SVHN (04) 001 009 003 003

D Learning CurvesIn Figure 4 we visualize learning curves for all datasets (including CIFAR-10 which we alreadyvisualized in Figure 2) We only show the learning curves for trainingvalidation proportion of 08since the results for 04 were similar with 08 Note that Figure 1(c) shows the learning curves forthe rst 80 epochs for CIFAR-10 without ooding Figure 1(d) shows the learning curves withooding when the ooding level is 018

22

(a) MNIST (000) (b) MNIST (001) (c) MNIST (002) (d) MNIST (003)

(e) Fashion-MNIST(000)

(f) Fashion-MNIST(002)

(g) Fashion-MNIST(005)

(h) Fashion-MNIST(010)

(i) KMNIST (000) (j) KMNIST (002) (k) KMNIST (005) (l) KMNIST (010)

(m) CIFAR-10 (000) (n) CIFAR-10 (003) (o) CIFAR-10 (007) (p) CIFAR-10 (020)

(q) CIFAR-100 (000) (r) CIFAR-100 (001) (s) CIFAR-100 (006) (t) CIFAR-100 (020)

(u) SVHN (000) (v) SVHN (001) (w) SVHN (006) (x) SVHN (020)

Figure 4 Learning curves of training and test loss e rst gure in each row is the learning curveswithout ooding e 2nd 3rd and 4th columns show the results with dierent ooding levels e oodinglevel increases towards the right-hand side 23

E Relationship between Performance and GradientsSettings We visualize the relationship between test performance (loss or accuracy) and gradientamplitude of the trainingtest loss in Figure 5 where the gradient amplitude is the `2 norm of thelter-nomalized gradient of the loss e lter-normalized gradient is the gradient appropriatelyscaled depending on the magnitude of the corresponding convolutional lter similarly to Liet al [2018] More specicically for each lter of every comvolutional layer we multiply thecorresponding elements of the gradient by the norm of the lter Note that a fully connected layeris a special case of convolutional layer and subject to this scaling We exclude Fashion-MNISTbecause the optimal ooding level was zero We used the results with trainingvalidation splitratio of 08

Results For the gures with gradient amplitude of training loss on the horizontal axis ldquordquomarks (w ooding) are oen ploed on the right of ldquo+rdquo marks (wo ooding) which impliesthat ooding prevents the model from staying a local minimum For the gures with gradientamplitude of test loss on the horizontal axis we can observe the method with ooding (ldquordquo)improves performance while the gradient amplitude becomes smaller On the other hand theperformance with the method without ooding (ldquo+rdquo) degenerates while the gradient amplitude oftest loss keeps increasing

24

(a) MNIST xtrainyloss

(b) MNIST xtrainyacc

(c) MNIST xtestyloss

(d) MNIST xtest yacc

(e) K-M xtrain yloss (f) K-M xtrain yacc (g) K-M xtest yloss (h) K-M xtest yacc

(i) C10 xtrain yloss (j) C10 xtrain yacc (k) C10 xtest yloss (l) C10 xtest yacc

(m) C100 xtrainyloss

(n) C100 xtrain yacc (o) C100 xtest yloss (p) C100 xtest yacc

(q) SVHN xtrainyloss

(r) SVHN xtrainyacc

(s) SVHN xtest yloss (t) SVHN xtest yacc

Figure 5 Relationship between test performance (loss or accuracy) and amplitude of gradient (with trainingor test loss) Each point (ldquordquo or ldquo+rdquo) in the gures corresponds to a single model at a certain epoch Weremove the rst 5 epochs and plot the rest ldquordquo is used for the method with ooding and ldquo+rdquo is used forthe method without ooding e large black ldquordquo and ldquo+rdquo show the epochs with early stopping e colorbecomes lighter (purplerarr yellow) as the training proceeds K-M C10 and C100 stand for Kuzushiji-MNISTCIFAR-10 and CIFAR-100

25

F FlatnessSettings We follow Li et al [2018] and give a one-dimensional visualization of atness for eachdataset in Figure 6 We exclude Fashion-MNIST because the optimal ooding level was zero Weused the results with trainingvalidation split ratio of 08 We compare the atness of the modelright aer the empirical risk with respect to a mini-batch becomes smaller than the ooding levelRm(g) lt b for the rst time (doed blue line) and the model aer training (solid blue line) Wealso compare them with the model trained by the baseline method without ooding and trainingis nished (solid red line)

Results According to Figure 6 the test loss becomes lower and more at during the trainingwith ooding Note that the training loss on the other hand continues to oat around the oodinglevel until the end of training aer it enters the ooding zone We expect that the model makes arandom walk and escapes regions with sharp loss landscapes during the period is may be apossible reason for beer generalization results with our proposed method

26

(a) MNIST (train) (b) MNIST (test) (c) Kuzushiji-MNIST(train)

(d) Kuzushiji-MNIST(test)

(e) CIFAR-10 (train) (f) CIFAR-10 (test) (g) CIFAR-100 (train) (h) CIFAR-100 (test)

(i) SVHN (train) (j) SVHN (test)

Figure 6 One-dimensional visualization of atness We visualize the trainingtest loss with respect toperturbation We depict the results for 3 models the model when the empirical risk with respect to trainingdata is below the ooding level for the rst time during training (doed blue) the model at the end oftraining with ooding (solid blue) and the model at the end of training without ooding (solid red)

27

  • 1 Introduction
  • 2 Backgrounds
    • 21 Regularization Methods
    • 22 Double Descent Curves with Overparametrization
    • 23 Lower-Bounding the Empirical Risk
      • 3 Flooding How to Avoid Zero Training Loss
        • 31 Preliminaries
        • 32 Algorithm
        • 33 Implementation
        • 34 Theoretical Analysis
          • 4 Experiments
            • 41 Synthetic Experiments
            • 42 Benchmark Experiments
            • 43 Memorization
              • 5 Conclusion
              • A Proof of Theorem
              • B Bayes Risk for Gaussian Distributions
              • C Details of Experiments
                • C1 Benchmark Datasets
                • C2 Chosen Flooding Levels
                  • D Learning Curves
                  • E Relationship between Performance and Gradients
                  • F Flatness
Page 9: Do We Need Zero Training Loss A›er Achieving Zero Training … · 2020. 2. 21. · directly aim to avoid zero training loss, they o›en fail to maintain a moderate level of training

4 ExperimentsIn this section we show experimental results with synthetic and benchmark datasets e imple-mentation is based on PyTorch [Paszke et al 2019] and demo code will be available5 Experimentswere carried out with NVIDIA GeForce GTX 1080 Ti NVIDIA adro RTX 5000 and Intel XeonGold 6142

41 Synthetic Experimentse aim of our synthetic experiments is to study the behavior of ooding with a controlled setupWe use three types of synthetic data described below

Two Gaussians Data We perform binary classication with two 10-dimensional Gaus-sian distributions with covariance matrix identity and means microP = [0 0 0]gt and microN =[mm m]gt where m isin 08 10 e Bayes risk for m = 10 and m = 08 are 014 and024 respectively where proofs are shown in Appendix B e training validation and test samplesizes are 25 10000 and 10000 per class respectively

Sinusoid Data e sinusoid data [Nakkiran et al 2019] are generated as follows We rstdraw input data points uniformly from the inside of a 2-dimensional ball of radius 1 en we putclass labels based on

y = sign(xgtw + sin(xgtwprime))

where w and wprime are any two 2-dimesional vectors such that w perp wprime e training validationand test sample sizes are 100 100 and 20000 respectively

Spiral Data e spiral data [Sugiyama 2015] are two-dimensional synthetic data Letθ+1 = 0 θ+2 θ

+n+ = 4π be equally spaced n+ points in the interval [0 4π] and θminus1 =

0 θminus2 θminusnminus = 4π be equally spaced nminus points in the interval [0 4π] Let positive and negative

input data points be

x+i+ = θ+i+ [cos(θ

+i+) sin(θ

+i+)]

gt + τν+i+

xminusiminus = (θminusiminus + π)[cos(θminusiminus) sin(θminusiminus)]

gt + τνminusiminus

for i+ = 1 n+ and iminus = 1 nminus where τ controls the magnitude of the noise ν+i and

νminusi are iid distributed according to the two-dimensional standard normal distribution enwe make data for classication by (xi yi)ni=1 = (x+

i+ +1)n+

i+=1 cup (xminusiminus minus1)

nminus

iminus=1 wheren = n+ + nminus e training validation and test sample sizes are 100 100 and 10000 per classrespectively

For Two Gaussians we use a one-hidden-layer feedforward neural network with 500 unitsin the hidden layer with the ReLU activation function [Nair and Hinton 2010] We train thenetwork for 1000 epochs with the logistic loss and vanilla gradient descent with learning rate of005 e ooding level is chosen from b isin 0 001 002 040 For Sinusoid and Spiral weuse a four-hidden-layer feedforward neural network with 500 units in the hidden layer with the

5httpsgithubcomtakashiishidaflooding

9

Table 2 Experimental results for the synthetic data Sub-table (A) shows the results without early stoppingSub-table (B) shows the results with early stopping e beer method is shown in bold in each of sub-tables (A) and (B) ldquoBRrdquo stands for the Bayes risk For Two Gaussians the distance between the positive andnegative distributions is larger for larger m See the description in Section 41 for the details

(A) Without Early Stopping (B) With Early Stopping

Data Seing WithoutFlooding

WithFlooding

Chosenb

WithoutFlooding

WithFlooding

Chosenb

Two Gaussians m 10 BR 014 8796 9225 028 9163 9225 027Two Gaussians m 08 BR 024 8200 8731 033 8657 8729 035Sinusoid Label Noise 001 9384 9446 001 9254 9254 000Sinusoid Label Noise 005 9112 9544 010 9326 9460 001Sinusoid Label Noise 010 8657 9602 017 9670 9670 000Spiral Label Noise 001 9896 9785 001 9860 9888 001Spiral Label Noise 005 9387 9624 004 9658 9562 014Spiral Label Noise 010 8970 9296 016 8970 9296 016

ReLU activation function [Nair and Hinton 2010] and batch normalization [Ioe and Szegedy2015] We train the network for 500 epochs with the logistic loss and Adam [Kingma and Ba2015] optimizer with 100 mini-batch size and learning rate of 0001 e ooding level is chosenfrom b isin 0 001 002 020 Note that training with b = 0 is identical to the baselinemethod without ooding We report the test accuracy of the ooding level with the best validationaccuracy We rst conduct experiments without early stopping which means that the last epochwas chosen for all ooding levels

Results e results are summarized in Table 2 It is worth noting that for Two Gaussians thechosen ooding level b is larger for the smaller distance between the two distributions which iswhen the classication task is harder and the Bayes risk becomes larger since the two distributionsbecome less separated We see similar tendencies for Sinusoid and Spiral data a larger b waschosen for larger ipping probability for label noise which is expected to increase the Bayes riskis implies the positive correlation between the optimal ooding level and the Bayes risk asis also partially suggested by eorem 1 Another interesting observation is that the chosen bis close to but higher than the Bayes risk for Two Gaussians data is may look inconsistentwith eorem 1 However it makes sense to adopt larger b with stronger regularization eect thatallows some bias as a trade-o for reducing the variance of the risk estimator In fact eorem 1does not deny the possibility that some b ge R(g) achieves even beer estimation

From (A) in Table 2 we can see that the method with ooding oen improves test accuracyover the baseline method without ooding As we mentioned in the introduction it ca be harmfulto keep training a model until the end without ooding However with ooding the model at thenal epoch has good prediction performance according to the results which implies that ooding

10

Table 3 Results with benchmark datasets We report classication accuracy for all combinations ofweight decay (X and times) early stopping (X and times) and ooding (X and times) e second column shows thetrainingvalidation split used for the experiment W stands for weight decay E stands for early stopping andF stands for ooding ldquomdashrdquo means that ooding level of zero was optimal ldquoNArdquo means that we skipped theexperiments because zero weight decay was optimal in the case without ooding e best and equivalentare shown in bold by comparing ldquowith oodingrdquo and ldquowithout oodingrdquo for two columns with the sameseing for W and E eg the rst and h columns out of the 8 columns e best performing combinationis highlighted

Datasettr W times times X X times times X Xva E times X times X times X times X

split F times times times times X X X X

MNIST 08 9832 9830 9851 9842 9846 9853 9850 984804 9771 9770 9782 9791 9774 9785 mdash 9783

Fashion- 08 8934 8936 NA NA mdash mdash NA NAMNIST 04 8848 8863 8860 8862 mdash mdash mdash mdash

Kuzushiji- 08 9163 9162 9163 9171 9240 9212 9211 9197MNIST 04 8918 8918 8958 8973 9041 9015 8971 8988

CIFAR-10 08 7359 7336 7365 7357 7306 7344 mdash 744104 6639 6663 6931 6928 6720 6758 mdash mdash

CIFAR-100 08 4216 4233 4267 4245 4250 4236 mdash mdash04 3427 3434 3797 3882 3499 3514 mdash mdash

SVHN 08 9238 9241 9320 9299 9278 9279 mdash 934204 9032 9035 9043 9049 9057 9061 9116 9121

helps the late-stage training improve test accuracyWe also conducted experiments with early stopping meaning that we chose the model that

recorded the best validation accuracy during training e results are reported in sub-table (B) ofTable 2 Compared with sub-table (A) we see that early stopping improves the baseline methodwithout ooding well in many cases is indicates that training longer without ooding washarmful in our experiments On the other hand the accuracy for ooding combined with earlystopping is oen close to that with early stopping meaning that training until the end withooding tends to be already as good as doing so with early stopping e table shows that oodingoen improves or retains the test accuracy of the baseline method without ooding even aerdeploying early stopping Flooding does not hurt performance but can be benecial for methodsused with early stopping

11

(a) CIFAR-10 (000) (b) CIFAR-10 (003) (c) CIFAR-10 (007) (d) CIFAR-10 (020)

Figure 2 Learning curves of training and test loss for trainingvalidation proportion of 08 (a) shows thelearning curves without ooding (b) (c) and (d) show the learning curves with dierent ooding levels

42 Benchmark ExperimentsWe next perform experiments with benchmark datasets Not only do we compare with the baselinewithout ooding we also compare or combine with other general regularization methods whichare early stopping and weight decay

Settings We use the following six benchmark datasets MNIST Fashion-MNIST Kuzushiji-MNIST CIFAR-10 CIFAR-100 and SVHN e details of the benchmark datasets can be found inAppendix C1 We split the original training dataset into training and validation data with dierentproportions 08 or 04 (meaning 80 or 40 was used for training and the rest was used forvalidation respectively) We perform the exhaustive hyper-parameter search for the ooding levelwith candidates from 000 001 020 e number of epochs is 500 Stochastic gradientdescent [Robbins and Monro 1951] is used with learning rate of 01 and momentum of 09 ForMNIST Fashion-MNIST and Kuzushiji-MNIST we use a one-hidden-layer feedforward neuralnetwork with 500 units and ReLU activation function [Nair and Hinton 2010] For CIFAR-10CIFAR-100 and SVHN we used ResNet-18 [He et al 2016] We do not use any data augmentationor manual learning rate decay We deployed early stopping in the same way as in Section 41

We rst ran experiments with the following candidates for the weight decay rate 1times10minus5 1times10minus4 4times 10minus4 7times 10minus4 1times 10minus3 4times 10minus3 7times 10minus3 We choose the weight decay rate withthe best validation accuracy for each dataset and each trainingvalidation proportion en xingthe weight decay to the chosen one we ran experiments with ooding level candidates from0 001 020 to investigate whether weight decay and ooding have complementary eectsor if adding weight decay will diminish the accuracy gain of ooding

Results We show the results in Table 3 and the chosen ooding levels in Table 4 in Appendix C2We can observe that ooding gives beer accuracy for most cases We can also see that combiningooding with early stopping or with both early stopping and weight decay may lead to even beeraccuracy in some cases

12

(a) wo early stop (08) (b) wo early stop (04) (c) w early stop (08) (d) w early stop (04)

Figure 3 We show the optimal ooding level maintains memorization e vertical axis shows the trainingaccuracy e horizontal axis shows the ooding level We show results with and without early stoppingand for dierent trainingvalidation splits with proportion of 08 or 04 e marks (4 O ) areplaced on the ooding level that was chosen based on validation accuracy

43 MemorizationCan we maintain memorization even aer adding ooding We investigate if the trained modelhas zero training error (100 accuracy) for the ooding level that was chosen with validation dataWe show the results for all benchmark datasets and all trainingvalidation splits with proportions08 and 04 We also show the case without early stopping (choosing the last epoch) and withearly stopping (choosing the epoch with the highest validation accuracy) e results are shownin Fig 3

All gures show downward curves implying that the model will give up eventually onmemorizing all training data as the ooding level becomes higher A more interesting andimportant observation is the position of the optimal ooding level (the one chosen by validationaccuracy which is marked with 4 O or ) We can observe that the marks are oen ploedat zero error and in some cases there is a mark on the highest ooding level that maintains zeroerror ese results are consistent with recent empirical works that imply zero training error leadsto lower generalization error [Belkin et al 2019 Nakkiran et al 2020] but we further demonstratethat zero training loss may be harmful under zero training error

5 ConclusionWe proposed a novel regularization method called ooding that keeps the training loss to stayaround a small constant value to avoid zero training loss In our experiments the optimal oodinglevel oen maintained memorization of training data with zero error With ooding we showedthat the test accuracy will improve for various benchmark datasets and theoretically showed thatthe mean squared error will be reduced under certain conditions

As a byproduct we were able to produce a double descent curve for the test loss with a relativelyfew number of epochs eg in around 100 epochs shown in Fig 2 and Fig 4 in Appendix D Animportant future direction is to study the relationship between this and the double descent curves

13

from previous works [Belkin et al 2019 Krogh and Hertz 1992 Nakkiran et al 2020]It would also be interesting to see if Bayesian optimization [Shahriari et al 2016 Snoek

et al 2012] methods can be utilized to search for the optimal ooding level eciently We willinvestigate this direction in the future

AcknowledgementsWe thank Chang Xu Genki Yamamoto Kento Nozawa Nontawat Charoenphakdee Voot Tangkaraand Yoshihiro Nagano for the helpful discussions TI was supported by the Google PhD FellowshipProgram MS and IY were supported by JST CREST Grant Number JPMJCR18A2 including AIPchallenge program Japan

ReferencesDevansh Arpit Stanisaw Jastrzebski Nicolas Ballas David Krueger Emmanuel Bengio Maxin-

der S Kanwal Tegan Maharaj Asja Fischer Aaron Courville Yoshua Bengio and SimonLacoste-Julien A closer look at memorization in deep networks In ICML 2017

Mikhail Belkin Daniel Hsu and Partha P Mitra Overing or perfect ing Risk bounds forclassication and regression rules that interpolate In NeurIPS 2018

Mikhail Belkin Daniel Hsu Siyuan Ma and Soumik Mandal Reconciling modern machine-learningpractice and the classical biasvariance trade-o PNAS 11615850ndash15854 2019

David Berthelot Nicholas Carlini Ian Goodfellow Nicolas Papernot Avital Oliver and Colin ARael MixMatch A holistic approach to semi-supervised learning In NeurIPS 2019

Christopher M Bishop Paern Recognition and Machine Learning (Information Science andStatistics) Springer 2011

Christopher Michael Bishop Regularization and complexity control in feed-forward networks InICANN 1995

Rich Caruana Steve Lawrence and C Lee Giles Overing in neural nets Backpropagationconjugate gradient and early stopping In NeurIPS 2000

Pratik Chaudhari Anna Choromanska Stefano Soao Yann LeCun Carlo Baldassi ChristianBorgs Jennifer Chayes Levent Sagun and Riccardo Zecchina Entropy-SGD Biasing gradientdescent into wide valleys In ICLR 2017

Jesus Cid-Sueiro Darıo Garcıa-Garcıa and Raul Santos-Rodrıguez Consistency of losses forlearning from weak labels In ECML-PKDD 2014

14

Tarin Clanuwat Mikel Bober-Irizar Asanobu Kitamoto Alex Lamb Kazuaki Yamamoto and DavidHa Deep learning for classical Japanese literature In NeurIPS Workshop on Machine Learningfor Creativity and Design 2018

Marthinus Christoel du Plessis Gang Niu and Masashi Sugiyama Analysis of learning frompositive and unlabeled data In NeurIPS 2014

Marthinus Christoel du Plessis Gang Niu and Masashi Sugiyama Convex formulation forlearning from positive and unlabeled data In ICML 2015

Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MIT Press 2016

Hongyu Guo Yongyi Mao and Richong Zhang Augmenting data with mixup for sentenceclassication An empirical study In arXiv190508941 2019

Bo Han Gang Niu Jiangchao Yao Xingrui Yu Miao Xu Ivor Tsang and Masashi SugiyamaPumpout A meta approach to robust deep learning with noisy labels In arXiv180911008 2018

Stephen Jose Hanson and Lorien Y Pra Comparing biases for minimal network constructionwith back-propagation In NeurIPS 1988

Kaiming He Xiangyu Zhang Shaoqing Ren and Jian Sun Deep residual learning for imagerecognition In CVPR 2016

Sergey Ioe and Christian Szegedy Batch normalization Accelerating deep network training byreducing internal covariate shi In ICML 2015

Takashi Ishida Gang Niu Aditya Krishna Menon and Masashi Sugiyama Complementary-labellearning for arbitrary losses and models In ICML 2019

Nitish Shirish Keskar Dheevatsa Mudigere Jorge Nocedal Mikhail Smelyanskiy and Ping Tak PeterTang On large-batch training for deep learning Generalization gap and sharp minima In ICLR2017

Diederik P Kingma and Jimmy L Ba Adam A method for stochastic optimization In ICLR 2015

Ryuichi Kiryo Gang Niu Marthinus Christoel du Plessis and Masashi Sugiyama Positive-unlabeled learning with non-negative risk estimator In NeurIPS 2017

Anders Krogh and John A Hertz A simple weight decay can improve generalization In NeurIPS1992

Yann Lecun Lon Boou Yoshua Bengio and Patrick Haner Gradient-based learning applied todocument recognition In Proceedings of the IEEE pages 2278ndash2324 1998

Hao Li Zheng Xu Gavin Taylor Christoph Studer and Tom Goldstein Visualizing the losslandscape of neural nets In NeurIPS 2018

15

Ilya Loshchilov and Frank Huer Decoupled weight decay regularization In ICLR 2019

Nan Lu Tianyi Zhang Gang Niu and Masashi Sugiyama Mitigating overing in supervisedclassication from two unlabeled datasets A consistent risk correction approach In AISTATS2020

N Morgan and H Bourlard Generalization and parameter estimation in feedforward nets Someexperiments In NeurIPS 1990

Vinod Nair and Georey E Hinton Rectied linear units improve restricted boltzmann machinesIn ICML 2010

Preetum Nakkiran Gal Kaplun Dimitris Kalimeris Tristan Yang Benjamin L Edelman FredZhang and Boaz Barak SGD on neural networks learns functions of increasing complexity InNeurIPS 2019

Preetum Nakkiran Gal Kaplun Yamini Bansal Tristan Yang Boaz Barak and Ilya Sutskever Deepdouble descent Where bigger models and more data hurt In ICLR 2020

Nagarajan Natarajan Inderjit S Dhillon Pradeep K Ravikumar and Ambuj Tewari Learningwith noisy labels In NeurIPS 2013

Yuval Netzer Tao Wang Adam Coates Alessandro Bissacco Bo Wu and Andrew Y Ng Readingdigits in natural images with unsupervised feature learning In NeurIPS Workshop on DeepLearning and Unsupervised Feature Learning 2011

Andrew Y Ng Preventing ldquooveringrdquo of cross-validation data In ICML 1997

Adam Paszke Sam Gross Francisco Massa Adam Lerer James Bradbury Gregory Chanan TrevorKilleen Zeming Lin Natalia Gimelshein Luca Antiga Alban Desmaison Andreas Kopf EdwardYang Zachary DeVito Martin Raison Alykhan Tejani Sasank Chilamkurthy Benoit SteinerLu Fang Junjie Bai and Soumith Chintala PyTorch An imperative style high-performancedeep learning library In NeurIPS 2019

Giorgio Patrini Alessandro Rozza Aditya K Menon Richard Nock and Lizhen Making deepneural networks robust to label noise A loss correction approach In CVPR 2017

Herbert Robbins and Suon Monro A stochastic approximation method Annals of MathematicalStatistics 22400ndash407 1951

Rebecca Roelofs Vaishaal Shankar Benjamin Recht Sara Fridovich-Keil Moritz Hardt John Millerand Ludwig Schmidt A meta-analysis of overing in machine learning In NeurIPS 2019

Bobak Shahriari Kevin Swersky Ziyu Wang Ryan P Adams and Nando de Freitas Taking thehuman out of the loop A review of bayesian optimization Proceedings of the IEEE 104148ndash1752016

16

Connor Shorten and Taghi M Khoshgoaar A survey on image data augmentation for deeplearning Journal of Big Data 6 2019

Jasper Snoek Hugo Larochelle and Ryan P Adams Practical bayesian optimization of machinelearning algorithms In NeurIPS 2012

Nitish Srivastava Georey Hinton Alex Krizhevsky Ilya Sutskever and Ruslan SalakhutdinovDropout A simple way to prevent neural networks from overing Journal of Machine LearningResearch 151929ndash1958 2014

Masashi Sugiyama Introduction to statistical machine learning Morgan Kaufmann 2015

Christian Szegedy Vincent Vanhoucke Sergey Ioe and Jon Shlens Rethinking the inceptionarchitecture for computer vision In CVPR 2016

Sunil ulasidasan Gopinath Chennupati Je Bilmes Tanmoy Bhaacharya and Sarah MichalakOn mixup training Improved calibration and predictive uncertainty for deep neural networksIn NeurIPS 2019

Robert Tibshirani Regression shrinkage and selection via the lasso Journal of the Royal StatisticalSociety Series B (Methodological) 58267ndash288 1996

A N Tikhonov and V Y Arsenin Solutions of Ill Posed Problems Winston 1977

Andrey Nikolayevich Tikhonov On the stability of inverse problems Doklady Akademii NaukSSSR 39195ndash198 1943

Antonio Torralba Rob Fergus and William T Freeman 80 million tiny images A large data setfor nonparametric object and scene recognition In IEEE Trans PAMI 2008

Brendan van Rooyen and Robert C Williamson A theory of learning with corrupted labelsJournal of Machine Learning Research 181ndash50 2018

Vikas Verma Alex Lamb Juho Kannala Yoshua Bengio and David Lopez-Paz Interpolationconsistency training for semi-supervised learning In IJCAI 2019

Stefan Wager Sida Wang and Percy Liang Dropout training as adaptive regularization In NeurIPS2013

Roman Werpachowski Andrs Gyrgy and Csaba Szepesvri Detecting overing via adversarialexamples In NeurIPS 2019

Han Xiao Kashif Rasul and Roland Vollgraf Fashion-MNIST a novel image dataset for bench-marking machine learning algorithms arXiv preprint arXiv170807747 2017

Chiyuan Zhang Samy Bengio Moritz Hardt Benjamin Recht and Oriol Vinyals Understandingdeep learning requires rethinking generalization In ICLR 2017

17

Hongyi Zhang Moustapha Cisse Yann N Dauphin and David Lopez-Paz mixup Beyondempirical risk minimization In ICLR 2018

Tong Zhang Statistical behavior and consistency of classication methods based on convex riskminimization e Annals of Statistics 3256ndash85 2004

18

A Proof of eoremProof If the ooding level is b then the proposed ooding estimator is

R(g) = |R(g)minus b|+ b (12)

Since the absolute operator can be expressed with a max operator with max(a b) = a+b+|aminusb|2

theproposed estimator can be re-expressed as

R(g) = 2max(R(g) b)minus R(g) = Aminus R(g) (13)

For convenience we used A = 2max(R(g) b) From the denition of MSE

MSE(R(g)) = E[(R(g)minusR(g))2] (14)MSE(R(g)) = E[(R(g)minusR(g))2] (15)

= E[(Aminus R(g)minusR(g))2)] (16)= E[A2]minus 2E[A(R(g) +R(g))] + E[(R(g) +R(g))2] (17)

We are interested in the sign of

MSE(R(g))minusMSE(R(g)) = E[minus4R(g)R(g)minus A2 + 2A(R(g) +R(g))] (18)

Dene the inside of the expectation as B = minus4R(g)R(g)minus A2 + 2A(R(g) + R(g)) B can bedivided into two cases depending on the outcome of the max operator

B =

minus4R(g)R(g)minus 4R(g)2 + 4R(g)(R(g) +R(g)) if R(g) ge b

minus4R(g)R(g)minus 4b2 + 4b(R(g) +R(g)) if R(g) lt b(19)

=

0 if R(g) ge b

minus4R(g)R(g)minus 4b2 + 4b(R(g) +R(g)) if R(g) lt b(20)

=

0 if R(g) ge b

minus4(bminus R(g))(bminusR(g)) if R(g) lt b (21)

e laer case becomes positive when R(g) lt b lt R(g) erefore when R(g) lt b lt R(g)

MSE(R(g))minusMSE(R(g)) gt 0 (22)MSE(R(g)) gt MSE(R(g)) (23)

When b le R(g)

MSE(R(g))minusMSE(R(g)) gt 0 (24)MSE(R(g)) = MSE(R(g)) (25)

19

B Bayes Risk for Gaussian DistributionsIn this section we explain in detail how we derived the Bayes risk with respect to the surrogateloss in the experiments with Gaussian data in Section 41 Since we are using the logistic loss inthe synthetic experiments the loss of the margin is

`(yg(x)) = log(1 + exp(minusyg(x))) (26)

where g(x) Rd rarr R is a scalar instead of the vector denition that was used previously becausethe synthetic experiments only consider binary classication Take the derivative to derive

partE[`(yg(x))]partg(x)

=E[log(1 + exp(minusyg(x)))]

partg(x)= E

[minusy exp(minusyg(x))1 + exp(minusyg(x))

∣∣∣∣∣x]p(x) (27)

= E

[minusy

1 + exp(yg(x))

∣∣∣∣∣x]p(x) (28)

= E

[y + 1

2

1

exp(minusg(x)) + 1+y minus 1

2

minus1exp(g(x)) + 1

∣∣∣∣∣x]p(x) (29)

= E

[y + 1

2

∣∣∣∣∣x]

1

exp(minusg(x)) + 1p(x) + E

[y minus 1

2

∣∣∣∣∣x]

minus1exp(g(x)) + 1

p(x) (30)

= p(y = +1|x) 1

exp(minusg(x)) + 1p(x) + p(y = minus1|x) minus1

exp(g(x)) + 1p(x) (31)

Set this to zero divide by p(x) gt 0 to obtain

p(y = minus1|x) 1

exp(minusg(x)) + 1= p(y = +1|x) 1

exp(g(x)) + 1(32)

exp(g(x)) =p(y = +1|x)p(y = minus1|x)

(33)

g(x) = logp(y = +1|x)p(y = minus1|x)

(34)

Since we are interested in the surrogate loss under this classier we plug this into the logisticloss to obtain the Bayes risk

E[`(yg(x))] = E[log(1 +

p(minusy|x)p(y|x)

)

]= E

[log(

1

p(y|x))

]= E[minus log p(y|x)] (35)

In the experiments in Section 41 we report the empirical version of this with the test dataset asthe Bayes risk

20

C Details of Experiments

C1 Benchmark DatasetsIn the experiments in Section 42 we use six benchmark datasets explained below

bull MNIST6 [Lecun et al 1998] is a 10 class dataset of handwrien digits 1 2 9 and 0 Eachsample is a 28times 28 grayscale image e number of training and test samples are 60000and 10000 respectively

bull Fashion-MNIST7 [Xiao et al 2017] is a 10 class dataset of fashion items T-shirttop TrouserPullover Dress Coat Sandal Shirt Sneaker Bag and Ankle boot Each sample is a 28times 28grayscale image e number of training and test samples are 60000 and 10000 respectively

bull Kuzushiji-MNIST8 [Clanuwat et al 2018] is a 10 class dataset of cursive Japanese (ldquoKuzushijirdquo)characters Each sample is a 28 times 28 grayscale image e number of training and testsamples are 60000 and 10000 respectively

bull CIFAR-109 is a 10 class dataset of various objects airplane automobile bird cat deer dogfrog horse ship and truck Each sample is a colored image in 32times 32times 3 RGB format It isa subset of the 80 million tiny images dataset [Torralba et al 2008] ere are 6000 imagesper class where 5000 are for training and 1000 are for test

bull CIFAR-10010 is a 100 class dataset of various objects Each class has 600 samples where 500samples are for training and 100 samples are for test is is also a subset of the 80 milliontiny images dataset [Torralba et al 2008]

bull SVHN11 [Netzer et al 2011] is a 10 class dataset of house numbers from Google Street Viewimages in 32times 32times 3 RGB format 73257 digits are for training and 26032 digits are fortesting

6httpyannlecuncomexdbmnist7httpsgithubcomzalandoresearchfashion-mnist8httpsgithubcomrois-codhkmnist9httpswwwcstorontoedusimkrizcifarhtml

10httpswwwcstorontoedusimkrizcifarhtml11httpufldlstanfordeduhousenumbers

21

C2 Chosen Flooding LevelsIn Table 4 we report the chosen ooding levels for our experiments with benchmark datasets

Table 4 e chosen ooding levels for benchmark experiments

Early stopping times X times XWeight decay times times X XFlooding X X X X

MNIST (08) 002 002 003 002MNIST (04) 002 003 000 002Fashion-MNIST (08) 000 000 mdash mdashFashion-MNIST (04) 000 000 mdash mdashKuzushiji-MNIST (08) 001 003 003 003Kuzushiji-MNIST (04) 003 002 004 003CIFAR-10 (08) 004 004 000 001CIFAR-10 (04) 003 003 000 000CIFAR-100 (08) 002 002 000 000CIFAR-100 (04) 003 003 000 000SVHN (08) 001 001 000 002SVHN (04) 001 009 003 003

D Learning CurvesIn Figure 4 we visualize learning curves for all datasets (including CIFAR-10 which we alreadyvisualized in Figure 2) We only show the learning curves for trainingvalidation proportion of 08since the results for 04 were similar with 08 Note that Figure 1(c) shows the learning curves forthe rst 80 epochs for CIFAR-10 without ooding Figure 1(d) shows the learning curves withooding when the ooding level is 018

22

(a) MNIST (000) (b) MNIST (001) (c) MNIST (002) (d) MNIST (003)

(e) Fashion-MNIST(000)

(f) Fashion-MNIST(002)

(g) Fashion-MNIST(005)

(h) Fashion-MNIST(010)

(i) KMNIST (000) (j) KMNIST (002) (k) KMNIST (005) (l) KMNIST (010)

(m) CIFAR-10 (000) (n) CIFAR-10 (003) (o) CIFAR-10 (007) (p) CIFAR-10 (020)

(q) CIFAR-100 (000) (r) CIFAR-100 (001) (s) CIFAR-100 (006) (t) CIFAR-100 (020)

(u) SVHN (000) (v) SVHN (001) (w) SVHN (006) (x) SVHN (020)

Figure 4 Learning curves of training and test loss e rst gure in each row is the learning curveswithout ooding e 2nd 3rd and 4th columns show the results with dierent ooding levels e oodinglevel increases towards the right-hand side 23

E Relationship between Performance and GradientsSettings We visualize the relationship between test performance (loss or accuracy) and gradientamplitude of the trainingtest loss in Figure 5 where the gradient amplitude is the `2 norm of thelter-nomalized gradient of the loss e lter-normalized gradient is the gradient appropriatelyscaled depending on the magnitude of the corresponding convolutional lter similarly to Liet al [2018] More specicically for each lter of every comvolutional layer we multiply thecorresponding elements of the gradient by the norm of the lter Note that a fully connected layeris a special case of convolutional layer and subject to this scaling We exclude Fashion-MNISTbecause the optimal ooding level was zero We used the results with trainingvalidation splitratio of 08

Results For the gures with gradient amplitude of training loss on the horizontal axis ldquordquomarks (w ooding) are oen ploed on the right of ldquo+rdquo marks (wo ooding) which impliesthat ooding prevents the model from staying a local minimum For the gures with gradientamplitude of test loss on the horizontal axis we can observe the method with ooding (ldquordquo)improves performance while the gradient amplitude becomes smaller On the other hand theperformance with the method without ooding (ldquo+rdquo) degenerates while the gradient amplitude oftest loss keeps increasing

24

(a) MNIST xtrainyloss

(b) MNIST xtrainyacc

(c) MNIST xtestyloss

(d) MNIST xtest yacc

(e) K-M xtrain yloss (f) K-M xtrain yacc (g) K-M xtest yloss (h) K-M xtest yacc

(i) C10 xtrain yloss (j) C10 xtrain yacc (k) C10 xtest yloss (l) C10 xtest yacc

(m) C100 xtrainyloss

(n) C100 xtrain yacc (o) C100 xtest yloss (p) C100 xtest yacc

(q) SVHN xtrainyloss

(r) SVHN xtrainyacc

(s) SVHN xtest yloss (t) SVHN xtest yacc

Figure 5 Relationship between test performance (loss or accuracy) and amplitude of gradient (with trainingor test loss) Each point (ldquordquo or ldquo+rdquo) in the gures corresponds to a single model at a certain epoch Weremove the rst 5 epochs and plot the rest ldquordquo is used for the method with ooding and ldquo+rdquo is used forthe method without ooding e large black ldquordquo and ldquo+rdquo show the epochs with early stopping e colorbecomes lighter (purplerarr yellow) as the training proceeds K-M C10 and C100 stand for Kuzushiji-MNISTCIFAR-10 and CIFAR-100

25

F FlatnessSettings We follow Li et al [2018] and give a one-dimensional visualization of atness for eachdataset in Figure 6 We exclude Fashion-MNIST because the optimal ooding level was zero Weused the results with trainingvalidation split ratio of 08 We compare the atness of the modelright aer the empirical risk with respect to a mini-batch becomes smaller than the ooding levelRm(g) lt b for the rst time (doed blue line) and the model aer training (solid blue line) Wealso compare them with the model trained by the baseline method without ooding and trainingis nished (solid red line)

Results According to Figure 6 the test loss becomes lower and more at during the trainingwith ooding Note that the training loss on the other hand continues to oat around the oodinglevel until the end of training aer it enters the ooding zone We expect that the model makes arandom walk and escapes regions with sharp loss landscapes during the period is may be apossible reason for beer generalization results with our proposed method

26

(a) MNIST (train) (b) MNIST (test) (c) Kuzushiji-MNIST(train)

(d) Kuzushiji-MNIST(test)

(e) CIFAR-10 (train) (f) CIFAR-10 (test) (g) CIFAR-100 (train) (h) CIFAR-100 (test)

(i) SVHN (train) (j) SVHN (test)

Figure 6 One-dimensional visualization of atness We visualize the trainingtest loss with respect toperturbation We depict the results for 3 models the model when the empirical risk with respect to trainingdata is below the ooding level for the rst time during training (doed blue) the model at the end oftraining with ooding (solid blue) and the model at the end of training without ooding (solid red)

27

  • 1 Introduction
  • 2 Backgrounds
    • 21 Regularization Methods
    • 22 Double Descent Curves with Overparametrization
    • 23 Lower-Bounding the Empirical Risk
      • 3 Flooding How to Avoid Zero Training Loss
        • 31 Preliminaries
        • 32 Algorithm
        • 33 Implementation
        • 34 Theoretical Analysis
          • 4 Experiments
            • 41 Synthetic Experiments
            • 42 Benchmark Experiments
            • 43 Memorization
              • 5 Conclusion
              • A Proof of Theorem
              • B Bayes Risk for Gaussian Distributions
              • C Details of Experiments
                • C1 Benchmark Datasets
                • C2 Chosen Flooding Levels
                  • D Learning Curves
                  • E Relationship between Performance and Gradients
                  • F Flatness
Page 10: Do We Need Zero Training Loss A›er Achieving Zero Training … · 2020. 2. 21. · directly aim to avoid zero training loss, they o›en fail to maintain a moderate level of training

Table 2 Experimental results for the synthetic data Sub-table (A) shows the results without early stoppingSub-table (B) shows the results with early stopping e beer method is shown in bold in each of sub-tables (A) and (B) ldquoBRrdquo stands for the Bayes risk For Two Gaussians the distance between the positive andnegative distributions is larger for larger m See the description in Section 41 for the details

(A) Without Early Stopping (B) With Early Stopping

Data Seing WithoutFlooding

WithFlooding

Chosenb

WithoutFlooding

WithFlooding

Chosenb

Two Gaussians m 10 BR 014 8796 9225 028 9163 9225 027Two Gaussians m 08 BR 024 8200 8731 033 8657 8729 035Sinusoid Label Noise 001 9384 9446 001 9254 9254 000Sinusoid Label Noise 005 9112 9544 010 9326 9460 001Sinusoid Label Noise 010 8657 9602 017 9670 9670 000Spiral Label Noise 001 9896 9785 001 9860 9888 001Spiral Label Noise 005 9387 9624 004 9658 9562 014Spiral Label Noise 010 8970 9296 016 8970 9296 016

ReLU activation function [Nair and Hinton 2010] and batch normalization [Ioe and Szegedy2015] We train the network for 500 epochs with the logistic loss and Adam [Kingma and Ba2015] optimizer with 100 mini-batch size and learning rate of 0001 e ooding level is chosenfrom b isin 0 001 002 020 Note that training with b = 0 is identical to the baselinemethod without ooding We report the test accuracy of the ooding level with the best validationaccuracy We rst conduct experiments without early stopping which means that the last epochwas chosen for all ooding levels

Results e results are summarized in Table 2 It is worth noting that for Two Gaussians thechosen ooding level b is larger for the smaller distance between the two distributions which iswhen the classication task is harder and the Bayes risk becomes larger since the two distributionsbecome less separated We see similar tendencies for Sinusoid and Spiral data a larger b waschosen for larger ipping probability for label noise which is expected to increase the Bayes riskis implies the positive correlation between the optimal ooding level and the Bayes risk asis also partially suggested by eorem 1 Another interesting observation is that the chosen bis close to but higher than the Bayes risk for Two Gaussians data is may look inconsistentwith eorem 1 However it makes sense to adopt larger b with stronger regularization eect thatallows some bias as a trade-o for reducing the variance of the risk estimator In fact eorem 1does not deny the possibility that some b ge R(g) achieves even beer estimation

From (A) in Table 2 we can see that the method with ooding oen improves test accuracyover the baseline method without ooding As we mentioned in the introduction it ca be harmfulto keep training a model until the end without ooding However with ooding the model at thenal epoch has good prediction performance according to the results which implies that ooding

10

Table 3 Results with benchmark datasets We report classication accuracy for all combinations ofweight decay (X and times) early stopping (X and times) and ooding (X and times) e second column shows thetrainingvalidation split used for the experiment W stands for weight decay E stands for early stopping andF stands for ooding ldquomdashrdquo means that ooding level of zero was optimal ldquoNArdquo means that we skipped theexperiments because zero weight decay was optimal in the case without ooding e best and equivalentare shown in bold by comparing ldquowith oodingrdquo and ldquowithout oodingrdquo for two columns with the sameseing for W and E eg the rst and h columns out of the 8 columns e best performing combinationis highlighted

Datasettr W times times X X times times X Xva E times X times X times X times X

split F times times times times X X X X

MNIST 08 9832 9830 9851 9842 9846 9853 9850 984804 9771 9770 9782 9791 9774 9785 mdash 9783

Fashion- 08 8934 8936 NA NA mdash mdash NA NAMNIST 04 8848 8863 8860 8862 mdash mdash mdash mdash

Kuzushiji- 08 9163 9162 9163 9171 9240 9212 9211 9197MNIST 04 8918 8918 8958 8973 9041 9015 8971 8988

CIFAR-10 08 7359 7336 7365 7357 7306 7344 mdash 744104 6639 6663 6931 6928 6720 6758 mdash mdash

CIFAR-100 08 4216 4233 4267 4245 4250 4236 mdash mdash04 3427 3434 3797 3882 3499 3514 mdash mdash

SVHN 08 9238 9241 9320 9299 9278 9279 mdash 934204 9032 9035 9043 9049 9057 9061 9116 9121

helps the late-stage training improve test accuracyWe also conducted experiments with early stopping meaning that we chose the model that

recorded the best validation accuracy during training e results are reported in sub-table (B) ofTable 2 Compared with sub-table (A) we see that early stopping improves the baseline methodwithout ooding well in many cases is indicates that training longer without ooding washarmful in our experiments On the other hand the accuracy for ooding combined with earlystopping is oen close to that with early stopping meaning that training until the end withooding tends to be already as good as doing so with early stopping e table shows that oodingoen improves or retains the test accuracy of the baseline method without ooding even aerdeploying early stopping Flooding does not hurt performance but can be benecial for methodsused with early stopping

11

(a) CIFAR-10 (000) (b) CIFAR-10 (003) (c) CIFAR-10 (007) (d) CIFAR-10 (020)

Figure 2 Learning curves of training and test loss for trainingvalidation proportion of 08 (a) shows thelearning curves without ooding (b) (c) and (d) show the learning curves with dierent ooding levels

42 Benchmark ExperimentsWe next perform experiments with benchmark datasets Not only do we compare with the baselinewithout ooding we also compare or combine with other general regularization methods whichare early stopping and weight decay

Settings We use the following six benchmark datasets MNIST Fashion-MNIST Kuzushiji-MNIST CIFAR-10 CIFAR-100 and SVHN e details of the benchmark datasets can be found inAppendix C1 We split the original training dataset into training and validation data with dierentproportions 08 or 04 (meaning 80 or 40 was used for training and the rest was used forvalidation respectively) We perform the exhaustive hyper-parameter search for the ooding levelwith candidates from 000 001 020 e number of epochs is 500 Stochastic gradientdescent [Robbins and Monro 1951] is used with learning rate of 01 and momentum of 09 ForMNIST Fashion-MNIST and Kuzushiji-MNIST we use a one-hidden-layer feedforward neuralnetwork with 500 units and ReLU activation function [Nair and Hinton 2010] For CIFAR-10CIFAR-100 and SVHN we used ResNet-18 [He et al 2016] We do not use any data augmentationor manual learning rate decay We deployed early stopping in the same way as in Section 41

We rst ran experiments with the following candidates for the weight decay rate 1times10minus5 1times10minus4 4times 10minus4 7times 10minus4 1times 10minus3 4times 10minus3 7times 10minus3 We choose the weight decay rate withthe best validation accuracy for each dataset and each trainingvalidation proportion en xingthe weight decay to the chosen one we ran experiments with ooding level candidates from0 001 020 to investigate whether weight decay and ooding have complementary eectsor if adding weight decay will diminish the accuracy gain of ooding

Results We show the results in Table 3 and the chosen ooding levels in Table 4 in Appendix C2We can observe that ooding gives beer accuracy for most cases We can also see that combiningooding with early stopping or with both early stopping and weight decay may lead to even beeraccuracy in some cases

12

(a) wo early stop (08) (b) wo early stop (04) (c) w early stop (08) (d) w early stop (04)

Figure 3 We show the optimal ooding level maintains memorization e vertical axis shows the trainingaccuracy e horizontal axis shows the ooding level We show results with and without early stoppingand for dierent trainingvalidation splits with proportion of 08 or 04 e marks (4 O ) areplaced on the ooding level that was chosen based on validation accuracy

43 MemorizationCan we maintain memorization even aer adding ooding We investigate if the trained modelhas zero training error (100 accuracy) for the ooding level that was chosen with validation dataWe show the results for all benchmark datasets and all trainingvalidation splits with proportions08 and 04 We also show the case without early stopping (choosing the last epoch) and withearly stopping (choosing the epoch with the highest validation accuracy) e results are shownin Fig 3

All gures show downward curves implying that the model will give up eventually onmemorizing all training data as the ooding level becomes higher A more interesting andimportant observation is the position of the optimal ooding level (the one chosen by validationaccuracy which is marked with 4 O or ) We can observe that the marks are oen ploedat zero error and in some cases there is a mark on the highest ooding level that maintains zeroerror ese results are consistent with recent empirical works that imply zero training error leadsto lower generalization error [Belkin et al 2019 Nakkiran et al 2020] but we further demonstratethat zero training loss may be harmful under zero training error

5 ConclusionWe proposed a novel regularization method called ooding that keeps the training loss to stayaround a small constant value to avoid zero training loss In our experiments the optimal oodinglevel oen maintained memorization of training data with zero error With ooding we showedthat the test accuracy will improve for various benchmark datasets and theoretically showed thatthe mean squared error will be reduced under certain conditions

As a byproduct we were able to produce a double descent curve for the test loss with a relativelyfew number of epochs eg in around 100 epochs shown in Fig 2 and Fig 4 in Appendix D Animportant future direction is to study the relationship between this and the double descent curves

13

from previous works [Belkin et al 2019 Krogh and Hertz 1992 Nakkiran et al 2020]It would also be interesting to see if Bayesian optimization [Shahriari et al 2016 Snoek

et al 2012] methods can be utilized to search for the optimal ooding level eciently We willinvestigate this direction in the future

AcknowledgementsWe thank Chang Xu Genki Yamamoto Kento Nozawa Nontawat Charoenphakdee Voot Tangkaraand Yoshihiro Nagano for the helpful discussions TI was supported by the Google PhD FellowshipProgram MS and IY were supported by JST CREST Grant Number JPMJCR18A2 including AIPchallenge program Japan

ReferencesDevansh Arpit Stanisaw Jastrzebski Nicolas Ballas David Krueger Emmanuel Bengio Maxin-

der S Kanwal Tegan Maharaj Asja Fischer Aaron Courville Yoshua Bengio and SimonLacoste-Julien A closer look at memorization in deep networks In ICML 2017

Mikhail Belkin Daniel Hsu and Partha P Mitra Overing or perfect ing Risk bounds forclassication and regression rules that interpolate In NeurIPS 2018

Mikhail Belkin Daniel Hsu Siyuan Ma and Soumik Mandal Reconciling modern machine-learningpractice and the classical biasvariance trade-o PNAS 11615850ndash15854 2019

David Berthelot Nicholas Carlini Ian Goodfellow Nicolas Papernot Avital Oliver and Colin ARael MixMatch A holistic approach to semi-supervised learning In NeurIPS 2019

Christopher M Bishop Paern Recognition and Machine Learning (Information Science andStatistics) Springer 2011

Christopher Michael Bishop Regularization and complexity control in feed-forward networks InICANN 1995

Rich Caruana Steve Lawrence and C Lee Giles Overing in neural nets Backpropagationconjugate gradient and early stopping In NeurIPS 2000

Pratik Chaudhari Anna Choromanska Stefano Soao Yann LeCun Carlo Baldassi ChristianBorgs Jennifer Chayes Levent Sagun and Riccardo Zecchina Entropy-SGD Biasing gradientdescent into wide valleys In ICLR 2017

Jesus Cid-Sueiro Darıo Garcıa-Garcıa and Raul Santos-Rodrıguez Consistency of losses forlearning from weak labels In ECML-PKDD 2014

14

Tarin Clanuwat Mikel Bober-Irizar Asanobu Kitamoto Alex Lamb Kazuaki Yamamoto and DavidHa Deep learning for classical Japanese literature In NeurIPS Workshop on Machine Learningfor Creativity and Design 2018

Marthinus Christoel du Plessis Gang Niu and Masashi Sugiyama Analysis of learning frompositive and unlabeled data In NeurIPS 2014

Marthinus Christoel du Plessis Gang Niu and Masashi Sugiyama Convex formulation forlearning from positive and unlabeled data In ICML 2015

Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MIT Press 2016

Hongyu Guo Yongyi Mao and Richong Zhang Augmenting data with mixup for sentenceclassication An empirical study In arXiv190508941 2019

Bo Han Gang Niu Jiangchao Yao Xingrui Yu Miao Xu Ivor Tsang and Masashi SugiyamaPumpout A meta approach to robust deep learning with noisy labels In arXiv180911008 2018

Stephen Jose Hanson and Lorien Y Pra Comparing biases for minimal network constructionwith back-propagation In NeurIPS 1988

Kaiming He Xiangyu Zhang Shaoqing Ren and Jian Sun Deep residual learning for imagerecognition In CVPR 2016

Sergey Ioe and Christian Szegedy Batch normalization Accelerating deep network training byreducing internal covariate shi In ICML 2015

Takashi Ishida Gang Niu Aditya Krishna Menon and Masashi Sugiyama Complementary-labellearning for arbitrary losses and models In ICML 2019

Nitish Shirish Keskar Dheevatsa Mudigere Jorge Nocedal Mikhail Smelyanskiy and Ping Tak PeterTang On large-batch training for deep learning Generalization gap and sharp minima In ICLR2017

Diederik P Kingma and Jimmy L Ba Adam A method for stochastic optimization In ICLR 2015

Ryuichi Kiryo Gang Niu Marthinus Christoel du Plessis and Masashi Sugiyama Positive-unlabeled learning with non-negative risk estimator In NeurIPS 2017

Anders Krogh and John A Hertz A simple weight decay can improve generalization In NeurIPS1992

Yann Lecun Lon Boou Yoshua Bengio and Patrick Haner Gradient-based learning applied todocument recognition In Proceedings of the IEEE pages 2278ndash2324 1998

Hao Li Zheng Xu Gavin Taylor Christoph Studer and Tom Goldstein Visualizing the losslandscape of neural nets In NeurIPS 2018

15

Ilya Loshchilov and Frank Huer Decoupled weight decay regularization In ICLR 2019

Nan Lu Tianyi Zhang Gang Niu and Masashi Sugiyama Mitigating overing in supervisedclassication from two unlabeled datasets A consistent risk correction approach In AISTATS2020

N Morgan and H Bourlard Generalization and parameter estimation in feedforward nets Someexperiments In NeurIPS 1990

Vinod Nair and Georey E Hinton Rectied linear units improve restricted boltzmann machinesIn ICML 2010

Preetum Nakkiran Gal Kaplun Dimitris Kalimeris Tristan Yang Benjamin L Edelman FredZhang and Boaz Barak SGD on neural networks learns functions of increasing complexity InNeurIPS 2019

Preetum Nakkiran Gal Kaplun Yamini Bansal Tristan Yang Boaz Barak and Ilya Sutskever Deepdouble descent Where bigger models and more data hurt In ICLR 2020

Nagarajan Natarajan Inderjit S Dhillon Pradeep K Ravikumar and Ambuj Tewari Learningwith noisy labels In NeurIPS 2013

Yuval Netzer Tao Wang Adam Coates Alessandro Bissacco Bo Wu and Andrew Y Ng Readingdigits in natural images with unsupervised feature learning In NeurIPS Workshop on DeepLearning and Unsupervised Feature Learning 2011

Andrew Y Ng Preventing ldquooveringrdquo of cross-validation data In ICML 1997

Adam Paszke Sam Gross Francisco Massa Adam Lerer James Bradbury Gregory Chanan TrevorKilleen Zeming Lin Natalia Gimelshein Luca Antiga Alban Desmaison Andreas Kopf EdwardYang Zachary DeVito Martin Raison Alykhan Tejani Sasank Chilamkurthy Benoit SteinerLu Fang Junjie Bai and Soumith Chintala PyTorch An imperative style high-performancedeep learning library In NeurIPS 2019

Giorgio Patrini Alessandro Rozza Aditya K Menon Richard Nock and Lizhen Making deepneural networks robust to label noise A loss correction approach In CVPR 2017

Herbert Robbins and Suon Monro A stochastic approximation method Annals of MathematicalStatistics 22400ndash407 1951

Rebecca Roelofs Vaishaal Shankar Benjamin Recht Sara Fridovich-Keil Moritz Hardt John Millerand Ludwig Schmidt A meta-analysis of overing in machine learning In NeurIPS 2019

Bobak Shahriari Kevin Swersky Ziyu Wang Ryan P Adams and Nando de Freitas Taking thehuman out of the loop A review of bayesian optimization Proceedings of the IEEE 104148ndash1752016

16

Connor Shorten and Taghi M Khoshgoaar A survey on image data augmentation for deeplearning Journal of Big Data 6 2019

Jasper Snoek Hugo Larochelle and Ryan P Adams Practical bayesian optimization of machinelearning algorithms In NeurIPS 2012

Nitish Srivastava Georey Hinton Alex Krizhevsky Ilya Sutskever and Ruslan SalakhutdinovDropout A simple way to prevent neural networks from overing Journal of Machine LearningResearch 151929ndash1958 2014

Masashi Sugiyama Introduction to statistical machine learning Morgan Kaufmann 2015

Christian Szegedy Vincent Vanhoucke Sergey Ioe and Jon Shlens Rethinking the inceptionarchitecture for computer vision In CVPR 2016

Sunil ulasidasan Gopinath Chennupati Je Bilmes Tanmoy Bhaacharya and Sarah MichalakOn mixup training Improved calibration and predictive uncertainty for deep neural networksIn NeurIPS 2019

Robert Tibshirani Regression shrinkage and selection via the lasso Journal of the Royal StatisticalSociety Series B (Methodological) 58267ndash288 1996

A N Tikhonov and V Y Arsenin Solutions of Ill Posed Problems Winston 1977

Andrey Nikolayevich Tikhonov On the stability of inverse problems Doklady Akademii NaukSSSR 39195ndash198 1943

Antonio Torralba Rob Fergus and William T Freeman 80 million tiny images A large data setfor nonparametric object and scene recognition In IEEE Trans PAMI 2008

Brendan van Rooyen and Robert C Williamson A theory of learning with corrupted labelsJournal of Machine Learning Research 181ndash50 2018

Vikas Verma Alex Lamb Juho Kannala Yoshua Bengio and David Lopez-Paz Interpolationconsistency training for semi-supervised learning In IJCAI 2019

Stefan Wager Sida Wang and Percy Liang Dropout training as adaptive regularization In NeurIPS2013

Roman Werpachowski Andrs Gyrgy and Csaba Szepesvri Detecting overing via adversarialexamples In NeurIPS 2019

Han Xiao Kashif Rasul and Roland Vollgraf Fashion-MNIST a novel image dataset for bench-marking machine learning algorithms arXiv preprint arXiv170807747 2017

Chiyuan Zhang Samy Bengio Moritz Hardt Benjamin Recht and Oriol Vinyals Understandingdeep learning requires rethinking generalization In ICLR 2017

17

Hongyi Zhang Moustapha Cisse Yann N Dauphin and David Lopez-Paz mixup Beyondempirical risk minimization In ICLR 2018

Tong Zhang Statistical behavior and consistency of classication methods based on convex riskminimization e Annals of Statistics 3256ndash85 2004

18

A Proof of eoremProof If the ooding level is b then the proposed ooding estimator is

R(g) = |R(g)minus b|+ b (12)

Since the absolute operator can be expressed with a max operator with max(a b) = a+b+|aminusb|2

theproposed estimator can be re-expressed as

R(g) = 2max(R(g) b)minus R(g) = Aminus R(g) (13)

For convenience we used A = 2max(R(g) b) From the denition of MSE

MSE(R(g)) = E[(R(g)minusR(g))2] (14)MSE(R(g)) = E[(R(g)minusR(g))2] (15)

= E[(Aminus R(g)minusR(g))2)] (16)= E[A2]minus 2E[A(R(g) +R(g))] + E[(R(g) +R(g))2] (17)

We are interested in the sign of

MSE(R(g))minusMSE(R(g)) = E[minus4R(g)R(g)minus A2 + 2A(R(g) +R(g))] (18)

Dene the inside of the expectation as B = minus4R(g)R(g)minus A2 + 2A(R(g) + R(g)) B can bedivided into two cases depending on the outcome of the max operator

B =

minus4R(g)R(g)minus 4R(g)2 + 4R(g)(R(g) +R(g)) if R(g) ge b

minus4R(g)R(g)minus 4b2 + 4b(R(g) +R(g)) if R(g) lt b(19)

=

0 if R(g) ge b

minus4R(g)R(g)minus 4b2 + 4b(R(g) +R(g)) if R(g) lt b(20)

=

0 if R(g) ge b

minus4(bminus R(g))(bminusR(g)) if R(g) lt b (21)

e laer case becomes positive when R(g) lt b lt R(g) erefore when R(g) lt b lt R(g)

MSE(R(g))minusMSE(R(g)) gt 0 (22)MSE(R(g)) gt MSE(R(g)) (23)

When b le R(g)

MSE(R(g))minusMSE(R(g)) gt 0 (24)MSE(R(g)) = MSE(R(g)) (25)

19

B Bayes Risk for Gaussian DistributionsIn this section we explain in detail how we derived the Bayes risk with respect to the surrogateloss in the experiments with Gaussian data in Section 41 Since we are using the logistic loss inthe synthetic experiments the loss of the margin is

`(yg(x)) = log(1 + exp(minusyg(x))) (26)

where g(x) Rd rarr R is a scalar instead of the vector denition that was used previously becausethe synthetic experiments only consider binary classication Take the derivative to derive

partE[`(yg(x))]partg(x)

=E[log(1 + exp(minusyg(x)))]

partg(x)= E

[minusy exp(minusyg(x))1 + exp(minusyg(x))

∣∣∣∣∣x]p(x) (27)

= E

[minusy

1 + exp(yg(x))

∣∣∣∣∣x]p(x) (28)

= E

[y + 1

2

1

exp(minusg(x)) + 1+y minus 1

2

minus1exp(g(x)) + 1

∣∣∣∣∣x]p(x) (29)

= E

[y + 1

2

∣∣∣∣∣x]

1

exp(minusg(x)) + 1p(x) + E

[y minus 1

2

∣∣∣∣∣x]

minus1exp(g(x)) + 1

p(x) (30)

= p(y = +1|x) 1

exp(minusg(x)) + 1p(x) + p(y = minus1|x) minus1

exp(g(x)) + 1p(x) (31)

Set this to zero divide by p(x) gt 0 to obtain

p(y = minus1|x) 1

exp(minusg(x)) + 1= p(y = +1|x) 1

exp(g(x)) + 1(32)

exp(g(x)) =p(y = +1|x)p(y = minus1|x)

(33)

g(x) = logp(y = +1|x)p(y = minus1|x)

(34)

Since we are interested in the surrogate loss under this classier we plug this into the logisticloss to obtain the Bayes risk

E[`(yg(x))] = E[log(1 +

p(minusy|x)p(y|x)

)

]= E

[log(

1

p(y|x))

]= E[minus log p(y|x)] (35)

In the experiments in Section 41 we report the empirical version of this with the test dataset asthe Bayes risk

20

C Details of Experiments

C1 Benchmark DatasetsIn the experiments in Section 42 we use six benchmark datasets explained below

bull MNIST6 [Lecun et al 1998] is a 10 class dataset of handwrien digits 1 2 9 and 0 Eachsample is a 28times 28 grayscale image e number of training and test samples are 60000and 10000 respectively

bull Fashion-MNIST7 [Xiao et al 2017] is a 10 class dataset of fashion items T-shirttop TrouserPullover Dress Coat Sandal Shirt Sneaker Bag and Ankle boot Each sample is a 28times 28grayscale image e number of training and test samples are 60000 and 10000 respectively

bull Kuzushiji-MNIST8 [Clanuwat et al 2018] is a 10 class dataset of cursive Japanese (ldquoKuzushijirdquo)characters Each sample is a 28 times 28 grayscale image e number of training and testsamples are 60000 and 10000 respectively

bull CIFAR-109 is a 10 class dataset of various objects airplane automobile bird cat deer dogfrog horse ship and truck Each sample is a colored image in 32times 32times 3 RGB format It isa subset of the 80 million tiny images dataset [Torralba et al 2008] ere are 6000 imagesper class where 5000 are for training and 1000 are for test

bull CIFAR-10010 is a 100 class dataset of various objects Each class has 600 samples where 500samples are for training and 100 samples are for test is is also a subset of the 80 milliontiny images dataset [Torralba et al 2008]

bull SVHN11 [Netzer et al 2011] is a 10 class dataset of house numbers from Google Street Viewimages in 32times 32times 3 RGB format 73257 digits are for training and 26032 digits are fortesting

6httpyannlecuncomexdbmnist7httpsgithubcomzalandoresearchfashion-mnist8httpsgithubcomrois-codhkmnist9httpswwwcstorontoedusimkrizcifarhtml

10httpswwwcstorontoedusimkrizcifarhtml11httpufldlstanfordeduhousenumbers

21

C2 Chosen Flooding LevelsIn Table 4 we report the chosen ooding levels for our experiments with benchmark datasets

Table 4 e chosen ooding levels for benchmark experiments

Early stopping times X times XWeight decay times times X XFlooding X X X X

MNIST (08) 002 002 003 002MNIST (04) 002 003 000 002Fashion-MNIST (08) 000 000 mdash mdashFashion-MNIST (04) 000 000 mdash mdashKuzushiji-MNIST (08) 001 003 003 003Kuzushiji-MNIST (04) 003 002 004 003CIFAR-10 (08) 004 004 000 001CIFAR-10 (04) 003 003 000 000CIFAR-100 (08) 002 002 000 000CIFAR-100 (04) 003 003 000 000SVHN (08) 001 001 000 002SVHN (04) 001 009 003 003

D Learning CurvesIn Figure 4 we visualize learning curves for all datasets (including CIFAR-10 which we alreadyvisualized in Figure 2) We only show the learning curves for trainingvalidation proportion of 08since the results for 04 were similar with 08 Note that Figure 1(c) shows the learning curves forthe rst 80 epochs for CIFAR-10 without ooding Figure 1(d) shows the learning curves withooding when the ooding level is 018

22

(a) MNIST (000) (b) MNIST (001) (c) MNIST (002) (d) MNIST (003)

(e) Fashion-MNIST(000)

(f) Fashion-MNIST(002)

(g) Fashion-MNIST(005)

(h) Fashion-MNIST(010)

(i) KMNIST (000) (j) KMNIST (002) (k) KMNIST (005) (l) KMNIST (010)

(m) CIFAR-10 (000) (n) CIFAR-10 (003) (o) CIFAR-10 (007) (p) CIFAR-10 (020)

(q) CIFAR-100 (000) (r) CIFAR-100 (001) (s) CIFAR-100 (006) (t) CIFAR-100 (020)

(u) SVHN (000) (v) SVHN (001) (w) SVHN (006) (x) SVHN (020)

Figure 4 Learning curves of training and test loss e rst gure in each row is the learning curveswithout ooding e 2nd 3rd and 4th columns show the results with dierent ooding levels e oodinglevel increases towards the right-hand side 23

E Relationship between Performance and GradientsSettings We visualize the relationship between test performance (loss or accuracy) and gradientamplitude of the trainingtest loss in Figure 5 where the gradient amplitude is the `2 norm of thelter-nomalized gradient of the loss e lter-normalized gradient is the gradient appropriatelyscaled depending on the magnitude of the corresponding convolutional lter similarly to Liet al [2018] More specicically for each lter of every comvolutional layer we multiply thecorresponding elements of the gradient by the norm of the lter Note that a fully connected layeris a special case of convolutional layer and subject to this scaling We exclude Fashion-MNISTbecause the optimal ooding level was zero We used the results with trainingvalidation splitratio of 08

Results For the gures with gradient amplitude of training loss on the horizontal axis ldquordquomarks (w ooding) are oen ploed on the right of ldquo+rdquo marks (wo ooding) which impliesthat ooding prevents the model from staying a local minimum For the gures with gradientamplitude of test loss on the horizontal axis we can observe the method with ooding (ldquordquo)improves performance while the gradient amplitude becomes smaller On the other hand theperformance with the method without ooding (ldquo+rdquo) degenerates while the gradient amplitude oftest loss keeps increasing

24

(a) MNIST xtrainyloss

(b) MNIST xtrainyacc

(c) MNIST xtestyloss

(d) MNIST xtest yacc

(e) K-M xtrain yloss (f) K-M xtrain yacc (g) K-M xtest yloss (h) K-M xtest yacc

(i) C10 xtrain yloss (j) C10 xtrain yacc (k) C10 xtest yloss (l) C10 xtest yacc

(m) C100 xtrainyloss

(n) C100 xtrain yacc (o) C100 xtest yloss (p) C100 xtest yacc

(q) SVHN xtrainyloss

(r) SVHN xtrainyacc

(s) SVHN xtest yloss (t) SVHN xtest yacc

Figure 5 Relationship between test performance (loss or accuracy) and amplitude of gradient (with trainingor test loss) Each point (ldquordquo or ldquo+rdquo) in the gures corresponds to a single model at a certain epoch Weremove the rst 5 epochs and plot the rest ldquordquo is used for the method with ooding and ldquo+rdquo is used forthe method without ooding e large black ldquordquo and ldquo+rdquo show the epochs with early stopping e colorbecomes lighter (purplerarr yellow) as the training proceeds K-M C10 and C100 stand for Kuzushiji-MNISTCIFAR-10 and CIFAR-100

25

F FlatnessSettings We follow Li et al [2018] and give a one-dimensional visualization of atness for eachdataset in Figure 6 We exclude Fashion-MNIST because the optimal ooding level was zero Weused the results with trainingvalidation split ratio of 08 We compare the atness of the modelright aer the empirical risk with respect to a mini-batch becomes smaller than the ooding levelRm(g) lt b for the rst time (doed blue line) and the model aer training (solid blue line) Wealso compare them with the model trained by the baseline method without ooding and trainingis nished (solid red line)

Results According to Figure 6 the test loss becomes lower and more at during the trainingwith ooding Note that the training loss on the other hand continues to oat around the oodinglevel until the end of training aer it enters the ooding zone We expect that the model makes arandom walk and escapes regions with sharp loss landscapes during the period is may be apossible reason for beer generalization results with our proposed method

26

(a) MNIST (train) (b) MNIST (test) (c) Kuzushiji-MNIST(train)

(d) Kuzushiji-MNIST(test)

(e) CIFAR-10 (train) (f) CIFAR-10 (test) (g) CIFAR-100 (train) (h) CIFAR-100 (test)

(i) SVHN (train) (j) SVHN (test)

Figure 6 One-dimensional visualization of atness We visualize the trainingtest loss with respect toperturbation We depict the results for 3 models the model when the empirical risk with respect to trainingdata is below the ooding level for the rst time during training (doed blue) the model at the end oftraining with ooding (solid blue) and the model at the end of training without ooding (solid red)

27

  • 1 Introduction
  • 2 Backgrounds
    • 21 Regularization Methods
    • 22 Double Descent Curves with Overparametrization
    • 23 Lower-Bounding the Empirical Risk
      • 3 Flooding How to Avoid Zero Training Loss
        • 31 Preliminaries
        • 32 Algorithm
        • 33 Implementation
        • 34 Theoretical Analysis
          • 4 Experiments
            • 41 Synthetic Experiments
            • 42 Benchmark Experiments
            • 43 Memorization
              • 5 Conclusion
              • A Proof of Theorem
              • B Bayes Risk for Gaussian Distributions
              • C Details of Experiments
                • C1 Benchmark Datasets
                • C2 Chosen Flooding Levels
                  • D Learning Curves
                  • E Relationship between Performance and Gradients
                  • F Flatness
Page 11: Do We Need Zero Training Loss A›er Achieving Zero Training … · 2020. 2. 21. · directly aim to avoid zero training loss, they o›en fail to maintain a moderate level of training

Table 3 Results with benchmark datasets We report classication accuracy for all combinations ofweight decay (X and times) early stopping (X and times) and ooding (X and times) e second column shows thetrainingvalidation split used for the experiment W stands for weight decay E stands for early stopping andF stands for ooding ldquomdashrdquo means that ooding level of zero was optimal ldquoNArdquo means that we skipped theexperiments because zero weight decay was optimal in the case without ooding e best and equivalentare shown in bold by comparing ldquowith oodingrdquo and ldquowithout oodingrdquo for two columns with the sameseing for W and E eg the rst and h columns out of the 8 columns e best performing combinationis highlighted

Datasettr W times times X X times times X Xva E times X times X times X times X

split F times times times times X X X X

MNIST 08 9832 9830 9851 9842 9846 9853 9850 984804 9771 9770 9782 9791 9774 9785 mdash 9783

Fashion- 08 8934 8936 NA NA mdash mdash NA NAMNIST 04 8848 8863 8860 8862 mdash mdash mdash mdash

Kuzushiji- 08 9163 9162 9163 9171 9240 9212 9211 9197MNIST 04 8918 8918 8958 8973 9041 9015 8971 8988

CIFAR-10 08 7359 7336 7365 7357 7306 7344 mdash 744104 6639 6663 6931 6928 6720 6758 mdash mdash

CIFAR-100 08 4216 4233 4267 4245 4250 4236 mdash mdash04 3427 3434 3797 3882 3499 3514 mdash mdash

SVHN 08 9238 9241 9320 9299 9278 9279 mdash 934204 9032 9035 9043 9049 9057 9061 9116 9121

helps the late-stage training improve test accuracyWe also conducted experiments with early stopping meaning that we chose the model that

recorded the best validation accuracy during training e results are reported in sub-table (B) ofTable 2 Compared with sub-table (A) we see that early stopping improves the baseline methodwithout ooding well in many cases is indicates that training longer without ooding washarmful in our experiments On the other hand the accuracy for ooding combined with earlystopping is oen close to that with early stopping meaning that training until the end withooding tends to be already as good as doing so with early stopping e table shows that oodingoen improves or retains the test accuracy of the baseline method without ooding even aerdeploying early stopping Flooding does not hurt performance but can be benecial for methodsused with early stopping

11

(a) CIFAR-10 (000) (b) CIFAR-10 (003) (c) CIFAR-10 (007) (d) CIFAR-10 (020)

Figure 2 Learning curves of training and test loss for trainingvalidation proportion of 08 (a) shows thelearning curves without ooding (b) (c) and (d) show the learning curves with dierent ooding levels

42 Benchmark ExperimentsWe next perform experiments with benchmark datasets Not only do we compare with the baselinewithout ooding we also compare or combine with other general regularization methods whichare early stopping and weight decay

Settings We use the following six benchmark datasets MNIST Fashion-MNIST Kuzushiji-MNIST CIFAR-10 CIFAR-100 and SVHN e details of the benchmark datasets can be found inAppendix C1 We split the original training dataset into training and validation data with dierentproportions 08 or 04 (meaning 80 or 40 was used for training and the rest was used forvalidation respectively) We perform the exhaustive hyper-parameter search for the ooding levelwith candidates from 000 001 020 e number of epochs is 500 Stochastic gradientdescent [Robbins and Monro 1951] is used with learning rate of 01 and momentum of 09 ForMNIST Fashion-MNIST and Kuzushiji-MNIST we use a one-hidden-layer feedforward neuralnetwork with 500 units and ReLU activation function [Nair and Hinton 2010] For CIFAR-10CIFAR-100 and SVHN we used ResNet-18 [He et al 2016] We do not use any data augmentationor manual learning rate decay We deployed early stopping in the same way as in Section 41

We rst ran experiments with the following candidates for the weight decay rate 1times10minus5 1times10minus4 4times 10minus4 7times 10minus4 1times 10minus3 4times 10minus3 7times 10minus3 We choose the weight decay rate withthe best validation accuracy for each dataset and each trainingvalidation proportion en xingthe weight decay to the chosen one we ran experiments with ooding level candidates from0 001 020 to investigate whether weight decay and ooding have complementary eectsor if adding weight decay will diminish the accuracy gain of ooding

Results We show the results in Table 3 and the chosen ooding levels in Table 4 in Appendix C2We can observe that ooding gives beer accuracy for most cases We can also see that combiningooding with early stopping or with both early stopping and weight decay may lead to even beeraccuracy in some cases

12

(a) wo early stop (08) (b) wo early stop (04) (c) w early stop (08) (d) w early stop (04)

Figure 3 We show the optimal ooding level maintains memorization e vertical axis shows the trainingaccuracy e horizontal axis shows the ooding level We show results with and without early stoppingand for dierent trainingvalidation splits with proportion of 08 or 04 e marks (4 O ) areplaced on the ooding level that was chosen based on validation accuracy

43 MemorizationCan we maintain memorization even aer adding ooding We investigate if the trained modelhas zero training error (100 accuracy) for the ooding level that was chosen with validation dataWe show the results for all benchmark datasets and all trainingvalidation splits with proportions08 and 04 We also show the case without early stopping (choosing the last epoch) and withearly stopping (choosing the epoch with the highest validation accuracy) e results are shownin Fig 3

All gures show downward curves implying that the model will give up eventually onmemorizing all training data as the ooding level becomes higher A more interesting andimportant observation is the position of the optimal ooding level (the one chosen by validationaccuracy which is marked with 4 O or ) We can observe that the marks are oen ploedat zero error and in some cases there is a mark on the highest ooding level that maintains zeroerror ese results are consistent with recent empirical works that imply zero training error leadsto lower generalization error [Belkin et al 2019 Nakkiran et al 2020] but we further demonstratethat zero training loss may be harmful under zero training error

5 ConclusionWe proposed a novel regularization method called ooding that keeps the training loss to stayaround a small constant value to avoid zero training loss In our experiments the optimal oodinglevel oen maintained memorization of training data with zero error With ooding we showedthat the test accuracy will improve for various benchmark datasets and theoretically showed thatthe mean squared error will be reduced under certain conditions

As a byproduct we were able to produce a double descent curve for the test loss with a relativelyfew number of epochs eg in around 100 epochs shown in Fig 2 and Fig 4 in Appendix D Animportant future direction is to study the relationship between this and the double descent curves

13

from previous works [Belkin et al 2019 Krogh and Hertz 1992 Nakkiran et al 2020]It would also be interesting to see if Bayesian optimization [Shahriari et al 2016 Snoek

et al 2012] methods can be utilized to search for the optimal ooding level eciently We willinvestigate this direction in the future

AcknowledgementsWe thank Chang Xu Genki Yamamoto Kento Nozawa Nontawat Charoenphakdee Voot Tangkaraand Yoshihiro Nagano for the helpful discussions TI was supported by the Google PhD FellowshipProgram MS and IY were supported by JST CREST Grant Number JPMJCR18A2 including AIPchallenge program Japan

ReferencesDevansh Arpit Stanisaw Jastrzebski Nicolas Ballas David Krueger Emmanuel Bengio Maxin-

der S Kanwal Tegan Maharaj Asja Fischer Aaron Courville Yoshua Bengio and SimonLacoste-Julien A closer look at memorization in deep networks In ICML 2017

Mikhail Belkin Daniel Hsu and Partha P Mitra Overing or perfect ing Risk bounds forclassication and regression rules that interpolate In NeurIPS 2018

Mikhail Belkin Daniel Hsu Siyuan Ma and Soumik Mandal Reconciling modern machine-learningpractice and the classical biasvariance trade-o PNAS 11615850ndash15854 2019

David Berthelot Nicholas Carlini Ian Goodfellow Nicolas Papernot Avital Oliver and Colin ARael MixMatch A holistic approach to semi-supervised learning In NeurIPS 2019

Christopher M Bishop Paern Recognition and Machine Learning (Information Science andStatistics) Springer 2011

Christopher Michael Bishop Regularization and complexity control in feed-forward networks InICANN 1995

Rich Caruana Steve Lawrence and C Lee Giles Overing in neural nets Backpropagationconjugate gradient and early stopping In NeurIPS 2000

Pratik Chaudhari Anna Choromanska Stefano Soao Yann LeCun Carlo Baldassi ChristianBorgs Jennifer Chayes Levent Sagun and Riccardo Zecchina Entropy-SGD Biasing gradientdescent into wide valleys In ICLR 2017

Jesus Cid-Sueiro Darıo Garcıa-Garcıa and Raul Santos-Rodrıguez Consistency of losses forlearning from weak labels In ECML-PKDD 2014

14

Tarin Clanuwat Mikel Bober-Irizar Asanobu Kitamoto Alex Lamb Kazuaki Yamamoto and DavidHa Deep learning for classical Japanese literature In NeurIPS Workshop on Machine Learningfor Creativity and Design 2018

Marthinus Christoel du Plessis Gang Niu and Masashi Sugiyama Analysis of learning frompositive and unlabeled data In NeurIPS 2014

Marthinus Christoel du Plessis Gang Niu and Masashi Sugiyama Convex formulation forlearning from positive and unlabeled data In ICML 2015

Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MIT Press 2016

Hongyu Guo Yongyi Mao and Richong Zhang Augmenting data with mixup for sentenceclassication An empirical study In arXiv190508941 2019

Bo Han Gang Niu Jiangchao Yao Xingrui Yu Miao Xu Ivor Tsang and Masashi SugiyamaPumpout A meta approach to robust deep learning with noisy labels In arXiv180911008 2018

Stephen Jose Hanson and Lorien Y Pra Comparing biases for minimal network constructionwith back-propagation In NeurIPS 1988

Kaiming He Xiangyu Zhang Shaoqing Ren and Jian Sun Deep residual learning for imagerecognition In CVPR 2016

Sergey Ioe and Christian Szegedy Batch normalization Accelerating deep network training byreducing internal covariate shi In ICML 2015

Takashi Ishida Gang Niu Aditya Krishna Menon and Masashi Sugiyama Complementary-labellearning for arbitrary losses and models In ICML 2019

Nitish Shirish Keskar Dheevatsa Mudigere Jorge Nocedal Mikhail Smelyanskiy and Ping Tak PeterTang On large-batch training for deep learning Generalization gap and sharp minima In ICLR2017

Diederik P Kingma and Jimmy L Ba Adam A method for stochastic optimization In ICLR 2015

Ryuichi Kiryo Gang Niu Marthinus Christoel du Plessis and Masashi Sugiyama Positive-unlabeled learning with non-negative risk estimator In NeurIPS 2017

Anders Krogh and John A Hertz A simple weight decay can improve generalization In NeurIPS1992

Yann Lecun Lon Boou Yoshua Bengio and Patrick Haner Gradient-based learning applied todocument recognition In Proceedings of the IEEE pages 2278ndash2324 1998

Hao Li Zheng Xu Gavin Taylor Christoph Studer and Tom Goldstein Visualizing the losslandscape of neural nets In NeurIPS 2018

15

Ilya Loshchilov and Frank Huer Decoupled weight decay regularization In ICLR 2019

Nan Lu Tianyi Zhang Gang Niu and Masashi Sugiyama Mitigating overing in supervisedclassication from two unlabeled datasets A consistent risk correction approach In AISTATS2020

N Morgan and H Bourlard Generalization and parameter estimation in feedforward nets Someexperiments In NeurIPS 1990

Vinod Nair and Georey E Hinton Rectied linear units improve restricted boltzmann machinesIn ICML 2010

Preetum Nakkiran Gal Kaplun Dimitris Kalimeris Tristan Yang Benjamin L Edelman FredZhang and Boaz Barak SGD on neural networks learns functions of increasing complexity InNeurIPS 2019

Preetum Nakkiran Gal Kaplun Yamini Bansal Tristan Yang Boaz Barak and Ilya Sutskever Deepdouble descent Where bigger models and more data hurt In ICLR 2020

Nagarajan Natarajan Inderjit S Dhillon Pradeep K Ravikumar and Ambuj Tewari Learningwith noisy labels In NeurIPS 2013

Yuval Netzer Tao Wang Adam Coates Alessandro Bissacco Bo Wu and Andrew Y Ng Readingdigits in natural images with unsupervised feature learning In NeurIPS Workshop on DeepLearning and Unsupervised Feature Learning 2011

Andrew Y Ng Preventing ldquooveringrdquo of cross-validation data In ICML 1997

Adam Paszke Sam Gross Francisco Massa Adam Lerer James Bradbury Gregory Chanan TrevorKilleen Zeming Lin Natalia Gimelshein Luca Antiga Alban Desmaison Andreas Kopf EdwardYang Zachary DeVito Martin Raison Alykhan Tejani Sasank Chilamkurthy Benoit SteinerLu Fang Junjie Bai and Soumith Chintala PyTorch An imperative style high-performancedeep learning library In NeurIPS 2019

Giorgio Patrini Alessandro Rozza Aditya K Menon Richard Nock and Lizhen Making deepneural networks robust to label noise A loss correction approach In CVPR 2017

Herbert Robbins and Suon Monro A stochastic approximation method Annals of MathematicalStatistics 22400ndash407 1951

Rebecca Roelofs Vaishaal Shankar Benjamin Recht Sara Fridovich-Keil Moritz Hardt John Millerand Ludwig Schmidt A meta-analysis of overing in machine learning In NeurIPS 2019

Bobak Shahriari Kevin Swersky Ziyu Wang Ryan P Adams and Nando de Freitas Taking thehuman out of the loop A review of bayesian optimization Proceedings of the IEEE 104148ndash1752016

16

Connor Shorten and Taghi M Khoshgoaar A survey on image data augmentation for deeplearning Journal of Big Data 6 2019

Jasper Snoek Hugo Larochelle and Ryan P Adams Practical bayesian optimization of machinelearning algorithms In NeurIPS 2012

Nitish Srivastava Georey Hinton Alex Krizhevsky Ilya Sutskever and Ruslan SalakhutdinovDropout A simple way to prevent neural networks from overing Journal of Machine LearningResearch 151929ndash1958 2014

Masashi Sugiyama Introduction to statistical machine learning Morgan Kaufmann 2015

Christian Szegedy Vincent Vanhoucke Sergey Ioe and Jon Shlens Rethinking the inceptionarchitecture for computer vision In CVPR 2016

Sunil ulasidasan Gopinath Chennupati Je Bilmes Tanmoy Bhaacharya and Sarah MichalakOn mixup training Improved calibration and predictive uncertainty for deep neural networksIn NeurIPS 2019

Robert Tibshirani Regression shrinkage and selection via the lasso Journal of the Royal StatisticalSociety Series B (Methodological) 58267ndash288 1996

A N Tikhonov and V Y Arsenin Solutions of Ill Posed Problems Winston 1977

Andrey Nikolayevich Tikhonov On the stability of inverse problems Doklady Akademii NaukSSSR 39195ndash198 1943

Antonio Torralba Rob Fergus and William T Freeman 80 million tiny images A large data setfor nonparametric object and scene recognition In IEEE Trans PAMI 2008

Brendan van Rooyen and Robert C Williamson A theory of learning with corrupted labelsJournal of Machine Learning Research 181ndash50 2018

Vikas Verma Alex Lamb Juho Kannala Yoshua Bengio and David Lopez-Paz Interpolationconsistency training for semi-supervised learning In IJCAI 2019

Stefan Wager Sida Wang and Percy Liang Dropout training as adaptive regularization In NeurIPS2013

Roman Werpachowski Andrs Gyrgy and Csaba Szepesvri Detecting overing via adversarialexamples In NeurIPS 2019

Han Xiao Kashif Rasul and Roland Vollgraf Fashion-MNIST a novel image dataset for bench-marking machine learning algorithms arXiv preprint arXiv170807747 2017

Chiyuan Zhang Samy Bengio Moritz Hardt Benjamin Recht and Oriol Vinyals Understandingdeep learning requires rethinking generalization In ICLR 2017

17

Hongyi Zhang Moustapha Cisse Yann N Dauphin and David Lopez-Paz mixup Beyondempirical risk minimization In ICLR 2018

Tong Zhang Statistical behavior and consistency of classication methods based on convex riskminimization e Annals of Statistics 3256ndash85 2004

18

A Proof of eoremProof If the ooding level is b then the proposed ooding estimator is

R(g) = |R(g)minus b|+ b (12)

Since the absolute operator can be expressed with a max operator with max(a b) = a+b+|aminusb|2

theproposed estimator can be re-expressed as

R(g) = 2max(R(g) b)minus R(g) = Aminus R(g) (13)

For convenience we used A = 2max(R(g) b) From the denition of MSE

MSE(R(g)) = E[(R(g)minusR(g))2] (14)MSE(R(g)) = E[(R(g)minusR(g))2] (15)

= E[(Aminus R(g)minusR(g))2)] (16)= E[A2]minus 2E[A(R(g) +R(g))] + E[(R(g) +R(g))2] (17)

We are interested in the sign of

MSE(R(g))minusMSE(R(g)) = E[minus4R(g)R(g)minus A2 + 2A(R(g) +R(g))] (18)

Dene the inside of the expectation as B = minus4R(g)R(g)minus A2 + 2A(R(g) + R(g)) B can bedivided into two cases depending on the outcome of the max operator

B =

minus4R(g)R(g)minus 4R(g)2 + 4R(g)(R(g) +R(g)) if R(g) ge b

minus4R(g)R(g)minus 4b2 + 4b(R(g) +R(g)) if R(g) lt b(19)

=

0 if R(g) ge b

minus4R(g)R(g)minus 4b2 + 4b(R(g) +R(g)) if R(g) lt b(20)

=

0 if R(g) ge b

minus4(bminus R(g))(bminusR(g)) if R(g) lt b (21)

e laer case becomes positive when R(g) lt b lt R(g) erefore when R(g) lt b lt R(g)

MSE(R(g))minusMSE(R(g)) gt 0 (22)MSE(R(g)) gt MSE(R(g)) (23)

When b le R(g)

MSE(R(g))minusMSE(R(g)) gt 0 (24)MSE(R(g)) = MSE(R(g)) (25)

19

B Bayes Risk for Gaussian DistributionsIn this section we explain in detail how we derived the Bayes risk with respect to the surrogateloss in the experiments with Gaussian data in Section 41 Since we are using the logistic loss inthe synthetic experiments the loss of the margin is

`(yg(x)) = log(1 + exp(minusyg(x))) (26)

where g(x) Rd rarr R is a scalar instead of the vector denition that was used previously becausethe synthetic experiments only consider binary classication Take the derivative to derive

partE[`(yg(x))]partg(x)

=E[log(1 + exp(minusyg(x)))]

partg(x)= E

[minusy exp(minusyg(x))1 + exp(minusyg(x))

∣∣∣∣∣x]p(x) (27)

= E

[minusy

1 + exp(yg(x))

∣∣∣∣∣x]p(x) (28)

= E

[y + 1

2

1

exp(minusg(x)) + 1+y minus 1

2

minus1exp(g(x)) + 1

∣∣∣∣∣x]p(x) (29)

= E

[y + 1

2

∣∣∣∣∣x]

1

exp(minusg(x)) + 1p(x) + E

[y minus 1

2

∣∣∣∣∣x]

minus1exp(g(x)) + 1

p(x) (30)

= p(y = +1|x) 1

exp(minusg(x)) + 1p(x) + p(y = minus1|x) minus1

exp(g(x)) + 1p(x) (31)

Set this to zero divide by p(x) gt 0 to obtain

p(y = minus1|x) 1

exp(minusg(x)) + 1= p(y = +1|x) 1

exp(g(x)) + 1(32)

exp(g(x)) =p(y = +1|x)p(y = minus1|x)

(33)

g(x) = logp(y = +1|x)p(y = minus1|x)

(34)

Since we are interested in the surrogate loss under this classier we plug this into the logisticloss to obtain the Bayes risk

E[`(yg(x))] = E[log(1 +

p(minusy|x)p(y|x)

)

]= E

[log(

1

p(y|x))

]= E[minus log p(y|x)] (35)

In the experiments in Section 41 we report the empirical version of this with the test dataset asthe Bayes risk

20

C Details of Experiments

C1 Benchmark DatasetsIn the experiments in Section 42 we use six benchmark datasets explained below

bull MNIST6 [Lecun et al 1998] is a 10 class dataset of handwrien digits 1 2 9 and 0 Eachsample is a 28times 28 grayscale image e number of training and test samples are 60000and 10000 respectively

bull Fashion-MNIST7 [Xiao et al 2017] is a 10 class dataset of fashion items T-shirttop TrouserPullover Dress Coat Sandal Shirt Sneaker Bag and Ankle boot Each sample is a 28times 28grayscale image e number of training and test samples are 60000 and 10000 respectively

bull Kuzushiji-MNIST8 [Clanuwat et al 2018] is a 10 class dataset of cursive Japanese (ldquoKuzushijirdquo)characters Each sample is a 28 times 28 grayscale image e number of training and testsamples are 60000 and 10000 respectively

bull CIFAR-109 is a 10 class dataset of various objects airplane automobile bird cat deer dogfrog horse ship and truck Each sample is a colored image in 32times 32times 3 RGB format It isa subset of the 80 million tiny images dataset [Torralba et al 2008] ere are 6000 imagesper class where 5000 are for training and 1000 are for test

bull CIFAR-10010 is a 100 class dataset of various objects Each class has 600 samples where 500samples are for training and 100 samples are for test is is also a subset of the 80 milliontiny images dataset [Torralba et al 2008]

bull SVHN11 [Netzer et al 2011] is a 10 class dataset of house numbers from Google Street Viewimages in 32times 32times 3 RGB format 73257 digits are for training and 26032 digits are fortesting

6httpyannlecuncomexdbmnist7httpsgithubcomzalandoresearchfashion-mnist8httpsgithubcomrois-codhkmnist9httpswwwcstorontoedusimkrizcifarhtml

10httpswwwcstorontoedusimkrizcifarhtml11httpufldlstanfordeduhousenumbers

21

C2 Chosen Flooding LevelsIn Table 4 we report the chosen ooding levels for our experiments with benchmark datasets

Table 4 e chosen ooding levels for benchmark experiments

Early stopping times X times XWeight decay times times X XFlooding X X X X

MNIST (08) 002 002 003 002MNIST (04) 002 003 000 002Fashion-MNIST (08) 000 000 mdash mdashFashion-MNIST (04) 000 000 mdash mdashKuzushiji-MNIST (08) 001 003 003 003Kuzushiji-MNIST (04) 003 002 004 003CIFAR-10 (08) 004 004 000 001CIFAR-10 (04) 003 003 000 000CIFAR-100 (08) 002 002 000 000CIFAR-100 (04) 003 003 000 000SVHN (08) 001 001 000 002SVHN (04) 001 009 003 003

D Learning CurvesIn Figure 4 we visualize learning curves for all datasets (including CIFAR-10 which we alreadyvisualized in Figure 2) We only show the learning curves for trainingvalidation proportion of 08since the results for 04 were similar with 08 Note that Figure 1(c) shows the learning curves forthe rst 80 epochs for CIFAR-10 without ooding Figure 1(d) shows the learning curves withooding when the ooding level is 018

22

(a) MNIST (000) (b) MNIST (001) (c) MNIST (002) (d) MNIST (003)

(e) Fashion-MNIST(000)

(f) Fashion-MNIST(002)

(g) Fashion-MNIST(005)

(h) Fashion-MNIST(010)

(i) KMNIST (000) (j) KMNIST (002) (k) KMNIST (005) (l) KMNIST (010)

(m) CIFAR-10 (000) (n) CIFAR-10 (003) (o) CIFAR-10 (007) (p) CIFAR-10 (020)

(q) CIFAR-100 (000) (r) CIFAR-100 (001) (s) CIFAR-100 (006) (t) CIFAR-100 (020)

(u) SVHN (000) (v) SVHN (001) (w) SVHN (006) (x) SVHN (020)

Figure 4 Learning curves of training and test loss e rst gure in each row is the learning curveswithout ooding e 2nd 3rd and 4th columns show the results with dierent ooding levels e oodinglevel increases towards the right-hand side 23

E Relationship between Performance and GradientsSettings We visualize the relationship between test performance (loss or accuracy) and gradientamplitude of the trainingtest loss in Figure 5 where the gradient amplitude is the `2 norm of thelter-nomalized gradient of the loss e lter-normalized gradient is the gradient appropriatelyscaled depending on the magnitude of the corresponding convolutional lter similarly to Liet al [2018] More specicically for each lter of every comvolutional layer we multiply thecorresponding elements of the gradient by the norm of the lter Note that a fully connected layeris a special case of convolutional layer and subject to this scaling We exclude Fashion-MNISTbecause the optimal ooding level was zero We used the results with trainingvalidation splitratio of 08

Results For the gures with gradient amplitude of training loss on the horizontal axis ldquordquomarks (w ooding) are oen ploed on the right of ldquo+rdquo marks (wo ooding) which impliesthat ooding prevents the model from staying a local minimum For the gures with gradientamplitude of test loss on the horizontal axis we can observe the method with ooding (ldquordquo)improves performance while the gradient amplitude becomes smaller On the other hand theperformance with the method without ooding (ldquo+rdquo) degenerates while the gradient amplitude oftest loss keeps increasing

24

(a) MNIST xtrainyloss

(b) MNIST xtrainyacc

(c) MNIST xtestyloss

(d) MNIST xtest yacc

(e) K-M xtrain yloss (f) K-M xtrain yacc (g) K-M xtest yloss (h) K-M xtest yacc

(i) C10 xtrain yloss (j) C10 xtrain yacc (k) C10 xtest yloss (l) C10 xtest yacc

(m) C100 xtrainyloss

(n) C100 xtrain yacc (o) C100 xtest yloss (p) C100 xtest yacc

(q) SVHN xtrainyloss

(r) SVHN xtrainyacc

(s) SVHN xtest yloss (t) SVHN xtest yacc

Figure 5 Relationship between test performance (loss or accuracy) and amplitude of gradient (with trainingor test loss) Each point (ldquordquo or ldquo+rdquo) in the gures corresponds to a single model at a certain epoch Weremove the rst 5 epochs and plot the rest ldquordquo is used for the method with ooding and ldquo+rdquo is used forthe method without ooding e large black ldquordquo and ldquo+rdquo show the epochs with early stopping e colorbecomes lighter (purplerarr yellow) as the training proceeds K-M C10 and C100 stand for Kuzushiji-MNISTCIFAR-10 and CIFAR-100

25

F FlatnessSettings We follow Li et al [2018] and give a one-dimensional visualization of atness for eachdataset in Figure 6 We exclude Fashion-MNIST because the optimal ooding level was zero Weused the results with trainingvalidation split ratio of 08 We compare the atness of the modelright aer the empirical risk with respect to a mini-batch becomes smaller than the ooding levelRm(g) lt b for the rst time (doed blue line) and the model aer training (solid blue line) Wealso compare them with the model trained by the baseline method without ooding and trainingis nished (solid red line)

Results According to Figure 6 the test loss becomes lower and more at during the trainingwith ooding Note that the training loss on the other hand continues to oat around the oodinglevel until the end of training aer it enters the ooding zone We expect that the model makes arandom walk and escapes regions with sharp loss landscapes during the period is may be apossible reason for beer generalization results with our proposed method

26

(a) MNIST (train) (b) MNIST (test) (c) Kuzushiji-MNIST(train)

(d) Kuzushiji-MNIST(test)

(e) CIFAR-10 (train) (f) CIFAR-10 (test) (g) CIFAR-100 (train) (h) CIFAR-100 (test)

(i) SVHN (train) (j) SVHN (test)

Figure 6 One-dimensional visualization of atness We visualize the trainingtest loss with respect toperturbation We depict the results for 3 models the model when the empirical risk with respect to trainingdata is below the ooding level for the rst time during training (doed blue) the model at the end oftraining with ooding (solid blue) and the model at the end of training without ooding (solid red)

27

  • 1 Introduction
  • 2 Backgrounds
    • 21 Regularization Methods
    • 22 Double Descent Curves with Overparametrization
    • 23 Lower-Bounding the Empirical Risk
      • 3 Flooding How to Avoid Zero Training Loss
        • 31 Preliminaries
        • 32 Algorithm
        • 33 Implementation
        • 34 Theoretical Analysis
          • 4 Experiments
            • 41 Synthetic Experiments
            • 42 Benchmark Experiments
            • 43 Memorization
              • 5 Conclusion
              • A Proof of Theorem
              • B Bayes Risk for Gaussian Distributions
              • C Details of Experiments
                • C1 Benchmark Datasets
                • C2 Chosen Flooding Levels
                  • D Learning Curves
                  • E Relationship between Performance and Gradients
                  • F Flatness
Page 12: Do We Need Zero Training Loss A›er Achieving Zero Training … · 2020. 2. 21. · directly aim to avoid zero training loss, they o›en fail to maintain a moderate level of training

(a) CIFAR-10 (000) (b) CIFAR-10 (003) (c) CIFAR-10 (007) (d) CIFAR-10 (020)

Figure 2 Learning curves of training and test loss for trainingvalidation proportion of 08 (a) shows thelearning curves without ooding (b) (c) and (d) show the learning curves with dierent ooding levels

42 Benchmark ExperimentsWe next perform experiments with benchmark datasets Not only do we compare with the baselinewithout ooding we also compare or combine with other general regularization methods whichare early stopping and weight decay

Settings We use the following six benchmark datasets MNIST Fashion-MNIST Kuzushiji-MNIST CIFAR-10 CIFAR-100 and SVHN e details of the benchmark datasets can be found inAppendix C1 We split the original training dataset into training and validation data with dierentproportions 08 or 04 (meaning 80 or 40 was used for training and the rest was used forvalidation respectively) We perform the exhaustive hyper-parameter search for the ooding levelwith candidates from 000 001 020 e number of epochs is 500 Stochastic gradientdescent [Robbins and Monro 1951] is used with learning rate of 01 and momentum of 09 ForMNIST Fashion-MNIST and Kuzushiji-MNIST we use a one-hidden-layer feedforward neuralnetwork with 500 units and ReLU activation function [Nair and Hinton 2010] For CIFAR-10CIFAR-100 and SVHN we used ResNet-18 [He et al 2016] We do not use any data augmentationor manual learning rate decay We deployed early stopping in the same way as in Section 41

We rst ran experiments with the following candidates for the weight decay rate 1times10minus5 1times10minus4 4times 10minus4 7times 10minus4 1times 10minus3 4times 10minus3 7times 10minus3 We choose the weight decay rate withthe best validation accuracy for each dataset and each trainingvalidation proportion en xingthe weight decay to the chosen one we ran experiments with ooding level candidates from0 001 020 to investigate whether weight decay and ooding have complementary eectsor if adding weight decay will diminish the accuracy gain of ooding

Results We show the results in Table 3 and the chosen ooding levels in Table 4 in Appendix C2We can observe that ooding gives beer accuracy for most cases We can also see that combiningooding with early stopping or with both early stopping and weight decay may lead to even beeraccuracy in some cases

12

(a) wo early stop (08) (b) wo early stop (04) (c) w early stop (08) (d) w early stop (04)

Figure 3 We show the optimal ooding level maintains memorization e vertical axis shows the trainingaccuracy e horizontal axis shows the ooding level We show results with and without early stoppingand for dierent trainingvalidation splits with proportion of 08 or 04 e marks (4 O ) areplaced on the ooding level that was chosen based on validation accuracy

43 MemorizationCan we maintain memorization even aer adding ooding We investigate if the trained modelhas zero training error (100 accuracy) for the ooding level that was chosen with validation dataWe show the results for all benchmark datasets and all trainingvalidation splits with proportions08 and 04 We also show the case without early stopping (choosing the last epoch) and withearly stopping (choosing the epoch with the highest validation accuracy) e results are shownin Fig 3

All gures show downward curves implying that the model will give up eventually onmemorizing all training data as the ooding level becomes higher A more interesting andimportant observation is the position of the optimal ooding level (the one chosen by validationaccuracy which is marked with 4 O or ) We can observe that the marks are oen ploedat zero error and in some cases there is a mark on the highest ooding level that maintains zeroerror ese results are consistent with recent empirical works that imply zero training error leadsto lower generalization error [Belkin et al 2019 Nakkiran et al 2020] but we further demonstratethat zero training loss may be harmful under zero training error

5 ConclusionWe proposed a novel regularization method called ooding that keeps the training loss to stayaround a small constant value to avoid zero training loss In our experiments the optimal oodinglevel oen maintained memorization of training data with zero error With ooding we showedthat the test accuracy will improve for various benchmark datasets and theoretically showed thatthe mean squared error will be reduced under certain conditions

As a byproduct we were able to produce a double descent curve for the test loss with a relativelyfew number of epochs eg in around 100 epochs shown in Fig 2 and Fig 4 in Appendix D Animportant future direction is to study the relationship between this and the double descent curves

13

from previous works [Belkin et al 2019 Krogh and Hertz 1992 Nakkiran et al 2020]It would also be interesting to see if Bayesian optimization [Shahriari et al 2016 Snoek

et al 2012] methods can be utilized to search for the optimal ooding level eciently We willinvestigate this direction in the future

AcknowledgementsWe thank Chang Xu Genki Yamamoto Kento Nozawa Nontawat Charoenphakdee Voot Tangkaraand Yoshihiro Nagano for the helpful discussions TI was supported by the Google PhD FellowshipProgram MS and IY were supported by JST CREST Grant Number JPMJCR18A2 including AIPchallenge program Japan

ReferencesDevansh Arpit Stanisaw Jastrzebski Nicolas Ballas David Krueger Emmanuel Bengio Maxin-

der S Kanwal Tegan Maharaj Asja Fischer Aaron Courville Yoshua Bengio and SimonLacoste-Julien A closer look at memorization in deep networks In ICML 2017

Mikhail Belkin Daniel Hsu and Partha P Mitra Overing or perfect ing Risk bounds forclassication and regression rules that interpolate In NeurIPS 2018

Mikhail Belkin Daniel Hsu Siyuan Ma and Soumik Mandal Reconciling modern machine-learningpractice and the classical biasvariance trade-o PNAS 11615850ndash15854 2019

David Berthelot Nicholas Carlini Ian Goodfellow Nicolas Papernot Avital Oliver and Colin ARael MixMatch A holistic approach to semi-supervised learning In NeurIPS 2019

Christopher M Bishop Paern Recognition and Machine Learning (Information Science andStatistics) Springer 2011

Christopher Michael Bishop Regularization and complexity control in feed-forward networks InICANN 1995

Rich Caruana Steve Lawrence and C Lee Giles Overing in neural nets Backpropagationconjugate gradient and early stopping In NeurIPS 2000

Pratik Chaudhari Anna Choromanska Stefano Soao Yann LeCun Carlo Baldassi ChristianBorgs Jennifer Chayes Levent Sagun and Riccardo Zecchina Entropy-SGD Biasing gradientdescent into wide valleys In ICLR 2017

Jesus Cid-Sueiro Darıo Garcıa-Garcıa and Raul Santos-Rodrıguez Consistency of losses forlearning from weak labels In ECML-PKDD 2014

14

Tarin Clanuwat Mikel Bober-Irizar Asanobu Kitamoto Alex Lamb Kazuaki Yamamoto and DavidHa Deep learning for classical Japanese literature In NeurIPS Workshop on Machine Learningfor Creativity and Design 2018

Marthinus Christoel du Plessis Gang Niu and Masashi Sugiyama Analysis of learning frompositive and unlabeled data In NeurIPS 2014

Marthinus Christoel du Plessis Gang Niu and Masashi Sugiyama Convex formulation forlearning from positive and unlabeled data In ICML 2015

Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MIT Press 2016

Hongyu Guo Yongyi Mao and Richong Zhang Augmenting data with mixup for sentenceclassication An empirical study In arXiv190508941 2019

Bo Han Gang Niu Jiangchao Yao Xingrui Yu Miao Xu Ivor Tsang and Masashi SugiyamaPumpout A meta approach to robust deep learning with noisy labels In arXiv180911008 2018

Stephen Jose Hanson and Lorien Y Pra Comparing biases for minimal network constructionwith back-propagation In NeurIPS 1988

Kaiming He Xiangyu Zhang Shaoqing Ren and Jian Sun Deep residual learning for imagerecognition In CVPR 2016

Sergey Ioe and Christian Szegedy Batch normalization Accelerating deep network training byreducing internal covariate shi In ICML 2015

Takashi Ishida Gang Niu Aditya Krishna Menon and Masashi Sugiyama Complementary-labellearning for arbitrary losses and models In ICML 2019

Nitish Shirish Keskar Dheevatsa Mudigere Jorge Nocedal Mikhail Smelyanskiy and Ping Tak PeterTang On large-batch training for deep learning Generalization gap and sharp minima In ICLR2017

Diederik P Kingma and Jimmy L Ba Adam A method for stochastic optimization In ICLR 2015

Ryuichi Kiryo Gang Niu Marthinus Christoel du Plessis and Masashi Sugiyama Positive-unlabeled learning with non-negative risk estimator In NeurIPS 2017

Anders Krogh and John A Hertz A simple weight decay can improve generalization In NeurIPS1992

Yann Lecun Lon Boou Yoshua Bengio and Patrick Haner Gradient-based learning applied todocument recognition In Proceedings of the IEEE pages 2278ndash2324 1998

Hao Li Zheng Xu Gavin Taylor Christoph Studer and Tom Goldstein Visualizing the losslandscape of neural nets In NeurIPS 2018

15

Ilya Loshchilov and Frank Huer Decoupled weight decay regularization In ICLR 2019

Nan Lu Tianyi Zhang Gang Niu and Masashi Sugiyama Mitigating overing in supervisedclassication from two unlabeled datasets A consistent risk correction approach In AISTATS2020

N Morgan and H Bourlard Generalization and parameter estimation in feedforward nets Someexperiments In NeurIPS 1990

Vinod Nair and Georey E Hinton Rectied linear units improve restricted boltzmann machinesIn ICML 2010

Preetum Nakkiran Gal Kaplun Dimitris Kalimeris Tristan Yang Benjamin L Edelman FredZhang and Boaz Barak SGD on neural networks learns functions of increasing complexity InNeurIPS 2019

Preetum Nakkiran Gal Kaplun Yamini Bansal Tristan Yang Boaz Barak and Ilya Sutskever Deepdouble descent Where bigger models and more data hurt In ICLR 2020

Nagarajan Natarajan Inderjit S Dhillon Pradeep K Ravikumar and Ambuj Tewari Learningwith noisy labels In NeurIPS 2013

Yuval Netzer Tao Wang Adam Coates Alessandro Bissacco Bo Wu and Andrew Y Ng Readingdigits in natural images with unsupervised feature learning In NeurIPS Workshop on DeepLearning and Unsupervised Feature Learning 2011

Andrew Y Ng Preventing ldquooveringrdquo of cross-validation data In ICML 1997

Adam Paszke Sam Gross Francisco Massa Adam Lerer James Bradbury Gregory Chanan TrevorKilleen Zeming Lin Natalia Gimelshein Luca Antiga Alban Desmaison Andreas Kopf EdwardYang Zachary DeVito Martin Raison Alykhan Tejani Sasank Chilamkurthy Benoit SteinerLu Fang Junjie Bai and Soumith Chintala PyTorch An imperative style high-performancedeep learning library In NeurIPS 2019

Giorgio Patrini Alessandro Rozza Aditya K Menon Richard Nock and Lizhen Making deepneural networks robust to label noise A loss correction approach In CVPR 2017

Herbert Robbins and Suon Monro A stochastic approximation method Annals of MathematicalStatistics 22400ndash407 1951

Rebecca Roelofs Vaishaal Shankar Benjamin Recht Sara Fridovich-Keil Moritz Hardt John Millerand Ludwig Schmidt A meta-analysis of overing in machine learning In NeurIPS 2019

Bobak Shahriari Kevin Swersky Ziyu Wang Ryan P Adams and Nando de Freitas Taking thehuman out of the loop A review of bayesian optimization Proceedings of the IEEE 104148ndash1752016

16

Connor Shorten and Taghi M Khoshgoaar A survey on image data augmentation for deeplearning Journal of Big Data 6 2019

Jasper Snoek Hugo Larochelle and Ryan P Adams Practical bayesian optimization of machinelearning algorithms In NeurIPS 2012

Nitish Srivastava Georey Hinton Alex Krizhevsky Ilya Sutskever and Ruslan SalakhutdinovDropout A simple way to prevent neural networks from overing Journal of Machine LearningResearch 151929ndash1958 2014

Masashi Sugiyama Introduction to statistical machine learning Morgan Kaufmann 2015

Christian Szegedy Vincent Vanhoucke Sergey Ioe and Jon Shlens Rethinking the inceptionarchitecture for computer vision In CVPR 2016

Sunil ulasidasan Gopinath Chennupati Je Bilmes Tanmoy Bhaacharya and Sarah MichalakOn mixup training Improved calibration and predictive uncertainty for deep neural networksIn NeurIPS 2019

Robert Tibshirani Regression shrinkage and selection via the lasso Journal of the Royal StatisticalSociety Series B (Methodological) 58267ndash288 1996

A N Tikhonov and V Y Arsenin Solutions of Ill Posed Problems Winston 1977

Andrey Nikolayevich Tikhonov On the stability of inverse problems Doklady Akademii NaukSSSR 39195ndash198 1943

Antonio Torralba Rob Fergus and William T Freeman 80 million tiny images A large data setfor nonparametric object and scene recognition In IEEE Trans PAMI 2008

Brendan van Rooyen and Robert C Williamson A theory of learning with corrupted labelsJournal of Machine Learning Research 181ndash50 2018

Vikas Verma Alex Lamb Juho Kannala Yoshua Bengio and David Lopez-Paz Interpolationconsistency training for semi-supervised learning In IJCAI 2019

Stefan Wager Sida Wang and Percy Liang Dropout training as adaptive regularization In NeurIPS2013

Roman Werpachowski Andrs Gyrgy and Csaba Szepesvri Detecting overing via adversarialexamples In NeurIPS 2019

Han Xiao Kashif Rasul and Roland Vollgraf Fashion-MNIST a novel image dataset for bench-marking machine learning algorithms arXiv preprint arXiv170807747 2017

Chiyuan Zhang Samy Bengio Moritz Hardt Benjamin Recht and Oriol Vinyals Understandingdeep learning requires rethinking generalization In ICLR 2017

17

Hongyi Zhang Moustapha Cisse Yann N Dauphin and David Lopez-Paz mixup Beyondempirical risk minimization In ICLR 2018

Tong Zhang Statistical behavior and consistency of classication methods based on convex riskminimization e Annals of Statistics 3256ndash85 2004

18

A Proof of eoremProof If the ooding level is b then the proposed ooding estimator is

R(g) = |R(g)minus b|+ b (12)

Since the absolute operator can be expressed with a max operator with max(a b) = a+b+|aminusb|2

theproposed estimator can be re-expressed as

R(g) = 2max(R(g) b)minus R(g) = Aminus R(g) (13)

For convenience we used A = 2max(R(g) b) From the denition of MSE

MSE(R(g)) = E[(R(g)minusR(g))2] (14)MSE(R(g)) = E[(R(g)minusR(g))2] (15)

= E[(Aminus R(g)minusR(g))2)] (16)= E[A2]minus 2E[A(R(g) +R(g))] + E[(R(g) +R(g))2] (17)

We are interested in the sign of

MSE(R(g))minusMSE(R(g)) = E[minus4R(g)R(g)minus A2 + 2A(R(g) +R(g))] (18)

Dene the inside of the expectation as B = minus4R(g)R(g)minus A2 + 2A(R(g) + R(g)) B can bedivided into two cases depending on the outcome of the max operator

B =

minus4R(g)R(g)minus 4R(g)2 + 4R(g)(R(g) +R(g)) if R(g) ge b

minus4R(g)R(g)minus 4b2 + 4b(R(g) +R(g)) if R(g) lt b(19)

=

0 if R(g) ge b

minus4R(g)R(g)minus 4b2 + 4b(R(g) +R(g)) if R(g) lt b(20)

=

0 if R(g) ge b

minus4(bminus R(g))(bminusR(g)) if R(g) lt b (21)

e laer case becomes positive when R(g) lt b lt R(g) erefore when R(g) lt b lt R(g)

MSE(R(g))minusMSE(R(g)) gt 0 (22)MSE(R(g)) gt MSE(R(g)) (23)

When b le R(g)

MSE(R(g))minusMSE(R(g)) gt 0 (24)MSE(R(g)) = MSE(R(g)) (25)

19

B Bayes Risk for Gaussian DistributionsIn this section we explain in detail how we derived the Bayes risk with respect to the surrogateloss in the experiments with Gaussian data in Section 41 Since we are using the logistic loss inthe synthetic experiments the loss of the margin is

`(yg(x)) = log(1 + exp(minusyg(x))) (26)

where g(x) Rd rarr R is a scalar instead of the vector denition that was used previously becausethe synthetic experiments only consider binary classication Take the derivative to derive

partE[`(yg(x))]partg(x)

=E[log(1 + exp(minusyg(x)))]

partg(x)= E

[minusy exp(minusyg(x))1 + exp(minusyg(x))

∣∣∣∣∣x]p(x) (27)

= E

[minusy

1 + exp(yg(x))

∣∣∣∣∣x]p(x) (28)

= E

[y + 1

2

1

exp(minusg(x)) + 1+y minus 1

2

minus1exp(g(x)) + 1

∣∣∣∣∣x]p(x) (29)

= E

[y + 1

2

∣∣∣∣∣x]

1

exp(minusg(x)) + 1p(x) + E

[y minus 1

2

∣∣∣∣∣x]

minus1exp(g(x)) + 1

p(x) (30)

= p(y = +1|x) 1

exp(minusg(x)) + 1p(x) + p(y = minus1|x) minus1

exp(g(x)) + 1p(x) (31)

Set this to zero divide by p(x) gt 0 to obtain

p(y = minus1|x) 1

exp(minusg(x)) + 1= p(y = +1|x) 1

exp(g(x)) + 1(32)

exp(g(x)) =p(y = +1|x)p(y = minus1|x)

(33)

g(x) = logp(y = +1|x)p(y = minus1|x)

(34)

Since we are interested in the surrogate loss under this classier we plug this into the logisticloss to obtain the Bayes risk

E[`(yg(x))] = E[log(1 +

p(minusy|x)p(y|x)

)

]= E

[log(

1

p(y|x))

]= E[minus log p(y|x)] (35)

In the experiments in Section 41 we report the empirical version of this with the test dataset asthe Bayes risk

20

C Details of Experiments

C1 Benchmark DatasetsIn the experiments in Section 42 we use six benchmark datasets explained below

bull MNIST6 [Lecun et al 1998] is a 10 class dataset of handwrien digits 1 2 9 and 0 Eachsample is a 28times 28 grayscale image e number of training and test samples are 60000and 10000 respectively

bull Fashion-MNIST7 [Xiao et al 2017] is a 10 class dataset of fashion items T-shirttop TrouserPullover Dress Coat Sandal Shirt Sneaker Bag and Ankle boot Each sample is a 28times 28grayscale image e number of training and test samples are 60000 and 10000 respectively

bull Kuzushiji-MNIST8 [Clanuwat et al 2018] is a 10 class dataset of cursive Japanese (ldquoKuzushijirdquo)characters Each sample is a 28 times 28 grayscale image e number of training and testsamples are 60000 and 10000 respectively

bull CIFAR-109 is a 10 class dataset of various objects airplane automobile bird cat deer dogfrog horse ship and truck Each sample is a colored image in 32times 32times 3 RGB format It isa subset of the 80 million tiny images dataset [Torralba et al 2008] ere are 6000 imagesper class where 5000 are for training and 1000 are for test

bull CIFAR-10010 is a 100 class dataset of various objects Each class has 600 samples where 500samples are for training and 100 samples are for test is is also a subset of the 80 milliontiny images dataset [Torralba et al 2008]

bull SVHN11 [Netzer et al 2011] is a 10 class dataset of house numbers from Google Street Viewimages in 32times 32times 3 RGB format 73257 digits are for training and 26032 digits are fortesting

6httpyannlecuncomexdbmnist7httpsgithubcomzalandoresearchfashion-mnist8httpsgithubcomrois-codhkmnist9httpswwwcstorontoedusimkrizcifarhtml

10httpswwwcstorontoedusimkrizcifarhtml11httpufldlstanfordeduhousenumbers

21

C2 Chosen Flooding LevelsIn Table 4 we report the chosen ooding levels for our experiments with benchmark datasets

Table 4 e chosen ooding levels for benchmark experiments

Early stopping times X times XWeight decay times times X XFlooding X X X X

MNIST (08) 002 002 003 002MNIST (04) 002 003 000 002Fashion-MNIST (08) 000 000 mdash mdashFashion-MNIST (04) 000 000 mdash mdashKuzushiji-MNIST (08) 001 003 003 003Kuzushiji-MNIST (04) 003 002 004 003CIFAR-10 (08) 004 004 000 001CIFAR-10 (04) 003 003 000 000CIFAR-100 (08) 002 002 000 000CIFAR-100 (04) 003 003 000 000SVHN (08) 001 001 000 002SVHN (04) 001 009 003 003

D Learning CurvesIn Figure 4 we visualize learning curves for all datasets (including CIFAR-10 which we alreadyvisualized in Figure 2) We only show the learning curves for trainingvalidation proportion of 08since the results for 04 were similar with 08 Note that Figure 1(c) shows the learning curves forthe rst 80 epochs for CIFAR-10 without ooding Figure 1(d) shows the learning curves withooding when the ooding level is 018

22

(a) MNIST (000) (b) MNIST (001) (c) MNIST (002) (d) MNIST (003)

(e) Fashion-MNIST(000)

(f) Fashion-MNIST(002)

(g) Fashion-MNIST(005)

(h) Fashion-MNIST(010)

(i) KMNIST (000) (j) KMNIST (002) (k) KMNIST (005) (l) KMNIST (010)

(m) CIFAR-10 (000) (n) CIFAR-10 (003) (o) CIFAR-10 (007) (p) CIFAR-10 (020)

(q) CIFAR-100 (000) (r) CIFAR-100 (001) (s) CIFAR-100 (006) (t) CIFAR-100 (020)

(u) SVHN (000) (v) SVHN (001) (w) SVHN (006) (x) SVHN (020)

Figure 4 Learning curves of training and test loss e rst gure in each row is the learning curveswithout ooding e 2nd 3rd and 4th columns show the results with dierent ooding levels e oodinglevel increases towards the right-hand side 23

E Relationship between Performance and GradientsSettings We visualize the relationship between test performance (loss or accuracy) and gradientamplitude of the trainingtest loss in Figure 5 where the gradient amplitude is the `2 norm of thelter-nomalized gradient of the loss e lter-normalized gradient is the gradient appropriatelyscaled depending on the magnitude of the corresponding convolutional lter similarly to Liet al [2018] More specicically for each lter of every comvolutional layer we multiply thecorresponding elements of the gradient by the norm of the lter Note that a fully connected layeris a special case of convolutional layer and subject to this scaling We exclude Fashion-MNISTbecause the optimal ooding level was zero We used the results with trainingvalidation splitratio of 08

Results For the gures with gradient amplitude of training loss on the horizontal axis ldquordquomarks (w ooding) are oen ploed on the right of ldquo+rdquo marks (wo ooding) which impliesthat ooding prevents the model from staying a local minimum For the gures with gradientamplitude of test loss on the horizontal axis we can observe the method with ooding (ldquordquo)improves performance while the gradient amplitude becomes smaller On the other hand theperformance with the method without ooding (ldquo+rdquo) degenerates while the gradient amplitude oftest loss keeps increasing

24

(a) MNIST xtrainyloss

(b) MNIST xtrainyacc

(c) MNIST xtestyloss

(d) MNIST xtest yacc

(e) K-M xtrain yloss (f) K-M xtrain yacc (g) K-M xtest yloss (h) K-M xtest yacc

(i) C10 xtrain yloss (j) C10 xtrain yacc (k) C10 xtest yloss (l) C10 xtest yacc

(m) C100 xtrainyloss

(n) C100 xtrain yacc (o) C100 xtest yloss (p) C100 xtest yacc

(q) SVHN xtrainyloss

(r) SVHN xtrainyacc

(s) SVHN xtest yloss (t) SVHN xtest yacc

Figure 5 Relationship between test performance (loss or accuracy) and amplitude of gradient (with trainingor test loss) Each point (ldquordquo or ldquo+rdquo) in the gures corresponds to a single model at a certain epoch Weremove the rst 5 epochs and plot the rest ldquordquo is used for the method with ooding and ldquo+rdquo is used forthe method without ooding e large black ldquordquo and ldquo+rdquo show the epochs with early stopping e colorbecomes lighter (purplerarr yellow) as the training proceeds K-M C10 and C100 stand for Kuzushiji-MNISTCIFAR-10 and CIFAR-100

25

F FlatnessSettings We follow Li et al [2018] and give a one-dimensional visualization of atness for eachdataset in Figure 6 We exclude Fashion-MNIST because the optimal ooding level was zero Weused the results with trainingvalidation split ratio of 08 We compare the atness of the modelright aer the empirical risk with respect to a mini-batch becomes smaller than the ooding levelRm(g) lt b for the rst time (doed blue line) and the model aer training (solid blue line) Wealso compare them with the model trained by the baseline method without ooding and trainingis nished (solid red line)

Results According to Figure 6 the test loss becomes lower and more at during the trainingwith ooding Note that the training loss on the other hand continues to oat around the oodinglevel until the end of training aer it enters the ooding zone We expect that the model makes arandom walk and escapes regions with sharp loss landscapes during the period is may be apossible reason for beer generalization results with our proposed method

26

(a) MNIST (train) (b) MNIST (test) (c) Kuzushiji-MNIST(train)

(d) Kuzushiji-MNIST(test)

(e) CIFAR-10 (train) (f) CIFAR-10 (test) (g) CIFAR-100 (train) (h) CIFAR-100 (test)

(i) SVHN (train) (j) SVHN (test)

Figure 6 One-dimensional visualization of atness We visualize the trainingtest loss with respect toperturbation We depict the results for 3 models the model when the empirical risk with respect to trainingdata is below the ooding level for the rst time during training (doed blue) the model at the end oftraining with ooding (solid blue) and the model at the end of training without ooding (solid red)

27

  • 1 Introduction
  • 2 Backgrounds
    • 21 Regularization Methods
    • 22 Double Descent Curves with Overparametrization
    • 23 Lower-Bounding the Empirical Risk
      • 3 Flooding How to Avoid Zero Training Loss
        • 31 Preliminaries
        • 32 Algorithm
        • 33 Implementation
        • 34 Theoretical Analysis
          • 4 Experiments
            • 41 Synthetic Experiments
            • 42 Benchmark Experiments
            • 43 Memorization
              • 5 Conclusion
              • A Proof of Theorem
              • B Bayes Risk for Gaussian Distributions
              • C Details of Experiments
                • C1 Benchmark Datasets
                • C2 Chosen Flooding Levels
                  • D Learning Curves
                  • E Relationship between Performance and Gradients
                  • F Flatness
Page 13: Do We Need Zero Training Loss A›er Achieving Zero Training … · 2020. 2. 21. · directly aim to avoid zero training loss, they o›en fail to maintain a moderate level of training

(a) wo early stop (08) (b) wo early stop (04) (c) w early stop (08) (d) w early stop (04)

Figure 3 We show the optimal ooding level maintains memorization e vertical axis shows the trainingaccuracy e horizontal axis shows the ooding level We show results with and without early stoppingand for dierent trainingvalidation splits with proportion of 08 or 04 e marks (4 O ) areplaced on the ooding level that was chosen based on validation accuracy

43 MemorizationCan we maintain memorization even aer adding ooding We investigate if the trained modelhas zero training error (100 accuracy) for the ooding level that was chosen with validation dataWe show the results for all benchmark datasets and all trainingvalidation splits with proportions08 and 04 We also show the case without early stopping (choosing the last epoch) and withearly stopping (choosing the epoch with the highest validation accuracy) e results are shownin Fig 3

All gures show downward curves implying that the model will give up eventually onmemorizing all training data as the ooding level becomes higher A more interesting andimportant observation is the position of the optimal ooding level (the one chosen by validationaccuracy which is marked with 4 O or ) We can observe that the marks are oen ploedat zero error and in some cases there is a mark on the highest ooding level that maintains zeroerror ese results are consistent with recent empirical works that imply zero training error leadsto lower generalization error [Belkin et al 2019 Nakkiran et al 2020] but we further demonstratethat zero training loss may be harmful under zero training error

5 ConclusionWe proposed a novel regularization method called ooding that keeps the training loss to stayaround a small constant value to avoid zero training loss In our experiments the optimal oodinglevel oen maintained memorization of training data with zero error With ooding we showedthat the test accuracy will improve for various benchmark datasets and theoretically showed thatthe mean squared error will be reduced under certain conditions

As a byproduct we were able to produce a double descent curve for the test loss with a relativelyfew number of epochs eg in around 100 epochs shown in Fig 2 and Fig 4 in Appendix D Animportant future direction is to study the relationship between this and the double descent curves

13

from previous works [Belkin et al 2019 Krogh and Hertz 1992 Nakkiran et al 2020]It would also be interesting to see if Bayesian optimization [Shahriari et al 2016 Snoek

et al 2012] methods can be utilized to search for the optimal ooding level eciently We willinvestigate this direction in the future

AcknowledgementsWe thank Chang Xu Genki Yamamoto Kento Nozawa Nontawat Charoenphakdee Voot Tangkaraand Yoshihiro Nagano for the helpful discussions TI was supported by the Google PhD FellowshipProgram MS and IY were supported by JST CREST Grant Number JPMJCR18A2 including AIPchallenge program Japan

ReferencesDevansh Arpit Stanisaw Jastrzebski Nicolas Ballas David Krueger Emmanuel Bengio Maxin-

der S Kanwal Tegan Maharaj Asja Fischer Aaron Courville Yoshua Bengio and SimonLacoste-Julien A closer look at memorization in deep networks In ICML 2017

Mikhail Belkin Daniel Hsu and Partha P Mitra Overing or perfect ing Risk bounds forclassication and regression rules that interpolate In NeurIPS 2018

Mikhail Belkin Daniel Hsu Siyuan Ma and Soumik Mandal Reconciling modern machine-learningpractice and the classical biasvariance trade-o PNAS 11615850ndash15854 2019

David Berthelot Nicholas Carlini Ian Goodfellow Nicolas Papernot Avital Oliver and Colin ARael MixMatch A holistic approach to semi-supervised learning In NeurIPS 2019

Christopher M Bishop Paern Recognition and Machine Learning (Information Science andStatistics) Springer 2011

Christopher Michael Bishop Regularization and complexity control in feed-forward networks InICANN 1995

Rich Caruana Steve Lawrence and C Lee Giles Overing in neural nets Backpropagationconjugate gradient and early stopping In NeurIPS 2000

Pratik Chaudhari Anna Choromanska Stefano Soao Yann LeCun Carlo Baldassi ChristianBorgs Jennifer Chayes Levent Sagun and Riccardo Zecchina Entropy-SGD Biasing gradientdescent into wide valleys In ICLR 2017

Jesus Cid-Sueiro Darıo Garcıa-Garcıa and Raul Santos-Rodrıguez Consistency of losses forlearning from weak labels In ECML-PKDD 2014

14

Tarin Clanuwat Mikel Bober-Irizar Asanobu Kitamoto Alex Lamb Kazuaki Yamamoto and DavidHa Deep learning for classical Japanese literature In NeurIPS Workshop on Machine Learningfor Creativity and Design 2018

Marthinus Christoel du Plessis Gang Niu and Masashi Sugiyama Analysis of learning frompositive and unlabeled data In NeurIPS 2014

Marthinus Christoel du Plessis Gang Niu and Masashi Sugiyama Convex formulation forlearning from positive and unlabeled data In ICML 2015

Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MIT Press 2016

Hongyu Guo Yongyi Mao and Richong Zhang Augmenting data with mixup for sentenceclassication An empirical study In arXiv190508941 2019

Bo Han Gang Niu Jiangchao Yao Xingrui Yu Miao Xu Ivor Tsang and Masashi SugiyamaPumpout A meta approach to robust deep learning with noisy labels In arXiv180911008 2018

Stephen Jose Hanson and Lorien Y Pra Comparing biases for minimal network constructionwith back-propagation In NeurIPS 1988

Kaiming He Xiangyu Zhang Shaoqing Ren and Jian Sun Deep residual learning for imagerecognition In CVPR 2016

Sergey Ioe and Christian Szegedy Batch normalization Accelerating deep network training byreducing internal covariate shi In ICML 2015

Takashi Ishida Gang Niu Aditya Krishna Menon and Masashi Sugiyama Complementary-labellearning for arbitrary losses and models In ICML 2019

Nitish Shirish Keskar Dheevatsa Mudigere Jorge Nocedal Mikhail Smelyanskiy and Ping Tak PeterTang On large-batch training for deep learning Generalization gap and sharp minima In ICLR2017

Diederik P Kingma and Jimmy L Ba Adam A method for stochastic optimization In ICLR 2015

Ryuichi Kiryo Gang Niu Marthinus Christoel du Plessis and Masashi Sugiyama Positive-unlabeled learning with non-negative risk estimator In NeurIPS 2017

Anders Krogh and John A Hertz A simple weight decay can improve generalization In NeurIPS1992

Yann Lecun Lon Boou Yoshua Bengio and Patrick Haner Gradient-based learning applied todocument recognition In Proceedings of the IEEE pages 2278ndash2324 1998

Hao Li Zheng Xu Gavin Taylor Christoph Studer and Tom Goldstein Visualizing the losslandscape of neural nets In NeurIPS 2018

15

Ilya Loshchilov and Frank Huer Decoupled weight decay regularization In ICLR 2019

Nan Lu Tianyi Zhang Gang Niu and Masashi Sugiyama Mitigating overing in supervisedclassication from two unlabeled datasets A consistent risk correction approach In AISTATS2020

N Morgan and H Bourlard Generalization and parameter estimation in feedforward nets Someexperiments In NeurIPS 1990

Vinod Nair and Georey E Hinton Rectied linear units improve restricted boltzmann machinesIn ICML 2010

Preetum Nakkiran Gal Kaplun Dimitris Kalimeris Tristan Yang Benjamin L Edelman FredZhang and Boaz Barak SGD on neural networks learns functions of increasing complexity InNeurIPS 2019

Preetum Nakkiran Gal Kaplun Yamini Bansal Tristan Yang Boaz Barak and Ilya Sutskever Deepdouble descent Where bigger models and more data hurt In ICLR 2020

Nagarajan Natarajan Inderjit S Dhillon Pradeep K Ravikumar and Ambuj Tewari Learningwith noisy labels In NeurIPS 2013

Yuval Netzer Tao Wang Adam Coates Alessandro Bissacco Bo Wu and Andrew Y Ng Readingdigits in natural images with unsupervised feature learning In NeurIPS Workshop on DeepLearning and Unsupervised Feature Learning 2011

Andrew Y Ng Preventing ldquooveringrdquo of cross-validation data In ICML 1997

Adam Paszke Sam Gross Francisco Massa Adam Lerer James Bradbury Gregory Chanan TrevorKilleen Zeming Lin Natalia Gimelshein Luca Antiga Alban Desmaison Andreas Kopf EdwardYang Zachary DeVito Martin Raison Alykhan Tejani Sasank Chilamkurthy Benoit SteinerLu Fang Junjie Bai and Soumith Chintala PyTorch An imperative style high-performancedeep learning library In NeurIPS 2019

Giorgio Patrini Alessandro Rozza Aditya K Menon Richard Nock and Lizhen Making deepneural networks robust to label noise A loss correction approach In CVPR 2017

Herbert Robbins and Suon Monro A stochastic approximation method Annals of MathematicalStatistics 22400ndash407 1951

Rebecca Roelofs Vaishaal Shankar Benjamin Recht Sara Fridovich-Keil Moritz Hardt John Millerand Ludwig Schmidt A meta-analysis of overing in machine learning In NeurIPS 2019

Bobak Shahriari Kevin Swersky Ziyu Wang Ryan P Adams and Nando de Freitas Taking thehuman out of the loop A review of bayesian optimization Proceedings of the IEEE 104148ndash1752016

16

Connor Shorten and Taghi M Khoshgoaar A survey on image data augmentation for deeplearning Journal of Big Data 6 2019

Jasper Snoek Hugo Larochelle and Ryan P Adams Practical bayesian optimization of machinelearning algorithms In NeurIPS 2012

Nitish Srivastava Georey Hinton Alex Krizhevsky Ilya Sutskever and Ruslan SalakhutdinovDropout A simple way to prevent neural networks from overing Journal of Machine LearningResearch 151929ndash1958 2014

Masashi Sugiyama Introduction to statistical machine learning Morgan Kaufmann 2015

Christian Szegedy Vincent Vanhoucke Sergey Ioe and Jon Shlens Rethinking the inceptionarchitecture for computer vision In CVPR 2016

Sunil ulasidasan Gopinath Chennupati Je Bilmes Tanmoy Bhaacharya and Sarah MichalakOn mixup training Improved calibration and predictive uncertainty for deep neural networksIn NeurIPS 2019

Robert Tibshirani Regression shrinkage and selection via the lasso Journal of the Royal StatisticalSociety Series B (Methodological) 58267ndash288 1996

A N Tikhonov and V Y Arsenin Solutions of Ill Posed Problems Winston 1977

Andrey Nikolayevich Tikhonov On the stability of inverse problems Doklady Akademii NaukSSSR 39195ndash198 1943

Antonio Torralba Rob Fergus and William T Freeman 80 million tiny images A large data setfor nonparametric object and scene recognition In IEEE Trans PAMI 2008

Brendan van Rooyen and Robert C Williamson A theory of learning with corrupted labelsJournal of Machine Learning Research 181ndash50 2018

Vikas Verma Alex Lamb Juho Kannala Yoshua Bengio and David Lopez-Paz Interpolationconsistency training for semi-supervised learning In IJCAI 2019

Stefan Wager Sida Wang and Percy Liang Dropout training as adaptive regularization In NeurIPS2013

Roman Werpachowski Andrs Gyrgy and Csaba Szepesvri Detecting overing via adversarialexamples In NeurIPS 2019

Han Xiao Kashif Rasul and Roland Vollgraf Fashion-MNIST a novel image dataset for bench-marking machine learning algorithms arXiv preprint arXiv170807747 2017

Chiyuan Zhang Samy Bengio Moritz Hardt Benjamin Recht and Oriol Vinyals Understandingdeep learning requires rethinking generalization In ICLR 2017

17

Hongyi Zhang Moustapha Cisse Yann N Dauphin and David Lopez-Paz mixup Beyondempirical risk minimization In ICLR 2018

Tong Zhang Statistical behavior and consistency of classication methods based on convex riskminimization e Annals of Statistics 3256ndash85 2004

18

A Proof of eoremProof If the ooding level is b then the proposed ooding estimator is

R(g) = |R(g)minus b|+ b (12)

Since the absolute operator can be expressed with a max operator with max(a b) = a+b+|aminusb|2

theproposed estimator can be re-expressed as

R(g) = 2max(R(g) b)minus R(g) = Aminus R(g) (13)

For convenience we used A = 2max(R(g) b) From the denition of MSE

MSE(R(g)) = E[(R(g)minusR(g))2] (14)MSE(R(g)) = E[(R(g)minusR(g))2] (15)

= E[(Aminus R(g)minusR(g))2)] (16)= E[A2]minus 2E[A(R(g) +R(g))] + E[(R(g) +R(g))2] (17)

We are interested in the sign of

MSE(R(g))minusMSE(R(g)) = E[minus4R(g)R(g)minus A2 + 2A(R(g) +R(g))] (18)

Dene the inside of the expectation as B = minus4R(g)R(g)minus A2 + 2A(R(g) + R(g)) B can bedivided into two cases depending on the outcome of the max operator

B =

minus4R(g)R(g)minus 4R(g)2 + 4R(g)(R(g) +R(g)) if R(g) ge b

minus4R(g)R(g)minus 4b2 + 4b(R(g) +R(g)) if R(g) lt b(19)

=

0 if R(g) ge b

minus4R(g)R(g)minus 4b2 + 4b(R(g) +R(g)) if R(g) lt b(20)

=

0 if R(g) ge b

minus4(bminus R(g))(bminusR(g)) if R(g) lt b (21)

e laer case becomes positive when R(g) lt b lt R(g) erefore when R(g) lt b lt R(g)

MSE(R(g))minusMSE(R(g)) gt 0 (22)MSE(R(g)) gt MSE(R(g)) (23)

When b le R(g)

MSE(R(g))minusMSE(R(g)) gt 0 (24)MSE(R(g)) = MSE(R(g)) (25)

19

B Bayes Risk for Gaussian DistributionsIn this section we explain in detail how we derived the Bayes risk with respect to the surrogateloss in the experiments with Gaussian data in Section 41 Since we are using the logistic loss inthe synthetic experiments the loss of the margin is

`(yg(x)) = log(1 + exp(minusyg(x))) (26)

where g(x) Rd rarr R is a scalar instead of the vector denition that was used previously becausethe synthetic experiments only consider binary classication Take the derivative to derive

partE[`(yg(x))]partg(x)

=E[log(1 + exp(minusyg(x)))]

partg(x)= E

[minusy exp(minusyg(x))1 + exp(minusyg(x))

∣∣∣∣∣x]p(x) (27)

= E

[minusy

1 + exp(yg(x))

∣∣∣∣∣x]p(x) (28)

= E

[y + 1

2

1

exp(minusg(x)) + 1+y minus 1

2

minus1exp(g(x)) + 1

∣∣∣∣∣x]p(x) (29)

= E

[y + 1

2

∣∣∣∣∣x]

1

exp(minusg(x)) + 1p(x) + E

[y minus 1

2

∣∣∣∣∣x]

minus1exp(g(x)) + 1

p(x) (30)

= p(y = +1|x) 1

exp(minusg(x)) + 1p(x) + p(y = minus1|x) minus1

exp(g(x)) + 1p(x) (31)

Set this to zero divide by p(x) gt 0 to obtain

p(y = minus1|x) 1

exp(minusg(x)) + 1= p(y = +1|x) 1

exp(g(x)) + 1(32)

exp(g(x)) =p(y = +1|x)p(y = minus1|x)

(33)

g(x) = logp(y = +1|x)p(y = minus1|x)

(34)

Since we are interested in the surrogate loss under this classier we plug this into the logisticloss to obtain the Bayes risk

E[`(yg(x))] = E[log(1 +

p(minusy|x)p(y|x)

)

]= E

[log(

1

p(y|x))

]= E[minus log p(y|x)] (35)

In the experiments in Section 41 we report the empirical version of this with the test dataset asthe Bayes risk

20

C Details of Experiments

C1 Benchmark DatasetsIn the experiments in Section 42 we use six benchmark datasets explained below

bull MNIST6 [Lecun et al 1998] is a 10 class dataset of handwrien digits 1 2 9 and 0 Eachsample is a 28times 28 grayscale image e number of training and test samples are 60000and 10000 respectively

bull Fashion-MNIST7 [Xiao et al 2017] is a 10 class dataset of fashion items T-shirttop TrouserPullover Dress Coat Sandal Shirt Sneaker Bag and Ankle boot Each sample is a 28times 28grayscale image e number of training and test samples are 60000 and 10000 respectively

bull Kuzushiji-MNIST8 [Clanuwat et al 2018] is a 10 class dataset of cursive Japanese (ldquoKuzushijirdquo)characters Each sample is a 28 times 28 grayscale image e number of training and testsamples are 60000 and 10000 respectively

bull CIFAR-109 is a 10 class dataset of various objects airplane automobile bird cat deer dogfrog horse ship and truck Each sample is a colored image in 32times 32times 3 RGB format It isa subset of the 80 million tiny images dataset [Torralba et al 2008] ere are 6000 imagesper class where 5000 are for training and 1000 are for test

bull CIFAR-10010 is a 100 class dataset of various objects Each class has 600 samples where 500samples are for training and 100 samples are for test is is also a subset of the 80 milliontiny images dataset [Torralba et al 2008]

bull SVHN11 [Netzer et al 2011] is a 10 class dataset of house numbers from Google Street Viewimages in 32times 32times 3 RGB format 73257 digits are for training and 26032 digits are fortesting

6httpyannlecuncomexdbmnist7httpsgithubcomzalandoresearchfashion-mnist8httpsgithubcomrois-codhkmnist9httpswwwcstorontoedusimkrizcifarhtml

10httpswwwcstorontoedusimkrizcifarhtml11httpufldlstanfordeduhousenumbers

21

C2 Chosen Flooding LevelsIn Table 4 we report the chosen ooding levels for our experiments with benchmark datasets

Table 4 e chosen ooding levels for benchmark experiments

Early stopping times X times XWeight decay times times X XFlooding X X X X

MNIST (08) 002 002 003 002MNIST (04) 002 003 000 002Fashion-MNIST (08) 000 000 mdash mdashFashion-MNIST (04) 000 000 mdash mdashKuzushiji-MNIST (08) 001 003 003 003Kuzushiji-MNIST (04) 003 002 004 003CIFAR-10 (08) 004 004 000 001CIFAR-10 (04) 003 003 000 000CIFAR-100 (08) 002 002 000 000CIFAR-100 (04) 003 003 000 000SVHN (08) 001 001 000 002SVHN (04) 001 009 003 003

D Learning CurvesIn Figure 4 we visualize learning curves for all datasets (including CIFAR-10 which we alreadyvisualized in Figure 2) We only show the learning curves for trainingvalidation proportion of 08since the results for 04 were similar with 08 Note that Figure 1(c) shows the learning curves forthe rst 80 epochs for CIFAR-10 without ooding Figure 1(d) shows the learning curves withooding when the ooding level is 018

22

(a) MNIST (000) (b) MNIST (001) (c) MNIST (002) (d) MNIST (003)

(e) Fashion-MNIST(000)

(f) Fashion-MNIST(002)

(g) Fashion-MNIST(005)

(h) Fashion-MNIST(010)

(i) KMNIST (000) (j) KMNIST (002) (k) KMNIST (005) (l) KMNIST (010)

(m) CIFAR-10 (000) (n) CIFAR-10 (003) (o) CIFAR-10 (007) (p) CIFAR-10 (020)

(q) CIFAR-100 (000) (r) CIFAR-100 (001) (s) CIFAR-100 (006) (t) CIFAR-100 (020)

(u) SVHN (000) (v) SVHN (001) (w) SVHN (006) (x) SVHN (020)

Figure 4 Learning curves of training and test loss e rst gure in each row is the learning curveswithout ooding e 2nd 3rd and 4th columns show the results with dierent ooding levels e oodinglevel increases towards the right-hand side 23

E Relationship between Performance and GradientsSettings We visualize the relationship between test performance (loss or accuracy) and gradientamplitude of the trainingtest loss in Figure 5 where the gradient amplitude is the `2 norm of thelter-nomalized gradient of the loss e lter-normalized gradient is the gradient appropriatelyscaled depending on the magnitude of the corresponding convolutional lter similarly to Liet al [2018] More specicically for each lter of every comvolutional layer we multiply thecorresponding elements of the gradient by the norm of the lter Note that a fully connected layeris a special case of convolutional layer and subject to this scaling We exclude Fashion-MNISTbecause the optimal ooding level was zero We used the results with trainingvalidation splitratio of 08

Results For the gures with gradient amplitude of training loss on the horizontal axis ldquordquomarks (w ooding) are oen ploed on the right of ldquo+rdquo marks (wo ooding) which impliesthat ooding prevents the model from staying a local minimum For the gures with gradientamplitude of test loss on the horizontal axis we can observe the method with ooding (ldquordquo)improves performance while the gradient amplitude becomes smaller On the other hand theperformance with the method without ooding (ldquo+rdquo) degenerates while the gradient amplitude oftest loss keeps increasing

24

(a) MNIST xtrainyloss

(b) MNIST xtrainyacc

(c) MNIST xtestyloss

(d) MNIST xtest yacc

(e) K-M xtrain yloss (f) K-M xtrain yacc (g) K-M xtest yloss (h) K-M xtest yacc

(i) C10 xtrain yloss (j) C10 xtrain yacc (k) C10 xtest yloss (l) C10 xtest yacc

(m) C100 xtrainyloss

(n) C100 xtrain yacc (o) C100 xtest yloss (p) C100 xtest yacc

(q) SVHN xtrainyloss

(r) SVHN xtrainyacc

(s) SVHN xtest yloss (t) SVHN xtest yacc

Figure 5 Relationship between test performance (loss or accuracy) and amplitude of gradient (with trainingor test loss) Each point (ldquordquo or ldquo+rdquo) in the gures corresponds to a single model at a certain epoch Weremove the rst 5 epochs and plot the rest ldquordquo is used for the method with ooding and ldquo+rdquo is used forthe method without ooding e large black ldquordquo and ldquo+rdquo show the epochs with early stopping e colorbecomes lighter (purplerarr yellow) as the training proceeds K-M C10 and C100 stand for Kuzushiji-MNISTCIFAR-10 and CIFAR-100

25

F FlatnessSettings We follow Li et al [2018] and give a one-dimensional visualization of atness for eachdataset in Figure 6 We exclude Fashion-MNIST because the optimal ooding level was zero Weused the results with trainingvalidation split ratio of 08 We compare the atness of the modelright aer the empirical risk with respect to a mini-batch becomes smaller than the ooding levelRm(g) lt b for the rst time (doed blue line) and the model aer training (solid blue line) Wealso compare them with the model trained by the baseline method without ooding and trainingis nished (solid red line)

Results According to Figure 6 the test loss becomes lower and more at during the trainingwith ooding Note that the training loss on the other hand continues to oat around the oodinglevel until the end of training aer it enters the ooding zone We expect that the model makes arandom walk and escapes regions with sharp loss landscapes during the period is may be apossible reason for beer generalization results with our proposed method

26

(a) MNIST (train) (b) MNIST (test) (c) Kuzushiji-MNIST(train)

(d) Kuzushiji-MNIST(test)

(e) CIFAR-10 (train) (f) CIFAR-10 (test) (g) CIFAR-100 (train) (h) CIFAR-100 (test)

(i) SVHN (train) (j) SVHN (test)

Figure 6 One-dimensional visualization of atness We visualize the trainingtest loss with respect toperturbation We depict the results for 3 models the model when the empirical risk with respect to trainingdata is below the ooding level for the rst time during training (doed blue) the model at the end oftraining with ooding (solid blue) and the model at the end of training without ooding (solid red)

27

  • 1 Introduction
  • 2 Backgrounds
    • 21 Regularization Methods
    • 22 Double Descent Curves with Overparametrization
    • 23 Lower-Bounding the Empirical Risk
      • 3 Flooding How to Avoid Zero Training Loss
        • 31 Preliminaries
        • 32 Algorithm
        • 33 Implementation
        • 34 Theoretical Analysis
          • 4 Experiments
            • 41 Synthetic Experiments
            • 42 Benchmark Experiments
            • 43 Memorization
              • 5 Conclusion
              • A Proof of Theorem
              • B Bayes Risk for Gaussian Distributions
              • C Details of Experiments
                • C1 Benchmark Datasets
                • C2 Chosen Flooding Levels
                  • D Learning Curves
                  • E Relationship between Performance and Gradients
                  • F Flatness
Page 14: Do We Need Zero Training Loss A›er Achieving Zero Training … · 2020. 2. 21. · directly aim to avoid zero training loss, they o›en fail to maintain a moderate level of training

from previous works [Belkin et al 2019 Krogh and Hertz 1992 Nakkiran et al 2020]It would also be interesting to see if Bayesian optimization [Shahriari et al 2016 Snoek

et al 2012] methods can be utilized to search for the optimal ooding level eciently We willinvestigate this direction in the future

AcknowledgementsWe thank Chang Xu Genki Yamamoto Kento Nozawa Nontawat Charoenphakdee Voot Tangkaraand Yoshihiro Nagano for the helpful discussions TI was supported by the Google PhD FellowshipProgram MS and IY were supported by JST CREST Grant Number JPMJCR18A2 including AIPchallenge program Japan

ReferencesDevansh Arpit Stanisaw Jastrzebski Nicolas Ballas David Krueger Emmanuel Bengio Maxin-

der S Kanwal Tegan Maharaj Asja Fischer Aaron Courville Yoshua Bengio and SimonLacoste-Julien A closer look at memorization in deep networks In ICML 2017

Mikhail Belkin Daniel Hsu and Partha P Mitra Overing or perfect ing Risk bounds forclassication and regression rules that interpolate In NeurIPS 2018

Mikhail Belkin Daniel Hsu Siyuan Ma and Soumik Mandal Reconciling modern machine-learningpractice and the classical biasvariance trade-o PNAS 11615850ndash15854 2019

David Berthelot Nicholas Carlini Ian Goodfellow Nicolas Papernot Avital Oliver and Colin ARael MixMatch A holistic approach to semi-supervised learning In NeurIPS 2019

Christopher M Bishop Paern Recognition and Machine Learning (Information Science andStatistics) Springer 2011

Christopher Michael Bishop Regularization and complexity control in feed-forward networks InICANN 1995

Rich Caruana Steve Lawrence and C Lee Giles Overing in neural nets Backpropagationconjugate gradient and early stopping In NeurIPS 2000

Pratik Chaudhari Anna Choromanska Stefano Soao Yann LeCun Carlo Baldassi ChristianBorgs Jennifer Chayes Levent Sagun and Riccardo Zecchina Entropy-SGD Biasing gradientdescent into wide valleys In ICLR 2017

Jesus Cid-Sueiro Darıo Garcıa-Garcıa and Raul Santos-Rodrıguez Consistency of losses forlearning from weak labels In ECML-PKDD 2014

14

Tarin Clanuwat Mikel Bober-Irizar Asanobu Kitamoto Alex Lamb Kazuaki Yamamoto and DavidHa Deep learning for classical Japanese literature In NeurIPS Workshop on Machine Learningfor Creativity and Design 2018

Marthinus Christoel du Plessis Gang Niu and Masashi Sugiyama Analysis of learning frompositive and unlabeled data In NeurIPS 2014

Marthinus Christoel du Plessis Gang Niu and Masashi Sugiyama Convex formulation forlearning from positive and unlabeled data In ICML 2015

Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MIT Press 2016

Hongyu Guo Yongyi Mao and Richong Zhang Augmenting data with mixup for sentenceclassication An empirical study In arXiv190508941 2019

Bo Han Gang Niu Jiangchao Yao Xingrui Yu Miao Xu Ivor Tsang and Masashi SugiyamaPumpout A meta approach to robust deep learning with noisy labels In arXiv180911008 2018

Stephen Jose Hanson and Lorien Y Pra Comparing biases for minimal network constructionwith back-propagation In NeurIPS 1988

Kaiming He Xiangyu Zhang Shaoqing Ren and Jian Sun Deep residual learning for imagerecognition In CVPR 2016

Sergey Ioe and Christian Szegedy Batch normalization Accelerating deep network training byreducing internal covariate shi In ICML 2015

Takashi Ishida Gang Niu Aditya Krishna Menon and Masashi Sugiyama Complementary-labellearning for arbitrary losses and models In ICML 2019

Nitish Shirish Keskar Dheevatsa Mudigere Jorge Nocedal Mikhail Smelyanskiy and Ping Tak PeterTang On large-batch training for deep learning Generalization gap and sharp minima In ICLR2017

Diederik P Kingma and Jimmy L Ba Adam A method for stochastic optimization In ICLR 2015

Ryuichi Kiryo Gang Niu Marthinus Christoel du Plessis and Masashi Sugiyama Positive-unlabeled learning with non-negative risk estimator In NeurIPS 2017

Anders Krogh and John A Hertz A simple weight decay can improve generalization In NeurIPS1992

Yann Lecun Lon Boou Yoshua Bengio and Patrick Haner Gradient-based learning applied todocument recognition In Proceedings of the IEEE pages 2278ndash2324 1998

Hao Li Zheng Xu Gavin Taylor Christoph Studer and Tom Goldstein Visualizing the losslandscape of neural nets In NeurIPS 2018

15

Ilya Loshchilov and Frank Huer Decoupled weight decay regularization In ICLR 2019

Nan Lu Tianyi Zhang Gang Niu and Masashi Sugiyama Mitigating overing in supervisedclassication from two unlabeled datasets A consistent risk correction approach In AISTATS2020

N Morgan and H Bourlard Generalization and parameter estimation in feedforward nets Someexperiments In NeurIPS 1990

Vinod Nair and Georey E Hinton Rectied linear units improve restricted boltzmann machinesIn ICML 2010

Preetum Nakkiran Gal Kaplun Dimitris Kalimeris Tristan Yang Benjamin L Edelman FredZhang and Boaz Barak SGD on neural networks learns functions of increasing complexity InNeurIPS 2019

Preetum Nakkiran Gal Kaplun Yamini Bansal Tristan Yang Boaz Barak and Ilya Sutskever Deepdouble descent Where bigger models and more data hurt In ICLR 2020

Nagarajan Natarajan Inderjit S Dhillon Pradeep K Ravikumar and Ambuj Tewari Learningwith noisy labels In NeurIPS 2013

Yuval Netzer Tao Wang Adam Coates Alessandro Bissacco Bo Wu and Andrew Y Ng Readingdigits in natural images with unsupervised feature learning In NeurIPS Workshop on DeepLearning and Unsupervised Feature Learning 2011

Andrew Y Ng Preventing ldquooveringrdquo of cross-validation data In ICML 1997

Adam Paszke Sam Gross Francisco Massa Adam Lerer James Bradbury Gregory Chanan TrevorKilleen Zeming Lin Natalia Gimelshein Luca Antiga Alban Desmaison Andreas Kopf EdwardYang Zachary DeVito Martin Raison Alykhan Tejani Sasank Chilamkurthy Benoit SteinerLu Fang Junjie Bai and Soumith Chintala PyTorch An imperative style high-performancedeep learning library In NeurIPS 2019

Giorgio Patrini Alessandro Rozza Aditya K Menon Richard Nock and Lizhen Making deepneural networks robust to label noise A loss correction approach In CVPR 2017

Herbert Robbins and Suon Monro A stochastic approximation method Annals of MathematicalStatistics 22400ndash407 1951

Rebecca Roelofs Vaishaal Shankar Benjamin Recht Sara Fridovich-Keil Moritz Hardt John Millerand Ludwig Schmidt A meta-analysis of overing in machine learning In NeurIPS 2019

Bobak Shahriari Kevin Swersky Ziyu Wang Ryan P Adams and Nando de Freitas Taking thehuman out of the loop A review of bayesian optimization Proceedings of the IEEE 104148ndash1752016

16

Connor Shorten and Taghi M Khoshgoaar A survey on image data augmentation for deeplearning Journal of Big Data 6 2019

Jasper Snoek Hugo Larochelle and Ryan P Adams Practical bayesian optimization of machinelearning algorithms In NeurIPS 2012

Nitish Srivastava Georey Hinton Alex Krizhevsky Ilya Sutskever and Ruslan SalakhutdinovDropout A simple way to prevent neural networks from overing Journal of Machine LearningResearch 151929ndash1958 2014

Masashi Sugiyama Introduction to statistical machine learning Morgan Kaufmann 2015

Christian Szegedy Vincent Vanhoucke Sergey Ioe and Jon Shlens Rethinking the inceptionarchitecture for computer vision In CVPR 2016

Sunil ulasidasan Gopinath Chennupati Je Bilmes Tanmoy Bhaacharya and Sarah MichalakOn mixup training Improved calibration and predictive uncertainty for deep neural networksIn NeurIPS 2019

Robert Tibshirani Regression shrinkage and selection via the lasso Journal of the Royal StatisticalSociety Series B (Methodological) 58267ndash288 1996

A N Tikhonov and V Y Arsenin Solutions of Ill Posed Problems Winston 1977

Andrey Nikolayevich Tikhonov On the stability of inverse problems Doklady Akademii NaukSSSR 39195ndash198 1943

Antonio Torralba Rob Fergus and William T Freeman 80 million tiny images A large data setfor nonparametric object and scene recognition In IEEE Trans PAMI 2008

Brendan van Rooyen and Robert C Williamson A theory of learning with corrupted labelsJournal of Machine Learning Research 181ndash50 2018

Vikas Verma Alex Lamb Juho Kannala Yoshua Bengio and David Lopez-Paz Interpolationconsistency training for semi-supervised learning In IJCAI 2019

Stefan Wager Sida Wang and Percy Liang Dropout training as adaptive regularization In NeurIPS2013

Roman Werpachowski Andrs Gyrgy and Csaba Szepesvri Detecting overing via adversarialexamples In NeurIPS 2019

Han Xiao Kashif Rasul and Roland Vollgraf Fashion-MNIST a novel image dataset for bench-marking machine learning algorithms arXiv preprint arXiv170807747 2017

Chiyuan Zhang Samy Bengio Moritz Hardt Benjamin Recht and Oriol Vinyals Understandingdeep learning requires rethinking generalization In ICLR 2017

17

Hongyi Zhang Moustapha Cisse Yann N Dauphin and David Lopez-Paz mixup Beyondempirical risk minimization In ICLR 2018

Tong Zhang Statistical behavior and consistency of classication methods based on convex riskminimization e Annals of Statistics 3256ndash85 2004

18

A Proof of eoremProof If the ooding level is b then the proposed ooding estimator is

R(g) = |R(g)minus b|+ b (12)

Since the absolute operator can be expressed with a max operator with max(a b) = a+b+|aminusb|2

theproposed estimator can be re-expressed as

R(g) = 2max(R(g) b)minus R(g) = Aminus R(g) (13)

For convenience we used A = 2max(R(g) b) From the denition of MSE

MSE(R(g)) = E[(R(g)minusR(g))2] (14)MSE(R(g)) = E[(R(g)minusR(g))2] (15)

= E[(Aminus R(g)minusR(g))2)] (16)= E[A2]minus 2E[A(R(g) +R(g))] + E[(R(g) +R(g))2] (17)

We are interested in the sign of

MSE(R(g))minusMSE(R(g)) = E[minus4R(g)R(g)minus A2 + 2A(R(g) +R(g))] (18)

Dene the inside of the expectation as B = minus4R(g)R(g)minus A2 + 2A(R(g) + R(g)) B can bedivided into two cases depending on the outcome of the max operator

B =

minus4R(g)R(g)minus 4R(g)2 + 4R(g)(R(g) +R(g)) if R(g) ge b

minus4R(g)R(g)minus 4b2 + 4b(R(g) +R(g)) if R(g) lt b(19)

=

0 if R(g) ge b

minus4R(g)R(g)minus 4b2 + 4b(R(g) +R(g)) if R(g) lt b(20)

=

0 if R(g) ge b

minus4(bminus R(g))(bminusR(g)) if R(g) lt b (21)

e laer case becomes positive when R(g) lt b lt R(g) erefore when R(g) lt b lt R(g)

MSE(R(g))minusMSE(R(g)) gt 0 (22)MSE(R(g)) gt MSE(R(g)) (23)

When b le R(g)

MSE(R(g))minusMSE(R(g)) gt 0 (24)MSE(R(g)) = MSE(R(g)) (25)

19

B Bayes Risk for Gaussian DistributionsIn this section we explain in detail how we derived the Bayes risk with respect to the surrogateloss in the experiments with Gaussian data in Section 41 Since we are using the logistic loss inthe synthetic experiments the loss of the margin is

`(yg(x)) = log(1 + exp(minusyg(x))) (26)

where g(x) Rd rarr R is a scalar instead of the vector denition that was used previously becausethe synthetic experiments only consider binary classication Take the derivative to derive

partE[`(yg(x))]partg(x)

=E[log(1 + exp(minusyg(x)))]

partg(x)= E

[minusy exp(minusyg(x))1 + exp(minusyg(x))

∣∣∣∣∣x]p(x) (27)

= E

[minusy

1 + exp(yg(x))

∣∣∣∣∣x]p(x) (28)

= E

[y + 1

2

1

exp(minusg(x)) + 1+y minus 1

2

minus1exp(g(x)) + 1

∣∣∣∣∣x]p(x) (29)

= E

[y + 1

2

∣∣∣∣∣x]

1

exp(minusg(x)) + 1p(x) + E

[y minus 1

2

∣∣∣∣∣x]

minus1exp(g(x)) + 1

p(x) (30)

= p(y = +1|x) 1

exp(minusg(x)) + 1p(x) + p(y = minus1|x) minus1

exp(g(x)) + 1p(x) (31)

Set this to zero divide by p(x) gt 0 to obtain

p(y = minus1|x) 1

exp(minusg(x)) + 1= p(y = +1|x) 1

exp(g(x)) + 1(32)

exp(g(x)) =p(y = +1|x)p(y = minus1|x)

(33)

g(x) = logp(y = +1|x)p(y = minus1|x)

(34)

Since we are interested in the surrogate loss under this classier we plug this into the logisticloss to obtain the Bayes risk

E[`(yg(x))] = E[log(1 +

p(minusy|x)p(y|x)

)

]= E

[log(

1

p(y|x))

]= E[minus log p(y|x)] (35)

In the experiments in Section 41 we report the empirical version of this with the test dataset asthe Bayes risk

20

C Details of Experiments

C1 Benchmark DatasetsIn the experiments in Section 42 we use six benchmark datasets explained below

bull MNIST6 [Lecun et al 1998] is a 10 class dataset of handwrien digits 1 2 9 and 0 Eachsample is a 28times 28 grayscale image e number of training and test samples are 60000and 10000 respectively

bull Fashion-MNIST7 [Xiao et al 2017] is a 10 class dataset of fashion items T-shirttop TrouserPullover Dress Coat Sandal Shirt Sneaker Bag and Ankle boot Each sample is a 28times 28grayscale image e number of training and test samples are 60000 and 10000 respectively

bull Kuzushiji-MNIST8 [Clanuwat et al 2018] is a 10 class dataset of cursive Japanese (ldquoKuzushijirdquo)characters Each sample is a 28 times 28 grayscale image e number of training and testsamples are 60000 and 10000 respectively

bull CIFAR-109 is a 10 class dataset of various objects airplane automobile bird cat deer dogfrog horse ship and truck Each sample is a colored image in 32times 32times 3 RGB format It isa subset of the 80 million tiny images dataset [Torralba et al 2008] ere are 6000 imagesper class where 5000 are for training and 1000 are for test

bull CIFAR-10010 is a 100 class dataset of various objects Each class has 600 samples where 500samples are for training and 100 samples are for test is is also a subset of the 80 milliontiny images dataset [Torralba et al 2008]

bull SVHN11 [Netzer et al 2011] is a 10 class dataset of house numbers from Google Street Viewimages in 32times 32times 3 RGB format 73257 digits are for training and 26032 digits are fortesting

6httpyannlecuncomexdbmnist7httpsgithubcomzalandoresearchfashion-mnist8httpsgithubcomrois-codhkmnist9httpswwwcstorontoedusimkrizcifarhtml

10httpswwwcstorontoedusimkrizcifarhtml11httpufldlstanfordeduhousenumbers

21

C2 Chosen Flooding LevelsIn Table 4 we report the chosen ooding levels for our experiments with benchmark datasets

Table 4 e chosen ooding levels for benchmark experiments

Early stopping times X times XWeight decay times times X XFlooding X X X X

MNIST (08) 002 002 003 002MNIST (04) 002 003 000 002Fashion-MNIST (08) 000 000 mdash mdashFashion-MNIST (04) 000 000 mdash mdashKuzushiji-MNIST (08) 001 003 003 003Kuzushiji-MNIST (04) 003 002 004 003CIFAR-10 (08) 004 004 000 001CIFAR-10 (04) 003 003 000 000CIFAR-100 (08) 002 002 000 000CIFAR-100 (04) 003 003 000 000SVHN (08) 001 001 000 002SVHN (04) 001 009 003 003

D Learning CurvesIn Figure 4 we visualize learning curves for all datasets (including CIFAR-10 which we alreadyvisualized in Figure 2) We only show the learning curves for trainingvalidation proportion of 08since the results for 04 were similar with 08 Note that Figure 1(c) shows the learning curves forthe rst 80 epochs for CIFAR-10 without ooding Figure 1(d) shows the learning curves withooding when the ooding level is 018

22

(a) MNIST (000) (b) MNIST (001) (c) MNIST (002) (d) MNIST (003)

(e) Fashion-MNIST(000)

(f) Fashion-MNIST(002)

(g) Fashion-MNIST(005)

(h) Fashion-MNIST(010)

(i) KMNIST (000) (j) KMNIST (002) (k) KMNIST (005) (l) KMNIST (010)

(m) CIFAR-10 (000) (n) CIFAR-10 (003) (o) CIFAR-10 (007) (p) CIFAR-10 (020)

(q) CIFAR-100 (000) (r) CIFAR-100 (001) (s) CIFAR-100 (006) (t) CIFAR-100 (020)

(u) SVHN (000) (v) SVHN (001) (w) SVHN (006) (x) SVHN (020)

Figure 4 Learning curves of training and test loss e rst gure in each row is the learning curveswithout ooding e 2nd 3rd and 4th columns show the results with dierent ooding levels e oodinglevel increases towards the right-hand side 23

E Relationship between Performance and GradientsSettings We visualize the relationship between test performance (loss or accuracy) and gradientamplitude of the trainingtest loss in Figure 5 where the gradient amplitude is the `2 norm of thelter-nomalized gradient of the loss e lter-normalized gradient is the gradient appropriatelyscaled depending on the magnitude of the corresponding convolutional lter similarly to Liet al [2018] More specicically for each lter of every comvolutional layer we multiply thecorresponding elements of the gradient by the norm of the lter Note that a fully connected layeris a special case of convolutional layer and subject to this scaling We exclude Fashion-MNISTbecause the optimal ooding level was zero We used the results with trainingvalidation splitratio of 08

Results For the gures with gradient amplitude of training loss on the horizontal axis ldquordquomarks (w ooding) are oen ploed on the right of ldquo+rdquo marks (wo ooding) which impliesthat ooding prevents the model from staying a local minimum For the gures with gradientamplitude of test loss on the horizontal axis we can observe the method with ooding (ldquordquo)improves performance while the gradient amplitude becomes smaller On the other hand theperformance with the method without ooding (ldquo+rdquo) degenerates while the gradient amplitude oftest loss keeps increasing

24

(a) MNIST xtrainyloss

(b) MNIST xtrainyacc

(c) MNIST xtestyloss

(d) MNIST xtest yacc

(e) K-M xtrain yloss (f) K-M xtrain yacc (g) K-M xtest yloss (h) K-M xtest yacc

(i) C10 xtrain yloss (j) C10 xtrain yacc (k) C10 xtest yloss (l) C10 xtest yacc

(m) C100 xtrainyloss

(n) C100 xtrain yacc (o) C100 xtest yloss (p) C100 xtest yacc

(q) SVHN xtrainyloss

(r) SVHN xtrainyacc

(s) SVHN xtest yloss (t) SVHN xtest yacc

Figure 5 Relationship between test performance (loss or accuracy) and amplitude of gradient (with trainingor test loss) Each point (ldquordquo or ldquo+rdquo) in the gures corresponds to a single model at a certain epoch Weremove the rst 5 epochs and plot the rest ldquordquo is used for the method with ooding and ldquo+rdquo is used forthe method without ooding e large black ldquordquo and ldquo+rdquo show the epochs with early stopping e colorbecomes lighter (purplerarr yellow) as the training proceeds K-M C10 and C100 stand for Kuzushiji-MNISTCIFAR-10 and CIFAR-100

25

F FlatnessSettings We follow Li et al [2018] and give a one-dimensional visualization of atness for eachdataset in Figure 6 We exclude Fashion-MNIST because the optimal ooding level was zero Weused the results with trainingvalidation split ratio of 08 We compare the atness of the modelright aer the empirical risk with respect to a mini-batch becomes smaller than the ooding levelRm(g) lt b for the rst time (doed blue line) and the model aer training (solid blue line) Wealso compare them with the model trained by the baseline method without ooding and trainingis nished (solid red line)

Results According to Figure 6 the test loss becomes lower and more at during the trainingwith ooding Note that the training loss on the other hand continues to oat around the oodinglevel until the end of training aer it enters the ooding zone We expect that the model makes arandom walk and escapes regions with sharp loss landscapes during the period is may be apossible reason for beer generalization results with our proposed method

26

(a) MNIST (train) (b) MNIST (test) (c) Kuzushiji-MNIST(train)

(d) Kuzushiji-MNIST(test)

(e) CIFAR-10 (train) (f) CIFAR-10 (test) (g) CIFAR-100 (train) (h) CIFAR-100 (test)

(i) SVHN (train) (j) SVHN (test)

Figure 6 One-dimensional visualization of atness We visualize the trainingtest loss with respect toperturbation We depict the results for 3 models the model when the empirical risk with respect to trainingdata is below the ooding level for the rst time during training (doed blue) the model at the end oftraining with ooding (solid blue) and the model at the end of training without ooding (solid red)

27

  • 1 Introduction
  • 2 Backgrounds
    • 21 Regularization Methods
    • 22 Double Descent Curves with Overparametrization
    • 23 Lower-Bounding the Empirical Risk
      • 3 Flooding How to Avoid Zero Training Loss
        • 31 Preliminaries
        • 32 Algorithm
        • 33 Implementation
        • 34 Theoretical Analysis
          • 4 Experiments
            • 41 Synthetic Experiments
            • 42 Benchmark Experiments
            • 43 Memorization
              • 5 Conclusion
              • A Proof of Theorem
              • B Bayes Risk for Gaussian Distributions
              • C Details of Experiments
                • C1 Benchmark Datasets
                • C2 Chosen Flooding Levels
                  • D Learning Curves
                  • E Relationship between Performance and Gradients
                  • F Flatness
Page 15: Do We Need Zero Training Loss A›er Achieving Zero Training … · 2020. 2. 21. · directly aim to avoid zero training loss, they o›en fail to maintain a moderate level of training

Tarin Clanuwat Mikel Bober-Irizar Asanobu Kitamoto Alex Lamb Kazuaki Yamamoto and DavidHa Deep learning for classical Japanese literature In NeurIPS Workshop on Machine Learningfor Creativity and Design 2018

Marthinus Christoel du Plessis Gang Niu and Masashi Sugiyama Analysis of learning frompositive and unlabeled data In NeurIPS 2014

Marthinus Christoel du Plessis Gang Niu and Masashi Sugiyama Convex formulation forlearning from positive and unlabeled data In ICML 2015

Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MIT Press 2016

Hongyu Guo Yongyi Mao and Richong Zhang Augmenting data with mixup for sentenceclassication An empirical study In arXiv190508941 2019

Bo Han Gang Niu Jiangchao Yao Xingrui Yu Miao Xu Ivor Tsang and Masashi SugiyamaPumpout A meta approach to robust deep learning with noisy labels In arXiv180911008 2018

Stephen Jose Hanson and Lorien Y Pra Comparing biases for minimal network constructionwith back-propagation In NeurIPS 1988

Kaiming He Xiangyu Zhang Shaoqing Ren and Jian Sun Deep residual learning for imagerecognition In CVPR 2016

Sergey Ioe and Christian Szegedy Batch normalization Accelerating deep network training byreducing internal covariate shi In ICML 2015

Takashi Ishida Gang Niu Aditya Krishna Menon and Masashi Sugiyama Complementary-labellearning for arbitrary losses and models In ICML 2019

Nitish Shirish Keskar Dheevatsa Mudigere Jorge Nocedal Mikhail Smelyanskiy and Ping Tak PeterTang On large-batch training for deep learning Generalization gap and sharp minima In ICLR2017

Diederik P Kingma and Jimmy L Ba Adam A method for stochastic optimization In ICLR 2015

Ryuichi Kiryo Gang Niu Marthinus Christoel du Plessis and Masashi Sugiyama Positive-unlabeled learning with non-negative risk estimator In NeurIPS 2017

Anders Krogh and John A Hertz A simple weight decay can improve generalization In NeurIPS1992

Yann Lecun Lon Boou Yoshua Bengio and Patrick Haner Gradient-based learning applied todocument recognition In Proceedings of the IEEE pages 2278ndash2324 1998

Hao Li Zheng Xu Gavin Taylor Christoph Studer and Tom Goldstein Visualizing the losslandscape of neural nets In NeurIPS 2018

15

Ilya Loshchilov and Frank Huer Decoupled weight decay regularization In ICLR 2019

Nan Lu Tianyi Zhang Gang Niu and Masashi Sugiyama Mitigating overing in supervisedclassication from two unlabeled datasets A consistent risk correction approach In AISTATS2020

N Morgan and H Bourlard Generalization and parameter estimation in feedforward nets Someexperiments In NeurIPS 1990

Vinod Nair and Georey E Hinton Rectied linear units improve restricted boltzmann machinesIn ICML 2010

Preetum Nakkiran Gal Kaplun Dimitris Kalimeris Tristan Yang Benjamin L Edelman FredZhang and Boaz Barak SGD on neural networks learns functions of increasing complexity InNeurIPS 2019

Preetum Nakkiran Gal Kaplun Yamini Bansal Tristan Yang Boaz Barak and Ilya Sutskever Deepdouble descent Where bigger models and more data hurt In ICLR 2020

Nagarajan Natarajan Inderjit S Dhillon Pradeep K Ravikumar and Ambuj Tewari Learningwith noisy labels In NeurIPS 2013

Yuval Netzer Tao Wang Adam Coates Alessandro Bissacco Bo Wu and Andrew Y Ng Readingdigits in natural images with unsupervised feature learning In NeurIPS Workshop on DeepLearning and Unsupervised Feature Learning 2011

Andrew Y Ng Preventing ldquooveringrdquo of cross-validation data In ICML 1997

Adam Paszke Sam Gross Francisco Massa Adam Lerer James Bradbury Gregory Chanan TrevorKilleen Zeming Lin Natalia Gimelshein Luca Antiga Alban Desmaison Andreas Kopf EdwardYang Zachary DeVito Martin Raison Alykhan Tejani Sasank Chilamkurthy Benoit SteinerLu Fang Junjie Bai and Soumith Chintala PyTorch An imperative style high-performancedeep learning library In NeurIPS 2019

Giorgio Patrini Alessandro Rozza Aditya K Menon Richard Nock and Lizhen Making deepneural networks robust to label noise A loss correction approach In CVPR 2017

Herbert Robbins and Suon Monro A stochastic approximation method Annals of MathematicalStatistics 22400ndash407 1951

Rebecca Roelofs Vaishaal Shankar Benjamin Recht Sara Fridovich-Keil Moritz Hardt John Millerand Ludwig Schmidt A meta-analysis of overing in machine learning In NeurIPS 2019

Bobak Shahriari Kevin Swersky Ziyu Wang Ryan P Adams and Nando de Freitas Taking thehuman out of the loop A review of bayesian optimization Proceedings of the IEEE 104148ndash1752016

16

Connor Shorten and Taghi M Khoshgoaar A survey on image data augmentation for deeplearning Journal of Big Data 6 2019

Jasper Snoek Hugo Larochelle and Ryan P Adams Practical bayesian optimization of machinelearning algorithms In NeurIPS 2012

Nitish Srivastava Georey Hinton Alex Krizhevsky Ilya Sutskever and Ruslan SalakhutdinovDropout A simple way to prevent neural networks from overing Journal of Machine LearningResearch 151929ndash1958 2014

Masashi Sugiyama Introduction to statistical machine learning Morgan Kaufmann 2015

Christian Szegedy Vincent Vanhoucke Sergey Ioe and Jon Shlens Rethinking the inceptionarchitecture for computer vision In CVPR 2016

Sunil ulasidasan Gopinath Chennupati Je Bilmes Tanmoy Bhaacharya and Sarah MichalakOn mixup training Improved calibration and predictive uncertainty for deep neural networksIn NeurIPS 2019

Robert Tibshirani Regression shrinkage and selection via the lasso Journal of the Royal StatisticalSociety Series B (Methodological) 58267ndash288 1996

A N Tikhonov and V Y Arsenin Solutions of Ill Posed Problems Winston 1977

Andrey Nikolayevich Tikhonov On the stability of inverse problems Doklady Akademii NaukSSSR 39195ndash198 1943

Antonio Torralba Rob Fergus and William T Freeman 80 million tiny images A large data setfor nonparametric object and scene recognition In IEEE Trans PAMI 2008

Brendan van Rooyen and Robert C Williamson A theory of learning with corrupted labelsJournal of Machine Learning Research 181ndash50 2018

Vikas Verma Alex Lamb Juho Kannala Yoshua Bengio and David Lopez-Paz Interpolationconsistency training for semi-supervised learning In IJCAI 2019

Stefan Wager Sida Wang and Percy Liang Dropout training as adaptive regularization In NeurIPS2013

Roman Werpachowski Andrs Gyrgy and Csaba Szepesvri Detecting overing via adversarialexamples In NeurIPS 2019

Han Xiao Kashif Rasul and Roland Vollgraf Fashion-MNIST a novel image dataset for bench-marking machine learning algorithms arXiv preprint arXiv170807747 2017

Chiyuan Zhang Samy Bengio Moritz Hardt Benjamin Recht and Oriol Vinyals Understandingdeep learning requires rethinking generalization In ICLR 2017

17

Hongyi Zhang Moustapha Cisse Yann N Dauphin and David Lopez-Paz mixup Beyondempirical risk minimization In ICLR 2018

Tong Zhang Statistical behavior and consistency of classication methods based on convex riskminimization e Annals of Statistics 3256ndash85 2004

18

A Proof of eoremProof If the ooding level is b then the proposed ooding estimator is

R(g) = |R(g)minus b|+ b (12)

Since the absolute operator can be expressed with a max operator with max(a b) = a+b+|aminusb|2

theproposed estimator can be re-expressed as

R(g) = 2max(R(g) b)minus R(g) = Aminus R(g) (13)

For convenience we used A = 2max(R(g) b) From the denition of MSE

MSE(R(g)) = E[(R(g)minusR(g))2] (14)MSE(R(g)) = E[(R(g)minusR(g))2] (15)

= E[(Aminus R(g)minusR(g))2)] (16)= E[A2]minus 2E[A(R(g) +R(g))] + E[(R(g) +R(g))2] (17)

We are interested in the sign of

MSE(R(g))minusMSE(R(g)) = E[minus4R(g)R(g)minus A2 + 2A(R(g) +R(g))] (18)

Dene the inside of the expectation as B = minus4R(g)R(g)minus A2 + 2A(R(g) + R(g)) B can bedivided into two cases depending on the outcome of the max operator

B =

minus4R(g)R(g)minus 4R(g)2 + 4R(g)(R(g) +R(g)) if R(g) ge b

minus4R(g)R(g)minus 4b2 + 4b(R(g) +R(g)) if R(g) lt b(19)

=

0 if R(g) ge b

minus4R(g)R(g)minus 4b2 + 4b(R(g) +R(g)) if R(g) lt b(20)

=

0 if R(g) ge b

minus4(bminus R(g))(bminusR(g)) if R(g) lt b (21)

e laer case becomes positive when R(g) lt b lt R(g) erefore when R(g) lt b lt R(g)

MSE(R(g))minusMSE(R(g)) gt 0 (22)MSE(R(g)) gt MSE(R(g)) (23)

When b le R(g)

MSE(R(g))minusMSE(R(g)) gt 0 (24)MSE(R(g)) = MSE(R(g)) (25)

19

B Bayes Risk for Gaussian DistributionsIn this section we explain in detail how we derived the Bayes risk with respect to the surrogateloss in the experiments with Gaussian data in Section 41 Since we are using the logistic loss inthe synthetic experiments the loss of the margin is

`(yg(x)) = log(1 + exp(minusyg(x))) (26)

where g(x) Rd rarr R is a scalar instead of the vector denition that was used previously becausethe synthetic experiments only consider binary classication Take the derivative to derive

partE[`(yg(x))]partg(x)

=E[log(1 + exp(minusyg(x)))]

partg(x)= E

[minusy exp(minusyg(x))1 + exp(minusyg(x))

∣∣∣∣∣x]p(x) (27)

= E

[minusy

1 + exp(yg(x))

∣∣∣∣∣x]p(x) (28)

= E

[y + 1

2

1

exp(minusg(x)) + 1+y minus 1

2

minus1exp(g(x)) + 1

∣∣∣∣∣x]p(x) (29)

= E

[y + 1

2

∣∣∣∣∣x]

1

exp(minusg(x)) + 1p(x) + E

[y minus 1

2

∣∣∣∣∣x]

minus1exp(g(x)) + 1

p(x) (30)

= p(y = +1|x) 1

exp(minusg(x)) + 1p(x) + p(y = minus1|x) minus1

exp(g(x)) + 1p(x) (31)

Set this to zero divide by p(x) gt 0 to obtain

p(y = minus1|x) 1

exp(minusg(x)) + 1= p(y = +1|x) 1

exp(g(x)) + 1(32)

exp(g(x)) =p(y = +1|x)p(y = minus1|x)

(33)

g(x) = logp(y = +1|x)p(y = minus1|x)

(34)

Since we are interested in the surrogate loss under this classier we plug this into the logisticloss to obtain the Bayes risk

E[`(yg(x))] = E[log(1 +

p(minusy|x)p(y|x)

)

]= E

[log(

1

p(y|x))

]= E[minus log p(y|x)] (35)

In the experiments in Section 41 we report the empirical version of this with the test dataset asthe Bayes risk

20

C Details of Experiments

C1 Benchmark DatasetsIn the experiments in Section 42 we use six benchmark datasets explained below

bull MNIST6 [Lecun et al 1998] is a 10 class dataset of handwrien digits 1 2 9 and 0 Eachsample is a 28times 28 grayscale image e number of training and test samples are 60000and 10000 respectively

bull Fashion-MNIST7 [Xiao et al 2017] is a 10 class dataset of fashion items T-shirttop TrouserPullover Dress Coat Sandal Shirt Sneaker Bag and Ankle boot Each sample is a 28times 28grayscale image e number of training and test samples are 60000 and 10000 respectively

bull Kuzushiji-MNIST8 [Clanuwat et al 2018] is a 10 class dataset of cursive Japanese (ldquoKuzushijirdquo)characters Each sample is a 28 times 28 grayscale image e number of training and testsamples are 60000 and 10000 respectively

bull CIFAR-109 is a 10 class dataset of various objects airplane automobile bird cat deer dogfrog horse ship and truck Each sample is a colored image in 32times 32times 3 RGB format It isa subset of the 80 million tiny images dataset [Torralba et al 2008] ere are 6000 imagesper class where 5000 are for training and 1000 are for test

bull CIFAR-10010 is a 100 class dataset of various objects Each class has 600 samples where 500samples are for training and 100 samples are for test is is also a subset of the 80 milliontiny images dataset [Torralba et al 2008]

bull SVHN11 [Netzer et al 2011] is a 10 class dataset of house numbers from Google Street Viewimages in 32times 32times 3 RGB format 73257 digits are for training and 26032 digits are fortesting

6httpyannlecuncomexdbmnist7httpsgithubcomzalandoresearchfashion-mnist8httpsgithubcomrois-codhkmnist9httpswwwcstorontoedusimkrizcifarhtml

10httpswwwcstorontoedusimkrizcifarhtml11httpufldlstanfordeduhousenumbers

21

C2 Chosen Flooding LevelsIn Table 4 we report the chosen ooding levels for our experiments with benchmark datasets

Table 4 e chosen ooding levels for benchmark experiments

Early stopping times X times XWeight decay times times X XFlooding X X X X

MNIST (08) 002 002 003 002MNIST (04) 002 003 000 002Fashion-MNIST (08) 000 000 mdash mdashFashion-MNIST (04) 000 000 mdash mdashKuzushiji-MNIST (08) 001 003 003 003Kuzushiji-MNIST (04) 003 002 004 003CIFAR-10 (08) 004 004 000 001CIFAR-10 (04) 003 003 000 000CIFAR-100 (08) 002 002 000 000CIFAR-100 (04) 003 003 000 000SVHN (08) 001 001 000 002SVHN (04) 001 009 003 003

D Learning CurvesIn Figure 4 we visualize learning curves for all datasets (including CIFAR-10 which we alreadyvisualized in Figure 2) We only show the learning curves for trainingvalidation proportion of 08since the results for 04 were similar with 08 Note that Figure 1(c) shows the learning curves forthe rst 80 epochs for CIFAR-10 without ooding Figure 1(d) shows the learning curves withooding when the ooding level is 018

22

(a) MNIST (000) (b) MNIST (001) (c) MNIST (002) (d) MNIST (003)

(e) Fashion-MNIST(000)

(f) Fashion-MNIST(002)

(g) Fashion-MNIST(005)

(h) Fashion-MNIST(010)

(i) KMNIST (000) (j) KMNIST (002) (k) KMNIST (005) (l) KMNIST (010)

(m) CIFAR-10 (000) (n) CIFAR-10 (003) (o) CIFAR-10 (007) (p) CIFAR-10 (020)

(q) CIFAR-100 (000) (r) CIFAR-100 (001) (s) CIFAR-100 (006) (t) CIFAR-100 (020)

(u) SVHN (000) (v) SVHN (001) (w) SVHN (006) (x) SVHN (020)

Figure 4 Learning curves of training and test loss e rst gure in each row is the learning curveswithout ooding e 2nd 3rd and 4th columns show the results with dierent ooding levels e oodinglevel increases towards the right-hand side 23

E Relationship between Performance and GradientsSettings We visualize the relationship between test performance (loss or accuracy) and gradientamplitude of the trainingtest loss in Figure 5 where the gradient amplitude is the `2 norm of thelter-nomalized gradient of the loss e lter-normalized gradient is the gradient appropriatelyscaled depending on the magnitude of the corresponding convolutional lter similarly to Liet al [2018] More specicically for each lter of every comvolutional layer we multiply thecorresponding elements of the gradient by the norm of the lter Note that a fully connected layeris a special case of convolutional layer and subject to this scaling We exclude Fashion-MNISTbecause the optimal ooding level was zero We used the results with trainingvalidation splitratio of 08

Results For the gures with gradient amplitude of training loss on the horizontal axis ldquordquomarks (w ooding) are oen ploed on the right of ldquo+rdquo marks (wo ooding) which impliesthat ooding prevents the model from staying a local minimum For the gures with gradientamplitude of test loss on the horizontal axis we can observe the method with ooding (ldquordquo)improves performance while the gradient amplitude becomes smaller On the other hand theperformance with the method without ooding (ldquo+rdquo) degenerates while the gradient amplitude oftest loss keeps increasing

24

(a) MNIST xtrainyloss

(b) MNIST xtrainyacc

(c) MNIST xtestyloss

(d) MNIST xtest yacc

(e) K-M xtrain yloss (f) K-M xtrain yacc (g) K-M xtest yloss (h) K-M xtest yacc

(i) C10 xtrain yloss (j) C10 xtrain yacc (k) C10 xtest yloss (l) C10 xtest yacc

(m) C100 xtrainyloss

(n) C100 xtrain yacc (o) C100 xtest yloss (p) C100 xtest yacc

(q) SVHN xtrainyloss

(r) SVHN xtrainyacc

(s) SVHN xtest yloss (t) SVHN xtest yacc

Figure 5 Relationship between test performance (loss or accuracy) and amplitude of gradient (with trainingor test loss) Each point (ldquordquo or ldquo+rdquo) in the gures corresponds to a single model at a certain epoch Weremove the rst 5 epochs and plot the rest ldquordquo is used for the method with ooding and ldquo+rdquo is used forthe method without ooding e large black ldquordquo and ldquo+rdquo show the epochs with early stopping e colorbecomes lighter (purplerarr yellow) as the training proceeds K-M C10 and C100 stand for Kuzushiji-MNISTCIFAR-10 and CIFAR-100

25

F FlatnessSettings We follow Li et al [2018] and give a one-dimensional visualization of atness for eachdataset in Figure 6 We exclude Fashion-MNIST because the optimal ooding level was zero Weused the results with trainingvalidation split ratio of 08 We compare the atness of the modelright aer the empirical risk with respect to a mini-batch becomes smaller than the ooding levelRm(g) lt b for the rst time (doed blue line) and the model aer training (solid blue line) Wealso compare them with the model trained by the baseline method without ooding and trainingis nished (solid red line)

Results According to Figure 6 the test loss becomes lower and more at during the trainingwith ooding Note that the training loss on the other hand continues to oat around the oodinglevel until the end of training aer it enters the ooding zone We expect that the model makes arandom walk and escapes regions with sharp loss landscapes during the period is may be apossible reason for beer generalization results with our proposed method

26

(a) MNIST (train) (b) MNIST (test) (c) Kuzushiji-MNIST(train)

(d) Kuzushiji-MNIST(test)

(e) CIFAR-10 (train) (f) CIFAR-10 (test) (g) CIFAR-100 (train) (h) CIFAR-100 (test)

(i) SVHN (train) (j) SVHN (test)

Figure 6 One-dimensional visualization of atness We visualize the trainingtest loss with respect toperturbation We depict the results for 3 models the model when the empirical risk with respect to trainingdata is below the ooding level for the rst time during training (doed blue) the model at the end oftraining with ooding (solid blue) and the model at the end of training without ooding (solid red)

27

  • 1 Introduction
  • 2 Backgrounds
    • 21 Regularization Methods
    • 22 Double Descent Curves with Overparametrization
    • 23 Lower-Bounding the Empirical Risk
      • 3 Flooding How to Avoid Zero Training Loss
        • 31 Preliminaries
        • 32 Algorithm
        • 33 Implementation
        • 34 Theoretical Analysis
          • 4 Experiments
            • 41 Synthetic Experiments
            • 42 Benchmark Experiments
            • 43 Memorization
              • 5 Conclusion
              • A Proof of Theorem
              • B Bayes Risk for Gaussian Distributions
              • C Details of Experiments
                • C1 Benchmark Datasets
                • C2 Chosen Flooding Levels
                  • D Learning Curves
                  • E Relationship between Performance and Gradients
                  • F Flatness
Page 16: Do We Need Zero Training Loss A›er Achieving Zero Training … · 2020. 2. 21. · directly aim to avoid zero training loss, they o›en fail to maintain a moderate level of training

Ilya Loshchilov and Frank Huer Decoupled weight decay regularization In ICLR 2019

Nan Lu Tianyi Zhang Gang Niu and Masashi Sugiyama Mitigating overing in supervisedclassication from two unlabeled datasets A consistent risk correction approach In AISTATS2020

N Morgan and H Bourlard Generalization and parameter estimation in feedforward nets Someexperiments In NeurIPS 1990

Vinod Nair and Georey E Hinton Rectied linear units improve restricted boltzmann machinesIn ICML 2010

Preetum Nakkiran Gal Kaplun Dimitris Kalimeris Tristan Yang Benjamin L Edelman FredZhang and Boaz Barak SGD on neural networks learns functions of increasing complexity InNeurIPS 2019

Preetum Nakkiran Gal Kaplun Yamini Bansal Tristan Yang Boaz Barak and Ilya Sutskever Deepdouble descent Where bigger models and more data hurt In ICLR 2020

Nagarajan Natarajan Inderjit S Dhillon Pradeep K Ravikumar and Ambuj Tewari Learningwith noisy labels In NeurIPS 2013

Yuval Netzer Tao Wang Adam Coates Alessandro Bissacco Bo Wu and Andrew Y Ng Readingdigits in natural images with unsupervised feature learning In NeurIPS Workshop on DeepLearning and Unsupervised Feature Learning 2011

Andrew Y Ng Preventing ldquooveringrdquo of cross-validation data In ICML 1997

Adam Paszke Sam Gross Francisco Massa Adam Lerer James Bradbury Gregory Chanan TrevorKilleen Zeming Lin Natalia Gimelshein Luca Antiga Alban Desmaison Andreas Kopf EdwardYang Zachary DeVito Martin Raison Alykhan Tejani Sasank Chilamkurthy Benoit SteinerLu Fang Junjie Bai and Soumith Chintala PyTorch An imperative style high-performancedeep learning library In NeurIPS 2019

Giorgio Patrini Alessandro Rozza Aditya K Menon Richard Nock and Lizhen Making deepneural networks robust to label noise A loss correction approach In CVPR 2017

Herbert Robbins and Suon Monro A stochastic approximation method Annals of MathematicalStatistics 22400ndash407 1951

Rebecca Roelofs Vaishaal Shankar Benjamin Recht Sara Fridovich-Keil Moritz Hardt John Millerand Ludwig Schmidt A meta-analysis of overing in machine learning In NeurIPS 2019

Bobak Shahriari Kevin Swersky Ziyu Wang Ryan P Adams and Nando de Freitas Taking thehuman out of the loop A review of bayesian optimization Proceedings of the IEEE 104148ndash1752016

16

Connor Shorten and Taghi M Khoshgoaar A survey on image data augmentation for deeplearning Journal of Big Data 6 2019

Jasper Snoek Hugo Larochelle and Ryan P Adams Practical bayesian optimization of machinelearning algorithms In NeurIPS 2012

Nitish Srivastava Georey Hinton Alex Krizhevsky Ilya Sutskever and Ruslan SalakhutdinovDropout A simple way to prevent neural networks from overing Journal of Machine LearningResearch 151929ndash1958 2014

Masashi Sugiyama Introduction to statistical machine learning Morgan Kaufmann 2015

Christian Szegedy Vincent Vanhoucke Sergey Ioe and Jon Shlens Rethinking the inceptionarchitecture for computer vision In CVPR 2016

Sunil ulasidasan Gopinath Chennupati Je Bilmes Tanmoy Bhaacharya and Sarah MichalakOn mixup training Improved calibration and predictive uncertainty for deep neural networksIn NeurIPS 2019

Robert Tibshirani Regression shrinkage and selection via the lasso Journal of the Royal StatisticalSociety Series B (Methodological) 58267ndash288 1996

A N Tikhonov and V Y Arsenin Solutions of Ill Posed Problems Winston 1977

Andrey Nikolayevich Tikhonov On the stability of inverse problems Doklady Akademii NaukSSSR 39195ndash198 1943

Antonio Torralba Rob Fergus and William T Freeman 80 million tiny images A large data setfor nonparametric object and scene recognition In IEEE Trans PAMI 2008

Brendan van Rooyen and Robert C Williamson A theory of learning with corrupted labelsJournal of Machine Learning Research 181ndash50 2018

Vikas Verma Alex Lamb Juho Kannala Yoshua Bengio and David Lopez-Paz Interpolationconsistency training for semi-supervised learning In IJCAI 2019

Stefan Wager Sida Wang and Percy Liang Dropout training as adaptive regularization In NeurIPS2013

Roman Werpachowski Andrs Gyrgy and Csaba Szepesvri Detecting overing via adversarialexamples In NeurIPS 2019

Han Xiao Kashif Rasul and Roland Vollgraf Fashion-MNIST a novel image dataset for bench-marking machine learning algorithms arXiv preprint arXiv170807747 2017

Chiyuan Zhang Samy Bengio Moritz Hardt Benjamin Recht and Oriol Vinyals Understandingdeep learning requires rethinking generalization In ICLR 2017

17

Hongyi Zhang Moustapha Cisse Yann N Dauphin and David Lopez-Paz mixup Beyondempirical risk minimization In ICLR 2018

Tong Zhang Statistical behavior and consistency of classication methods based on convex riskminimization e Annals of Statistics 3256ndash85 2004

18

A Proof of eoremProof If the ooding level is b then the proposed ooding estimator is

R(g) = |R(g)minus b|+ b (12)

Since the absolute operator can be expressed with a max operator with max(a b) = a+b+|aminusb|2

theproposed estimator can be re-expressed as

R(g) = 2max(R(g) b)minus R(g) = Aminus R(g) (13)

For convenience we used A = 2max(R(g) b) From the denition of MSE

MSE(R(g)) = E[(R(g)minusR(g))2] (14)MSE(R(g)) = E[(R(g)minusR(g))2] (15)

= E[(Aminus R(g)minusR(g))2)] (16)= E[A2]minus 2E[A(R(g) +R(g))] + E[(R(g) +R(g))2] (17)

We are interested in the sign of

MSE(R(g))minusMSE(R(g)) = E[minus4R(g)R(g)minus A2 + 2A(R(g) +R(g))] (18)

Dene the inside of the expectation as B = minus4R(g)R(g)minus A2 + 2A(R(g) + R(g)) B can bedivided into two cases depending on the outcome of the max operator

B =

minus4R(g)R(g)minus 4R(g)2 + 4R(g)(R(g) +R(g)) if R(g) ge b

minus4R(g)R(g)minus 4b2 + 4b(R(g) +R(g)) if R(g) lt b(19)

=

0 if R(g) ge b

minus4R(g)R(g)minus 4b2 + 4b(R(g) +R(g)) if R(g) lt b(20)

=

0 if R(g) ge b

minus4(bminus R(g))(bminusR(g)) if R(g) lt b (21)

e laer case becomes positive when R(g) lt b lt R(g) erefore when R(g) lt b lt R(g)

MSE(R(g))minusMSE(R(g)) gt 0 (22)MSE(R(g)) gt MSE(R(g)) (23)

When b le R(g)

MSE(R(g))minusMSE(R(g)) gt 0 (24)MSE(R(g)) = MSE(R(g)) (25)

19

B Bayes Risk for Gaussian DistributionsIn this section we explain in detail how we derived the Bayes risk with respect to the surrogateloss in the experiments with Gaussian data in Section 41 Since we are using the logistic loss inthe synthetic experiments the loss of the margin is

`(yg(x)) = log(1 + exp(minusyg(x))) (26)

where g(x) Rd rarr R is a scalar instead of the vector denition that was used previously becausethe synthetic experiments only consider binary classication Take the derivative to derive

partE[`(yg(x))]partg(x)

=E[log(1 + exp(minusyg(x)))]

partg(x)= E

[minusy exp(minusyg(x))1 + exp(minusyg(x))

∣∣∣∣∣x]p(x) (27)

= E

[minusy

1 + exp(yg(x))

∣∣∣∣∣x]p(x) (28)

= E

[y + 1

2

1

exp(minusg(x)) + 1+y minus 1

2

minus1exp(g(x)) + 1

∣∣∣∣∣x]p(x) (29)

= E

[y + 1

2

∣∣∣∣∣x]

1

exp(minusg(x)) + 1p(x) + E

[y minus 1

2

∣∣∣∣∣x]

minus1exp(g(x)) + 1

p(x) (30)

= p(y = +1|x) 1

exp(minusg(x)) + 1p(x) + p(y = minus1|x) minus1

exp(g(x)) + 1p(x) (31)

Set this to zero divide by p(x) gt 0 to obtain

p(y = minus1|x) 1

exp(minusg(x)) + 1= p(y = +1|x) 1

exp(g(x)) + 1(32)

exp(g(x)) =p(y = +1|x)p(y = minus1|x)

(33)

g(x) = logp(y = +1|x)p(y = minus1|x)

(34)

Since we are interested in the surrogate loss under this classier we plug this into the logisticloss to obtain the Bayes risk

E[`(yg(x))] = E[log(1 +

p(minusy|x)p(y|x)

)

]= E

[log(

1

p(y|x))

]= E[minus log p(y|x)] (35)

In the experiments in Section 41 we report the empirical version of this with the test dataset asthe Bayes risk

20

C Details of Experiments

C1 Benchmark DatasetsIn the experiments in Section 42 we use six benchmark datasets explained below

bull MNIST6 [Lecun et al 1998] is a 10 class dataset of handwrien digits 1 2 9 and 0 Eachsample is a 28times 28 grayscale image e number of training and test samples are 60000and 10000 respectively

bull Fashion-MNIST7 [Xiao et al 2017] is a 10 class dataset of fashion items T-shirttop TrouserPullover Dress Coat Sandal Shirt Sneaker Bag and Ankle boot Each sample is a 28times 28grayscale image e number of training and test samples are 60000 and 10000 respectively

bull Kuzushiji-MNIST8 [Clanuwat et al 2018] is a 10 class dataset of cursive Japanese (ldquoKuzushijirdquo)characters Each sample is a 28 times 28 grayscale image e number of training and testsamples are 60000 and 10000 respectively

bull CIFAR-109 is a 10 class dataset of various objects airplane automobile bird cat deer dogfrog horse ship and truck Each sample is a colored image in 32times 32times 3 RGB format It isa subset of the 80 million tiny images dataset [Torralba et al 2008] ere are 6000 imagesper class where 5000 are for training and 1000 are for test

bull CIFAR-10010 is a 100 class dataset of various objects Each class has 600 samples where 500samples are for training and 100 samples are for test is is also a subset of the 80 milliontiny images dataset [Torralba et al 2008]

bull SVHN11 [Netzer et al 2011] is a 10 class dataset of house numbers from Google Street Viewimages in 32times 32times 3 RGB format 73257 digits are for training and 26032 digits are fortesting

6httpyannlecuncomexdbmnist7httpsgithubcomzalandoresearchfashion-mnist8httpsgithubcomrois-codhkmnist9httpswwwcstorontoedusimkrizcifarhtml

10httpswwwcstorontoedusimkrizcifarhtml11httpufldlstanfordeduhousenumbers

21

C2 Chosen Flooding LevelsIn Table 4 we report the chosen ooding levels for our experiments with benchmark datasets

Table 4 e chosen ooding levels for benchmark experiments

Early stopping times X times XWeight decay times times X XFlooding X X X X

MNIST (08) 002 002 003 002MNIST (04) 002 003 000 002Fashion-MNIST (08) 000 000 mdash mdashFashion-MNIST (04) 000 000 mdash mdashKuzushiji-MNIST (08) 001 003 003 003Kuzushiji-MNIST (04) 003 002 004 003CIFAR-10 (08) 004 004 000 001CIFAR-10 (04) 003 003 000 000CIFAR-100 (08) 002 002 000 000CIFAR-100 (04) 003 003 000 000SVHN (08) 001 001 000 002SVHN (04) 001 009 003 003

D Learning CurvesIn Figure 4 we visualize learning curves for all datasets (including CIFAR-10 which we alreadyvisualized in Figure 2) We only show the learning curves for trainingvalidation proportion of 08since the results for 04 were similar with 08 Note that Figure 1(c) shows the learning curves forthe rst 80 epochs for CIFAR-10 without ooding Figure 1(d) shows the learning curves withooding when the ooding level is 018

22

(a) MNIST (000) (b) MNIST (001) (c) MNIST (002) (d) MNIST (003)

(e) Fashion-MNIST(000)

(f) Fashion-MNIST(002)

(g) Fashion-MNIST(005)

(h) Fashion-MNIST(010)

(i) KMNIST (000) (j) KMNIST (002) (k) KMNIST (005) (l) KMNIST (010)

(m) CIFAR-10 (000) (n) CIFAR-10 (003) (o) CIFAR-10 (007) (p) CIFAR-10 (020)

(q) CIFAR-100 (000) (r) CIFAR-100 (001) (s) CIFAR-100 (006) (t) CIFAR-100 (020)

(u) SVHN (000) (v) SVHN (001) (w) SVHN (006) (x) SVHN (020)

Figure 4 Learning curves of training and test loss e rst gure in each row is the learning curveswithout ooding e 2nd 3rd and 4th columns show the results with dierent ooding levels e oodinglevel increases towards the right-hand side 23

E Relationship between Performance and GradientsSettings We visualize the relationship between test performance (loss or accuracy) and gradientamplitude of the trainingtest loss in Figure 5 where the gradient amplitude is the `2 norm of thelter-nomalized gradient of the loss e lter-normalized gradient is the gradient appropriatelyscaled depending on the magnitude of the corresponding convolutional lter similarly to Liet al [2018] More specicically for each lter of every comvolutional layer we multiply thecorresponding elements of the gradient by the norm of the lter Note that a fully connected layeris a special case of convolutional layer and subject to this scaling We exclude Fashion-MNISTbecause the optimal ooding level was zero We used the results with trainingvalidation splitratio of 08

Results For the gures with gradient amplitude of training loss on the horizontal axis ldquordquomarks (w ooding) are oen ploed on the right of ldquo+rdquo marks (wo ooding) which impliesthat ooding prevents the model from staying a local minimum For the gures with gradientamplitude of test loss on the horizontal axis we can observe the method with ooding (ldquordquo)improves performance while the gradient amplitude becomes smaller On the other hand theperformance with the method without ooding (ldquo+rdquo) degenerates while the gradient amplitude oftest loss keeps increasing

24

(a) MNIST xtrainyloss

(b) MNIST xtrainyacc

(c) MNIST xtestyloss

(d) MNIST xtest yacc

(e) K-M xtrain yloss (f) K-M xtrain yacc (g) K-M xtest yloss (h) K-M xtest yacc

(i) C10 xtrain yloss (j) C10 xtrain yacc (k) C10 xtest yloss (l) C10 xtest yacc

(m) C100 xtrainyloss

(n) C100 xtrain yacc (o) C100 xtest yloss (p) C100 xtest yacc

(q) SVHN xtrainyloss

(r) SVHN xtrainyacc

(s) SVHN xtest yloss (t) SVHN xtest yacc

Figure 5 Relationship between test performance (loss or accuracy) and amplitude of gradient (with trainingor test loss) Each point (ldquordquo or ldquo+rdquo) in the gures corresponds to a single model at a certain epoch Weremove the rst 5 epochs and plot the rest ldquordquo is used for the method with ooding and ldquo+rdquo is used forthe method without ooding e large black ldquordquo and ldquo+rdquo show the epochs with early stopping e colorbecomes lighter (purplerarr yellow) as the training proceeds K-M C10 and C100 stand for Kuzushiji-MNISTCIFAR-10 and CIFAR-100

25

F FlatnessSettings We follow Li et al [2018] and give a one-dimensional visualization of atness for eachdataset in Figure 6 We exclude Fashion-MNIST because the optimal ooding level was zero Weused the results with trainingvalidation split ratio of 08 We compare the atness of the modelright aer the empirical risk with respect to a mini-batch becomes smaller than the ooding levelRm(g) lt b for the rst time (doed blue line) and the model aer training (solid blue line) Wealso compare them with the model trained by the baseline method without ooding and trainingis nished (solid red line)

Results According to Figure 6 the test loss becomes lower and more at during the trainingwith ooding Note that the training loss on the other hand continues to oat around the oodinglevel until the end of training aer it enters the ooding zone We expect that the model makes arandom walk and escapes regions with sharp loss landscapes during the period is may be apossible reason for beer generalization results with our proposed method

26

(a) MNIST (train) (b) MNIST (test) (c) Kuzushiji-MNIST(train)

(d) Kuzushiji-MNIST(test)

(e) CIFAR-10 (train) (f) CIFAR-10 (test) (g) CIFAR-100 (train) (h) CIFAR-100 (test)

(i) SVHN (train) (j) SVHN (test)

Figure 6 One-dimensional visualization of atness We visualize the trainingtest loss with respect toperturbation We depict the results for 3 models the model when the empirical risk with respect to trainingdata is below the ooding level for the rst time during training (doed blue) the model at the end oftraining with ooding (solid blue) and the model at the end of training without ooding (solid red)

27

  • 1 Introduction
  • 2 Backgrounds
    • 21 Regularization Methods
    • 22 Double Descent Curves with Overparametrization
    • 23 Lower-Bounding the Empirical Risk
      • 3 Flooding How to Avoid Zero Training Loss
        • 31 Preliminaries
        • 32 Algorithm
        • 33 Implementation
        • 34 Theoretical Analysis
          • 4 Experiments
            • 41 Synthetic Experiments
            • 42 Benchmark Experiments
            • 43 Memorization
              • 5 Conclusion
              • A Proof of Theorem
              • B Bayes Risk for Gaussian Distributions
              • C Details of Experiments
                • C1 Benchmark Datasets
                • C2 Chosen Flooding Levels
                  • D Learning Curves
                  • E Relationship between Performance and Gradients
                  • F Flatness
Page 17: Do We Need Zero Training Loss A›er Achieving Zero Training … · 2020. 2. 21. · directly aim to avoid zero training loss, they o›en fail to maintain a moderate level of training

Connor Shorten and Taghi M Khoshgoaar A survey on image data augmentation for deeplearning Journal of Big Data 6 2019

Jasper Snoek Hugo Larochelle and Ryan P Adams Practical bayesian optimization of machinelearning algorithms In NeurIPS 2012

Nitish Srivastava Georey Hinton Alex Krizhevsky Ilya Sutskever and Ruslan SalakhutdinovDropout A simple way to prevent neural networks from overing Journal of Machine LearningResearch 151929ndash1958 2014

Masashi Sugiyama Introduction to statistical machine learning Morgan Kaufmann 2015

Christian Szegedy Vincent Vanhoucke Sergey Ioe and Jon Shlens Rethinking the inceptionarchitecture for computer vision In CVPR 2016

Sunil ulasidasan Gopinath Chennupati Je Bilmes Tanmoy Bhaacharya and Sarah MichalakOn mixup training Improved calibration and predictive uncertainty for deep neural networksIn NeurIPS 2019

Robert Tibshirani Regression shrinkage and selection via the lasso Journal of the Royal StatisticalSociety Series B (Methodological) 58267ndash288 1996

A N Tikhonov and V Y Arsenin Solutions of Ill Posed Problems Winston 1977

Andrey Nikolayevich Tikhonov On the stability of inverse problems Doklady Akademii NaukSSSR 39195ndash198 1943

Antonio Torralba Rob Fergus and William T Freeman 80 million tiny images A large data setfor nonparametric object and scene recognition In IEEE Trans PAMI 2008

Brendan van Rooyen and Robert C Williamson A theory of learning with corrupted labelsJournal of Machine Learning Research 181ndash50 2018

Vikas Verma Alex Lamb Juho Kannala Yoshua Bengio and David Lopez-Paz Interpolationconsistency training for semi-supervised learning In IJCAI 2019

Stefan Wager Sida Wang and Percy Liang Dropout training as adaptive regularization In NeurIPS2013

Roman Werpachowski Andrs Gyrgy and Csaba Szepesvri Detecting overing via adversarialexamples In NeurIPS 2019

Han Xiao Kashif Rasul and Roland Vollgraf Fashion-MNIST a novel image dataset for bench-marking machine learning algorithms arXiv preprint arXiv170807747 2017

Chiyuan Zhang Samy Bengio Moritz Hardt Benjamin Recht and Oriol Vinyals Understandingdeep learning requires rethinking generalization In ICLR 2017

17

Hongyi Zhang Moustapha Cisse Yann N Dauphin and David Lopez-Paz mixup Beyondempirical risk minimization In ICLR 2018

Tong Zhang Statistical behavior and consistency of classication methods based on convex riskminimization e Annals of Statistics 3256ndash85 2004

18

A Proof of eoremProof If the ooding level is b then the proposed ooding estimator is

R(g) = |R(g)minus b|+ b (12)

Since the absolute operator can be expressed with a max operator with max(a b) = a+b+|aminusb|2

theproposed estimator can be re-expressed as

R(g) = 2max(R(g) b)minus R(g) = Aminus R(g) (13)

For convenience we used A = 2max(R(g) b) From the denition of MSE

MSE(R(g)) = E[(R(g)minusR(g))2] (14)MSE(R(g)) = E[(R(g)minusR(g))2] (15)

= E[(Aminus R(g)minusR(g))2)] (16)= E[A2]minus 2E[A(R(g) +R(g))] + E[(R(g) +R(g))2] (17)

We are interested in the sign of

MSE(R(g))minusMSE(R(g)) = E[minus4R(g)R(g)minus A2 + 2A(R(g) +R(g))] (18)

Dene the inside of the expectation as B = minus4R(g)R(g)minus A2 + 2A(R(g) + R(g)) B can bedivided into two cases depending on the outcome of the max operator

B =

minus4R(g)R(g)minus 4R(g)2 + 4R(g)(R(g) +R(g)) if R(g) ge b

minus4R(g)R(g)minus 4b2 + 4b(R(g) +R(g)) if R(g) lt b(19)

=

0 if R(g) ge b

minus4R(g)R(g)minus 4b2 + 4b(R(g) +R(g)) if R(g) lt b(20)

=

0 if R(g) ge b

minus4(bminus R(g))(bminusR(g)) if R(g) lt b (21)

e laer case becomes positive when R(g) lt b lt R(g) erefore when R(g) lt b lt R(g)

MSE(R(g))minusMSE(R(g)) gt 0 (22)MSE(R(g)) gt MSE(R(g)) (23)

When b le R(g)

MSE(R(g))minusMSE(R(g)) gt 0 (24)MSE(R(g)) = MSE(R(g)) (25)

19

B Bayes Risk for Gaussian DistributionsIn this section we explain in detail how we derived the Bayes risk with respect to the surrogateloss in the experiments with Gaussian data in Section 41 Since we are using the logistic loss inthe synthetic experiments the loss of the margin is

`(yg(x)) = log(1 + exp(minusyg(x))) (26)

where g(x) Rd rarr R is a scalar instead of the vector denition that was used previously becausethe synthetic experiments only consider binary classication Take the derivative to derive

partE[`(yg(x))]partg(x)

=E[log(1 + exp(minusyg(x)))]

partg(x)= E

[minusy exp(minusyg(x))1 + exp(minusyg(x))

∣∣∣∣∣x]p(x) (27)

= E

[minusy

1 + exp(yg(x))

∣∣∣∣∣x]p(x) (28)

= E

[y + 1

2

1

exp(minusg(x)) + 1+y minus 1

2

minus1exp(g(x)) + 1

∣∣∣∣∣x]p(x) (29)

= E

[y + 1

2

∣∣∣∣∣x]

1

exp(minusg(x)) + 1p(x) + E

[y minus 1

2

∣∣∣∣∣x]

minus1exp(g(x)) + 1

p(x) (30)

= p(y = +1|x) 1

exp(minusg(x)) + 1p(x) + p(y = minus1|x) minus1

exp(g(x)) + 1p(x) (31)

Set this to zero divide by p(x) gt 0 to obtain

p(y = minus1|x) 1

exp(minusg(x)) + 1= p(y = +1|x) 1

exp(g(x)) + 1(32)

exp(g(x)) =p(y = +1|x)p(y = minus1|x)

(33)

g(x) = logp(y = +1|x)p(y = minus1|x)

(34)

Since we are interested in the surrogate loss under this classier we plug this into the logisticloss to obtain the Bayes risk

E[`(yg(x))] = E[log(1 +

p(minusy|x)p(y|x)

)

]= E

[log(

1

p(y|x))

]= E[minus log p(y|x)] (35)

In the experiments in Section 41 we report the empirical version of this with the test dataset asthe Bayes risk

20

C Details of Experiments

C1 Benchmark DatasetsIn the experiments in Section 42 we use six benchmark datasets explained below

bull MNIST6 [Lecun et al 1998] is a 10 class dataset of handwrien digits 1 2 9 and 0 Eachsample is a 28times 28 grayscale image e number of training and test samples are 60000and 10000 respectively

bull Fashion-MNIST7 [Xiao et al 2017] is a 10 class dataset of fashion items T-shirttop TrouserPullover Dress Coat Sandal Shirt Sneaker Bag and Ankle boot Each sample is a 28times 28grayscale image e number of training and test samples are 60000 and 10000 respectively

bull Kuzushiji-MNIST8 [Clanuwat et al 2018] is a 10 class dataset of cursive Japanese (ldquoKuzushijirdquo)characters Each sample is a 28 times 28 grayscale image e number of training and testsamples are 60000 and 10000 respectively

bull CIFAR-109 is a 10 class dataset of various objects airplane automobile bird cat deer dogfrog horse ship and truck Each sample is a colored image in 32times 32times 3 RGB format It isa subset of the 80 million tiny images dataset [Torralba et al 2008] ere are 6000 imagesper class where 5000 are for training and 1000 are for test

bull CIFAR-10010 is a 100 class dataset of various objects Each class has 600 samples where 500samples are for training and 100 samples are for test is is also a subset of the 80 milliontiny images dataset [Torralba et al 2008]

bull SVHN11 [Netzer et al 2011] is a 10 class dataset of house numbers from Google Street Viewimages in 32times 32times 3 RGB format 73257 digits are for training and 26032 digits are fortesting

6httpyannlecuncomexdbmnist7httpsgithubcomzalandoresearchfashion-mnist8httpsgithubcomrois-codhkmnist9httpswwwcstorontoedusimkrizcifarhtml

10httpswwwcstorontoedusimkrizcifarhtml11httpufldlstanfordeduhousenumbers

21

C2 Chosen Flooding LevelsIn Table 4 we report the chosen ooding levels for our experiments with benchmark datasets

Table 4 e chosen ooding levels for benchmark experiments

Early stopping times X times XWeight decay times times X XFlooding X X X X

MNIST (08) 002 002 003 002MNIST (04) 002 003 000 002Fashion-MNIST (08) 000 000 mdash mdashFashion-MNIST (04) 000 000 mdash mdashKuzushiji-MNIST (08) 001 003 003 003Kuzushiji-MNIST (04) 003 002 004 003CIFAR-10 (08) 004 004 000 001CIFAR-10 (04) 003 003 000 000CIFAR-100 (08) 002 002 000 000CIFAR-100 (04) 003 003 000 000SVHN (08) 001 001 000 002SVHN (04) 001 009 003 003

D Learning CurvesIn Figure 4 we visualize learning curves for all datasets (including CIFAR-10 which we alreadyvisualized in Figure 2) We only show the learning curves for trainingvalidation proportion of 08since the results for 04 were similar with 08 Note that Figure 1(c) shows the learning curves forthe rst 80 epochs for CIFAR-10 without ooding Figure 1(d) shows the learning curves withooding when the ooding level is 018

22

(a) MNIST (000) (b) MNIST (001) (c) MNIST (002) (d) MNIST (003)

(e) Fashion-MNIST(000)

(f) Fashion-MNIST(002)

(g) Fashion-MNIST(005)

(h) Fashion-MNIST(010)

(i) KMNIST (000) (j) KMNIST (002) (k) KMNIST (005) (l) KMNIST (010)

(m) CIFAR-10 (000) (n) CIFAR-10 (003) (o) CIFAR-10 (007) (p) CIFAR-10 (020)

(q) CIFAR-100 (000) (r) CIFAR-100 (001) (s) CIFAR-100 (006) (t) CIFAR-100 (020)

(u) SVHN (000) (v) SVHN (001) (w) SVHN (006) (x) SVHN (020)

Figure 4 Learning curves of training and test loss e rst gure in each row is the learning curveswithout ooding e 2nd 3rd and 4th columns show the results with dierent ooding levels e oodinglevel increases towards the right-hand side 23

E Relationship between Performance and GradientsSettings We visualize the relationship between test performance (loss or accuracy) and gradientamplitude of the trainingtest loss in Figure 5 where the gradient amplitude is the `2 norm of thelter-nomalized gradient of the loss e lter-normalized gradient is the gradient appropriatelyscaled depending on the magnitude of the corresponding convolutional lter similarly to Liet al [2018] More specicically for each lter of every comvolutional layer we multiply thecorresponding elements of the gradient by the norm of the lter Note that a fully connected layeris a special case of convolutional layer and subject to this scaling We exclude Fashion-MNISTbecause the optimal ooding level was zero We used the results with trainingvalidation splitratio of 08

Results For the gures with gradient amplitude of training loss on the horizontal axis ldquordquomarks (w ooding) are oen ploed on the right of ldquo+rdquo marks (wo ooding) which impliesthat ooding prevents the model from staying a local minimum For the gures with gradientamplitude of test loss on the horizontal axis we can observe the method with ooding (ldquordquo)improves performance while the gradient amplitude becomes smaller On the other hand theperformance with the method without ooding (ldquo+rdquo) degenerates while the gradient amplitude oftest loss keeps increasing

24

(a) MNIST xtrainyloss

(b) MNIST xtrainyacc

(c) MNIST xtestyloss

(d) MNIST xtest yacc

(e) K-M xtrain yloss (f) K-M xtrain yacc (g) K-M xtest yloss (h) K-M xtest yacc

(i) C10 xtrain yloss (j) C10 xtrain yacc (k) C10 xtest yloss (l) C10 xtest yacc

(m) C100 xtrainyloss

(n) C100 xtrain yacc (o) C100 xtest yloss (p) C100 xtest yacc

(q) SVHN xtrainyloss

(r) SVHN xtrainyacc

(s) SVHN xtest yloss (t) SVHN xtest yacc

Figure 5 Relationship between test performance (loss or accuracy) and amplitude of gradient (with trainingor test loss) Each point (ldquordquo or ldquo+rdquo) in the gures corresponds to a single model at a certain epoch Weremove the rst 5 epochs and plot the rest ldquordquo is used for the method with ooding and ldquo+rdquo is used forthe method without ooding e large black ldquordquo and ldquo+rdquo show the epochs with early stopping e colorbecomes lighter (purplerarr yellow) as the training proceeds K-M C10 and C100 stand for Kuzushiji-MNISTCIFAR-10 and CIFAR-100

25

F FlatnessSettings We follow Li et al [2018] and give a one-dimensional visualization of atness for eachdataset in Figure 6 We exclude Fashion-MNIST because the optimal ooding level was zero Weused the results with trainingvalidation split ratio of 08 We compare the atness of the modelright aer the empirical risk with respect to a mini-batch becomes smaller than the ooding levelRm(g) lt b for the rst time (doed blue line) and the model aer training (solid blue line) Wealso compare them with the model trained by the baseline method without ooding and trainingis nished (solid red line)

Results According to Figure 6 the test loss becomes lower and more at during the trainingwith ooding Note that the training loss on the other hand continues to oat around the oodinglevel until the end of training aer it enters the ooding zone We expect that the model makes arandom walk and escapes regions with sharp loss landscapes during the period is may be apossible reason for beer generalization results with our proposed method

26

(a) MNIST (train) (b) MNIST (test) (c) Kuzushiji-MNIST(train)

(d) Kuzushiji-MNIST(test)

(e) CIFAR-10 (train) (f) CIFAR-10 (test) (g) CIFAR-100 (train) (h) CIFAR-100 (test)

(i) SVHN (train) (j) SVHN (test)

Figure 6 One-dimensional visualization of atness We visualize the trainingtest loss with respect toperturbation We depict the results for 3 models the model when the empirical risk with respect to trainingdata is below the ooding level for the rst time during training (doed blue) the model at the end oftraining with ooding (solid blue) and the model at the end of training without ooding (solid red)

27

  • 1 Introduction
  • 2 Backgrounds
    • 21 Regularization Methods
    • 22 Double Descent Curves with Overparametrization
    • 23 Lower-Bounding the Empirical Risk
      • 3 Flooding How to Avoid Zero Training Loss
        • 31 Preliminaries
        • 32 Algorithm
        • 33 Implementation
        • 34 Theoretical Analysis
          • 4 Experiments
            • 41 Synthetic Experiments
            • 42 Benchmark Experiments
            • 43 Memorization
              • 5 Conclusion
              • A Proof of Theorem
              • B Bayes Risk for Gaussian Distributions
              • C Details of Experiments
                • C1 Benchmark Datasets
                • C2 Chosen Flooding Levels
                  • D Learning Curves
                  • E Relationship between Performance and Gradients
                  • F Flatness
Page 18: Do We Need Zero Training Loss A›er Achieving Zero Training … · 2020. 2. 21. · directly aim to avoid zero training loss, they o›en fail to maintain a moderate level of training

Hongyi Zhang Moustapha Cisse Yann N Dauphin and David Lopez-Paz mixup Beyondempirical risk minimization In ICLR 2018

Tong Zhang Statistical behavior and consistency of classication methods based on convex riskminimization e Annals of Statistics 3256ndash85 2004

18

A Proof of eoremProof If the ooding level is b then the proposed ooding estimator is

R(g) = |R(g)minus b|+ b (12)

Since the absolute operator can be expressed with a max operator with max(a b) = a+b+|aminusb|2

theproposed estimator can be re-expressed as

R(g) = 2max(R(g) b)minus R(g) = Aminus R(g) (13)

For convenience we used A = 2max(R(g) b) From the denition of MSE

MSE(R(g)) = E[(R(g)minusR(g))2] (14)MSE(R(g)) = E[(R(g)minusR(g))2] (15)

= E[(Aminus R(g)minusR(g))2)] (16)= E[A2]minus 2E[A(R(g) +R(g))] + E[(R(g) +R(g))2] (17)

We are interested in the sign of

MSE(R(g))minusMSE(R(g)) = E[minus4R(g)R(g)minus A2 + 2A(R(g) +R(g))] (18)

Dene the inside of the expectation as B = minus4R(g)R(g)minus A2 + 2A(R(g) + R(g)) B can bedivided into two cases depending on the outcome of the max operator

B =

minus4R(g)R(g)minus 4R(g)2 + 4R(g)(R(g) +R(g)) if R(g) ge b

minus4R(g)R(g)minus 4b2 + 4b(R(g) +R(g)) if R(g) lt b(19)

=

0 if R(g) ge b

minus4R(g)R(g)minus 4b2 + 4b(R(g) +R(g)) if R(g) lt b(20)

=

0 if R(g) ge b

minus4(bminus R(g))(bminusR(g)) if R(g) lt b (21)

e laer case becomes positive when R(g) lt b lt R(g) erefore when R(g) lt b lt R(g)

MSE(R(g))minusMSE(R(g)) gt 0 (22)MSE(R(g)) gt MSE(R(g)) (23)

When b le R(g)

MSE(R(g))minusMSE(R(g)) gt 0 (24)MSE(R(g)) = MSE(R(g)) (25)

19

B Bayes Risk for Gaussian DistributionsIn this section we explain in detail how we derived the Bayes risk with respect to the surrogateloss in the experiments with Gaussian data in Section 41 Since we are using the logistic loss inthe synthetic experiments the loss of the margin is

`(yg(x)) = log(1 + exp(minusyg(x))) (26)

where g(x) Rd rarr R is a scalar instead of the vector denition that was used previously becausethe synthetic experiments only consider binary classication Take the derivative to derive

partE[`(yg(x))]partg(x)

=E[log(1 + exp(minusyg(x)))]

partg(x)= E

[minusy exp(minusyg(x))1 + exp(minusyg(x))

∣∣∣∣∣x]p(x) (27)

= E

[minusy

1 + exp(yg(x))

∣∣∣∣∣x]p(x) (28)

= E

[y + 1

2

1

exp(minusg(x)) + 1+y minus 1

2

minus1exp(g(x)) + 1

∣∣∣∣∣x]p(x) (29)

= E

[y + 1

2

∣∣∣∣∣x]

1

exp(minusg(x)) + 1p(x) + E

[y minus 1

2

∣∣∣∣∣x]

minus1exp(g(x)) + 1

p(x) (30)

= p(y = +1|x) 1

exp(minusg(x)) + 1p(x) + p(y = minus1|x) minus1

exp(g(x)) + 1p(x) (31)

Set this to zero divide by p(x) gt 0 to obtain

p(y = minus1|x) 1

exp(minusg(x)) + 1= p(y = +1|x) 1

exp(g(x)) + 1(32)

exp(g(x)) =p(y = +1|x)p(y = minus1|x)

(33)

g(x) = logp(y = +1|x)p(y = minus1|x)

(34)

Since we are interested in the surrogate loss under this classier we plug this into the logisticloss to obtain the Bayes risk

E[`(yg(x))] = E[log(1 +

p(minusy|x)p(y|x)

)

]= E

[log(

1

p(y|x))

]= E[minus log p(y|x)] (35)

In the experiments in Section 41 we report the empirical version of this with the test dataset asthe Bayes risk

20

C Details of Experiments

C1 Benchmark DatasetsIn the experiments in Section 42 we use six benchmark datasets explained below

bull MNIST6 [Lecun et al 1998] is a 10 class dataset of handwrien digits 1 2 9 and 0 Eachsample is a 28times 28 grayscale image e number of training and test samples are 60000and 10000 respectively

bull Fashion-MNIST7 [Xiao et al 2017] is a 10 class dataset of fashion items T-shirttop TrouserPullover Dress Coat Sandal Shirt Sneaker Bag and Ankle boot Each sample is a 28times 28grayscale image e number of training and test samples are 60000 and 10000 respectively

bull Kuzushiji-MNIST8 [Clanuwat et al 2018] is a 10 class dataset of cursive Japanese (ldquoKuzushijirdquo)characters Each sample is a 28 times 28 grayscale image e number of training and testsamples are 60000 and 10000 respectively

bull CIFAR-109 is a 10 class dataset of various objects airplane automobile bird cat deer dogfrog horse ship and truck Each sample is a colored image in 32times 32times 3 RGB format It isa subset of the 80 million tiny images dataset [Torralba et al 2008] ere are 6000 imagesper class where 5000 are for training and 1000 are for test

bull CIFAR-10010 is a 100 class dataset of various objects Each class has 600 samples where 500samples are for training and 100 samples are for test is is also a subset of the 80 milliontiny images dataset [Torralba et al 2008]

bull SVHN11 [Netzer et al 2011] is a 10 class dataset of house numbers from Google Street Viewimages in 32times 32times 3 RGB format 73257 digits are for training and 26032 digits are fortesting

6httpyannlecuncomexdbmnist7httpsgithubcomzalandoresearchfashion-mnist8httpsgithubcomrois-codhkmnist9httpswwwcstorontoedusimkrizcifarhtml

10httpswwwcstorontoedusimkrizcifarhtml11httpufldlstanfordeduhousenumbers

21

C2 Chosen Flooding LevelsIn Table 4 we report the chosen ooding levels for our experiments with benchmark datasets

Table 4 e chosen ooding levels for benchmark experiments

Early stopping times X times XWeight decay times times X XFlooding X X X X

MNIST (08) 002 002 003 002MNIST (04) 002 003 000 002Fashion-MNIST (08) 000 000 mdash mdashFashion-MNIST (04) 000 000 mdash mdashKuzushiji-MNIST (08) 001 003 003 003Kuzushiji-MNIST (04) 003 002 004 003CIFAR-10 (08) 004 004 000 001CIFAR-10 (04) 003 003 000 000CIFAR-100 (08) 002 002 000 000CIFAR-100 (04) 003 003 000 000SVHN (08) 001 001 000 002SVHN (04) 001 009 003 003

D Learning CurvesIn Figure 4 we visualize learning curves for all datasets (including CIFAR-10 which we alreadyvisualized in Figure 2) We only show the learning curves for trainingvalidation proportion of 08since the results for 04 were similar with 08 Note that Figure 1(c) shows the learning curves forthe rst 80 epochs for CIFAR-10 without ooding Figure 1(d) shows the learning curves withooding when the ooding level is 018

22

(a) MNIST (000) (b) MNIST (001) (c) MNIST (002) (d) MNIST (003)

(e) Fashion-MNIST(000)

(f) Fashion-MNIST(002)

(g) Fashion-MNIST(005)

(h) Fashion-MNIST(010)

(i) KMNIST (000) (j) KMNIST (002) (k) KMNIST (005) (l) KMNIST (010)

(m) CIFAR-10 (000) (n) CIFAR-10 (003) (o) CIFAR-10 (007) (p) CIFAR-10 (020)

(q) CIFAR-100 (000) (r) CIFAR-100 (001) (s) CIFAR-100 (006) (t) CIFAR-100 (020)

(u) SVHN (000) (v) SVHN (001) (w) SVHN (006) (x) SVHN (020)

Figure 4 Learning curves of training and test loss e rst gure in each row is the learning curveswithout ooding e 2nd 3rd and 4th columns show the results with dierent ooding levels e oodinglevel increases towards the right-hand side 23

E Relationship between Performance and GradientsSettings We visualize the relationship between test performance (loss or accuracy) and gradientamplitude of the trainingtest loss in Figure 5 where the gradient amplitude is the `2 norm of thelter-nomalized gradient of the loss e lter-normalized gradient is the gradient appropriatelyscaled depending on the magnitude of the corresponding convolutional lter similarly to Liet al [2018] More specicically for each lter of every comvolutional layer we multiply thecorresponding elements of the gradient by the norm of the lter Note that a fully connected layeris a special case of convolutional layer and subject to this scaling We exclude Fashion-MNISTbecause the optimal ooding level was zero We used the results with trainingvalidation splitratio of 08

Results For the gures with gradient amplitude of training loss on the horizontal axis ldquordquomarks (w ooding) are oen ploed on the right of ldquo+rdquo marks (wo ooding) which impliesthat ooding prevents the model from staying a local minimum For the gures with gradientamplitude of test loss on the horizontal axis we can observe the method with ooding (ldquordquo)improves performance while the gradient amplitude becomes smaller On the other hand theperformance with the method without ooding (ldquo+rdquo) degenerates while the gradient amplitude oftest loss keeps increasing

24

(a) MNIST xtrainyloss

(b) MNIST xtrainyacc

(c) MNIST xtestyloss

(d) MNIST xtest yacc

(e) K-M xtrain yloss (f) K-M xtrain yacc (g) K-M xtest yloss (h) K-M xtest yacc

(i) C10 xtrain yloss (j) C10 xtrain yacc (k) C10 xtest yloss (l) C10 xtest yacc

(m) C100 xtrainyloss

(n) C100 xtrain yacc (o) C100 xtest yloss (p) C100 xtest yacc

(q) SVHN xtrainyloss

(r) SVHN xtrainyacc

(s) SVHN xtest yloss (t) SVHN xtest yacc

Figure 5 Relationship between test performance (loss or accuracy) and amplitude of gradient (with trainingor test loss) Each point (ldquordquo or ldquo+rdquo) in the gures corresponds to a single model at a certain epoch Weremove the rst 5 epochs and plot the rest ldquordquo is used for the method with ooding and ldquo+rdquo is used forthe method without ooding e large black ldquordquo and ldquo+rdquo show the epochs with early stopping e colorbecomes lighter (purplerarr yellow) as the training proceeds K-M C10 and C100 stand for Kuzushiji-MNISTCIFAR-10 and CIFAR-100

25

F FlatnessSettings We follow Li et al [2018] and give a one-dimensional visualization of atness for eachdataset in Figure 6 We exclude Fashion-MNIST because the optimal ooding level was zero Weused the results with trainingvalidation split ratio of 08 We compare the atness of the modelright aer the empirical risk with respect to a mini-batch becomes smaller than the ooding levelRm(g) lt b for the rst time (doed blue line) and the model aer training (solid blue line) Wealso compare them with the model trained by the baseline method without ooding and trainingis nished (solid red line)

Results According to Figure 6 the test loss becomes lower and more at during the trainingwith ooding Note that the training loss on the other hand continues to oat around the oodinglevel until the end of training aer it enters the ooding zone We expect that the model makes arandom walk and escapes regions with sharp loss landscapes during the period is may be apossible reason for beer generalization results with our proposed method

26

(a) MNIST (train) (b) MNIST (test) (c) Kuzushiji-MNIST(train)

(d) Kuzushiji-MNIST(test)

(e) CIFAR-10 (train) (f) CIFAR-10 (test) (g) CIFAR-100 (train) (h) CIFAR-100 (test)

(i) SVHN (train) (j) SVHN (test)

Figure 6 One-dimensional visualization of atness We visualize the trainingtest loss with respect toperturbation We depict the results for 3 models the model when the empirical risk with respect to trainingdata is below the ooding level for the rst time during training (doed blue) the model at the end oftraining with ooding (solid blue) and the model at the end of training without ooding (solid red)

27

  • 1 Introduction
  • 2 Backgrounds
    • 21 Regularization Methods
    • 22 Double Descent Curves with Overparametrization
    • 23 Lower-Bounding the Empirical Risk
      • 3 Flooding How to Avoid Zero Training Loss
        • 31 Preliminaries
        • 32 Algorithm
        • 33 Implementation
        • 34 Theoretical Analysis
          • 4 Experiments
            • 41 Synthetic Experiments
            • 42 Benchmark Experiments
            • 43 Memorization
              • 5 Conclusion
              • A Proof of Theorem
              • B Bayes Risk for Gaussian Distributions
              • C Details of Experiments
                • C1 Benchmark Datasets
                • C2 Chosen Flooding Levels
                  • D Learning Curves
                  • E Relationship between Performance and Gradients
                  • F Flatness
Page 19: Do We Need Zero Training Loss A›er Achieving Zero Training … · 2020. 2. 21. · directly aim to avoid zero training loss, they o›en fail to maintain a moderate level of training

A Proof of eoremProof If the ooding level is b then the proposed ooding estimator is

R(g) = |R(g)minus b|+ b (12)

Since the absolute operator can be expressed with a max operator with max(a b) = a+b+|aminusb|2

theproposed estimator can be re-expressed as

R(g) = 2max(R(g) b)minus R(g) = Aminus R(g) (13)

For convenience we used A = 2max(R(g) b) From the denition of MSE

MSE(R(g)) = E[(R(g)minusR(g))2] (14)MSE(R(g)) = E[(R(g)minusR(g))2] (15)

= E[(Aminus R(g)minusR(g))2)] (16)= E[A2]minus 2E[A(R(g) +R(g))] + E[(R(g) +R(g))2] (17)

We are interested in the sign of

MSE(R(g))minusMSE(R(g)) = E[minus4R(g)R(g)minus A2 + 2A(R(g) +R(g))] (18)

Dene the inside of the expectation as B = minus4R(g)R(g)minus A2 + 2A(R(g) + R(g)) B can bedivided into two cases depending on the outcome of the max operator

B =

minus4R(g)R(g)minus 4R(g)2 + 4R(g)(R(g) +R(g)) if R(g) ge b

minus4R(g)R(g)minus 4b2 + 4b(R(g) +R(g)) if R(g) lt b(19)

=

0 if R(g) ge b

minus4R(g)R(g)minus 4b2 + 4b(R(g) +R(g)) if R(g) lt b(20)

=

0 if R(g) ge b

minus4(bminus R(g))(bminusR(g)) if R(g) lt b (21)

e laer case becomes positive when R(g) lt b lt R(g) erefore when R(g) lt b lt R(g)

MSE(R(g))minusMSE(R(g)) gt 0 (22)MSE(R(g)) gt MSE(R(g)) (23)

When b le R(g)

MSE(R(g))minusMSE(R(g)) gt 0 (24)MSE(R(g)) = MSE(R(g)) (25)

19

B Bayes Risk for Gaussian DistributionsIn this section we explain in detail how we derived the Bayes risk with respect to the surrogateloss in the experiments with Gaussian data in Section 41 Since we are using the logistic loss inthe synthetic experiments the loss of the margin is

`(yg(x)) = log(1 + exp(minusyg(x))) (26)

where g(x) Rd rarr R is a scalar instead of the vector denition that was used previously becausethe synthetic experiments only consider binary classication Take the derivative to derive

partE[`(yg(x))]partg(x)

=E[log(1 + exp(minusyg(x)))]

partg(x)= E

[minusy exp(minusyg(x))1 + exp(minusyg(x))

∣∣∣∣∣x]p(x) (27)

= E

[minusy

1 + exp(yg(x))

∣∣∣∣∣x]p(x) (28)

= E

[y + 1

2

1

exp(minusg(x)) + 1+y minus 1

2

minus1exp(g(x)) + 1

∣∣∣∣∣x]p(x) (29)

= E

[y + 1

2

∣∣∣∣∣x]

1

exp(minusg(x)) + 1p(x) + E

[y minus 1

2

∣∣∣∣∣x]

minus1exp(g(x)) + 1

p(x) (30)

= p(y = +1|x) 1

exp(minusg(x)) + 1p(x) + p(y = minus1|x) minus1

exp(g(x)) + 1p(x) (31)

Set this to zero divide by p(x) gt 0 to obtain

p(y = minus1|x) 1

exp(minusg(x)) + 1= p(y = +1|x) 1

exp(g(x)) + 1(32)

exp(g(x)) =p(y = +1|x)p(y = minus1|x)

(33)

g(x) = logp(y = +1|x)p(y = minus1|x)

(34)

Since we are interested in the surrogate loss under this classier we plug this into the logisticloss to obtain the Bayes risk

E[`(yg(x))] = E[log(1 +

p(minusy|x)p(y|x)

)

]= E

[log(

1

p(y|x))

]= E[minus log p(y|x)] (35)

In the experiments in Section 41 we report the empirical version of this with the test dataset asthe Bayes risk

20

C Details of Experiments

C1 Benchmark DatasetsIn the experiments in Section 42 we use six benchmark datasets explained below

bull MNIST6 [Lecun et al 1998] is a 10 class dataset of handwrien digits 1 2 9 and 0 Eachsample is a 28times 28 grayscale image e number of training and test samples are 60000and 10000 respectively

bull Fashion-MNIST7 [Xiao et al 2017] is a 10 class dataset of fashion items T-shirttop TrouserPullover Dress Coat Sandal Shirt Sneaker Bag and Ankle boot Each sample is a 28times 28grayscale image e number of training and test samples are 60000 and 10000 respectively

bull Kuzushiji-MNIST8 [Clanuwat et al 2018] is a 10 class dataset of cursive Japanese (ldquoKuzushijirdquo)characters Each sample is a 28 times 28 grayscale image e number of training and testsamples are 60000 and 10000 respectively

bull CIFAR-109 is a 10 class dataset of various objects airplane automobile bird cat deer dogfrog horse ship and truck Each sample is a colored image in 32times 32times 3 RGB format It isa subset of the 80 million tiny images dataset [Torralba et al 2008] ere are 6000 imagesper class where 5000 are for training and 1000 are for test

bull CIFAR-10010 is a 100 class dataset of various objects Each class has 600 samples where 500samples are for training and 100 samples are for test is is also a subset of the 80 milliontiny images dataset [Torralba et al 2008]

bull SVHN11 [Netzer et al 2011] is a 10 class dataset of house numbers from Google Street Viewimages in 32times 32times 3 RGB format 73257 digits are for training and 26032 digits are fortesting

6httpyannlecuncomexdbmnist7httpsgithubcomzalandoresearchfashion-mnist8httpsgithubcomrois-codhkmnist9httpswwwcstorontoedusimkrizcifarhtml

10httpswwwcstorontoedusimkrizcifarhtml11httpufldlstanfordeduhousenumbers

21

C2 Chosen Flooding LevelsIn Table 4 we report the chosen ooding levels for our experiments with benchmark datasets

Table 4 e chosen ooding levels for benchmark experiments

Early stopping times X times XWeight decay times times X XFlooding X X X X

MNIST (08) 002 002 003 002MNIST (04) 002 003 000 002Fashion-MNIST (08) 000 000 mdash mdashFashion-MNIST (04) 000 000 mdash mdashKuzushiji-MNIST (08) 001 003 003 003Kuzushiji-MNIST (04) 003 002 004 003CIFAR-10 (08) 004 004 000 001CIFAR-10 (04) 003 003 000 000CIFAR-100 (08) 002 002 000 000CIFAR-100 (04) 003 003 000 000SVHN (08) 001 001 000 002SVHN (04) 001 009 003 003

D Learning CurvesIn Figure 4 we visualize learning curves for all datasets (including CIFAR-10 which we alreadyvisualized in Figure 2) We only show the learning curves for trainingvalidation proportion of 08since the results for 04 were similar with 08 Note that Figure 1(c) shows the learning curves forthe rst 80 epochs for CIFAR-10 without ooding Figure 1(d) shows the learning curves withooding when the ooding level is 018

22

(a) MNIST (000) (b) MNIST (001) (c) MNIST (002) (d) MNIST (003)

(e) Fashion-MNIST(000)

(f) Fashion-MNIST(002)

(g) Fashion-MNIST(005)

(h) Fashion-MNIST(010)

(i) KMNIST (000) (j) KMNIST (002) (k) KMNIST (005) (l) KMNIST (010)

(m) CIFAR-10 (000) (n) CIFAR-10 (003) (o) CIFAR-10 (007) (p) CIFAR-10 (020)

(q) CIFAR-100 (000) (r) CIFAR-100 (001) (s) CIFAR-100 (006) (t) CIFAR-100 (020)

(u) SVHN (000) (v) SVHN (001) (w) SVHN (006) (x) SVHN (020)

Figure 4 Learning curves of training and test loss e rst gure in each row is the learning curveswithout ooding e 2nd 3rd and 4th columns show the results with dierent ooding levels e oodinglevel increases towards the right-hand side 23

E Relationship between Performance and GradientsSettings We visualize the relationship between test performance (loss or accuracy) and gradientamplitude of the trainingtest loss in Figure 5 where the gradient amplitude is the `2 norm of thelter-nomalized gradient of the loss e lter-normalized gradient is the gradient appropriatelyscaled depending on the magnitude of the corresponding convolutional lter similarly to Liet al [2018] More specicically for each lter of every comvolutional layer we multiply thecorresponding elements of the gradient by the norm of the lter Note that a fully connected layeris a special case of convolutional layer and subject to this scaling We exclude Fashion-MNISTbecause the optimal ooding level was zero We used the results with trainingvalidation splitratio of 08

Results For the gures with gradient amplitude of training loss on the horizontal axis ldquordquomarks (w ooding) are oen ploed on the right of ldquo+rdquo marks (wo ooding) which impliesthat ooding prevents the model from staying a local minimum For the gures with gradientamplitude of test loss on the horizontal axis we can observe the method with ooding (ldquordquo)improves performance while the gradient amplitude becomes smaller On the other hand theperformance with the method without ooding (ldquo+rdquo) degenerates while the gradient amplitude oftest loss keeps increasing

24

(a) MNIST xtrainyloss

(b) MNIST xtrainyacc

(c) MNIST xtestyloss

(d) MNIST xtest yacc

(e) K-M xtrain yloss (f) K-M xtrain yacc (g) K-M xtest yloss (h) K-M xtest yacc

(i) C10 xtrain yloss (j) C10 xtrain yacc (k) C10 xtest yloss (l) C10 xtest yacc

(m) C100 xtrainyloss

(n) C100 xtrain yacc (o) C100 xtest yloss (p) C100 xtest yacc

(q) SVHN xtrainyloss

(r) SVHN xtrainyacc

(s) SVHN xtest yloss (t) SVHN xtest yacc

Figure 5 Relationship between test performance (loss or accuracy) and amplitude of gradient (with trainingor test loss) Each point (ldquordquo or ldquo+rdquo) in the gures corresponds to a single model at a certain epoch Weremove the rst 5 epochs and plot the rest ldquordquo is used for the method with ooding and ldquo+rdquo is used forthe method without ooding e large black ldquordquo and ldquo+rdquo show the epochs with early stopping e colorbecomes lighter (purplerarr yellow) as the training proceeds K-M C10 and C100 stand for Kuzushiji-MNISTCIFAR-10 and CIFAR-100

25

F FlatnessSettings We follow Li et al [2018] and give a one-dimensional visualization of atness for eachdataset in Figure 6 We exclude Fashion-MNIST because the optimal ooding level was zero Weused the results with trainingvalidation split ratio of 08 We compare the atness of the modelright aer the empirical risk with respect to a mini-batch becomes smaller than the ooding levelRm(g) lt b for the rst time (doed blue line) and the model aer training (solid blue line) Wealso compare them with the model trained by the baseline method without ooding and trainingis nished (solid red line)

Results According to Figure 6 the test loss becomes lower and more at during the trainingwith ooding Note that the training loss on the other hand continues to oat around the oodinglevel until the end of training aer it enters the ooding zone We expect that the model makes arandom walk and escapes regions with sharp loss landscapes during the period is may be apossible reason for beer generalization results with our proposed method

26

(a) MNIST (train) (b) MNIST (test) (c) Kuzushiji-MNIST(train)

(d) Kuzushiji-MNIST(test)

(e) CIFAR-10 (train) (f) CIFAR-10 (test) (g) CIFAR-100 (train) (h) CIFAR-100 (test)

(i) SVHN (train) (j) SVHN (test)

Figure 6 One-dimensional visualization of atness We visualize the trainingtest loss with respect toperturbation We depict the results for 3 models the model when the empirical risk with respect to trainingdata is below the ooding level for the rst time during training (doed blue) the model at the end oftraining with ooding (solid blue) and the model at the end of training without ooding (solid red)

27

  • 1 Introduction
  • 2 Backgrounds
    • 21 Regularization Methods
    • 22 Double Descent Curves with Overparametrization
    • 23 Lower-Bounding the Empirical Risk
      • 3 Flooding How to Avoid Zero Training Loss
        • 31 Preliminaries
        • 32 Algorithm
        • 33 Implementation
        • 34 Theoretical Analysis
          • 4 Experiments
            • 41 Synthetic Experiments
            • 42 Benchmark Experiments
            • 43 Memorization
              • 5 Conclusion
              • A Proof of Theorem
              • B Bayes Risk for Gaussian Distributions
              • C Details of Experiments
                • C1 Benchmark Datasets
                • C2 Chosen Flooding Levels
                  • D Learning Curves
                  • E Relationship between Performance and Gradients
                  • F Flatness
Page 20: Do We Need Zero Training Loss A›er Achieving Zero Training … · 2020. 2. 21. · directly aim to avoid zero training loss, they o›en fail to maintain a moderate level of training

B Bayes Risk for Gaussian DistributionsIn this section we explain in detail how we derived the Bayes risk with respect to the surrogateloss in the experiments with Gaussian data in Section 41 Since we are using the logistic loss inthe synthetic experiments the loss of the margin is

`(yg(x)) = log(1 + exp(minusyg(x))) (26)

where g(x) Rd rarr R is a scalar instead of the vector denition that was used previously becausethe synthetic experiments only consider binary classication Take the derivative to derive

partE[`(yg(x))]partg(x)

=E[log(1 + exp(minusyg(x)))]

partg(x)= E

[minusy exp(minusyg(x))1 + exp(minusyg(x))

∣∣∣∣∣x]p(x) (27)

= E

[minusy

1 + exp(yg(x))

∣∣∣∣∣x]p(x) (28)

= E

[y + 1

2

1

exp(minusg(x)) + 1+y minus 1

2

minus1exp(g(x)) + 1

∣∣∣∣∣x]p(x) (29)

= E

[y + 1

2

∣∣∣∣∣x]

1

exp(minusg(x)) + 1p(x) + E

[y minus 1

2

∣∣∣∣∣x]

minus1exp(g(x)) + 1

p(x) (30)

= p(y = +1|x) 1

exp(minusg(x)) + 1p(x) + p(y = minus1|x) minus1

exp(g(x)) + 1p(x) (31)

Set this to zero divide by p(x) gt 0 to obtain

p(y = minus1|x) 1

exp(minusg(x)) + 1= p(y = +1|x) 1

exp(g(x)) + 1(32)

exp(g(x)) =p(y = +1|x)p(y = minus1|x)

(33)

g(x) = logp(y = +1|x)p(y = minus1|x)

(34)

Since we are interested in the surrogate loss under this classier we plug this into the logisticloss to obtain the Bayes risk

E[`(yg(x))] = E[log(1 +

p(minusy|x)p(y|x)

)

]= E

[log(

1

p(y|x))

]= E[minus log p(y|x)] (35)

In the experiments in Section 41 we report the empirical version of this with the test dataset asthe Bayes risk

20

C Details of Experiments

C1 Benchmark DatasetsIn the experiments in Section 42 we use six benchmark datasets explained below

bull MNIST6 [Lecun et al 1998] is a 10 class dataset of handwrien digits 1 2 9 and 0 Eachsample is a 28times 28 grayscale image e number of training and test samples are 60000and 10000 respectively

bull Fashion-MNIST7 [Xiao et al 2017] is a 10 class dataset of fashion items T-shirttop TrouserPullover Dress Coat Sandal Shirt Sneaker Bag and Ankle boot Each sample is a 28times 28grayscale image e number of training and test samples are 60000 and 10000 respectively

bull Kuzushiji-MNIST8 [Clanuwat et al 2018] is a 10 class dataset of cursive Japanese (ldquoKuzushijirdquo)characters Each sample is a 28 times 28 grayscale image e number of training and testsamples are 60000 and 10000 respectively

bull CIFAR-109 is a 10 class dataset of various objects airplane automobile bird cat deer dogfrog horse ship and truck Each sample is a colored image in 32times 32times 3 RGB format It isa subset of the 80 million tiny images dataset [Torralba et al 2008] ere are 6000 imagesper class where 5000 are for training and 1000 are for test

bull CIFAR-10010 is a 100 class dataset of various objects Each class has 600 samples where 500samples are for training and 100 samples are for test is is also a subset of the 80 milliontiny images dataset [Torralba et al 2008]

bull SVHN11 [Netzer et al 2011] is a 10 class dataset of house numbers from Google Street Viewimages in 32times 32times 3 RGB format 73257 digits are for training and 26032 digits are fortesting

6httpyannlecuncomexdbmnist7httpsgithubcomzalandoresearchfashion-mnist8httpsgithubcomrois-codhkmnist9httpswwwcstorontoedusimkrizcifarhtml

10httpswwwcstorontoedusimkrizcifarhtml11httpufldlstanfordeduhousenumbers

21

C2 Chosen Flooding LevelsIn Table 4 we report the chosen ooding levels for our experiments with benchmark datasets

Table 4 e chosen ooding levels for benchmark experiments

Early stopping times X times XWeight decay times times X XFlooding X X X X

MNIST (08) 002 002 003 002MNIST (04) 002 003 000 002Fashion-MNIST (08) 000 000 mdash mdashFashion-MNIST (04) 000 000 mdash mdashKuzushiji-MNIST (08) 001 003 003 003Kuzushiji-MNIST (04) 003 002 004 003CIFAR-10 (08) 004 004 000 001CIFAR-10 (04) 003 003 000 000CIFAR-100 (08) 002 002 000 000CIFAR-100 (04) 003 003 000 000SVHN (08) 001 001 000 002SVHN (04) 001 009 003 003

D Learning CurvesIn Figure 4 we visualize learning curves for all datasets (including CIFAR-10 which we alreadyvisualized in Figure 2) We only show the learning curves for trainingvalidation proportion of 08since the results for 04 were similar with 08 Note that Figure 1(c) shows the learning curves forthe rst 80 epochs for CIFAR-10 without ooding Figure 1(d) shows the learning curves withooding when the ooding level is 018

22

(a) MNIST (000) (b) MNIST (001) (c) MNIST (002) (d) MNIST (003)

(e) Fashion-MNIST(000)

(f) Fashion-MNIST(002)

(g) Fashion-MNIST(005)

(h) Fashion-MNIST(010)

(i) KMNIST (000) (j) KMNIST (002) (k) KMNIST (005) (l) KMNIST (010)

(m) CIFAR-10 (000) (n) CIFAR-10 (003) (o) CIFAR-10 (007) (p) CIFAR-10 (020)

(q) CIFAR-100 (000) (r) CIFAR-100 (001) (s) CIFAR-100 (006) (t) CIFAR-100 (020)

(u) SVHN (000) (v) SVHN (001) (w) SVHN (006) (x) SVHN (020)

Figure 4 Learning curves of training and test loss e rst gure in each row is the learning curveswithout ooding e 2nd 3rd and 4th columns show the results with dierent ooding levels e oodinglevel increases towards the right-hand side 23

E Relationship between Performance and GradientsSettings We visualize the relationship between test performance (loss or accuracy) and gradientamplitude of the trainingtest loss in Figure 5 where the gradient amplitude is the `2 norm of thelter-nomalized gradient of the loss e lter-normalized gradient is the gradient appropriatelyscaled depending on the magnitude of the corresponding convolutional lter similarly to Liet al [2018] More specicically for each lter of every comvolutional layer we multiply thecorresponding elements of the gradient by the norm of the lter Note that a fully connected layeris a special case of convolutional layer and subject to this scaling We exclude Fashion-MNISTbecause the optimal ooding level was zero We used the results with trainingvalidation splitratio of 08

Results For the gures with gradient amplitude of training loss on the horizontal axis ldquordquomarks (w ooding) are oen ploed on the right of ldquo+rdquo marks (wo ooding) which impliesthat ooding prevents the model from staying a local minimum For the gures with gradientamplitude of test loss on the horizontal axis we can observe the method with ooding (ldquordquo)improves performance while the gradient amplitude becomes smaller On the other hand theperformance with the method without ooding (ldquo+rdquo) degenerates while the gradient amplitude oftest loss keeps increasing

24

(a) MNIST xtrainyloss

(b) MNIST xtrainyacc

(c) MNIST xtestyloss

(d) MNIST xtest yacc

(e) K-M xtrain yloss (f) K-M xtrain yacc (g) K-M xtest yloss (h) K-M xtest yacc

(i) C10 xtrain yloss (j) C10 xtrain yacc (k) C10 xtest yloss (l) C10 xtest yacc

(m) C100 xtrainyloss

(n) C100 xtrain yacc (o) C100 xtest yloss (p) C100 xtest yacc

(q) SVHN xtrainyloss

(r) SVHN xtrainyacc

(s) SVHN xtest yloss (t) SVHN xtest yacc

Figure 5 Relationship between test performance (loss or accuracy) and amplitude of gradient (with trainingor test loss) Each point (ldquordquo or ldquo+rdquo) in the gures corresponds to a single model at a certain epoch Weremove the rst 5 epochs and plot the rest ldquordquo is used for the method with ooding and ldquo+rdquo is used forthe method without ooding e large black ldquordquo and ldquo+rdquo show the epochs with early stopping e colorbecomes lighter (purplerarr yellow) as the training proceeds K-M C10 and C100 stand for Kuzushiji-MNISTCIFAR-10 and CIFAR-100

25

F FlatnessSettings We follow Li et al [2018] and give a one-dimensional visualization of atness for eachdataset in Figure 6 We exclude Fashion-MNIST because the optimal ooding level was zero Weused the results with trainingvalidation split ratio of 08 We compare the atness of the modelright aer the empirical risk with respect to a mini-batch becomes smaller than the ooding levelRm(g) lt b for the rst time (doed blue line) and the model aer training (solid blue line) Wealso compare them with the model trained by the baseline method without ooding and trainingis nished (solid red line)

Results According to Figure 6 the test loss becomes lower and more at during the trainingwith ooding Note that the training loss on the other hand continues to oat around the oodinglevel until the end of training aer it enters the ooding zone We expect that the model makes arandom walk and escapes regions with sharp loss landscapes during the period is may be apossible reason for beer generalization results with our proposed method

26

(a) MNIST (train) (b) MNIST (test) (c) Kuzushiji-MNIST(train)

(d) Kuzushiji-MNIST(test)

(e) CIFAR-10 (train) (f) CIFAR-10 (test) (g) CIFAR-100 (train) (h) CIFAR-100 (test)

(i) SVHN (train) (j) SVHN (test)

Figure 6 One-dimensional visualization of atness We visualize the trainingtest loss with respect toperturbation We depict the results for 3 models the model when the empirical risk with respect to trainingdata is below the ooding level for the rst time during training (doed blue) the model at the end oftraining with ooding (solid blue) and the model at the end of training without ooding (solid red)

27

  • 1 Introduction
  • 2 Backgrounds
    • 21 Regularization Methods
    • 22 Double Descent Curves with Overparametrization
    • 23 Lower-Bounding the Empirical Risk
      • 3 Flooding How to Avoid Zero Training Loss
        • 31 Preliminaries
        • 32 Algorithm
        • 33 Implementation
        • 34 Theoretical Analysis
          • 4 Experiments
            • 41 Synthetic Experiments
            • 42 Benchmark Experiments
            • 43 Memorization
              • 5 Conclusion
              • A Proof of Theorem
              • B Bayes Risk for Gaussian Distributions
              • C Details of Experiments
                • C1 Benchmark Datasets
                • C2 Chosen Flooding Levels
                  • D Learning Curves
                  • E Relationship between Performance and Gradients
                  • F Flatness
Page 21: Do We Need Zero Training Loss A›er Achieving Zero Training … · 2020. 2. 21. · directly aim to avoid zero training loss, they o›en fail to maintain a moderate level of training

C Details of Experiments

C1 Benchmark DatasetsIn the experiments in Section 42 we use six benchmark datasets explained below

bull MNIST6 [Lecun et al 1998] is a 10 class dataset of handwrien digits 1 2 9 and 0 Eachsample is a 28times 28 grayscale image e number of training and test samples are 60000and 10000 respectively

bull Fashion-MNIST7 [Xiao et al 2017] is a 10 class dataset of fashion items T-shirttop TrouserPullover Dress Coat Sandal Shirt Sneaker Bag and Ankle boot Each sample is a 28times 28grayscale image e number of training and test samples are 60000 and 10000 respectively

bull Kuzushiji-MNIST8 [Clanuwat et al 2018] is a 10 class dataset of cursive Japanese (ldquoKuzushijirdquo)characters Each sample is a 28 times 28 grayscale image e number of training and testsamples are 60000 and 10000 respectively

bull CIFAR-109 is a 10 class dataset of various objects airplane automobile bird cat deer dogfrog horse ship and truck Each sample is a colored image in 32times 32times 3 RGB format It isa subset of the 80 million tiny images dataset [Torralba et al 2008] ere are 6000 imagesper class where 5000 are for training and 1000 are for test

bull CIFAR-10010 is a 100 class dataset of various objects Each class has 600 samples where 500samples are for training and 100 samples are for test is is also a subset of the 80 milliontiny images dataset [Torralba et al 2008]

bull SVHN11 [Netzer et al 2011] is a 10 class dataset of house numbers from Google Street Viewimages in 32times 32times 3 RGB format 73257 digits are for training and 26032 digits are fortesting

6httpyannlecuncomexdbmnist7httpsgithubcomzalandoresearchfashion-mnist8httpsgithubcomrois-codhkmnist9httpswwwcstorontoedusimkrizcifarhtml

10httpswwwcstorontoedusimkrizcifarhtml11httpufldlstanfordeduhousenumbers

21

C2 Chosen Flooding LevelsIn Table 4 we report the chosen ooding levels for our experiments with benchmark datasets

Table 4 e chosen ooding levels for benchmark experiments

Early stopping times X times XWeight decay times times X XFlooding X X X X

MNIST (08) 002 002 003 002MNIST (04) 002 003 000 002Fashion-MNIST (08) 000 000 mdash mdashFashion-MNIST (04) 000 000 mdash mdashKuzushiji-MNIST (08) 001 003 003 003Kuzushiji-MNIST (04) 003 002 004 003CIFAR-10 (08) 004 004 000 001CIFAR-10 (04) 003 003 000 000CIFAR-100 (08) 002 002 000 000CIFAR-100 (04) 003 003 000 000SVHN (08) 001 001 000 002SVHN (04) 001 009 003 003

D Learning CurvesIn Figure 4 we visualize learning curves for all datasets (including CIFAR-10 which we alreadyvisualized in Figure 2) We only show the learning curves for trainingvalidation proportion of 08since the results for 04 were similar with 08 Note that Figure 1(c) shows the learning curves forthe rst 80 epochs for CIFAR-10 without ooding Figure 1(d) shows the learning curves withooding when the ooding level is 018

22

(a) MNIST (000) (b) MNIST (001) (c) MNIST (002) (d) MNIST (003)

(e) Fashion-MNIST(000)

(f) Fashion-MNIST(002)

(g) Fashion-MNIST(005)

(h) Fashion-MNIST(010)

(i) KMNIST (000) (j) KMNIST (002) (k) KMNIST (005) (l) KMNIST (010)

(m) CIFAR-10 (000) (n) CIFAR-10 (003) (o) CIFAR-10 (007) (p) CIFAR-10 (020)

(q) CIFAR-100 (000) (r) CIFAR-100 (001) (s) CIFAR-100 (006) (t) CIFAR-100 (020)

(u) SVHN (000) (v) SVHN (001) (w) SVHN (006) (x) SVHN (020)

Figure 4 Learning curves of training and test loss e rst gure in each row is the learning curveswithout ooding e 2nd 3rd and 4th columns show the results with dierent ooding levels e oodinglevel increases towards the right-hand side 23

E Relationship between Performance and GradientsSettings We visualize the relationship between test performance (loss or accuracy) and gradientamplitude of the trainingtest loss in Figure 5 where the gradient amplitude is the `2 norm of thelter-nomalized gradient of the loss e lter-normalized gradient is the gradient appropriatelyscaled depending on the magnitude of the corresponding convolutional lter similarly to Liet al [2018] More specicically for each lter of every comvolutional layer we multiply thecorresponding elements of the gradient by the norm of the lter Note that a fully connected layeris a special case of convolutional layer and subject to this scaling We exclude Fashion-MNISTbecause the optimal ooding level was zero We used the results with trainingvalidation splitratio of 08

Results For the gures with gradient amplitude of training loss on the horizontal axis ldquordquomarks (w ooding) are oen ploed on the right of ldquo+rdquo marks (wo ooding) which impliesthat ooding prevents the model from staying a local minimum For the gures with gradientamplitude of test loss on the horizontal axis we can observe the method with ooding (ldquordquo)improves performance while the gradient amplitude becomes smaller On the other hand theperformance with the method without ooding (ldquo+rdquo) degenerates while the gradient amplitude oftest loss keeps increasing

24

(a) MNIST xtrainyloss

(b) MNIST xtrainyacc

(c) MNIST xtestyloss

(d) MNIST xtest yacc

(e) K-M xtrain yloss (f) K-M xtrain yacc (g) K-M xtest yloss (h) K-M xtest yacc

(i) C10 xtrain yloss (j) C10 xtrain yacc (k) C10 xtest yloss (l) C10 xtest yacc

(m) C100 xtrainyloss

(n) C100 xtrain yacc (o) C100 xtest yloss (p) C100 xtest yacc

(q) SVHN xtrainyloss

(r) SVHN xtrainyacc

(s) SVHN xtest yloss (t) SVHN xtest yacc

Figure 5 Relationship between test performance (loss or accuracy) and amplitude of gradient (with trainingor test loss) Each point (ldquordquo or ldquo+rdquo) in the gures corresponds to a single model at a certain epoch Weremove the rst 5 epochs and plot the rest ldquordquo is used for the method with ooding and ldquo+rdquo is used forthe method without ooding e large black ldquordquo and ldquo+rdquo show the epochs with early stopping e colorbecomes lighter (purplerarr yellow) as the training proceeds K-M C10 and C100 stand for Kuzushiji-MNISTCIFAR-10 and CIFAR-100

25

F FlatnessSettings We follow Li et al [2018] and give a one-dimensional visualization of atness for eachdataset in Figure 6 We exclude Fashion-MNIST because the optimal ooding level was zero Weused the results with trainingvalidation split ratio of 08 We compare the atness of the modelright aer the empirical risk with respect to a mini-batch becomes smaller than the ooding levelRm(g) lt b for the rst time (doed blue line) and the model aer training (solid blue line) Wealso compare them with the model trained by the baseline method without ooding and trainingis nished (solid red line)

Results According to Figure 6 the test loss becomes lower and more at during the trainingwith ooding Note that the training loss on the other hand continues to oat around the oodinglevel until the end of training aer it enters the ooding zone We expect that the model makes arandom walk and escapes regions with sharp loss landscapes during the period is may be apossible reason for beer generalization results with our proposed method

26

(a) MNIST (train) (b) MNIST (test) (c) Kuzushiji-MNIST(train)

(d) Kuzushiji-MNIST(test)

(e) CIFAR-10 (train) (f) CIFAR-10 (test) (g) CIFAR-100 (train) (h) CIFAR-100 (test)

(i) SVHN (train) (j) SVHN (test)

Figure 6 One-dimensional visualization of atness We visualize the trainingtest loss with respect toperturbation We depict the results for 3 models the model when the empirical risk with respect to trainingdata is below the ooding level for the rst time during training (doed blue) the model at the end oftraining with ooding (solid blue) and the model at the end of training without ooding (solid red)

27

  • 1 Introduction
  • 2 Backgrounds
    • 21 Regularization Methods
    • 22 Double Descent Curves with Overparametrization
    • 23 Lower-Bounding the Empirical Risk
      • 3 Flooding How to Avoid Zero Training Loss
        • 31 Preliminaries
        • 32 Algorithm
        • 33 Implementation
        • 34 Theoretical Analysis
          • 4 Experiments
            • 41 Synthetic Experiments
            • 42 Benchmark Experiments
            • 43 Memorization
              • 5 Conclusion
              • A Proof of Theorem
              • B Bayes Risk for Gaussian Distributions
              • C Details of Experiments
                • C1 Benchmark Datasets
                • C2 Chosen Flooding Levels
                  • D Learning Curves
                  • E Relationship between Performance and Gradients
                  • F Flatness
Page 22: Do We Need Zero Training Loss A›er Achieving Zero Training … · 2020. 2. 21. · directly aim to avoid zero training loss, they o›en fail to maintain a moderate level of training

C2 Chosen Flooding LevelsIn Table 4 we report the chosen ooding levels for our experiments with benchmark datasets

Table 4 e chosen ooding levels for benchmark experiments

Early stopping times X times XWeight decay times times X XFlooding X X X X

MNIST (08) 002 002 003 002MNIST (04) 002 003 000 002Fashion-MNIST (08) 000 000 mdash mdashFashion-MNIST (04) 000 000 mdash mdashKuzushiji-MNIST (08) 001 003 003 003Kuzushiji-MNIST (04) 003 002 004 003CIFAR-10 (08) 004 004 000 001CIFAR-10 (04) 003 003 000 000CIFAR-100 (08) 002 002 000 000CIFAR-100 (04) 003 003 000 000SVHN (08) 001 001 000 002SVHN (04) 001 009 003 003

D Learning CurvesIn Figure 4 we visualize learning curves for all datasets (including CIFAR-10 which we alreadyvisualized in Figure 2) We only show the learning curves for trainingvalidation proportion of 08since the results for 04 were similar with 08 Note that Figure 1(c) shows the learning curves forthe rst 80 epochs for CIFAR-10 without ooding Figure 1(d) shows the learning curves withooding when the ooding level is 018

22

(a) MNIST (000) (b) MNIST (001) (c) MNIST (002) (d) MNIST (003)

(e) Fashion-MNIST(000)

(f) Fashion-MNIST(002)

(g) Fashion-MNIST(005)

(h) Fashion-MNIST(010)

(i) KMNIST (000) (j) KMNIST (002) (k) KMNIST (005) (l) KMNIST (010)

(m) CIFAR-10 (000) (n) CIFAR-10 (003) (o) CIFAR-10 (007) (p) CIFAR-10 (020)

(q) CIFAR-100 (000) (r) CIFAR-100 (001) (s) CIFAR-100 (006) (t) CIFAR-100 (020)

(u) SVHN (000) (v) SVHN (001) (w) SVHN (006) (x) SVHN (020)

Figure 4 Learning curves of training and test loss e rst gure in each row is the learning curveswithout ooding e 2nd 3rd and 4th columns show the results with dierent ooding levels e oodinglevel increases towards the right-hand side 23

E Relationship between Performance and GradientsSettings We visualize the relationship between test performance (loss or accuracy) and gradientamplitude of the trainingtest loss in Figure 5 where the gradient amplitude is the `2 norm of thelter-nomalized gradient of the loss e lter-normalized gradient is the gradient appropriatelyscaled depending on the magnitude of the corresponding convolutional lter similarly to Liet al [2018] More specicically for each lter of every comvolutional layer we multiply thecorresponding elements of the gradient by the norm of the lter Note that a fully connected layeris a special case of convolutional layer and subject to this scaling We exclude Fashion-MNISTbecause the optimal ooding level was zero We used the results with trainingvalidation splitratio of 08

Results For the gures with gradient amplitude of training loss on the horizontal axis ldquordquomarks (w ooding) are oen ploed on the right of ldquo+rdquo marks (wo ooding) which impliesthat ooding prevents the model from staying a local minimum For the gures with gradientamplitude of test loss on the horizontal axis we can observe the method with ooding (ldquordquo)improves performance while the gradient amplitude becomes smaller On the other hand theperformance with the method without ooding (ldquo+rdquo) degenerates while the gradient amplitude oftest loss keeps increasing

24

(a) MNIST xtrainyloss

(b) MNIST xtrainyacc

(c) MNIST xtestyloss

(d) MNIST xtest yacc

(e) K-M xtrain yloss (f) K-M xtrain yacc (g) K-M xtest yloss (h) K-M xtest yacc

(i) C10 xtrain yloss (j) C10 xtrain yacc (k) C10 xtest yloss (l) C10 xtest yacc

(m) C100 xtrainyloss

(n) C100 xtrain yacc (o) C100 xtest yloss (p) C100 xtest yacc

(q) SVHN xtrainyloss

(r) SVHN xtrainyacc

(s) SVHN xtest yloss (t) SVHN xtest yacc

Figure 5 Relationship between test performance (loss or accuracy) and amplitude of gradient (with trainingor test loss) Each point (ldquordquo or ldquo+rdquo) in the gures corresponds to a single model at a certain epoch Weremove the rst 5 epochs and plot the rest ldquordquo is used for the method with ooding and ldquo+rdquo is used forthe method without ooding e large black ldquordquo and ldquo+rdquo show the epochs with early stopping e colorbecomes lighter (purplerarr yellow) as the training proceeds K-M C10 and C100 stand for Kuzushiji-MNISTCIFAR-10 and CIFAR-100

25

F FlatnessSettings We follow Li et al [2018] and give a one-dimensional visualization of atness for eachdataset in Figure 6 We exclude Fashion-MNIST because the optimal ooding level was zero Weused the results with trainingvalidation split ratio of 08 We compare the atness of the modelright aer the empirical risk with respect to a mini-batch becomes smaller than the ooding levelRm(g) lt b for the rst time (doed blue line) and the model aer training (solid blue line) Wealso compare them with the model trained by the baseline method without ooding and trainingis nished (solid red line)

Results According to Figure 6 the test loss becomes lower and more at during the trainingwith ooding Note that the training loss on the other hand continues to oat around the oodinglevel until the end of training aer it enters the ooding zone We expect that the model makes arandom walk and escapes regions with sharp loss landscapes during the period is may be apossible reason for beer generalization results with our proposed method

26

(a) MNIST (train) (b) MNIST (test) (c) Kuzushiji-MNIST(train)

(d) Kuzushiji-MNIST(test)

(e) CIFAR-10 (train) (f) CIFAR-10 (test) (g) CIFAR-100 (train) (h) CIFAR-100 (test)

(i) SVHN (train) (j) SVHN (test)

Figure 6 One-dimensional visualization of atness We visualize the trainingtest loss with respect toperturbation We depict the results for 3 models the model when the empirical risk with respect to trainingdata is below the ooding level for the rst time during training (doed blue) the model at the end oftraining with ooding (solid blue) and the model at the end of training without ooding (solid red)

27

  • 1 Introduction
  • 2 Backgrounds
    • 21 Regularization Methods
    • 22 Double Descent Curves with Overparametrization
    • 23 Lower-Bounding the Empirical Risk
      • 3 Flooding How to Avoid Zero Training Loss
        • 31 Preliminaries
        • 32 Algorithm
        • 33 Implementation
        • 34 Theoretical Analysis
          • 4 Experiments
            • 41 Synthetic Experiments
            • 42 Benchmark Experiments
            • 43 Memorization
              • 5 Conclusion
              • A Proof of Theorem
              • B Bayes Risk for Gaussian Distributions
              • C Details of Experiments
                • C1 Benchmark Datasets
                • C2 Chosen Flooding Levels
                  • D Learning Curves
                  • E Relationship between Performance and Gradients
                  • F Flatness
Page 23: Do We Need Zero Training Loss A›er Achieving Zero Training … · 2020. 2. 21. · directly aim to avoid zero training loss, they o›en fail to maintain a moderate level of training

(a) MNIST (000) (b) MNIST (001) (c) MNIST (002) (d) MNIST (003)

(e) Fashion-MNIST(000)

(f) Fashion-MNIST(002)

(g) Fashion-MNIST(005)

(h) Fashion-MNIST(010)

(i) KMNIST (000) (j) KMNIST (002) (k) KMNIST (005) (l) KMNIST (010)

(m) CIFAR-10 (000) (n) CIFAR-10 (003) (o) CIFAR-10 (007) (p) CIFAR-10 (020)

(q) CIFAR-100 (000) (r) CIFAR-100 (001) (s) CIFAR-100 (006) (t) CIFAR-100 (020)

(u) SVHN (000) (v) SVHN (001) (w) SVHN (006) (x) SVHN (020)

Figure 4 Learning curves of training and test loss e rst gure in each row is the learning curveswithout ooding e 2nd 3rd and 4th columns show the results with dierent ooding levels e oodinglevel increases towards the right-hand side 23

E Relationship between Performance and GradientsSettings We visualize the relationship between test performance (loss or accuracy) and gradientamplitude of the trainingtest loss in Figure 5 where the gradient amplitude is the `2 norm of thelter-nomalized gradient of the loss e lter-normalized gradient is the gradient appropriatelyscaled depending on the magnitude of the corresponding convolutional lter similarly to Liet al [2018] More specicically for each lter of every comvolutional layer we multiply thecorresponding elements of the gradient by the norm of the lter Note that a fully connected layeris a special case of convolutional layer and subject to this scaling We exclude Fashion-MNISTbecause the optimal ooding level was zero We used the results with trainingvalidation splitratio of 08

Results For the gures with gradient amplitude of training loss on the horizontal axis ldquordquomarks (w ooding) are oen ploed on the right of ldquo+rdquo marks (wo ooding) which impliesthat ooding prevents the model from staying a local minimum For the gures with gradientamplitude of test loss on the horizontal axis we can observe the method with ooding (ldquordquo)improves performance while the gradient amplitude becomes smaller On the other hand theperformance with the method without ooding (ldquo+rdquo) degenerates while the gradient amplitude oftest loss keeps increasing

24

(a) MNIST xtrainyloss

(b) MNIST xtrainyacc

(c) MNIST xtestyloss

(d) MNIST xtest yacc

(e) K-M xtrain yloss (f) K-M xtrain yacc (g) K-M xtest yloss (h) K-M xtest yacc

(i) C10 xtrain yloss (j) C10 xtrain yacc (k) C10 xtest yloss (l) C10 xtest yacc

(m) C100 xtrainyloss

(n) C100 xtrain yacc (o) C100 xtest yloss (p) C100 xtest yacc

(q) SVHN xtrainyloss

(r) SVHN xtrainyacc

(s) SVHN xtest yloss (t) SVHN xtest yacc

Figure 5 Relationship between test performance (loss or accuracy) and amplitude of gradient (with trainingor test loss) Each point (ldquordquo or ldquo+rdquo) in the gures corresponds to a single model at a certain epoch Weremove the rst 5 epochs and plot the rest ldquordquo is used for the method with ooding and ldquo+rdquo is used forthe method without ooding e large black ldquordquo and ldquo+rdquo show the epochs with early stopping e colorbecomes lighter (purplerarr yellow) as the training proceeds K-M C10 and C100 stand for Kuzushiji-MNISTCIFAR-10 and CIFAR-100

25

F FlatnessSettings We follow Li et al [2018] and give a one-dimensional visualization of atness for eachdataset in Figure 6 We exclude Fashion-MNIST because the optimal ooding level was zero Weused the results with trainingvalidation split ratio of 08 We compare the atness of the modelright aer the empirical risk with respect to a mini-batch becomes smaller than the ooding levelRm(g) lt b for the rst time (doed blue line) and the model aer training (solid blue line) Wealso compare them with the model trained by the baseline method without ooding and trainingis nished (solid red line)

Results According to Figure 6 the test loss becomes lower and more at during the trainingwith ooding Note that the training loss on the other hand continues to oat around the oodinglevel until the end of training aer it enters the ooding zone We expect that the model makes arandom walk and escapes regions with sharp loss landscapes during the period is may be apossible reason for beer generalization results with our proposed method

26

(a) MNIST (train) (b) MNIST (test) (c) Kuzushiji-MNIST(train)

(d) Kuzushiji-MNIST(test)

(e) CIFAR-10 (train) (f) CIFAR-10 (test) (g) CIFAR-100 (train) (h) CIFAR-100 (test)

(i) SVHN (train) (j) SVHN (test)

Figure 6 One-dimensional visualization of atness We visualize the trainingtest loss with respect toperturbation We depict the results for 3 models the model when the empirical risk with respect to trainingdata is below the ooding level for the rst time during training (doed blue) the model at the end oftraining with ooding (solid blue) and the model at the end of training without ooding (solid red)

27

  • 1 Introduction
  • 2 Backgrounds
    • 21 Regularization Methods
    • 22 Double Descent Curves with Overparametrization
    • 23 Lower-Bounding the Empirical Risk
      • 3 Flooding How to Avoid Zero Training Loss
        • 31 Preliminaries
        • 32 Algorithm
        • 33 Implementation
        • 34 Theoretical Analysis
          • 4 Experiments
            • 41 Synthetic Experiments
            • 42 Benchmark Experiments
            • 43 Memorization
              • 5 Conclusion
              • A Proof of Theorem
              • B Bayes Risk for Gaussian Distributions
              • C Details of Experiments
                • C1 Benchmark Datasets
                • C2 Chosen Flooding Levels
                  • D Learning Curves
                  • E Relationship between Performance and Gradients
                  • F Flatness
Page 24: Do We Need Zero Training Loss A›er Achieving Zero Training … · 2020. 2. 21. · directly aim to avoid zero training loss, they o›en fail to maintain a moderate level of training

E Relationship between Performance and GradientsSettings We visualize the relationship between test performance (loss or accuracy) and gradientamplitude of the trainingtest loss in Figure 5 where the gradient amplitude is the `2 norm of thelter-nomalized gradient of the loss e lter-normalized gradient is the gradient appropriatelyscaled depending on the magnitude of the corresponding convolutional lter similarly to Liet al [2018] More specicically for each lter of every comvolutional layer we multiply thecorresponding elements of the gradient by the norm of the lter Note that a fully connected layeris a special case of convolutional layer and subject to this scaling We exclude Fashion-MNISTbecause the optimal ooding level was zero We used the results with trainingvalidation splitratio of 08

Results For the gures with gradient amplitude of training loss on the horizontal axis ldquordquomarks (w ooding) are oen ploed on the right of ldquo+rdquo marks (wo ooding) which impliesthat ooding prevents the model from staying a local minimum For the gures with gradientamplitude of test loss on the horizontal axis we can observe the method with ooding (ldquordquo)improves performance while the gradient amplitude becomes smaller On the other hand theperformance with the method without ooding (ldquo+rdquo) degenerates while the gradient amplitude oftest loss keeps increasing

24

(a) MNIST xtrainyloss

(b) MNIST xtrainyacc

(c) MNIST xtestyloss

(d) MNIST xtest yacc

(e) K-M xtrain yloss (f) K-M xtrain yacc (g) K-M xtest yloss (h) K-M xtest yacc

(i) C10 xtrain yloss (j) C10 xtrain yacc (k) C10 xtest yloss (l) C10 xtest yacc

(m) C100 xtrainyloss

(n) C100 xtrain yacc (o) C100 xtest yloss (p) C100 xtest yacc

(q) SVHN xtrainyloss

(r) SVHN xtrainyacc

(s) SVHN xtest yloss (t) SVHN xtest yacc

Figure 5 Relationship between test performance (loss or accuracy) and amplitude of gradient (with trainingor test loss) Each point (ldquordquo or ldquo+rdquo) in the gures corresponds to a single model at a certain epoch Weremove the rst 5 epochs and plot the rest ldquordquo is used for the method with ooding and ldquo+rdquo is used forthe method without ooding e large black ldquordquo and ldquo+rdquo show the epochs with early stopping e colorbecomes lighter (purplerarr yellow) as the training proceeds K-M C10 and C100 stand for Kuzushiji-MNISTCIFAR-10 and CIFAR-100

25

F FlatnessSettings We follow Li et al [2018] and give a one-dimensional visualization of atness for eachdataset in Figure 6 We exclude Fashion-MNIST because the optimal ooding level was zero Weused the results with trainingvalidation split ratio of 08 We compare the atness of the modelright aer the empirical risk with respect to a mini-batch becomes smaller than the ooding levelRm(g) lt b for the rst time (doed blue line) and the model aer training (solid blue line) Wealso compare them with the model trained by the baseline method without ooding and trainingis nished (solid red line)

Results According to Figure 6 the test loss becomes lower and more at during the trainingwith ooding Note that the training loss on the other hand continues to oat around the oodinglevel until the end of training aer it enters the ooding zone We expect that the model makes arandom walk and escapes regions with sharp loss landscapes during the period is may be apossible reason for beer generalization results with our proposed method

26

(a) MNIST (train) (b) MNIST (test) (c) Kuzushiji-MNIST(train)

(d) Kuzushiji-MNIST(test)

(e) CIFAR-10 (train) (f) CIFAR-10 (test) (g) CIFAR-100 (train) (h) CIFAR-100 (test)

(i) SVHN (train) (j) SVHN (test)

Figure 6 One-dimensional visualization of atness We visualize the trainingtest loss with respect toperturbation We depict the results for 3 models the model when the empirical risk with respect to trainingdata is below the ooding level for the rst time during training (doed blue) the model at the end oftraining with ooding (solid blue) and the model at the end of training without ooding (solid red)

27

  • 1 Introduction
  • 2 Backgrounds
    • 21 Regularization Methods
    • 22 Double Descent Curves with Overparametrization
    • 23 Lower-Bounding the Empirical Risk
      • 3 Flooding How to Avoid Zero Training Loss
        • 31 Preliminaries
        • 32 Algorithm
        • 33 Implementation
        • 34 Theoretical Analysis
          • 4 Experiments
            • 41 Synthetic Experiments
            • 42 Benchmark Experiments
            • 43 Memorization
              • 5 Conclusion
              • A Proof of Theorem
              • B Bayes Risk for Gaussian Distributions
              • C Details of Experiments
                • C1 Benchmark Datasets
                • C2 Chosen Flooding Levels
                  • D Learning Curves
                  • E Relationship between Performance and Gradients
                  • F Flatness
Page 25: Do We Need Zero Training Loss A›er Achieving Zero Training … · 2020. 2. 21. · directly aim to avoid zero training loss, they o›en fail to maintain a moderate level of training

(a) MNIST xtrainyloss

(b) MNIST xtrainyacc

(c) MNIST xtestyloss

(d) MNIST xtest yacc

(e) K-M xtrain yloss (f) K-M xtrain yacc (g) K-M xtest yloss (h) K-M xtest yacc

(i) C10 xtrain yloss (j) C10 xtrain yacc (k) C10 xtest yloss (l) C10 xtest yacc

(m) C100 xtrainyloss

(n) C100 xtrain yacc (o) C100 xtest yloss (p) C100 xtest yacc

(q) SVHN xtrainyloss

(r) SVHN xtrainyacc

(s) SVHN xtest yloss (t) SVHN xtest yacc

Figure 5 Relationship between test performance (loss or accuracy) and amplitude of gradient (with trainingor test loss) Each point (ldquordquo or ldquo+rdquo) in the gures corresponds to a single model at a certain epoch Weremove the rst 5 epochs and plot the rest ldquordquo is used for the method with ooding and ldquo+rdquo is used forthe method without ooding e large black ldquordquo and ldquo+rdquo show the epochs with early stopping e colorbecomes lighter (purplerarr yellow) as the training proceeds K-M C10 and C100 stand for Kuzushiji-MNISTCIFAR-10 and CIFAR-100

25

F FlatnessSettings We follow Li et al [2018] and give a one-dimensional visualization of atness for eachdataset in Figure 6 We exclude Fashion-MNIST because the optimal ooding level was zero Weused the results with trainingvalidation split ratio of 08 We compare the atness of the modelright aer the empirical risk with respect to a mini-batch becomes smaller than the ooding levelRm(g) lt b for the rst time (doed blue line) and the model aer training (solid blue line) Wealso compare them with the model trained by the baseline method without ooding and trainingis nished (solid red line)

Results According to Figure 6 the test loss becomes lower and more at during the trainingwith ooding Note that the training loss on the other hand continues to oat around the oodinglevel until the end of training aer it enters the ooding zone We expect that the model makes arandom walk and escapes regions with sharp loss landscapes during the period is may be apossible reason for beer generalization results with our proposed method

26

(a) MNIST (train) (b) MNIST (test) (c) Kuzushiji-MNIST(train)

(d) Kuzushiji-MNIST(test)

(e) CIFAR-10 (train) (f) CIFAR-10 (test) (g) CIFAR-100 (train) (h) CIFAR-100 (test)

(i) SVHN (train) (j) SVHN (test)

Figure 6 One-dimensional visualization of atness We visualize the trainingtest loss with respect toperturbation We depict the results for 3 models the model when the empirical risk with respect to trainingdata is below the ooding level for the rst time during training (doed blue) the model at the end oftraining with ooding (solid blue) and the model at the end of training without ooding (solid red)

27

  • 1 Introduction
  • 2 Backgrounds
    • 21 Regularization Methods
    • 22 Double Descent Curves with Overparametrization
    • 23 Lower-Bounding the Empirical Risk
      • 3 Flooding How to Avoid Zero Training Loss
        • 31 Preliminaries
        • 32 Algorithm
        • 33 Implementation
        • 34 Theoretical Analysis
          • 4 Experiments
            • 41 Synthetic Experiments
            • 42 Benchmark Experiments
            • 43 Memorization
              • 5 Conclusion
              • A Proof of Theorem
              • B Bayes Risk for Gaussian Distributions
              • C Details of Experiments
                • C1 Benchmark Datasets
                • C2 Chosen Flooding Levels
                  • D Learning Curves
                  • E Relationship between Performance and Gradients
                  • F Flatness
Page 26: Do We Need Zero Training Loss A›er Achieving Zero Training … · 2020. 2. 21. · directly aim to avoid zero training loss, they o›en fail to maintain a moderate level of training

F FlatnessSettings We follow Li et al [2018] and give a one-dimensional visualization of atness for eachdataset in Figure 6 We exclude Fashion-MNIST because the optimal ooding level was zero Weused the results with trainingvalidation split ratio of 08 We compare the atness of the modelright aer the empirical risk with respect to a mini-batch becomes smaller than the ooding levelRm(g) lt b for the rst time (doed blue line) and the model aer training (solid blue line) Wealso compare them with the model trained by the baseline method without ooding and trainingis nished (solid red line)

Results According to Figure 6 the test loss becomes lower and more at during the trainingwith ooding Note that the training loss on the other hand continues to oat around the oodinglevel until the end of training aer it enters the ooding zone We expect that the model makes arandom walk and escapes regions with sharp loss landscapes during the period is may be apossible reason for beer generalization results with our proposed method

26

(a) MNIST (train) (b) MNIST (test) (c) Kuzushiji-MNIST(train)

(d) Kuzushiji-MNIST(test)

(e) CIFAR-10 (train) (f) CIFAR-10 (test) (g) CIFAR-100 (train) (h) CIFAR-100 (test)

(i) SVHN (train) (j) SVHN (test)

Figure 6 One-dimensional visualization of atness We visualize the trainingtest loss with respect toperturbation We depict the results for 3 models the model when the empirical risk with respect to trainingdata is below the ooding level for the rst time during training (doed blue) the model at the end oftraining with ooding (solid blue) and the model at the end of training without ooding (solid red)

27

  • 1 Introduction
  • 2 Backgrounds
    • 21 Regularization Methods
    • 22 Double Descent Curves with Overparametrization
    • 23 Lower-Bounding the Empirical Risk
      • 3 Flooding How to Avoid Zero Training Loss
        • 31 Preliminaries
        • 32 Algorithm
        • 33 Implementation
        • 34 Theoretical Analysis
          • 4 Experiments
            • 41 Synthetic Experiments
            • 42 Benchmark Experiments
            • 43 Memorization
              • 5 Conclusion
              • A Proof of Theorem
              • B Bayes Risk for Gaussian Distributions
              • C Details of Experiments
                • C1 Benchmark Datasets
                • C2 Chosen Flooding Levels
                  • D Learning Curves
                  • E Relationship between Performance and Gradients
                  • F Flatness
Page 27: Do We Need Zero Training Loss A›er Achieving Zero Training … · 2020. 2. 21. · directly aim to avoid zero training loss, they o›en fail to maintain a moderate level of training

(a) MNIST (train) (b) MNIST (test) (c) Kuzushiji-MNIST(train)

(d) Kuzushiji-MNIST(test)

(e) CIFAR-10 (train) (f) CIFAR-10 (test) (g) CIFAR-100 (train) (h) CIFAR-100 (test)

(i) SVHN (train) (j) SVHN (test)

Figure 6 One-dimensional visualization of atness We visualize the trainingtest loss with respect toperturbation We depict the results for 3 models the model when the empirical risk with respect to trainingdata is below the ooding level for the rst time during training (doed blue) the model at the end oftraining with ooding (solid blue) and the model at the end of training without ooding (solid red)

27

  • 1 Introduction
  • 2 Backgrounds
    • 21 Regularization Methods
    • 22 Double Descent Curves with Overparametrization
    • 23 Lower-Bounding the Empirical Risk
      • 3 Flooding How to Avoid Zero Training Loss
        • 31 Preliminaries
        • 32 Algorithm
        • 33 Implementation
        • 34 Theoretical Analysis
          • 4 Experiments
            • 41 Synthetic Experiments
            • 42 Benchmark Experiments
            • 43 Memorization
              • 5 Conclusion
              • A Proof of Theorem
              • B Bayes Risk for Gaussian Distributions
              • C Details of Experiments
                • C1 Benchmark Datasets
                • C2 Chosen Flooding Levels
                  • D Learning Curves
                  • E Relationship between Performance and Gradients
                  • F Flatness