Abstract: In the first part of the talk, we focus on the gradient noise (GN) in the stochastic gradient descent (SGD) algorithm for training deep neural networks (DNNs). In this setting, GN is often considered to be Gaussian in the large data regime by assuming that the classical central limit theorem (CLT) kicks in. This assumption is often made for mathematical convenience, since it enables SGD to be analyzed as a stochastic differential equation (SDE) driven by a Brownian motion. We argue that the Gaussianity assumption might fail to hold in deep learning settings and hence render the Brownian motion-based analyses inappropriate. We conduct extensive experiments on common deep learning architectures and show that in all settings, the GN is highly non-Gaussian and admits heavy-tails. We further investigate the tail behavior in varying network architectures and sizes, loss functions, and datasets. By exploiting connections to Levy-driven differential equations and their metastability properties, our results open up a different perspective and shed more light on the belief that SGD prefers wide minima. Joint work with Umut Simsekli and Levent Sagun. In the second part of the talk, we question the origins of heavy tails in SGD iterations further and their link to various notions of capacity and complexity that have been proposed for characterizing the generalization properties of SGD in deep learning. Some of the popular notions that correlate well with the performance on unseen data are (i) the ‘flatness’ of the local minimum found by SGD, which is related to the eigenvalues of the Hessian, (ii) the ratio of the stepsize η to the batch size b, which essentially controls the magnitude of the stochastic gradient noise, and (iii) the ‘tail-index’, which measures the heaviness of the tails of the eigenspectra of the network weights. We argue that these three seemingly unrelated perspectives for generalization are deeply linked to each other. We claim that depending on the structure of the Hessian of the loss at the minimum, and the choices of the algorithm parameters η and b, the SGD iterates will converge to a heavy-tailed stationary distribution. We rigorously prove this claim in the setting of linear regression: we show that even in a simple quadratic optimization problem with independent and identically distributed Gaussian data, the iterates can be heavy-tailed with infinite variance. We further characterize the behavior of the tails with respect to algorithm parameters, the dimension, and the curvature. We then translate our results into insights about the behavior of SGD in deep learning. We finally support our theory with experiments conducted on both synthetic data and neural networks. Joint work with Umut Simsekli and Lingjiong Zhu. In the third part of the talk, if time permits, we will talk about a new class of initialization schemes for fully-connected neural networks that can improve the performance of SGD training by inducing a particular form of heavy-tailed behavior in the stochastic gradients. Joint work with Yuanhan Hu.
- Күн бұрын
Prof. Mert GürbüzbalabanDS | Heavy-tail Phenomenon in SGD
- Рет қаралды 244
Пікірлер: 1