The Generalization Error Bound for A Stochastic Gradient Descent Family bia A Gaussian Approximation Method

Hao Chen; Zhanfeng Mo; Zhouwang Yang

doi:10.61822/amcs-2025-0018

Abstract

Recent works have developed model complexity based and algorithm based generalization error bounds to explain how stochastic gradient descent (SGD) methods help over-parameterized models generalize better. However, previous works are limited by their scope of analysis and fail to provide comprehensive explanations. In this paper, we propose a novel Gaussian approximation framework to establish generalization error bounds for the 𝒰-SGD family, which is a class of SGD with asymptotically unbiased and uniformly bounded gradient noise. We study 𝒰-SGD dynamics, and we show both theoretically and numerically that the limiting model parameter distribution tends to be Gaussian, even when the original gradient noise is non-Gaussian. For a 𝒰-SGD family, we establish a desirable iteration number independent generalization error bound at the order of $𝒪 ((1 + \sqrt{log (p \sqrt{n})}) / \sqrt{n})$ \mathcal{O}\left( {\left( {1 + \sqrt {\log \left( {p\sqrt n } \right)} } \right)/\sqrt n } \right) , where n and p stand for the sample size and parameter dimension. Based on our analysis, we propose two general types of methods to help models generalize better, termed as the additive and multiplicative noise insertions. We show that these methods significantly reduce the dominant term of the generalization error bound.

References

Arora, S., Du, S.S., Hu, W., Li, Z., Salakhutdinov, R.R. and Wang, R. (2019). On exact computation with an infinitely wide neural net, in H. Wallach et al. (Eds), Advances in Neural Information Processing Systems, Vol. 32, Curran Associates, New York, pp. 8141–8150.
Search in Google Scholar Back to article
Bartlett, P.L., Foster, D.J. and Telgarsky, M.J. (2017). Spectrally-normalized margin bounds for neural networks, in I. Guyon et al. (Eds), Advances in Neural Information Processing Systems, Vol. 30, Curran Associates, New York, pp. 6240–6249.
Search in Google Scholar Back to article
Bottou, L. (1998). On-line learning and stochastic approximations, in L. Bottou (Ed) On-line Learning in Neural Networks, Cambridge University Press, pp. 9–42.
Search in Google Scholar Back to article
Bousquet, O. and Elisseeff, A. (2002). Stability and generalization, Journal of Machine Learning Research 2: 499–526, DOI: 10.1162/153244302760200704.
Search in Google Scholar Back to article
Chen, H., Mo, Z., Yang, Z. and Wang, X. (2019). Theoretical investigation of generalization bound for residual networks, Proceedings of the 28th International Joint Conference on Artificial Intelligence, Morgan Kaufmann, San Francisco, pp. 2081–2087, DOI: 10.24963/ijcai.2019/288.
Search in Google Scholar Back to article
Chen, Y., Chang, H., Meng, J. and Zhang, D. (2019). Ensemble neural networks (enn): A gradient-free stochastic method, Neural Networks 110: 170–185.
Search in Google Scholar Back to article
D’Agostino, R. and Pearson, E.S. (1973). Tests for departure from normality. empirical results for the distributions of b² and √b¹, Biometrika 60(3): 613–622.
Search in Google Scholar Back to article
Dieuleveut, A., Durmus, A. and Bach, F. (2017). Bridging the gap between constant step size stochastic gradient descent and markov chains, arXiv: 1707.06386.
Search in Google Scholar Back to article
Elisseeff, A., Evgeniou, T. and Pontil, M. (2005). Stability of randomized learning algorithms, Journal of Machine Learning Research 6: 55–79.
Search in Google Scholar Back to article
Feng, Y., Gao, T., Li, L., Liu, J.-G. and Lu, Y. (2020). Uniform-in-time weak error analysis for stochastic gradient descent algorithms via diffusion approximation, Communications in Mathematical Sciences 18(1).
Search in Google Scholar Back to article
Hardt, M., Recht, B. and Singer, Y. (2016). Train faster, generalize better: Stability of stochastic gradient descent, in M.F. Balcan and K.Q.Weinberger (Eds), Proceedings of The 33rd International Conference on Machine Learning, New York, USA, pp. 1225–1234.
Search in Google Scholar Back to article
He, F., Liu, T. and Tao, D. (2019). Control batch size and learning rate to generalize well: Theoretical and empirical evidence, in H.Wallach et al. (Eds), Advances in Neural Information Processing Systems, Vol. 32, Curran Associates, New York, pp. 1143–1152.
Search in Google Scholar Back to article
He, K., Zhang, X., Ren, S. and Sun, J. (2015). Delving deep into rectifiers: Surpassing human-level performance on imagenet classification, Poceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 1026–1034.
Search in Google Scholar Back to article
Jacob, B., Kligys, S., Chen, B., Zhu, M., Tang, M., Howard, A., Adam, H. and Kalenichenko, D. (2018). Quantization and training of neural networks for efficient integer-arithmetic-only inference, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2704–2713, DOI:10.1109/CVPR.2018.00286.
Search in Google Scholar Back to article
Kramers, H. (1940). Brownian motion in a field of force and the diffusion model of chemical reactions, Physica 7(4): 284–304.
Search in Google Scholar Back to article
Krizhevsky, A. and Hinton, G. (2009). LearningMultiple Layers of Features from Tiny Images, Master’s thesis, Department of Computer Science, University of Toronto.
Search in Google Scholar Back to article
LeCun, Y. and Cortes, C. (2010). MNIST handwritten digit database, http://yann.lecun.com/exdb/mnist/.
Search in Google Scholar Back to article
Li, H., Xu, Z., Taylor, G., Studer, C. and Goldstein, T. (2018). Visualizing the loss landscape of neural nets, in S. Bengio et al. (Eds), Advances in Neural Information Processing Systems, Vol. 31, Curran Associates, New York, pp. 6389–6399.
Search in Google Scholar Back to article
Li, J., Luo, X. and Qiao, M. (2020). On generalization error bounds of noisy gradient methods for non-convex learning, arXiv: 1902.00621.
Search in Google Scholar Back to article
Li, Q., Tai, C. and Weinan, E. (2019). Stochastic modified equations and dynamics of stochastic gradient algorithms i: Mathematical foundations, Journal of Machine Learning Research 20(40): 1–47.
Search in Google Scholar Back to article
Li, X., Lu, J.,Wang, Z., Haupt, J. and Zhao, T. (2019). On tighter generalization bound for deep neural networks: CNNs, ResNets, and beyond, arXiv: 1806.05159.
Search in Google Scholar Back to article
Ljung, L., Pflug, G. and Walk, H. (1992). Stochastic Approximation and Optimization of Random Systems, Birkhäuser, Basel, Switzerland.
Search in Google Scholar Back to article
London, B., Huang, B., Taskar, B. and Getoor, L. (2014). PAC-Bayesian Collective Stability, in S. Kaski and J. Corander (Eds), Proceedings of the 17th International Conference on Artificial Intelligence and Statistics, Reykjavik, Iceland, pp. 585–594.
Search in Google Scholar Back to article
Mandt, S., Hoffman, M.D. and Blei, D.M. (2017). Stochastic gradient descent as approximate Bayesian inference, Journal of Machine Learning Research 18(1): 4873–4907.
Search in Google Scholar Back to article
McAllester, D.A. (1999). PAC-Bayesian model averaging, Proceedings of the 12th Annual Conference on Computational Learning Theory, COLT ’99, New York, NY, USA, pp. 164–170, DOI: 10.1145/307400.307435.
Search in Google Scholar Back to article
Negrea, J., Haghifam, M., Dziugaite, G. K., Khisti, A. and Roy, D. M. (2019). Information-theoretic generalization bounds for sgld via data-dependent estimates, in H. Wallach et al. (Eds), Advances in Neural Information Processing Systems, Vol. 32, Curran Associates, New York, pp. 11015–11025.
Search in Google Scholar Back to article
Panigrahi, A., Somani, R., Goyal, N. and Netrapalli, P. (2019). Non-gaussianity of stochastic gradient noise, arXiv: 1910.09626.
Search in Google Scholar Back to article
Qian, Y., Wang, Y., Wang, B., Gu, Z., Guo, Y. and Swaileh, W. (2022). Hessian-free second-order adversarial examples for adversarial learning, arXiv: 2207.01396.
Search in Google Scholar Back to article
Rong, Y., Huang, W., Xu, T. and Huang, J. (2020). DropEdge: Towards deep graph convolutional networks on node classification, arXiv: 1907.10903.
Search in Google Scholar Back to article
Shalev-Shwartz, S. and Ben-David, S. (2014). Understanding Machine Learning: From Theory to Algorithms, Cambridge University Press, International Conference on Learning Representations 2020, https://openreview.net/pdf?id=Hkx1qkrKPr.
Search in Google Scholar Back to article
Simsekli, U., Sagun, L. and Gurbuzbalaban, M. (2019). A tail-index analysis of stochastic gradient noise in deep neural networks, in K. Chaudhuri and R. Salakhutdinov (Eds), Proceedings of the 36th International Conference on Machine Learning, New York, pp. 5827–5837.
Search in Google Scholar Back to article
Sulaiman, I.M., Kaelo, P., Khalid, R. and Nawawi, M.K.M. (2024). A descent generalized rmil spectral gradient algorithm for optimization problems, Internationl Jouranl of Applied Mathematics and Computer Science 34(2): 225–233, DOI: 10.61822/amcs-2024-0016.
Search in Google Scholar Back to article
Sutskever, I., Martens, J., Dahl, G. and Hinton, G. (2013). On the importance of initialization and momentum in deep learning, in S. Dasgupta and D. McAllester (Eds), Proceedings of the 30th International Conference on Machine Learning, Atlanta, USA, pp. 1139–1147.
Search in Google Scholar Back to article
Villani, C. (2008). Optimal Transport: Old and New, Grundlehren der Mathematischen Wissenschaften, Springer, Berlin/Heidelberg.
Search in Google Scholar Back to article
Weinan, E., Ma, C. and Wang, Q. (2019). A priori estimates of the population risk for residual networks, arXiv: 1903.02154.
Search in Google Scholar Back to article
Welling, M. and Teh, Y.W. (2011). Bayesian learning via stochastic gradient Langevin dynamics, Proceedings of the 28th International Conference on International Conference on Machine Learning, ICML’11, Omnipress, Madison, WI, USA, p. 681–688.
Search in Google Scholar Back to article
Wu, J., Hu, W., Xiong, H., Huan, J., Braverman, V. and Zhu, Z. (2020). On the noisy gradient descent that generalizes as SGD, International Conference on Machine Learning, PMLR 119: 10367–10376.
Search in Google Scholar Back to article
Zhang, C., Bengio, S., Hardt, M., Recht, B. and Vinyals, O. (2016). Understanding deep learning requires rethinking generalization, arXiv: 1611.03530.
Search in Google Scholar Back to article

The Generalization Error Bound for A Stochastic Gradient Descent Family bia A Gaussian Approximation Method

Abstract

Paradigm

My account