Improved Convergence in Deep Neural Networks using a Modified Adaptive Moment Gradient Thresholding Algorithm

Mark Laisin; Bright Okore Osu; Prisca Udodiri Duruojinkeya; Chigozie Chibuisi

doi:10.63561/jca.v2i4.1069

Authors

Mark Laisin Department of Mathematics Chukwuemeka Odumegwu Ojukwu University, Uli, Nigeria
Bright Okore Osu Department of Mathematics, Abia State University, Uturu, Nigeria
Prisca Udodiri Duruojinkeya Department of Mathematics and Statistics, Federal Polytechnic Nekede, Owerri, Nigeria
Chigozie Chibuisi Department of Insurance, University of Jos, Jos, Nigeria

DOI:

https://doi.org/10.63561/jca.v2i4.1069

Keywords:

Adaptive Optimization, Gradient Descent, Deep Neural Networks, Convergence Analysis, Momentum

Abstract

This study introduces the Adaptive Moment Gradient Thresholding (AMGT) algorithm, a modified version of the Adam optimizer, aimed at enhancing convergence stability in deep neural networks. By leveraging optimization theory and addressing the limitations of Adam, AMGT was designed to tackle non-convexity, constrained environments, and gradient-based learning instability. The algorithm incorporates a diminishing step size schedule and momentum thresholding to improve performance. Theoretical analysis demonstrated that AMGT achieved linear convergence under strong convexity with a rate of O(k^(-μ/2)), global convergence under bounded gradient approximation errors, and convergence to stationary points in non-convex scenarios. Numerical experiments on convex quadratic functions validated the theoretical predictions, highlighting the algorithm’s sensitivity to spectral properties and resilience to learning rate variations. The results indicate that AMGT surpasses standard Adam in convergence behaviour and provides theoretical guarantees often lacking in adaptive optimizers. AMGT is particularly effective in high-dimensional, noisy, or resource-constrained settings due to its support for quantized and sparsified updates. By combining theoretical rigour with empirical robustness, AMGT emerges as a dependable option for training deep learning models across diverse optimization landscapes.

References

Alistarh, D., Grubic, D., Li, J., Tomioka, R., &Vojnovic, M. (2017). QSGD: Communication-efficient SGD via gradient quantization and encoding. Advances in Neural Information Processing Systems, 30(NIPS 2017),1-12.

Bottou, L., Curtis, F. E., &Nocedal, J. (2018). Optimization methods for large-scale machine learning.SIAM Review, 60 (2), 223–311. https://doi.org/10.1137/16M1080173

Cauchy, A. L. (1847). Méthodegénérale pour la résolution des systèmesd' équationssimultanées. ComptesRendus, 25, 536–538.

Chen, J., Zhang, H., Xu, X., & Yin, W. (2018). On the convergence of a class of Adam-type algorithms for non-convex optimization. arXiv preprint arXiv:1808.02941. https://arxiv.org/abs/1808.02941.

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT(4171–4186). https://doi.org/10.48550/arXiv.1810.04805.

Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12, 2121–2159.

Ghadimi, S., & Lan, G. (2013). Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4), 2341–2368. https://doi.org/10.1137/130905661.

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT Press.

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (770–778). https://doi.org/10.1109/CVPR.2016.90.

Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1412.6980.

Laisin, M., Edike, C. and Bright O. Osu (2024); The construction of rational polyhedron on an ???? ×???? board with some application on integral polyhedral. TIJER, ISSN 2349-9249, 11(11), www.tijer.org.

Laisin, M. &Edike, C. (2025a). Hybrid Optimization with Integer Constraints: Modeling and Solving Problems Using Simplex Techniques. Global Online Journal of Academic Research (GOJAR), 4(2), 22-38. https://klamidas.com/gojar-v4n2-2025-02/.

Laisin, M., &Adigwe, R. U. (2025b). Implementation and comparative analysis of AMGT method in Maple 24: Convergence performance in optimization problems. Global Online Journal of Academic Research (GOJAR), 4(52), 26–40. https://klamidas.com/gojar-v4n1-2025-02/

Laisin, M., Edike, C., &Ujumadu, R. N. (2025c). Characterizing Boundedness and Solution Size in Rational Linear Programming and Polyhedrall Optimization. Global Online Journal of Academic Research (GOJAR), 4(2), 63-76. https://klamidas.com/gojar-v4n2 2025-04/.

Laisin, M. &Adigwe, R. U. (2025d). Gradient Descent Convergence: From Convex Optimization to Deep Learning. SOLVANGLE, 1(1), 7-26. https://klamidas.com/solvangle v1n1-2025-01/

Liu, L., Jiang, H., He, P., Chen, W., Liu, X., Gao, J., & Han, J. (2020). On the variance of the adaptive learning rate and beyond. In Proceedings of the 8th International Conference on Learning Representations (ICLR).https://arxiv.org/abs/1908.03265.

Luo, L., Xiong, Y., Liu, Y., & Sun, X. (2019). Adaptive gradient methods with dynamic bound of learning rate. In Proceedings of the 7th International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1902.09843.

Nesterov, Y. (2004). Introductory lectures on convex optimization: A basic course. Springer Science & Business Media.

Nesterov, Y. E. (1983). A method for solving the convex programming problem with convergence rate O(1/k2)O(1/k^2)O(1/k2). Soviet Mathematics Doklady, 27(2), 372–376.

Nedic, A., &Bertsekas, D. P. (2001). Incremental subgradient methods for nondifferentiable optimization. SIAM Journal on Optimization, 12(1), 109–138. https://doi.org/10.1137/S1052623499362822.

Polyak, B. T. (1964). Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5), 1–17.

Reddi, S. J., Kale, S., & Kumar, S. (2018). On the convergence of Adam and beyond. In Proceedings of ICLR.https://doi.org/10.48550/arXiv.1904.09237.

Robbins, H., & Monro, S. (1951). A stochastic approximation method. The Annals of Mathematical Statistics, 22(3), 400–407. https://doi.org/10.1214/aoms/1177729586.

Schmidt, M., Le Roux, N., & Bach, F. (2011). Convergence rates of inexact proximal-gradient methods for convex optimization.Advances in Neural Information Processing Systems. 24. Curran Associates, Inc.

Tieleman, T., & Hinton, G. (2012). Lecture 6.5—RMSProp: Divide the gradient by a running average of its recent magnitude. In COURSERA: Neural Networks for Machine Learning. https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf.

Wilson, A. C., Roelofs, R., Stern, M., Srebro, N., & Recht, B. (2017). The marginal value of adaptive gradient methods in machine learning. InProceedings of NeurIPS, 4148–4158).https://proceedings.neurips.cc/paper_files/paper/2017/file/5d44ee6f2c3f71b73125876103c8f6c4-Paper.pdf.

You, Y., Reddi, S., Hseu, J., Kumar, S., Bhojanapalli, S., Song, X., & Hsieh, C.-J. (2020). Large batch optimization for deep learning: Training BERT in 76 minutes. International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1904.00962.

Zhuang, J., Tang, T., Ding, Y., Tatikonda, S., Dvornek, N., Papademetris, X., & Duncan, J. S. (2020). AdaBelief optimizer: Adapting stepsizes by the belief in observed gradients. Advances in Neural Information Processing Systems, 33, 18795–18806. https://proceedings.neurips.cc/paper/2020/hash/1ede3d44f3efc4098a5a5ea0f4f74c30-Abstract.html.

Improved Convergence in Deep Neural Networks using a Modified Adaptive Moment Gradient Thresholding Algorithm

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

Similar Articles

Current Issue

Browse

Information

Sponsored