Improved Convergence in Deep Neural Networks using a Modified Adaptive Moment Gradient Thresholding Algorithm
DOI:
https://doi.org/10.63561/jca.v2i4.1069Keywords:
Adaptive Optimization, Gradient Descent, Deep Neural Networks, Convergence Analysis, MomentumAbstract
This study introduces the Adaptive Moment Gradient Thresholding (AMGT) algorithm, a modified version of the Adam optimizer, aimed at enhancing convergence stability in deep neural networks. By leveraging optimization theory and addressing the limitations of Adam, AMGT was designed to tackle non-convexity, constrained environments, and gradient-based learning instability. The algorithm incorporates a diminishing step size schedule and momentum thresholding to improve performance. Theoretical analysis demonstrated that AMGT achieved linear convergence under strong convexity with a rate of O(k^(-μ/2)), global convergence under bounded gradient approximation errors, and convergence to stationary points in non-convex scenarios. Numerical experiments on convex quadratic functions validated the theoretical predictions, highlighting the algorithm’s sensitivity to spectral properties and resilience to learning rate variations. The results indicate that AMGT surpasses standard Adam in convergence behaviour and provides theoretical guarantees often lacking in adaptive optimizers. AMGT is particularly effective in high-dimensional, noisy, or resource-constrained settings due to its support for quantized and sparsified updates. By combining theoretical rigour with empirical robustness, AMGT emerges as a dependable option for training deep learning models across diverse optimization landscapes.
References
Alistarh, D., Grubic, D., Li, J., Tomioka, R., &Vojnovic, M. (2017). QSGD: Communication-efficient SGD via gradient quantization and encoding. Advances in Neural Information Processing Systems, 30(NIPS 2017),1-12.
Bottou, L., Curtis, F. E., &Nocedal, J. (2018). Optimization methods for large-scale machine learning.SIAM Review, 60 (2), 223–311. https://doi.org/10.1137/16M1080173
Cauchy, A. L. (1847). Méthodegénérale pour la résolution des systèmesd' équationssimultanées. ComptesRendus, 25, 536–538.
Chen, J., Zhang, H., Xu, X., & Yin, W. (2018). On the convergence of a class of Adam-type algorithms for non-convex optimization. arXiv preprint arXiv:1808.02941. https://arxiv.org/abs/1808.02941.
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT(4171–4186). https://doi.org/10.48550/arXiv.1810.04805.
Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12, 2121–2159.
Ghadimi, S., & Lan, G. (2013). Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4), 2341–2368. https://doi.org/10.1137/130905661.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT Press.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (770–778). https://doi.org/10.1109/CVPR.2016.90.
Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1412.6980.
Laisin, M., Edike, C. and Bright O. Osu (2024); The construction of rational polyhedron on an ???? ×???? board with some application on integral polyhedral. TIJER, ISSN 2349-9249, 11(11), www.tijer.org.
Laisin, M. &Edike, C. (2025a). Hybrid Optimization with Integer Constraints: Modeling and Solving Problems Using Simplex Techniques. Global Online Journal of Academic Research (GOJAR), 4(2), 22-38. https://klamidas.com/gojar-v4n2-2025-02/.
Laisin, M., &Adigwe, R. U. (2025b). Implementation and comparative analysis of AMGT method in Maple 24: Convergence performance in optimization problems. Global Online Journal of Academic Research (GOJAR), 4(52), 26–40. https://klamidas.com/gojar-v4n1-2025-02/
Laisin, M., Edike, C., &Ujumadu, R. N. (2025c). Characterizing Boundedness and Solution Size in Rational Linear Programming and Polyhedrall Optimization. Global Online Journal of Academic Research (GOJAR), 4(2), 63-76. https://klamidas.com/gojar-v4n2 2025-04/.
Laisin, M. &Adigwe, R. U. (2025d). Gradient Descent Convergence: From Convex Optimization to Deep Learning. SOLVANGLE, 1(1), 7-26. https://klamidas.com/solvangle v1n1-2025-01/
Liu, L., Jiang, H., He, P., Chen, W., Liu, X., Gao, J., & Han, J. (2020). On the variance of the adaptive learning rate and beyond. In Proceedings of the 8th International Conference on Learning Representations (ICLR).https://arxiv.org/abs/1908.03265.
Luo, L., Xiong, Y., Liu, Y., & Sun, X. (2019). Adaptive gradient methods with dynamic bound of learning rate. In Proceedings of the 7th International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1902.09843.
Nesterov, Y. (2004). Introductory lectures on convex optimization: A basic course. Springer Science & Business Media.
Nesterov, Y. E. (1983). A method for solving the convex programming problem with convergence rate O(1/k2)O(1/k^2)O(1/k2). Soviet Mathematics Doklady, 27(2), 372–376.
Nedic, A., &Bertsekas, D. P. (2001). Incremental subgradient methods for nondifferentiable optimization. SIAM Journal on Optimization, 12(1), 109–138. https://doi.org/10.1137/S1052623499362822.
Polyak, B. T. (1964). Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5), 1–17.
Reddi, S. J., Kale, S., & Kumar, S. (2018). On the convergence of Adam and beyond. In Proceedings of ICLR.https://doi.org/10.48550/arXiv.1904.09237.
Robbins, H., & Monro, S. (1951). A stochastic approximation method. The Annals of Mathematical Statistics, 22(3), 400–407. https://doi.org/10.1214/aoms/1177729586.
Schmidt, M., Le Roux, N., & Bach, F. (2011). Convergence rates of inexact proximal-gradient methods for convex optimization.Advances in Neural Information Processing Systems. 24. Curran Associates, Inc.
Tieleman, T., & Hinton, G. (2012). Lecture 6.5—RMSProp: Divide the gradient by a running average of its recent magnitude. In COURSERA: Neural Networks for Machine Learning. https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf.
Wilson, A. C., Roelofs, R., Stern, M., Srebro, N., & Recht, B. (2017). The marginal value of adaptive gradient methods in machine learning. InProceedings of NeurIPS, 4148–4158).https://proceedings.neurips.cc/paper_files/paper/2017/file/5d44ee6f2c3f71b73125876103c8f6c4-Paper.pdf.
You, Y., Reddi, S., Hseu, J., Kumar, S., Bhojanapalli, S., Song, X., & Hsieh, C.-J. (2020). Large batch optimization for deep learning: Training BERT in 76 minutes. International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1904.00962.
Zhuang, J., Tang, T., Ding, Y., Tatikonda, S., Dvornek, N., Papademetris, X., & Duncan, J. S. (2020). AdaBelief optimizer: Adapting stepsizes by the belief in observed gradients. Advances in Neural Information Processing Systems, 33, 18795–18806. https://proceedings.neurips.cc/paper/2020/hash/1ede3d44f3efc4098a5a5ea0f4f74c30-Abstract.html.


