A Primal-Dual SGD Algorithm for Distributed Nonconvex Optimization

Xinlei Yi; Shengjun Zhang; Tao Yang; Tianyou Chai; Karl Henrik Johansson

doi:10.1109/JAS.2022.105554

Volume 9 Issue 5

May 2022

IEEE/CAA Journal of Automatica Sinica

JCR Impact Factor: 15.3, Top 1 (SCI Q1)

CiteScore: 23.5, Top 2% (Q1)
Google Scholar h5-index: 77， TOP 5

Turn off MathJax

Article Contents

Article Navigation > IEEE/CAA Journal of Automatica Sinica > 2022 > 9(5): 812-833

X. L. Yi, S. J. Zhang, T. Yang, T. Y. Chai, and K. H. Johansson, “A primal-dual SGD algorithm for distributed nonconvex optimization,” IEEE/CAA J. Autom. Sinica, vol. 9, no. 5, pp. 812–833, May 2022. doi: 10.1109/JAS.2022.105554

Citation:

X. L. Yi, S. J. Zhang, T. Yang, T. Y. Chai, and K. H. Johansson, “A primal-dual SGD algorithm for distributed nonconvex optimization,” IEEE/CAA J. Autom. Sinica, vol. 9, no. 5, pp. 812–833, May 2022. doi: 10.1109/JAS.2022.105554

Citation:

PDF( 1607 KB)

A Primal-Dual SGD Algorithm for Distributed Nonconvex Optimization

doi: 10.1109/JAS.2022.105554

1.
Division of Decision and Control Systems, School of Electrical Engineering and Computer Science, KTH Royal Institute of Technology, and also affiliated with the DigitalFutures, Stockholm 10044, Sweden
2.
Department of Electrical Engineering, University of North Texas, Denton, TX 76203 USA
3.
State Key Laboratory of Synthetical Automation for Process Industries, Northeastern University, Shenyang 110819, China

Funds: This work was supported by the Knut and Alice Wallenberg Foundation, the Swedish Foundation for Strategic Research, the Swedish Research Council, and the National Natural Science Foundation of China (62133003, 61991403, 61991404, 61991400)

More Information

Author Bio:
Xinlei Yi received the Ph.D. degree in electrical engineering from the School of Electrical Engineering and Computer Science, KTH Royal Institute of Technology in 2020 and now is a Postdoc at the same university. He received B.S. and M.S. degrees in mathematics from China University of Geoscience and Fudan University in 2011 and 2014, respec-tively. His current research interests include online optimization, distributed optimization, and event-triggered control

Shengjun Zhang received the B.Eng. degree in automation of honors program from China Agricultural University in 2014, and the M.S. degree in electrical engineering from New York University, USA, in 2017. He is currently working toward the Ph.D. degree with the Department of Electrical Engineering in the College of Engineering, University of North Texas, USA. His current research interests include distributed optimization, statistical learning, and Sparse PCA

Tao Yang is a Professor at the State Key Laboratory of Synthetical Automation for Process Industries, Northeastern University. He was an Assistant Professor at the Department of Electrical Engineering, University of North Texas, USA, from 2016–2019. He received the Ph.D. degree in electrical engineering from Washington State University in 2012. Between August 2012 and August 2014, he was an ACCESS Post-Doctoral Researcher with the ACCESS Linnaeus Centre, Royal Institute of Technology, Sweden. He then joined the Pacific Northwest National Laboratory as a Postdoc, and was promoted to Scientist/Engineer II in 2015. His research interests include industrial artificial intelligence, integrated optimization and control, distributed control and optimization with applications to process industries, cyber-physical systems, and networked control systems. He is an Associate Editor for IEEE Transactions on Control Systems Technology, IEEE Transactions on Neural Networks and Learning Systems, and IEEE/CAA Journal of Automatica Sinica. He is members of several technical committees of the IEEE Control Systems Society and IFAC. He received Ralph E. Powe Junior Faculty Enhancement Award and Best Student Paper award (as an advisor) of the 14th IEEE International Conference on Control & Automation in 2018

Tianyou Chai (Fellow, IEEE) received the Ph.D. degree in control theory and engineering in 1985 from Northeastern University where he became a Professor in 1988. He is the Founder and Director of the Center of Automation, which became a National Engineering and Technology Research Center and a State Key Laboratory. He is a Member of Chinese Academy of Engineering, IFAC Fellow, and IEEE Fellow. He has served as Director of Department of Information Science of National Natural Science Foundation of China from 2010 to 2018. His current research interests include modeling, control, optimization and integrated automation of complex industrial processes. He has published 297 peer reviewed international journal papers. His paper titled Hybrid intelligent control for optimal operation of shaft furnace roasting process was selected as one of three best papers for the Control Engineering Practice Paper Prize for 2011–2013. He has developed control technologies with applications to various industrial processes. For his contributions, he has won 5 prestigious awards of National Natural Science, National Science and Technology Progress and National Technological Innovation, the 2007 Industry Award for Excellence in Transitional Control Research from IEEE Multiple-conference on Systems and Control, and the 2017 Wook Hyun Kwon Education Award from Asian Control Association

Karl Henrik Johansson (Fellow, IEEE) is Professor at the School of Electrical Engineering and Computer Science, KTH Royal Institute of Technology. He received M.Sc. and Ph.D. degrees from Lund University. He has held visiting positions at UC Berkeley, Caltech, NTU, HKUST Institute of Advanced Studies, and NTNU. His research interests are in networked control systems, cyber-physical systems, and applications in transportation, energy, and automation. He has served on the IEEE Control Systems Society Board of Governors, the IFAC Executive Board, and the European Control Association Council. He has received several best paper awards and other distinctions from IEEE and ACM. He has been awarded Distinguished Professor with the Swedish Research Council and Wallenberg Scholar with the Knut and Alice Wallenberg Foundation. He has received the Future Research Leader Award from the Swedish Foundation for Strategic Research and the triennial Young Author Prize from IFAC. He is Fellow of the IEEE and the Royal Swedish Academy of Engineering Sciences, and he is IEEE Control Systems Society Distinguished Lecturer
Corresponding author: Tao Yang, e-mail: yangtao@mail.neu.edu.cn
¹ Note: the parameter names are different in each paper.
Received Date: 2022-02-11
Accepted Date: 2022-02-25

Available Online: 2022-03-14

Abstract

Abstract

The distributed nonconvex optimization problem of minimizing a global cost function formed by a sum of n local cost functions by using local information exchange is considered. This problem is an important component of many machine learning techniques with data parallelism, such as deep learning and federated learning. We propose a distributed primal-dual stochastic gradient descent (SGD) algorithm, suitable for arbitrarily connected communication networks and any smooth (possibly nonconvex) cost functions. We show that the proposed algorithm achieves the linear speedup convergence rate ${{{\cal{O}}(1/\sqrt{nT})}}$ for general nonconvex cost functions and the linear speedup convergence rate $ {\cal{O}}(1/(nT))$ when the global cost function satisfies the Polyak-Łojasiewicz (P-Ł) condition, where T is the total number of iterations. We also show that the output of the proposed algorithm with constant parameters linearly converges to a neighborhood of a global optimum. We demonstrate through numerical experiments the efficiency of our algorithm in comparison with the baseline centralized SGD and recently proposed distributed SGD algorithms.
- Distributed nonconvex optimization,
- linear speedup,
- Polyak-Łojasiewicz (P-Ł) condition,
- primal-dual algorithm,
- stochastic gradient descent

FullText(HTML)

¹ Note: the parameter names are different in each paper.

References(63)

References

[1]	J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, Q. V. Le, and A. Y. Ng, “Large scale distributed deep networks,” in Advances in Neural Information Processing Systems, 2012, pp. 1223–1231.
[2]	H. B. McMahan, E. Moore, D. Ramage, et al., “Communication-efficient learning of deep networks from decentralized data,” in Proc. Int. Conf. Artificial Intelligence and Statistics, 2017, pp. 1273–1282.
[3]	T. Tatarenko and B. Touri, “Non-convex distributed optimization,” IEEE Trans. Automatic Control, vol. 62, no. 8, pp. 3744–3757, 2017. doi: 10.1109/TAC.2017.2648041
[4]	M. Hong, D. Hajinezhad, and M.-M. Zhao, “Prox-PDA: The proximal primal-dual algorithm for fast distributed nonconvex optimization and learning over networks,” in Proc. Int. Conf. Machine Learning, 2017, pp. 1529–1538.
[5]	A. Daneshmand, G. Scutari, and V. Kungurtsev, “Second-order guarantees of gradient algorithms over networks,” in Proc. Annual Allerton Conf. Communication, Control, and Computing, 2018, pp. 359–365.
[6]	H. Sun and M. Hong, “Distributed non-convex first-order optimization and information processing: Lower complexity bounds and rate optimal algorithms,” arXiv preprint arXiv: 1804.02729, 2018.
[7]	D. Hajinezhad and M. Hong, “Perturbed proximal primal-dual algorithm for nonconvex nonsmooth optimization,” Mathematical Programming, vol. 176, no. 1, pp. 207–245, 2019.
[8]	H. Sun and M. Hong, “Distributed non-convex first-order optimization and information processing: Lower complexity bounds and rate optimal algorithms,” IEEE Trans. Signal Processing, vol. 67, no. 22, pp. 5912–5928, 2019. doi: 10.1109/TSP.2019.2943230
[9]	H.-T. Wai, J. Lafond, A. Scaglione, and E. Moulines, “Decentralized Frank-Wolfe algorithm for convex and nonconvex problems,” IEEE Trans. Automatic Control, vol. 62, no. 11, pp. 5522–5537, 2017. doi: 10.1109/TAC.2017.2685559
[10]	J. Langford, L. Li, and T. Zhang, “Sparse online learning via truncated gradient,” Journal of Machine Learning Research, vol. 10, no. 3, pp. 777–801, 2009.
[11]	B. Recht, C. Re, S. Wright, and F. Niu, “Hogwild!: A lock-free approach to parallelizing stochastic gradient descent,” in Advances in Neural Information Processing Systems, 2011, pp. 693–701.
[12]	C. M. De Sa, C. Zhang, K. Olukotun, and C. Ré, “Taming the wild: A unified analysis of hogwild-style algorithms,” in Advances in Neural Information Processing Systems, 2015, pp. 2674–2682.
[13]	X. Lian, Y. Huang, Y. Li, and J. Liu, “Asynchronous parallel stochastic gradient for nonconvex optimization,” in Advances in Neural Information Processing Systems, 2015, pp. 2737–2745.
[14]	X. Lian, H. Zhang, C.-J. Hsieh, Y. Huang, and J. Liu, “A comprehensive linear speedup analysis for asynchronous stochastic parallel optimization from zeroth-order to first-order,” in Advances in Neural Information Processing Systems, 2016, pp. 3054–3062.
[15]	Z. Zhou, P. Mertikopoulos, N. Bambos, P. Glynn, Y. Ye, L.-J. Li, and F.-F. Li, “Distributed asynchronous optimization with unbounded delays: How slow can you go?” in Proc. Int. Conf. Machine Learning, 2018, pp. 5970–5979.
[16]	J. Bernstein, Y.-X. Wang, K. Azizzadenesheli, and A. Anandkumar, “SignSGD: Compressed optimisation for non-convex problems,” in Proc. Int. Conf. Machine Learning, 2018, pp. 560–569.
[17]	P. Jiang and G. Agrawal, “A linear speedup analysis of distributed deep learning with sparse and quantized communication,” in Advances in Neural Information Processing Systems, 2018, pp. 2525–2536.
[18]	A. Reisizadeh, A. Mokhtari, H. Hassani, A. Jadbabaie, and R. Pedarsani, “FedPAQ: A communication-efficient federated learning method with periodic averaging and quantization,” in Proc. Int. Conf. Artificial Intelligence and Statistics, 2020, pp. 2021–2031.
[19]	D. Basu, D. Data, C. Karakus, and S. Diggavi, “Qsparse-local-SGD: Distributed SGD with quantization, sparsification and local computations,” in Advances in Neural Information Processing Systems, 2019, pp. 14668–14679.
[20]	J. Wang and G. Joshi, “Adaptive communication strategies to achieve the best error-runtime trade-off in local-update SGD,” in Proc. Conf. Machine Learning and Systems, 2019, pp. 212–229.
[21]	H. Yu, S. Yang, and S. Zhu, “Parallel restarted SGD with faster convergence and less communication: Demystifying why model averaging works for deep learning,” in Proc. AAAI Conf. Artificial Intelligence, 2019, pp. 5693–5700.
[22]	F. Haddadpour, M. M. Kamani, M. Mahdavi, and V. Cadambe, “Trading redundancy for communication: Speeding up distributed SGD for nonconvex optimization,” in Proc. Int. Conf. Machine Learning, 2019, pp. 2545–2554.
[23]	H. Yu, R. Jin, and S. Yang, “On the linear speedup analysis of communication efficient momentum SGD for distributed non-convex optimization,” in Proc. Int. Conf. Machine Learning, 2019, pp. 7184–7193.
[24]	F. Haddadpour, M. M. Kamani, M. Mahdavi, and V. Cadambe, “Local SGD with periodic averaging: Tighter analysis and adaptive synchronization,” in Advances in Neural Information Processing Systems, 2019, pp. 11080–11092.
[25]	H. Yu and R. Jin, “On the computation and communication complexity of parallel SGD with dynamic batch sizes for stochastic non-convex optimization,” in Proc. Int. Conf. Machine Learning, 2019, pp. 7174–7183.
[26]	Z. Jiang, A. Balu, C. Hegde, and S. Sarkar, “Collaborative deep learning in fixed topology networks,” in Advances in Neural Information Processing Systems, 2017, pp. 5904–5914.
[27]	X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, and J. Liu, “Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent,” in Advances in Neural Information Processing Systems, 2017, pp. 5330–5340.
[28]	J. George, T. Yang, H. Bai, and P. Gurram, “Distributed stochastic gradient method for non-convex problems with applications in supervised learning,” in Proc. IEEE Conf. Decision and Control, 2019, pp. 5538–5543.
[29]	X. Lian, W. Zhang, C. Zhang, and J. Liu, “Asynchronous decentralized parallel stochastic gradient descent,” in Proc. Int. Conf. Machine Learning, 2018, pp. 3043–3052.
[30]	M. Assran, N. Loizou, N. Ballas, and M. Rabbat, “Stochastic gradient push for distributed deep learning,” in Proc. Int. Conf. Machine Learning, 2019, pp. 344–353.
[31]	H. Tang, S. Gan, C. Zhang, T. Zhang, and J. Liu, “Communication compression for decentralized training,” in Advances in Neural Information Processing Systems, 2018, pp. 7652–7662.
[32]	A. Reisizadeh, H. Taheri, A. Mokhtari, H. Hassani, and R. Pedarsani, “Robust and communication-efficient collaborative learning,” in Advances in Neural Information Processing Systems, 2019, pp. 8386–8397.
[33]	H. Taheri, A. Mokhtari, H. Hassani, and R. Pedarsani, “Quantized decentralized stochastic learning over directed graphs,” in Proc. Int. Conf. Machine Learning, 2020, pp. 9324–9333.
[34]	N. Singh, D. Data, J. George, and S. Diggavi, “SQuARM-SGD: Communication-efficient momentum SGD for decentralized optimization,” IEEE Journal on Selected Areas in Information Theory, vol. 2, no. 3, pp. 954–969, 2021. doi: 10.1109/JSAIT.2021.3103920
[35]	J. Wang and G. Joshi, “Cooperative SGD: A unified framework for the design and analysis of communication-efficient SGD algorithms,” Journal of Machine Learning Research, vol. 22, no. 213, pp. 1–50, 2021.
[36]	H. Tang, X. Lian, M. Yan, C. Zhang, and J. Liu, “D²: Decentralized training over decentralized data,” in Proc. Int. Conf. Machine Learning, 2018, pp. 4848–4856.
[37]	S. Lu, X. Zhang, H. Sun, and M. Hong, “GNSD: A gradient-tracking based nonconvex stochastic algorithm for decentralized optimization,” in Proc. IEEE Data Science Workshop, 2019, pp. 315–321.
[38]	J. Zhang and K. You, “Decentralized stochastic gradient tracking for empirical risk minimization,” arXiv preprint arXiv: 1909.02712, 2019.
[39]	S. U. Stich, “Local SGD converges fast and communicates little,” in Proc. Int. Conf. Learning Representations, 2019.
[40]	A. Koloskova, S. Stich, and M. Jaggi, “Decentralized stochastic optimization and gossip algorithms with compressed communication,” in Proc. Int. Conf. Machine Learning, 2019, pp. 3478–3487.
[41]	S. Pu, A. Olshevsky, and I. C. Paschalidis, “A sharp estimate on the transient time of distributed stochastic gradient descent,” IEEE Transactions on Automatic Control, 2021. DOI: 10.1109/TAC.2021.3126253
[42]	M. Rabbat, “Multi-agent mirror descent for decentralized stochastic optimization,” in Proc. Int. Workshop on Computational Advances in Multi-Sensor Adaptive Processing, 2015, pp. 517–520.
[43]	G. Lan, S. Lee, and Y. Zhou, “Communication-efficient algorithms for decentralized and stochastic optimization,” Mathematical Programming, vol. 180, no. 1, pp. 237–284, 2020.
[44]	D. Yuan, Y. Hong, D. W. Ho, and G. Jiang, “Optimal distributed stochastic mirror descent for strongly convex optimization,” Automatica, vol. 90, pp. 196–203, 2018. doi: 10.1016/j.automatica.2017.12.053
[45]	D. Jakovetic, D. Bajovic, A. K. Sahu, and S. Kar, “Convergence rates for distributed stochastic optimization over random networks,” in Proc. IEEE Conf. Decision and Control, 2018, pp. 4238–4245.
[46]	A. Fallah, M. Gurbuzbalaban, A. Ozdaglar, U. Simsekli, and L. Zhu, “Robust distributed accelerated stochastic gradient methods for multiagent networks,” arXiv preprint arXiv: 1910.08701, 2019.
[47]	S. Pu and A. Garcia, “Swarming for faster convergence in stochastic optimization,” SIAM Journal on Control and Optimization, vol. 56, no. 4, pp. 2997–3020, 2018. doi: 10.1137/17M1111085
[48]	S. Pu and A. Nedić, “A distributed stochastic gradient tracking method,” in Proc. IEEE Conf. Decision and Control, 2018, pp. 963–968.
[49]	R. Xin, A. K. Sahu, U. A. Khan, and S. Kar, “Distributed stochastic optimization with gradient tracking over strongly-connected networks,” in Proc. IEEE Conf. Decision and Control, 2019, pp. 8353–8358.
[50]	M. Mesbahi and M. Egerstedt, Graph Theoretic Methods in Multiagent Networks. Princeton University Press, 2010.
[51]	A. Rakhlin, O. Shamir, and K. Sridharan, “Making gradient descent optimal for strongly convex stochastic optimization,” in Proc. Int. Conf. Machine Learning, 2012, pp. 1571–1578.
[52]	H. Karimi, J. Nutini, and M. Schmidt, “Linear convergence of gradient and proximal-gradient methods under the Polyak-Łojasiewicz condition,” in Proc. Joint European Conf. Machine Learning and Knowledge Discovery in Databases, 2016, pp. 795–811.
[53]	H. Zhang and L. Cheng, “Restricted strong convexity and its applications to convergence analysis of gradient-type methods in convex optimization,” Optimization Letters, vol. 9, no. 5, pp. 961–979, 2015. doi: 10.1007/s11590-014-0795-x
[54]	Z. Li and J. Li, “A simple proximal stochastic gradient method for nonsmooth nonconvex optimization,” in Advances in Neural Information Processing Systems, 2018, pp. 5569–5579.
[55]	M. Fazel, R. Ge, S. Kakade, and M. Mesbahi, “Global convergence of policy gradient methods for the linear quadratic regulator,” in Proc. Int. Conf. Machine Learning, 2018, pp. 1467–1476.
[56]	W. Shi, Q. Ling, G. Wu, and W. Yin, “EXTRA: An exact first-order algorithm for decentralized consensus optimization,” SIAM Journal on Optimization, vol. 25, no. 2, pp. 944–966, 2015. doi: 10.1137/14096668X
[57]	A. Nedić, A. Olshevsky, and W. Shi, “Achieving geometric convergence for distributed optimization over time-varying graphs,” SIAM Journal on Optimization, vol. 27, no. 4, pp. 2597–2633, 2017. doi: 10.1137/16M1084316
[58]	H. Li and Z. Lin, “Revisiting extra for smooth distributed optimization,” SIAM Journal on Optimization, vol. 30, no. 3, pp. 1795–1821, 2020. doi: 10.1137/18M122902X
[59]	Y. LeCun, C. Cortes, and C. Burges, “MNIST handwritten digit database,” [Online] Available: http://yann.lecun.com/exdb/mnist, 2010.
[60]	L. Bottou, “Stochastic gradient descent tricks,” in Neural Networks: Tricks of the Trade. Springer, 2012, pp. 421–436.
[61]	Y. Nesterov, Lectures on Convex Optimization, 2nd ed. Springer Int. Publishing, 2018.
[62]	Y. Tang, J. Zhang, and N. Li, “Distributed zero-order algorithms for nonconvex multiagent optimization,” IEEE Trans. Control of Network Systems, vol. 8, no. 1, pp. 269–281, 2020.
[63]	X. Yi, L. Yao, T. Yang, J. George, and K. H. Johansson, “Distributed optimization for second-order multi-agent systems with dynamic eventtriggered communication,” in Proc. IEEE Conf. Decision and Control, 2018, pp. 3397–3402.

Supplements(0)

Cited By

Proportional views

Proportional views

通讯作者: 陈斌, bchen63@163.com

1.
沈阳化工大学材料科学与工程学院沈阳 110142

Figures(4) / Tables(5)

Get Citation

PDF

XML

Article Metrics

Article views (1155) PDF downloads(163)

Highlights

This paper proposes a novel distributed SGD algorithm, suitable for arbitrarily connected communication networks and heterogeneous local cost functions
The proposed algorithm achieves the linear speedup rate for smooth nonconvex cost functions
It also achieves the linear speedup convergence rate when the global cost function satisfies the Polyak–Łojasiewicz condition which is weaker than the commonly used strong convexity assumption

A Primal-Dual SGD Algorithm for Distributed Nonconvex Optimization

doi: 10.1109/JAS.2022.105554

Abstract

References

Proportional views

Catalog

通讯作者: 陈斌, bchen63@163.com

Article Metrics

Highlights

Export File

Citation

Format

Content