IEEE/CAA Journal of Automatica Sinica
Citation: | X. L. Yi, S. J. Zhang, T. Yang, T. Y. Chai, and K. H. Johansson, “A primal-dual SGD algorithm for distributed nonconvex optimization,” IEEE/CAA J. Autom. Sinica, vol. 9, no. 5, pp. 812–833, May 2022. doi: 10.1109/JAS.2022.105554 |
[1] |
J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, Q. V. Le, and A. Y. Ng, “Large scale distributed deep networks,” in Advances in Neural Information Processing Systems, 2012, pp. 1223–1231.
|
[2] |
H. B. McMahan, E. Moore, D. Ramage, et al., “Communication-efficient learning of deep networks from decentralized data,” in Proc. Int. Conf. Artificial Intelligence and Statistics, 2017, pp. 1273–1282.
|
[3] |
T. Tatarenko and B. Touri, “Non-convex distributed optimization,” IEEE Trans. Automatic Control, vol. 62, no. 8, pp. 3744–3757, 2017. doi: 10.1109/TAC.2017.2648041
|
[4] |
M. Hong, D. Hajinezhad, and M.-M. Zhao, “Prox-PDA: The proximal primal-dual algorithm for fast distributed nonconvex optimization and learning over networks,” in Proc. Int. Conf. Machine Learning, 2017, pp. 1529–1538.
|
[5] |
A. Daneshmand, G. Scutari, and V. Kungurtsev, “Second-order guarantees of gradient algorithms over networks,” in Proc. Annual Allerton Conf. Communication, Control, and Computing, 2018, pp. 359–365.
|
[6] |
H. Sun and M. Hong, “Distributed non-convex first-order optimization and information processing: Lower complexity bounds and rate optimal algorithms,” arXiv preprint arXiv: 1804.02729, 2018.
|
[7] |
D. Hajinezhad and M. Hong, “Perturbed proximal primal-dual algorithm for nonconvex nonsmooth optimization,” Mathematical Programming, vol. 176, no. 1, pp. 207–245, 2019.
|
[8] |
H. Sun and M. Hong, “Distributed non-convex first-order optimization and information processing: Lower complexity bounds and rate optimal algorithms,” IEEE Trans. Signal Processing, vol. 67, no. 22, pp. 5912–5928, 2019. doi: 10.1109/TSP.2019.2943230
|
[9] |
H.-T. Wai, J. Lafond, A. Scaglione, and E. Moulines, “Decentralized Frank-Wolfe algorithm for convex and nonconvex problems,” IEEE Trans. Automatic Control, vol. 62, no. 11, pp. 5522–5537, 2017. doi: 10.1109/TAC.2017.2685559
|
[10] |
J. Langford, L. Li, and T. Zhang, “Sparse online learning via truncated gradient,” Journal of Machine Learning Research, vol. 10, no. 3, pp. 777–801, 2009.
|
[11] |
B. Recht, C. Re, S. Wright, and F. Niu, “Hogwild!: A lock-free approach to parallelizing stochastic gradient descent,” in Advances in Neural Information Processing Systems, 2011, pp. 693–701.
|
[12] |
C. M. De Sa, C. Zhang, K. Olukotun, and C. Ré, “Taming the wild: A unified analysis of hogwild-style algorithms,” in Advances in Neural Information Processing Systems, 2015, pp. 2674–2682.
|
[13] |
X. Lian, Y. Huang, Y. Li, and J. Liu, “Asynchronous parallel stochastic gradient for nonconvex optimization,” in Advances in Neural Information Processing Systems, 2015, pp. 2737–2745.
|
[14] |
X. Lian, H. Zhang, C.-J. Hsieh, Y. Huang, and J. Liu, “A comprehensive linear speedup analysis for asynchronous stochastic parallel optimization from zeroth-order to first-order,” in Advances in Neural Information Processing Systems, 2016, pp. 3054–3062.
|
[15] |
Z. Zhou, P. Mertikopoulos, N. Bambos, P. Glynn, Y. Ye, L.-J. Li, and F.-F. Li, “Distributed asynchronous optimization with unbounded delays: How slow can you go?” in Proc. Int. Conf. Machine Learning, 2018, pp. 5970–5979.
|
[16] |
J. Bernstein, Y.-X. Wang, K. Azizzadenesheli, and A. Anandkumar, “SignSGD: Compressed optimisation for non-convex problems,” in Proc. Int. Conf. Machine Learning, 2018, pp. 560–569.
|
[17] |
P. Jiang and G. Agrawal, “A linear speedup analysis of distributed deep learning with sparse and quantized communication,” in Advances in Neural Information Processing Systems, 2018, pp. 2525–2536.
|
[18] |
A. Reisizadeh, A. Mokhtari, H. Hassani, A. Jadbabaie, and R. Pedarsani, “FedPAQ: A communication-efficient federated learning method with periodic averaging and quantization,” in Proc. Int. Conf. Artificial Intelligence and Statistics, 2020, pp. 2021–2031.
|
[19] |
D. Basu, D. Data, C. Karakus, and S. Diggavi, “Qsparse-local-SGD: Distributed SGD with quantization, sparsification and local computations,” in Advances in Neural Information Processing Systems, 2019, pp. 14668–14679.
|
[20] |
J. Wang and G. Joshi, “Adaptive communication strategies to achieve the best error-runtime trade-off in local-update SGD,” in Proc. Conf. Machine Learning and Systems, 2019, pp. 212–229.
|
[21] |
H. Yu, S. Yang, and S. Zhu, “Parallel restarted SGD with faster convergence and less communication: Demystifying why model averaging works for deep learning,” in Proc. AAAI Conf. Artificial Intelligence, 2019, pp. 5693–5700.
|
[22] |
F. Haddadpour, M. M. Kamani, M. Mahdavi, and V. Cadambe, “Trading redundancy for communication: Speeding up distributed SGD for nonconvex optimization,” in Proc. Int. Conf. Machine Learning, 2019, pp. 2545–2554.
|
[23] |
H. Yu, R. Jin, and S. Yang, “On the linear speedup analysis of communication efficient momentum SGD for distributed non-convex optimization,” in Proc. Int. Conf. Machine Learning, 2019, pp. 7184–7193.
|
[24] |
F. Haddadpour, M. M. Kamani, M. Mahdavi, and V. Cadambe, “Local SGD with periodic averaging: Tighter analysis and adaptive synchronization,” in Advances in Neural Information Processing Systems, 2019, pp. 11080–11092.
|
[25] |
H. Yu and R. Jin, “On the computation and communication complexity of parallel SGD with dynamic batch sizes for stochastic non-convex optimization,” in Proc. Int. Conf. Machine Learning, 2019, pp. 7174–7183.
|
[26] |
Z. Jiang, A. Balu, C. Hegde, and S. Sarkar, “Collaborative deep learning in fixed topology networks,” in Advances in Neural Information Processing Systems, 2017, pp. 5904–5914.
|
[27] |
X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, and J. Liu, “Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent,” in Advances in Neural Information Processing Systems, 2017, pp. 5330–5340.
|
[28] |
J. George, T. Yang, H. Bai, and P. Gurram, “Distributed stochastic gradient method for non-convex problems with applications in supervised learning,” in Proc. IEEE Conf. Decision and Control, 2019, pp. 5538–5543.
|
[29] |
X. Lian, W. Zhang, C. Zhang, and J. Liu, “Asynchronous decentralized parallel stochastic gradient descent,” in Proc. Int. Conf. Machine Learning, 2018, pp. 3043–3052.
|
[30] |
M. Assran, N. Loizou, N. Ballas, and M. Rabbat, “Stochastic gradient push for distributed deep learning,” in Proc. Int. Conf. Machine Learning, 2019, pp. 344–353.
|
[31] |
H. Tang, S. Gan, C. Zhang, T. Zhang, and J. Liu, “Communication compression for decentralized training,” in Advances in Neural Information Processing Systems, 2018, pp. 7652–7662.
|
[32] |
A. Reisizadeh, H. Taheri, A. Mokhtari, H. Hassani, and R. Pedarsani, “Robust and communication-efficient collaborative learning,” in Advances in Neural Information Processing Systems, 2019, pp. 8386–8397.
|
[33] |
H. Taheri, A. Mokhtari, H. Hassani, and R. Pedarsani, “Quantized decentralized stochastic learning over directed graphs,” in Proc. Int. Conf. Machine Learning, 2020, pp. 9324–9333.
|
[34] |
N. Singh, D. Data, J. George, and S. Diggavi, “SQuARM-SGD: Communication-efficient momentum SGD for decentralized optimization,” IEEE Journal on Selected Areas in Information Theory, vol. 2, no. 3, pp. 954–969, 2021. doi: 10.1109/JSAIT.2021.3103920
|
[35] |
J. Wang and G. Joshi, “Cooperative SGD: A unified framework for the design and analysis of communication-efficient SGD algorithms,” Journal of Machine Learning Research, vol. 22, no. 213, pp. 1–50, 2021.
|
[36] |
H. Tang, X. Lian, M. Yan, C. Zhang, and J. Liu, “D2: Decentralized training over decentralized data,” in Proc. Int. Conf. Machine Learning, 2018, pp. 4848–4856.
|
[37] |
S. Lu, X. Zhang, H. Sun, and M. Hong, “GNSD: A gradient-tracking based nonconvex stochastic algorithm for decentralized optimization,” in Proc. IEEE Data Science Workshop, 2019, pp. 315–321.
|
[38] |
J. Zhang and K. You, “Decentralized stochastic gradient tracking for empirical risk minimization,” arXiv preprint arXiv: 1909.02712, 2019.
|
[39] |
S. U. Stich, “Local SGD converges fast and communicates little,” in Proc. Int. Conf. Learning Representations, 2019.
|
[40] |
A. Koloskova, S. Stich, and M. Jaggi, “Decentralized stochastic optimization and gossip algorithms with compressed communication,” in Proc. Int. Conf. Machine Learning, 2019, pp. 3478–3487.
|
[41] |
S. Pu, A. Olshevsky, and I. C. Paschalidis, “A sharp estimate on the transient time of distributed stochastic gradient descent,” IEEE Transactions on Automatic Control, 2021. DOI: 10.1109/TAC.2021.3126253
|
[42] |
M. Rabbat, “Multi-agent mirror descent for decentralized stochastic optimization,” in Proc. Int. Workshop on Computational Advances in Multi-Sensor Adaptive Processing, 2015, pp. 517–520.
|
[43] |
G. Lan, S. Lee, and Y. Zhou, “Communication-efficient algorithms for decentralized and stochastic optimization,” Mathematical Programming, vol. 180, no. 1, pp. 237–284, 2020.
|
[44] |
D. Yuan, Y. Hong, D. W. Ho, and G. Jiang, “Optimal distributed stochastic mirror descent for strongly convex optimization,” Automatica, vol. 90, pp. 196–203, 2018. doi: 10.1016/j.automatica.2017.12.053
|
[45] |
D. Jakovetic, D. Bajovic, A. K. Sahu, and S. Kar, “Convergence rates for distributed stochastic optimization over random networks,” in Proc. IEEE Conf. Decision and Control, 2018, pp. 4238–4245.
|
[46] |
A. Fallah, M. Gurbuzbalaban, A. Ozdaglar, U. Simsekli, and L. Zhu, “Robust distributed accelerated stochastic gradient methods for multiagent networks,” arXiv preprint arXiv: 1910.08701, 2019.
|
[47] |
S. Pu and A. Garcia, “Swarming for faster convergence in stochastic optimization,” SIAM Journal on Control and Optimization, vol. 56, no. 4, pp. 2997–3020, 2018. doi: 10.1137/17M1111085
|
[48] |
S. Pu and A. Nedić, “A distributed stochastic gradient tracking method,” in Proc. IEEE Conf. Decision and Control, 2018, pp. 963–968.
|
[49] |
R. Xin, A. K. Sahu, U. A. Khan, and S. Kar, “Distributed stochastic optimization with gradient tracking over strongly-connected networks,” in Proc. IEEE Conf. Decision and Control, 2019, pp. 8353–8358.
|
[50] |
M. Mesbahi and M. Egerstedt, Graph Theoretic Methods in Multiagent Networks. Princeton University Press, 2010.
|
[51] |
A. Rakhlin, O. Shamir, and K. Sridharan, “Making gradient descent optimal for strongly convex stochastic optimization,” in Proc. Int. Conf. Machine Learning, 2012, pp. 1571–1578.
|
[52] |
H. Karimi, J. Nutini, and M. Schmidt, “Linear convergence of gradient and proximal-gradient methods under the Polyak-Łojasiewicz condition,” in Proc. Joint European Conf. Machine Learning and Knowledge Discovery in Databases, 2016, pp. 795–811.
|
[53] |
H. Zhang and L. Cheng, “Restricted strong convexity and its applications to convergence analysis of gradient-type methods in convex optimization,” Optimization Letters, vol. 9, no. 5, pp. 961–979, 2015. doi: 10.1007/s11590-014-0795-x
|
[54] |
Z. Li and J. Li, “A simple proximal stochastic gradient method for nonsmooth nonconvex optimization,” in Advances in Neural Information Processing Systems, 2018, pp. 5569–5579.
|
[55] |
M. Fazel, R. Ge, S. Kakade, and M. Mesbahi, “Global convergence of policy gradient methods for the linear quadratic regulator,” in Proc. Int. Conf. Machine Learning, 2018, pp. 1467–1476.
|
[56] |
W. Shi, Q. Ling, G. Wu, and W. Yin, “EXTRA: An exact first-order algorithm for decentralized consensus optimization,” SIAM Journal on Optimization, vol. 25, no. 2, pp. 944–966, 2015. doi: 10.1137/14096668X
|
[57] |
A. Nedić, A. Olshevsky, and W. Shi, “Achieving geometric convergence for distributed optimization over time-varying graphs,” SIAM Journal on Optimization, vol. 27, no. 4, pp. 2597–2633, 2017. doi: 10.1137/16M1084316
|
[58] |
H. Li and Z. Lin, “Revisiting extra for smooth distributed optimization,” SIAM Journal on Optimization, vol. 30, no. 3, pp. 1795–1821, 2020. doi: 10.1137/18M122902X
|
[59] |
Y. LeCun, C. Cortes, and C. Burges, “MNIST handwritten digit database,” [Online] Available: http://yann.lecun.com/exdb/mnist, 2010.
|
[60] |
L. Bottou, “Stochastic gradient descent tricks,” in Neural Networks: Tricks of the Trade. Springer, 2012, pp. 421–436.
|
[61] |
Y. Nesterov, Lectures on Convex Optimization, 2nd ed. Springer Int. Publishing, 2018.
|
[62] |
Y. Tang, J. Zhang, and N. Li, “Distributed zero-order algorithms for nonconvex multiagent optimization,” IEEE Trans. Control of Network Systems, vol. 8, no. 1, pp. 269–281, 2020.
|
[63] |
X. Yi, L. Yao, T. Yang, J. George, and K. H. Johansson, “Distributed optimization for second-order multi-agent systems with dynamic eventtriggered communication,” in Proc. IEEE Conf. Decision and Control, 2018, pp. 3397–3402.
|