A journal of IEEE and CAA , publishes high-quality papers in English on original theoretical/experimental research and development in all areas of automation
Volume 9 Issue 5
May  2022

IEEE/CAA Journal of Automatica Sinica

  • JCR Impact Factor: 6.171, Top 11% (SCI Q1)
    CiteScore: 11.2, Top 5% (Q1)
    Google Scholar h5-index: 51, TOP 8
Turn off MathJax
Article Contents
X. L. Yi, S. J. Zhang, T. Yang, T. Y. Chai, and  K. H. Johansson,  “A primal-dual SGD algorithm for distributed nonconvex optimization,” IEEE/CAA J. Autom. Sinica, vol. 9, no. 5, pp. 812–833, May 2022. doi: 10.1109/JAS.2022.105554
Citation: X. L. Yi, S. J. Zhang, T. Yang, T. Y. Chai, and  K. H. Johansson,  “A primal-dual SGD algorithm for distributed nonconvex optimization,” IEEE/CAA J. Autom. Sinica, vol. 9, no. 5, pp. 812–833, May 2022. doi: 10.1109/JAS.2022.105554

A Primal-Dual SGD Algorithm for Distributed Nonconvex Optimization

doi: 10.1109/JAS.2022.105554
Funds:  This work was supported by the Knut and Alice Wallenberg Foundation, the Swedish Foundation for Strategic Research, the Swedish Research Council, and the National Natural Science Foundation of China (62133003, 61991403, 61991404, 61991400)
More Information
  • The distributed nonconvex optimization problem of minimizing a global cost function formed by a sum of n local cost functions by using local information exchange is considered. This problem is an important component of many machine learning techniques with data parallelism, such as deep learning and federated learning. We propose a distributed primal-dual stochastic gradient descent (SGD) algorithm, suitable for arbitrarily connected communication networks and any smooth (possibly nonconvex) cost functions. We show that the proposed algorithm achieves the linear speedup convergence rate

    \begin{document}${{{\cal{O}}(1/\sqrt{nT})}}$\end{document}

    for general nonconvex cost functions and the linear speedup convergence rate

    \begin{document}$ {\cal{O}}(1/(nT))$\end{document}

    when the global cost function satisfies the Polyak-Łojasiewicz (P-Ł) condition, where T is the total number of iterations. We also show that the output of the proposed algorithm with constant parameters linearly converges to a neighborhood of a global optimum. We demonstrate through numerical experiments the efficiency of our algorithm in comparison with the baseline centralized SGD and recently proposed distributed SGD algorithms.

     

  • loading
  • 1 Note: the parameter names are different in each paper.
  • [1]
    J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, Q. V. Le, and A. Y. Ng, “Large scale distributed deep networks,” in Advances in Neural Information Processing Systems, 2012, pp. 1223–1231.
    [2]
    H. B. McMahan, E. Moore, D. Ramage, et al., “Communication-efficient learning of deep networks from decentralized data,” in Proc. Int. Conf. Artificial Intelligence and Statistics, 2017, pp. 1273–1282.
    [3]
    T. Tatarenko and B. Touri, “Non-convex distributed optimization,” IEEE Trans. Automatic Control, vol. 62, no. 8, pp. 3744–3757, 2017. doi: 10.1109/TAC.2017.2648041
    [4]
    M. Hong, D. Hajinezhad, and M.-M. Zhao, “Prox-PDA: The proximal primal-dual algorithm for fast distributed nonconvex optimization and learning over networks,” in Proc. Int. Conf. Machine Learning, 2017, pp. 1529–1538.
    [5]
    A. Daneshmand, G. Scutari, and V. Kungurtsev, “Second-order guarantees of gradient algorithms over networks,” in Proc. Annual Allerton Conf. Communication, Control, and Computing, 2018, pp. 359–365.
    [6]
    H. Sun and M. Hong, “Distributed non-convex first-order optimization and information processing: Lower complexity bounds and rate optimal algorithms,” arXiv preprint arXiv: 1804.02729, 2018.
    [7]
    D. Hajinezhad and M. Hong, “Perturbed proximal primal-dual algorithm for nonconvex nonsmooth optimization,” Mathematical Programming, vol. 176, no. 1, pp. 207–245, 2019.
    [8]
    H. Sun and M. Hong, “Distributed non-convex first-order optimization and information processing: Lower complexity bounds and rate optimal algorithms,” IEEE Trans. Signal Processing, vol. 67, no. 22, pp. 5912–5928, 2019. doi: 10.1109/TSP.2019.2943230
    [9]
    H.-T. Wai, J. Lafond, A. Scaglione, and E. Moulines, “Decentralized Frank-Wolfe algorithm for convex and nonconvex problems,” IEEE Trans. Automatic Control, vol. 62, no. 11, pp. 5522–5537, 2017. doi: 10.1109/TAC.2017.2685559
    [10]
    J. Langford, L. Li, and T. Zhang, “Sparse online learning via truncated gradient,” Journal of Machine Learning Research, vol. 10, no. 3, pp. 777–801, 2009.
    [11]
    B. Recht, C. Re, S. Wright, and F. Niu, “Hogwild!: A lock-free approach to parallelizing stochastic gradient descent,” in Advances in Neural Information Processing Systems, 2011, pp. 693–701.
    [12]
    C. M. De Sa, C. Zhang, K. Olukotun, and C. Ré, “Taming the wild: A unified analysis of hogwild-style algorithms,” in Advances in Neural Information Processing Systems, 2015, pp. 2674–2682.
    [13]
    X. Lian, Y. Huang, Y. Li, and J. Liu, “Asynchronous parallel stochastic gradient for nonconvex optimization,” in Advances in Neural Information Processing Systems, 2015, pp. 2737–2745.
    [14]
    X. Lian, H. Zhang, C.-J. Hsieh, Y. Huang, and J. Liu, “A comprehensive linear speedup analysis for asynchronous stochastic parallel optimization from zeroth-order to first-order,” in Advances in Neural Information Processing Systems, 2016, pp. 3054–3062.
    [15]
    Z. Zhou, P. Mertikopoulos, N. Bambos, P. Glynn, Y. Ye, L.-J. Li, and F.-F. Li, “Distributed asynchronous optimization with unbounded delays: How slow can you go?” in Proc. Int. Conf. Machine Learning, 2018, pp. 5970–5979.
    [16]
    J. Bernstein, Y.-X. Wang, K. Azizzadenesheli, and A. Anandkumar, “SignSGD: Compressed optimisation for non-convex problems,” in Proc. Int. Conf. Machine Learning, 2018, pp. 560–569.
    [17]
    P. Jiang and G. Agrawal, “A linear speedup analysis of distributed deep learning with sparse and quantized communication,” in Advances in Neural Information Processing Systems, 2018, pp. 2525–2536.
    [18]
    A. Reisizadeh, A. Mokhtari, H. Hassani, A. Jadbabaie, and R. Pedarsani, “FedPAQ: A communication-efficient federated learning method with periodic averaging and quantization,” in Proc. Int. Conf. Artificial Intelligence and Statistics, 2020, pp. 2021–2031.
    [19]
    D. Basu, D. Data, C. Karakus, and S. Diggavi, “Qsparse-local-SGD: Distributed SGD with quantization, sparsification and local computations,” in Advances in Neural Information Processing Systems, 2019, pp. 14668–14679.
    [20]
    J. Wang and G. Joshi, “Adaptive communication strategies to achieve the best error-runtime trade-off in local-update SGD,” in Proc. Conf. Machine Learning and Systems, 2019, pp. 212–229.
    [21]
    H. Yu, S. Yang, and S. Zhu, “Parallel restarted SGD with faster convergence and less communication: Demystifying why model averaging works for deep learning,” in Proc. AAAI Conf. Artificial Intelligence, 2019, pp. 5693–5700.
    [22]
    F. Haddadpour, M. M. Kamani, M. Mahdavi, and V. Cadambe, “Trading redundancy for communication: Speeding up distributed SGD for nonconvex optimization,” in Proc. Int. Conf. Machine Learning, 2019, pp. 2545–2554.
    [23]
    H. Yu, R. Jin, and S. Yang, “On the linear speedup analysis of communication efficient momentum SGD for distributed non-convex optimization,” in Proc. Int. Conf. Machine Learning, 2019, pp. 7184–7193.
    [24]
    F. Haddadpour, M. M. Kamani, M. Mahdavi, and V. Cadambe, “Local SGD with periodic averaging: Tighter analysis and adaptive synchronization,” in Advances in Neural Information Processing Systems, 2019, pp. 11080–11092.
    [25]
    H. Yu and R. Jin, “On the computation and communication complexity of parallel SGD with dynamic batch sizes for stochastic non-convex optimization,” in Proc. Int. Conf. Machine Learning, 2019, pp. 7174–7183.
    [26]
    Z. Jiang, A. Balu, C. Hegde, and S. Sarkar, “Collaborative deep learning in fixed topology networks,” in Advances in Neural Information Processing Systems, 2017, pp. 5904–5914.
    [27]
    X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, and J. Liu, “Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent,” in Advances in Neural Information Processing Systems, 2017, pp. 5330–5340.
    [28]
    J. George, T. Yang, H. Bai, and P. Gurram, “Distributed stochastic gradient method for non-convex problems with applications in supervised learning,” in Proc. IEEE Conf. Decision and Control, 2019, pp. 5538–5543.
    [29]
    X. Lian, W. Zhang, C. Zhang, and J. Liu, “Asynchronous decentralized parallel stochastic gradient descent,” in Proc. Int. Conf. Machine Learning, 2018, pp. 3043–3052.
    [30]
    M. Assran, N. Loizou, N. Ballas, and M. Rabbat, “Stochastic gradient push for distributed deep learning,” in Proc. Int. Conf. Machine Learning, 2019, pp. 344–353.
    [31]
    H. Tang, S. Gan, C. Zhang, T. Zhang, and J. Liu, “Communication compression for decentralized training,” in Advances in Neural Information Processing Systems, 2018, pp. 7652–7662.
    [32]
    A. Reisizadeh, H. Taheri, A. Mokhtari, H. Hassani, and R. Pedarsani, “Robust and communication-efficient collaborative learning,” in Advances in Neural Information Processing Systems, 2019, pp. 8386–8397.
    [33]
    H. Taheri, A. Mokhtari, H. Hassani, and R. Pedarsani, “Quantized decentralized stochastic learning over directed graphs,” in Proc. Int. Conf. Machine Learning, 2020, pp. 9324–9333.
    [34]
    N. Singh, D. Data, J. George, and S. Diggavi, “SQuARM-SGD: Communication-efficient momentum SGD for decentralized optimization,” IEEE Journal on Selected Areas in Information Theory, vol. 2, no. 3, pp. 954–969, 2021. doi: 10.1109/JSAIT.2021.3103920
    [35]
    J. Wang and G. Joshi, “Cooperative SGD: A unified framework for the design and analysis of communication-efficient SGD algorithms,” Journal of Machine Learning Research, vol. 22, no. 213, pp. 1–50, 2021.
    [36]
    H. Tang, X. Lian, M. Yan, C. Zhang, and J. Liu, “D2: Decentralized training over decentralized data,” in Proc. Int. Conf. Machine Learning, 2018, pp. 4848–4856.
    [37]
    S. Lu, X. Zhang, H. Sun, and M. Hong, “GNSD: A gradient-tracking based nonconvex stochastic algorithm for decentralized optimization,” in Proc. IEEE Data Science Workshop, 2019, pp. 315–321.
    [38]
    J. Zhang and K. You, “Decentralized stochastic gradient tracking for empirical risk minimization,” arXiv preprint arXiv: 1909.02712, 2019.
    [39]
    S. U. Stich, “Local SGD converges fast and communicates little,” in Proc. Int. Conf. Learning Representations, 2019.
    [40]
    A. Koloskova, S. Stich, and M. Jaggi, “Decentralized stochastic optimization and gossip algorithms with compressed communication,” in Proc. Int. Conf. Machine Learning, 2019, pp. 3478–3487.
    [41]
    S. Pu, A. Olshevsky, and I. C. Paschalidis, “A sharp estimate on the transient time of distributed stochastic gradient descent,” IEEE Transactions on Automatic Control, 2021. DOI: 10.1109/TAC.2021.3126253
    [42]
    M. Rabbat, “Multi-agent mirror descent for decentralized stochastic optimization,” in Proc. Int. Workshop on Computational Advances in Multi-Sensor Adaptive Processing, 2015, pp. 517–520.
    [43]
    G. Lan, S. Lee, and Y. Zhou, “Communication-efficient algorithms for decentralized and stochastic optimization,” Mathematical Programming, vol. 180, no. 1, pp. 237–284, 2020.
    [44]
    D. Yuan, Y. Hong, D. W. Ho, and G. Jiang, “Optimal distributed stochastic mirror descent for strongly convex optimization,” Automatica, vol. 90, pp. 196–203, 2018. doi: 10.1016/j.automatica.2017.12.053
    [45]
    D. Jakovetic, D. Bajovic, A. K. Sahu, and S. Kar, “Convergence rates for distributed stochastic optimization over random networks,” in Proc. IEEE Conf. Decision and Control, 2018, pp. 4238–4245.
    [46]
    A. Fallah, M. Gurbuzbalaban, A. Ozdaglar, U. Simsekli, and L. Zhu, “Robust distributed accelerated stochastic gradient methods for multiagent networks,” arXiv preprint arXiv: 1910.08701, 2019.
    [47]
    S. Pu and A. Garcia, “Swarming for faster convergence in stochastic optimization,” SIAM Journal on Control and Optimization, vol. 56, no. 4, pp. 2997–3020, 2018. doi: 10.1137/17M1111085
    [48]
    S. Pu and A. Nedić, “A distributed stochastic gradient tracking method,” in Proc. IEEE Conf. Decision and Control, 2018, pp. 963–968.
    [49]
    R. Xin, A. K. Sahu, U. A. Khan, and S. Kar, “Distributed stochastic optimization with gradient tracking over strongly-connected networks,” in Proc. IEEE Conf. Decision and Control, 2019, pp. 8353–8358.
    [50]
    M. Mesbahi and M. Egerstedt, Graph Theoretic Methods in Multiagent Networks. Princeton University Press, 2010.
    [51]
    A. Rakhlin, O. Shamir, and K. Sridharan, “Making gradient descent optimal for strongly convex stochastic optimization,” in Proc. Int. Conf. Machine Learning, 2012, pp. 1571–1578.
    [52]
    H. Karimi, J. Nutini, and M. Schmidt, “Linear convergence of gradient and proximal-gradient methods under the Polyak-Łojasiewicz condition,” in Proc. Joint European Conf. Machine Learning and Knowledge Discovery in Databases, 2016, pp. 795–811.
    [53]
    H. Zhang and L. Cheng, “Restricted strong convexity and its applications to convergence analysis of gradient-type methods in convex optimization,” Optimization Letters, vol. 9, no. 5, pp. 961–979, 2015. doi: 10.1007/s11590-014-0795-x
    [54]
    Z. Li and J. Li, “A simple proximal stochastic gradient method for nonsmooth nonconvex optimization,” in Advances in Neural Information Processing Systems, 2018, pp. 5569–5579.
    [55]
    M. Fazel, R. Ge, S. Kakade, and M. Mesbahi, “Global convergence of policy gradient methods for the linear quadratic regulator,” in Proc. Int. Conf. Machine Learning, 2018, pp. 1467–1476.
    [56]
    W. Shi, Q. Ling, G. Wu, and W. Yin, “EXTRA: An exact first-order algorithm for decentralized consensus optimization,” SIAM Journal on Optimization, vol. 25, no. 2, pp. 944–966, 2015. doi: 10.1137/14096668X
    [57]
    A. Nedić, A. Olshevsky, and W. Shi, “Achieving geometric convergence for distributed optimization over time-varying graphs,” SIAM Journal on Optimization, vol. 27, no. 4, pp. 2597–2633, 2017. doi: 10.1137/16M1084316
    [58]
    H. Li and Z. Lin, “Revisiting extra for smooth distributed optimization,” SIAM Journal on Optimization, vol. 30, no. 3, pp. 1795–1821, 2020. doi: 10.1137/18M122902X
    [59]
    Y. LeCun, C. Cortes, and C. Burges, “MNIST handwritten digit database,” [Online] Available: http://yann.lecun.com/exdb/mnist, 2010.
    [60]
    L. Bottou, “Stochastic gradient descent tricks,” in Neural Networks: Tricks of the Trade. Springer, 2012, pp. 421–436.
    [61]
    Y. Nesterov, Lectures on Convex Optimization, 2nd ed. Springer Int. Publishing, 2018.
    [62]
    Y. Tang, J. Zhang, and N. Li, “Distributed zero-order algorithms for nonconvex multiagent optimization,” IEEE Trans. Control of Network Systems, vol. 8, no. 1, pp. 269–281, 2020.
    [63]
    X. Yi, L. Yao, T. Yang, J. George, and K. H. Johansson, “Distributed optimization for second-order multi-agent systems with dynamic eventtriggered communication,” in Proc. IEEE Conf. Decision and Control, 2018, pp. 3397–3402.

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(4)  / Tables(5)

    Article Metrics

    Article views (204) PDF downloads(65) Cited by()

    Highlights

    • This paper proposes a novel distributed SGD algorithm, suitable for arbitrarily connected communication networks and heterogeneous local cost functions
    • The proposed algorithm achieves the linear speedup rate for smooth nonconvex cost functions
    • It also achieves the linear speedup convergence rate when the global cost function satisfies the Polyak–Łojasiewicz condition which is weaker than the commonly used strong convexity assumption

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return