Implementing and Evaluating Second-Order Optimization for Deep Learning on Diverse Architectures


At a glance

First-order optimization methods—e.g., SGD and its variants—are ubiquitous, but they come with well-known deficiencies that second-order methods easily solve. Second-order methods use second derivative and not just first derivative information, and so they tend to be more expensive. Recent work has begun to consider the use of second-order methods for deep learning and other ML problems.  A challenge     is that developing high-quality implementations can be non-trivial,  in part since second-order methods can    be more sensitive to architectural issues than simpler first-order methods. The time is ripe for a thorough reevaluation of these methods. Of particular interest is the use of trust region (TR) and cubic regularization (CR) versions of these methods in models with non-convex loss functions. Such models are ubiquitous: multi-layer neural networks; discriminative training of mixture models; models such as latent factor models with “products” of parameters; etc. Our goal will be twofold: first, to provide a detailed implementation  in different architectures of recently-developed scalable second-order optimization methods; and second, to provide a detailed evaulation of them in non-convex deep learning problems of interest for the BDD project.
Prior work. In prior work, we have a proof-of-principle treatment of sketching as a way to reduce compu- tational complexity of Newton-type methods, for both convex and non-convex problems, in both RAM and parallel environments, and with both CPUs and GPUs; see [5, 6, 7, 8, 9] and references therein. We considered convex problems and characterized thoroughly the local and global convergence properties of sub-sampled Newton methods, where the gradient and/or the Hessian are uniformly or non-uniformly sub-sampled [5, 6]. We employed similar ideas for TR and CR algorithms for non-convex optimization [8] and presented initial empirical results illustrating the practical usefulness of these methods for non-convex ML problems [9]. We also considered distributed variants of sub-sampled Newton methods that are highly suitable for distributed settings, in which communication across the network is the major bottleneck [7]. In contrast to popular first-order algorithms,  we  empirically demonstrated the robustness of Newton-type methods to the choice     of hyper-parameters [9] and showcased their resilience properties to ill-conditioning [5, 6]. In (unpublished) work-in-progress, we  have  also shown that by  leveraging resources such as GPUs,  Newton-type methods  can be made computationally highly favorable to many first-order alternatives available  as part of popular   ML software packages, such as TensorFlow [3]. See Figure 1 (and [5, 9, 7] for details on the models, loss functions, data/applications, etc.) for examples of promising prior work.
Work description. We will focus on two main directions. In each case, we will be interested in two main goals: demonstrating competitive/superior performance, relative to first order methods, for BDD-type appli- cations; and demonstrating scalability in parallel implementations, in different computational environments, e.g., in Alchemist [2] and Ray [1] in the RISELab stack, with and without access to GPUs.
Sketched  negative-curvature  methods  for  non-convex  ML  problems.  Until  recently,  the  majority  of effort in designing efficient second-order algorithms has focused on convex problems [5, 6]. However, now there is a surge of interest in designing second-order methods for the non-convex setting. In recent work [8, 9], we have  studied, both theoretically and empirically, sub-sampling schemes for classical TR methods as well  as CR for solving large-scale non-convex finite-sum problems. In this project, we intend to extend the parallel matrix approximation approaches from randomized numerical linear algebra (RandNLA) [4] to negative cur- vature direction methods for arbitrary, i.e., not necessarily finite-sum, non-convex problems.

Figure 1: Examples from prior work [5, 9, 7] demonstrating, in contrast to first-order algorithms, attractive features of Newton-type methods: resiliency to problem ill-conditioning (1(a)), low-communication costs in distributed settings (1(b)), computational advantages from GPUs (1(c)), ability to escape undesirable saddle- points (1(d),1(e)), and good generalization error and robustness to hyper-parameter tuning (1(f),1(g)).

In particular, we will study stochastic variants of Gill-Murray Stable Newton’s method, the Fiacco-McCormick method, and the Fletcher-Freeman method (details in the Nocedal-Wright book), where we will incorporate RandNLA sketching tools to accelerate the matrix decompositions involved. Based on our previous experience [8, 9], we believe that only a rough sketch of the corresponding Hessian matrix would be sufficient to obtain prov- ably convergent algorithms with desirable empirical properties. The main technical challenges here involve sketching the potentially indefinite matrices and characterizing the convergence behavior of such stochastic methods, e.g., to saddle points or local minima, which can significantly impact downstream ML applications.
RandNLA to speed up TR sub-problem solver. TR methods are  considered  as  alternatives to line- search techniques, as they define a region around the current iterate within which they trust the model to be a reasonable approximation, and they then find the step as a (approximate) minimizer of the model in this region. The computational bottleneck in TR algorithms is the minimization of the constrained quadratic sub-problems, and the connections between the TR problem and eigenvalue problems suggest that more efficient TR algorithms should exist. In [8, 9], we considered sub-sampling variants of the classical TR algorithm where the Hessian in the sub-problem is replaced with its sub-sampled approximation. We then employed some black box sub-problem solver for approximately solving the sub-problem. In this project, we aim to open these black box solvers and employ powerful parallel ideas from RandNLA and low rank matrix approximation to improve upon their efficiency in solving the sub-problem.
Data policy. We  will make publicly-available the implementations of algorithms we  develop.  To  evaluate  our methods, we will apply them to standard benchmarks, e.g., ImageNet and CIFAR10; and to “stress test” our methods in a realistic application, we will apply them to data (TBD) from the BDD project.


  1. Ray.
  2. A. Gittens, K. Rothauge, M. W. Mahoney, S. Wang, and J. Kotaalam. Alchemist: An Apache Spark <=> MPI interface. Manuscript. 2018.
  3. S. B Kylasa, F. Roosta-Khorasani, M. W. Mahoney, and A. Grama. Sub-sampled Newton methods on GPUs. Manuscript. 2018.
  4. Michael  W  Mahoney.   Randomized  algorithms  for  matrices  and  data.   Foundations  and  TrendsQR Learning, 3(2):123–224, 2011. in Machine
  5. Farbod Roosta-Khorasani and Michael W Mahoney. Sub-sampled Newton methods I: globally convergent algo- rithms. arXiv preprint arXiv:1601.04737, 2016.
  6. Farbod Roosta-Khorasani and Michael W Mahoney. Sub-sampled Newton methods II: Local convergence rates. arXiv preprint arXiv:1601.04738, 2016.
  7. Shusen Wang, Farbod Roosta-Khorasani, Peng Xu, and Michael W Mahoney. GIANT: Globally Improved Ap- proximate Newton Method for Distributed Optimization. arXiv preprint arXiv:1709.03528, 2017.
  8. Peng Xu, Farbod Roosta-Khorasani, and Michael W. Mahoney. Newton-Type Methods for Non-Convex Opti- mization Under Inexact Hessian Information. arXiv preprint arXiv:1708.07164, 2017.
  9. Peng Xu, Farbod Roosta-Khorasani, and Michael W. Mahoney. Second-Order Optimization for Non-Convex Machine Learning: An Empirical Study. arXiv preprint arXiv:1708.07827, 2017.
principal investigatorsresearchersthemes
Michael Mahoney second-order optimization, GPU optimization, trust region methods, optimizing batch size and learning rate