Authors: Sudhir B. Kylasa,Farbod Roosta-Khorasani,Michael W. Mahoney,Ananth Grama
ArXiv: 1802.09113
Document:
PDF
DOI
Abstract URL: http://arxiv.org/abs/1802.09113v2
First order methods, which solely rely on gradient information, are commonly
used in diverse machine learning (ML) and data analysis (DA) applications. This
is attributed to the simplicity of their implementations, as well as low
per-iteration computational/storage costs. However, they suffer from
significant disadvantages; most notably, their performance degrades with
increasing problem ill-conditioning. Furthermore, they often involve a large
number of hyper-parameters, and are notoriously sensitive to parameters such as
the step-size. By incorporating additional information from the Hessian,
second-order methods, have been shown to be resilient to many such adversarial
effects. However, these advantages of using curvature information come at the
cost of higher per-iteration costs, which in \enquote{big data} regimes, can be
computationally prohibitive.
In this paper, we show that, contrary to conventional belief, second-order
methods, when implemented appropriately, can be more efficient than first-order
alternatives in many large-scale ML/ DA applications. In particular, in convex
settings, we consider variants of classical Newton\textsf{'}s method in which
the Hessian and/or the gradient are randomly sub-sampled. We show that by
effectively leveraging the power of GPUs, such randomized Newton-type
algorithms can be significantly accelerated, and can easily outperform state of
the art implementations of existing techniques in popular ML/ DA software
packages such as TensorFlow. Additionally these randomized methods incur a
small memory overhead compared to first-order methods. In particular, we show
that for million-dimensional problems, our GPU accelerated sub-sampled
Newton\textsf{'}s method achieves a higher test accuracy in milliseconds as
compared with tens of seconds for first order alternatives.