Paper:《Adam: A Method for Stochastic Optimization》的翻译与解读
Paper:《Adam: A Method for Stochastic Optimization》的翻译与解读Adam: A Method for Stochastic Optimization论文出处:Adam: A Method for Stochastic OptimizationABSTRACTWe introduce Adam, an algorithm for first-order gradient-based optimization ofstochastic objective functions, based on adaptive estimates of lower-order moments.The method is straightforward to implement, is computationally efficient,has little memory requirements, is invariant to diagonal rescaling of the gradients,and is well suited for problems that are large in terms of data and/or parameters.The method is also appropriate for non-stationary objectives and problems withvery noisy and/or sparse gradients. The hyper-parameters have intuitive interpretationsand typically require little tuning. Some connections to related algorithms,on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Empirical results demonstrate that Adam works well in practice and compares favorably to other stochastic optimization methods. Finally, we discuss AdaMax, a variant of Adam based on the infinity norm.介绍了一种基于低阶矩自适应估计的随机目标函数一阶梯度优化算法Adam。该方法易于实现,计算效率高,对内存的要求少,对梯度的对角重新缩放是不变的,并且非常适合于数据和/或参数很大的问题。该方法也适用于非平稳目标和具有非常嘈杂和/或稀疏梯度的问题。超参数有直观的解释,通常需要很少的调整。本文讨论了一些与相关算法的联系,Adam 正是在这些算法上受到启发。我们还分析了算法的理论收敛性,并给出了与在线凸优化框架下的最优结果相当的收敛遗憾界。实验结果表明,该方法在实际应用中效果良好,与其它随机优化方法相比具有一定的优越性。最后,我们讨论了AdaMax,一个基于无穷范数的Adam变体。1、INTRODUCTIONStochastic gradient-based optimization is of core practical importance in many fields of science and engineering. Many problems in these fields can be cast as the optimization of some scalar parameterized objective function requiring maximization or minimization with respect to its parameters. If the function is differentiable w.r.t. its parameters, gradient descent is a relatively efficient optimization method, since the computation of first-order partial derivatives w.r.t. all the parameters is of the same computational complexity as just evaluating the function. Often, objective functions are stochastic. For example, many objective functions are composed of a sum of subfunctions evaluated at different subsamples of data; in this case optimization can be made more efficient by taking gradient steps w.r.t. individual subfunctions, i.e. stochastic gradient descent (SGD) or ascent. SGD proved itself as an efficient and effective optimization method that was central in many machine learning success stories, such as recent advances in deep learning (Deng et al., 2013; Krizhevsky et al., 2012; Hinton & Salakhutdinov, 2006; Hinton et al., 2012a; Graves et al., 2013). Objectives may also have other sources of noise than data subsampling, such as dropout (Hinton et al., 2012b) regularization. For all such noisy objectives, efficient stochastic optimization techniques are required. The focus of this paper is on the optimization of stochastic objectives with high-dimensional parameters spaces. In these cases, higher-order optimization methods are ill-suited, and discussion in this paper will be restricted to first-order methods.基于随机梯度的优化方法在许多科学和工程领域具有核心的实际意义。这些领域中的许多问题都可以转化为对某个标量参数化目标函数的优化,该函数的参数需要最大化或最小化。如果函数的参数是可微的,梯度下降法是一种比较有效的优化方法,因为一阶偏导数的计算与函数的求值具有相同的计算复杂度。通常,目标函数是随机的。例如,许多目标函数由在不同数据子样本上求值的子函数和组成;在这种情况下,优化可以通过采取梯度步骤w.r.t.单独的子函数,即随机梯度下降(SGD)或上升来提高效率。SGD证明了自己是一种高效和有效的优化方法,这在许多机器学习成功的故事中都是核心,比如最近在深度学习方面的进展(Deng et al., 2013;Krizhevsky等,2012;Hinton & Salakhutdinov, 2006;Hinton等,2012a;Graves 等人,2013)。目标也可能有数据子采样之外的其他噪音源,如dropout (Hinton et al., 2012b)正则化。对于所有这些有噪声的目标,都需要有效的随机优化技术。本文主要研究具有高维参数空间的随机目标的优化问题。在这种情况下,高阶优化方法是不合适的,本文的讨论将局限于一阶方法。We propose Adam, a method for efficient stochastic optimization that only requires first-order gradients with little memory requirement. The method computes individual adaptive learning rates for different parameters from estimates of first and second moments of the gradients; the name Adam is derived from adaptive moment estimation. Our method is designed to combine the advantages of two recently popular methods: AdaGrad (Duchi et al., 2011), which works well with sparse gradients, and RMSProp (Tieleman & Hinton, 2012), which works well in on-line and non-stationary settings; important connections to these and other stochastic optimization methods are clarified in section 5. Some of Adam’s advantages are that the magnitudes of parameter updates are invariant to rescaling of the gradient, its stepsizes are approximately bounded by the stepsize hyperparameter, it does not require a stationary objective, it works with sparse gradients, and it naturally performs a form of step size annealing.我们提出了一种只需要一阶梯度且对内存要求很小的高效随机优化方法Adam。该方法通过估计梯度的一阶矩和二阶矩计算不同参数的个体自适应学习率;Adam这个名字来源于自适应矩估计。我们的方法结合了两种最近流行的方法的优点:AdaGrad (Duchi et al., 2011)和RMSProp (Tieleman & Hinton, 2012),前者在稀疏梯度下工作良好,后者在在线和非平稳环境下工作良好;与这些和其他随机优化方法的重要联系在第5节中阐明。Adam的一些优势的大小参数更新不变的尺度改变梯度,其stepsizes大约有界的stepsize hyperparameter,它不需要一个固定的目标,它适用于稀疏的梯度,它自然地执行步长退火的一种形式。3、CONCLUSIONWe have introduced a simple and computationally efficient algorithm for gradient-based optimization of stochastic objective functions. Our method is aimed towards machine learning problems with large datasets and/or high-dimensional parameter spaces. The method combines the advantages of two recently popular optimization methods: the ability of AdaGrad to deal with sparse gradients, and the ability of RMSProp to deal with non-stationary objectives. The method is straightforwardto implement and requires little memory. The experiments confirm the analysis on the rate of convergence in convex problems. Overall, we found Adam to be robust and well-suited to a wide range of non-convex optimization problems in the field machine learning.介绍了一种简单、计算效率高的随机目标函数梯度优化算法。我们的方法是针对大数据集和/或高维参数空间的机器学习问题。该方法结合了两种最近流行的优化方法的优点:AdaGrad处理稀疏梯度的能力和RMSProp处理非平稳目标的能力。该方法易于实现,并且需要的内存很少。实验验证了凸问题收敛速度的分析。总的来说,我们发现Adam是健壮的,并且非常适合于在领域机器学习中大量的非凸优化问题。