Publications about 'gradient descent'

Publications about 'gradient descent'

Articles in journal or book chapters

A.C.B de Oliveira, D.D. Jatkar, and E.D. Sontag. On the convergence of overparameterized problems: Inherent properties of the compositional structure of neural networks. Proceedings of the 8th Annual Learning for Dynamics & Control Conference (L4DC), 2026. Note: To appear. Also 2025 arXiv:2511.09810 [cs.LG]. [doi:https://doi.org/10.48550/arXiv.2511.09810] Keyword(s): gradient dynamics, gradient descent, gradient systems, numerical methods, dynamics of algorithms, gradient dominance, gradient flows, neural networks, optimization, overparameterization. Abstract:

This paper investigates how the compositional structure of neural networks shapes their optimization landscape and training dynamics. We analyze the gradient flow associated with overparameterized optimization problems, which can be interpreted as training a neural network with linear activations. Remarkably, we show that the global convergence properties can be derived for any cost function that is proper and real analytic. We then specialize the analysis to scalar-valued cost functions, where the geometry of the landscape can be fully characterized. In this setting, we demonstrate that key structural features -- such as the location and stability of saddle points -- are universal across all admissible costs, depending solely on the overparameterized representation rather than on problem-specific details. Moreover, we show that convergence can be arbitrarily accelerated depending on the initialization, as measured by an imbalance metric introduced in this work. Finally, we discuss how these insights may generalize to neural networks with sigmoidal activations, showing through a simple example which geometric and dynamical properties persist beyond the linear case.

L. Cui, Z.P. Jiang, E.D. Sontag, and R.D. Braatz. Perturbed gradient descent algorithms are small-disturbance input-to-state stable. Automatica, 2025. Note: Submitted. Also arXiv:2507.02131. [PDF] [doi:https://doi.org/10.48550/arXiv.2507.02131] Keyword(s): gradient dynamics, gradient descent, gradient systems, numerical methods, dynamics of algorithms, gradient dominance, gradient flows, input-to-state stability, dynamics of algorithms, policy optimization, linear quadratic regulator. Abstract:

This article investigates the robustness of gradient descent algorithms under perturbations. The concept of small-disturbance input-to-state stability (ISS) for discrete-time nonlinear dynamical systems is introduced, along with its Lyapunov characterization. The conventional linear Polyak-Lojasiewicz (PL) condition is then extended to a nonlinear version, and it is shown that the gradient descent algorithm is small-disturbance ISS provided the objective function satisfies the generalized nonlinear PL condition. This small-disturbance ISS property guarantees that the gradient descent algorithm converges to a small neighborhood of the optimum under sufficiently small perturbations. As a direct application of the developed framework, we demonstrate that the LQR cost satisfies the generalized nonlinear PL condition, thereby establishing that the policy gradient algorithm for LQR is small-disturbance ISS. Additionally, other popular policy gradient algorithms, including natural policy gradient and Gauss-Newton method, are also proven to be small-disturbance ISS.

E.D. Sontag. Some remarks on gradient dominance and LQR policy optimization. arXiv 2507.10452, 2025. [PDF] [doi:https://doi.org/10.48550/arXiv.2507.10452] Keyword(s): gradient dynamics, gradient descent, gradient systems, numerical methods, dynamics of algorithms, gradient dominance, gradient flows, dynamics of algorithms, LQR, reinforcement learning, machine learning, artificial intelligence, optimal control. Abstract:

Solutions of optimization problems, including policy optimization in reinforcement learning, typically rely upon some variant of gradient descent. There has been much recent work in the machine learning, control, and optimization communities applying the Polyak-Åojasiewicz Inequality (PLI) to such problems in order to establish an exponential rate of convergence (a.k.a. ``linear convergence'' in the local-iteration language of numerical analysis) of loss functions to their minima under the gradient flow. Often, as is the case of policy iteration for the continuous-time LQR problem, this rate vanishes for large initial conditions, resulting in a mixed globally linear / locally exponential behavior. This is in sharp contrast with the discrete-time LQR problem, where there is global exponential convergence. That gap between CT and DT behaviors motivates the search for various generalized PLI-like conditions, and this paper addresses that topic. Moreover, these generalizations are key to understanding the transient and asymptotic effects of errors in the estimation of the gradient, errors which might arise from adversarial attacks, wrong evaluation by an oracle, early stopping of a simulation, inaccurate and very approximate digital twins, stochastic computations (algorithm ``reproducibility''), or learning by sampling from limited data. We describe an ``input to state stability'' (ISS) analysis of this issue. We also discuss convergence and PLI-like properties of ``linear feedforward neural networks'' in feedback control. Much of the work described here was done in collaboration with Arthur Castello B. de Oliveira, Leilei Cui, Zhong-Ping Jiang, and Milad Siami. This is a short paper summarizing the slides presented at my keynote at the 2025 L4DC (Learning for Dynamics \& Control Conference) in Ann Arbor, Michigan, 05 June 2025. A partial bibliography has been added.

A.C.B de Oliveira, M. Siami, and E.D. Sontag. Convergence analysis of overparametrized LQR formulations. Automatica, 182:112504, 2025. Note: Version with more details in arXiv 2408.15456. [PDF] Keyword(s): gradient dynamics, gradient descent, gradient systems, numerical methods, dynamics of algorithms, gradient dominance, gradient flows, machine learning, artificial intelligence, learning theory, singularities in optimization, neural networks, overparametrization, input to state stability, feedback control, LQR. Abstract:

Motivated by the growing use of Artificial Intelligence (AI) tools in control design, this paper takes the first steps towards bridging the gap between results from Direct Gradient methods for the Linear Quadratic Regulator (LQR), and neural networks. More specifically, it looks into the case where one wants to find a Linear Feed-Forward Neural Network (LFFNN) feedback that minimizes a LQR cost. This paper starts by computing the gradient formulas for the parameters of each layer, which are used to derive a key conservation law of the system. This conservation law is then leveraged to prove boundedness and global convergence of solutions to critical points, and invariance of the set of stabilizing networks under the training dynamics. This is followed by an analysis of the case where the LFFNN has a single hidden layer. For this case, the paper proves that the training converges not only to critical points but to the optimal feedback control law for all but a set of measure-zero of the initializations. These theoretical results are followed by an extensive analysis of a simple version of the problem (the ``vector case''), proving the theoretical properties of accelerated convergence and robustness for this simpler example. Finally, the paper presents numerical evidence of faster convergence of the training of general LFFNNs when compared to traditional direct gradient methods, showing that the acceleration of the solution is observable even when the gradient is not explicitly computed but estimated from evaluations of the cost function.

L. Cui, Z.P. Jiang, and E. D. Sontag. Small-disturbance input-to-state stability of perturbed gradient flows: Applications to LQR problem. Systems and Control Letters, 188:105804, 2024. [PDF] [doi:https://doi.org/10.1016/j.sysconle.2024.105804] Keyword(s): gradient dynamics, gradient descent, gradient systems, numerical methods, dynamics of algorithms, gradient dominance, gradient flows, machine learning, artificial intelligence, dynamics of algorithms, direct optimization, input-to-state stability, ISS. Abstract:

This paper studies the effect of perturbations on the gradient flow of a general constrained nonlinear programming problem, where the perturbation may arise from inaccurate gradient estimation in the setting of data-driven optimization. Under suitable conditions on the objective function, the perturbed gradient flow is shown to be small-disturbance input-to-state stable (ISS), which implies that, in the presence of a small-enough perturbation, the trajectory of the perturbed gradient flow must eventually enter a small neighborhood of the optimum. This work was motivated by the question of robustness of direct methods for the linear quadratic regulator problem, and specifically the analysis of the effect of perturbations caused by gradient estimation or round-off errors in policy optimization. Interestingly, we show small-disturbance ISS for three of the most common optimization algorithms: standard gradient flow, natural gradient flow, and Newton gradient flow.

E.D. Sontag. Remarks on input to state stability of perturbed gradient flows, motivated by model-free feedback control learning. Systems and Control Letters, 161:105138, 2022. Note: Important: there is an error in the paper. For the LQR application, the paper only shows iISS, not ISS. See the paper Small-disturbance input-to-state stability of perturbed gradient flows: Applications to LQR problem for details.[PDF] Keyword(s): gradient dominance, iss, input to state stability, data-driven control, gradient systems, steepest descent, model-free control, gradient dynamics, gradient descent, gradient systems, gradient descent, numerical methods, dynamics of algorithms. Abstract:

Recent work on data-driven control and reinforcement learning has renewed interest in a relatively old field in control theory: model-free optimal control approaches which work directly with a cost function and do not rely upon perfect knowledge of a system model. Instead, an "oracle" returns an estimate of the cost associated to, for example, a proposed linear feedback law to solve a linear-quadratic regulator problem. This estimate, and an estimate of the gradient of the cost, might be obtained by performing experiments on the physical system being controlled. This motivates in turn the analysis of steepest descent algorithms and their associated gradient differential equations. This paper studies the effect of errors in the estimation of the gradient, framed in the language of input to state stability, where the input represents a perturbation from the true gradient. Since one needs to study systems evolving on proper open subsets of Euclidean space, a self-contained review of input to state stability definitions and theorems for systems that evolve on such sets is included. The results are then applied to the study of noisy gradient systems, as well as the associated steepest descent algorithms.

E.D. Sontag. A general approach to path planning for systems without drift. In J. Baillieul, S. S. Sastry, and H.J. Sussmann, editors, Essays on mathematical robotics (Minneapolis, MN, 1993), volume 104 of IMA Vol. Math. Appl., pages 151-168. Springer, New York, 1998. [PDF] Keyword(s): path-planning, systems without drift, nonlinear control, controllability, real-analytic functions, gradient dynamics, gradient descent, gradient systems, gradient descent, numerical methods, dynamics of algorithms. Abstract:

This paper proposes a generally applicable technique for the control of analytic systems with no drift. The method is based on the generation of "nonsingular loops" that allow linearized controllability. One can then implement Newton and/or gradient searches in the search for a control. A general convergence theorem is proved.

E.D. Sontag. Critical points for least-squares problems involving certain analytic functions, with applications to sigmoidal nets. Adv. Comput. Math., 5(2-3):245-268, 1996. [PDF] Keyword(s): machine learning, artificial intelligence, subanalytic sets, semianalytic sets, critical points, approximation theory, neural networks, real-analytic functions, gradient dynamics, gradient descent, gradient systems, gradient descent, numerical methods, dynamics of algorithms. Abstract:

This paper deals with nonlinear least-squares problems involving the fitting to data of parameterized analytic functions. For generic regression data, a general result establishes the countability, and under stronger assumptions finiteness, of the set of functions giving rise to critical points of the quadratic loss function. In the special case of what are usually called "single-hidden layer neural networks", which are built upon the standard sigmoidal activation tanh(x) or equivalently 1/(1+exp(-x)), a rough upper bound for this cardinality is provided as well.

E.D. Sontag and H.J. Sussmann. Back propagation separates where perceptrons do. Neural Networks, 4(2):243-249, 1991. [PDF] [doi:http://dx.doi.org/10.1016/0893-6080(91)90008-S] Keyword(s): machine learning, artificial intelligence, gradient dynamics, gradient descent, gradient systems, gradient descent, numerical methods, dynamics of algorithms, neural networks. Abstract:

Feedforward nets with sigmoidal activation functions are often designed by minimizing a cost criterion. It has been pointed out before that this technique may be outperformed by the classical perceptron learning rule, at least on some problems. In this paper, we show that no such pathologies can arise if the error criterion is of a threshold LMS type, i.e., is zero for values ``beyond'' the desired target values. More precisely, we show that if the data are linearly separable, and one considers nets with no hidden neurons, then an error function as above cannot have any local minima that are not global. In addition, the proof gives the following stronger result, under the stated hypotheses: the continuous gradient adjustment procedure is such that from any initial weight configuration a separating set of weights is obtained in finite time. This is a precise analogue of the Perceptron Learning Theorem. The results are then compared with the more classical pattern recognition problem of threshold LMS with linear activations, where no spurious local minima exist even for nonseparable data: here it is shown that even if using the threshold criterion, such bad local minima may occur, if the data are not separable and sigmoids are used. keywords = { neural networks , feedforward neural nets },

Conference articles

L. Cui, Z.P. Jiang, and E. D. Sontag. Small-covariance noise-to-state stability of stochastic systems and its applications to stochastic gradient dynamics. In 2026 American Control Conference (ACC), 2026. Note: To appear. Also 2025 arXiv:2509.24277. [PDF] [doi:https://doi.org/10.48550/arXiv.2509.24277] Keyword(s): gradient dynamics, gradient descent, gradient systems, numerical methods, dynamics of algorithms, gradient dominance, gradient flows, noise to state stability, input to state stability, dynamics of algorithms, stochastic systems. Abstract:

This paper studies gradient dynamics subject to additive stochastic noise, which may arise from sources such as stochastic gradient estimation, measurement noise, or stochastic sampling errors. To analyze the robustness of such stochastic gradient systems, the concept of small-covariance noise-to-state stability (NSS) is introduced, along with a Lyapunov-based characterization. Furthermore, the classical Polyakâ€“Lojasiewicz (PL) condition on the objective function is generalized to the $\mathcal{K}$-PL condition via comparison functions, thereby extending its applicability to a broader class of optimization problems. It is shown that the stochastic gradient dynamics exhibit small-covariance NSS if the objective function satisfies the $\mathcal{K}$-PL condition and possesses a globally Lipschitz continuous gradient. This result implies that the trajectories of stochastic gradient dynamics converge to a neighborhood of the optimum with high probability, with the size of the neighborhood determined by the noise covariance. Moreover, if the $\mathcal{K}$-PL condition is strengthened to a $\mathcal{K}_\infty$-PL condition, the dynamics are NSS; whereas if it is weakened to a general positive-definite-PL condition, the dynamics exhibit integral NSS. The results further extend to objectives without globally Lipschitz gradients through appropriate step-size tuning. The proposed framework is further applied to the robustness analysis of policy optimization for the linear quadratic regulator (LQR) and logistic regression.

A. Oliveira, A. C. B. de Oliveira, M. Sznaier, and E. D. Sontag. On incremental and semi-global exponential stability of gradient flows satisfying generalized Lojasiewicz inequalities. In Proc. 65th IEEE Conference on Decision and Control (CDC), 2026. Note: Submitted. Also arXiv arXiv:2603.25822. Keyword(s): gradient dynamics, gradient descent, gradient systems, numerical methods, dynamics of algorithms, gradient dominance, gradient flows, contractions, contractive systems. Abstract:

The Lojasiewicz inequality characterizes objective-value convergence along gradient flows and, in special cases, yields exponential decay of the cost. However, such results do not directly imply convergence of the state. In this paper, we use contraction theory to derive state-space guarantees for gradient systems satisfying generalized Lojasiewicz inequalities. We first show that, when the objective has a unique strongly convex minimizer, the generalized Lojasiewicz inequality implies semi-global exponential stability; on arbitrary compact subsets, this yields exponential stability. We then give two curvature-based sufficient conditions, together with constraints on the Lojasiewicz rate, under which the nonconvex gradient flow is globally incrementally exponentially stable, a property strictly stronger than global exponential stability. A few examples are presented at the end of the paper to validate the proposed theory.

M.K. Wafi, A.C.B de Oliveira, and E.D. Sontag. On the (almost) global exponential convergence of overparameterized policy optimization for the LQR problem. In 2026 American Control Conference (ACC), 2026. Note: To appear. See also 2025 arXiv:2510.02140. [PDF] Keyword(s): gradient dynamics, gradient descent, gradient systems, numerical methods, dynamics of algorithms, gradient dominance, gradient flows, machine learning, artificial intelligence, dynamics of algorithms, LQR, reinforcement learning. Abstract:

In this work we study the convergence of gradient methods for nonconvex optimization problems -- specifically the effect of the problem formulation to the convergence behavior of the solution of a gradient flow. We show through a simple example that, surprisingly, the gradient flow solution can be exponentially or asymptotically convergent, depending on how the problem is formulated. We then deepen the analysis and show that a policy optimization strategy for the continuous-time linear quadratic regulator (LQR) (which is known to present only asymptotic convergence globally) presents almost global exponential convergence if the problem is overparameterized through a linear feed-forward neural network (LFFNN). We prove this qualitative improvement always happens for a simplified version of the LQR problem and derive explicit convergence rates for the gradient flow. Finally, we show that both the qualitative improvement and the quantitative rate gains persist in the general LQR through numerical simulations.

A.C.B de Oliveira, L. Cui, and E. D. Sontag. Remarks on the Polyak-Lojasiewicz inequality and the convergence of gradient systems. In Proc. 64th IEEE Conference on Decision and Control (CDC), pages 1150-1155, 2025. Note: Extended version in arXiv:2503.23641. [PDF] [doi:https://doi.org/10.48550/arXiv.2503.23641] Keyword(s): gradient dynamics, gradient descent, gradient systems, numerical methods, dynamics of algorithms, gradient dominance, gradient flows, machine learning, artificial intelligence, numerical methods, dynamics of algorithms, LQR, reinforcement learning. Abstract:

This work explores generalizations of the Polyak-Lojasiewicz inequality (PLI) and their implications for the convergence behavior of gradient flows in optimization problems. Motivated by the continuous-time linear quadratic regulator (CT-LQR) policy optimization problem -- where only a weaker version of the PLI is characterized in the literature -- this work shows that while weaker conditions are sufficient for global convergence to, and optimality of the set of critical points of the cost function, the "profile" of the gradient flow solution can change significantly depending on which "flavor" of inequality the cost satisfies. After a general theoretical analysis, we focus on fitting the CT-LQR policy optimization problem to the proposed framework, showing that, in fact, it can never satisfy a PLI in its strongest form. We follow up our analysis with a brief discussion on the difference between continuous- and discrete-time LQR policy optimization, and end the paper with some intuition on the extension of this framework to optimization problems with L1 regularization and solved through proximal gradient flows.

A.C.B de Oliveira, M. Siami, and E.D. Sontag. Remarks on the gradient training of linear neural network based feedback for the LQR Problem. In Proc. 2024 63rd IEEE Conference on Decision and Control (CDC), pages 7846-7852, 2024. [PDF] Keyword(s): gradient dynamics, gradient descent, gradient systems, numerical methods, dynamics of algorithms, gradient dominance, gradient flows, machine learning, artificial intelligence, neural networks, overparametrization, dynamics of algorithms, input to state stability, feedback control, LQR. Abstract:

Motivated by the current interest in using Artificial intelligence (AI) tools in control design, this paper takes the first steps towards bridging results from gradient methods for solving the LQR control problem, and neural networks. More specifically, it looks into the case where one wants to find a Linear Feed-Forward Neural Network (LFFNN) that minimizes the Linear Quadratic Regulator (LQR) cost. This work develops gradient formulas that can be used to implement the training of LFFNNs to solve the LQR problem, and derives an important conservation law of the system. This conservation law is then leveraged to prove global convergence of solutions and invariance of the set of stabilizing networks under the training dynamics. These theoretical results are then followed by and extensive analysis of the simplest version of the problem (the ``scalar case'') and by numerical evidence of faster convergence of the training of general LFFNNs when compared to traditional direct gradient methods. These results not only serve as indication of the theoretical value of studying such a problem, but also of the practical value of LFFNNs as design tools for data-driven control applications.

A.C.B de Oliveira, M. Siami, and E.D. Sontag. Dynamics and perturbations of overparameterized linear neural networks. In Proc. 2023 62st IEEE Conference on Decision and Control (CDC), pages 7356-7361, 2023. Note: Extended version is On the ISS property of the gradient flow for single hidden-layer neural networks with linear activations, arXiv https://arxiv.org/abs/2305.09904. [PDF] [doi:10.1109/CDC49753.2023.10383478] Keyword(s): gradient dominance, neural networks, overparametrization, gradient descent, gradient dynamics, gradient descent, gradient systems, gradient descent, numerical methods, dynamics of algorithms, input to state stability, gradient systems. Abstract:

Recent research in neural networks and machine learning suggests that using many more parameters than strictly required by the initial complexity of a regression problem can result in more accurate or faster-converging models -- contrary to classical statistical belief. This phenomenon, sometimes known as ``benign overfitting'', raises questions regarding in what other ways might overparameterization affect the properties of a learning problem. In this work, we investigate the effects of overfitting on the robustness of gradient-descent training when subject to uncertainty on the gradient estimation. This uncertainty arises naturally if the gradient is estimated from noisy data or directly measured. Our object of study is a linear neural network with a single, arbitrarily wide, hidden layer and an arbitrary number of inputs and outputs. In this paper we solve the problem for the case where the input and output of our neural-network are one-dimensional, deriving sufficient conditions for robustness of our system based on necessary and sufficient conditions for convergence in the undisturbed case. We then show that the general overparametrized formulation introduces a set of spurious equilibria which lay outside the set where the loss function is minimized, and discuss directions of future work that might extend our current results for more general formulations.

T. Natschläger, W. Maass, E.D. Sontag, and A. Zador. Processing of time series by neural circuits with biologically realistic synaptic dynamics. In Todd K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13 (NIPS2000), pages 145-151, 2000. MIT Press, Cambridge. Note: Proc. NIPS(NeurIPS)-13, Denver, 2000, https://papers.nips.cc/paper_files/paper/2000. [PDF] Keyword(s): NeurIPS, machine learning, artificial intelligence, neural networks, Volterra series. Abstract:

Experimental data show that biological synapses are dynamic, i.e., their weight changes on a short time scale by several hundred percent in dependence of the past input to the synapse. In this article we explore the consequences that this synaptic dynamics entails for the computational power of feedforward neural networks. It turns out that even with just a single hidden layer such networks can approximate a surprisingly large large class of nonlinear filters: all filters that can be characterized by Volterra series. This result is robust with regard to various changes in the model for synaptic dynamics. Furthermore we show that simple gradient descent suffices to approximate a given quadratic filter by a rather small neural system with dynamic synapses.

E.D. Sontag. Gradient techniques for systems with no drift: A classical idea revisited. In Proc. IEEE Conf. Decision and Control, San Antonio, Dec. 1993, IEEE Publications, 1993, pages 2706-2711, 1993. [PDF] Keyword(s): path-planning, systems without drift, nonlinear control, controllability, real-analytic functions, gradient dynamics, gradient descent, gradient systems, gradient descent, numerical methods, dynamics of algorithms. Abstract:

This paper proposes a technique for the control of analytic systems with no drift. It is based on the generation of "nonsingular loops" which allow linearized controllability. Once such loops are available, it is possible to employ standard Newton or steepest descent methods. The theoretical justification of the approach relies on results on genericity of nonsingular controls as well as a simple convergence lemma.

Internal reports

E.D. Sontag. Some remarks on the backpropagation algorithm for neural net learning. Technical report SYCON-88-02, Rutgers Center for Systems and Control, 1988. [PDF] Keyword(s): machine learning, artificial intelligence, neural networks. Abstract:

This is a very old informal report that discusses the study of local minima of quadratic loss functions for fitting errors in sigmoidal neural net learning. It also includes several remarks concerning the growth of weights during gradient descent. There is nothing very interesting here - far better knowledge is now available - but the report was placed here by request.

BACK TO INDEX

Disclaimer:

This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders.

Last modified: Fri Jun 19 21:49:04 2026
Author: sontag.

This document was translated from BibT_EX by bibtex2html