Publications about 'learning'

Publications about 'learning'

Articles in journal or book chapters

A. Darabi, Z. An, M.A. Al-Radhawi, W. Cho, M. Siami, and E.D. Sontag. Combining model-based and data-driven models: an application to synthetic biology resource competition. 2025. Note: Submitted. Preprint in bioRxiv 2025/642275. [WWW] Keyword(s): mechanistic models, machine learning, neural networks, resource competition, synthetic biology. Abstract:

This work explores the integration of machine learning (ML) and mechanistic models (MM). While ML has demonstrated remarkable success in data-driven modeling across engineering, biology, and other scientific fields, MM remain essential for their interpretability and capacity to extrapolate beyond observed conditions based on established principles such as chemical kinetics and physiological processes. However, MM can be labor-intensive to construct and often rely on simplifying assumptions that may not fully capture real-world complexity. It is thus desirable to combine MM and ML approaches so as to enable more robust predictions, enhanced system insights, and improved handling of sparse or noisy data. A key challenge when doing so is ensuring that ML components do not disregard mechanistic information, potentially leading to overfitting or reduced interpretability. To address that challenge, this paper introduces the idea of Partially Uncertain Model Structures (PUMS) and investigates conditions that discourage the ML components from ignoring mechanistic constraints. This work also introduces the concept of embedded Physics-Informed Neural Networks (ePINNs), which consist of two loss-sharing neural networks that seamlessly blend ML and MM components. This work arose in the study of the context problem in synthetic biology. Engineered genetic circuits may exhibit unexpected behavior in living cells due to resource sharing. To illustrate the advantages of the ePINNs approach, this paper applies the framework to a gene network model subject to resource competition, demonstrating the effectiveness of this hybrid modeling approach in capturing complex system interactions while maintaining physical consistency.

Z. Liu, N. Ozay, and E. D. Sontag. Properties of immersions for systems with multiple limit sets with implications to learning Koopman embeddings. Automatica, 176:112226, 2025. [PDF] Keyword(s): linear systems, nonlinear systems, observables, Koopman embedding, duality. Abstract:

Linear immersions (or Koopman eigenmappings) of a nonlinear system have wide applications in prediction and control. In this work, we study the non-existence of one-to-one linear immersions for nonlinear systems with multiple omega-limit sets. While previous research has indicated the possibility of discontinuous one-to-one linear immersions for such systems, it remained uncertain whether continuous one-to-one linear immersions are attainable. Under mild conditions, we prove that any continuous one-to-one immersion to a class of systems including linear systems cannot distinguish different omega-limit sets, and thus cannot be one-to-one. Furthermore, we show that this property is also shared by approximate linear immersions learned from data as sample size increases and sampling interval decreases. Multiple examples are studied to illustrate our results.

E.D. Sontag. Some remarks on gradient dominance and LQR policy optimization. arXiv, 2025. [doi:https://doi.org/10.48550/arXiv.2507.10452] Keyword(s): gradient dominance, gradient flows, LQR, reinforcement learning, machine learning, artificial intelligence, optimal control. Abstract:

Solutions of optimization problems, including policy optimization in reinforcement learning, typically rely upon some variant of gradient descent. There has been much recent work in the machine learning, control, and optimization communities applying the Polyak-Åojasiewicz Inequality (PLI) to such problems in order to establish an exponential rate of convergence (a.k.a. ``linear convergence'' in the local-iteration language of numerical analysis) of loss functions to their minima under the gradient flow. Often, as is the case of policy iteration for the continuous-time LQR problem, this rate vanishes for large initial conditions, resulting in a mixed globally linear / locally exponential behavior. This is in sharp contrast with the discrete-time LQR problem, where there is global exponential convergence. That gap between CT and DT behaviors motivates the search for various generalized PLI-like conditions, and this paper addresses that topic. Moreover, these generalizations are key to understanding the transient and asymptotic effects of errors in the estimation of the gradient, errors which might arise from adversarial attacks, wrong evaluation by an oracle, early stopping of a simulation, inaccurate and very approximate digital twins, stochastic computations (algorithm ``reproducibility''), or learning by sampling from limited data. We describe an ``input to state stability'' (ISS) analysis of this issue. We also discuss convergence and PLI-like properties of ``linear feedforward neural networks'' in feedback control. Much of the work described here was done in collaboration with Arthur Castello B. de Oliveira, Leilei Cui, Zhong-Ping Jiang, and Milad Siami. This is a short paper summarizing the slides presented at my keynote at the 2025 L4DC (Learning for Dynamics \& Control Conference) in Ann Arbor, Michigan, 05 June 2025. A partial bibliography has been added.

H. Zhang, B. Yalcin, J. Lavaei, and E.D. Sontag. Exact recovery guarantees for parameterized nonlinear system identification problem under sparse disturbances or semi-oblivious attacks. 2025. Note: Submitted. Abstract:

In this work, we study the problem of learning a nonlinear dynamical system by parameterizing its dynamics using basis functions. We assume that disturbances occur at each time step with an arbitrary probability p, which models the sparsity level of the disturbance vectors over time. These disturbances are drawn from an arbitrary, unknown probability distribution, which may depend on past disturbances, provided that it satisfies a zero-mean assumption. The primary objective of this paper is to learn the system's dynamics within a f inite time and analyze the sample complexity as a function of p. To achieve this, we examine a LASSO-type non-smooth estimator and establish necessary and sufficient conditions for its well-specifiedness and the uniqueness of the global solution to the underlying optimization problem. We then provide exact recovery guarantees for the estimator under two distinct conditions: boundedness and Lipschitz continuity of the basis functions. We show that finite-time exact recovery is achieved with high probability, even when p approaches 1. Unlike prior works, which primarily focus on independent and identically distributed (i.i.d.) disturbances and provide only asymptotic guarantees for system learning, this study presents the first finite-time analysis of nonlinear dynamical systems under a highly general disturbance model. Our framework allows for possible temporal correlations in the disturbances and accommodates semi-oblivious adversarial attacks, significantly broadening the scope of existing theoretical results.

A.C.B de Olivera, M. Siami, and E.D. Sontag. Convergence analysis of overparametrized LQR formulations. Automatica, 2025. Note: To appear. Also more details in arXiv 2408.15456, pdf linked here is to the long version. [PDF] Keyword(s): learning theory, singularities in optimization, gradient systems, overparametrization, neural networks, overparametrization, gradient descent, input to state stability, feedback control, LQR. Abstract:

Motivated by the growing use of Artificial Intelligence (AI) tools in control design, this paper takes the first steps towards bridging the gap between results from Direct Gradient methods for the Linear Quadratic Regulator (LQR), and neural networks. More specifically, it looks into the case where one wants to find a Linear Feed-Forward Neural Network (LFFNN) feedback that minimizes a LQR cost. This paper starts by computing the gradient formulas for the parameters of each layer, which are used to derive a key conservation law of the system. This conservation law is then leveraged to prove boundedness and global convergence of solutions to critical points, and invariance of the set of stabilizing networks under the training dynamics. This is followed by an analysis of the case where the LFFNN has a single hidden layer. For this case, the paper proves that the training converges not only to critical points but to the optimal feedback control law for all but a set of measure-zero of the initializations. These theoretical results are followed by an extensive analysis of a simple version of the problem (the ``vector case''), proving the theoretical properties of accelerated convergence and robustness for this simpler example. Finally, the paper presents numerical evidence of faster convergence of the training of general LFFNNs when compared to traditional direct gradient methods, showing that the acceleration of the solution is observable even when the gradient is not explicitly computed but estimated from evaluations of the cost function.

E.D. Sontag. Remarks on input to state stability of perturbed gradient flows, motivated by model-free feedback control learning. Systems and Control Letters, 161:105138, 2022. Note: Important: there is an error in the paper. For the LQR application, the paper only shows iISS, not ISS. See the paper Small-disturbance input-to-state stability of perturbed gradient flows: Applications to LQR problem for details.[PDF] Keyword(s): iss, input to state stability, data-driven control, gradient systems, steepest descent, model-free control. Abstract:

Recent work on data-driven control and reinforcement learning has renewed interest in a relatively old field in control theory: model-free optimal control approaches which work directly with a cost function and do not rely upon perfect knowledge of a system model. Instead, an "oracle" returns an estimate of the cost associated to, for example, a proposed linear feedback law to solve a linear-quadratic regulator problem. This estimate, and an estimate of the gradient of the cost, might be obtained by performing experiments on the physical system being controlled. This motivates in turn the analysis of steepest descent algorithms and their associated gradient differential equations. This paper studies the effect of errors in the estimation of the gradient, framed in the language of input to state stability, where the input represents a perturbation from the true gradient. Since one needs to study systems evolving on proper open subsets of Euclidean space, a self-contained review of input to state stability definitions and theorems for systems that evolve on such sets is included. The results are then applied to the study of noisy gradient systems, as well as the associated steepest descent algorithms.

J. Hanson, M. Raginsky, and E.D. Sontag. Learning recurrent neural net models of nonlinear systems. Proc. of Machine Learning Research, 144:1-11, 2021. [PDF] Keyword(s): machine learning, empirical risk minimization, recurrent neural networks, dynamical systems, continuous time, system identification, statistical learning theory, generalization bounds. Abstract:

This paper considers the following learning problem: given sample pairs of input and output signals generated by an unknown nonlinear system (which is not assumed to be causal or time-invariant), one wishes to find a continuous-time recurrent neural net, with activation function tanh, that approximately reproduces the underlying i/o behavior with high confidence. Leveraging earlier work concerned with matching derivatives up to a finite order of the input and output signals the problem is reformulated in familiar system-theoretic language and quantitative guarantees on the sup-norm risk of the learned model are derived, in terms of the number of neurons, the sample size, the number of derivatives being matched, and the regularity properties of the inputs, the outputs, and the unknown i/o map.

W. Maass, P. Joshi, and E.D. Sontag. Computational aspects of feedback in neural circuits. PLoS Computational Biology, 3:e165 1-20, 2007. [PDF] Keyword(s): machine learning, neural networks, feedback linearization, computation by cortical microcircuits, fading memory. Abstract:

It had previously been shown that generic cortical microcircuit models can perform complex real-time computations on continuous input streams, provided that these computations can be carried out with a rapidly fading memory. We investigate in this article the computational capability of such circuits in the more realistic case where not only readout neurons, but in addition a few neurons within the circuit have been trained for specific tasks. This is essentially equivalent to the case where the output of trained readout neurons is fed back into the circuit. We show that this new model overcomes the limitation of a rapidly fading memory. In fact, we prove that in the idealized case without noise it can carry out any conceivable digital or analog computation on time-varying inputs. But even with noise the resulting computational model can perform a large class of biologically relevant real-time computations that require a non-fading memory.

P. Kuusela, D. Ocone, and E.D. Sontag. Learning Complexity Dimensions for a Continuous-Time Control System. SIAM J. Control Optim., 43(3):872-898, 2004. [PDF] [doi:http://dx.doi.org/10.1137/S0363012901384302] Keyword(s): machine learning, theory of computing and complexity, VC dimension, neural networks. Abstract:

This paper takes a computational learning theory approach to a problem of linear systems identification. It is assumed that input signals have only a finite number k of frequency components, and systems to be identified have dimension no greater than n. The main result establishes that the sample complexity needed for identification scales polynomially with n and logarithmically with k.

W. Maass and E.D. Sontag. Analog neural nets with Gaussian or other common noise distributions cannot recognize arbitrary regular languages. Neural Comput., 11(3):771-782, 1999. [PDF] [doi:http://dx.doi.org/10.1162/089976699300016656] Keyword(s): machine learning, neural networks. Abstract:

We consider recurrent analog neural nets where the output of each gate is subject to Gaussian noise, or any other common noise distribution that is nonzero on a large set. We show that many regular languages cannot be recognized by networks of this type, and we give a precise characterization of those languages which can be recognized. This result implies severe constraints on possibilities for constructing recurrent analog neural nets that are robust against realistic types of analog noise. On the other hand we present a method for constructing feedforward analog neural nets that are robust with regard to analog noise of this type.

E.D. Sontag and Y. Qiao. Further results on controllability of recurrent neural networks. Systems Control Lett., 36(2):121-129, 1999. [PDF] Keyword(s): machine learning, controllability, recurrent neural networks, neural networks. Abstract:

This paper studies controllability properties of recurrent neural networks. The new contributions are: (1) an extension of the result in "Complete controllability of continuous-time recurrent neural networks" to a slightly different model, where inputs appear in an affine form, (2) a formulation and proof of a necessary and sufficient condition, in terms of local-local controllability, and (3) a complete analysis of the 2-dimensional case for which the hypotheses made in previous work do not apply.

E.D. Sontag. VC dimension of neural networks. In C.M. Bishop, editor, Neural Networks and Machine Learning, pages 69-95. Springer, Berlin, 1998. [PDF] Keyword(s): machine learning, VC dimension, learning, neural networks, shattering. Abstract:

The Vapnik-Chervonenkis (VC) dimension is an integer which helps to characterize distribution-independent learning of binary concepts from positive and negative samples. This paper, based on lectures delivered at the Isaac Newton Institute in August of 1997, presents a brief introduction, establishes various elementary results, and discusses how to estimate the VC dimension in several examples of interest in neural network theory. (It does not address the learning and estimation-theoretic applications of VC dimension, and the applications to uniform convergence theorems for empirical probabilities, for which many suitable references are available.)

P. Koiran and E.D. Sontag. Vapnik-Chervonenkis dimension of recurrent neural networks. Discrete Appl. Math., 86(1):63-79, 1998. [PDF] [doi:http://dx.doi.org/10.1016/S0166-218X(98)00014-6] Keyword(s): machine learning, neural networks, recurrent neural networks. Abstract:

This paper provides lower and upper bounds for the VC dimension of recurrent networks. Several types of activation functions are discussed, including threshold, polynomial, piecewise-polynomial and sigmoidal functions. The bounds depend on two independent parameters: the number w of weights in the network, and the length k of the input sequence. Ignoring multiplicative constants, the main results say roughly the following: 1. For architectures whose activation is any fixed nonlinear polynomial, the VC dimension is proportional to wk. 2. For architectures whose activation is any fixed piecewise polynomial, the VC dimension is between wk and w**2k. 3. For architectures with threshold activations, the VC dimension is between wlog(k/w) and the smallest of wklog(wk) and w**2+wlog(wk). 4. For the standard sigmoid tanh(x), the VC dimension is between wk and w**4 k**2.

E.D. Sontag. A learning result for continuous-time recurrent neural networks. Systems Control Lett., 34(3):151-158, 1998. [PDF] [doi:http://dx.doi.org/10.1016/S0167-6911(98)00006-1] Keyword(s): machine learning, neural networks, VC dimension, recurrent neural networks. Abstract:

The following learning problem is considered, for continuous-time recurrent neural networks having sigmoidal activation functions. Given a ``black box'' representing an unknown system, measurements of output derivatives are collected, for a set of randomly generated inputs, and a network is used to approximate the observed behavior. It is shown that the number of inputs needed for reliable generalization (the sample complexity of the learning problem) is upper bounded by an expression that grows polynomially with the dimension of the network and logarithmically with the number of output derivatives being matched.

P. Koiran and E.D. Sontag. Vapnik-Chervonenkis dimension of recurrent neural networks. In Computational learning theory (Jerusalem, 1997), volume 1208 of Lecture Notes in Comput. Sci., pages 223-237. Springer-Verlag, London, UK, 1997. Keyword(s): machine learning, neural networks, VC dimension, recurrent neural networks.

E.D. Sontag. Recurrent neural networks: Some systems-theoretic aspects. In M. Karny, K. Warwick, and V. Kurkova, editors, Dealing with Complexity: a Neural Network Approach, pages 1-12. Springer-Verlag, London, 1997. [PDF] Keyword(s): machine learning, neural networks, recurrent neural networks, learning, VC dimension. Abstract:

This paper provides an exposition of some recent results regarding system-theoretic aspects of continuous-time recurrent (dynamic) neural networks with sigmoidal activation functions. The class of systems is introduced and discussed, and a result is cited regarding their universal approximation properties. Known characterizations of controllability, observability, and parameter identifiability are reviewed, as well as a result on minimality. Facts regarding the computational power of recurrent nets are also mentioned.

M. J. Donahue, L. Gurvits, C. Darken, and E.D. Sontag. Rates of convex approximation in non-Hilbert spaces. Constr. Approx., 13(2):187-220, 1997. [PDF] Keyword(s): machine learning, neural networks, optimization, approximation theory. Abstract:

This paper deals with sparse approximations by means of convex combinations of elements from a predetermined "basis" subset S of a function space. Specifically, the focus is on the rate at which the lowest achievable error can be reduced as larger subsets of S are allowed when constructing an approximant. The new results extend those given for Hilbert spaces by Jones and Barron, including in particular a computationally attractive incremental approximation scheme. Bounds are derived for broad classes of Banach spaces. The techniques used borrow from results regarding moduli of smoothness in functional analysis as well as from the theory of stochastic processes on function spaces.

P. Koiran and E.D. Sontag. Neural networks with quadratic VC dimension. J. Comput. System Sci., 54(1, part 2):190-198, 1997. Note: (1st Annual Dagstuhl Seminar on Neural Computing, 1994). [PDF] [doi:http://dx.doi.org/10.1006/jcss.1997.1479] Keyword(s): machine learning, neural networks, VC dimension. Abstract:

This paper shows that neural networks which use continuous activation functions have VC dimension at least as large as the square of the number of weights w. This result settles the open question of whether whether the well-known O(w log w) bound, known for hard-threshold nets, also held for more general sigmoidal nets. Implications for the number of samples needed for valid generalization are discussed.

R. Koplon and E.D. Sontag. Using Fourier-neural recurrent networks to fit sequential input/output data. Neurocomputing, 15:225-248, 1997. [PDF] Keyword(s): machine learning, neural networks, recurrent neural networks. Abstract:

This paper suggests the use of Fourier-type activation functions in fully recurrent neural networks. The main theoretical advantage is that, in principle, the problem of recovering internal coefficients from input/output data is solvable in closed form.

E.D. Sontag. Shattering all sets of k points in `general position' requires (k-1)/2 parameters. Neural Comput., 9(2):337-348, 1997. [PDF] Keyword(s): machine learning, neural networks, VC dimension, real-analytic functions. Abstract:

For classes of concepts defined by certain classes of analytic functions depending on k parameters, there are nonempty open sets of samples of length 2k+2 which cannot be shattered. A slighly weaker result is also proved for piecewise-analytic functions. The special case of neural networks is discussed.

E.D. Sontag and H.J. Sussmann. Complete controllability of continuous-time recurrent neural networks. Systems Control Lett., 30(4):177-183, 1997. [PDF] [doi:http://dx.doi.org/10.1016/S0167-6911(97)00002-9] Keyword(s): machine learning, neural networks, recurrent neural networks. Abstract:

This paper presents a characterization of controllability for the class of control systems commonly called (continuous-time) recurrent neural networks. The characterization involves a simple condition on the input matrix, and is proved when the activation function is the hyperbolic tangent.

B. DasGupta and E.D. Sontag. Sample complexity for learning recurrent perceptron mappings. IEEE Trans. Inform. Theory, 42(5):1479-1487, 1996. [PDF] Keyword(s): machine learning, neural networks, VC dimension, recurrent neural networks. Abstract:

Recurrent perceptron classifiers generalize the usual perceptron model. They correspond to linear transformations of input vectors obtained by means of "autoregressive moving-average schemes", or infinite impulse response filters, and allow taking into account those correlations and dependences among input coordinates which arise from linear digital filtering. This paper provides tight bounds on sample complexity associated to the fitting of such models to experimental data. The results are expressed in the context of the theory of probably approximately correct (PAC) learning.

E.D. Sontag. Critical points for least-squares problems involving certain analytic functions, with applications to sigmoidal nets. Adv. Comput. Math., 5(2-3):245-268, 1996. [PDF] Keyword(s): machine learning, subanalytic sets, semianalytic sets, critical points, approximation theory, neural networks, real-analytic functions. Abstract:

This paper deals with nonlinear least-squares problems involving the fitting to data of parameterized analytic functions. For generic regression data, a general result establishes the countability, and under stronger assumptions finiteness, of the set of functions giving rise to critical points of the quadratic loss function. In the special case of what are usually called "single-hidden layer neural networks", which are built upon the standard sigmoidal activation tanh(x) or equivalently 1/(1+exp(-x)), a rough upper bound for this cardinality is provided as well.

B. DasGupta, H.T. Siegelmann, and E.D. Sontag. On the complexity of training neural networks with continuous activation functions. IEEE Trans. Neural Networks, 6:1490-1504, 1995. [PDF] Keyword(s): machine learning, neural networks, analog computing, theory of computing, neural networks, computational complexity, machine learning. Abstract:

Blum and Rivest showed that any possible neural net learning algorithm based on fixed architectures faces severe computational barriers. This paper extends their NP-completeness result, which applied only to nets based on hard threshold activations, to nets that employ a particular continuous activation. In view of neural network practice, this is a more relevant result to understanding the limitations of backpropagation and related techniques.

H. T. Siegelmann and E.D. Sontag. On the computational power of neural nets. J. Comput. System Sci., 50(1):132-150, 1995. [PDF] [doi:http://dx.doi.org/10.1006/jcss.1995.1013] Keyword(s): machine learning, neural networks, recurrent neural networks, machine learning, analog computing, theory of computing, neural networks, computational complexity, super-Turing computation. Abstract:

This paper deals with finite size networks which consist of interconnections of synchronously evolving processors. Each processor updates its state by applying a "sigmoidal" function to a rational-coefficient linear combination of the previous states of all units. We prove that one may simulate all Turing Machines by such nets. In particular, one can simulate any multi-stack Turing Machine in real time, and there is a net made up of 886 processors which computes a universal partial-recursive function. Products (high order nets) are not required, contrary to what had been stated in the literature. Non-deterministic Turing Machines can be simulated by non-deterministic rational nets, also in real time. The simulation result has many consequences regarding the decidability, or more generally the complexity, of questions about recursive nets.

B. DasGupta, H.T. Siegelmann, and E.D. Sontag. On the Intractability of Loading Neural Networks. In V. P. Roychowdhury, Siu K. Y., and Orlitsky A., editors, Theoretical Advances in Neural Computation and Learning, pages 357-389. Kluwer Academic Publishers, 1994. [PDF] Keyword(s): analog computing, neural networks, computational complexity, machine learning.

W. Maass, G. Schnitger, and E.D. Sontag. A comparison of the computational power of sigmoid and Boolean threshold circuits. In V. P. Roychowdhury, Siu K. Y., and Orlitsky A., editors, Theoretical Advances in Neural Computation and Learning, pages 127-151. Kluwer Academic Publishers, 1994. [PDF] Keyword(s): machine learning, neural networks, boolean systems. Abstract:

We examine the power of constant depth circuits with sigmoid threshold gates for computing boolean functions. It is shown that, for depth 2, constant size circuits of this type are strictly more powerful than constant size boolean threshold circuits (i.e. circuits with linear threshold gates). On the other hand it turns out that, for any constant depth d, polynomial size sigmoid threshold circuits with polynomially bounded weights compute exactly the same boolean functions as the corresponding circuits with linear threshold gates.

F. Albertini and E.D. Sontag. State observability in recurrent neural networks. Systems Control Lett., 22(4):235-244, 1994. [PDF] [doi:http://dx.doi.org/10.1016/0167-6911(94)90054-X] Keyword(s): machine learning, neural networks, recurrent neural networks, observability, identifiability. Abstract:

This paper concerns recurrent networks x'=s(Ax+Bu), y=Cx, where s is a sigmoid, in both discrete time and continuous time. Our main result is that observability can be characterized, if one assumes certain conditions on the nonlinearity and on the system, in a manner very analogous to that of the linear case. Recall that for the latter, observability is equivalent to the requirement that there not be any nontrivial A-invariant subspace included in the kernel of C. We show that the result generalizes in a natural manner, except that one now needs to restrict attention to certain special "coordinate" subspaces.

F. Albertini, E.D. Sontag, and V. Maillot. Uniqueness of weights for neural networks. In R. Mammone, editor, Artificial Neural Networks for Speech and Vision, pages 115-125. Chapman and Hall, London, 1993. [PDF] Keyword(s): machine learning, neural networks, recurrent neural networks. Abstract:

In this short expository survey, we sketch various known facts about uniqueness of weights in neural networks, including results about recurrent nets, and we provide a new and elementary complex-variable proof of a uniqueness result that applies in the single hidden layer case.

E.D. Sontag. Neural networks for control. In H. L. Trentelman and J. C. Willems, editors, Essays on control: perspectives in the theory and its applications (Groningen, 1993), volume 14 of Progr. Systems Control Theory, pages 339-380. Birkhäuser Boston, Boston, MA, 1993. Note: A longer version (tech report with more details) is here: http://sontaglab.org/FTPDIR/neural-nets-siemens.pdf. [PDF] Keyword(s): neural networks, recurrent neural networks, machine learning, neural networks. Abstract:

This paper has an expository introduction to two related topics: (a) Some mathematical results regarding "neural networks", and (b) so-called "neurocontrol" and "learning control" (each part can be read independently of the other). It was prepared for a short course given at the 1993 European Control Conference.

F. Albertini and E.D. Sontag. For neural networks, function determines form. Neural Networks, 6(7):975-990, 1993. [PDF] Keyword(s): machine learning, neural networks, identifiability, recurrent neural networks, realization theory, observability, neural networks. Abstract:

This paper shows that the weights of continuous-time feedback neural networks x'=s(Ax+Bu), y=Cx (where s is a sigmoid) are uniquely identifiable from input/output measurements. Under very weak genericity assumptions, the following is true: Assume given two nets, whose neurons all have the same nonlinear activation function s; if the two nets have equal behaviors as "black boxes" then necessarily they must have the same number of neurons and -except at most for sign reversals at each node- the same weights. Moreover, even if the activations are not a priori known to coincide, they are shown to be also essentially determined from the external measurements.

E.D. Sontag. Feedback stabilization using two-hidden-layer nets. IEEE Trans. Neural Networks, 3:981-990, 1992. [PDF] Keyword(s): machine learning, neural networks, feedback stabilization. Abstract:

This paper compares the representational capabilities of one hidden layer and two hidden layer nets consisting of feedforward interconnections of linear threshold units. It is remarked that for certain problems two hidden layers are required, contrary to what might be in principle expected from the known approximation theorems. The differences are not based on numerical accuracy or number of units needed, nor on capabilities for feature extraction, but rather on a much more basic classification into "direct" and "inverse" problems. The former correspond to the approximation of continuous functions, while the latter are concerned with approximating one-sided inverses of continuous functions - and are often encountered in the context of inverse kinematics determination or in control questions. A general result is given showing that nonlinear control systems can be stabilized using two hidden layers, but not in general using just one.

E.D. Sontag. Feedforward nets for interpolation and classification. J. Comput. System Sci., 45(1):20-48, 1992. [PDF] [doi:http://dx.doi.org/10.1016/0022-0000(92)90039-L] Keyword(s): machine learning, neural networks, VC dimension, boolean systems. Abstract:

This paper deals with single-hidden-layer feedforward nets, studying various aspects of classification power and interpolation capability. In particular, a worst-case analysis shows that direct input to output connections in threshold nets double the recognition but not the interpolation power, while using sigmoids rather than thresholds allows doubling both. For other measures of classification, including the Vapnik-Chervonenkis dimension, the effect of direct connections or sigmoidal activations is studied in the special case of two-dimensional inputs.

E.D. Sontag. Capabilities and training of feedforward nets. In Neural networks (New Brunswick, NJ, 1990), pages 303-321. Academic Press, Boston, MA, 1991. [PDF] Keyword(s): machine learning, machine learning, neural networks. Abstract:

This paper surveys recent work by the author on learning and representational capabilities of feedforward nets. The learning results show that, among two possible variants of the so-called backpropagation training method for sigmoidal nets, both of which variants are used in practice, one is a better generalization of the older perceptron training algorithm than the other. The representation results show that nets consisting of sigmoidal neurons have at least twice the representational capabilities of nets that use classical threshold neurons, at least when this increase is quantified in terms of classification power. On the other hand, threshold nets are shown to be more useful when approximating implicit functions, as illustrated with an application to a typical control problem.

H. T. Siegelmann and E.D. Sontag. Turing computability with neural nets. Appl. Math. Lett., 4(6):77-80, 1991. [PDF] Keyword(s): machine learning, neural networks, computational complexity, recurrent neural networks. Abstract:

This paper shows the existence of a finite neural network, made up of sigmoidal neurons, which simulates a universal Turing machine. It is composed of less than 100,000 synchronously evolving processors, interconnected linearly. High-order connections are not required. (Note: this paper was placed here by special request. The results in this paper have been by now improved considerably: see the JCSS pape which among other aspects provides a polynomial time simulation. This paper, based on a unary encoding, results in an exponential slowdown).

E.D. Sontag and H.J. Sussmann. Back propagation separates where perceptrons do. Neural Networks, 4(2):243-249, 1991. [PDF] [doi:http://dx.doi.org/10.1016/0893-6080(91)90008-S] Keyword(s): machine learning, neural networks. Abstract:

Feedforward nets with sigmoidal activation functions are often designed by minimizing a cost criterion. It has been pointed out before that this technique may be outperformed by the classical perceptron learning rule, at least on some problems. In this paper, we show that no such pathologies can arise if the error criterion is of a threshold LMS type, i.e., is zero for values ``beyond'' the desired target values. More precisely, we show that if the data are linearly separable, and one considers nets with no hidden neurons, then an error function as above cannot have any local minima that are not global. In addition, the proof gives the following stronger result, under the stated hypotheses: the continuous gradient adjustment procedure is such that from any initial weight configuration a separating set of weights is obtained in finite time. This is a precise analogue of the Perceptron Learning Theorem. The results are then compared with the more classical pattern recognition problem of threshold LMS with linear activations, where no spurious local minima exist even for nonseparable data: here it is shown that even if using the threshold criterion, such bad local minima may occur, if the data are not separable and sigmoids are used. keywords = { neural networks , feedforward neural nets },

E.D. Sontag. Sigmoids distinguish more efficiently than Heavisides. Neural Computation, 1:470-472, 1989. [PDF] Keyword(s): machine learning, neural networks, boolean systems. Abstract:

Every dichotomy on a 2k-point set in Rn can be implemented by a neural net with a single hidden layer containing k sigmoidal neurons. If the neurons were of a hardlimiter (Heaviside) type, 2k-1 would be in general needed.

E.D. Sontag and H.J. Sussmann. Backpropagation can give rise to spurious local minima even for networks without hidden layers. Complex Systems, 3(1):91-106, 1989. [PDF] Keyword(s): machine learning, neural networks. Abstract:

We give an example of a neural net without hidden layers and with a sigmoid transfer function, together with a training set of binary vectors, for which the sum of the squared errors, regarded as a function of the weights, has a local minimum which is not a global minimum. The example consists of a set of 125 training instances, with four weights and a threshold to be learnt. We do not know if substantially smaller binary examples exist.

Conference articles

M. Sznaier, F. Allgower, A. C. B. de Oliveira, N. Ozay, and E. D. Sontag. Tutorial: Data driven and learning enabled control. In Proc. 64th IEEE Conference on Decision and Control (CDC), 2025. Note: Submitted. Keyword(s): data-drive control, reinforcement learning. Abstract:

Data-driven control (DDC), that is the design of controllers directly from observed data, has attracted substantial attention in recent years due to its advantages over model-based control. DDC avoids a computationally expensive, potentially conservative model identification step and bypasses practically difficult questions such as model order/class selection. This tutorial paper seeks to offer a sampling of the different approaches that have been recently used to synthesize data driven controllers and filters, covering both analytic approaches and learning enabled ones, indicating the relative strengths of each. A second objective is to provide a key to the rapidly expanding literature in the subject, to help researchers newly interested in this field to quickly come up to speed.

A.C.B de Olivera, L. Cui, and E. D. Sontag. Remarks on the Polyak-Lojasiewicz inequality and the convergence of gradient systems. In Proc. 64th IEEE Conference on Decision and Control (CDC), 2025. Note: Submitted. Keyword(s): gradient dominance, gradient flows, LQR, reinforcement learning. Abstract:

This work explores generalizations of the Polyak-Lojasiewicz inequality (PLI) and their implications for the convergence behavior of gradient flows in optimization problems. Motivated by the continuous-time linear quadratic regulator (CT-LQR) policy optimization problem -- where only a weaker version of the PLI is characterized in the literature -- this work shows that while weaker conditions are sufficient for global convergence to, and optimality of the set of critical points of the cost function, the "profile" of the gradient flow solution can change significantly depending on which "flavor" of inequality the cost satisfies. After a general theoretical analysis, we focus on fitting the CT-LQR policy optimization problem to the proposed framework, showing that, in fact, it can never satisfy a PLI in its strongest form. We follow up our analysis with a brief discussion on the difference between continuous- and discrete-time LQR policy optimization, and end the paper with some intuition on the extension of this framework to optimization problems with L1 regularization and solved through proximal gradient flows.

A.C.B de Olivera, M. Siami, and E.D. Sontag. Dynamics and perturbations of overparameterized linear neural networks. In Proc. 2023 62st IEEE Conference on Decision and Control (CDC), pages 7356-7361, 2023. Note: Extended version is On the ISS property of the gradient flow for single hidden-layer neural networks with linear activations, arXiv https://arxiv.org/abs/2305.09904. [PDF] [doi:10.1109/CDC49753.2023.10383478] Keyword(s): neural networks, overparametrization, gradient descent, input to state stability, gradient systems. Abstract:

Recent research in neural networks and machine learning suggests that using many more parameters than strictly required by the initial complexity of a regression problem can result in more accurate or faster-converging models -- contrary to classical statistical belief. This phenomenon, sometimes known as ``benign overfitting'', raises questions regarding in what other ways might overparameterization affect the properties of a learning problem. In this work, we investigate the effects of overfitting on the robustness of gradient-descent training when subject to uncertainty on the gradient estimation. This uncertainty arises naturally if the gradient is estimated from noisy data or directly measured. Our object of study is a linear neural network with a single, arbitrarily wide, hidden layer and an arbitrary number of inputs and outputs. In this paper we solve the problem for the case where the input and output of our neural-network are one-dimensional, deriving sufficient conditions for robustness of our system based on necessary and sufficient conditions for convergence in the undisturbed case. We then show that the general overparametrized formulation introduces a set of spurious equilibria which lay outside the set where the loss function is minimized, and discuss directions of future work that might extend our current results for more general formulations.

M. Sznaier, A. Olshevsky, and E.D. Sontag. The role of systems theory in control oriented learning. In Proc. 25th Int. Symp. Mathematical Theory of Networks and Systems (MTNS 2022), 2022. Note: Looks like only the abstract was published!. [PDF] Keyword(s): control oriented learning, neural networks, reinforcement learning, feedback control, machine learning. Abstract:

Systems theory can play an important in unveiling fundamental limitations of learning algorithms and architectures when used to control a dynamical system, and in suggesting strategies for overcoming these limitations. As an example, a feedforward neural network cannot stabilize a double integrator using output feedback. Similarly, a recurrent NN with differentiable activation functions that stabilizes a non-strongly stabilizable system must be itself open loop unstable, a fact that has profound implications for training with noisy, finite data. A potential solution to this problem, motivated by results on stabilization with periodic control, is the use of neural nets with periodic resets, showing that indeed systems theoretic analysis is instrumental in developing architectures capable of controlling certain classes of unstable systems. This short conference paper also argues that when the goal is to learn control oriented models, the loss function should reflect closed loop, rather than open loop model performance, a fact that can be accomplished by using gap-metric motivated loss functions.

W. Maass and E.D. Sontag. A precise characterization of the class of languages recognized by neural nets under Gaussian and other common noise distributions. In Proceedings of the 1998 conference on Advances in neural information processing systems II, Cambridge, MA, USA, pages 281-287, 1999. MIT Press. [PDF] Keyword(s): machine learning, neural networks.

E.D. Sontag and Y. Qiao. Remarks on controllability of recurrent neural networks. In Proc. IEEE Conf. Decision and Control, Tampa, Dec. 1998, IEEE Publications, 1998, pages 501-506, 1998. Keyword(s): machine learning, neural networks, recurrent neural networks.

E.D. Sontag. Some learning and systems-theoretic questions regarding recurrent neural networks. In Proc. Conf. on Information Sciences and Systems (CISS 97), Johns Hopkins, Baltimore, MD, March 1997, pages 630-635, 1997. Keyword(s): machine learning, neural networks, VC dimension, recurrent neural networks.

B. Dasgupta and E.D. Sontag. Sample complexity for learning recurrent perceptron mappings. In D.S. Touretzky, M.C. Moser, and M.E. Hasselmo, editors, Advances in Neural Information Processing Systems 8, pages 204-210, 1996. MIT Press, Cambridge, MA. Keyword(s): machine learning, neural networks, VC dimension, recurrent neural networks.

P. Koiran and E.D. Sontag. Neural networks with quadratic VC dimension. In D.S. Touretzky, M.C. Moser, and M.E. Hasselmo, editors, Advances in Neural Information Processing Systems 8, pages 197-203, 1996. MIT Press, Cambridge, MA. Keyword(s): machine learning, neural networks, VC dimension.

B. DasGupta, H. T. Siegelmann, and E.D. Sontag. On a learnability question associated to neural networks with continuous activations (extended abstract). In COLT '94: Proceedings of the seventh annual conference on Computational learning theory, New York, NY, USA, pages 47-56, 1994. ACM Press. [doi:http://doi.acm.org/10.1145/180139.181009] Keyword(s): machine learning, analog computing, neural networks, computational complexity.

R. Koplon and E.D. Sontag. Techniques for parameter reconstruction in Fourier-Neural recurrent networks. In Proc. IEEE Conf. Decision and Control, Orlando, Dec. 1994, IEEE Publications, 1994, pages 213-218, 1994. Keyword(s): machine learning, neural networks, recurrent neural networks.

F. Albertini and E.D. Sontag. Identifiability of discrete-time neural networks. In Proc. European Control Conf., Groningen, June 1993, pages 460-465, 1993. Keyword(s): machine learning, neural networks, recurrent neural networks.

F. Albertini and E.D. Sontag. State observability in recurrent neural networks. In Proc. IEEE Conf. Decision and Control, San Antonio, Dec. 1993, IEEE Publications, 1993, pages 3706-3707, 1993. Keyword(s): machine learning, neural networks, observability, recurrent neural networks.

F. Albertini and E.D. Sontag. Uniqueness of weights for recurrent nets. In Systems and Networks: Mathematical Theory and Applications, Proc. MTNS '93, Vol. 2, Akad. Verlag, Regensburg, pages 599-602, 1993. Note: Full version, never submitted for publication, is here: http://sontaglab.org/FTPDIR/93mtns-nn-extended.pdf. [PDF] Keyword(s): machine learning, neural networks, identifiability, recurrent neural networks. Abstract:

This paper concerns recurrent networks x'=s(Ax+Bu), y=Cx, where s is a sigmoid, in both discrete time and continuous time. The paper establishes parameter identifiability under stronger assumptions on the activation than in "For neural networks, function determines form", but on the other hand deals with arbitrary (nonzero) initial states.

J. L. Balcázar, R. Gavaldà, H. T. Siegelmann, and E.D. Sontag. Some structural complexity aspects of neural computation. In Proceedings of the Eighth Annual Structure in Complexity Theory Conference (San Diego, CA, 1993), Los Alamitos, CA, pages 253-265, 1993. IEEE Comput. Soc. Press. [PDF] Keyword(s): machine learning, analog computing, neural networks, computational complexity, super-Turing computation, theory of computing and complexity. Abstract:

Recent work by H.T. Siegelmann and E.D. Sontag (1992) has demonstrated that polynomial time on linear saturated recurrent neural networks equals polynomial time on standard computational models: Turing machines if the weights of the net are rationals, and nonuniform circuits if the weights are real. Here, further connections between the languages recognized by such neural nets and other complexity classes are developed. Connections to space-bounded classes, simulation of parallel computational models such as Vector Machines, and a discussion of the characterizations of various nonuniform classes in terms of Kolmogorov complexity are presented.

C. Darken, M.J. Donahue, L. Gurvits, and E.D. Sontag. Rate of approximation results motivated by robust neural network learning. In COLT '93: Proceedings of the sixth annual conference on Computational learning theory, New York, NY, USA, pages 303-309, 1993. ACM Press. [doi:http://doi.acm.org/10.1145/168304.168357] Keyword(s): machine learning, neural networks, optimization problems, approximation theory.

A. Macintyre and E.D. Sontag. Finiteness results for sigmoidal neural networks. In STOC '93: Proceedings of the twenty-fifth annual ACM symposium on Theory of computing, New York, NY, USA, pages 325-334, 1993. ACM Press. [PDF] [doi:http://doi.acm.org/10.1145/167088.167192] Keyword(s): machine learning, neural networks, theory of computing and complexity, real-analytic functions. Abstract:

This paper deals with analog circuits. It establishes the finiteness of VC dimension, teaching dimension, and several other measures of sample complexity which arise in learning theory. It also shows that the equivalence of behaviors, and the loading problem, are effectively decidable, modulo a widely believed conjecture in number theory. The results, the first ones that are independent of weight size, apply when the gate function is the "standard sigmoid" commonly used in neural networks research. The proofs rely on very recent developments in the elementary theory of real numbers with exponentiation. (Some weaker conclusions are also given for more general analytic gate functions.) Applications to learnability of sparse polynomials are also mentioned.

F. Albertini and E.D. Sontag. For neural networks, function determines form. In Proc. IEEE Conf. Decision and Control, Tucson, Dec. 1992, IEEE Publications, 1992, pages 26-31, 1992. Keyword(s): machine learning, neural networks, recurrent neural networks.

H.T. Siegelmann, E.D. Sontag, and C.L. Giles. The Complexity of Language Recognition by Neural Networks. In Proceedings of the IFIP 12th World Computer Congress on Algorithms, Software, Architecture - Information Processing '92, Volume 1, pages 329-335, 1992. North-Holland. Keyword(s): machine learning, neural networks, computational complexity, machine learning, recurrent neural networks, theory of computing and complexity.

E.D. Sontag. Neural nets as systems models and controllers. In Proc. Seventh Yale Workshop on Adaptive and Learning Systems, Yale University, 1992, pages 73-79, 1992. [PDF] Keyword(s): machine learning, neural networks, recurrent neural networks, neural networks. Abstract:

A conference paper. Placed here because it was requested, but contains little that is not also contained in the survey on neural nets mentioned above.

E.D. Sontag. Systems combining linearity and saturations, and relations to neural nets. In Nonlinear Control Systems Design 1992, IFAC Symposia Series, 1993, M. Fliess Ed., Pergamon Press, Oxford, 1993, pages 15-21, 1992. Note: (Also in Proc. Nonlinear Control Systems Design Symp., Bordeaux, June 1992, M. Fliess, Ed., IFAC Publications, pp. 242-247). Keyword(s): machine learning, neural networks, recurrent neural networks.

W. Maass, G. Schnitger, and E.D. Sontag. On the computational power of sigmoid versus Boolean threshold circuits (extended abstract). In Proceedings of the 32nd annual symposium on Foundations of computer science, Los Alamitos, CA, USA, pages 767-776, 1991. IEEE Computer Society Press. Keyword(s): machine learning, neural networks, theory of computing and complexity.

T. Asano, J. Hershberger, J. Pach, E.D. Sontag, D. Souivaine, and S. Suri. Separating bi-chromatic points by parallel lines. In Proceedings of the Second Canadian Conf. on Computational Geometry, Ottawa, Canada, 1990, pages 46-49, 1990. [PDF] Keyword(s): computational geometry. Abstract:

Given a 2-coloring of the vertices of a regular n-gon P, how many parallel lines are needed to separate the vertices into monochromatic subsets? We prove that floor(n/2) is a tight upper bound, and also provide an O(n log n) time algorithm to determine the direction that gives the minimum number of lines. If the polygon is a non-regular convex polygon, then n-3 lines may be necessary, while n-2 lines always suffice. This problem arises in machine learning and has implications about the representational capabilities of some neural networks.

H. Dewan and E.D. Sontag. Extrapolatory methods for speeding up the BP algorithm. In Proc. Int. Joint Conf. on Neural Networks, Washington, DC, Jan. 1990, Lawrence Erlbaum Associates, Inc., Publishers, ISBN 0-8058-0775-6, pages I.613-616, 1990. [PDF] Keyword(s): machine learning, neural networks. Abstract:

We describe a speedup technique that uses extrapolatory methods to predict the weights in a Neural Network using Back Propagation (BP) learning. The method is based on empirical observations of the way the weights change as a function of time. We use numerical function fitting techniques to determine the parameters of an extrapolation function and then use this function to project weights into the future. Significant computational savings result by using the extrapolated weights to jump over many iterations of the standard algorithm, achieving comparable performance with fewer iterations.

E.D. Sontag. Comparing sigmoids and heavisides. In Proc. Conf. Info. Sci. and Systems, Princeton, 1990, pages 654-659, 1990. Keyword(s): machine learning, neural networks, boolean systems.

E.D. Sontag. Remarks on interpolation and recognition using neural nets. In NIPS-3: Proceedings of the 1990 conference on Advances in neural information processing systems 3, San Francisco, CA, USA, pages 939-945, 1990. Morgan Kaufmann Publishers Inc.. Keyword(s): machine learning, neural networks.

E.D. Sontag and H.J. Sussmann. Remarks on local minima in backpropagation. In Proc. Conf. Info. Sciences and Systems, Johns Hopkins University Press, 1989, pages 432-435, 1989. Keyword(s): machine learning, neural networks.

Internal reports

E.D. Sontag. Sigmoids distinguish more efficiently than Heavisides. Technical report SYCON-89-12, Rutgers Center for Systems and Control, 1989. Keyword(s): machine learning, neural networks.

E.D. Sontag. Some remarks on the backpropagation algorithm for neural net learning. Technical report SYCON-88-02, Rutgers Center for Systems and Control, 1988. [PDF] Keyword(s): machine learning, neural networks. Abstract:

This is a very old informal report that discusses the study of local minima of quadratic loss functions for fitting errors in sigmoidal neural net learning. It also includes several remarks concerning the growth of weights during gradient descent. There is nothing very interesting here - far better knowledge is now available - but the report was placed here by request.

BACK TO INDEX

Disclaimer:

This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders.

Last modified: Tue Jul 15 23:04:59 2025
Author: sontag.

This document was translated from BibT_EX by bibtex2html