Publications about 'learning' |
Articles in journal or book chapters |
Recent work on data-driven control and reinforcement learning has renewed interest in a relatively old field in control theory: model-free optimal control approaches which work directly with a cost function and do not rely upon perfect knowledge of a system model. Instead, an "oracle" returns an estimate of the cost associated to, for example, a proposed linear feedback law to solve a linear-quadratic regulator problem. This estimate, and an estimate of the gradient of the cost, might be obtained by performing experiments on the physical system being controlled. This motivates in turn the analysis of steepest descent algorithms and their associated gradient differential equations. This paper studies the effect of errors in the estimation of the gradient, framed in the language of input to state stability, where the input represents a perturbation from the true gradient. Since one needs to study systems evolving on proper open subsets of Euclidean space, a self-contained review of input to state stability definitions and theorems for systems that evolve on such sets is included. The results are then applied to the study of noisy gradient systems, as well as the associated steepest descent algorithms. |
This paper considers the following learning problem: given sample pairs of input and output signals generated by an unknown nonlinear system (which is not assumed to be causal or time-invariant), one wishes to find a continuous-time recurrent neural net, with activation function tanh, that approximately reproduces the underlying i/o behavior with high confidence. Leveraging earlier work concerned with matching derivatives up to a finite order of the input and output signals the problem is reformulated in familiar system-theoretic language and quantitative guarantees on the sup-norm risk of the learned model are derived, in terms of the number of neurons, the sample size, the number of derivatives being matched, and the regularity properties of the inputs, the outputs, and the unknown i/o map. |
It had previously been shown that generic cortical microcircuit models can perform complex real-time computations on continuous input streams, provided that these computations can be carried out with a rapidly fading memory. We investigate in this article the computational capability of such circuits in the more realistic case where not only readout neurons, but in addition a few neurons within the circuit have been trained for specific tasks. This is essentially equivalent to the case where the output of trained readout neurons is fed back into the circuit. We show that this new model overcomes the limitation of a rapidly fading memory. In fact, we prove that in the idealized case without noise it can carry out any conceivable digital or analog computation on time-varying inputs. But even with noise the resulting computational model can perform a large class of biologically relevant real-time computations that require a non-fading memory. |
This paper takes a computational learning theory approach to a problem of linear systems identification. It is assumed that input signals have only a finite number k of frequency components, and systems to be identified have dimension no greater than n. The main result establishes that the sample complexity needed for identification scales polynomially with n and logarithmically with k. |
We consider recurrent analog neural nets where the output of each gate is subject to Gaussian noise, or any other common noise distribution that is nonzero on a large set. We show that many regular languages cannot be recognized by networks of this type, and we give a precise characterization of those languages which can be recognized. This result implies severe constraints on possibilities for constructing recurrent analog neural nets that are robust against realistic types of analog noise. On the other hand we present a method for constructing feedforward analog neural nets that are robust with regard to analog noise of this type. |
This paper studies controllability properties of recurrent neural networks. The new contributions are: (1) an extension of the result in "Complete controllability of continuous-time recurrent neural networks" to a slightly different model, where inputs appear in an affine form, (2) a formulation and proof of a necessary and sufficient condition, in terms of local-local controllability, and (3) a complete analysis of the 2-dimensional case for which the hypotheses made in previous work do not apply. |
The Vapnik-Chervonenkis (VC) dimension is an integer which helps to characterize distribution-independent learning of binary concepts from positive and negative samples. This paper, based on lectures delivered at the Isaac Newton Institute in August of 1997, presents a brief introduction, establishes various elementary results, and discusses how to estimate the VC dimension in several examples of interest in neural network theory. (It does not address the learning and estimation-theoretic applications of VC dimension, and the applications to uniform convergence theorems for empirical probabilities, for which many suitable references are available.) |
This paper provides lower and upper bounds for the VC dimension of recurrent networks. Several types of activation functions are discussed, including threshold, polynomial, piecewise-polynomial and sigmoidal functions. The bounds depend on two independent parameters: the number w of weights in the network, and the length k of the input sequence. Ignoring multiplicative constants, the main results say roughly the following: 1. For architectures whose activation is any fixed nonlinear polynomial, the VC dimension is proportional to wk. 2. For architectures whose activation is any fixed piecewise polynomial, the VC dimension is between wk and w**2k. 3. For architectures with threshold activations, the VC dimension is between wlog(k/w) and the smallest of wklog(wk) and w**2+wlog(wk). 4. For the standard sigmoid tanh(x), the VC dimension is between wk and w**4 k**2. |
The following learning problem is considered, for continuous-time recurrent neural networks having sigmoidal activation functions. Given a ``black box'' representing an unknown system, measurements of output derivatives are collected, for a set of randomly generated inputs, and a network is used to approximate the observed behavior. It is shown that the number of inputs needed for reliable generalization (the sample complexity of the learning problem) is upper bounded by an expression that grows polynomially with the dimension of the network and logarithmically with the number of output derivatives being matched. |
This paper provides an exposition of some recent results regarding system-theoretic aspects of continuous-time recurrent (dynamic) neural networks with sigmoidal activation functions. The class of systems is introduced and discussed, and a result is cited regarding their universal approximation properties. Known characterizations of controllability, observability, and parameter identifiability are reviewed, as well as a result on minimality. Facts regarding the computational power of recurrent nets are also mentioned. |
This paper deals with sparse approximations by means of convex combinations of elements from a predetermined "basis" subset S of a function space. Specifically, the focus is on the rate at which the lowest achievable error can be reduced as larger subsets of S are allowed when constructing an approximant. The new results extend those given for Hilbert spaces by Jones and Barron, including in particular a computationally attractive incremental approximation scheme. Bounds are derived for broad classes of Banach spaces. The techniques used borrow from results regarding moduli of smoothness in functional analysis as well as from the theory of stochastic processes on function spaces. |
This paper shows that neural networks which use continuous activation functions have VC dimension at least as large as the square of the number of weights w. This result settles the open question of whether whether the well-known O(w log w) bound, known for hard-threshold nets, also held for more general sigmoidal nets. Implications for the number of samples needed for valid generalization are discussed. |
This paper suggests the use of Fourier-type activation functions in fully recurrent neural networks. The main theoretical advantage is that, in principle, the problem of recovering internal coefficients from input/output data is solvable in closed form. |
For classes of concepts defined by certain classes of analytic functions depending on k parameters, there are nonempty open sets of samples of length 2k+2 which cannot be shattered. A slighly weaker result is also proved for piecewise-analytic functions. The special case of neural networks is discussed. |
This paper presents a characterization of controllability for the class of control systems commonly called (continuous-time) recurrent neural networks. The characterization involves a simple condition on the input matrix, and is proved when the activation function is the hyperbolic tangent. |
Recurrent perceptron classifiers generalize the usual perceptron model. They correspond to linear transformations of input vectors obtained by means of "autoregressive moving-average schemes", or infinite impulse response filters, and allow taking into account those correlations and dependences among input coordinates which arise from linear digital filtering. This paper provides tight bounds on sample complexity associated to the fitting of such models to experimental data. The results are expressed in the context of the theory of probably approximately correct (PAC) learning. |
This paper deals with nonlinear least-squares problems involving the fitting to data of parameterized analytic functions. For generic regression data, a general result establishes the countability, and under stronger assumptions finiteness, of the set of functions giving rise to critical points of the quadratic loss function. In the special case of what are usually called "single-hidden layer neural networks", which are built upon the standard sigmoidal activation tanh(x) or equivalently 1/(1+exp(-x)), a rough upper bound for this cardinality is provided as well. |
Blum and Rivest showed that any possible neural net learning algorithm based on fixed architectures faces severe computational barriers. This paper extends their NP-completeness result, which applied only to nets based on hard threshold activations, to nets that employ a particular continuous activation. In view of neural network practice, this is a more relevant result to understanding the limitations of backpropagation and related techniques. |
This paper deals with finite size networks which consist of interconnections of synchronously evolving processors. Each processor updates its state by applying a "sigmoidal" function to a rational-coefficient linear combination of the previous states of all units. We prove that one may simulate all Turing Machines by such nets. In particular, one can simulate any multi-stack Turing Machine in real time, and there is a net made up of 886 processors which computes a universal partial-recursive function. Products (high order nets) are not required, contrary to what had been stated in the literature. Non-deterministic Turing Machines can be simulated by non-deterministic rational nets, also in real time. The simulation result has many consequences regarding the decidability, or more generally the complexity, of questions about recursive nets. |
We examine the power of constant depth circuits with sigmoid threshold gates for computing boolean functions. It is shown that, for depth 2, constant size circuits of this type are strictly more powerful than constant size boolean threshold circuits (i.e. circuits with linear threshold gates). On the other hand it turns out that, for any constant depth d, polynomial size sigmoid threshold circuits with polynomially bounded weights compute exactly the same boolean functions as the corresponding circuits with linear threshold gates. |
This paper concerns recurrent networks x'=s(Ax+Bu), y=Cx, where s is a sigmoid, in both discrete time and continuous time. Our main result is that observability can be characterized, if one assumes certain conditions on the nonlinearity and on the system, in a manner very analogous to that of the linear case. Recall that for the latter, observability is equivalent to the requirement that there not be any nontrivial A-invariant subspace included in the kernel of C. We show that the result generalizes in a natural manner, except that one now needs to restrict attention to certain special "coordinate" subspaces. |
In this short expository survey, we sketch various known facts about uniqueness of weights in neural networks, including results about recurrent nets, and we provide a new and elementary complex-variable proof of a uniqueness result that applies in the single hidden layer case. |
This paper has an expository introduction to two related topics: (a) Some mathematical results regarding "neural networks", and (b) so-called "neurocontrol" and "learning control" (each part can be read independently of the other). It was prepared for a short course given at the 1993 European Control Conference. |
This paper shows that the weights of continuous-time feedback neural networks x'=s(Ax+Bu), y=Cx (where s is a sigmoid) are uniquely identifiable from input/output measurements. Under very weak genericity assumptions, the following is true: Assume given two nets, whose neurons all have the same nonlinear activation function s; if the two nets have equal behaviors as "black boxes" then necessarily they must have the same number of neurons and -except at most for sign reversals at each node- the same weights. Moreover, even if the activations are not a priori known to coincide, they are shown to be also essentially determined from the external measurements. |
This paper compares the representational capabilities of one hidden layer and two hidden layer nets consisting of feedforward interconnections of linear threshold units. It is remarked that for certain problems two hidden layers are required, contrary to what might be in principle expected from the known approximation theorems. The differences are not based on numerical accuracy or number of units needed, nor on capabilities for feature extraction, but rather on a much more basic classification into "direct" and "inverse" problems. The former correspond to the approximation of continuous functions, while the latter are concerned with approximating one-sided inverses of continuous functions - and are often encountered in the context of inverse kinematics determination or in control questions. A general result is given showing that nonlinear control systems can be stabilized using two hidden layers, but not in general using just one. |
This paper deals with single-hidden-layer feedforward nets, studying various aspects of classification power and interpolation capability. In particular, a worst-case analysis shows that direct input to output connections in threshold nets double the recognition but not the interpolation power, while using sigmoids rather than thresholds allows doubling both. For other measures of classification, including the Vapnik-Chervonenkis dimension, the effect of direct connections or sigmoidal activations is studied in the special case of two-dimensional inputs. |
This paper surveys recent work by the author on learning and representational capabilities of feedforward nets. The learning results show that, among two possible variants of the so-called backpropagation training method for sigmoidal nets, both of which variants are used in practice, one is a better generalization of the older perceptron training algorithm than the other. The representation results show that nets consisting of sigmoidal neurons have at least twice the representational capabilities of nets that use classical threshold neurons, at least when this increase is quantified in terms of classification power. On the other hand, threshold nets are shown to be more useful when approximating implicit functions, as illustrated with an application to a typical control problem. |
This paper shows the existence of a finite neural network, made up of sigmoidal neurons, which simulates a universal Turing machine. It is composed of less than 100,000 synchronously evolving processors, interconnected linearly. High-order connections are not required. (Note: this paper was placed here by special request. The results in this paper have been by now improved considerably: see the JCSS pape which among other aspects provides a polynomial time simulation. This paper, based on a unary encoding, results in an exponential slowdown). |
Feedforward nets with sigmoidal activation functions are often designed by minimizing a cost criterion. It has been pointed out before that this technique may be outperformed by the classical perceptron learning rule, at least on some problems. In this paper, we show that no such pathologies can arise if the error criterion is of a threshold LMS type, i.e., is zero for values ``beyond'' the desired target values. More precisely, we show that if the data are linearly separable, and one considers nets with no hidden neurons, then an error function as above cannot have any local minima that are not global. In addition, the proof gives the following stronger result, under the stated hypotheses: the continuous gradient adjustment procedure is such that from any initial weight configuration a separating set of weights is obtained in finite time. This is a precise analogue of the Perceptron Learning Theorem. The results are then compared with the more classical pattern recognition problem of threshold LMS with linear activations, where no spurious local minima exist even for nonseparable data: here it is shown that even if using the threshold criterion, such bad local minima may occur, if the data are not separable and sigmoids are used. keywords = { neural networks , feedforward neural nets }, |
Every dichotomy on a 2k-point set in Rn can be implemented by a neural net with a single hidden layer containing k sigmoidal neurons. If the neurons were of a hardlimiter (Heaviside) type, 2k-1 would be in general needed. |
We give an example of a neural net without hidden layers and with a sigmoid transfer function, together with a training set of binary vectors, for which the sum of the squared errors, regarded as a function of the weights, has a local minimum which is not a global minimum. The example consists of a set of 125 training instances, with four weights and a threshold to be learnt. We do not know if substantially smaller binary examples exist. |
Conference articles |
Recent research in neural networks and machine learning suggests that using many more parameters than strictly required by the initial complexity of a regression problem can result in more accurate or faster-converging models -- contrary to classical statistical belief. This phenomenon, sometimes known as ``benign overfitting'', raises questions regarding in what other ways might overparameterization affect the properties of a learning problem. In this work, we investigate the effects of overfitting on the robustness of gradient-descent training when subject to uncertainty on the gradient estimation. This uncertainty arises naturally if the gradient is estimated from noisy data or directly measured. Our object of study is a linear neural network with a single, arbitrarily wide, hidden layer and an arbitrary number of inputs and outputs. In this paper we solve the problem for the case where the input and output of our neural-network are one-dimensional, deriving sufficient conditions for robustness of our system based on necessary and sufficient conditions for convergence in the undisturbed case. We then show that the general overparametrized formulation introduces a set of spurious equilibria which lay outside the set where the loss function is minimized, and discuss directions of future work that might extend our current results for more general formulations. |
Systems theory can play an important in unveiling fundamental limitations of learning algorithms and architectures when used to control a dynamical system, and in suggesting strategies for overcoming these limitations. As an example, a feedforward neural network cannot stabilize a double integrator using output feedback. Similarly, a recurrent NN with differentiable activation functions that stabilizes a non-strongly stabilizable system must be itself open loop unstable, a fact that has profound implications for training with noisy, finite data. A potential solution to this problem, motivated by results on stabilization with periodic control, is the use of neural nets with periodic resets, showing that indeed systems theoretic analysis is instrumental in developing architectures capable of controlling certain classes of unstable systems. This short conference paper also argues that when the goal is to learn control oriented models, the loss function should reflect closed loop, rather than open loop model performance, a fact that can be accomplished by using gap-metric motivated loss functions. |
This paper concerns recurrent networks x'=s(Ax+Bu), y=Cx, where s is a sigmoid, in both discrete time and continuous time. The paper establishes parameter identifiability under stronger assumptions on the activation than in "For neural networks, function determines form", but on the other hand deals with arbitrary (nonzero) initial states. |
Recent work by H.T. Siegelmann and E.D. Sontag (1992) has demonstrated that polynomial time on linear saturated recurrent neural networks equals polynomial time on standard computational models: Turing machines if the weights of the net are rationals, and nonuniform circuits if the weights are real. Here, further connections between the languages recognized by such neural nets and other complexity classes are developed. Connections to space-bounded classes, simulation of parallel computational models such as Vector Machines, and a discussion of the characterizations of various nonuniform classes in terms of Kolmogorov complexity are presented. |
This paper deals with analog circuits. It establishes the finiteness of VC dimension, teaching dimension, and several other measures of sample complexity which arise in learning theory. It also shows that the equivalence of behaviors, and the loading problem, are effectively decidable, modulo a widely believed conjecture in number theory. The results, the first ones that are independent of weight size, apply when the gate function is the "standard sigmoid" commonly used in neural networks research. The proofs rely on very recent developments in the elementary theory of real numbers with exponentiation. (Some weaker conclusions are also given for more general analytic gate functions.) Applications to learnability of sparse polynomials are also mentioned. |
A conference paper. Placed here because it was requested, but contains little that is not also contained in the survey on neural nets mentioned above. |
Given a 2-coloring of the vertices of a regular n-gon P, how many parallel lines are needed to separate the vertices into monochromatic subsets? We prove that floor(n/2) is a tight upper bound, and also provide an O(n log n) time algorithm to determine the direction that gives the minimum number of lines. If the polygon is a non-regular convex polygon, then n-3 lines may be necessary, while n-2 lines always suffice. This problem arises in machine learning and has implications about the representational capabilities of some neural networks. |
We describe a speedup technique that uses extrapolatory methods to predict the weights in a Neural Network using Back Propagation (BP) learning. The method is based on empirical observations of the way the weights change as a function of time. We use numerical function fitting techniques to determine the parameters of an extrapolation function and then use this function to project weights into the future. Significant computational savings result by using the extrapolated weights to jump over many iterations of the standard algorithm, achieving comparable performance with fewer iterations. |
Internal reports |
This is a very old informal report that discusses the study of local minima of quadratic loss functions for fitting errors in sigmoidal neural net learning. It also includes several remarks concerning the growth of weights during gradient descent. There is nothing very interesting here - far better knowledge is now available - but the report was placed here by request. |
This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders.
This document was translated from BibT_{E}X by bibtex2html