Abstract: Residual networks (ResNets) have displayed impressive results in pattern recognition and, recently, have garnered considerable theoretical interest due to a perceived link with neural ordinary differential equations (neural ODEs). This link relies on the convergence of network weights to a smooth function as the number of layers increases. We investigate the properties of weights trained by stochastic gradient descent and their scaling with network depth through detailed numerical experiments. We observe the existence of scaling regimes markedly different from those assumed in neural ODE literature. Depending on certain features of the network architecture, such as the smoothness of the activation function, we prove the existence of an alternative ODE limit, a stochastic differential equation, or neither of these. For each case, we also derive the limit of the backpropagation dynamics and address its adaptiveness issue. These findings cast doubts on the validity of the neural ODE model as an adequate asymptotic description of deep ResNets and point to an alternative class of differential equations as a better description of the deep network limit.
When the gradient descent method is applied to the training of ResNets, we prove that it converges linearly to a global minimum if the network is sufficiently deep and the initialization is sufficiently small. In addition, the global minimum found by the gradient descent method has finite quadratic variation without using any regularization in the training. This confirms existing empirical results that the gradient descent method enjoys an implicit regularization property and is capable of generalizing to unseen data.
This is based on a few papers with Rama Cont (Oxford), Alain Rossier (Oxford), and Alain-Sam Cohen (InstaDeep).