awesome-very-deep-learning is a curated list for papers and code about implementing and training very deep neural networks.
ODE Networks are a kind of continuous-depth neural network. Instead of specifying a discrete sequence of hidden layers, they parameterize the derivative of the hidden state using a neural network. The output of the network is computed using a black-box differential equation solver. These continuous-depth models have constant memory cost, adapt their evaluation strategy to each input, and can explicitly trade numerical precision for speed.
- Neural Ordinary Differential Equations (2018) [original code], introduces several ODENets such as continuous-depth residual networks and continuous-time latent variable models. The paper also constructs continuous normalizing flows, a generative model that can train by maximum likelihood, without partitioning or ordering the data dimensions. For training, the authors show how to scalably backpropagate through any ODE solver, without access to its internal operations. This allows end-to-end training of ODEs within larger models. NIPS 2018 best paper.
- Augmented Neural ODEs (2019), neural ODEs preserve topology, thus their learned flows can't intersect with each other. Therefore some functions can't be learned. Augmented NODEs improve upon this by adding an additional dimension to learn simpler flows.
- Authors Autograd Implementation
Value Iteration Networks are very deep networks that have tied weights and perform approximate value iteration. They are used as an internal (model-based) planning module.
- Value Iteration Networks (2016) [original code], introduces VINs (Value Iteration Networks). The author shows that one can perform value iteration using iterative usage of convolutions and channel-wise pooling. It is able to generalize better in environments where a network needs to plan. NIPS 2016 best paper.
Densely Connected Convolutional Networks are very deep neural networks consisting of dense blocks. Within dense blocks, each layer receives the feature maps of all preceding layers. This leverages feature reuse and thus substantially reduces the model size (parameters).
- Densely Connected Convolutional Networks (2016) [original code], introduces DenseNets and shows that it outperforms ResNets in CIFAR10 and 100 by a large margin (especially when not using data augmentation), while only requiring half the parameters. CVPR 2017 best paper.
- Authors' Caffe Implementation
- Authors' more memory-efficient Torch Implementation.
- Tensorflow Implementation by Yixuan Li.
- Tensorflow Implementation by Laurent Mazare.
- Lasagne Implementation by Jan Schlüter.
- Keras Implementation by tdeboissiere.
- Keras Implementation by Roberto de Moura Estevão Filho.
- Chainer Implementation by Toshinori Hanya.
- Chainer Implementation by Yasunori Kudo.
- PyTorch Implementation (including BC structures) by Andreas Veit
- PyTorch Implementation
Deep Residual Networks are a family of extremely deep architectures (up to 1000 layers) showing compelling accuracy and nice convergence behaviors. Instead of learning a new representation at each layer, deep residual networks use identity mappings to learn residuals.
- The Reversible Residual Network: Backpropagation Without Storing Activations [code] constructs reversible residual layers (no need to store activations) and surprisingly finds out that reversible layers don't impact final performance.
- Squeeze-and-Excitation Networks [original code], introduces Squeeze-and-Excitation (SE) block, that adaptively recalibrates channel-wise feature responses. It achieved the 1st place on ILSVRC17.
- Aggregated Residual Transformation for Deep Neural Networks (2016), introduces ResNeXt, which aggregates a set of transformations within a a res-block. It achieved the 2nd place on ILSVRC16.
- Residual Networks of Residual Networks: Multilevel Residual Networks (2016), adds multi-level hierarchical residual mappings and shows that this improves the accuracy of deep networks
- Wide Residual Networks (2016) [orginal code], studies wide residual neural networks and shows that making residual blocks wider outperforms deeper and thinner network architectures
- Swapout: Learning an ensemble of deep architectures (2016), improving accuracy by randomly applying dropout, skipforward and residual units per layer
- Deep Networks with Stochastic Depth (2016) [original code], dropout with residual layers as regularizer
- Identity Mappings in Deep Residual Networks (2016) [original code], improving the original proposed residual units by reordering batchnorm and activation layers
- Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning (2016), inception network with residual connections
- Deep Residual Learning for Image Recognition (2015) [original code], original paper introducing residual neural networks
- Torch by Facebook AI Research (FAIR), with training code in Torch and pre-trained ResNet-18/34/50/101 models for ImageNet: blog, code
- Torch, CIFAR-10, with ResNet-20 to ResNet-110, training code, and curves: code
- Lasagne, CIFAR-10, with ResNet-32 and ResNet-56 and training code: code
- Neon, CIFAR-10, with pre-trained ResNet-32 to ResNet-110 models, training code, and curves: code
- Neon, Preactivation layer implementation: code
- Torch, MNIST, 100 layers: blog, code
- A winning entry in Kaggle's right whale recognition challenge: blog, code
- Neon, Place2 (mini), 40 layers: blog, code
- Tensorflow with tflearn, with CIFAR-10 and MNIST: code
- Tensorflow with skflow, with MNIST: code
- Stochastic dropout in Keras: code
- ResNet in Chainer: code
- Stochastic dropout in Chainer: code
- Wide Residual Networks in Keras: code
- ResNet in TensorFlow 0.9+ with pretrained caffe weights: code
- ResNet in PyTorch: code
- Ladder Network for Semi-Supervised Learning in Keras : code
In addition, this code by Ryan Dahl helps to convert the pre-trained models to TensorFlow.
Highway Networks take inspiration from Long Short Term Memory (LSTM) and allow training of deep, efficient networks (with hundreds of layers) with conventional gradient-based methods
- Recurrent Highway Networks (2016) [original code], introducing recurrent highway networks, which increases space depth in recurrent networks
- Training Very Deep Networks (2015), introducing highway neural networks
Theories in very deep learning concentrate on the ideas that very deep networks with skip connections are able to efficiently approximate recurrent computations (similar to the recurrent connections in the visual cortex) or are actually exponential ensembles of shallow networks
- Identity Matters in Deep Learning considers identity parameterizations from a theoretical perspective and proofs that arbitrarily deep linear residual networks have no spurious local optima
- The Shattered Gradients Problem: If resnets are the answer, then what is the question? argues that gradients of very deep networks resemble white noise (thus are harder to optimize). Resnets are more resistant to shattering (decaying sublinearly)
- Skip Connections as Effective Symmetry-Breaking hypothesizes that ResNets improve performance by breaking symmetries
- Highway and Residual Networks learn Unrolled Iterative Estimation, argues that instead of learning a new representation at each layer, the layers within a stage rather work as an iterative refinement of the same features.
- Demystifying ResNet, shows mathematically that 2-shortcuts in ResNets achieves the best results because they have non-degenerate depth-invariant initial condition numbers (in comparison to 1 or 3-shortcuts), making it easy for the optimisation algorithm to escape from the initial point.
- Wider or Deeper? Revisiting the ResNet Model for Visual Recognition, extends results from Veit et al. and shows that it is actually a linear ensemble of subnetworks. Wide ResNet work well, because current very deep networks are actually over-deepened (hence not trained end-to-end), due to the much shorter effective path length.
- Residual Networks are Exponential Ensembles of Relatively Shallow Networks, shows that ResNets behaves just like ensembles of shallow networks in test time. This suggests that in addition to describing neural networks in terms of width and depth, there is a third dimension: multiplicity, the size of the implicit ensemble
- Bridging the Gaps Between Residual Learning, Recurrent Neural Networks and Visual Cortex, shows that ResNets with shared weights work well too although having fewer parameters
- A Simple Way to Initialize Recurrent Networks of Rectified Linear Units, pre-ResNet Hinton paper that suggested, that the identity matrix could be useful for the initialization of deep networks
- ResNet with one-neuron hidden layers is a Universal Approximator, ResNet increases representational power for narrow deep networks because the skip connection and one neuron per hidden layer can uniformly approximate any Lebesgue integrable function in d dimensions (in contrast to fully connected networks).