Related Work
This section is a huge benefit to me as it outlines some previous knowledge that I need in order to navigate their methodology.
Multi-Branch Convolutional Networks The Inception network is based on creating functional multi-branch architecture. ResNets can be viewed as 2-branch networks where one of the branches is the identity function. Finally they note decision forests are a multi-branch network of split learned function shaped as a tree.
Grouped Convolutions Date back to AlexNet at least. The original motivation was to split the large model across multiple GPUs; however, this was functional, and probably not meant as a technique for increasing accuracy. A special case of grouped convolutions is channel-wise convolutions. That is, the number of groups is the same as channels.
Compressing convolution networks Decomposition, which is a technique used to reduce the redundancy of deep NN graphs, is a long time technique for compressing complexity. Though decomposition is typically used to compress the convolutional network, here the paper uses it for increasing representational power.
Ensembling is another common technique to improve accuracy by training independent models and aggregating their outputs. Some interpret ResNets as an ensemble of shallower networks, which results from ResNet’s additive behaviors. Though this is partially true across the aggregate of transforms, they note that their model is trained jointly, so the ‘smaller networks’ are not trained independently.