2 Mishkin Initialization
The problem of a bad initialization in deep networks is the activation and/or gradient magnitude in final layers. This is noted from the Kaiming Paper. Mishkin says the scaling of each layer on the input is k^L, where L is a number of layers. Values of k > 1 lead to “extremely large values of output layers, k < 1 leads to a diminishing signal and gradient.
There are optimal scaling factors, such as SQRT(2) specifically for ReLU.
Previously proposed methods:
- Sussillo & Abbot (2014) Random walk initialization which keeps constant the log of the norms of the backpropagated errors.
- Hinton (2014) and Romero (2015) knowledge distillation and Hints initialization
- Srivastava (2015) gating scheme (like LSTM) to control information and gradient flow.
- Saxe (2014) orthonormal matrix initialization.
- Bengio (2007) layer-wise pre-training
One subject I’m unfamiliar with is the “orthonormal matrix”: Don’t have time (actually brainpower) to get to it today. I’ll have to look at it on Monday.