2 Mishkin Introduction
ALL YOU NEED IS A GOOD INIT
This 2016 paper has a great introduction section in prose describing the history of initialization methods. He definition of reference state of the art is the definition of deep being anything greater than 16 CNN layers.
A great sentence: One of the main obstacles preventing the wide adoption of very deep nets is the absence of a general, repeatable and efficient procedure for their end-to-end training.
He references the Kaiming paper to do end to end training in single init pass. He also mentions a batch norm paper I haven’t yet read, but it is apparently the case for zero mean unit variance.
The introduction provides the case for a single init stage that does not require any additional computation or complexity.