Comparison with “Xavier’ Initialization

The “Xavier” initialization method was an earlier method proven to be well suited to deeper layered networks. The “Kaiming” init method is here contrasted with it. Based on their derivation, the std will be 1/SQRT(2^L) factor of their initialization. However, at 22 layers, it is still sufficient for “Xavier” to converge; but at 30 layers, “Xavier” stalls. The authors see an improved reduction of early error, but not clear superiority in accuracy. Kaiming [correctly] comments that the increase of depth is not seeing improvement because “the method of increasing depth is not appropriate, or the recognition task is not enough complex.” I believe time has shown it is the former. He wrote those words in 2015.