Key takeaways

A neural network is a matrix multiplication machine. If you continue to multiply numbers over and over, three things can happen.

You can explode the results, such that the outputs are so large there is nothing to them.
You can squeeze the results such that the outputs are so small as being indiscriminate, or
You can produce steady numbers through many mulitiplications.

The key takeaway from “Delving Deep into Rectifies” is you initiate your model to survive many multiplications of random numbers until the true information starts to propagate to the weights.

The most impactful equation is (9)

They state: This product [(9)] is the key to the initialization design. A proper initialization method should avoid reducing or magnifying the magnitudes of input signals exponentially.

Thus, if you take a normal random distribution, by multiply your initialized values by SQRT(2/nl) where nl is the number of nodes in the layer, it allows your initialized weights to take on what I like to think of as the right heaviness. Since the weights are random, they are not encoding any useful information, but by being initialized correctly, they are setup to survive many multiplications until the algorithm has had a chance to tune those weights.