Implementation Details

The architecture used by Kaiming et al. is similar to VGG-19, with a few minor differences. They cite their changes were motivated by faster running speeds. Their table shows their full architectural structure. A key takeaway is the proposed initialization method allows them to apply scale jittering from the beginning because they are enabled to train deeper models from the get go and not just during fine tuning. Their input size was 224x224.
Data augmentation used

scale randomly jittered
random horizontal flip
random color altering

They used the strategy from the SPP-net paper which is a multi-view testing on feature maps. It’s a pyramid analysis with the feature map from my limited understanding.