The Paper’s Outline
This is a very efficient paper because it’s dealing with a very concrete problem and has a clean solution. The outline could be given as such. First describe the problem, it is the softmax problem. Then, review their solution, which is a three part solution, and finally look at experimental results and conclude.
The softmax bottleneck: 30,000 ft view
The problem with softmax is given large output degrees (say 10k-30k vocab words) the softmax is unable to retain proper information across a low-rank matrix (the softmax vector). This was described in another paper, and they look at the prominent technique to addressing the bottleneck: Mixture of Softmaxes (MoS).
Three Novel additions
The paper presents three key ideas to add to efficiently breaking the bottleneck.
- Logit Space Vector Gating
- Sigmoid Tree Decomposition
- Gate Sharing
Conclusions
From the beginning, we know that they claim to have gained a performance boost over MoS between 1.6x and 11.5x while being comparable or better than MoS on 4 benchmarks. Both MoS and Mixtape show benchmark performance over softmax, at a spend and memory penalty.