The past months have felt like torrential rainfall that culminates in hail. It was a tough semester and has kept me from writing down summaries of papers, although, I have still been doing a number of reading, just without the translation to the blog. The months have passed quickly and not because I have been slacking on my growing in machine learning, it is actually the opposite. After a nice week break of surfing, I’m happy to start afresh tackling new and seminal papers.
One thing I’ve found in the meantime is Yannic Kilcher’s Youtube channel, which walks through ML papers, but more importantly gives his interpretation and explanation of them. Really his take on things has motivated me to keep trying to do these I&I weeks as they may be the most beneficial pieces I write, and certainly the only original work.
This week I want to reflect on a few ideas that have begun to solidify in my thinking, and map out some further direction.
Attention
I still have only an academic view of attention and a media driven thought of the power of Transformer architectures, but one idea which I had not previously understood regarding Transformers is they are actually so powerful because they lack many assumptions. It is unclear from the beginning with a paper like ‘Attention is all you need’ that attention is so fundamentally a basic computation structure, but I’m becoming convinced that the power lies in its generality not the novelty of its design. When I first read the paper, it seemed like a new idea, and novel to apply to systems exclusively, which makes it seem like the power is in the newness of the architecture. But what I now see is attention mechanism leaves behind so many assumptions (like CNN’s localization of weights) that many previous architectures rely on due to the data/computation curve. Given infinite data and high computation with a good optimizer, an MLP is going to be the highest performer, precisely because it lacks many assumptions we use today to overcome our lack of data and limited compute.
I hope to come to a more nuanced view of this and architectures in general, but also have a clear view of where the payoffs are coming from. Even the BERT paper cites how simply being able to leverage the compute structures we already have is a reason BERT is able to perform so well.
Hopefully, I can come up with a paradigm that includes the architecture, data, and compute capabilities to understand what to expect for performance and predict what the future holds.
Capsule Networks
I really want to improve my understanding and even code up some simple CapsNets. I simply want to explore what literature is out there, and I want to at least explore something while there is a bit of skepticism just to invest in things that may have space for me to contribute. Thus, my hope is to be able to take the next few papers as Capsule papers to see what its like.