The transition layer of Transformer has very sparse activation RRS feed

  • Question

  • I found that, unlike other parts of the architecture, the activations right after relu of the transition layer of Transformer (after 10k iterations) are extremely sparse in the sense that setting the neurons with smallest 90% or so activations to zero does not affect the performance (or even training) at all. I imagine a similar conclusion holds for self-attention part. By its design, the transition layer is supposed to work as attention over hidden dimension in contrast to the self-attention layer that is attention over timestep. While the attention over timestep is easily interpretable, attention over hidden dimension is currently not for Transformer on text unlike 2D CNN (activation atlas). However, given that the transition layer has very sparse activation, it is possibly easier to analyze which neuron is responsible for what kind of feature. Anyway, I hope this observation will be of use in something.

    Tuesday, March 12, 2019 10:18 AM