How to understand masked multi-head attention in transformer

I had the very same question after reading the Transformer paper. I found no complete and detailed answer to the question in the Internet so I’ll try to explain my understanding of Masked Multi-Head Attention. The short answer is – we need masking to make the training parallel. And the parallelization is good as it … Read more