Linhao Dong, Shuang Xu, Bo Xu, Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition, ICASSP 2018
Attention Penalty (from Speech Transformer paper):
In addition, we encouraged the model attending to closer positions by adding a bigger penalty on the attention weights of more distant position-pairs.
There is no more specific description about attention penalty.
This is my imagination, adding negative value for non-diagonal elements on scaled_attention_logits except for the first multi-head attention in decoders.
I have no concrete idea about the attention penalty the authors explained.