Techniques of NN training. Keep updating.

NLP

Weight tying

Weight tying^[1] is tying the input word embedding matrix $U$ (called input embedding) with topmost weight matrix $V$ (called output embedding) of neural network language models (NNLM), i.e., setting $U=V$. This technique can reduce the parameter size and therefore lead to less overfitting.

Press et. al (2016) showed that the weight typing

Untied NNLM

Give the word sequence $i_{1:t} = [i_1, \cdots, i_t]$ at timestep t, and current output target word $o_t$ , the NLL loss is:

$\mathcal{L}_t = - \log p_t (o_t \vert i_{1:t})$

where $p_t (o_t \vert i_{1:t}) = \frac{\exp(\mathbf{V}^{\top}_{o_t} h_2^{(t)} )}{ \sum_{x=1}^C \exp(\mathbf{V}_x^\top h_t^{(t)} ) }$ , $\mathbf{U}_k / \mathbf{V}_k$ is the $k$-th row of $\mathbf{U}/\mathbf{V}$, $k$ is the corresponding word index, $h_2^{(t)}$ is the vector of activations of the topmost LSTM layers’ output at time t.

The update for row $k$ of input embedding $\mathbf{U}$ is:

$\frac{\partial \mathcal{L}_t}{\partial \mathbf{U}_t } =\left\{ \begin{array}{ll} (\sum_{x=1}^C p_t(x \vert i_{1:t} \cdot \mathbf{V}_x^\top - \mathbf{V}_{o_t}^\top) \frac{\partial h_2^{(t)}}{\partial U_{i_t}}) & k=i_t\\ 0 & k \neq i_t \end{array} \right.$

For the output embedding $\mathbf{V}$, the $k$-th row update is:

$\frac{\partial \mathcal{L}_t}{\partial \mathbf{V}_t } =\left\{ \begin{array}{ll} (p_t(o_t \vert i_{1:t})-1) h_2^{(t)} & k=o_t\\ p_t(o_t \vert i_{1:t}) \cdot h_2^{(t)} & k \neq o_t \end{array} \right.$

Therefore, in the untied NNLM,

the input embedding $\mathbf{U}$ only updates the current input word at $k$-th row, which denotes that the update times is correlated with its occurrence and thus rare words would be updated few times;
the output embedding $\mathbf{V}$ updates every row at each timestep.

Tied NNLM

With weight tying, we set $\color{red}{\mathbf{U}=\mathbf{V}=S}$. Thus $S$ serves as the role of both the input and output embeddings, whose update of each row in $S$ would conducted through both of them.

It can be seen that the update is mostly affected by the output embeddings and the tied weights perform similarly to output embedding $\mathbf{V}$ rather than input embedding $\mathbf{U}$ in the untied model.
Projection regularization is used at large models, by inserting the projection matrix $\color{red}{P}$ before the output embedding $\mathbf{V}$: $h_3 = \mathbf{V} \color{red}{P} h_2$ Then add the regularization term $\lambda \| \color{red}{P}\|_2$ to the loss. $\lambda=0.15$ in our experiments.

References

1.Press, O., & Wolf, L. (2016). Using the output embedding to improve language models. arXiv preprint arXiv:1608.05859. ↩

Yekun's Note

Neural Network Tricks

NLP

Weight tying

Untied NNLM

Tied NNLM

References