A note for NLP Interview.

Statistical ML

LR vs SVM

Difference:

LR uses logistic loss, while SVM uses hinge loss.
LR is sensitive to outliers, while SVM is not.
SVM is suitable for small training set, while LR needs much.
LR tries to find a hyperplane that stays far away with all points (all points count), whereas SVM only aims at keeping away support vectors.
LR requires feature enginnering, SVM uses kernel trick.
SVM is non-parametric methods, whereas LR is parametric model.

Logistic Regression (LR)

Logistic Regression (LR) is a linear mapping from features $x$ to labels $y \in \{ 0,1 \}$ with sigmoid function $g(z)=1/(1+\exp(-z))$.

The LR is fomulated as:

$\begin{equation} h_{\theta} = g(\theta^\top x) = \frac{1}{1+\exp(-\theta^\top x)} \end{equation}$

The derivative of sigmoid function is:

$\begin{align} g^\prime(z) &{}= \frac{d}{dz} \frac{1}{1+\exp(-\theta^\top x)}\\ &{}= \frac{1}{(1+\exp(-\theta^\top x))^2} (\exp(-\theta^\top x))\\ &{}= \frac{1}{1+\exp(-\theta^\top x)} \cdot \bigg( 1- \frac{1}{1+\exp(-\theta^\top x)} \bigg) \\ &{}= g(z)(1-g(z)) \end{align}$

LR can be used for binary classification, thus

$\begin{align} P(y=1 \vert x; \theta) &{}= h_{\theta} (x)\\ P(y=0 \vert x; \theta) &{}= (1-h_{\theta} (x)) \end{align}$

That is,

$p(y \vert x, \theta) = (h_{\theta}(x))^y (1-h_{\theta}(x))^{(1-y)}$

Given the training data, the features $x = \{ x_1, x_2, \cdots, x_m \}$ and labels $y = \{ y_1, y_2, \cdots, y_m \}$. The maximum likelihood function is:

$\begin{align} \ell (\theta) &{}= \log \mathcal{L}(\theta) \\ &{}= \sum_{i=1}^m y^{(i)} \log h(x^{(i)}) + (1-y^{(i)}) \log (1-h(x^{(i)})) \end{align}$

With gradient ascend algorithm, we have $\theta : \theta + \alpha \nabla_{\theta}\ell (\theta)$.

$\begin{align} \frac{\partial}{\partial \theta_j} &{}= \bigg( y \frac{1}{g(\theta^T x)} - (1-y) \frac{1}{1-g(\theta^T x)} \bigg) \frac{\partial}{\partial \theta_j} g(\theta^T x) \\ &{}= \bigg( y \frac{1}{g(\theta^T x)} - (1-y) \frac{1}{1-g(\theta^T x)} \bigg) g(\theta^T x) (1-g(\theta^T x)) \frac{\partial}{\partial \theta_j} \theta^T x \\ &{}= (y (1-g(\theta^T x)) - (1-y)g(\theta^T x)) x_j \\ &{}= (y-h_\theta (x)) x_j \end{align}$

If we only use one sample to train, the update can be formulated as:

$\theta_j: \theta_j + \alpha (y^{(i)} - h_\theta (x^{(i)})) x_{j}^{(i)}$

The loss function is:

$\begin{align} J(w, b) &{}= \frac{1}{m} \sum_{i=1}^m \mathcal{L} (\hat{y}^{(i)}, y^{(i)})\\ &{}= \frac{1}{m} \sum_{i=1}^m (-y \log (\hat{y}^{(i)})- (1-y) \log (1-\hat{y}^{(i)})) \end{align}$

Linear SVM

Given a training dataset of $m$ points of the form $(\mathbf{x}_1,y_1)， \cdots,(\mathbf{x}_m,y_m)$, where $y \in \{-1,1\}$, each indicating the calss to which the point $x_i$ belongs. We want to find the maximum-margin hyperplane that divides the group of points $\mathbf{x}_i$ into two groups so that the distance between the hyperplane and the nearest point from either group is maximized.

Any hyperplane can be written as the set of points $\mathbf{x}$ satisfying $\mathbf{w}^T\mathbf{x}+\mathbf{b}=0$.

Hard margin

If the training data is linearly separable, we can select to parallel hyperplanes that separate the two clases of data, so that the distance between them is as large as possible. The region bounded by these two hyperplanes is called the “margin”, and the maximum-margin hyperplanes is the hyperplane that lies halfway between them.
The optimization aims to “minimize $\Vert \mathbf{w} \Vert$ subject to $y_i (w^T x_i + b) \geq 1$ for $\forall i$”:

Maximize the margin: $ \min_{\mathbf{w,b}} \frac{1}{2} \Vert \mathbf{w} \Vert^2$
Classify: $y_i (w^T x_i + b) \geq 1, \quad i=1,2,3,\cdots,m$

where the $\mathbf{w},\mathbf{b}$ determine our classifier $\mathbf{x} \rightarrow \textrm{sign} (\mathbf{w}^T \mathbf{x} + \mathbf{b})$

Soft margin

Hinge Loss
When the data are not linearly separable, the hinge loss^[1] is helpful:
$\begin{align} \max(0, 1- \underbrace{y_i}_\textrm{label} \underbrace{(\mathbf{w}^T \mathbf{x}_i + \mathbf{b})}_\textrm{prediction}) \end{align}$
The hinge loss is zero if the constraint $y_i (w^T x_i + b) \geq 1$ is satisfied, i.e., if $\mathbf{x}_i$ lies on the correct side of the margin. For data on the wrong side of the margin (-1 vs 1), the hinge loss is proportional to the distance from the margin.
Soft margin objective
The optimization goal is to minimize
$\begin{align} \lambda \Vert \mathbf{w} \Vert^2 + \bigg[ \frac{1}{n} \sum_{i=1}^n \max (0, 1- \underbrace{y_i}_\textrm{label}\underbrace{(\mathbf{w}^T\mathbf{x}_i + \mathbf{b}))}_\textrm{prediction} \bigg] \end{align}$
where the parameter $\lambda$ determines the trade-off between increasing the margin size and ensuring that the $\mathbf{x}_i$ lie on the correct side of the margin. Thus, for sufficiently small values of $\lambda$, the second term in the loss function will become negligible, hence it will behave similar to the hard-margin SVM.

CRF

Loss

Let $x$ represent the observation, $y$ denote the labels. CRF can be formulated as:

$\begin{aligned} p(y|x) = \frac{\exp(\textrm{score}(x,y))}{\sum_{y'}\exp(\textrm{score}(x,y'))} \end{aligned}$

where

$\textrm{score}(x,y) = \sum_{i} T_{y_i, y_{i+1}} + \sum_{i} E_{i, y_i}$

The loss function would be given as:

$\begin{aligned} \ell &{}= -\log (p (y \vert x))\\ &{}= - \textrm{score}(x,y) + \log (\sum_{y'} (\exp(\textrm{score}(x, y')))) \end{aligned}$

Decision Tree

GBDT / Xgboost

L1/L2 Regularization

L1 regularization

The L1 regularization is given as:

$\begin{aligned} C = C_0 + \frac{\lambda}{2n} \sum_w \Vert w \Vert^2 \\ \end{aligned}$

Thus,

$\frac{\partial C}{\partial w} = \frac{\partial C}{\partial w} + \frac{\lambda}{n} w$

The weight update is:

$\begin{aligned} w & \rightarrow w - \eta \frac{\partial C_0}{\partial w} - \frac{\eta \lambda}{n} \\ &{}= \underbrace{\big( 1- \frac{\eta \lambda}{n} \big) w }_{\Downarrow decrease }- \eta \frac{\partial C_0}{\partial w} \end{aligned}$

L2 regularization

The L2 regularization is given as:

$\begin{aligned} C &{}= C_0 + \frac{\lambda}{2n} \sum_w \vert w \vert \end{aligned}$

Thus,

$\frac{\partial C}{\partial w} = \frac{\partial C}{\partial w} + \frac{\lambda}{n} \textrm{sgn} (w)$

The weight update is:

$\begin{aligned} w & \rightarrow w - \frac{\eta \lambda}{n} \textrm{sgn}(w) - \eta \frac{\partial C_0}{\partial w} \end{aligned}$

Implementation

KMeans

KNN

Class Imbalance / Long-tailed Learning

Extant class imbalance^[12]^[13] methods:

the input to a model (Data modification)
- Under-sampling
- Over-sampling
- Feature Transfer
the output of a model (Post-hoc correction of the decision threshold)
- Modify threshold
- Normalize weights
the internals of a model (e.g., loss function)
- Loss balancing
- Volume weighting
- Average top-k loss
- Domain adaptation
- Label aware margin

Information Theory

KL Divergence

Kullback-Leibler (KL) divergence^[9] (a.k.a, relative entropy) is a measure of how one probability distribution is different from a second, reference probability distribution.

Consider two probability distributions $P$ and $Q$. Usually, $P$ represents the data, the observations, or a measured probability distribution. Distribution $Q$ represents instead a theory, a model, a description or an approxmation of $P$. The KL divergence is then interpreted as the average difference of the number of bits required for encoding samples of $P$ using a code optimized for $Q$ rather than one optimized for $P$.

For discrete probability distributions $P$ and $Q$ defined on the same probability space $\chi$, the relative entropy from $Q$ to $P$ is defined to be:

$\begin{align} \mathbb{KL}(P \Vert Q) &{}= \sum_{x \in \chi} P(x) \log \bigg( \frac{P(x)}{Q(x)} \bigg) \\ &{}= - \sum_{x \in \chi} P(x) \log \bigg( \frac{Q(x)}{P(x)} \bigg) \end{align}$

The relative entropy can be interpreted as the expected message-length per datum that must be communicated if a code that is optimal for a given (wrong) distribution $Q$ is used, compared to using a code based on the true distribution $P$.

$\begin{align} \mathcal{KL} (P \Vert Q) &{}= - \sum_{x \in \chi} p(x) \log q(x) + \sum_{x \in \chi} p(x) \log p(x) \\ &{}= \mathbb{H}(P \vert Q) - \mathbb{H}(P) \end{align}$

where $\mathbb{H}(P \vert Q)$ indicates the cross entropy of P and Q, $\mathbb{H}(P)$ is the entropy of P.

Properties

Non-negative
Asymmetric

JS Divergence

Jensen-Shannon (JS) divergence is a measure of similarity between two probablity distributions.

$\begin{align} \mathbb{JS}(P \vert Q) = \frac{1}{2} \mathbb{KL}(P \Vert M) + \frac{1}{2} \mathbb{KL}(Q \Vert M) \end{align}$

where $M = \frac{1}{2}(P+Q)$

Properties

Symmetric
Bound $0 \leq JSD \leq 1$

Mutual Information

Mutual Information (MI)^[10] measures the mutual dependence between the two variables.

$\begin{align} I(X; Y) &{}= \mathbb{KL} (P(X,Y) \Vert P(X)P(Y)) \\ &{}= \mathbb{E}_{X} \{\mathbb{KL}(P(Y \vert X) \Vert P(Y))\}\\ &{}= \mathbb{E}_{Y} \{\mathbb{KL}(P(X \vert Y) \Vert P(X))\} \end{align}$

For discrete variables $X$ and $Y$ the MI is:

$\begin{align} I(X;Y) = \sum_{y \in \mathcal{Y}} \sum_{x \in \mathcal{X}} p_{(X,Y)} (x,y) \log \bigg( \frac{p_{(X,Y)}(x,y)}{p_X(x)_Yp(y)} \bigg) \end{align}$

Properties

Non-negative: $I(X;Y) \geq 0$
Symmetry: $I(X;Y) = I(Y;X)$

Evaluation Metric

ROC / AUC

ROC

Reiceiver Operating Characteristic (ROC) Curve is a plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied.

x-axis: false positive rate (FPR), a.k.a, sensitivity, recall, probability of detection.
y-axis: true positive rate (TPR), a.k.a. probability of false alarm.

ROC is a comparison of two operating characteristics (TPR and FPR) as the criterion changes.

True Positive Rate (TPR) is a synonym for recall and is therefore defined as follows:

$\textrm{TPR} = \frac{\textrm{TP}}{\textrm{TP+FN}}$

False Positive Rate (FPR) is defined as follows:

$\textrm{FPR} = \frac{\textrm{FP}}{\textrm{FP+TN}}$

An ROC curve plots TPR vs. FPR at different classification thresholds. Lowering the classification threshold classifies more items as positive, thus increasing both False Positives and True Positives.

AUC

Area under the ROC Curve (AUC) provides an aggregate measure of performance across all possible classification thresholds. A model whose predictions are 100% wrong has an AUC of 0.0; one whose predictions are 100% correct has an AUC of 1.0.

import numpy as np
import pandas as pd

y_pred = list(np.random.uniform(.4, .6, 2000)) + list(np.random.uniform(.5, .7, 8000))
y_true = [0] * 2000 + [1] * 8000


def calc_auc(y_true, y_pred):
    pair = list(zip(y_true, y_pred))
    pair = sorted(pair, key=lambda x: x[1])
    df = pd.DataFrame([[x[0], x[1], i + 1] for i, x in enumerate(pair)], columns=['y_true', 'y_pred', 'rank'])
    for k, v in df['y_pred'].value_counts().items():
        if v == 1:
            continue
        rank_mean = df[df['y_pred'] == k]['rank'].mean()
        df.loc[df['y_pred'] == k, 'rank'] = rank_mean
    pos_df = df[df['y_true'] == 1]
    m = pos_df.shape[0]
    n = df.shape[0] - m
    return (pos_df['rank'].sum() - m * (m + 1) / 2) / (m * n)

print(calc_auc(y_true, y_pred))

# sklearn
from sklearn.metrics import roc_auc_score
print(roc_auc_score(y_true, y_pred))

F1-Measure

Micro-F1: calculate metrics globally by counting total TP,FN,FP
Macro-F1: calculate metrics for each label => unweighted mean.
Weighted-F1: calculate metrics for each label => average weighted by support (# of true instances for each class)

Comparison between ROC and F1-measure:

Both look at the precision scores (TPR): ROC looks at the True Positive Rate (TPR/Recall) and False Positive Rate (FPR) while F1 looks at Positive Predictive Value (PPV/Precision) and True Positive Rate (TPR/Recall).^[11]
F1 score cares more about the positive class, such as highly imbalanced dataset where the fraction of positive class is small.
ROC cares equally about the positive and negative class or the dataset is quite balanced.

Deep Learning

Batch Norm vs Layer Norm

$y = \frac{x - \mathrm{E}[x]}{\sqrt{\mathrm{Var}[x] + \epsilon}} * \gamma + \beta$

BN normalizes along one batch (first dim), LN does on one sample (last dim).
Refer to details

Gradient Vanishing/Exploding

Gradient vanishing/exploding arises from the issues of backpropagation, in other words, the accumulated multiplication of smaller-than-1 or greater-than-1 gradient values.

Solution

Pretraining-Finetuning per layer
Gradient Clip / Weight Regularization
Activation function: avoid to use sigmoid.
Appropriate weight initialization: Xavier-Glorot initialization^[4]
Batch Norm: reduce the covariant shift of training dataset.
Residual Connection
LSTM: refer to ^[4]^[5].

RNNs

LSTM

LSTM^[6] integrates three gates: input gate, forget gate, and output gate.

$\begin{align} \left[\begin{array}{c} \mathbf{i}^c_j\\ \mathbf{o}^c_j \\ \mathbf{f}^c_j \\ \tilde{c}^c_j \end{array}\right] &{}= \left[\begin{array}{c} \sigma \\ \sigma \\ \sigma \\ \tanh \end{array}\right] (\mathbf{W}^{c^T} \left[\begin{array}{c} \mathbf{x}^c_j \\ \mathbf{h}^c_{j-1}\end{array}\right] + \mathbf{b}^c) \\ \mathbf{c}^c_j &{}= \mathbf{f}^c_j \odot \mathbf{c}^c_{j-1} + \mathbf{i}^c_j \odot \tilde{c}^c_{j} \\ \mathbf{h}_j^c &{}= \mathbf{o}_j^c \odot \tanh(\mathbf{c}^c_j) \end{align}$

GRU

GRU has three gates: update gate (vs input/output gate in LSTM) and reset gate.

Transformer

See Transformer blog

Backprop with Softmax + XE

Refer to ^[7].

Softmax Forward

Given the softmax written in:

$\textrm{softmax}(a_i) = p_i = \frac{\exp(a_i)}{\sum_{j}^N \exp(a_j)}$

where $a_i, i=1,2,\cdots,N$ is the output logits, $p_i$ is the predicted probability of $i$-th class, and

$\sum_{i=1}^N p_i = 1$

Computation

The computation of softmax will first reduce the maximum value of $A=[a_1, a_2, \cdots, a_N]$ to avoid the overflow of exp(.).

We have

$\begin{align} p_i &{}= \frac{\exp(a_i)}{\sum_j^N \exp(a_i)} \\ &{}= \frac{C \exp(a_i)}{C \sum_j^N \exp(a_i)} \\ &{}= \frac{\exp(\log C) \exp(a_i)}{\exp(\log C) \sum_j^N \exp(a_i)} \\ &{}= \frac{\exp(a_i + \log C)}{\sum_j^N \exp(a_i + \log C)} \\ &{}= \frac{\exp(a_i - max(A))}{\sum_j^N \exp(a_i - max(A))} \\ \end{align}$

where $C$ is constant.

Cross Entropy Forward

Denote the Cross Entropy (XE) loss as $H$:

$\ell(y_i, p_i) = H(y_i, p_i) = -\sum_{i}^N y_i \cdot \log p_i$

Softmax Derivative

The derivative of softmax w.r.t $a_i$ is:

$\frac{\partial p_i}{\partial a_j} = \frac{\partial \big( \frac{\exp(a_i)}{\sum_j^N \exp(a_i)} \big)}{\partial a_j}$

For brevity, let $\sum = \sum_j^N \exp(a_j)$.

When $i=j$, we have:
$\begin{align} \frac{\partial p_i}{\partial a_j} &{}= \frac{\exp(a_i) \cdot \sum - \exp(a_i)\cdot \exp(a_j)}{\sum\cdot \sum} \\ &{}= \frac{\exp(a_i) (\sum - \exp(a_i))}{\sum\cdot \sum} \\ &{}= p_i (1-p_j) \end{align}$
When $i \neq j$, we have:
$\begin{align} \frac{\partial p_i}{\partial a_j} &{}= \frac{0 \cdot \sum - \exp(a_i)\cdot \exp(a_j)}{\sum\cdot \sum} \\ &{}= - p_i \cdot p_j \end{align}$

XE+Softmax Derivative

The derivative of XE is:

$\begin{equation} H^\prime(y_i, p_i) = - \sum_i^N y_i \frac{1}{p_i} \end{equation}$

According to the chain rule, the derivative w.r.t $a_j$ is:

$\begin{align} \frac{\partial H}{\partial a_j} &{}= \frac{\partial H}{\partial p_i} \cdot \frac{\partial p_i}{\partial a_j}\\ &{}= \bigg( -\sum_i y_i \frac{1}{p_i} \bigg) \cdot \frac{\partial p_i}{\partial a_j} \label{eq:xe_derivative} \end{align}$

When $i=j$
$\begin{align} \textrm{Eq.} \eqref{eq:xe_derivative} &{}= -\sum_{i=j} y_i \frac{1}{p_i}\cdot p_i \cdot (1-p_j) \\ &{}= -\sum_{i=j} y_i \cdot (1 - p_j) \\ &{}= -y_i + y_i p_i \label{eq:s1} \end{align}$
When $i \neq j$, the Eq. $\eqref{eq:xe_derivative}$ is:
$\begin{align} \textrm{Eq.} \eqref{eq:xe_derivative} &{}= -\sum_{i \neq j} y_i \frac{1}{p_i}\cdot (- p_i \cdot p_j) \\ &{}= \sum_{i \neq j} y_i p_j \label{eq:s2} \end{align}$

Since above two scenarios are independent, combining Eq. $\eqref{eq:s1}$ and $\eqref{eq:s2}$, we have:

$\begin{align} \textrm{Eq.} \eqref{eq:xe_derivative} &{}= \textrm{Eq.}\eqref{eq:s1} + \textrm{Eq.}\eqref{eq:s2} \\ &{}= -y_i + y_i p_i + \sum_{i \neq j} y_i p_j \\ &{}= -y_i + (\sum_{i=j}y_i p_j + \sum_{i \neq j} y_i p_j)\\ &{}= -y_i + \sum_i^N y_i p_i \label{eq:one_hot} \\ &{}= p_j - y_i \label{eq:ij}\\ &{}= p_j - y_j \end{align}$

In Eq.$\eqref{eq:one_hot}$, we have $\sum_i^N y_i = 1$;
In Eq.$\eqref{eq:ij}$, we have $\sum_i^N y_i = y_j$.

NLP

Static Word Representation

Word2Vec

Hierarchical Softmax / Negative Sampling

Refer to my blog

Hierarchical Softmax: $|V| => \log |V|$ using huffman tree
Negative Sampling

W2V vs GloVe

BPE vs WordPiece

Refer to OOV blog

References

1.Wiki: Hinge Loss ↩
2.SVM Blog ↩
3.SVM Derivatives (in Chinese) ↩
4.Written Memories: Understanding, Deriving and Extending the LSTM ↩
5.LSTM eased gradient vanishing explanations (in Chinese) ↩
6.Understanding LSTM Networks ↩
7.Softmax classification with cross-entropy (2/2) ↩
8.Softmax+XE Backpropagation (in Chinese) ↩
9.Wiki: KL divergence ↩
10.Wiki: Mutual Information ↩
11.[F1 score vs ROC AUC vs Accuracy vs PR AUC: Which Evaluation Metric Should You Choose?] ↩
12.Long-Tail Learning via Logit Adjustment ↩
13.Data Imbalance blog ↩