 Dynamic Programming (DP) is ubiquitous in NLP, such as Minimum Edit Distance, Viterbi Decoding, forward/backward algorithm, CKY algorithm, etc.

# Minimum Edit Distance

Minimum Edit Distance (Levenshtein distance) is string metric for measuring the difference between two sequences.

Given two strings $a$ and $b$ (with length $|a|$ and $|b|$ respectively), define $\text{D}(i,j)$ as the edit distance between $a[1 \cdots i]$ and $b[1 \cdots j]$, i.e. the first $i$ chars of $a$ and the first $j$ chars of $b$. The edit distance between two strings is $\text{D}(|a|, |b|)$.

Let the costs of insertion, deletion, substitution are 1,1 and 2, respectively.

That is,

Algorithms in Python: # DP in HMMs

Hidden Markov Models (HMMs) describe the joint probability of its hidden(unobserved) and observed discrete random variables. It relies on the Markov assumption and output indenpendence.

• The observation sequence $Y = (Y_1=y_1, Y_2=y_2, \cdots, Y_T=y_T)$
• Let $X_t$ be a discrete random vriable with $K$ possible states. The transition matrix $A={a_{ij}} = P(X_t=j \vert X_{t-1}=i)$
• The initial state distribution(t=1) is $\pi_i = P(X_1 = i)$
• The output/emission matrix $B = {b_j(y_i)} = P(Y_t=y_i \vert X_t = j)$

Thus the hidden markov chain can be defined by $\theta = (A, B, \pi)$. ## Forward algorithm

Forward algorithm is used to calculate the "belief state": the probability at time $t$, given the history of evidence. Such process is also called filtering.

• The goal is to compute the joint probability $p(x_t, y_{1:t})$, which requires marginalizing over all possible state sequence ${x_{1:t-1}}$, the # of which grows exponentially with $t$.
• The conditional independence of HMM could help to perform calculation recursively.
• Time complexity: $O(K^2T)$

Forward algorithm:

• Init $t=0$, transition probabilities $p(x_t \vert x_{t-1})$, emission probabilities $p(y_j \vert x_i)$, observed sequence $y_{1:t}$
for $t \leftarrow t+1, \cdots, T$:return $\alpha_T$

## Viterbi Algorithm

Viterbi algorithm is to finding the most probable sequence of hidden states, i.e. Viterbi path.

Input:

• The observation space $\pmb{O} = {o_1, o_2, \cdots, o_N }$
• state space $S = {s_1, s_2, \cdots, s_K}$
• initial probability array $\Pi = (\pi_1, \pi_2, \cdots, \pi_K)$ such that $\pi_i = p(x_1 = s_i)$
• observation sequence $Y = {y_1, y_2, \cdots, y_T}$
• transition matrix $A$ with size $K \times K$. $A_{ij}$ stores the transition probability of transition from state $s_i$ to $s_j$
• emission matrix $B$ of size $K \times N$, such that $B_{ij}$ stores the probability of observing $o_j$ from state $s_i$

Output:

• the most likely hidden state sequence $X = (x_1, x_2, \cdots, x_T)$
• Time complexity: $O(|S|^2T)$
1. function Viterbi($O$,$S$, $\pi$, $Y$, $A$, $B$): returns $X$
1. for state i = $1,2,\cdots, K$ do
• Viterbi init step $0 \rightarrow 1$ : $T_1[i,1] \leftarrow \pi_i \cdot B_{iy_1}$
• Backtrack init step $0 \rightarrow 1$ : $T_2[i,1] \leftarrow 0$
2. for observation $j = 2,3,\cdots, T$:
• for state $i = 1,2,\cdots,K$:
• Viterbi records the best prob $T_1[i,j] \leftarrow \max_k \big( T_1[k,j-1] \cdot A_{ki} \cdot B_{iy_j} \big)$
• Backpointer saves the best backpointer $T_2[i,j] \leftarrow \arg\max_k \big( T_1[k,j-1] \cdot A_{ki} \big)$
3. best path prob. $Z_T \leftarrow \arg\max_k\big( T_1[k,T] \big)$
4. best path pointer $x_T \leftarrow s_{z_T}$
5. for $j=T, T-1, \cdots, 2$:
1. do back tracking $z_{j-1} \leftarrow T_2[z_j,j]$
2. Prev hidden state $x_{j-1} \leftarrow s_{z_{j-1}}$
6. return $X= (x_1, x_2, \cdots, x_T)$

## Baum-Welch algorithm

Baum-Welch algorithm, as a speciqal case of EM algorithms, uses the forward-backward algorithm to find the maximum likelihood estimate of the unknown parameters of a HMM given a set of observed feature vectors.

The local maximum $\theta^* = \arg\max_\theta P(Y \vert \theta)$

### Forward process

Let $\alpha_i(t) = P(Y_1=y_1, \cdots, Y_t=y_t, X_t=i \vert \theta)$ represent the probability of the observation of $y_1,y_2,\cdots,y_t$:

### backward process

Let $\beta_i(t)= P(Y_{t+1}=y_{t+1},\cdots,y_T=y_T \vert X_t=i,\theta)$ denote the probability of the sequence $y_{t+1}, \cdots, y_T$ starting at time $t$.

### update

The prob. of being state $i$ at time $t$ given the observation $Y$ and parameters $\theta$

The prob. of being state $i$ and $j$ at times $t$ and $t+1$ respectively given the observation $Y$ and parameter $\theta$

Update the HMM $\theta$:

• $\pi$ at state $i$ at time 1:
• the transition matrix $A$.
• the emission matrix $B$

• where

TBD