Yekun's Note

Machine learning notes and writeup.

Fork me on GitHub

Dynamic Programming in NLP

Dynamic Programming (DP) is ubiquitous in NLP, such as Minimum Edit Distance, Viterbi Decoding, forward/backward algorithm, CKY algorithm, etc.

Minimum Edit Distance

Minimum Edit Distance (Levenshtein distance) is string metric for measuring the difference between two sequences.

Given two strings $a$ and $b$ (with length $|a|$ and $|b|$ respectively), define as the edit distance between $a[1 \cdots i]$ and $b[1 \cdots j]$, i.e. the first $i$ chars of $a$ and the first $j$ chars of $b$. The edit distance between two strings is $\text{D}(|a|, |b|)$.

Let the costs of insertion, deletion, substitution are 1,1 and 2, respectively.

That is,

Algorithms in Python:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
class Solution:
def minDistance(self, word1, word2):
"""
Levenshtein distance
:type word1: str
:type word2: str
:rtype: int
"""
m, n = len(word1) + 1, len(word2) + 1
dp = [[0 for _ in range(n)] for _ in range(m)]
for i in range(m):
dp[i][0] = i
for j in range(n):
dp[0][j] = j
for i in range(1, m):
for j in range(1, n):
# implementation 1
# --------------------------
# if word1[i - 1] == word2[j - 1]:
# sub_cost = 0
# else:
# sub_cost = 2
# dp[i][j] = min(dp[i - 1][j] + 1, dp[i][j - 1] + 1, dp[i - 1][j - 1] + sub_cost)
# implementation 2
# ===========================
if word1[i - 1] == word2[j - 1]:
dp[i][j] = dp[i - 1][j - 1]
else:
dp[i][j] = min(dp[i - 1][j] + 1, dp[i][j - 1] + 1, dp[i - 1][j - 1] + 2)
return dp[m - 1][n - 1]

upload successful

DP in HMMs

Hidden Markov Models (HMMs) describe the joint probability of its hidden(unobserved) and observed discrete random variables. It relies on the Markov assumption and output indenpendence.

  • The observation sequence
  • Let be a discrete random vriable with $K$ possible states. The transition matrix
  • The initial state distribution(t=1) is
  • The output/emission matrix

Thus the hidden markov chain can be defined by .

upload successful

Forward algorithm

Forward algorithm is used to calculate the "belief state": the probability at time $t$, given the history of evidence.[1] Such process is also called filtering.

  • The goal is to compute the joint probability , which requires marginalizing over all possible state sequence , the # of which grows exponentially with $t$.
  • The conditional independence of HMM could help to perform calculation recursively.
  • Time complexity: $O(K^2T)$

Forward algorithm:

  • Init $t=0$, transition probabilities , emission probabilities , observed sequence $y_{1:t}$
    for $t \leftarrow t+1, \cdots, T$:return

Viterbi Algorithm

Viterbi algorithm is to finding the most probable sequence of hidden states, i.e. Viterbi path.[2]

Input:

  • The observation space
  • state space
  • initial probability array such that
  • observation sequence
  • transition matrix $A$ with size $K \times K$. stores the transition probability of transition from state to
  • emission matrix $B$ of size $K \times N$, such that stores the probability of observing from state

Output:

  • the most likely hidden state sequence
  • Time complexity: $O(|S|^2T)$
  1. function Viterbi($O$,$S$, $\pi$, $Y$, $A$, $B$): returns $X$
    1. for state i = do
      • Viterbi init step $0 \rightarrow 1$ :
      • Backtrack init step $0 \rightarrow 1$ :
    2. for observation $j = 2,3,\cdots, T$:
      • for state $i = 1,2,\cdots,K$:
        • Viterbi records the best prob
        • Backpointer saves the best backpointer
    3. best path prob.
    4. best path pointer
    5. for $j=T, T-1, \cdots, 2$:
      1. do back tracking $z_{j-1} \leftarrow T_2[z_j,j]$
      2. Prev hidden state
    6. return

Baum-Welch algorithm

Baum-Welch algorithm, as a speciqal case of EM algorithms, uses the forward-backward algorithm to find the maximum likelihood estimate of the unknown parameters of a HMM given a set of observed feature vectors.

The local maximum

Forward process

Let represent the probability of the observation of :

backward process

Let denote the probability of the sequence starting at time $t$.

update

The prob. of being state $i$ at time $t$ given the observation $Y$ and parameters $\theta$

The prob. of being state $i$ and $j$ at times $t$ and $t+1$ respectively given the observation $Y$ and parameter $\theta$

Update the HMM $\theta$:

  • $\pi$ at state $i$ at time 1:
  • the transition matrix $A$.
  • the emission matrix $B$

    • where

CKY algorithm

TBD

References