POS Tagging with HMMs

An introduction of Part-of-Speech tagging using Hidden Markov Model (HMMs).

Markov chains

Consider a sequence of state variables $q_1, q_2, ..., q_i$ . A first-order Markov model instantiate two simplifying assumptions.

The probability of a state depends only the previous state. $\text{Markov Assumption:}\quad P(q_i=a \vert q_1 ... q_{i-1}) = P(q_i = a \vert q_{i-1})$
The probability of an output observation $o_i$ depends only on the current state that produced the observation $q_i$ : $\text{Output Independence:}\quad P(o_i \vert q_1 \cdots q_i, \cdots, q_T, o_1, \cdots, o_i, \cdots,o_T) = P(o_i \vert q_i)$

Hidden Markov Model

A set of $N$ states: $Q = q_1 q_2 \cdots q_N$
A transition probability matrix $A$ , each $a_{ij}$ representing the probability of moving from state $i$ to state $j$, s.t. $\sum_{j=1}^N a_{ij} =1$ : $A=a_{11} \cdots a_{ij} \cdots a_{NN}$
A sequence of $T$ observations, each one drawn from a vocabulary $V = v_1,v_2,\cdots,v_V$ : $O=o_1 o_2 \cdots o_T$
A sequence of observation likelihoods, a.k.a. emission probabilities, each expressing the probability of an observation $o_t$ , being generated from a state $i$: $B=b_i(o_t)$
An initial probability distribution over states. $\pi_i$ is the prob. that the Markov chain will start in state $i$. Some state $j$ may have $\pi_j = 0$ , meaning that they cannot be initial states. Also, $\sum_{i=1}^n \pi_i = 1$ : $\pi = \pi_1 ,\pi_2,\cdots,\pi_N$

Training with MLE

The initial probability $\pi$:
$\pi(t) = \frac{C(\text{the tag occurs in the sentence beginning})}{\text{count of sentences}}$
The transition probability matrix $A$ contains $P(t_i \vert t_{i-1})$ :
$P(t_i \vert t_{i-1}) = \frac{C(t_{i-1}, t_i)}{C(t_{i-1})}$
where $C(t_{i-1})$ means the count of the first word’s pos tag in bigram tuples.
The emission probability matrix $B$ contains $P(w_i \vert t_i)$ :
$P(w_i \vert t_i) = \frac{C(t_i, w_i)}{C(t_i)}$

Handling OOV problems

Decoding

Given as input an HMM $\lambda = (A, B)$ and a sequence of observations $O = o_1, o_2,...,o_T$ , find the most probable sequence of states $Q=q_1q_2q_3...q_T$

$\hat{t}_1^n = \arg\max_{t_1^n} P(t_1^n \vert w_1^n) \approx \arg\max_{t_1^n} P(w_1^n \vert t_1^n) P(t_1^n) \\ \approx \arg\max_{t_1^n} \prod_{i=1}^n \overbrace{P(t_i \vert t_{i-1})}^{\text{transition}} \overbrace{P(w_i \vert t_i)}^{\text{emission}}$

Yekun's Note