 An introduction of Part-of-Speech tagging using Hidden Markov Model (HMMs).

## Markov chains

Consider a sequence of state variables $q_1, q_2, ..., q_i$. A first-order Markov model instantiate two simplifying assumptions.

1. The probability of a state depends only the previous state.
2. The probability of an output observation $o_i$ depends only on the current state that produced the observation $q_i$:

## Hidden Markov Model

• A set of $N$ states: $Q = q_1 q_2 \cdots q_N$
• A transition probability matrix $A$, each $a_{ij}$ representing the probability of moving from state $i$ to state $j$, s.t. $\sum_{j=1}^N a_{ij} =1$:
• A sequence of $T$ observations, each one drawn from a vocabulary $V = v_1,v_2,\cdots,v_V$:
• A sequence of observation likelihoods, a.k.a. emission probabilities, each expressing the probability of an observation $o_t$, being generated from a state $i$:
• An initial probability distribution over states. $\pi_i$ is the prob. that the Markov chain will start in state $i$. Some state $j$ may have $\pi_j = 0$, meaning that they cannot be initial states. Also, $\sum_{i=1}^n \pi_i = 1$:

## Training with MLE

• The initial probability $\pi$:

• The transition probability matrix $A$ contains $P(t_i \vert t_{i-1})$:

where $C(t_{i-1})$ means the count of the first word’s pos tag in bigram tuples.

• The emission probability matrix $B$ contains $P(w_i \vert t_i)$:

## Decoding

Given as input an HMM $\lambda = (A, B)$ and a sequence of observations $O = o_1, o_2,...,o_T$, find the most probable sequence of states $Q=q_1q_2q_3...q_T$

### Viterbi algorithm 