 # Overview

Name Entity Recognition (NER) labels sequences of words in a text which are the names of things, such as person and company names, or gene and protein names.

Approaches:

• Statistical ML methods: HMM, MEMM, CRF
• Deep learning methods: RNN-CRF, CNN-CRF

## Hidden Markov Model

HMM consists of a discrete-time, discrete-state Markov chain, with hidden states $z_t \in {1,…,K}$
, plus an observation model $p(\mathbf{x}_t|z_t)$

The joint distribution:

HMM is a generative model that optimises the likelihood $P(W|T)$ . We estimate the posterior by combining the likelihood and the prior P(T).

HMM states only conditions on the previous state.

## MEMM

Problems of HMM:
HMM based on the probabilities of transmission prob. matrix and emission prob. matrix. It is hard to encode the knowledge into these two matrix. If we include many knowledge sources, like capitalisation, the presence of hyphens, word endings. It is not easy to fit prob. like P(capitalisation | tag), P(hyphen | tag), P(suffix | tag), P(suffix | tag) into an HMM-style model.

Solution ideas:

1. Strict left-to-right word classifier. Flaw: only use left sequence information. Unlike Viterbi decoding in HMM, can not consider future sequence information.

2. MEMM

MEMM is a discriminative model. It computes the posterior $P(T|W)$ directly. MEMM states conditioned on the previous state and current observation.

Decoding: Viterbi algorithm.

### Comparison with HMM

unlike the HMM, the MEMM can condition on any useful feature of the input observation. In the HMM this wasn’t possible because the HMM is likelihood-based, hence would have needed to compute the likelihood of each feature of the observation.

$Q$ denotes the state sequence, $O$ denotes the observation.

HMM

MEMM

To estimate the individual probability of a transition from a state $q’$ to a state $q$ producing an observation $o$, build a MaxEnt model:

## CRF

### linear-chain CRFs

Modelling the distribution of a set of ouputs $y_{1:T}$ given an input vector $\mathbf{x}$.

where $\lambda$ are the free parameters of the potentials, common form:

where $f_{k,t}(y_t,y_{t-1}, \mathbf{x})$ are features. Given a set of input-output sequence pairs, $\mathbf{x}^n, y_{1:T}^n, n=\{1,2,...,N\}$. We can learn the parameters $\lambda$ by Maximum Likelihood. Under the i.i.d. data assumption, the log likelihood is:

Cons: heavily rely on engineering features

## Bi-LSTM-CRF

Intuition:
Capture both information from history (forward LSTM) and future (backward LSTM) using bi-LSTM; Also use sentence level tag information (after concatenation of both forward and backward LSTM/GRU hidden states in each time step) followed by the CRF model; Each output after concatenation can be regarded as a sentence level output.

Pros: More robust, less affected by the removal of engineering features

Cons: RNNs are not as GPU-efficient as CNNs in terms of training speed and efficiency.

## Stack-LSTM

Char-based word representation indicates the word internal information, whilst pretrained word embeddings hold contextual text information.
Then, concat character-based word representation using Bi-LSTM and pretrained word embeddings as the final embeddings. 

• Bi-LSTM + CRF

Use aforementioned concatenated word embeddings for input word embeddings.

## ID-CNN-CRF

### ID-CNNs (Iterated Dilated ConvNets)

Pros: fast, resource-efficient

Dilation convolutions:

• context does not need to be consecutive
• By stacking layers of dilated convolutions of exponentially dilation width, we can expand the size of the effective input width to cover the entire length of most sequences using only a few layers: the size of the effective input width for a token at layer l is now given by 2l+1-1

• Let xt denotes the token in timestep t, Wt is a sliding window of width $r$ tokens on either side of each token in the sequence, the conventional convolution output output ct is:
$c_t= W_c \bigoplus_{k=0}^{r} x_{t} \pm k$, where $\bigoplus$ is vector concatenation.
• Dilated convolution is defined over a wider effective input width by skpping over \delta inputs, where $\delta$ is the dilation width. The dialated convolution is: $c_t= W_c \bigoplus_{k=0}^{r} x_{t} \pm k \delta$
When $\delta$ is 1, dialated convolution is equivalent to a simple convolution.

Model: Stacked dilated CNNs 

Let the $j_{th}$ dilated Conv layer of dilation width $\delta$ as $D_{\delta}^{(j)}$, $L_c$ layers output:

And add a final dilation-1 layer to the stack:

Finally, apply the simple affine transformation $W_0$ to the final representation for each token $x_t$:

ID-CNNs generate token-wise representation (corresponding to logits for each token in RNNs) for each token.

In comparison with bi-LSTM-CRF

Bi-LSTM or ID-CNNs calculates the logits for each token, whilst CRF layer capture the transmission probability of sequential inputs, i.e. use NNs to predict each token’s label independently, or apply Viterbi inference in a chain structured graphical model.

Let $s$ denotes input sequence.

• Char-level: $s= c_1,c_2,...,c_m$, where $c_j$ denotes the $j_{th}$ character.
• Word-level: $s= w_1,w_2,...,w_n$, where $w_i$ denotes the $i_{th}$ token in the word sequence.

## Char-based LSTM-CRF model

• $\mathbf{x}_j^c = \mathbf{e}^c(c_j)$, where $\mathbf{e}^c$ denotes a char embedding lookup table.

• Bi-LSTM output: $\mathbf{h_j^c} = [\overrightarrow{\mathbf{h}_j^c} ; \overleftarrow{\mathbf{h}_j^c}]$

• CRF model use $\mathbf{h_1^c}, \mathbf{h_2^c}, …,\mathbf{h_m^c}$ for sequence labelling.

### Char + bi-char

Concat char embeddings with char-bigram embeddings.

,where $\mathbf{e}^b$ denotes a char bigram lookup table

### Char + softword

• add segmentation as soft features by concatenating segmentation label embedding to char embeddings

, where $\mathbf{e^s}$ signals the segmentation label on the char $c_j$ given by a word segmentor with BMES scheme.

## Word-based model

The input for each word $w_i$: $\mathbf{x_i^w} = \mathbf{e}^w (w_i)$, where $e^w$ denotes the word embedding lookup table.
Then feed word embeddings into bi-directional LSTMs.

### Integrating char-representations

concat char-based word representation $\mathbf{x}_i^c$ (from cahr-LSTMs or char-CNNs) with pretrained word embeddings $\mathbf{e}^w(w_i)$: $\mathbf{x}_i^w = [\mathbf{e}^w(w_i); \mathbf{x}_i^c]$

1.Word + char bi-LSTMs: bi-LSTMs with chars as input in each time step.

2.Word + char-uni-LSTM

3.Word + char-CNNs
Use char sequence of each word to obtain its char representation $\mathbf{x}_i^c$:

, where $k=3$ is the kernel size and $max$ denotes the max-pooling.

## Lattice model

In comparison with char-based model:

Char embedding: $\mathbf{x}_j^c = \mathbf{e}^c(c_j)$

Basic recurrent LSTM:

where $\mathbf{i}^c_j$, $\mathbf{f}^c_j$, $\mathbf{o}^c_j$ denotes a set of input, forget and output gates, respectively. $\mathbf{c}^c_j$ denotes the char cell vector, $\mathbf{h}^c_j$ denotes the hidden vector on each char $c_j$, $\mathbf{W}^{c^T}$, $\mathbf{b}^c$ are parameters.

Lattice LSTM:

The computation of cell vector $\mathbf{c}^c_j$ considers lexicon subsequence $w_{b,e}^d$ in the sentence.

Each subsequence $w_{b,e}^{d}$ is:

, where $\mathbf{e}^w$ denotes the word embedding lookup table.

The word cell $\mathbf{c}^w_{b,e}$ represents the recurrent state of $\mathbf{x}^w_{b,e}$ from the beginning of the sentence. The $\mathbf{c}^w_{b,e}$ is:

where $\mathbf{i}^w_{b,e}$, $\mathbf{f}^w_{b,e}$ are input and forget gates. There is no output gate for word cells since labelling is performed only at the char level.

Here $\mathbf{c}^w_{b,e}$ as an additional recurrent path for information flow for each $\mathbf{c}^c_{j}$ . It applies an additional gate $\mathbf{i}^c_{b,e}$ for each subsequence cell $\mathbf{c}^w_{b,e}$ for controlling its contribution into $\mathbf{c}^w_{b,e}$:

Then cell vector $\mathbf{c}^c_j$ is:

where the gate value $\mathbf{i}^c_{b,j}$, $\mathbf{i}^c_{j}$ are normalised to $\mathbf{\alpha}_{b,j}^c$ and $\mathbf{\alpha}_{j}^c$ by setting the sum to 1:

### CRF decoding

On top of $\mathbf{h}_1$, $\mathbf{h}_2$, …, $\mathbf{h}_{\tau}$, where $\tau$ is $n$ for char-based and lattice-based models and $m$ for word-based models. The probability of a label sequence $y = l_1, l_2, ... , l_{\tau}$ is:

where $y’$ denotes an arbitrary label sequence, $\mathbf{W}_{CRF}^{l_i}$ is a model parameter specific to $l_i$ and $b_{CRF}^{(l_{i-1}, l_i)}$ is a bias specific to $l_{i-1}$ and $l_i$.

Sentence-level log-likelihood loss with L2 regularization:

where $\lambda$ is the $L_2$ regularisation parameter and $\Theta$ represents the parameter set.

## Bert

Bidirectional Transformer for Language model, with pretraining methods of Masked Language Models (predicting randomly masked 15% tokens of each sentence) and next sentence prediction (to capture information of ajacent sentences).

# NER challenges

## Chinese Language

### Word-based approach:

• Intuitive pipeline: segmentation → NER
• suffer from segmentation error: NE imports OOV errors in segmentation, and incorrectly segmented entity boundaries lead to NER errors.

### Char-based approach:

• Char-based NER outperform word-based approach in Chinese NER
• Cons: Explicit word and word sequence information are not fully exploited.

Solution:

• Lattice LSTM: leverage both char sequence and lexicon word information.

1. 1.Jurafsky, D., & Martin, J. H. (2014). Speech and language processing (Vol. 3). London: Pearson.
2. 2.Huang, Z., Xu, W., & Yu, K. (2015). Bidirectional LSTM-CRF Models for Sequence Tagging. arXiv preprint arXiv:1508.01991.
3. 3.Strubell, E., Verga, P., Belanger, D., & McCallum, A. (2017). Fast and accurate entity recognition with iterated dilated convolutions. arXiv preprint arXiv:1702.02098.
4. 4.Zhang, Y., & Yang, J. (2018). Chinese NER Using Lattice LSTM. arXiv preprint arXiv:1805.02023.
5. 5.Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805.
6. 6.Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., & Dyer, C. (2016). Neural Architectures for Named Entity Recognition. arXiv preprint arXiv:1603.01360.