Why is NER hard in the industry?
This blog dicusses several frequently occurred problems and possible solutions.
Background: Named Entity Recognition (NER) has always been a fundamental task in the NLP tasks, including information extraction, relation extraction, information retrieval, question answering, etc. The prevalent solution to NER is
BiLSTM-CRF model, but there still exist several issues in the real scenario.
Possible problems including:
- Expensive cost for manual labeling
- Incapability of generalization and transferability. For example, transfer between different domains.
- Weak interpretability. In certain domains such as medical NER, the “black box” is not relable for decision making.
- Low computing resources. E.g., some medical data is confidential and only accessible on the deivices of a hospital, where there is no enough GPU resource for computing.
For vertical domain:
- Adopt BiLSTM-CRF models.
- Analyze bad cases;
- Consistently build the in-domain lexicon and improve the pattern-based method.
For general domain:
- Construct syntactic features to feed into NER. For Chinese/Japanese NER, also use segmented words.
- Combine lexicon.
NER focus more on the bottom features. Try to introduce rich features, such as char, bi-gram, lexicon, POS tagging, ElMo, etc. In the vertical domain, we can pretrain a in-domain word embedding or language model. Try to build multiple features from different views.
If certain type of spans are too long, try:
- Use rule for postfix.
- Adopt pointer network + CRF for multi-task learning.
In situations with no computing limit, in general domain, or few-shot problems.