Yekun's Note

Machine learning notes and writeup.

Fork me on GitHub

Data Augmentation for Deep Learning Models

Neural nets require large scale dataset during training process. However, it is quite expensive to have the access to enough data size. One approach to deal with this issue is Data augmentation, which means increasing the number of data points.

Motivation

It works when we can find appropriate invariant properties that the model should posses


Image-recognition

  • rescaling or applying affine distortions to images (translating, scalingt, rotating, flipping of the input image)

Speech-recognition


Text-classification

Unlike image and speech, data augmentation using signal transformation is not reasonable, because exact order of characters may form rigorous syntactic and semantic meaning.

Best way:

  • human rephrases of sentences -> unrealistic and expensive

Choices

  • synonyms replacement: replace words or phrases with synonyms
  • back-translation: use [english - ‘intermediate language’ - english] translastion. [2]
  • data noising: [3]
  • contextual augmentation: [5]

References


  1. 1.Zhang, X., & LeCun, Y. (2015). Text understanding from scratch. arXiv preprint arXiv:1502.01710.
  2. 2.Wieting, J., Mallinson, J., & Gimpel, K. (2017). Learning Paraphrastic Sentence Embeddings from Back-Translated Bitext. arXiv preprint arXiv:1706.01847.
  3. 3.Xie, Z., Wang, S. I., Li, J., Lévy, D., Nie, A., Jurafsky, D., & Ng, A. Y. (2017). Data noising as smoothing in neural network language models. arXiv preprint arXiv:1703.02573.
  4. 4.fast.ai forum: data augmentation for nlp
  5. 5.Kobayashi, S. (2018). Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relations. arXiv preprint arXiv:1805.06201.