Transfer Learning for Morphological Tagging in Russian


Transfer Learning for Morphological Tagging in Russian

Авторы

Ivan Andrianov ; Vladimir Mayorov

Аннотация

This paper is devoted to morphological tagging task for Russian. There are multiple corpora created for the task during the years of research. Unfortunately, these corpora often have incompatible annotation guidelines and tag sets. This makes it difficult to utilize more than one corpus for training machine learning methods. Several attempts to manually unify guidelines and tag sets took place in the past. The most recent one was MorphoRuEval-2017 evaluation where organizers decided to unify annotations of four corpora. In this paper we propose a morphological tagging method which is able to be trained on any set of corpora with arbitrary guidelines and tag sets without any manual mapping provided. The method is based on transfer learning technique successfully applied in multilingual and cross-domain research. Firstly we establish a couple of solid neural network-based morphological tagging baselines. They both employ bidirectional LSTM networks but utilize different word embeddings. Comparative study on multiple corpora shows that fasttext model clearly outperforms word2vec one. Further we extend these baselines with transfer learning application. We prove transfer learning-based method effectiveness by performing two series of experiments. In the first series we exploit three wellknown corpora with incompatible tag sets: RNC, SynTagRus (original and Universal Dependencies versions). In the second series we utilize four corpora from MorphoRuEval-2017 evaluation with tag sets preliminary unified by organizers. In both cases transfer learning application improves performance of the method while training on more corpora. In the second case it even surpasses traditional joint training.

Издание

Ivannikov ISPRAS Open Conference (ISPRAS)

DOI: 10.1109/ISPRAS.2017.00017

Научная группа

Информационные системы

Все публикации за 2017 год Все публикации