20 Jun 2023

[Publication] Unsupervised Cross-lingual Word Embedding Representation for English-isiZulu

Paper by Derwin Ngomane and Vukosi Marivate

Members

Derwin Ngomane, Vukosi Marivate

Abstract

In this study, we investigate the effectiveness of using cross-lingual word embeddings for zero-shot transfer learning between a language with an abundant resource, English, and a languagewith limited resource, isiZulu. IsiZulu is a part of the South African Nguni language family, which is characterised by complex agglutinating morphology. We use VecMap, an open source tool, to obtain cross-lingual word embeddings. To perform an extrinsic evaluation of the effectiveness of the embeddings, we train a news classifier on labelled English data in order to categorise unlabelled isiZulu data using zero-shot transfer learning. In our study, we found our model to have a weighted average F1-score of 0.34. Our findings demonstrate that VecMap generates modular word embeddings in the cross-lingual space that have an impact on the downstream classifier used for zero-shot transfer learning.

Publications

  • D. Ngomane, R. Mabuya, J. Abbott, and V. Marivate. Unsupervised Cross-lingual Word Embedding Representation for English-isiZulu, Proceedings of the Fourth workshop on Resources for African Indigenous Languages (RAIL 2023). 2023. [NLP] <> [Paper URL] [Dataset]