Augmenting the Data: Vukosi Marivate on African Natural Language Processing (NLP) – Critical AI
By Eleni Coundouriotis. The event was organized and co-sponsored by Critical AI @ Rutgers, DIMACS, the Rutgers Department of Computer Science, and the Institute for the Study of Global Racial Justice
Vukosi Marivate (Chair of Data Science, University of Pretoria) makes a compelling case: we need to double down on the effort to capture “low resource” languages. There is too little existing data that can be used for natural language processing (NLP) of African languages and the danger exists of a growing inequity in representation. Without efforts to catch up, “our debt” he says, will increase; the “bill” will be higher, and harder to pay down. Focusing on his native South Africa, but addressing language equity across the continent, Marivate spoke of the pull of English. The vast majority of data available through journalistic and social media is in English. Official government business is in English. Afrikaans, a language derived from Dutch colonizers, is the second well represented language. Although it is in a much weaker position than English, it too outcompetes nine widely-spoken indigenous languages in South Africa.