The ability to communicate and be understood in one’s own language is a prerequisite to digital and societal inclusion. However, a gap in openly accessible datasets outside of a small number of languages has prevented breakthroughs based on natural language processing (NLP) technologies in Africa. Labelled data and speech corpora remain a key element of this gap, as well as the availability of corpora that can be used in transfer learning or semi-supervised approaches.
To address this gap and complement a groundswell of community efforts working on NLP for African languages, Lacuna Fund has launched a request for proposals to support open training and evaluation datasets for NLP in underserved languages in sub-Saharan Africa. The RFP is intentionally broad, supporting text, speech, and other critical datasets for upstream and downstream NLP tasks.
The webinar will discuss Lacuna’s Fund’s purpose and goals as well as further details of the language RFP and a Q&A session.”
About the Lacuna Fund
To complement and expand these efforts, Lacuna Fund hopes to fund the creation, expansion and maintenance of labelled data. Types of datasets we would like to support are listed below, but the RFP is intentionally open, to encourage new and innovative ideas that we may not have identified.
- Benchmarking datasets to enable further NLP tasks in underserved languages.
- Creating new or unlocking existing data for easier inclusion of underserved languages in multilingual models.
- Datasets to enable advances in NLP tasks for code-switched text or speech (speech alternating between multiple languages, dialects, or registers).
- Smaller mono- or multilingual datasets optimized for specific use cases (i.e. digit or place name ASR datasets, or MT or metadata extraction for legal or medical records)
- Other ideas! See our Grantmaking Philosophy. https://lacunafund.org/