Datasets for South African Languages: Bilingual Aligned and Monolingual Data for Machine Translation

Tanja Gaustad; Cindy A. McKellar; Martin J. Puttkammer

doi:10.5334/johd.372

Abstract

This data paper describes machine translation datasets built for the Autshumato project. The datasets contain both bilingual aligned data between English and all other official written languages of South Africa, namely Afrikaans (ISO 639-3: afr), isiNdebele (nbl), isiXhosa (xho), isiZulu (zul), Sepedi (nso), Sesotho (sot), Setswana (tsn), Siswati (ssw), Tshivenḓa (ven) and Xitsonga (tso), as well as monolingual data for all 11 languages. The content was sourced from existing and commissioned translations, various publications, and web-crawling of government sites. The present article describes the collection, alignment and cleanup processes that were used to create these resources. It also gives a detailed overview of the amount and provenance of the data included in the final datasets for all languages. Although the datasets were created primarily for the training of statistical and neural machine translation systems, they can also be used for other natural language processing tasks or linguistic research, such as term extraction or lexicography.

References

1Abdelzaher, E. (2022). An investigation of Corpus Contributions to Lexicographic Challenges over the Past Ten Years. Lexicos, 32, 162–179. 10.5788/32-1-1714
Back to article
2Héja, E. (2010). The Role of Parallel Corpora in Bilingual Lexicography. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10) (pp. 2798–2805). http://www.lrec-conf.org/proceedings/lrec2010/pdf/559_Paper.pdf
Back to article
3Hocking, J. (2014). Language identification for South African languages. In Proceedings of the Annual Pattern Recognition Association of South Africa and Robotics and Mechatronics International Conference (PRASA-RobMech), poster session (p. 307). http://scholar.google.com/scholar_lookup?title=Language+identification+for+South+African+languages&conference=Proceedings+of+the+Annual+Pattern+Recognition+Association+of+South+Africa+and+Robotics+and+Mechatronics+International+Conference+(PRASA-RobMech)&author=Hocking,+J.&publication_year=2014&pages=307
Back to article
4Ndhlovu, K. (2016). Using ParaConc to extract bilingual terminology from parallel corpora: A case of English and Ndebele. Literator, 37(2), 1–12. 10.4102/lit.v37i2.1278
Back to article
5Prinsloo, D. J., & de Schryver, G. M. (2002). Towards an 11x11 Array for the Degree of Conjunctivism/Disjunctivism of the South African Languages. Nordic Journal of African Studies, 11(2), 249–265. https://njas.fi/njas/article/view/359
Back to article
6Puttkammer, M. J. R., Eiselen, R., Hocking, J., & Koen, F. (2018). NLP Web Services for Resource-Scarce Languages. In Proceedings of ACL 2018, System demonstrations (pp. 43–49). 10.18653/v1/P18-4008
Back to article
7Varga, D., Németh, L., Halácsy, P., Kornai, A., Trón, V., & Nagy, V. (2005). Parallel corpora for medium density languages. In Proceedings of the RANLP 2005 (pp. 590–596). http://mokk.bme.hu/en/resources/hunalign
Back to article

Datasets for South African Languages: Bilingual Aligned and Monolingual Data for Machine Translation

Abstract

Paradigm

My account