Have a personal or library account? Click to login
Datasets for South African Languages: Bilingual Aligned and Monolingual Data for Machine Translation Cover

Datasets for South African Languages: Bilingual Aligned and Monolingual Data for Machine Translation

Open Access
|Sep 2025

Abstract

This data paper describes machine translation datasets built for the Autshumato project. The datasets contain both bilingual aligned data between English and all other official written languages of South Africa, namely Afrikaans (ISO 639-3: afr), isiNdebele (nbl), isiXhosa (xho), isiZulu (zul), Sepedi (nso), Sesotho (sot), Setswana (tsn), Siswati (ssw), Tshivenḓa (ven) and Xitsonga (tso), as well as monolingual data for all 11 languages. The content was sourced from existing and commissioned translations, various publications, and web-crawling of government sites. The present article describes the collection, alignment and cleanup processes that were used to create these resources. It also gives a detailed overview of the amount and provenance of the data included in the final datasets for all languages. Although the datasets were created primarily for the training of statistical and neural machine translation systems, they can also be used for other natural language processing tasks or linguistic research, such as term extraction or lexicography.

DOI: https://doi.org/10.5334/johd.372 | Journal eISSN: 2059-481X
Language: English
Submitted on: Aug 12, 2025
Accepted on: Sep 8, 2025
Published on: Sep 19, 2025
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2025 Tanja Gaustad, Cindy A. McKellar, Martin J. Puttkammer, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.