Abstract
This data paper describes machine translation datasets built for the Autshumato project. The datasets contain both bilingual aligned data between English and all other official written languages of South Africa, namely Afrikaans (ISO 639-3: afr), isiNdebele (nbl), isiXhosa (xho), isiZulu (zul), Sepedi (nso), Sesotho (sot), Setswana (tsn), Siswati (ssw), Tshivenḓa (ven) and Xitsonga (tso), as well as monolingual data for all 11 languages. The content was sourced from existing and commissioned translations, various publications, and web-crawling of government sites. The present article describes the collection, alignment and cleanup processes that were used to create these resources. It also gives a detailed overview of the amount and provenance of the data included in the final datasets for all languages. Although the datasets were created primarily for the training of statistical and neural machine translation systems, they can also be used for other natural language processing tasks or linguistic research, such as term extraction or lexicography.
