Have a personal or library account? Click to login
Datasets for South African Languages: Bilingual Aligned and Monolingual Data for Machine Translation Cover

Datasets for South African Languages: Bilingual Aligned and Monolingual Data for Machine Translation

Open Access
|Sep 2025

Full Article

(1) Overview

Repository location

SADiLaR Language Resource Repository https://repo.sadilar.org/home

Each language and type (monolingual and parallel) has a separate handle:

Context

The Autshumato1 project was initially funded by the South African Department of Sports, Arts and Culture (DSAC) to develop, release and support a set of open-source translation resources to enable easier translation between South Africa’s 11 official written languages.2 The current (6th) iteration of the project was funded by the South African Centre for Digital Language Resources (SADiLaR)3 with the aim to add new data resources that have become available and to make the updated corpora available for other researchers. The datasets described here were created as part of the Autshumato project to be used for the training of machine translation systems4 between English and all other written official South African languages, namely Afrikaans (ISO 639-3: afr), isiNdebele (nbl), isiXhosa (xho), isiZulu (zul), Sepedi (nso), Sesotho (sot), Setswana (tsn), Siswati (ssw), Tshivenḓa (ven) and Xitsonga (tso) (and vice versa) as well as to provide reusable linguistic resources for the development of other natural language processing applications for all these languages.

The datasets described in this article cover 11 official languages of South Africa comprising nine Bantu languages and two Germanic languages (English and Afrikaans). The South African Bantu languages can generally be categorised into three language family groups: Four conjunctively written Nguni languages (isiNdebele, isiXhosa, isiZulu, and Siswati5); five disjunctively written languages including four Sotho languages (Sepedi, Sesotho, Setswana, and Tshivenḓa6) and one TswaRonga language (Xitsonga). For each of these languages, there are two separate data handles: one handle with a set of two parallel files, one for English and one for the relevant language, and one handle containing a single monolingual file in the relevant language. For English, there is only a monolingual corpus included. This brings the total datasets to 21, 10 parallel and 11 monolingual.

All parallel corpora are UTF-8 encoded text files aligned at segment level with each new line being the start of an aligned segment. We refer to a combination of one or more words as a segment which includes partial and full sentences as well as headings and list elements. The same preprocessing was applied to the monolingual data resulting in a single UTF-8 encoded text file per language with one segment per line. An overview of the segment and word counts for all corpora can be found in Table 1.

Table 1

Overview of data counts for the described datasets including total number of segments, total words in the relevant language and total English words for the parallel data.

LANGUAGETOTAL SEGMENTSTOTAL WORDS IN LANGUAGETOTAL ENGLISH WORDS
Monolingual Datasets
Afrikaans1 191 90420 276 850
English8 832 451188 252 040
isiNdebele82 8011 001 959
isiXhosa341 3304 328 245
isiZulu238 6993 403 927
Sepedi171 7743 448 592
Sesotho216 8544 242 075
Setswana268 6155 205 832
Siswati138 6511 536 356
Tshivenḓa141 4262 870 916
Xitsonga200 9003 145 599
Parallel Datasets
English – Afrikaans1 367 86920 270 02120 514 138
English – isiNdebele128 3821 490 4232 067 749
English – isiXhosa109 9401 264 3901 745 236
English – isiZulu233 6912 910 8004 148 245
English – Sepedi131 5352 822 9162 214 453
English – Sesotho171 2923 465 4802 848 205
English – Setswana238 4754 874 1053 583 483
English – Siswati114 8391 423 4142 002 293
English – Tshivenḓa110 3672 527 7892 000 657
English – Xitsonga170 5892 427 6342 022 548

The amount of monolingual data ranges from about 1 million words for isiNdebele to roughly 190 million words for English whereas for the parallel data there are between 1.8 million words for isiXhosa (measured on English for comparison purposes, given the contrasting disjunctive and conjunctive orthographies for the different languages (Prinsloo & de Schryver, 2002)) to 20 million words for Afrikaans (measured on English as well). For the purposes of the data counts, any line that contains at least one word or number is counted and any word that contains at least one alpha-numeric character is counted.

(2) Methods

The data contained in these datasets is a combination of different sources: documents acquired with distribution rights by the Centre for Text Technology (translation works, multilingual magazines and newsletters) as well as documents crawled from the South African government domain (*.gov.za). Since the South African government websites have information in all the official languages of South Africa, they are a good source of parallel data. These websites cover a wide array of topics and are created by professional translators working for the South African government using official orthography and spelling rules. The acquired sources also contain a mix of topics ranging from official communications, internal news to adverts and information pamphlets for various domains.

Table 2 presents an overview of the contribution of the different sources to the parallel datasets. As all languages included are resource-scarce (except for Afrikaans and English), the inclusion of data was mainly based on availability: For some languages, there are only two types of sources and the distribution between crawled and translated data is roughly 50–50 (e.g. Siswati or Xitsonga), whereas for others the main contributor is crawled data with circa 60%, followed by magazines and newsletters at about 30% and translations at 10% (e.g. Sepedi or isiZulu). For Afrikaans and Setswana, translated text provides the bulk of the included parallel data whereas for isiXhosa magazines and newsletters contribute the most to the final data. This highlights once again the precariousness of securing data for under-resourced languages, but at the same time also showcases the diversity of data included on the presented language resources.

Table 2

Distribution of sources for the parallel datasets for all languages.

TYPES OF SOURCES
MAGAZINES AND NEWSLETTERSTRANSLATED DATACRAWLED DATA
Afrikaans3.42%62.64%33.94%
isiNdebele0%56.63%43.37%
isiXhosa63.79%5.13%31.08%
isiZulu29.90%12.70%57.40%
Sepedi26.72%10.80%62.48%
Sesotho49.64%12.34%38.02%
Setswana19.29%56.11%24.60%
Siswati0%49.53%50.47%
Tshivenḓa0%56.56%43.44%
Xitsonga.23%48.72%51.05%

Processing steps, Data cleaning and Quality control

For the crawled data, documents from South African government websites were collected using HTTrack.7 Next, the data was extracted from the original document formats (doc(x), htm(l) or pdf) and converted to UTF8 encoded text. To find all possible parallel data contained in the crawl, the converted documents were aligned on document-level using a combination of document names, internal document structure and website structure.

Aligned documents were then put through a sentence-level alignment process using HunAlign (Varga et al., 2005) supplemented with bilingual wordlists created from a combination of bilingual glossaries sourced from the Department of Sports, Arts and Culture as well as from professional translators. Every document pair was analysed after alignment to check the amount of data that was aligned properly, and the percentage of data lost. Any document that had over 20% data loss was checked for document processing errors and realigned to ensure maximum data retention. Any document pair that still had more than 20% data loss was assumed to be incorrectly matched with its translation during the document alignment phase and was discarded to ensure high quality alignments for the training corpus. The data identified as parallel was then extracted into one file per language and sorted uniquely (discarding doubled lines where both languages were identical).

All the acquired data (i.e. the translation works, magazines and newsletters) was also converted into text files, separated into sentences, tokenised and sentence-aligned with the HunAlign (Varga et al., 2005) algorithm. Even though this data was highly edited and needed slightly less processing, we still performed alignment checks and applied quality control.

Once each of the types of data had been individually processed and aligned on sentence level, the individual aligned files were combined into a sentence aligned document pair and then put through a final cleanup process consisting of the following five steps:

  • Language identification: All sentences were run through the CTexT Tools Language Identifier (Hocking, 2014, Puttkammer et al., 2018) to ensure that they are in the correct language since South African documents sometimes contain mixed languages. The language identification process identifies the language of each individual segment and the probability of the segment being in that language. Only segments that were identified as being in the correct language with a certainty of at least 80% were kept. All segments with lower certainty or that were identified as other languages were removed from the corpus along with the matching aligned segment.

  • Duplicate removal: Especially web crawling results contain an enormous amount of data duplication. This data duplication includes information that was posted on multiple sites, documents existing under different names and things like headings and contact information that were present on all pages of websites. To combat this unnecessary duplication, all identical aligned sentence pairs were removed. Only sentences were both of the aligned segments were identical were removed, leaving only one pair of the duplicates in the corpus.

  • Filtering out of unusable data: Lines with broken diacritics or excessive punctuation were deleted in both aligned files. Some of the South African languages use diacritics as part of their orthography and when there are errors in the document encoding these diacritic characters can be replaced with incorrect symbols. There are also sometimes errors introduced into the data during pdf-to-txt conversion due to badly formed pdf documents or bad optical character recognition (OCR). During this step, all lines containing odd characters are checked and sentences showing errors are either fixed or removed from the corpus along with the matching aligned segment.

  • Spell checking: All sentences that were not at least 70% correctly spelled were removed from the corpus along with the matching aligned segment. This means that every word was checked using an internal text-based spelling checker and the percentage of correctly spelled words was then calculated for each sentence. Any sentence with too many spelling errors was removed along with its aligned pair, ensuring that badly written text as well as text that contained errors made by the pdf conversion process were eliminated from the corpus. Most of the languages were spell checked with 80% accuracy, but the threshold was dropped to 70% for the extremely agglutinative Nguni languages (isiNdebele, isiXhosa, isiZulu, Siswati) since spell checking can be less accurate for these languages.

  • Randomisation: All sentence pairs were randomised to protect the content of the original documents and to comply with usage restrictions. To randomise the data each aligned pair was assigned a random number and the entire corpus was then re-ordered using these numbers. Aligned sentence pairs are kept together during this process to ensure the randomisation does not affect alignment. The numbers used are deleted from the corpus after the randomisation step.

The monolingual datasets were put through the same cleanup process as the aligned text except for the document and sentence alignment steps. The final monolingual corpora contain tokenised, unique sentences which have been identified as belonging to the targeted language. However, all lines containing broken diacritics and little usable text as well as all lines less than 70% correctly spelled were removed. As a final step, the lines were randomised to create the final version of each monolingual corpus.

(3) Dataset Description

Repository name

SADiLaR Language Resource Repository

Object name

Each language and type (monolingual and parallel) has a separate handle:

  • – Autshumato Monolingual Afrikaans Corpus

  • – Autshumato English-Afrikaans Parallel Corpora

  • – Autshumato Monolingual English Corpus

  • – Autshumato Monolingual isiNdebele Corpus

  • – Autshumato English-isiNdebele Parallel Corpora

  • – Autshumato Monolingual isiXhosa Corpus

  • – Autshumato English-isiXhosa Parallel corpus

  • – Autshumato Monolingual isiZulu Corpus

  • – Autshumato English-isiZulu Parallel Corpora

  • – Autshumato Monolingual Sepedi Corpus

  • – Autshumato English-Sepedi Parallel Corpora

  • – Autshumato Monolingual Sesotho Corpus

  • – Autshumato English-Sesotho Parallel Corpora

  • – Autshumato Monolingual Setswana Corpus

  • – Autshumato English-Setswana Parallel Corpora

  • – Autshumato Monolingual Siswati Corpus

  • – Autshumato English-Siswati Parallel Corpora

  • – Autshumato Monolingual Tshivenḓa Corpus

  • – Autshumato English-Tshivenḓa Parallel Corpora

  • – Autshumato Monolingual Xitsonga Corpus

  • – Autshumato English-Xitsonga Parallel Corpora

Format names and versions

UTF-8 encoded .txt files

Creation dates

2020-01-01 to 2025-06-01

Dataset creators

Cindy A. McKellar (Centre for Text Technology (CTexT), North-West University) – Data curation; Validation;

Tanja Gaustad (Centre for Text Technology (CTexT), North-West University) – Data curation;

Jacques van Heerden (Centre for Text Technology (CTexT), North-West University) – Project management;

Martin Puttkammer (Centre for Text Technology (CTexT), North-West University) – Data validation;

Language

Afrikaans (afr), English (eng), isiNdebele (nbl), isiXhosa (xho), isiZulu (zul), Sepedi/Sesotho sa Leboa (nso), Sesotho (sot), Setswana (tsn), Siswati (ssw), Tshivenḓa (ven), Xitsonga (tso)

License

CC BY 4.0 – https://creativecommons.org/licenses/by/4.0/

Publication date

2021-01-31 to 2025-06-10

(4) Reuse Potential

These datasets can serve as valuable resources for further research into both machine translation (for training and evaluation) and other natural language processing applications. Parallel corpora have been used for term extraction (see for example Ndhlovu, 2016) or the building of bilingual dictionaries (e.g. Héja, 2010) whereas corpora in general have played an increasing role in lexicography (Abdelzaher, 2022). The language specific part of the bilingual aligned corpus can also be combined with the monolingual corpus for a specific language to create a larger monolingual corpus that can then be used for research and development only requiring monolingual data.

Our data can also be used as a source for linguistic analysis of the included languages since textual materials are important as a representation of a language and its sociocultural context and for preserving and promoting the culture, linguistic heritage, and identity associated with a particular language. For instance, by analysing how similar concepts are expressed in different texts, corpora can provide evidence of both equivalence and non-equivalence across languages, offering insights into the differences between languages and cultures.

Even though these datasets contain a limited domain representation and are small compared to datasets used in large scale language technology research, they still represent a significant and valuable resource for re-use and analysis given the resource-scarce nature of the languages.

Notes

[2] In 2023, South African Sign Language (SASL) was recognised as an official language, but it is not included in these datasets as it is not a written language in the traditional sense.

[5] For Siswati, there is a separate data publication with some additional details: 10.1016/j.dib.2024.110325.

[6] For Tshivenḓa, there is a separate data publication with some additional details: 10.1016/j.dib.2024.110898.

Acknowledgements

This research was made possible with the support from the South African Centre for Digital Language Resources (SADiLaR). SADiLaR is a research infrastructure established by the Department of Science and Innovation (DSI) of the South African government as part of the South African Research Infrastructure Roadmap (SARIR). Additionally, it was made possible by support from the South African Department of Sports, Arts and Culture (DSAC).

Competing Interests

The authors have no competing interests to declare.

Author Contributions

Tanja Gaustad: Conceptualisation; Data curation; Project administration; Quality Control; Writing – original draft; Writing – review & editing;

Cindy A. McKellar: Data curation; Methodology; Validation; Writing – review & editing;

Martin J. Puttkammer: Funding acquisition; Methodology; Validation; Writing – review & editing.

DOI: https://doi.org/10.5334/johd.372 | Journal eISSN: 2059-481X
Language: English
Submitted on: Aug 12, 2025
Accepted on: Sep 8, 2025
Published on: Sep 19, 2025
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2025 Tanja Gaustad, Cindy A. McKellar, Martin J. Puttkammer, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.