Datasets for South African Languages: Bilingual Aligned and Monolingual Data for Machine Translation

Tanja Gaustad; Cindy A. McKellar; Martin J. Puttkammer

doi:10.5334/johd.372

Full Article

(1) Overview

Repository location

SADiLaR Language Resource Repository https://repo.sadilar.org/home

Each language and type (monolingual and parallel) has a separate handle:

– Autshumato Monolingual Afrikaans Corpus:https://hdl.handle.net/20.500.12185/580
– Autshumato English-Afrikaans Parallel Corpora:https://hdl.handle.net/20.500.12185/574
– Autshumato Monolingual English Corpus:https://hdl.handle.net/20.500.12185/686
– Autshumato Monolingual isiNdebele Corpus:https://hdl.handle.net/20.500.12185/573
– Autshumato English-isiNdebele Parallel Corpora:https://hdl.handle.net/20.500.12185/572
– Autshumato Monolingual isiXhosa Corpus:https://hdl.handle.net/20.500.12185/692
– Autshumato English-isiXhosa Parallel corpus:https://hdl.handle.net/20.500.12185/691
– Autshumato Monolingual isiZulu Corpus:https://hdl.handle.net/20.500.12185/581
– Autshumato English-isiZulu Parallel Corpora:https://hdl.handle.net/20.500.12185/575
– Autshumato Monolingual Sepedi Corpus:https://hdl.handle.net/20.500.12185/582
– Autshumato English-Sepedi Parallel Corpora:https://hdl.handle.net/20.500.12185/576
– Autshumato Monolingual Sesotho Corpus:https://hdl.handle.net/20.500.12185/583
– Autshumato English-Sesotho Parallel Corpora:https://hdl.handle.net/20.500.12185/577
– Autshumato Monolingual Setswana Corpus:https://hdl.handle.net/20.500.12185/584
– Autshumato English-Setswana Parallel Corpora:https://hdl.handle.net/20.500.12185/578
– Autshumato Monolingual Siswati Corpus:https://hdl.handle.net/20.500.12185/559.2
– Autshumato English-Siswati Parallel Corpora:https://hdl.handle.net/20.500.12185/560.2
– Autshumato Monolingual Tshivenḓa Corpus:https://hdl.handle.net/20.500.12185/681
– Autshumato English-Tshivenḓa Parallel Corpora:https://hdl.handle.net/20.500.12185/682
– Autshumato Monolingual Xitsonga Corpus:https://hdl.handle.net/20.500.12185/570
– Autshumato English-Xitsonga Parallel Corpora:https://hdl.handle.net/20.500.12185/579

Context

The Autshumato¹ project was initially funded by the South African Department of Sports, Arts and Culture (DSAC) to develop, release and support a set of open-source translation resources to enable easier translation between South Africa’s 11 official written languages.² The current (6^th) iteration of the project was funded by the South African Centre for Digital Language Resources (SADiLaR)³ with the aim to add new data resources that have become available and to make the updated corpora available for other researchers. The datasets described here were created as part of the Autshumato project to be used for the training of machine translation systems⁴ between English and all other written official South African languages, namely Afrikaans (ISO 639-3: afr), isiNdebele (nbl), isiXhosa (xho), isiZulu (zul), Sepedi (nso), Sesotho (sot), Setswana (tsn), Siswati (ssw), Tshivenḓa (ven) and Xitsonga (tso) (and vice versa) as well as to provide reusable linguistic resources for the development of other natural language processing applications for all these languages.

The datasets described in this article cover 11 official languages of South Africa comprising nine Bantu languages and two Germanic languages (English and Afrikaans). The South African Bantu languages can generally be categorised into three language family groups: Four conjunctively written Nguni languages (isiNdebele, isiXhosa, isiZulu, and Siswati⁵); five disjunctively written languages including four Sotho languages (Sepedi, Sesotho, Setswana, and Tshivenḓa⁶) and one TswaRonga language (Xitsonga). For each of these languages, there are two separate data handles: one handle with a set of two parallel files, one for English and one for the relevant language, and one handle containing a single monolingual file in the relevant language. For English, there is only a monolingual corpus included. This brings the total datasets to 21, 10 parallel and 11 monolingual.

All parallel corpora are UTF-8 encoded text files aligned at segment level with each new line being the start of an aligned segment. We refer to a combination of one or more words as a segment which includes partial and full sentences as well as headings and list elements. The same preprocessing was applied to the monolingual data resulting in a single UTF-8 encoded text file per language with one segment per line. An overview of the segment and word counts for all corpora can be found in Table 1.

Table 1

Overview of data counts for the described datasets including total number of segments, total words in the relevant language and total English words for the parallel data.

LANGUAGE	TOTAL SEGMENTS	TOTAL WORDS IN LANGUAGE	TOTAL ENGLISH WORDS
Monolingual Datasets
Afrikaans	1 191 904	20 276 850
English	8 832 451	188 252 040
isiNdebele	82 801	1 001 959
isiXhosa	341 330	4 328 245
isiZulu	238 699	3 403 927
Sepedi	171 774	3 448 592
Sesotho	216 854	4 242 075
Setswana	268 615	5 205 832
Siswati	138 651	1 536 356
Tshivenḓa	141 426	2 870 916
Xitsonga	200 900	3 145 599
Parallel Datasets
English – Afrikaans	1 367 869	20 270 021	20 514 138
English – isiNdebele	128 382	1 490 423	2 067 749
English – isiXhosa	109 940	1 264 390	1 745 236
English – isiZulu	233 691	2 910 800	4 148 245
English – Sepedi	131 535	2 822 916	2 214 453
English – Sesotho	171 292	3 465 480	2 848 205
English – Setswana	238 475	4 874 105	3 583 483
English – Siswati	114 839	1 423 414	2 002 293
English – Tshivenḓa	110 367	2 527 789	2 000 657
English – Xitsonga	170 589	2 427 634	2 022 548

The amount of monolingual data ranges from about 1 million words for isiNdebele to roughly 190 million words for English whereas for the parallel data there are between 1.8 million words for isiXhosa (measured on English for comparison purposes, given the contrasting disjunctive and conjunctive orthographies for the different languages (Prinsloo & de Schryver, 2002)) to 20 million words for Afrikaans (measured on English as well). For the purposes of the data counts, any line that contains at least one word or number is counted and any word that contains at least one alpha-numeric character is counted.

(2) Methods

The data contained in these datasets is a combination of different sources: documents acquired with distribution rights by the Centre for Text Technology (translation works, multilingual magazines and newsletters) as well as documents crawled from the South African government domain (*.gov.za). Since the South African government websites have information in all the official languages of South Africa, they are a good source of parallel data. These websites cover a wide array of topics and are created by professional translators working for the South African government using official orthography and spelling rules. The acquired sources also contain a mix of topics ranging from official communications, internal news to adverts and information pamphlets for various domains.

Table 2 presents an overview of the contribution of the different sources to the parallel datasets. As all languages included are resource-scarce (except for Afrikaans and English), the inclusion of data was mainly based on availability: For some languages, there are only two types of sources and the distribution between crawled and translated data is roughly 50–50 (e.g. Siswati or Xitsonga), whereas for others the main contributor is crawled data with circa 60%, followed by magazines and newsletters at about 30% and translations at 10% (e.g. Sepedi or isiZulu). For Afrikaans and Setswana, translated text provides the bulk of the included parallel data whereas for isiXhosa magazines and newsletters contribute the most to the final data. This highlights once again the precariousness of securing data for under-resourced languages, but at the same time also showcases the diversity of data included on the presented language resources.

Table 2

Distribution of sources for the parallel datasets for all languages.

	TYPES OF SOURCES
	MAGAZINES AND NEWSLETTERS	TRANSLATED DATA	CRAWLED DATA
Afrikaans	3.42%	62.64%	33.94%
isiNdebele	0%	56.63%	43.37%
isiXhosa	63.79%	5.13%	31.08%
isiZulu	29.90%	12.70%	57.40%
Sepedi	26.72%	10.80%	62.48%
Sesotho	49.64%	12.34%	38.02%
Setswana	19.29%	56.11%	24.60%
Siswati	0%	49.53%	50.47%
Tshivenḓa	0%	56.56%	43.44%
Xitsonga	.23%	48.72%	51.05%

Processing steps, Data cleaning and Quality control

For the crawled data, documents from South African government websites were collected using HTTrack.⁷ Next, the data was extracted from the original document formats (doc(x), htm(l) or pdf) and converted to UTF8 encoded text. To find all possible parallel data contained in the crawl, the converted documents were aligned on document-level using a combination of document names, internal document structure and website structure.

Aligned documents were then put through a sentence-level alignment process using HunAlign (Varga et al., 2005) supplemented with bilingual wordlists created from a combination of bilingual glossaries sourced from the Department of Sports, Arts and Culture as well as from professional translators. Every document pair was analysed after alignment to check the amount of data that was aligned properly, and the percentage of data lost. Any document that had over 20% data loss was checked for document processing errors and realigned to ensure maximum data retention. Any document pair that still had more than 20% data loss was assumed to be incorrectly matched with its translation during the document alignment phase and was discarded to ensure high quality alignments for the training corpus. The data identified as parallel was then extracted into one file per language and sorted uniquely (discarding doubled lines where both languages were identical).

All the acquired data (i.e. the translation works, magazines and newsletters) was also converted into text files, separated into sentences, tokenised and sentence-aligned with the HunAlign (Varga et al., 2005) algorithm. Even though this data was highly edited and needed slightly less processing, we still performed alignment checks and applied quality control.

Once each of the types of data had been individually processed and aligned on sentence level, the individual aligned files were combined into a sentence aligned document pair and then put through a final cleanup process consisting of the following five steps:

Language identification: All sentences were run through the CTexT Tools Language Identifier (Hocking, 2014, Puttkammer et al., 2018) to ensure that they are in the correct language since South African documents sometimes contain mixed languages. The language identification process identifies the language of each individual segment and the probability of the segment being in that language. Only segments that were identified as being in the correct language with a certainty of at least 80% were kept. All segments with lower certainty or that were identified as other languages were removed from the corpus along with the matching aligned segment.
Duplicate removal: Especially web crawling results contain an enormous amount of data duplication. This data duplication includes information that was posted on multiple sites, documents existing under different names and things like headings and contact information that were present on all pages of websites. To combat this unnecessary duplication, all identical aligned sentence pairs were removed. Only sentences were both of the aligned segments were identical were removed, leaving only one pair of the duplicates in the corpus.
Filtering out of unusable data: Lines with broken diacritics or excessive punctuation were deleted in both aligned files. Some of the South African languages use diacritics as part of their orthography and when there are errors in the document encoding these diacritic characters can be replaced with incorrect symbols. There are also sometimes errors introduced into the data during pdf-to-txt conversion due to badly formed pdf documents or bad optical character recognition (OCR). During this step, all lines containing odd characters are checked and sentences showing errors are either fixed or removed from the corpus along with the matching aligned segment.
Spell checking: All sentences that were not at least 70% correctly spelled were removed from the corpus along with the matching aligned segment. This means that every word was checked using an internal text-based spelling checker and the percentage of correctly spelled words was then calculated for each sentence. Any sentence with too many spelling errors was removed along with its aligned pair, ensuring that badly written text as well as text that contained errors made by the pdf conversion process were eliminated from the corpus. Most of the languages were spell checked with 80% accuracy, but the threshold was dropped to 70% for the extremely agglutinative Nguni languages (isiNdebele, isiXhosa, isiZulu, Siswati) since spell checking can be less accurate for these languages.
Randomisation: All sentence pairs were randomised to protect the content of the original documents and to comply with usage restrictions. To randomise the data each aligned pair was assigned a random number and the entire corpus was then re-ordered using these numbers. Aligned sentence pairs are kept together during this process to ensure the randomisation does not affect alignment. The numbers used are deleted from the corpus after the randomisation step.

The monolingual datasets were put through the same cleanup process as the aligned text except for the document and sentence alignment steps. The final monolingual corpora contain tokenised, unique sentences which have been identified as belonging to the targeted language. However, all lines containing broken diacritics and little usable text as well as all lines less than 70% correctly spelled were removed. As a final step, the lines were randomised to create the final version of each monolingual corpus.

(3) Dataset Description

Repository name

SADiLaR Language Resource Repository

Object name

Each language and type (monolingual and parallel) has a separate handle:

– Autshumato Monolingual Afrikaans Corpus
– Autshumato English-Afrikaans Parallel Corpora
– Autshumato Monolingual English Corpus
– Autshumato Monolingual isiNdebele Corpus
– Autshumato English-isiNdebele Parallel Corpora
– Autshumato Monolingual isiXhosa Corpus
– Autshumato English-isiXhosa Parallel corpus
– Autshumato Monolingual isiZulu Corpus
– Autshumato English-isiZulu Parallel Corpora
– Autshumato Monolingual Sepedi Corpus
– Autshumato English-Sepedi Parallel Corpora
– Autshumato Monolingual Sesotho Corpus
– Autshumato English-Sesotho Parallel Corpora
– Autshumato Monolingual Setswana Corpus
– Autshumato English-Setswana Parallel Corpora
– Autshumato Monolingual Siswati Corpus
– Autshumato English-Siswati Parallel Corpora
– Autshumato Monolingual Tshivenḓa Corpus
– Autshumato English-Tshivenḓa Parallel Corpora
– Autshumato Monolingual Xitsonga Corpus
– Autshumato English-Xitsonga Parallel Corpora

Format names and versions

UTF-8 encoded .txt files

Creation dates

2020-01-01 to 2025-06-01

Dataset creators

Cindy A. McKellar (Centre for Text Technology (CTexT), North-West University) – Data curation; Validation;

Tanja Gaustad (Centre for Text Technology (CTexT), North-West University) – Data curation;

Jacques van Heerden (Centre for Text Technology (CTexT), North-West University) – Project management;

Martin Puttkammer (Centre for Text Technology (CTexT), North-West University) – Data validation;

Language

Afrikaans (afr), English (eng), isiNdebele (nbl), isiXhosa (xho), isiZulu (zul), Sepedi/Sesotho sa Leboa (nso), Sesotho (sot), Setswana (tsn), Siswati (ssw), Tshivenḓa (ven), Xitsonga (tso)

License

CC BY 4.0 – https://creativecommons.org/licenses/by/4.0/

Publication date

2021-01-31 to 2025-06-10

(4) Reuse Potential

These datasets can serve as valuable resources for further research into both machine translation (for training and evaluation) and other natural language processing applications. Parallel corpora have been used for term extraction (see for example Ndhlovu, 2016) or the building of bilingual dictionaries (e.g. Héja, 2010) whereas corpora in general have played an increasing role in lexicography (Abdelzaher, 2022). The language specific part of the bilingual aligned corpus can also be combined with the monolingual corpus for a specific language to create a larger monolingual corpus that can then be used for research and development only requiring monolingual data.

Our data can also be used as a source for linguistic analysis of the included languages since textual materials are important as a representation of a language and its sociocultural context and for preserving and promoting the culture, linguistic heritage, and identity associated with a particular language. For instance, by analysing how similar concepts are expressed in different texts, corpora can provide evidence of both equivalence and non-equivalence across languages, offering insights into the differences between languages and cultures.

Even though these datasets contain a limited domain representation and are small compared to datasets used in large scale language technology research, they still represent a significant and valuable resource for re-use and analysis given the resource-scarce nature of the languages.

Notes

[1] https://autshumato.sourceforge.io/.

[2] In 2023, South African Sign Language (SASL) was recognised as an official language, but it is not included in these datasets as it is not a written language in the traditional sense.

[3] https://www.sadilar.org/.

[4] https://mt.nwu.ac.za/.

[5] For Siswati, there is a separate data publication with some additional details: 10.1016/j.dib.2024.110325.

[6] For Tshivenḓa, there is a separate data publication with some additional details: 10.1016/j.dib.2024.110898.

[7] www.httrack.com.

Acknowledgements

This research was made possible with the support from the South African Centre for Digital Language Resources (SADiLaR). SADiLaR is a research infrastructure established by the Department of Science and Innovation (DSI) of the South African government as part of the South African Research Infrastructure Roadmap (SARIR). Additionally, it was made possible by support from the South African Department of Sports, Arts and Culture (DSAC).

Competing Interests

The authors have no competing interests to declare.

Author Contributions

Tanja Gaustad: Conceptualisation; Data curation; Project administration; Quality Control; Writing – original draft; Writing – review & editing;

Cindy A. McKellar: Data curation; Methodology; Validation; Writing – review & editing;

Martin J. Puttkammer: Funding acquisition; Methodology; Validation; Writing – review & editing.