Have a personal or library account? Click to login
Datasets for South African Languages: Bilingual Aligned and Monolingual Data for Machine Translation Cover

Datasets for South African Languages: Bilingual Aligned and Monolingual Data for Machine Translation

Open Access
|Sep 2025

Figures & Tables

Table 1

Overview of data counts for the described datasets including total number of segments, total words in the relevant language and total English words for the parallel data.

LANGUAGETOTAL SEGMENTSTOTAL WORDS IN LANGUAGETOTAL ENGLISH WORDS
Monolingual Datasets
Afrikaans1 191 90420 276 850
English8 832 451188 252 040
isiNdebele82 8011 001 959
isiXhosa341 3304 328 245
isiZulu238 6993 403 927
Sepedi171 7743 448 592
Sesotho216 8544 242 075
Setswana268 6155 205 832
Siswati138 6511 536 356
Tshivenḓa141 4262 870 916
Xitsonga200 9003 145 599
Parallel Datasets
English – Afrikaans1 367 86920 270 02120 514 138
English – isiNdebele128 3821 490 4232 067 749
English – isiXhosa109 9401 264 3901 745 236
English – isiZulu233 6912 910 8004 148 245
English – Sepedi131 5352 822 9162 214 453
English – Sesotho171 2923 465 4802 848 205
English – Setswana238 4754 874 1053 583 483
English – Siswati114 8391 423 4142 002 293
English – Tshivenḓa110 3672 527 7892 000 657
English – Xitsonga170 5892 427 6342 022 548
Table 2

Distribution of sources for the parallel datasets for all languages.

TYPES OF SOURCES
MAGAZINES AND NEWSLETTERSTRANSLATED DATACRAWLED DATA
Afrikaans3.42%62.64%33.94%
isiNdebele0%56.63%43.37%
isiXhosa63.79%5.13%31.08%
isiZulu29.90%12.70%57.40%
Sepedi26.72%10.80%62.48%
Sesotho49.64%12.34%38.02%
Setswana19.29%56.11%24.60%
Siswati0%49.53%50.47%
Tshivenḓa0%56.56%43.44%
Xitsonga.23%48.72%51.05%
DOI: https://doi.org/10.5334/johd.372 | Journal eISSN: 2059-481X
Language: English
Submitted on: Aug 12, 2025
Accepted on: Sep 8, 2025
Published on: Sep 19, 2025
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2025 Tanja Gaustad, Cindy A. McKellar, Martin J. Puttkammer, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.