Table 1
Overview of data counts for the described datasets including total number of segments, total words in the relevant language and total English words for the parallel data.
| LANGUAGE | TOTAL SEGMENTS | TOTAL WORDS IN LANGUAGE | TOTAL ENGLISH WORDS |
|---|---|---|---|
| Monolingual Datasets | |||
| Afrikaans | 1 191 904 | 20 276 850 | |
| English | 8 832 451 | 188 252 040 | |
| isiNdebele | 82 801 | 1 001 959 | |
| isiXhosa | 341 330 | 4 328 245 | |
| isiZulu | 238 699 | 3 403 927 | |
| Sepedi | 171 774 | 3 448 592 | |
| Sesotho | 216 854 | 4 242 075 | |
| Setswana | 268 615 | 5 205 832 | |
| Siswati | 138 651 | 1 536 356 | |
| Tshivenḓa | 141 426 | 2 870 916 | |
| Xitsonga | 200 900 | 3 145 599 | |
| Parallel Datasets | |||
| English – Afrikaans | 1 367 869 | 20 270 021 | 20 514 138 |
| English – isiNdebele | 128 382 | 1 490 423 | 2 067 749 |
| English – isiXhosa | 109 940 | 1 264 390 | 1 745 236 |
| English – isiZulu | 233 691 | 2 910 800 | 4 148 245 |
| English – Sepedi | 131 535 | 2 822 916 | 2 214 453 |
| English – Sesotho | 171 292 | 3 465 480 | 2 848 205 |
| English – Setswana | 238 475 | 4 874 105 | 3 583 483 |
| English – Siswati | 114 839 | 1 423 414 | 2 002 293 |
| English – Tshivenḓa | 110 367 | 2 527 789 | 2 000 657 |
| English – Xitsonga | 170 589 | 2 427 634 | 2 022 548 |
Table 2
Distribution of sources for the parallel datasets for all languages.
| TYPES OF SOURCES | |||
|---|---|---|---|
| MAGAZINES AND NEWSLETTERS | TRANSLATED DATA | CRAWLED DATA | |
| Afrikaans | 3.42% | 62.64% | 33.94% |
| isiNdebele | 0% | 56.63% | 43.37% |
| isiXhosa | 63.79% | 5.13% | 31.08% |
| isiZulu | 29.90% | 12.70% | 57.40% |
| Sepedi | 26.72% | 10.80% | 62.48% |
| Sesotho | 49.64% | 12.34% | 38.02% |
| Setswana | 19.29% | 56.11% | 24.60% |
| Siswati | 0% | 49.53% | 50.47% |
| Tshivenḓa | 0% | 56.56% | 43.44% |
| Xitsonga | .23% | 48.72% | 51.05% |
