Have a personal or library account? Click to login
Updated Morphologically Annotated Corpora for 9 South African Languages Cover

Updated Morphologically Annotated Corpora for 9 South African Languages

Open Access
|Jun 2024

Figures & Tables

Table 1

Overview of total token counts for all nine languages included in the dataset.

LANGUAGENUMBER OF TOKENS
isiNdebele42,335
isiXhosa46,465
isiZulu45,933
Siswati43,568
Sesotho73,727
Sesotho sa Leboa/Sepedi73,031
Setswana72,609
Tshivenḓa66,487
Xitsonga69,584
Table 2

An overview of unique main morphology tags and total number of distinct tags per language.

NUMBER OF UNIQUE MAIN MORPHOLOGY TAGS (WITHOUT CLASS INFORMATION)TOTAL NUMBER OF MORPHOLOGY TAGS
isiNdebele71401
isiXhosa77370
isiZulu74423
Siswati69378
Sesotho74292
Sesotho sa Leboa/Sepedi65319
Setswana63313
Tshivenḓa64439
Xitsonga67290
Table 3

A Xitsonga example of morphologically annotated data.

TOKENMORPHOLOGICAL ANALYSIS
<LINE 1>
Xikongomeloxi[NPre7]-kongomelo[NRoot]
xaxa[PossConc7]
websitewebsite[Foreign]
yaya[PossConc9]
Vutirhelivu[NPre14]-tirheli[NRoot]
byabya[PossConc14]
AfrikaAfrika[ProperName]
Dzongari[NPre5]-dzonga[NRoot]
,,[Punc]
DOI: https://doi.org/10.5334/johd.211 | Journal eISSN: 2059-481X
Language: English
Submitted on: Apr 2, 2024
Accepted on: May 10, 2024
Published on: Jun 11, 2024
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2024 Tanja Gaustad, Cindy A. McKellar, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.