Updated Morphologically Annotated Corpora for 9 South African Languages

Tanja Gaustad; Cindy A. McKellar

doi:10.5334/johd.211

Abstract

The dataset described in this article presents converted and updated corpora for nine of the twelve official South African languages. After a revision of the morphological annotation protocols, the existing National Centre for Human Language Technology (NCHLT) corpora (Eiselen & Puttkammer, 2014) have been converted to updated morphological tags and consequently checked by linguistic experts for correctness. The resulting corpora are uniformly linguistically annotated for morphology across all nine languages, amounting to approximately 70,000 tokens for the five disjunctively written languages and 45,000 tokens for the four conjunctively written languages. The corpora are primarily aimed at the development and evaluation of Natural Language Processing (NLP) core technologies. In addition, the data can be used for language-specific and cross-language comparative corpus linguistic studies as well as corpus-based investigations of morphological phenomena in the included languages.

References

1Eiselen, R., & Puttkammer, M. J. (2014). Developing Text Resources for Ten South African Languages. Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014), Reykjavik, Iceland.
Back to article
2Gaustad, T., & Puttkammer, M. J. (2022, April). Linguistically annotated dataset for four official South African languages with a conjunctive orthography: isiNdebele, isiXhosa, isiZulu, and Siswati. Data in Brief, 41. DOI: 10.1016/j.dib.2022.107994
Back to article
3Puttkammer, M. J. (2014). Efficient development of human language technology resources for resource-scarce languages (PhD dissertation, North-West University).
Back to article

Updated Morphologically Annotated Corpora for 9 South African Languages

Abstract

Paradigm

My account