Have a personal or library account? Click to login
Updated Morphologically Annotated Corpora for 9 South African Languages Cover

Updated Morphologically Annotated Corpora for 9 South African Languages

Open Access
|Jun 2024

Abstract

The dataset described in this article presents converted and updated corpora for nine of the twelve official South African languages. After a revision of the morphological annotation protocols, the existing National Centre for Human Language Technology (NCHLT) corpora (Eiselen & Puttkammer, 2014) have been converted to updated morphological tags and consequently checked by linguistic experts for correctness. The resulting corpora are uniformly linguistically annotated for morphology across all nine languages, amounting to approximately 70,000 tokens for the five disjunctively written languages and 45,000 tokens for the four conjunctively written languages. The corpora are primarily aimed at the development and evaluation of Natural Language Processing (NLP) core technologies. In addition, the data can be used for language-specific and cross-language comparative corpus linguistic studies as well as corpus-based investigations of morphological phenomena in the included languages.

DOI: https://doi.org/10.5334/johd.211 | Journal eISSN: 2059-481X
Language: English
Submitted on: Apr 2, 2024
Accepted on: May 10, 2024
Published on: Jun 11, 2024
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2024 Tanja Gaustad, Cindy A. McKellar, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.