Have a personal or library account? Click to login
Updated Morphologically Annotated Corpora for 9 South African Languages Cover

Updated Morphologically Annotated Corpora for 9 South African Languages

Open Access
|Jun 2024

Full Article

(1) Overview

Repository location

SADiLaR repository: https://repo.sadilar.org/handle/20.500.12185/1

Each language has a separate handle:

Context

The data presented in this article was produced as part of the South African Centre for Digital Language Resources (SADiLaR) II (extension) project: Linguistic corpus enrichment for South African languages. It contains converted and updated morphologically annotated corpora for nine of the twelve official South African languages: data for the four languages with a conjunctive orthography, i.e. isiNdebele, isiXhosa, isiZulu, and Siswati, as well as for the five disjunctively written languages, i.e. Sesotho sa Leboa/Sepedi, Sesotho, Setswana, Tshivenḓa, and Xitsonga.

The (still) widely used annotated National Centre for Human Language Technology (NCHLT) corpora (Eiselen & Puttkammer, 2014) formed the basis of the data.1 To encourage and enable cross-linguistic studies as well as guarantee compatibility with more recently morphologically annotated data for the four conjunctively written Nguni languages presented in Gaustad & Puttkammer (2022), the existing morphological annotations have been converted to updated morphological tags after a thorough revision of the relevant protocols. The annotations have consequently been checked for correctness by linguistic experts.

The resulting corpora are uniformly linguistically annotated for morphology across all nine languages: approximately 70,000 tokens for the disjunctively written languages and 45,000 tokens for the conjunctively written languages (approximately 100,000 tokens for the conjunctive languages if combined with the previously published data in Gaustad and Puttkammer (2022)). See Table 1 for an overview of the exact counts.

Table 1

Overview of total token counts for all nine languages included in the dataset.

LANGUAGENUMBER OF TOKENS
isiNdebele42,335
isiXhosa46,465
isiZulu45,933
Siswati43,568
Sesotho73,727
Sesotho sa Leboa/Sepedi73,031
Setswana72,609
Tshivenḓa66,487
Xitsonga69,584

(2) Method

The dataset contains documents originally crawled from various South African web domains (mainly government sites, municipalities, and official publications) using HTTrack2 and converted to plain text with publicly available modules. For this updated version, only changes to tokenisation as well as correction of spelling mistakes were allowed on the original text.

Steps

Before re-annotating the data, the tag set needed to be updated. To make the tag set as comparable as possible between the nine languages in our project, the linguistic experts discussed using the same tags for equivalent linguistic phenomena, and agreed on the most uniform tag set possible. Given the linguistic differences, each language nevertheless has a separate annotation protocol detailing the permissible morphological tags as well as containing examples to guide the annotation process. All tags contain a main category such as “NPre” to denote a nominal prefix or “Fut” for a future tense morpheme. Some morphological tags (for nouns, adjectives, various concords, pronouns, etc.) also include class information, for example [NPre10], which substantially increases the number of tags. Table 2 gives an overview of the total number of main morphology tags as well as the total number of distinct tags including class information per language.

Table 2

An overview of unique main morphology tags and total number of distinct tags per language.

NUMBER OF UNIQUE MAIN MORPHOLOGY TAGS (WITHOUT CLASS INFORMATION)TOTAL NUMBER OF MORPHOLOGY TAGS
isiNdebele71401
isiXhosa77370
isiZulu74423
Siswati69378
Sesotho74292
Sesotho sa Leboa/Sepedi65319
Setswana63313
Tshivenḓa64439
Xitsonga67290

Based on the revised protocols, the morphological annotations in the data were subsequently updated following these steps:

  1. Retrieve a complete list of the old unique morphological tags for all nine languages.

  2. Write a mapping script from old tags to new tags based on the revised protocols and apply to the data. For some languages, class information was not included previously. In those cases, the missing class was indicated with “??” in the new tag (e.g. [AbsPron??]) to indicate that class information needed to be supplied.

  3. Pre-check token-morphological analysis pairs and add all unambiguous analyses to a dedicated list per language. These tokens are marked as correct in the annotation software LARA II3 (Puttkammer, 2014).

  4. Linguistic experts check all annotations not marked as correct in LARA II. This is done in batches of 2,000 tokens (for conjunctively written languages) to 5,000 tokens (for disjunctively written languages).

  5. After a corrected batch is received back, carry out quality control (QC) using a QC script followed by manual checking. Also, add all new unambiguous token-analysis pairs to the dedicated file so they will not need to be checked in the remaining batches of data.

  6. Once all batches have been annotated, do a final round of QC.

The final data is given as one text (.txt) file per language, where each line consists of a token and the corresponding morphological analysis separated by a tab character. The data also includes line numbering to mark sentences. Table 3 shows an example of the annotated data for Xitsonga. Each morpheme is separated by “-” and followed by a morphology tag between square brackets – for example, vu[NPre14]-tirheli[NRoot], indicating that there are two morphemes in “vutirheli”, namely a nominal prefix of class 14 [NPre14] “vu” as well as a noun root [NRoot] “tirheli”.

Table 3

A Xitsonga example of morphologically annotated data.

TOKENMORPHOLOGICAL ANALYSIS
<LINE 1>
Xikongomeloxi[NPre7]-kongomelo[NRoot]
xaxa[PossConc7]
websitewebsite[Foreign]
yaya[PossConc9]
Vutirhelivu[NPre14]-tirheli[NRoot]
byabya[PossConc14]
AfrikaAfrika[ProperName]
Dzongari[NPre5]-dzonga[NRoot]
,,[Punc]

(3) Dataset Description

Object name

Morphologically annotated corpus for isiNdebele

Morphologically annotated corpus for isiXhosa

Morphologically annotated corpus for isiZulu

Morphologically annotated corpus for Siswati

Morphologically annotated corpus for Sesotho

Morphologically annotated corpus for Sepedi

Morphologically annotated corpus for Setswana

Morphologically annotated corpus for Tshivenḓa

Morphologically annotated corpus for Xitsonga

Format names and versions

UTF-8 encoded .txt files

Creation dates

2022-04-01 – 2023-08-31

Dataset creators

Jaco du Toit (Independent) – Data curation

Sunny Gent (Centre for Text Technology (CTexT), North-West University) – Project and annotation management

Martin Puttkammer (Centre for Text Technology (CTexT), North-West University) – Funding acquisition

Language

isiNdebele (NR), isiXhosa (XH), isiZulu (ZU), Siswati (SS), Sesotho sa Leboa/Sepedi (NSO), Sesotho (ST), Setswana (TN), Tshivenda (VE), Xitsonga (TS)

License

CC BY 4.0 – https://creativecommons.org/licenses/by/4.0/

Repository name

SADiLaR Language Resource Repository

Publication date

2024-03-27

(4) Reuse Potential

The corpora are primarily aimed at the development and evaluation of Natural Language Processing (NLP) core technologies and applications for the represented languages. When building morphological analysers and/or segmenters, the described data can be used to train and test the tools. Morphological information can also be incorporated into spelling checkers to improve the word recognition rate without increasing the size of the lexicon. In addition, the data can form the basis for language-specific as well as cross-language corpus linguistic studies and investigations of morphological phenomena, potentially leading to new insights into word creation and usage, morphological productivity and more. As the corpora are UTF-8 encoded text files, they can be used with a number of generic analysis tools (e.g. R, Python, SPSS).

Notes

[1] These datasets are available for download at https://repo.sadilar.org/handle/20.500.12185/1 as “NCHLT <LANGUAGE> Annotated Text Corpora”.

Acknowledgements

We are grateful to our linguistic experts for their diligent work on the annotations.

Funding Information

This research was made possible with support from the South African Centre for Digital Language Resources (SADiLaR). SADiLaR is a research infrastructure established by the Department of Science and Innovation (DSI) of the South African government as part of the South African Research Infrastructure Roadmap (SARIR).

Competing Interests

The authors have no competing interests to declare.

Author Contributions

Tanja Gaustad: Conceptualisation; Data curation; Project administration; Quality Control; Writing – original draft; Writing – review & editing.

Cindy A McKellar: Validation; Writing – review & editing.

DOI: https://doi.org/10.5334/johd.211 | Journal eISSN: 2059-481X
Language: English
Submitted on: Apr 2, 2024
Accepted on: May 10, 2024
Published on: Jun 11, 2024
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2024 Tanja Gaustad, Cindy A. McKellar, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.