(1) Overview
Repository location
SADiLaR repository: https://repo.sadilar.org/handle/20.500.12185/1
Each language has a separate handle:
– Morphologically annotated corpus for isiNdebele: https://hdl.handle.net/20.500.12185/680
– Morphologically annotated corpus for isiXhosa: https://hdl.handle.net/20.500.12185/679
– Morphologically annotated corpus for isiZulu: https://hdl.handle.net/20.500.12185/678
– Morphologically annotated corpus for Siswati: https://hdl.handle.net/20.500.12185/677
– Morphologically annotated corpus for Sesotho: https://hdl.handle.net/20.500.12185/676
– Morphologically annotated corpus for Sepedi: https://hdl.handle.net/20.500.12185/675
– Morphologically annotated corpus for Setswana: https://hdl.handle.net/20.500.12185/674
– Morphologically annotated corpus for Tshivenḓa: https://hdl.handle.net/20.500.12185/673
– Morphologically annotated corpus for Xitsonga: https://hdl.handle.net/20.500.12185/672
Context
The data presented in this article was produced as part of the South African Centre for Digital Language Resources (SADiLaR) II (extension) project: Linguistic corpus enrichment for South African languages. It contains converted and updated morphologically annotated corpora for nine of the twelve official South African languages: data for the four languages with a conjunctive orthography, i.e. isiNdebele, isiXhosa, isiZulu, and Siswati, as well as for the five disjunctively written languages, i.e. Sesotho sa Leboa/Sepedi, Sesotho, Setswana, Tshivenḓa, and Xitsonga.
The (still) widely used annotated National Centre for Human Language Technology (NCHLT) corpora (Eiselen & Puttkammer, 2014) formed the basis of the data.1 To encourage and enable cross-linguistic studies as well as guarantee compatibility with more recently morphologically annotated data for the four conjunctively written Nguni languages presented in Gaustad & Puttkammer (2022), the existing morphological annotations have been converted to updated morphological tags after a thorough revision of the relevant protocols. The annotations have consequently been checked for correctness by linguistic experts.
The resulting corpora are uniformly linguistically annotated for morphology across all nine languages: approximately 70,000 tokens for the disjunctively written languages and 45,000 tokens for the conjunctively written languages (approximately 100,000 tokens for the conjunctive languages if combined with the previously published data in Gaustad and Puttkammer (2022)). See Table 1 for an overview of the exact counts.
(2) Method
The dataset contains documents originally crawled from various South African web domains (mainly government sites, municipalities, and official publications) using HTTrack2 and converted to plain text with publicly available modules. For this updated version, only changes to tokenisation as well as correction of spelling mistakes were allowed on the original text.
Steps
Before re-annotating the data, the tag set needed to be updated. To make the tag set as comparable as possible between the nine languages in our project, the linguistic experts discussed using the same tags for equivalent linguistic phenomena, and agreed on the most uniform tag set possible. Given the linguistic differences, each language nevertheless has a separate annotation protocol detailing the permissible morphological tags as well as containing examples to guide the annotation process. All tags contain a main category such as “NPre” to denote a nominal prefix or “Fut” for a future tense morpheme. Some morphological tags (for nouns, adjectives, various concords, pronouns, etc.) also include class information, for example [NPre10], which substantially increases the number of tags. Table 2 gives an overview of the total number of main morphology tags as well as the total number of distinct tags including class information per language.
Table 2
An overview of unique main morphology tags and total number of distinct tags per language.
| NUMBER OF UNIQUE MAIN MORPHOLOGY TAGS (WITHOUT CLASS INFORMATION) | TOTAL NUMBER OF MORPHOLOGY TAGS | |
|---|---|---|
| isiNdebele | 71 | 401 |
| isiXhosa | 77 | 370 |
| isiZulu | 74 | 423 |
| Siswati | 69 | 378 |
| Sesotho | 74 | 292 |
| Sesotho sa Leboa/Sepedi | 65 | 319 |
| Setswana | 63 | 313 |
| Tshivenḓa | 64 | 439 |
| Xitsonga | 67 | 290 |
Based on the revised protocols, the morphological annotations in the data were subsequently updated following these steps:
Retrieve a complete list of the old unique morphological tags for all nine languages.
Write a mapping script from old tags to new tags based on the revised protocols and apply to the data. For some languages, class information was not included previously. In those cases, the missing class was indicated with “??” in the new tag (e.g. [AbsPron??]) to indicate that class information needed to be supplied.
Pre-check token-morphological analysis pairs and add all unambiguous analyses to a dedicated list per language. These tokens are marked as correct in the annotation software LARA II3 (Puttkammer, 2014).
Linguistic experts check all annotations not marked as correct in LARA II. This is done in batches of 2,000 tokens (for conjunctively written languages) to 5,000 tokens (for disjunctively written languages).
After a corrected batch is received back, carry out quality control (QC) using a QC script followed by manual checking. Also, add all new unambiguous token-analysis pairs to the dedicated file so they will not need to be checked in the remaining batches of data.
Once all batches have been annotated, do a final round of QC.
The final data is given as one text (.txt) file per language, where each line consists of a token and the corresponding morphological analysis separated by a tab character. The data also includes line numbering to mark sentences. Table 3 shows an example of the annotated data for Xitsonga. Each morpheme is separated by “-” and followed by a morphology tag between square brackets – for example, vu[NPre14]-tirheli[NRoot], indicating that there are two morphemes in “vutirheli”, namely a nominal prefix of class 14 [NPre14] “vu” as well as a noun root [NRoot] “tirheli”.
(3) Dataset Description
Object name
Morphologically annotated corpus for isiNdebele
Morphologically annotated corpus for isiXhosa
Morphologically annotated corpus for isiZulu
Morphologically annotated corpus for Siswati
Morphologically annotated corpus for Sesotho
Morphologically annotated corpus for Sepedi
Morphologically annotated corpus for Setswana
Morphologically annotated corpus for Tshivenḓa
Morphologically annotated corpus for Xitsonga
Format names and versions
UTF-8 encoded .txt files
Creation dates
2022-04-01 – 2023-08-31
Dataset creators
Jaco du Toit (Independent) – Data curation
Sunny Gent (Centre for Text Technology (CTexT), North-West University) – Project and annotation management
Martin Puttkammer (Centre for Text Technology (CTexT), North-West University) – Funding acquisition
Language
isiNdebele (NR), isiXhosa (XH), isiZulu (ZU), Siswati (SS), Sesotho sa Leboa/Sepedi (NSO), Sesotho (ST), Setswana (TN), Tshivenda (VE), Xitsonga (TS)
License
CC BY 4.0 – https://creativecommons.org/licenses/by/4.0/
Repository name
SADiLaR Language Resource Repository
Publication date
2024-03-27
(4) Reuse Potential
The corpora are primarily aimed at the development and evaluation of Natural Language Processing (NLP) core technologies and applications for the represented languages. When building morphological analysers and/or segmenters, the described data can be used to train and test the tools. Morphological information can also be incorporated into spelling checkers to improve the word recognition rate without increasing the size of the lexicon. In addition, the data can form the basis for language-specific as well as cross-language corpus linguistic studies and investigations of morphological phenomena, potentially leading to new insights into word creation and usage, morphological productivity and more. As the corpora are UTF-8 encoded text files, they can be used with a number of generic analysis tools (e.g. R, Python, SPSS).
Notes
[1] These datasets are available for download at https://repo.sadilar.org/handle/20.500.12185/1 as “NCHLT <LANGUAGE> Annotated Text Corpora”.
[3] Available at https://hdl.handle.net/20.500.12185/432.
Acknowledgements
We are grateful to our linguistic experts for their diligent work on the annotations.
Funding Information
This research was made possible with support from the South African Centre for Digital Language Resources (SADiLaR). SADiLaR is a research infrastructure established by the Department of Science and Innovation (DSI) of the South African government as part of the South African Research Infrastructure Roadmap (SARIR).
Competing Interests
The authors have no competing interests to declare.
Author Contributions
Tanja Gaustad: Conceptualisation; Data curation; Project administration; Quality Control; Writing – original draft; Writing – review & editing.
Cindy A McKellar: Validation; Writing – review & editing.
