The Reusability of Resources in Language-Specific Contexts: The SADiLaR Repository as a Case Study

Benito Trollip; Michelle White

doi:10.5334/johd.523

(1) Context and motivation

A central requirement for conducting research is the availability of resources, whether it is in the form of practical aspects like funding or research-specific aspects like primary and/or secondary data sources. When considering the availability of resources, one needs to gauge what available means. An online resource repository where datasets are hosted would possibly constitute a platform for researchers searching for available resources. Finding datasets on a certain topic that they are interested in could motivate a researcher to build on it, rather than taking up the burden of generating a dataset from scratch. The availability of existing resources on an online platform could therefore enable more research. The fact that a resource repository exists unfortunately does not guarantee dataset submissions and/or reuse of those resources.

The focus of this contribution is on the reusability of language resources. The discussion will be focused on analysing the reuse of resources in the repository of the South African Centre for Digital Language Resources (SADiLaR). This repository is a specialist repository as it is not institution-specific but rather focused on datasets produced for/by projects or research that link with natural language processing (NLP) and digital humanities (DH). Framing the SADiLaR repository as a distributor of these types of resources, directly links to SADiLaR’s enabling function within the broader DH community.¹

Collecting digital language resources within the South African research landscape was the objective of an audit in 2009 by the South African National Human Language Technology Network (NHN). The Research Management Agency (RMA) afterwards populated an index of digital language resources; these are the resources that SADiLaR’s repository was initially populated with in 2016 (when SADiLaR was established). SADiLaR was officially launched in 2019; these differing dates and institutions already foreshadow difficulties in tracking reuse of the digital language resources in the SADiLaR repository.

The need for a repository is paramount if one considers the overall lack of open, available resources for, and in, the official languages of South Africa. Work from Marivate (2020) and Mlambo and Matfunjwa (2025) highlight the importance of the development and availing of resources in these languages. Furthermore, the availability of resources could further and encourage research in DH, an emerging field in South Africa (Van der Walt et al., 2023).

Beyond South Africa, several established infrastructures support the curation and dissemination of language resources. Within Europe, the Common Language Resources and Technology Infrastructure (CLARIN), a European research infrastructure that provides standardised access to digital language resources and tools through a network of certified centres, includes B-centres such as The Language Archive (TLA) and Språkbanken, which provide interoperable, standards-based access to linguistic data. In addition, general-purpose repositories such as Zenodo and Figshare, as well as institutional platforms based on Dataverse, are widely used for hosting resources. However, these infrastructures often have limited coverage of African languages, which emphasises the need for focused initiatives such as SADiLaR. The availability of South African language data links with the global shift toward developing language technology like automatic speech recognition (ASR) systems, machine translation (MT) and text-to-speech (TTS) systems.

Having resources, whether they are static datasets or executable files, in an online repository links to the FAIR (Findable, Accessible, Interoperable, Reusable) principles (Wilkinson et al. 2016). The FAIR principles connect directly to CLARIN’s mission to make sure “All digital language resources and tools from all over Europe and beyond are accessible through a single sign-on on-line environment for the support of researchers in the humanities and social sciences” (De Jong et al., 2018). This is directly relevant to SADiLaR, which represents South Africa in the CLARIN infrastructure. In terms of findability, persistent identifiers are assigned to each resource in the SADiLaR repository, while the resources are accessible in as far as there are, at the very least, metadata available for each resource. The sources can also be found through the CLARIN Virtual Language Observatory (https://vlo.clarin.eu). If interoperability is considered, it is important to note that there is no strict limit on the format of files that can be hosted in the repository. It remains the discretion of the data creator(s) to choose their format, but the repository in no way restricts their choice. Reusability links to all these principles and is present in as far as the resources are licensed by the individual or organisation depositing data into the repository.

The reusability of language resources, and tracking that reuse, are accompanied by specific challenges. Van Erp (2012) discusses general barriers for the reuse of linguistic resources: the task-specific design of language resources, the incompatibility of mark-up languages across systems and/or domains, the use of different conceptual models (like differing part-of-speech tag sets), a lack of machine readable definitions of terminology used, and difficulty in obtaining accurate metadata about the creation of the dataset. Referring to the reuse of CLARIN resources, De Jong et al. (2018) states that reuse is possible only if the resources can be easily found and understood by researchers. The focus of Van Erp’s work was not on tracking reuse per se. In an effort to promote sustainability and accountability, Weber (2021) argues for the implementation of a citation tracking method, making use of CLARIN as a case-study. He highlights that the method has not been implemented yet; a particular challenge he highlights is the different instances of accountability when it comes to enforcing data citations. Existing works from a South African perspective could not be found.

In addition to the FAIR principles, data citation practices are guided within the CLARIN Citation Guidelines framework (Mathiessen & Lenardič, 2025). These guidelines, developed within the CLARIN infrastructure, recommend the use of persistent identifiers (such as Handles or DOIs) for datasets and emphasise the direct citation of data as primary research objects rather than relying solely on secondary publications or informal references. These recommendations provide a relevant benchmark for assessing citation practices and the potential for reuse of resources hosted in the repository. These guidelines are linked to the analysis of reusability later in this article.

At the time of writing, the SADiLaR repository had three collections (i.e. Resource Index, Resource Catalogue and Student Data Repository) which comprise 538 unique resources.² To reach this total, one needs to view the resources by title (considering that titles are unique); some of the resources are part of more than one collection and therefore adding the collections together does not reflect the number of resources. Not all the resources include downloadable content, but those resources that are downloadable range from executable applications, textual and audio files to pieces of code. The files range in size from a few kilobytes to several gigabytes.

With the background of the repository in mind, in this paper we seek to examine the extent to which resources in the SADiLaR repository have been reused. For the purposes of our discussion, which takes the form of a meta-analysis of reuses in general with SADiLaR as a case study, reuse is taken to mean citation of resources in academic outputs.

(2) Methodology and reuse process

(2.1) Methodology

Reuse of SADiLaR resources is identified primarily through citation in academic publications. While citation-based identification provides a transparent means of tracing reuse, we recognise that it is a proxy rather than a comprehensive measure. As such, the findings may represent a lower bound on actual reuse, capturing only those instances that are academically documented. This approach is inherently biased toward academic publication practices and does not account for forms of reuse that are infrastructural, implicit, or mediated through automated processes, such as large-scale data aggregation for machine learning. These limitations are acknowledged in advance and are revisited in the analysis and discussion of repository usage data.

2.1.1 Collection of published academic outputs

To investigate the reuse of language resources hosted in the SADiLaR repository, the authors followed a manual identification process based on published academic outputs. Reuse was identified through explicit references to SADiLaR, the SADiLaR repository, or specific resources hosted within the repository.

Multiple complementary strategies were employed. First, the reference lists of all staff members currently employed at SADiLaR were examined. Second, targeted searches were conducted on Google Scholar using the keywords “South African Centre for Digital Language Resources” and “SADiLaR”. Third, proceedings from the Digital Humanities Association of Southern Africa (DHASA) were reviewed. Finally, publications by known collaborators and research groups active in the South African DH and NLP communities were included.

2.1.2 Inclusion and exclusion criteria

Publications were included in the analysis if they contained an explicit mention of SADiLaR, the repository itself, or a specific dataset hosted in the repository. Publications that made only indirect or ambiguous references, without a clear link to SADiLaR or its resources, were excluded.

2.1.3 Features used in the analysis

The included publications were coded according to citation type, authorship, language focus, and reuse type. The coding scheme was implemented to capture different aspects of resource reuse and attribution practices systematically. Citation type regards how the SADiLaR repository or its resources were referenced in each publication, distinguishing between persistent identifiers (Handles), direct URLs, and general or informal mentions without a resolvable URL. Authorship category differentiated between self-reuse (where the authors of the publication were also involved in creating the referenced resource), internal reuse (by SADiLaR-affiliated authors), and external reuse. This allows for an assessment of uptake beyond the immediate institutional context. Language focus captured whether reuse concerned multiple South African languages or a specific language, reflecting the multilingual scope of the repository. Finally, reuse type characterised the functional role of the resource within the publication, distinguishing between dataset creation, direct research use, tool creation, tool assessment, and resource assessment. Together, these categories enabled a nuanced analysis of not only whether reuse occurred, but how SADiLaR resources functioned within different research workflows.

2.1.4 Repository usage analysis

In addition to identifying reuse through academic publications, repository usage data were analysed to provide a complementary perspective on engagement with SADiLaR resources. User data from the DSpace platform hosting the repository were examined. DSpace is an open-source repository system widely used for managing and providing access to digital research outputs. The data analysed included the number of users, their geographic location, and whether interactions with the repository consisted of item views or file downloads.

A distinction was made between human users and automated access (e.g., bots or web crawlers) based on the classification provided by the platform’s Solr-based statistics engine. This system filters interactions using predefined lists of bot-specific User-Agents and IP ranges. While this approach enables large-scale differentiation between human and automated traffic, it is dependent on the completeness of the bot detection lists.

(2.2) Identification of reuse

A total of 26 articles (hereafter referred to by coded identifiers P1–P26) that mentioned SADiLaR, the repository or its resources were identified. Only 23 published articles met the inclusion criteria and were included in the analysis. The three publications were excluded for the following reasons: In both P24 and P25, the RMA website link to the resource was provided in a footnote, however this link now redirects users to the SADiLaR repository. Because these outputs were published prior to the establishment of SADiLaR, and therefore make no reference to the SADiLaR repository, they were excluded. The final exclusion was P26 as the work only made a passing reference to SADiLaR generally.

The 23 included publications were coded according to the features described above. See Table 1 for a list of these publications and coding. More details on the specific publications are provided in Supplementary File A.

Table 1

Included Publications Utilising the SADiLaR Repository.

ANONYMISED NAME	REPOSITORY CITATION	AUTHOR CATEGORY	LANGUAGE FOCUS	REUSE TYPE
P1	General mention	External	Multi-language	Dataset creation
P2	General mention	Self-reuse	Afrikaans	Direct research
P3	Long URL	Self-reuse	Multi-language	Direct research
P4	Long URL	Self-reuse	Multi-language	Dataset creation
P5	Handle	Self-reuse	Multi-language	Direct research
P6	Handle	Self-reuse	Multi-language	Dataset creation
P7	Long URL	Self-reuse	Multi-language	Dataset creation
P8	Handle	Self-reuse	Multi-language	Dataset creation
P9	General mention	External	Multi-language	Direct research
P10	General mention	External	Multi-language	Dataset creation
P11	Handle	Self-reuse	Multi-language	Dataset creation
P12	General mention	External	Multi-language	Tool creation
P13	Long URL	Internal (SADiLaR)	Multi-language	Tool assessment
P14	Long URL	Self-reuse	Multi-language	Tool creation
P15	Handle	Self-reuse	Afrikaans	Dataset creation
P16	General mention	Internal (SADiLaR)	Multi-language	Direct research
P17	Handle	Self-reuse	Multi-language	Dataset creation
P18	Long URL	Self-reuse	Sesotho	Tool creation
P19	General mention	Internal (SADiLaR)	Sesotho	Resource assessment
P20	General mention	Internal (SADiLaR)	Multi-language	Tool assessment
P21	Handle	External	Multi-language	Direct research
P22	Handle	Internal (SADiLaR)	Afrikaans	Direct research
P23	Long URL	Self-reuse	Afrikaans	Direct research

Of the 23 identified publications, 8 make use of persistent identifiers and can therefore be considered compliant with the CLARIN Citation Guidelines, as discussed above. A further 7 publications rely on direct URLs, representing partial compliance, as these links may not remain stable over time. The remaining publications (n = 8) refer to SADiLaR resources only through general mentions, without providing a resolvable reference.

In addition to citation practices, reuse type was analysed according to the features described above. Self-reuse is the most common authorship category (n = 13). Most publications (n = 17) focus on multiple languages rather than a single South African language. Reuse types vary: dataset creation is the most common (n = 9), followed by direct research use (n = 8) and tool creation (n = 3). Tool assessment is observed in 2 cases, while resource assessment occurs in 1 case. See Figure 1 for a plot of these data.

Number of Publications by Author and Reuse Type.

Analysis of repository usage data recorded on the DSpace platform indicated a total of 3,502,406 visitors during the period under review. Of these, 70.5% were identified as automated bot traffic. This classification was performed by the platform’s Solr-based statistics engine, which filters hits based on a predefined list of bot-specific User-Agents and IP ranges, and is dependent on the currency of the repository’s spider-agent list. Bot activity increased markedly from late 2022 and persisted until late 2023. See Figure 2 which illustrates weekly activity for bot and human users, where a hit is recorded whenever a user (or bot) interacts with a DSpace Object, between 2018 and early 2026.

Weekly activity for bot and human users.

Repository interactions were further categorised into item views and file downloads (see Figure 3). For human users, file downloads constituted the majority of interactions (71.8%). Following the user journey of a human, one would expect that an item view would precede a download, however the high proportion of downloads suggests that many users bypass the repository’s metadata splash pages. This occurs frequently when researchers access files directly via search engine results (e.g., Google) or shared direct links. Similarly, bot activity consisted primarily of file downloads (54.7%), likely due to automated harvesters and crawlers targeting bitstream URLs directly for indexing or text-mining, rather than interacting with the item’s descriptive metadata page.

Activity on the SADiLaR Repository by User Type.

The geographic distribution of human users showed a strong concentration within South Africa, accounting for 71.4% of identified users. The United States constituted the second largest group (7.8%), followed by Singapore (3.3%) and Poland (1.5%). Figure 4 presents a geographic visualisation of user distribution.

(3) Outcomes and experience

The analysis indicates that reuse of language resources hosted in the SADiLaR repository is occurring, but that such reuse is unevenly documented and not always visible through formal citation practices. Although 23 publications were identified that explicitly reference SADiLaR or its resources, this number is likely to underrepresent actual reuse, particularly in cases where resources function as background infrastructure for tool development or for data preparation.

One notable outcome is the prevalence of self-reuse, which accounted for more than half of the identified publications. While this is not unexpected in the context of a specialist repository closely linked to a research infrastructure initiative, it suggests that broader uptake of SADiLaR resources beyond the immediate institutional network may still be developing. At the same time, the presence of external reuse indicates that the repository is beginning to serve a wider research community.

The analysis further highlights the diversity of reuse types, with resources contributing to dataset creation, direct research outputs, and tool development. In particular, several resources appear to function as training data for NLP tools, including taggers and other language technologies. Such forms of reuse are not always foregrounded in publications, which complicates efforts to trace impact through conventional citation-based approaches.

Inconsistent citation practices emerged as a key issue. References to SADiLaR varied considerably, ranging from persistent identifiers and direct URLs to general or informal mentions. This variability poses challenges for resource discoverability, impact measurement, and long-term attribution, and suggests a need for clearer guidance on how repository-hosted resources should be cited in academic outputs. There is currently a banner on each resource’s landing page when accessing a resource that requests the user to use the handle (the persistent identifier for resources hosted in the SADiLaR repository) when citing the resource.

Repository usage data provide a complementary perspective on engagement with SADiLaR resources. The predominance of file downloads among human users suggests active reuse rather than passive exploration. However, the high proportion of automated bot activity complicates interpretation of usage metrics and limits the extent to which download statistics can be directly equated with academic reuse.

The high proportion of automated bot activity, specifically from late 2022 to late 2023, coincided with the public release of ChatGPT and the subsequent surge in generative AI development. This pattern aligns with global observations of increased so-called ‘LLM feeder’ crawling during the same period, suggesting that the SADiLaR repository may have been identified as a high-value target for automated data harvesting. In the context of the South African research landscape, where many languages remain low resource in the digital sphere, the repository represents a critical source of curated, high-quality training data. This points toward a significant, albeit invisible, form of reuse: while these automated agents process vast quantities of data for downstream technological development, this impact is not captured by traditional academic citation metrics, leading to a potential citation gap where the repository’s actual utility far exceeds its documented formal citations.

Overall, while evidence of reuse is present, the findings underscore the importance of consistent citation practices, improved mechanisms for tracking reuse, and increased awareness of the repository among potential external users. General, and sometimes catch-all type references to SADiLaR might be due to unfamiliarity with SADiLaR’s activities and/or the repository as a data source.

(4) Recommendations and good practices

From the discussion it is clear that there is room to improve the reusability of resources from the SADiLaR repository. It is our position that uniformity in citation as well as awareness of language data will impact the discoverability of reuse cases positively. An important recent development that speaks to an effort toward uniformity, is the CLARIN Data Citation Guidelines (Mathiessen & Lenardič, 2025). In light of SADiLaR representing South Africa in CLARIN, the adoption of these guidelines by the community that utilises the resources from SADiLaR’s repository poses the potential of streamlining the way resources from the repository will be cited. An important aspect of the guidelines is the impetus placed on citing datasets directly when only the datasets are the sources, rather than publications or outputs discussing those datasets. Considering that most publications included in this discussion have been published prior to the availability of the guidelines, it is not surprising that citation to the resources is generally not compliant with the guidelines. Besides the recent date of publication of these data citation guidelines, the author guidelines from journals could also impact compliance.

To address the identified inconsistencies in citation practices, it might be beneficial to emphasise the citation generation and export tool for every item on the repository’s webpage that is included in the banner requesting users to not use the URL in the browser when citing the resource. Such a feature serves to lower the barrier to formal attribution by providing automated exports in standardised, pre-formatted citation strings according to widely adopted academic style guides. Central to this is ensuring that the persistent identifier, the Handle in our case, is automatically embedded within the citation. By automating the generation of accurate, persistent metadata, the repository can minimise manual entry errors and ensure that the digital provenance of the resource is correctly maintained across diverse research outputs.

When considering awareness of language data, the repository and SADiLaR’s role in the DH research landscape, constant efforts like the ESCALATOR flagship programme cannot be overstated.³ According to ESCALATOR’s website, the mission of the programme includes the development of computational capacity for research involving African languages. This general statement could therefore include awareness, discovery and the promotion of traceable reuse of language resources. Through continued engagement with the community that contribute to the repository and reuse of the available resources, it is our position that there will be timeous increase in the utilisation of resources hosted in the SADiLaR repository.

Beyond documenting the reuse of resources within the SADiLaR repository, this study provides an empirical case study of visibility gaps in resource reuse that are likely relevant to language and digital repositories more broadly. The findings illustrate that compliance with FAIR principles and open accessibility alone does not guarantee formal citation or recognisable reuse in academic outputs. Methodologically, the paper offers a combined approach: integrating citation tracing, repository usage logs, and differentiation between bot and human interactions, which can serve as a template for assessing reuse and impact in other research infrastructures.

Supplementary file

Supplementary file A

Full List of Included Publications Utilising the SADiLaR Repository.

AUTHORS	YEAR	TITLE
Adelani, D.I., et al.	2022	MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity Recognition
Brink, N.	2020	A usage-based investigation of Afrikaans-speaking children’s holophrases and communicative intentions
De Wet, F., et al.	2023	Investigating the Extent and Usability of Webtext Available in South Africa’s Official Languages
Du Toit, J. S., & Puttkammer, M. J.	2021	Developing Core Technologies for Resource-Scarce Nguni Languages
Eiselen, R., & Gaustad, T.	2023	Deep learning and low-resource languages: How much data is enough? A case study of three linguistically distinct South African languages
Gaustad, T., & McKellar, C. A.	2024	Updated Morphologically Annotated Corpora for 9 South African Languages
Gaustad, T., & Puttkammer, M. J.	2022	Linguistically annotated dataset for four official South African languages with a conjunctive orthography: IsiNdebele, isiXhosa, isiZulu, and Siswati
Gaustad, T., et al.	2025	Datasets for South African Languages: Bilingual Aligned and Monolingual Data for Machine Translation
Kaffee, L.-A., et al.	2023	Multilingual Knowledge Graphs and Low-Resource Languages: A Review
Marivate, V., et al.	2025	Swivuriso: The South African Next Voices Multilingual Speech Dataset
McKellar, C. A., & Puttkammer, M. J.	2020	Dataset for comparable evaluation of machine translation between 11 South African languages
Meyer, F., et al.	2024	NGLUEni: Benchmarking and Adapting Pretrained Language Models for Nguni Languages
Mlambo, R. & Matfunjwa, M.	2025	Human language technology tools for indigenous South African languages and their potential use
Puttkammer, M., et al.	2018	NLP Web Services for Resource-Scarce Languages
Rabé, M.	2021	Kodewisseling in Afrikaans-Nederlandse kinders se spraak
Setaka, M., & Trollip, B.	2022	Resource Repositories and linking resources: An exploratory study
Sibeko, J., & Van Zaanen, M.	2023	A Data Set of Final Year High School Examination Texts of South African Home and First Additional Language Subjects
Sibeko, J., & Van Zaanen, M.	2025	Developing and testing syllabification systems for South African Sesotho
Sibeko, J., & Setaka, M.	2023	An overview of Sesotho BLARK content
Skosana, N. J., & Mlambo, R.	2021	A brief study of the Autshumato Machine Translation Web Service for South African languages
Terblanche, C., et al.	2025	The development of synthetic child speech in three South African languages
Trollip, B.	2023	’n Gebruiksgebaseerde beskrywing van Afrikaanse prefiksoïede
Trollip, B., & Strauss, T.	2024	Analysing Afrikaans lexical blends using Levenshtein distances

Notes

[1] https://sadilar.org/en/about/ [Date of access: 19 January 2026].

[2] https://hdl.handle.net/20.500.12185/1 [Date of access: 15 January 2026].

[3] https://sadilar.org/en/escalator/ [Date of access: 19 January 2026].

Acknowledgements

The authors would like to thank the reviewers of this paper for their feedback on the first version of this article.

Author Contributions

Both authors contributed to all aspects of the paper.