
Outside the Discipline, Inside the Data: A Retrospective Account of an Undocumented Tunisian Language Corpus in an Extractivist Research Context
Abstract
This discussion paper is a retrospective reflection on a qualitative interview corpus produced during my doctoral fieldwork in Tunis, Tunisia, in June 2022. As a researcher trained in architecture rather than linguistics or data science, I did not set out to build a language dataset, yet that is precisely what I produced. Working with 18 master’s students from two Tunisian universities, I coordinated the collection, transcription, and translation of 152 structured interviews with residents of the buffer zone of the UNESCO World Heritage Site of the Medina of Tunis. The interviews were conducted and transcribed in Tunisian and translated into French, within a single five-day fieldwork week. The resulting corpus, transcriptions, French translations, and a small number of audio recordings and photos, subsequently underpinned a published research paper, but was not formally documented, deposited, or cited as a dataset. This paper is also an account of remediation: the corpus has since been deposited on Zenodo, with public metadata and restricted access files, as a first step toward making it reusable. Looking back at this work now, I ask: what went wrong, why, and what should researchers outside the field of linguistics know before they find themselves in the same position? This paper is addressed above all to researchers in the humanities and built-environment disciplines who collect language data as part of their work without recognising it as such, and who risk, as I did, allowing irreplaceable fieldwork material to disappear into personal storage. The paper also situates this experience within a broader reflection on extractive research practices in asymmetric, cross-cultural settings.
© 2026 Khaoula Stiti, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.