Five Suggestions Towards User-Centred Data Repositories in the Social Sciences

Elias Herman Kruithof; Christophe Vanroelen; Laura Van den Borre

doi:10.5334/dsj-2024-019

Open data in the social sciences

During the last decade, many social scientists have started to schedule lectures, seminars, and workshops to promote ‘open science’ and open data. Data stewards have been appointed to assist researchers with their data management and sharing practices, and research data repositories went online to provide a platform for research data sharing. Over recent decades, open science practices have been increasingly adopted in the social sciences. In Belgium, open data archival developments in the social sciences were largely fragmented until the launch of the Social Sciences and Digital Humanities Archive (SODHA) in October 2020. As project partners, the authors were responsible for translating the needs of social scientists to the data archival experts and software developers. In order to tailor SODHA to the needs of social scientists, we searched the relevant literature to identify common pitfalls and inspirational examples. We believe these insights may also aid others in setting up new open data initiatives and improving existing repositories. To this end, this essay provides an overview of the relevant literature regarding open data infrastructure for the social sciences as retrieved. Building on these results as well as on the feedback from our SODHA users, we address four challenges for sharing data from a social sciences perspective: (a) diverging needs of different user groups, (b) relevance-evaluation, (c) epistemological debates on the reuse of qualitative data, and (d) ethical concerns with sharing qualitative data. In the last part, we conclude with five suggestions and good practices for the design of online data repositories for the social sciences.

This dedication to data repositories has good reasons. Its scientific potential manifests itself, among other occasions, in the possibility of replicating studies, the re-purposing of old research data, and opportunities for better cooperation across academia, government, and the private sector (Engzell and Rohrer 2021; Gregory et al. 2020; Zenk-Möltgen et al. 2018). Furthermore, open social science data can be a solution for known and occurring problems, such as when only falsely significant results are published (i.e., p-hacking) or when hypotheses are made to fit certain significant patterns found in the data (known as publication bias) (Breznau 2021; Freese, Rauf and Voelkel 2022). Archived data can also enable a more profound historical perspective on social science practices (McLeod & O’Connor 2020). Data preservation prevents extra costs for data collection, which is especially relevant for hard-to-reach populations, for whom acquiring data is both difficult and costly (Corti 2007; Tarrant 2017). Finally, open data can constitute ready-to-use learning materials, offering students the experience of the complexity of data analysis in the ‘real’ world (Corti 2007).

We encountered the issue that many researchers do not make their data publicly available for various reasons (for an overview, see Zuiderwijk, Shinde and Jeng 2020). For instance, some worry over the potential misuse and misinterpretation of their data by unqualified users (Niu and Hedstrom 2008; Zuiderwijk et al. 2020). Others refrain from sharing because of the loss of exclusive rights to follow-up papers based on the data, or more in general because there is no good system for adequately crediting the work behind the open data (David et al., 2020; Engzell and Rohrer, 2021; Borgman 2009). Drawing from years of experience in working with linked data sources, reusing administrative data, and exploring reuse opportunities within the social sciences, we served as scholarly advisors for the Belgian repository SODHA. Data and research infrastructure is a pivotal research activity of our common research group. There exists a history spanning over 60 years in the archiving of social science data, featuring numerous well-established repositories, organizations of professionals (e.g., IASSIST and CESSDA), a periodical publication (e.g., IASSIST Quarterly), and regular annual gatherings (Downey et al. 2019). Most social scientists also have a positive attitude towards more research openness (Christensen et al. 2019; Heers et al. 2017; Jeng, He and Oh 2016; Zhu 2019). However, there is still room for the development of the open data tradition. Moreover, significant disparities exist in terms of open data practices within the social sciences. The fields of economics and psychology, in particular, have embraced a more extensive and formalized open data policy as a common practice (Freese et al. 2022). Political science has also made notable strides in this regard (Zenk-Möltgen et al. 2018). However, in sociology, the traditional perspective on open data still dominates, and well-defined open data policies remain largely absent (Christensen et al. 2019; Freese, Rauf and Voelkel 2022; Zenk-Möltgen et al. 2018). Several factors make open data less developed in the social sciences when compared to STEM disciplines (Lyon 2016; Quarati and Raffaghelli 2020; Freese et al. 2022), and social scientists are reported to be less willing to share data (Borycz et al. 2023). For example, understanding research data in the social sciences often requires detailed contextual information, which makes the sharing process more time-consuming (Yoon and Kim 2017). Also, social scientists encounter additional difficulties due to the high ethical standards expected by the science community, the lack of funding and technical infrastructure in general, and the higher probability of the use of qualitative research data, which are often considered too complex to share and reuse (Curty 2016; Wei Jeng, He and Oh 2016; Jeng and Lyon 2016; Zhu 2019; Freese et al. 2022). Social science data repositories play a critical role in reducing researchers’ perceived effort to share and reuse data, so their design can make the difference (Perrier, Blondal and MacDonald 2020; Yoon and Kim 2017).

Four key challenges for sharing data in the social sciences

Needs of different user groups

Evidently, social scientists’ incentives to use secondary data first and foremost depend on how well these open data fit their information needs (Niu and Hedstrom 2008; Zuiderwijk et al. 2020). An online repository has to take all its different user groups into consideration, as they have different goals, requirements, problems, and techniques for problem solving. Experienced social scientists require well-documented data of topical fit, high quality, and comparability (Friedrich 2020). Novice researchers are heavily influenced by more experienced social science researchers when it comes to discovering, evaluating, and justifying their reuse of others’ data (Faniel et al. 2012). They mainly use data that are easy to understand, and they mainly encounter challenges such as forming efficient search queries, understanding data, or opening and reading datasets. Two studies on the user profiles of social science data repositories showed a remarkably high prevalence of students: 28.6% of users of a Czech repository and 39% of users of a Finnish repository (Kudrnáčová and Trtíková 2020; Late and Kekalainen 2020).

Metadata availability

As mentioned above, detailed contextual information is required for understanding research data in the social sciences. Many studies on data sharing and data search in the social sciences stress the fact that colleagues, friends, professors, or supervisors are all vital sources of information about data, as briefly mentioned above (Friedrich 2020; Heers et al. 2017; Krämer et al. 2021; Yoon and Kim 2017). In our opinion, the information needed to assess relevance is too often not centralized in digital data repositories (also see Krämer et al. 2021). However, to ensuring adequate metadata for all datasets can be challenging for various reasons (see Faniel, Frank and Yakel 2019). With metadata, as with any form of communication, some kind of reductionism is inevitable. Hence, all the relevant information for secondary use is never completely foreseeable by primary researchers when they write metadata (Birnholtz and Bietz 2003). Finally, problematic data documentation is especially pronounced in the sharing of qualitative data, where the versatile and specific context of the original study is often referred to as indispensable metadata. It includes, among others, knowledge about interactions between researchers and study participants, typos in the transcripts, abbreviations, study participants’ tones of voice, certain emphasis in the talk, reasons why a certain probe was made during the interview, their age, gender, social status, race, occupation, research settings, time and place, sample selection decisions, data collection methods and procedures, and complete field work protocols and transcripts (Coltart et al. 2013; Sherif 2018; Yoon 2014).

Epistemological debates on the reuse of qualitative data

Sharing qualitative social science research data for secondary use has been widely debated (DuBois, Strait and Walsh 2018; Ruggiano and Perry 2019; Freese et al. 2022), but for different reasons than the sharing of quantitative data. The debate regarding qualitative data sharing is vastly interesting and too extensive for this essay. In short, the nature of qualitative data raises hesitations regarding data security, storage, and reuse (Alexander et al. 2020; Broom, Cheshire and Emmison 2009; DuBois et al. 2018). Epistemological debates on the reuse of qualitative data are centered around the question of whether first-hand knowledge of the original research context is vital for making sense of the nature and scope of qualitative research data or not. It is frequently heard that the knowledge resulting from processing direct experiences is not ‘transportable’ either into research archives or new research contexts (Broom et al. 2009; Coltart et al. 2013). The reuse of qualitative data is still in the early stages of development, especially when compared to quantitative data (Bishop 2014; Bishop and Kuula-Lummi 2017; Coltart, Henwood and Shirani 2013; Sherif 2018). However, there are indications of an evolution towards increased open data in qualitative research (Freese et al. 2022; Mozersky et al. 2020; Mozersky et al. 2021).

Ethical concerns with sharing qualitative data

Besides epistemological and methodological questions, many ethical challenges are inherent to sharing qualitative data (Alexander et al. 2020). First, participant confidentiality and consent are key issues. The extent to which a participant’s consent to data sharing is durable over time is an unsolved conundrum (McLeod and O’Connor 2020). Informed consent is difficult to assure when the ways in which data might be reused are unknown in advance (McLeod and O’Connor 2020). Furthermore, interview transcripts are often the intellectual property of both the interviewer and the interviewee, thus creating even more complex data sharing issues (Broom et al. 2009). Some authors advise sharing only qualitative data that can be de-identified (Mannheimer et al. 2019). However, the possibilities regarding anonymization or pseudonymization of qualitative information are often dataset-specific, as it is difficult to find a balance between losing valuable information and preserving enough information to allow reuse.

Five suggestions for open data in the social sciences

Suggestion 1

Based on the respective needs of novice data users and the striking prevalence of students as users of social science data repositories, pedagogical modules should be offered. We advocate more collaboration with higher education teachers who teach statistics or research methods for the development of suitable learning modules. In those modules, students could experiment in their own ‘data lab’ with, for example, easy-to-understand datasets and additional help in the data search process. The online ‘Learning Hub’ of the UK Data Service can serve as an excellent example. It encompasses dedicated pages for new users, students, and teachers; provides an introduction to widely-used data analysis software; and offers various additional resources for reusing secondary data. Additionally, there is potential for enhanced support of novice researchers by collaborating with research groups to organize workshops aimed at, for example, elucidating GDPR regulations for researchers, distinguishing between anonymization/pseudonymization techniques, conducting technical workshops (such as small-cell risk analysis and synthetic data generation), and establishing platforms to facilitate communication between data and legal experts and researchers. Online resources dedicated to the life sciences, such as the ELIXIR FAIR Cookbook¹ and RDMkit,² can serve as good examples.

Suggestion 2

Data repositories can be thoroughly improved if seen as an important meeting place for researchers, thus facilitating as much as possible the contacts between primary and secondary users, especially in the case of coping with missing metadata. Even when documented metadata are theoretically sufficient for secondary data users, some researchers might prefer actual contact with the primary researchers instead of or in addition to consulting the documentation (Niu 2009). Also, when initial trust in a dataset is lost for one reason or another, contact with data creators can help restore this trust (Yoon 2017a). The user conferences of The Austrian Social Science Data Archive can inspire other repository developers in this regard. A repository should, additionally, find a way to keep track of the experts and data creators in case they change institutions. As a good practice, refer to The Finnish Social Science Data Archive, where authors are always linked to a dataset with their ORCID.

It also seems advisable to insert a comment section under each dataset, in which users can publicly post questions regarding the data. Such interactions could supersede or complement the common practice of ‘data talk,’ i.e., a casual real-life conversation among colleagues (Yoon 2017b). Contact with primary researchers is even more pertinent for the sharing of qualitative data (Broom et al. 2009). Reuse of qualitative data heavily depends on the trust reusers choose to grant to specific datasets, a process that is inherently connected with an assessment of the primary researchers’ worldview, research philosophy, and experiences (Yoon 2014). Many social scientists state that prior connection to the original data and the original investigators is a condition for reuse (Yoon 2014), and the great majority of secondary studies were undertaken by researchers with at least some first-hand knowledge of the context in which the qualitative data were originally collected and analyzed (Coltart et al. 2013).

Suggestion 3

Repositories should aim to prevent the two main challenges with metadata: poor documentation and insufficient descriptions. Thorough documentation of quantitative data always has to be ensured on all three ‘levels’ of metadata (Gutmann et al. 2004; Niu and Hedstrom 2008). First, study-level metadata should outline the general description of the study as a whole, such as the purposes of the study, major conceptual categories, characteristics of the sample, researchers involved, related publications, and so on. To enhance trust in the deposited dataset, these metadata can also contain information about the publications based on this dataset and a short profile of the original researcher (e.g., expertise, home departments, main methodologies, and research interests) (Yoon 2017a). Second, file-level metadata should describe the properties of individual files such as codebooks, extensive method reports, or questionnaires. It can be helpful to include links to basic concepts explained on the Web (Koesten et al. 2021). Third, variable-level metadata should describe the measurement and coding of individual variables or groups of variables (Faniel et al. 2012). Finally, repositories should include systematic updates and news about specific datasets when they are downloaded and reused (Kern and Mathiak 2015). The dataset descriptions in the GESIS interface exemplify good practices. Per dataset, it consistently includes a concise abstract outlining the research objectives, mentions the affiliated institutions of the researchers, provides a brief overview of the dataset’s variables, geographical coverage, and identifies the relevant study of data collection. Moreover, it offers a comprehensive overview of different versions and related publications per dataset.

Suggestion 4

Repositories should address the epistemological uncertainties of researchers when sharing their (qualitative) research data. In our opinion, repositories can provide at least two things. First, they can put forward their own clear vision that provides answers to the identified epistemological uncertainties. Then it is up to the researchers to position themselves against that vision. In addition, repositories can also provide an overview of the arguments and views in that debate, which researchers themselves can use as tools in their search for an answer to their queries. Again, the UK Data Service is an excellent illustration, offering insights into various approaches for reusing qualitative data, presenting the major debates in this field, and providing users with an extensive literature review on the reuse of such data. Secondly, it seems appropriate to provide a space in which individual researchers can express their epistemological uncertainties to the public (e.g., as part of the metadata) and where they are encouraged to justify in what manner they think these uncertainties might affect reuse.

Suggestion 5

Given the inherent diversity in the forms of the metadata required to describe qualitative and quantitative data, repositories should avoid requiring the use of the same metadata fields for qualitative data as for quantitative data. Metadata fields for qualitative data should provide formal flexibility along with a strong emphasis on a substantive description. For inspiration, we recommend the open-format dataset descriptions on the Qualitative Data Repository.³ However, to provide a certain consistency in the case of in-depth interviews, a repository could ask researchers to deposit their interview guides together with indications of the dynamics of the interviews (e.g., scale of openness) and a statement of the research goals (e.g., the main research questions). By solely uploading this guide, the researcher indicates which qualitative data are available without having to pass through ethics committees and without having to take the privacy of the interviewees into account.

Conclusion

In this essay, we identified four key challenges for data sharing and presented five recommendations for innovating domain-specific online data repositories in the social sciences to enhance data sharing. The strength of this essay lies in its usability; it offers any developer of an open data repository, at whatever scale, a thorough and contextual overview of urgent challenges for data sharing in the social sciences.

For future research, it is important to note that many fundamental aspects of data sharing in the social sciences remain understudied. Furthermore, many aspects of data sharing practices in the social sciences have not yet been sufficiently studied, e.g., which repositories do researchers choose (university repositories, public data archives, private repositories, or domain-specific repositories)? Why do researchers favor certain repositories over others? The reception of open science policies by social science researchers has also not yet been sufficiently investigated. Finally, it is crucial to understand the relative importance of informal data talks and workplace dynamics in the data sharing landscape for the social sciences.

It is time for social sciences open data repositories to fully apply a user-centered approach in the development of their online services. Such a designed online data repository can make a major difference in filling the gap between social scientists’ attitudes and their actual data sharing behavior through online research data repositories. Our five suggestions can be useful on the way towards truly user-centered data repositories, the next chapter of open data.

Notes

[1] https://faircookbook.elixir-europe.org/content/home.html.

[2] https://rdmkit.elixir-europe.org/.

[3] https://qdr.syr.edu/.

Acknowledgements

We thank Dr. Kim Bosmans and Benjamin Peuch for their useful insights and helpful comments on the earlier versions of the manuscript. Publishing cost: BELSPO – ESFRI-FED Programme, Contract nr EF/231/DAMAR.

Funding information

This project was funded by Federal Public Planning Service Science Policy (BELSPO: FR/00/SO5).

Competing Interests

The authors have no competing interests to declare.

Author contributions

Contributors Elias Kruithof and Laura Van den Borre designed the study. Elias Kruithof conducted the study and wrote the first draft of the manuscript. All authors edited the draft, discussed the interpretation, and approved the final version of the manuscript.