Have a personal or library account? Click to login
Bridging the Data Discovery Gap: User-Centric Recommendations for Research Data Repositories Cover

Bridging the Data Discovery Gap: User-Centric Recommendations for Research Data Repositories

Open Access
|Feb 2026

References

  1. Alencar, V., Kohwalter, T., Braganhole, V., da Silva, J. and Murta, L. (2024) ‘Prov-Dominoes: An approach for knowledge discovery from provenance data’, Expert Systems with Applications, 245. Available at: 10.1016/j.eswa.2023.123030
  2. Allahim, A., Shamsuddin, S.M. and Meulien, J. (2025) ‘Semantic approaches for query expansion: Taxonomy, challenges, and future research directions’, PeerJ Computer Science. Available at: 10.7717/peerj-cs.2664
  3. Amugongo, L.M., Mascheroni P., Brooks S., Doering S. and Seidel J. (2025) Retrieval augmented generation for large language models in healthcare: A systematic review. PLOS Digital Health, 4(6), e0000877. Available at: 10.1371/journal.pdig.0000877
  4. Bergold, J. and Thomas, S. (2012) ‘Participatory research methods: A methodological approach in motion’, Historical Social Research/Historische Sozialforschung, 37(4), pp. 191222. Available at: https://www.jstor.org/stable/41756482
  5. Bugbee, K., le Roux, J., Sisco, A., Kaulfus, A., Staton, P., Woods, C., Dixon, V., Lynnes, C. and Ramachandran, R. (2021) ‘Improving discovery and use of NASA’s Earth observation data through metadata quality assessments’, Data Science Journal, 20(1), p. 17. Available at: 10.5334/dsj-2021-017
  6. Burton, A., Aryani, A., Koers, H., Manghi, P., Bruzzo, S.L., Stocker, M., Diepenbroek, M., Schindler, U. and Fenner, M. (2017) ‘The Scholix framework for interoperability in data-literature information exchange’, D-Lib Magazine, 23(1/2), pp. 120. Available at: 10.1045/january2017-burton
  7. Candela, L., Mangione, D. and Pavone, G. (2024) ‘The FAIR assessment conundrum: Reflections on tools and metrics’, Data Science Journal, 23(1), p. 33. Available at: 10.5334/dsj-2024-033
  8. Chae, Y. and Davidson, T. (2025) ‘Large language models for text classification: From zero-shot learning to instruction-tuning’, Sociological Methods & Research. Available at: 10.1177/00491241251325243
  9. Cooper, D.M. and Springer, R. (2019) ‘Data communities: A new model for supporting STEM data sharing’, Ithaka S+R, 13 May. Available at: 10.18665/sr.311396
  10. CoreTrustSeal Standards and Certification Board (2022) CoreTrustSeal requirements 2023–2025 (V01.00). Available at: 10.5281/zenodo.7051012
  11. DataCite (2024a) DataCite thriving communities: 3000 repositories and counting. Available at: 10.5438/63qf-5740 (Accessed: 17 April 2024).
  12. DataCite Metadata Working Group (2024b) DataCite metadata schema for the publication and citation of research data and other research outputs. Version 4.5. DataCite e.V. Available at: 10.14454/znvd-6q68
  13. Davenport, E. (2010) ‘Confessional methods and everyday life information seeking’, Annual Review of Information Science and Technology, 44(1), pp. 533562. Available at: 10.1002/aris.2010.1440440119
  14. Dixit, R., Rogith, D., Narayana, V., Salimi, M., Gururaj, A., Ohno-Machado, L., Xu, H. and Johnson, T.R. (2018) ‘User needs analysis and usability assessment of DataMed – a biomedical data discovery index’, Journal of the American Medical Informatics Association, 25(3), pp. 337344. Available at: 10.1093/jamia/ocx134
  15. Felden, J., Möller, L., Schindler, U., Huber, R., Schumacher, S., Koppe, R., Diepenbroek, M. and Glöckner, F.O. (2023) ‘PANGAEA – Data publisher for earth & environmental Science’, Scientific Data, 10, p. 347. Available at: 10.1038/s41597-023-02269-x
  16. Flanagan, J.C. (1954) ‘The critical incident technique’, Psychological Bulletin, 51(4), pp. 327359. Available at: 10.1037/h0061470
  17. Foulonneau, M., Cole, T.W., Habing, T.G. and Shreeves, S.L. (2005) ‘Using collection descriptions to enhance an aggregation of harvested item-level metadata’, in Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries. Denver: Association for Computer Machinery, pp. 3242. Available at: 10.1145/1065385.1065393
  18. Friedrich, T. (2020). Looking for data: Information seeking behaviour of survey data users (Doctoral dissertation). Humboldt-Universität zu Berlin. Available at: 10.18452/22173
  19. Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J., Wang, M. and Wang, H. (2023) ‘Retrieval-augmented generation for large language models: A surve’, arXiv:2312.10997v5 [cs.CL]. Available at: 10.48550/arXiv.2312.10997
  20. Gregory, A., Bell, D., Brickley, D., Buttigieg, P.L., Cox, S., Edwards, M., Doug, F., Gonzalez Morales, L.G., Heus, P., Hodson, S., Kanjala, C., Le Franc, Y., Maxwell, L., Molloy, L., Richard, S., Rizzolo, F., Winstanley, P. and Wyborn, L. (2024) WorldFAIR (D2.3) (version 1). Available at: 10.5281/zenodo.11236871
  21. Gregory, K., Groth, P., Scharnhorst, A. and Wyatt, S. (2020) ‘Lost or found? Discovering data needed for research’, Harvard Data Science Review, 2(2). Available at: 10.1162/99608f92.e38165eb
  22. Hodson, S. (2024) WorldFAIR (D2.2) WorldFAIR’s experience with FIPs (second set of FAIR implementation profiles for each case study) (version 1). Available at: 10.5281/zenodo.11236094
  23. Jeng, W., He, D. and Chi, Y. (2017) ‘Social science data repositories in data deluge: A case study of ICPSR’s workflow and practices’, The Electronic Library, 35(4), pp. 626649. Available at: 10.1108/EL-11-2016-0243
  24. Kacprzak, E., Koesten, L., Tennison, J. and Simperl, E. (2018) ‘Characterising dataset search queries’, in WWW ’18: Companion Proceedings of the Web Conference 2018. Geneva: International World Wide Web Conferences Steering Committee, pp. 14851488. Available at: 10.1145/3184558.3191597
  25. Kalinin, N.A. and Skvortsov, N.A. (2023) ‘Difficulties of FAIR principles implementation in cross-domain research infrastructures’, Lobachevskii Journal of Math, 44, pp. 147156. Available at: 10.1134/S199508022301016X
  26. Koesten, L., Gregory, K., Groth, P. and Simperl, E. (2021) ‘Talking datasets—Understanding data sense-making behaviours’, International Journal of Human-Computer Studies, 146, 102562. Available at: 10.1016/j.ijhcs.2020.102562
  27. Koesten, L.M., Kacprzak, E., Tennison, J.F. and Simperl, E. (2017). ‘The Trials and Tribulations of Working with Structured Data: a Study on Information Seeking Behaviour’, in Proceedings of the 2017 CHI conference on human factors in computing systems (pp. 12771289). Available at: 10.1145/3025453.3025838
  28. Khalsa, S., Cotroneo, P. and Wu, M. (2018) ‘A survey of current practices in data search services’, Mendeley Data, V1. Available at: 10.17632/7j43z6n22z.1
  29. Klump, J., Wyborn, L., Wu, M., Martin, J., Downs, R.R. and Asmi, A. (2021) ‘Versioning data is about more than revisions: A conceptual framework and proposed principles’, Data Science Journal, 20(1), p. 12. Available at: 10.5334/dsj-2021-012
  30. Krans, N.A., Ammar, A., Nymark, P., Willighagen, E.L., Bakker, M.I. and Quik, J.T.K. (2022). ‘FAIR assessment tools: Evaluating use and performance’, NanoImpact, 27, p. 100402. Available at: 10.1016/j.impact.2022.100402
  31. Lacagnina, C., David, R., Nikiforova, A., Kuusniemi, M.E., Cappiello, C., Biehlmaier, O., Wright, L., Schubert, C., Bertino, A., Thiemann, H., & Dennis, R. (2023). Towards a data quality framework for EOSC (1.0.0). Available at: 10.5281/zenodo.7515816
  32. Lafia, S., Million, A.J. and Hemphill, L. (2023) ‘Direct, orienting, and scenic paths: How users navigate search in a research data archive’, in Proceedings of the 2023 Conference on Human Information Interaction and Retrieval (CHIIR ‘23). New York: Association for Computing Machinery, pp. 128136. Available at: 10.1145/3576840.3578275
  33. Lang, J.M. and Benbow, M.E. (2013) ‘Species interactions and competition’, Nature Education Knowledge, 4(4), p. 8. Available at: https://www.nature.com/scitable/knowledge/library/species-interactions-and-competition-102131429/
  34. Lister, A. and Sansone, A. (2023, July 28) FAIRsharing in a nutshell. Available at: 10.5281/zenodo.8191958
  35. Liu, Y.-H., Wu, M., Power, M. and Burton, A. (2022) Elicitation of data discovery contexts: An interview study (1.0). Available at: 10.5281/zenodo.7179526
  36. Liu, Y.-H., Wu, M., Power, M. and Burton, A. (2023) Elicitation of contexts for discovering clinical trials and related health data: An interview study (V1.0). Available at: 10.5281/zenodo.7839282
  37. Löffler, F., Wesp, V., König-Ries, B. and Klan, F. (2021) ‘Dataset search in biodiversity research: Do metadata in data repositories reflect scholarly information needs?’ PLoS ONE, 16(3), e0246099. Available at: 10.1371/journal.pone.0246099
  38. Löffler, F., Shafiei, F., Witte, R., König-Ries, B. and Klan, F. (2023) ‘Semantic search for biological datasets: A usability study on modes of querying and explaining search results’, 20th Conference on Database Systems for Business, Technology and Web, BTW 2023. Dresden, Germany, 6–10 March. Available at: 10.18420/BTW2023-56
  39. Manghi, P., Bardi, A., Atzori, C., Baglioni, M., Manola, N., Schirrwagen, J. and Principe, P. (2019) The OpenAIRE research graph data model. Available at: 10.5281/zenodo.2643199
  40. Marchionini, G. (2006) ‘Exploratory search: from finding to understanding’, Communications of the ACM, 49(4), pp. 4142. Available at: 10.1145/1121949.1121979
  41. Million, A.J., York, J., Lafia, S. and Hemphill, L. (2025) ‘Data, not documents: Moving beyond theories of information-seeking behavior to advance data discovery’, Journal of the Association for Information Science and Technology, 76(4), pp. 649664. Available at: 10.1002/asi.24962
  42. Miller, M. and Vielfaure, N. (2022) ‘OpenRefine: An approachable open tool to clean research data’, Bulletin – Association of Canadian Map Libraries and Archives (ACMLA), 170. Available at: 10.15353/acmla.n170.4873
  43. Nasir, J.A., Varlamis, I. and Ishfaq, S. (2019) ‘A knowledge-based semantic framework for query expansion’, Information Processing & Management, 56(5), pp. 16051617. Available at: 10.1016/j.ipm.2019.04.007
  44. National Library of Medicine (2021) SNOMED CT to ICD-10-CM map. Available at: https://www.nlm.nih.gov/research/umls/mapping_projects/snomedct_to_icd10cm.html.
  45. NISO (National Information Standards Organization) (2004) Understanding metadata. Bethesda: NISO Press. Available at: https://www.niso.org/standards/resources/UnderstandingMetadata.pdf.
  46. Peng, G., Berg-Cross, G., Wu, M., Downs, R.R., Shrestha, S.R., Wyborn, L., Ritchey, N., Ramapriyan, H.K., Clark, S.J., Wood, J., Liu, Z. and Marouane, A. (2024) ‘Harmonizing quality measures of FAIRness assessment towards machine-actionable quality information’, International Journal of Digital Earth, 17(1). Available at: 10.1080/17538947.2024.2390431
  47. Pressman, R. S. and Maxim, B. R. (2015). Software Engineering: A Practitioner’s Approach (8th ed.). McGraw-Hill Education.
  48. Quintel, D. and Wilson, R. (2020) ‘Analytics and privacy: Using Matomo in EBSCO’s discovery service’, Information Technology and Libraries, 39(3). Available at: 10.6017/ital.v39i3.12219
  49. Sharifpour, R., Wu, M. and Zhang, X. (2023) ‘Large-scale analysis of query logs to profile users for dataset search’, Journal of Documentation, 79(1), pp. 6685. Available at: 10.1108/JD-12-2021-0245
  50. Shneiderman, B., Plaisant, C., Cohen, M., Jacobs, S., Elmqvist, N. and Diakopoulos, N. (2016) Designing the user interface: Strategies for effective human-computer interaction. 6th ed. Boston: Pearson.
  51. Silva, L. and Barbosa, L. (2024) ‘Improving dense retrieval models with LLM augmented data for dataset search’, Knowledge-Based Systems, 294, 111740. Available at: 10.1016/j.knosys.2024.111740
  52. Slavković, A. and Seeman, J. (2023) ‘Statistical data privacy: A song of privacy and Uutility’, Annual Review of Statistics and Its Application, 10, pp. 189218. Available at: 10.1146/annurev-statistics-033121-112921
  53. Smith, L.C. (2020) ‘Interdisciplinary searching as a use case for vocabulary mapping’, in M. Lykke, T. Svarre, M. Skov and D. Martínez-Ávila (eds.) Knowledge organization at the interface: Proceedings of the sixteenth international ISKO conference, 2020 Aalborg, Denmark. vol. 17. Baden-Baden: Ergon-Verlag, pp. 428435. Available at: 10.5771/9783956507762-428
  54. Sostek, K., Russell, D.M., Goyal, N., Alrashed, T., Dugall, S. and Noy, N. (2024) ‘Discovering datasets on the web scale: Challenges and recommendations for Google dataset search’, Harvard Data Science Review, special issue 4. Available at: 10.1162/99608f92.4c3e11ca
  55. Stall, S., Bilder, G., Cannon, M., Hong N.C., Edmunds, S., Erdmann, C.C., Evans, M., Farmer, R., Feeney, P., Friedman, M., Giampoala, M., Hanson, R.B., Harrison, M., Karaiskos, D., Katz, D.S., Letizia, V., Lizzi, V., MacCallum, C., Meunch, A., Perry, K., Ratner, H., Schindler, U., Sedora, B., Stockhause, M., Townsend, R., Yeston, J. and Clark, T. (2023) ‘Journal production guidance for software and data citations’, Scientific Data, 10, 656. Available at: 10.1038/s41597-023-02491-7
  56. Sun, D., Hnatiuk, R.J. and Neldner, V.J. (1997) ‘Review of vegetation classification and mapping systems undertaken by major forested land management agencies in Australia’, Australian Journal of Botany, 45(6), pp. 929948. Available at: 10.1071/BT96121
  57. Taniguchi, S. and Hashizume, A. (2023) ‘Transforming metadata content guidelines and instructions to linked data’, Journal of Documentation, 51(4). Available at: 10.1177/01655515221142428
  58. Thomas, K., Papenmeier, A., Carevic, Z., Kern, D. and Mathiak, B. (2021) ‘Data-seeking behaviour in the social sciences’, International Journal on Digital Libraries, 22(2), pp. 175195. Available at: 10.1007/s00799-021-00303-0
  59. Terolli, E., Ernst, P. and Weikum, G. (2020) ‘Focused query expansion with entity cores for patient-centric health search’, in J.Z. Pan, V. Tamma, C. d’Amato, K. Janowicz, B. Fu, A. Polleres, O. Seneviratne and L. Kagal (eds.) The semantic web – ISWC 2020. Cham: Springer, pp. 547564. Available at: 10.1007/978-3-030-62419-4_31
  60. Vega-Gorgojo, G., Slaughter, L., Giese, M., Heggestoyl, S., Soylu, A. and Waaler, A. (2016) ‘Visual query interfaces for semantic datasets: An evaluation study’, Journal of Web Semantics, 39(C), pp. 8196. Available at: 10.2139/ssrn.3199241
  61. Wang, R. Y. and Strong, D. M. (1996). ‘Beyond Accuracy: What Data Quality Means to Data Consumers’, Journal of Management Information Systems, 12(4), pp. 533. Available at: 10.1080/07421222.1996.11518099
  62. Wang, X., Wang, Z., Gao, X., Zhang, F., Wu, Y., Xu, Y., Xu Z., Shi, T., Wang, Z., Li, S., Qian, Q., Yin, R., Lv, C., Zheng, X. and Huang, X. (2024) ‘Searching for best practices in retrieval-augmented generation’, in Y. Al-Onaizan, M. Bansal and Y.-N. Chen (eds.) Proceedings of the 2024 conference on empirical methods in natural language processing. Miami: Association for Computational Linguistics, pp. 1771617736. Available at:: 10.18653/v1/2024.emnlp-main.981
  63. Weitz, J. (2020) ‘Improving WorldCat quality: Resolving to reduce duplicates’, Organizacija znanja, 25 (1–2), 2025003. Available at: 10.3359/oz2025003
  64. Wentzel, B., Kirstein, F., Jastrow, T., Sturm, R., Peters, M. and Schimmler, S. (2023) ‘An extensive methodology and framework for quality assessment of DCAT-AP datasets’, in I. Lindgren, C. Csáki, E. Kalampokis, M. Janssen, G.V. Pereira, S. Virkar, E. Tambouris and A. Zuiderwijk (eds.) Electronic Government. Cham: Springer, pp. 262278. Available at: 10.1007/978-3-031-41138-0_17
  65. White, R.W. (2016) Interactions with search systems. Cambridge: Cambridge University Press.
  66. Wilkinson, M., Dumontier, M., Aalbersberg, I., Appleton, G., Axton, M., Baak, A., Blomberg, N., Boiten, J.-W., da Silva Santos, L.B., Bourne, P.E., Bouwman, J., Brookes, A.J., Clark, T., Crosas, M., Dillo, I., Dumon, O., Edmunds, S., Evelo, C.T., Finkers, R., Gonzalez-Beltran, A., Gray, A.J.G., Groth, P., Goble, C., Grethe, J.S., Heringa, J., Hoen, P.A.C., Hooft, R., Kuhn, T., Kok, R., Kok, J., Lusher, S.J., Martone, M.E., Mons, A., Packer, A.L., Persson, B., Rocca-Serra, P., Roos, M., van Schaik, R., Sansone, S.-A., Schultes, E., Sengstag, T., Slater, T., Strawn, G., Swertz, M.A., Thompson, M., van der Lei, J., van Mulligen, E., Velterop, J., Waagmeester, A., Wittenburg, P., Wolstencroft, K., Zhao, J. and Mons, B. (2016) ‘The FAIR Guiding Principles for scientific data management and stewardship’, Scientific Data, 3, 160018. Available at: 10.1038/sdata.2016.18
  67. Wollin-Giering, S., Hoffmann, M., Höfting, J. and Ventzke, C. (2024) ‘Automatic transcription of English and German qualitative interviews’, Forum Qualitative Sozialforschung Forum: Qualitative Social Research, 25(1). Available at: 10.17169/fqs-25.1.4129
  68. Wu, M. (2022) ARDC Project: Eliciting data search context. Available at: 10.5281/zenodo.6819787
  69. Wu, M., Juty, N., RDA Research Metadata Schemas WG, Collins, J., Duerr, R., Ridsdale, C., Shepherd, A., Verhey, C. and Castro, L.J. (2021) Guidelines for publishing structured metadata on the web (3.1). Available at: 10.15497/RDA00066
  70. Wu, M., Psomopoulos, F., Khalsa, S.J. and de Waard, A. (2019) ‘Data discovery paradigms: User requirements and recommendations for data repositories’, Data Science Journal, 18(1), p. 3. Available at: 10.5334/dsj-2019-003
  71. Wu, M., Brandhorst, H., Marinescu, M., Lopez, J.M., Hlava, M. and Busch, J. (2023) ‘Automated metadata annotation: What is and is not possible with machine learning’, Data Intelligence, 5(1), pp. 122138. Available at: 10.1162/dint_a_00162
  72. Wu, M., Gregory, K., Löffler, F., Mathiak, B., Psomopoulos, F., Schindler, U., Aryani, A., Bodera, J., Castro, L.J., Culina, A., Czerniak, A., Erdmann, C., Grethe, J., Hellström, M., Henzen, C., Hunter, C., Juty, N., Kvale, L., Lister, A., Liu, Y.-H., Madon, B., Medina-Smith, A., Parton, G., Pearman-Kanza, S., Pörsch, A., Söding, E., Szabo, D., van der Meer, L., Weisweiler, N., Widmann, H. and Woodford, C.J. (2024) Ten principles to improve dataset discoverability (1.0). Available at: 10.15497/rda/00120
Language: English
Submitted on: Jul 19, 2025
|
Accepted on: Jan 6, 2026
|
Published on: Feb 12, 2026
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2026 Mingfang Wu, Felicitas Löffler, Brigitte Mathiak, Fotis Psomopoulos, Uwe Schindler, Amir Aryani, Jordi Bodera Sempere, Antica Culina, Andreas Czerniak, Chris Erdmann, Kathleen Gregory, Nick Juty, Allyson Lister, Ying-Hsang Liu, Samantha Pearman-Kanza, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.