A Discussion of Value Metrics for Data Repositories in Earth and Environmental Sciences

Cynthia Parr; Corinna Gries; Margaret O'Brien; Robert R. Downs; Ruth Duerr; Rebecca Koskela; Philip Tarrant; Keith E. Maull; Nancy Hoebelheinrich; Shelley Stall

doi:10.5334/dsj-2019-058

Figures & Tables

Table 1

Cost categories (adapted from Curation Cost Exchange 2018).

1.1. Cost of Initial Investment	Gathering requirements, preservation planning, development of repository platform (hardware, software licenses) and search and access capabilities, development of policies for data acceptance and retention. Costs can vary widely depending on the scope of the requirements, the suitability of off-the-shelf software, and the time required for initial set up, testing and evolution to full production. Requirements may be imposed by the funder or the interests of the scientific community which influence the repository’s design and infrastructure.
1.2. Cost to Publish	Data acquisition, appraisal, quality review, standards-compliant metadata preparation and dissemination, overhead, marketing, user support. The variety of scientific communities needs leads to a variety of curation practices and repository goals, with costs partly depending on the data source. Earth and environmental data naturally ranges from the relatively homogeneous, e.g., from sensors or instruments to highly complex organismal observations and physical samples (biological, chemical, geoscience) under both ambient conditions and from experimental manipulations. Large, mission-centric repositories (e.g., satellite data) have costs generally tied to data collection. Repositories serving many individual data producers rely considerably on their contributors’ expertise and time which distributes part of the curation cost to those projects. Repositories that are primarily aggregators (whose goal is to collect a variety of metadata or sources for indexing) rely on a minimum level of metadata standardization from their sources; their costs typically arise from resolving incoherent source data and heterogeneous metadata, with related outreach efforts to improve practices.
1.3. Cost to Add Value	Data dissemination planning, processing, data product development, and quality control of the new data products, overhead. Varies greatly among repositories, but may represent the most visible return, or possibly even an opportunity for commercialization. Some raw data will have already received comprehensive processing to make them further useable. The concept of “Analysis Ready Data” are applied in other domains with value-adding steps by repository to target uses from multiple disciplines, non-research uses (e.g., policy makers, general public, education), or per the demand by such groups for the development of specific data products (Baker and Duerr, 2017). The cost for tasks to add value depends greatly on data types, diversity and envisioned uses.
1.4. Cost to Preserve	Anticipated retention period, facilities system maintenance, enhancements, and migration; staff development and technology upgrade. While tracking existing needs is relatively straightforward, future costs may be more difficult to predict. Preservation costs are greatly influenced by technological change (e.g., new hardware, standards, vocabularies, storage formats), and new requirements and data policies that must be translated into repository operations (Maness, et al., 2017, Baker and Duerr, 2017). Iterative migration necessitates expenses in development, data and metadata conversion and user engagement, sometimes without immediately noticeable changes in service. Moving from supporting primarily data publishing to supporting data which are frequently reused requires new services and possibly, value-added products.

Table 2

Generally accepted benefits of publishing data in an open repository.

1.1. Avoidance of Data Generation Costs	Data gathering is expensive; offering reusable data avoids the cost of recreation. Data value may be easily estimated as the cost to create; however, the future value of data cannot be predicted and different kinds of data will have different useful lifespans, generally dependent on how easy or expensive data are to create and whether they lose or gain in applicability over time. It may be feasible to recreate experimental data, but it generally is impossible to recreate observational field data.
1.2. Efficiency of Data Management	Infrastructure investments benefit all data producers; central programming functions for data search and access improve discoverability and reduce distribution costs to researchers; efficiency benefits are most obvious in repositories serving a large number of single investigators, though all repositories keep large amounts of data safe by upgrading hardware and software as technology changes, and by managing services, such as unique identifiers (e.g., Digital Object Identifiers, DOI). Data repositories can be compared to specialized analytical laboratories as they employ an expert workforce having specific skills in data curation and preservation that ensure the quality and interoperability of their holdings. Once data have met curation standards, repositories maintain continued usability capabilities and working life beyond the lifespan of the original creator’s data storage options by addressing format obsolescence and other issues.
1.3. Long-term Usability and re-use of Data	Implementing sustainable data curation, stewardship, metadata capture, and quality of data and metadata enables meta-analyses, innovative re-use for new science or applications. Lengthening the working life of data creates enduring value by enabling subsequent usage over time. Ongoing stewardship can support new uses and user communities. Properly curated, data can be combined or analyzed with data that will be collected in the future and allows the ability to build upon prior work (Starr et al., 2015).
1.4. Transparency of Scientific Results	Making data publicly available in a repository is an important step toward transparency and reproducibility of research, which in turn assures credibility of scientific results (McNutt et al. 2016) and the ability to build on prior work. Historically, best efforts have been made to preserve publications and the salient data published in them. In modern publishing, data needs to be managed and published as a product in its own right (Downs et al., 2015).
1.5. Value Added Data Products	Some repositories increase data utility via pre-processing, semantic and format standardization, data aggregation and interpretation, and specific tools that support the creation of new data products, uses and audiences beyond the original data users (e.g. general public, policy makers, education and outreach) (Baker et al., 2015).

Table 3

Summary of findings on ease of implementation of repository benefit metrics. For detailed list and description of metrics see Appendix A. Here the metrics are not necessarily named individually but restated in general terms.

Currently measureable by most repositories	Derived from data holdings: Temporal, spatial and subject coverage; Value of repository services: number of data submitters and users supported, grants or projects served, workforce development achieved; cost savings for trustworthy data storage and distribution per submitter. Support for reuse: completeness of metadata; expressiveness of metadata standard; presence and enforcement of data/metadata quality policies. Data reuse: Numbers of downloads, page views, or distinct IP addresses accessing the data, of metadata pages accessed, data products and specific tool accessed; time spent at the site.
Possible in the foreseeable future with research, advanced technology and changed practices	Scientific impact: extracted with artificial intelligence technologies from current publications, webpages, blogs, proposals and data management plans, and more reliably based on standardized data citations once practice is established.
Requiring major additional resources and expertise	Surveys: interviews to ascertain user satisfaction and perceived impact on research success (research enabled, time saved, new questions developed). Economic and societal impact: of data and data products beyond scientific use, or for fraud avoidance.

A Discussion of Value Metrics for Data Repositories in Earth and Environmental Sciences

Figures & Tables

Table 1

Table 2

Table 3

Paradigm

My account