Table 1
Cost categories (adapted from Curation Cost Exchange 2018).
| 1.1. Cost of Initial Investment | Gathering requirements, preservation
planning, development of repository platform (hardware, software
licenses) and search and access capabilities, development of
policies for data acceptance and retention. Costs can vary widely depending on the scope of the requirements, the suitability of off-the-shelf software, and the time required for initial set up, testing and evolution to full production. Requirements may be imposed by the funder or the interests of the scientific community which influence the repository’s design and infrastructure. |
| 1.2. Cost to Publish | Data acquisition, appraisal, quality
review, standards-compliant metadata preparation and
dissemination, overhead, marketing, user support. The variety of scientific communities needs leads to a variety of curation practices and repository goals, with costs partly depending on the data source. Earth and environmental data naturally ranges from the relatively homogeneous, e.g., from sensors or instruments to highly complex organismal observations and physical samples (biological, chemical, geoscience) under both ambient conditions and from experimental manipulations. Large, mission-centric repositories (e.g., satellite data) have costs generally tied to data collection. Repositories serving many individual data producers rely considerably on their contributors’ expertise and time which distributes part of the curation cost to those projects. Repositories that are primarily aggregators (whose goal is to collect a variety of metadata or sources for indexing) rely on a minimum level of metadata standardization from their sources; their costs typically arise from resolving incoherent source data and heterogeneous metadata, with related outreach efforts to improve practices. |
| 1.3. Cost to Add Value | Data dissemination planning,
processing, data product development, and quality control of the
new data products, overhead. Varies greatly among repositories, but may represent the most visible return, or possibly even an opportunity for commercialization. Some raw data will have already received comprehensive processing to make them further useable. The concept of “Analysis Ready Data” are applied in other domains with value-adding steps by repository to target uses from multiple disciplines, non-research uses (e.g., policy makers, general public, education), or per the demand by such groups for the development of specific data products (Baker and Duerr, 2017). The cost for tasks to add value depends greatly on data types, diversity and envisioned uses. |
| 1.4. Cost to Preserve | Anticipated retention period,
facilities system maintenance, enhancements, and migration;
staff development and technology upgrade. While tracking existing needs is relatively straightforward, future costs may be more difficult to predict. Preservation costs are greatly influenced by technological change (e.g., new hardware, standards, vocabularies, storage formats), and new requirements and data policies that must be translated into repository operations (Maness, et al., 2017, Baker and Duerr, 2017). Iterative migration necessitates expenses in development, data and metadata conversion and user engagement, sometimes without immediately noticeable changes in service. Moving from supporting primarily data publishing to supporting data which are frequently reused requires new services and possibly, value-added products. |
Table 2
Generally accepted benefits of publishing data in an open repository.
| 1.1. Avoidance of Data Generation Costs | Data gathering is expensive; offering
reusable data avoids the cost of recreation. Data value may be easily estimated as the cost to create; however, the future value of data cannot be predicted and different kinds of data will have different useful lifespans, generally dependent on how easy or expensive data are to create and whether they lose or gain in applicability over time. It may be feasible to recreate experimental data, but it generally is impossible to recreate observational field data. |
| 1.2. Efficiency of Data Management | Infrastructure investments benefit all
data producers; central programming functions for data search
and access improve discoverability and reduce distribution costs
to researchers; efficiency benefits are most obvious in
repositories serving a large number of single investigators,
though all repositories keep large amounts of data safe by
upgrading hardware and software as technology changes, and by
managing services, such as unique identifiers (e.g., Digital
Object Identifiers, DOI). Data repositories can be compared to specialized analytical laboratories as they employ an expert workforce having specific skills in data curation and preservation that ensure the quality and interoperability of their holdings. Once data have met curation standards, repositories maintain continued usability capabilities and working life beyond the lifespan of the original creator’s data storage options by addressing format obsolescence and other issues. |
| 1.3. Long-term Usability and re-use of Data | Implementing sustainable data
curation, stewardship, metadata capture, and quality of data and
metadata enables meta-analyses, innovative re-use for new
science or applications. Lengthening the working life of data creates enduring value by enabling subsequent usage over time. Ongoing stewardship can support new uses and user communities. Properly curated, data can be combined or analyzed with data that will be collected in the future and allows the ability to build upon prior work (Starr et al., 2015). |
| 1.4. Transparency of Scientific Results | Making data publicly available in a
repository is an important step toward transparency and
reproducibility of research, which in turn assures credibility
of scientific results (McNutt et
al. 2016) and the ability to build on prior
work. Historically, best efforts have been made to preserve publications and the salient data published in them. In modern publishing, data needs to be managed and published as a product in its own right (Downs et al., 2015). |
| 1.5. Value Added Data Products | Some repositories increase data utility via pre-processing, semantic and format standardization, data aggregation and interpretation, and specific tools that support the creation of new data products, uses and audiences beyond the original data users (e.g. general public, policy makers, education and outreach) (Baker et al., 2015). |
Table 3
Summary of findings on ease of implementation of repository benefit metrics. For detailed list and description of metrics see Appendix A. Here the metrics are not necessarily named individually but restated in general terms.
| Currently measureable by most repositories | Derived from data
holdings: Temporal, spatial and subject
coverage; Value of repository services: number of data submitters and users supported, grants or projects served, workforce development achieved; cost savings for trustworthy data storage and distribution per submitter. Support for reuse: completeness of metadata; expressiveness of metadata standard; presence and enforcement of data/metadata quality policies. Data reuse: Numbers of downloads, page views, or distinct IP addresses accessing the data, of metadata pages accessed, data products and specific tool accessed; time spent at the site. |
| Possible in the foreseeable future with research, advanced technology and changed practices | Scientific impact: extracted with artificial intelligence technologies from current publications, webpages, blogs, proposals and data management plans, and more reliably based on standardized data citations once practice is established. |
| Requiring major additional resources and expertise | Surveys: interviews to
ascertain user satisfaction and perceived impact on research success
(research enabled, time saved, new questions
developed). Economic and societal impact: of data and data products beyond scientific use, or for fraud avoidance. |
