A Framework for Active DMPs in Photon and Neutron Science Large-Scale Facilities

Heike Görzig; Alejandra N. Gonzalez Beltran; Felix Engel; Brian Matthews

doi:10.5334/dsj-2024-004

A Framework for Active DMPs in Photon and Neutron Science Large-Scale Facilities

Data Science Journal

Volume 23 (2024): Issue 1

By: Heike Görzig , Alejandra N. Gonzalez Beltran , Felix Engel and Brian Matthews

Open Access

|Jan 2024

Figures & Tables

An idealized facility research lifecycle, simplified (Matthews et al. 2012).

Metadata collected and information systems supporting the stages of the Experimental lifecycle.

Increasing knowledge throughout projects’ runtime and information flow.

Table 1

DMP Phases.

DMP PHASE	ACTOR PROVIDING INFORMATION FOR THE DMP
0 Before proposal submission	Typically, knowledge of instrument from scientist or RDM team (static parameter).
1 Proposal submission	Typically, knowledge of the researcher, with support from the facility administration and RDM team.
2 Accepted experiment planning	Typically, knowledge of the researcher, with support from the facility administration and instrument scientist.
3 Data Collection/Data processing/analysis	Typically, knowledge of the user, with support from the instrument scientist.

Table 2

PMBOK®’s phases and roles in RDM information collection.

ROLE	BEFORE PROJECT/OPA/EEF	PROJECT INITIATION	PROJECT PLANNING	PROJECT EXECUTION	PROJECT FINALISATION	AFTER PROJECT/PA
Instrument scientist	Instrument/software description, selection of applicable metadata standards, general dataset description		Required project specific software, used instrumentations and their configurations, and standards			Adding/actualisation of instruments and software information
Data manager	Controlled vocabularies and standards administration, mapping metadata to standards, general data policies, policy execution				Automatic metadata extraction and validation	Open access of research data, validation of policy execution; actualization of standards and policies
User office		Proposal information, instrument to be used, (co-) proposers
Experimental team(research)			Concrete dataset description, references to additional information, metadata schema selection, estimated amount of datasets produced, dataset usage, special (own) software infrastructure requirements	Experiment execution: parameter and configurations	Dataset selection, metadata completion, and validation
Experimental team (administration)		Specific policies and DMP requirements for project, funding, participating researchers	DMP actualization	DMP actualization after experiments	DMP actualization

Table 3

Facility wide metadata.

FACILITY INFORMATION
repository	The repository information comprises the name and access URL where the data is made accessible.
licence	The license usually applied to the data in the repository.
security	Information about e.g., backups and replicas of the data and other special security information.
pid_system	The default PID system applied in the repository e.g., handles or Digital Object Identifiers (DOIs).
personal_data	In case personal information in research data is treated on a facility level, e.g., no personal information is allowed in research data more than required for provenance.
min_storage_period	The minimum period research data has to be available for good scientific practice.
archive	Data archive used. If it is the same as the repository, then no URL needs be provided, as the access procedures have to be described.
certificate	If the repository is certified and with which certificate e.g., CoreTrustSeal.
arrangements	For the data produced in a research project, an arrangement with the data repository that will receive the research data has to be made. In case a proposal in PaN facilities is approved, it normally includes the usage of the repository.
embargo_period	In the data policy of the PaN facilities, there is also an embargo period defined.
access_control	How the access to the data repository is controlled.
costs	In case a proposal in PaN facilities is approved, it also normally includes the costs for research data management.

Table 4

Metadata for scientific techniques related to metadata schema and file format.

technique	Describes a scientific technique.
name	A name or label of the technique.
PID	E.g., the IRI in the PaNET Ontology (Collins et al. 2021).
metadata schema	Related to the technique are the requirements on metadata. There can be more than one metadata schema and format used in practice.
structure
format
tools
reading	Software for possible usage.
writing	Possible software for writing the format.
validation	Tools for validating the data against e.g., NeXus application definitions.¹
validation schema	Schema complies with metadata schema above, used for validation with a tool.

Table 5

Metadata for datasets.

DATASET
name	a default name
description	a project independent description of the dataset that can be adapted in projects
contributor	contributing persons, typically identified via ORCIDs
reproducible	if the dataset is reproducible and under which efforts
interested_community	might be derived from disciplines
usage	there might be some default usage scenarios, like calibration; otherwise the data sets’ intended usage in the project
archival	DMP question; moment, selection_criteria, and long_term_archival_reason might be used for automated execution and validation
data_security	measurements and responsible person
techniques	scientific techniques used to create the dataset
filecollections	a collection of files created by one software instance; a dataset can contain more than one filecollection
name	a default name
resource	instrument or laboratory used to create the filecollection; preferable identified by PID
storage	location and access to experimental storage
backup	location and access to experimental storage
quality_assurance	description and pointers of e.g., validation workflows
hardware	description of hardware components used to create the dataset; used for data curation
writing_software	description of software and its components used to create the dataset; used for data curation
files	files
name	can be a regular expression definition of default file names
format	the format of the file (could be related to a format registry and relates to the technique table above)
metadata_schema	the metadata schema applied in the file (could be related to a metadata schema registry and relates to the technique table above)
size	expected minimum and maximum size of the file; average size
amount	quantity of files; can be used together with size for estimating overall size and validation
processing_requirements	hardware and software requirements for processing the data
hardware_requirements
type	type of hardware requirements like storage or processors of a certain type of computer; manufacturer and model are required
reading_software	possible software to use the data, including access and documentation, as well as required plugins
name
PID
type
documentation
URL
plugins
name
type
URL

Table 6

Metadata for operations.

POLICY
name	The name of the policy.
constraint	When the operation should be executed.
type	Event trigger or scheduled.
value	Triggering event e.g. onCreation or date and time.
parameters	Array of required parameters like path to a file or metadata schema for validation. The parameters are divided in input and output parameters.
operation	Policy related operation or workflow (referencing an executable workflow).
categories	For finding the operation, e.g., validation, integrity, format, extraction, interoperability.
description	A textual description about what the operation does.

Table 7

Metadata for projects.

PROJECT:
name	name of the project
description	project description
funding reference	the usage of a PID is advisable for later curation and integration into a graph model
members	here as well the usage of ORCIDs is advisable
start_date	start of the project
end_date	end of the project
disciplines/keywords	to retrieve RDM requirements and improve findability of the data
jurisdictions	to retrieve policy requirements; jurisdictions can be funders, national, institutional, or a laboratory/instrument
resource	instrument or laboratory used to create data; used to retrieve possible dataset types created by the resource

Relationships associated with the Resource class and its associated technique.

Table 8

Pre-existing information lifecycle phases and related activities.

LIFECYCLE PHASE	ACTIVITIES
Before project (OPA/EEF)	Retrieve initial information for central knowledge base about Datasets of Resource from repository. General update/insert knowledge base: Metadata standards Formats Policies Mappings
Project initiation	Insert new project Relate to resource
Project planning	Specify projects datasets: Create project specific datasets Relate to default datasets of resource Update description
	Create DMP: Retrieve projects datasets and related information from central knowledge base and insert into a DMP tool Retrieve facility specific information Update information in DMP tool Create concrete policy execution environment: Retrieve to datasets related operations into execution environment
Project execution	Update DMP Retrieve projects datasets repository and update information in DMP tool Execute operations on datasets
Project finalization	Update DMP Execute operations Select datasets for archival Update concrete dataset descriptions
After project (OPA)	Validate pre-existing information against projects datasets in the repository: Update central knowledge base

Components for aDMP system (Görzig et al. 2022).

Pre-existing information enhance and use.

References

Authors

Metrics

Articles in this issue

DOI: https://doi.org/10.5334/dsj-2024-004 | Journal eISSN: 1683-1470

Journal RSS Feed

Language: English

Page range: 4 - 4

Submitted on: Dec 15, 2022

Accepted on: Dec 1, 2023

Published on: Jan 22, 2024

Published by: Ubiquity Press

In partnership with: Paradigm Publishing Services

Publication frequency: 1 issue per year

Keywords:

DMP,

Reuse,

© 2024 Heike Görzig, Alejandra N. Gonzalez Beltran, Felix Engel, Brian Matthews, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.

Volume 23 (2024): Issue 1

A Framework for Active DMPs in Photon and Neutron Science Large-Scale Facilities

Figures & Tables

Figure 1

Figure 2

Figure 3

Table 1

Table 2

Table 3

Table 4

Table 5

Table 6

Table 7

Figure 4

Figure 5

Figure 6

Table 8

Figure 7

Figure 8

Paradigm

My account