A Simplified Palaeoceanography Archiving System (PARIS) and GUI for Storage and Visualisation of Marine Sediment Core Proxy Data vs Age and Depth

Bryan C. Lougheed; Claire Waelbroeck; Nicolas Smialkowski; Natalia Vazquez Riveiros; Stephen P. Obrochta

doi:10.5334/oq.101

Full Article

Introduction

A common desire for Earth Science laboratories in the computer age is the digital storage and archiving of datasets in searchable databases. Furthermore, a growing number of funding agencies and publication venues are mandating that datasets are deposited in an open repository, so that other researchers may have access to the data. The ‘big data’ benefits of such a system for palaeoceanography are clear; data from multiple locations and periods of the Earth’s history can be searched, sorted and presented according to, for example, proxy and/or species type. Such an approach would save significant person hours currently spent by researchers worldwide in searching for, downloading, understanding and digitising datasets, thus allowing for much more efficient analysis of data. The principles guiding this process are the principles of findability, accessibility, interoperability, reusability (FAIR) (Wilkinson et al., 2016).

Much of the discussion involving the establishment of standardised digitised data has revolved around defining an ideal database format and/or repository for the storage of data (Bolliet et al., 2016; Jonkers et al., 2020; Khider et al., 2019; McKay and Emile-Geay, 2016), which is indeed a key prerequisite for the ultimate end goal whereby all data is stored on a common, publicly searchable/queryable online database in line with the goals of FAIR data. However, an often overlooked primary step in the realisation of such an end goal is ensuring that palaeoclimate data produced within a laboratory and/or research group is stored in some kind of machine readable format from the beginning, i.e. during the creation step. Current practices at many laboratories involve multiple actors and researchers of various levels of computer proficiency saving their data using idiosyncratic and machine-unreadable file formats. These practices lead to increased workload both during the project and also at the end of the project when submitting data to online repositories (i.e. due to laborious post-hoc data formatting and extensive manual metadata entry at the time of submission). Paradoxically, striving to store data in the perfect format at the final step can impede widespread sharing of machine readable data, as the increased workload can actually discourage a laboratory from producing digitised, shareable data in the first place.

Within the ERC ACCLIMATE project at Laboratoire des Sciences du Climat et de l’Environnement (LSCE), Gif-sur-Yvette, we digitised data from many existing records and created a simple GUI viewer with which to view them. We found that some trade-offs were necessary in order to digitise as much data as possible as efficiently as possible and in an internally consistent and machine readable way. We considered existing database formats, but found that, at least in the case of our project, striving for the most complete data format with very rich metadata was a time consuming process for our purposes. Our primary consideration was to take concrete steps to promote and ensure early adoption and awareness of the machine readable ethos within the project and laboratory (i.e. upon the generation of the data), by creating a machine readable format that worked for all participants within our project and contained the necessary metadata needed for our work, thus enabling us to produce as much digitised data as possible, with the metadata that we required. We note that when developing any system, there are always trade-offs involved; do researchers want a lot of data with essential metadata, or less data but with very extensive metadata? This is a philosophical question and the answer depends on the particular needs of a project.

Given the aforementioned requirements for our particular project, we determined that the ideal data file format should, in our case, meet the following three criteria:

the stored data must be machine-readable across many operating systems, thus allowing for automated reading of data, as well as bulk conversion/uploading to common database formats;
it must be human-friendly, thus allowing the human eye to quickly access and understand the data contained within the file if needed; not all project participants have sufficient proficiency with higher level storage formats such as SQL, NetCDF and/or JSON.
the file creation process must be as accessible as possible and cause as little labour burden as possible for laboratory members of all levels of computer proficiency, thus encouraging the seamless and autonomous creation of machine-readable data formats from the very beginning of the project workflow (e.g., in the field or at the time of laboratory analysis), preferably offline without having to navigate some kind of data entry form and/or web portal.

Here, we present a file naming and file structure format known as the Palaeoceanography ARchivIng System (PARIS), which is optimised for human-accessibility from the very beginning of a project (in this case, the stable isotope laboratory environment). Files stored in such a machine-readable file structure can subsequently easily be automatically batch-converted to the specific format requirements of a particular data repository, thus avoiding repeated manual metadata entry upon repository submission.

We also demonstrate the machine-readable power of this simple file format as a basis for a simplified database structure to use within a laboratory: we have built a fully documented GUI in Matlab for interactive searching and plotting of data using our simple file format. We use Matlab for the GUI because it is what we had at hand, but the PARIS database format itself is language agnostic and could be exploited using any other programming language environment. The GUI allows for the rapid searching and visual presentation of data by latitude, longitude, water depth, age and, where applicable, species type. The entire setup was designed with modular expansion in mind, and both the file formatting conventions and GUI can be used and/or modified by other laboratories for their own particular needs. The structure of the archiving system is shown in Figure 1 and described in the following sections.

A flowchart detailing the basic structure of the PARIS simplified database system.

File structure and organisation

File naming conventions

We use a file storage system based on universally readable, tab-delimited ASCII text files, which are more than sufficient for palaeoclimate datasets from sediment cores, seeing as such sediment cores contain discrete-depth measurements numbering only in the hundreds or thousands. We chose tab delimitation over comma delimitation to avoid issues in locales where a comma is used to denote the decimal marker (e.g. much of Africa, Europe and South America). These tab-delimited files can easily be created directly from analytical software or by using basic spreadsheet software. A uniform file naming convention is used to create machine readable identifiers containing information about the data contained within the file: core name, data type (six character code) and measured material (e.g., foraminifera species). Select examples are shown in Table 1. The underscore character in the file name functions as a marker to distinguish various descriptive properties of the file, thus facilitating machine readability and automated searching of file names. As such, core names may not contain an underscore. The full species names associated with species abbreviations can be found in the file _abbreviations.txt.

Table 1

Example of the file naming conventions using in the PARIS system.

FILENAME EXAMPLE	CORE	DATA TYPE	SPECIES/MATERIAL
ODP1234_18O13C_CWU.txt	ODP1234	Stable isotope raw data vs core depth	C. wuellerstorfi
ODP1234_18O13C_GRU.txt	ODP1234	Stable isotope raw data vs core depth	G. ruber
ODP1234_18O13C_MXB.txt	ODP1234	Stable isotope raw data vs core depth	Mixed benthics
ODP1234_14CRAW.txt	ODP1234	Radiocarbon raw ages vs core depth	Contained in file
ODP1234_TIEPTS.txt	ODP1234	Age-depth tie-points vs depth, or other non-14C age data vs depth.	Contained in file
ODP1234_MAGSUS.txt	ODP1234	Magnetic susceptibility	n/a
ODP1234_admodel.txt	OPD1234	Age-depth model	n/a

Internal file structures

Raw data files

A common challenge preventing long-term data sharing in palaeoceanography is the publication of isotope data exclusively vs age, which prevents re-evaluation of the data by future researchers as understanding of geochronological methods improves and evolves. For these reasons, all isotope and other palaeoclimate proxy data in the PARIS scheme are stored against core depth as the primary format, allowing for the later application of new geochronologies, and/or comparison of proxy data vs multiple geochronologies. A further ambiguity commonplace in palaeoceanography is reporting only a single core depth value corresponding to a particular data point (for example, often only a single core depth value is given, even though subsamples represent a depth interval). To avoid such ambiguity, each data point stored using the PARIS scheme has two depth values (depth1 and depth2) which correspond to the top and bottom of a particular core interval (“depth slice”). Within the PARIS scheme, it is also possible to include NaN for depth2. In such a case, depth2 will simply be assumed to be 1 cm greater than depth1 (i.e. depth1 represents the depth value corresponding to the top of a depth interval with a thickness of 1 cm).

The tab-delimited ASCII text format is used to structure data in column/row format, whereby data such as depth, measurement value and measurement uncertainty are stored in specified column numbers. When there is no data available for a particular sample (e.g. δ¹⁸O value but no accompanying δ¹³C value) a NaN is entered as a placeholder for the missing value, thus ensuring structural integrity and machine-readability of the file. The formatting used for each type of proxy is detailed in the user manual included with the GUI software. All raw data files are stored within the “raw data” folder. Here, we supply a number of example files of previously published Atlantic Ocean sediment core stable isotope data (Table 2) that were collated by Waelbroeck et al. (2019).

Table 2

Stable isotope data from the following sediment cores are included in demonstration database.

CORE	STUDY
CH22-KW31	Pastouret et al. (1978)
CH69-K09	Labeyrie et al. (1999)
ENAM93-21	Rasmussen et al. (1998, 1996)
EW9209-1JPC	Curry and Oppo (1997)
GEOFAR-KF13	Richter (2001)
GIK12392-1	Zahn et al. (1986)
GIK15669-1	Sarnthein et al. (1994)
GIK23415-9	Jung (1996)
GL1090	Santos et al. (2017)
GeoB1023-5	Mulitza et al. (1999)
GeoB1515-1	Vidal et al. (1999)
GeoB16202-2	Voigt et al. (2017)
GeoB16206-1	Voigt et al. (2017)
GeoB16224-1	Voigt et al. (2017)
GeoB1711	Vidal et al. (1999)
GeoB1720-2	Dickson et al. (2009)
GeoB3202-1	Arz et al. (1999)
GeoB4240-2	Freudenthal et al. (2002)
GeoB5546-2	Holzwarth et al. (2010)
GeoB6201-5	Portilho-Ramos et al. (2018)
GeoB7920-2	Tjallingii et al. (2008)
GeoB9508-5	Mulitza et al. (2008)
GeoB9526-5	Zarriess and Mackensen (2011)
KNR159-5-36GGC	Oppo and Horowitz (2000)
KNR166-2-26JPC	Schmidt and Lynch-Stieglitz (2011)
KNR166-2-29JPC	Lynch-Stieglitz et al. (2011)
KNR166-2-31JPC	Lynch-Stieglitz et al. (2011)
KNR166-2-73GGC	Lynch-Stieglitz et al. (2011)
KNR197-10-17GGC	Keigwin and Swift (2017)
KNR31-GPC5	Keigwin et al. (1991)
M35003-4	Hüls (1999)
M39008-3	Eynaud et al. (2009)
MD01-2461	Peck et al. (2007)
MD02-2575	Ziegler et al. (2008)
MD02-2588Q	Ziegler et al. (2008)
MD02-2594	Dyez et al. (2014)
MD03-2705	Jullien et al. (2007)
MD03-2707	Weldeab et al. (2016)
MD07-3076Q	Waelbroeck et al. (2011)
MD08-3180Q	Repschläger et al. (2015a, 2015b)
MD95-2010	Dokken and Jansen (1999)
MD95-2039	Zahn et al. (1997)
MD95-2040	Voelker and Abreu (2011)
MD95-2041	Voelker and Abreu (2011)
MD95-2042	Shackleton et al. (2000)
MD99-2284	Dokken et al., (2013)
MD99-2334K	Skinner and Shackleton (2004)
NA87-22	Vidal et al. (1997)
OCE205-2-100GGC	Came et al. (2008)

Age-depth model files

Within the PARIS system, separate age-depth model files are used to assign age and age uncertainty to the raw data that is stored against depth. Age-depth model files (corename_admodel.txt) are contained in a folder called “master” within the “age models” folder. The reason for this additional subdirectory level is to allow different age model scenarios to be stored, which can subsequently be accessed from the GUI. For example, one might wish to store and compare different age-depth models based on different methods (¹⁴C, U/Th, etc) for the same set of sediment cores. Similarly, one may want to compare age-depth models developed by different software packages (Blaauw and Christen, 2011; Bronk Ramsey, 1995; Haslett and Parnell, 2008; Lougheed and Obrochta, 2019). In that case, an additional folder can be made within the “age models” folder, and its contents will be accessible from the GUI. Age-depth model files use the “Undatable” (Lougheed and Obrochta, 2019) output file format by default, but users can adjust to use the file format of a different age-depth modelling software, or indeed any type of age-depth model file, by editing the required admodelformat.m formatting file contained within the subdirectory within the “age models” folder. Here, to demonstrate the PARIS system we supply a number of age-depth model files produced for Atlantic Ocean sediment cores by Waelbroeck et al. (2019).

Core information index file

All raw data files and age-depth model files contain a unique code detailing the sediment core that they come from. An additional file (_core information.txt) is present within the main folder of PARIS, which details some basic meta-data for each core, namely location (latitude and longitude) and water depth (mbsl). This allows the PARIS system to subsequently search for sediment core locations that match a specific search criteria (e.g. a certain water depth or latitude/longitude bounding box) and search for all raw data and age-depth models associated with sediment cores that correspond to the search criteria.

Reference records and bathymetry

Laboratories may also wish to store climate reference records for display within the GUI. For this reason, we include some climate reference records that can be viewed within the GUI. These include the Greenland ice-sheet δ¹⁸O and Ca²⁺ records (Andersen et al., 2006; Rasmussen et al., 2006; Seierstad et al., 2014), temperature derived from the Greenland isotope temperature record (Kindler et al., 2014), atmospheric CO₂ derived from the Antarctic ice core record (Lüthi et al., 2008). We also include a downscaled version of the GEBCO bathymetry (General Bathymetric Chart of the Oceans, 2015), that is used within the PARIS GUI to provide a simple map showing core locations superimposed upon bathymetry.

GUI search interface

To demonstrate the power of the text file based archiving system, and in order to provide a system with which laboratory members at LSCE could browse and visualise sediment core data, a GUI system was developed in Matlab (Figure 2). This system allows the user to search for sediment core locations according to certain criteria, and specify which types of data to plot, which are shown on three vertical panels (Figure 3). Data from multiple sediment cores and/or species can be plotted on to one of the first two panels, in order to facilitate inter-core comparison. Data can be plotted against depth or against age (from one of the supplied age-depth models), and the user can choose to plot with or without error bars. The third panel is reserved for plotting reference data or sediment accumulation rate (SAR) plots. The software automatically assigns a unique colour code to each sediment core, and unique symbol and line type to each type of data and/or species. Legends are also shown for ease of user interpretation. Finally, every time a plot is generated, a publication quality PDF of the plotted panels is generated within the main PARIS folder, saved under a name specified by the user.

An example of the PARIS GUI interface for visual browsing of the database system.

Example of an output figure from the PARIS GUI interface in the case of stable oxygen and carbon isotopes carried out on *C. wuellerstorfi* (CWU) for two sediment cores CH69-K09 (Labeyrie et al., 1999) & NA87-22 (Vidal et al., 1997). Also shown are Greenland ice core oxygen isotope data (Andersen et al., 2006; Rasmussen et al., 2006; Seierstad et al., 2014). Younger Dryas and Heinrich Stadial intervals are as defined by Waelbroeck et al. (2019).

Database inter-compatibility potential

Once all data from a given laboratory is stored using a common format, the process of submission to a given database or repository (i.e. changing the format to suit a particular repository) can be fully automated. One needs only to write a one-off batch script that can convert all files from the laboratory to the required format of the various repositories. Here, we provide an example of a similar such script (dataonage.m) that was used within the LSCE laboratory to systematically read in all isotope data vs depth and output all isotope data vs age according to their respective age-depth models. Hence, systematically updating the age values when (re-)submitting to repositories becomes a simple and rapid task.

Conclusion

The palaeoclimate literature has begun to embrace the principles of FAIR data and many good examples of useful database structures have been previously provided (Bolliet et al., 2016; Jonkers et al., 2020; Khider et al., 2019; McKay and Emile-Geay, 2016). We have provided an example of a concrete first step in the journey towards FAIR data, the creation of machine-readable data at the field and laboratory level. The involvement of multiple actors within a project requires that a machine readable format is fully accessible to persons with only basic computer proficiency. The simple PARIS file naming and structuring system, based on the ubiqutously accessible tabbed-text file format is one such example, and has been successfully deployed at LSCE. The simple structure is nonetheless powerful in that data can easily be indexed and searched, as demonstrated here using a GUI for data visualisation. Such an approach can encourage a laboratory to adhere to the FAIR data principles from the very outset of a project, thus saving much time and resources that would often be spent on post-hoc data conversion. While the format proposed is in some aspects not as metadata-rich as other formats previously proposed, the trade-off is that digitisation of existing and newly generated data can be carried out at a relatively rapid pace, while still preserving machine readability and its associated searching power.

Database and GUI availability

The database and GUI system can be downloaded from Zenodo: https://doi.org/10.5281/zenodo.4680717.

Acknowledgements

This is a contribution to the ACCLIMATE ERC project; the research leading to these results has received funding from the European Research Council under the European Union’s Seventh Framework Program (FP7/2007-2013 Grant agreement n° 339108). B.C. Lougheed acknowledges Swedish Research Council (Vetenskapsrådet) grants 637-2014-499 and 2018-04992.

Competing Interests

The authors have no competing interests to declare.