Biobabel: A Unified Interface for Reading Diverse Physiology File Formats Including Cardiac, Respiration, and Electrodermal Data

Floris Tijmen van Vugt

doi:10.5334/jors.724

(1) Overview

Introduction

Biosignals such as cardiac activity (typically from the electrocardiogram, ECG), electrodermal activity (EDA), respiration, and others [1, 2, 3, 4] provide a unique window into the cognitive, social, and emotional processes unfolding in humans and between humans [5]. There are now wonderful packages for preprocessing (e.g. neurokit [6]) and analyzing (e.g. biopeaks [7]) biosignals. However, progress is hampered by the proliferation of a multitude of file formats for such data (EDF, XDF, OpenSignals, BDF, CSV, Acknowledge ACQ, etc.).

This leads to the following problems.

The first problem is that existing software packages typically read only one of these formats, requiring researchers to convert files between formats, which is tedious and error-prone, or simply impossible when using read-only libraries. For example, for any given file format, Python packages exist that can read files in that format (e.g. pyxdf can read XDF files or pyedflib can read EDF files). However, if a user has built a pipeline for XDF files, but then wants to use EDF input files, they need to convert XDF to EDF, which is not possible with the corresponding Python package.

The second problem is that existing Python packages for reading these files tend to read the data into structures that are typically organized differently. Hence, researchers are required to reorganize their code to cater to different formats, which is tedious and error-prone. Different file formats, and their corresponding Python packages, make different assumptions about the data structure: in some formats, multiple signals in a file are forced to have the same sampling rate (e.g. OpenSignals [8]), whereas in other formats sampling rates can vary (e.g. XDF). In some cases, the signals are supposed to have the same onset time (e.g. EDF), whereas other formats allow different onset times requiring re-aligning (e.g. XDF). All this makes conversion cumbersome and errors can easily slip in.

This state of affairs hampers the development of unified, reproducible pipelines that can be shared between research groups across the globe. Increasingly, the field calls for sharing of data analysis pipelines between research groups as an indispensible step to much-needed reproducibility [9]. In addition, sharing analysis pipelines rather than each group having to reinvent the wheel allows for more efficient use of scientists’ time.

The third problem is that it is becoming increasingly important for physiological software to accommodate data from multiple participants. There is increasing interest in neuroscience in collecting physiological data simultaneously from multiple participants interacting in real-time [10]. Such hyperscanning studies place unique demands on file structures that classically were designed for data from single participants only.

Thus, what is needed is a software package that can read a variety of file formats into a reasonably flexible data structure that abstracts away from differences. Such a package should accommodate data streams from multiple participants and allow the data to be written in a sensible native open-standard format. These challenges were already solved for neuroimaging data by the nibabel package [11] from which we draw inspiration here. But for the physiology data, surprisingly such a software suite has been missing until now.

Implementation and Architecture

biobabel is a Python package whose main functionalities are the following (for a high-level overview, see Figure 1):

Seamless reading of a host of physiology data file formats.
Data flows into an object with a flexible internal structure supporting multiple data streams, time point markers, various sampling rates, and multiple participants.
Basic data manipulation (cropping in time, selecting subsets of channels, etc.) and visualization (previewing) not typically implemented in existing software packages.
A set of Swiss army knife command-line-based tools for on-the-fly data inspection and manipulation.
Streamlined modular code that allows the package to be easily extended to read file formats not yet included.
Data can be written to an open standard file format based on HDF5.

Biobabel allows reading major physiological data file formats, either via command line (top panel) using a variety of script utilities, or within Python (bottom panel). Within Python, a *biodata* object is generated which can be queried, visualized, and from which data and metadata can be extracted.

For a full demonstration, see the documentation and illustration notebook.

Supported data formats

At the time of writing biobabel supports the data formats shown in Table 1.

Table 1

Data formats supported in biobabel.

FORMAT	FILE EXTENSION	SUPPORTED VIA
Extensible Data Format	.xdf	pyxdf
BIOSEMI 24-bit BDF	.bdf	pybdf
BioPAC Acknowledge	.acq	bioread
OpenSignals (r)evolution/BiTalino	.txt	opensignalsreader
European Data Format	.edf	pyedflib
Generic CSV	.csv	in-house
hdphysio5	.hdf5	Biobabel-specific HDF5 flavor

The format of input files is guessed automatically at the time of reading, using clues such as file extension, but if these are insufficiently informative, guesses are made based on sniffing of the file. For more information about inferring formats and metadata, please see Section.

Within Python, the following code is sufficient to read a data file:

import biobabel as bb
bio = bb.load('tests/example.hdf5')

Then, we can view basic properties of the data file:

bio.print()

This will produce an overview of the dataset indicating sampling frequencies and durations:

Summary of Simulated data
- date 07/20/2023 10:48:32 EDT-0400
 
Participant 'a'
- channel a_ecg [ modality ecg ] 15000 samples @ 1000.0 Hz = 15.0 s
- channel a_ppg [ modality ppg ] 15000 samples @ 1000.0 Hz = 15.0 s
 
Participant 'b'
- channel b_ecg [ modality ecg ] 15000 samples @ 1000.0 Hz = 15.0 s

And easily inspect the data using a plot:

bio.plot()

which produces the output shown in Figure 2.

Overview plot of sample data file, indicating each channel as a separate panel. Vertical dashed lines are time markers.

Biobabel internal data structure

Internally, biobabel stores physiological datasets in a Biodata object (bio in the above example). Under the hood, this object contains a number of data streams, each of which is a single dimension data array with some associated key-value metadata, such as sampling frequency, participant ID, etc. Each data stream is identified with a unique ID.

The channel metadata allows us to easily find channels by data type:

bio.find_channels({'modality':'ecg'}) # return which ECG data channels

This returns a set of channel IDs:

['a_ecg', 'b_ecg']

The channel IDs can then be used to query the channel metadata (in dictionary format) and extract its data:

hdr,dat = bio.get('a_ecg')
hdr # find the associated metadata for this channel

which returns the metadata in hdr:

{'id': 'a_ecg',
 'participant': 'a',
 'sampling_frequency': 1000,
 'modality': 'ecg'}

In biobabel, each data stream can have its own sampling frequency, but all data streams are assumed to start at the same time. In my experience analyzing physiological data, this common starting time assumption was sensible, since it holds true in most applications and making this assumption simplifies subsequent data handling. For data formats in which this assumption does not necessarily hold true (e.g. XDF), biobabel will be padded the data streams with missing data samples (NAN) so that all streams start at the same time.

biobabel also supports markers, which are points in time at which specific events are recorded to occur. This can be start/stop markers indicating separate recording segments (e.g. append-markers in BioPAC Acknowledge files). Markers are stored in the Biodata object and can be accessed using bio.get_markers() (to find the marker names) and bio.get_marker(<NAME>) (to extract the corresponding time points). In default plotting functions of biobabel, they are indicated with dashed vertical lines (Figure 2).

biobabel allows a number of typical data management steps that most packages do not straight-forwardly allow, such as cropping the data to a selected time range (bio.crop(t_start,t_end)) and dropping or selecting channels.

Finally, data can be saved in the biobabel native HDF5-based format (bio.save).

For labs engaging in hyperscanning, biobabel seamlessly accommodates support for data from multiple participants. Each data stream can be allocated to a specific participant, allowing the software to find all participants bio.get_participants() or get channels for a specific participant (bio.find_ channels({'participant':'b'})).

Inferring formats and meta data

biobabel makes two kinds of inferences when it reads files.

First, biobabel attempts to infer what is the file format of an input file. For example, is a given file in the EDF or AcqKnowledge format?

Default strategy. The function biobabel.io. guess_dialect first checks the extension of the file name. For example, if the file ends with .edf, biobabel will guess it is an EDF file. Some extensions are ambiguous, such as .txt, which is used for OpenSignals files but also for Teensy-ECG formats. In this case, biobabel.io.guess_dialect reads the first line of the file, which for OpenSignals files should contain the string OpenSignals.
What happens when the default strategy is wrong. In most cases, if the wrong file format is guessed, an error will be thrown. This is the desired behavior, since it hands control back to the user to override the file format choice, as exemplified in tests/tes_ wrong_extension.py.
How to correct when the default strategy is wrong. The user can override the file format guess by specifying the file format manually. For example, if biobabel.load(filename) incorrectly guesses the file to be an EDF file, while it should be an AcqKnowledge file, the user can force the software to read the file as an AcqKnowledge by calling biobabel.load_acq(filename). This function returns exactly the same biodata object that would have been returned if biobabel.load() had guessed the file format correctly.

Second, biobabel attempts to infer metadata for a given input file when it is not explicitly indicated. Examples of such metadata are the type of physiological data contained in a given channel, e.g., PPG or ECG data.

Default strategy. biobabel will read modality information from the input file if present. For example, the HDF5 file format contains this information, which is simply taken over by biobabel. In the case of file formats that do not explicitly encode the modality of data channels, biobabel will search if the channel name contains any of the major modalities (e.g., ECG, PPG) and, if found, enter this as the channel modality.
What happens when the default strategy is wrong. The default strategy may not always provide the correct modality for a channel. The result of this will be that a channel will not be found when searching channels for a given modality, e.g., the channel will not show up in a search of all ECG channels via dat.find ({'modality':'ecg'}).
How to correct when the default strategy is wrong. The user can manually override the channel modality after the file is read. For instance, biodata.update_ channel('b_ecg',{'modality':'eeg'}) will set the modality of the channel to eeg. This is illustrated in the test file tests/test_guess_ modality.py.

Easy previewing and some manipulation from the command line

biobabel provides simple accessible previewing of data files directly from the command line. This functionality is inspired by AFNI [12], a toolbox of shell scripts for neuroimaging analysis.

The following shell scripts are currently included and available automatically if the package is installed via pip:

bioinfo <filename> which reads the data file and prints a summary (a wrapper around biodata. print())
biobabel <filename> which reads the data file and produces a simple plot (a wrapper around biodata.view())
tohdf5 <filename> which converts a data file in any of the supported formats into biobabel’s native HDF5 format.
biosplit <filename> which splits the data along its integrated markers (which often correspond to different recording sessions) into multiple separate files (e.g. <filename_001>, <filename_002> etc.)
bioview <filename> which launches a graphical user interface (GUI) reader allowing interactive inspection of data as shown below.

Quality Control

Accuracy

The software has been tested over the years during which it has been in development, which is from roughly 2021 until 2025, the time at which its basic functionality is considered stable. The software is part of standard lab pipelines at the BRAMS center in Montreal and is shared with colleagues around the world. Because of this, the software has undergone repeated cycles of error corrections based on field observations. Data read from specific files was compared against the same data read in other (commercial) interfaces. A variety of data file formats has been tested in this manner over the years to ensure the accuracy of the data read in biobabel.

Test suite

A standardized test suite based on pytest is developed and being expanded. By running the pytest command from the root of the package, a series of tests is run in which sample files (included in the subfolder tests/samples) are read and the output is compared against the known values. The included sample files have different file formats that the software is supposed to recognize automatically. Also included are tests where the package is supposed to throw an exception, such as when reading a file assumed to be of the wrong format (e.g., tests/test_wrong_extension.py). Finally, a test is included that verifies the data of a file that is converted from OpenSignals format to HDF5 (tests/test_convert_h5.py). These tests are not meant to be an exhaustive test of all functionalities of the package, but rather provide examples of imagined use cases. The user can also manually run these tests by launching the appropriate scripts in the tests subdirectory.

Computational overhead

What is the computational overhead of biobabel? Since biobabel is an added layer on top of existing packages, it is conceivable that it adds computation time to already data-intensive pipelines, which would impair the usefulness of the package.

To test this, the following performance stress test was performed. A large physiological data file was collected, involving eight data channels with data sampled at 1 kHz and a total duration of 12709.3 seconds (approximately 3.5 h). The data was written in the Biopac Acqknowledge format, amounting to a file size of 194 Mb. The time was measured for the data to be loaded into biobabel (using biobabel.load()), as well as the time taken to load it directly using the dedicated Python package bioread. This process was repeated 10 times to gain an estimate of variability.

Reading time in biobabel took 1965 ms (SD 37 ms), whereas using bioread took 1732 ms (SD 29 ms). As a result, the overhead imposed by biobabel is estimated to be around 233 ms for a file of this size, i.e., 11.9%. In summary, biobabel imposes some computational overhead.

(2) Availability

Operating system

Cross-platform (Windows, macOS, Linux).

Programming language

Python (≥ 3.10).

Additional system requirements

Memory: 2 GB RAM recommended.
Disk Space: 500 MB for dependencies.

Dependencies

numpy (≥ 2.3.5)
scipy (≥ 1.16.3)
pandas (≥ 2.3.3)
matplotlib (≥ 3.10.8)
h5py (≥ 3.15.1)
pyxdf (≥ 1.17.0)
bioread (≥ 2025.5.2)
opensignalsreader ( ≥ 0.2.2)
pyedflib (≥ 0.1.42)

Specific versions are managed in the pyproject.toml file within the repository.

List of contributors

Floris Tijmen van Vugt (University of Montreal, Canada)—Sole developer and contributor.

Software location

Archive

Name: Biobabel: Python package for reading major bio/physiology data formats.
Persistent identifier: https://doi.org/10.5281/zenodo.19686009
Licence: GNU General Public License v3.
Publisher: Zenodo.
Version published: v1.1.0
Date published: 2026-04-21

Code repository

Name: Github
Identifier: https://github.com/florisvanvugt/biobabel
Licence: GNU General Public License v3.
Date published: 2025-05-17

Language

English (repository, software comments, documentation, and supporting files).

(3) Reuse potential

Easy data inspection

One use case for the software here is to allow quick visual inspection of physiological data. With a single command, a preview can be accessed, which is a improvement on existing open source tools which usually require multiple steps (in addition to being specific to the data format in question). Furthermore, contrary to commercial software, the current package works cross-platform (including Linux platforms) and without needing license access dongles.

Integration with biosignals processing packages

Since biobabel takes care of all the peculiarities of data files, physiological processing pipelines can be substantially simplified. The following boilerplate code reads a data file and automatically finds the ECG columns and preprocesses the data using the excellent Python package neurokit2 [13]:

import neurokit2
import biobabel as bb
x = bb.load('dataset_copy.hdf5')
prep = {}
for hdr,signal in x.find({'modality':'ecg'}):
    prep[hdr['id']] = neurokit2.ecg_process(
         signal,sampling_rate=hdr['sampling_frequency'])

This code works without modifications for any of the supported data formats.

What biobabel brings to the existing open source software ecosystem is that it allows programmers to abstract away from the specifics of file formats. This is already being done in software packages such as biotop (hosted on GitHub here). This software package enables semiautomatic preprocessing of ECG and respiration data by using an algorithm to detect R-peaks in the ECG signal [14], which an analyst can then inspect, check, and modify if needed.

Conclusion

At the time of writing, biobabel is already being used at the Human Connection Science Lab and the International Laboratory for Brain, Music and Sound Research (BRAMS).

It is hoped that biobabel will simplify the lives of scientists by abstracting away from the specifics of physiology file formats. Using this package, data processing pipelines can be more easily shared across research groups that rely on different sensors, thus contributing toward greater reproducibility in our field.

Acknowledgements

Mihaela Felezeu and Alex Nieva at BRAMS provided helpful tutorials on using all manners of biosignals. Inspiration for biobabel was taken from nibabel which is a Python library able to read virtually any neuroimaging file format in the known universe, and making it available in a unified Python interface [11]. biobabel also builds on the strengths of a range of packages such as matplotlib [15], numpy [16], and pandas [17]. I want to thank the contributors of all those packages for their excellent work.