Crossflow: A Python Library for Computational Chemistry Workflows

Sam Cox; James Gebbie-Rayet; Charles Laughton

doi:10.5334/jors.539

Full Article

(1) Overview

Introduction

Workflows are integral to computational chemistry activity; for example, before a molecular dynamics simulation is run there are typically numerous steps involved. Such as remediating input molecular structure data, adding solvent and/or membrane components, assigning forcefield parameters etc. Then the simulation process itself will involve a series of preliminary relaxation and equilibration steps before the production phase is reached. A variety of advanced simulation methods are of themselves iterative workflows involving cycles of data generation, analysis, and decision making; replica exchange approaches would be an example. The implementation of these workflows may be as simple as the construction of a bash script, or as complex as encoding the functionality within an existing monolithic piece of software. The first approach may enable rapid prototyping, but is seldom robust, flexible and appropriate for wide dissemination; the second can produce a high-quality product but its range of functionality is hard coded and development times are long.

Computational workflow systems are designed to cover the middle ground. To enable the relatively rapid construction of bespoke tools that offer novel functionality but which are also of a quality that makes them suitable for wide use, and use at scale. Computational workflow systems have become popular and powerful tools for certain research communities, e.g. Galaxy [1] in the Life Sciences, Knime [2] and PipelinePilot [3, 4] in Cheminformatics, and Pegasus [5] in Earth Sciences, but so far have had limited reach into computational chemistry.

There are recent developments however, for example Galaxy’s BRIDGE project [6]. The HTBAC framework [7] for developing and executing binding free energy calculations. HTMD [8] for high throughput automated preparation and execution of MD simulations. Ensemble Toolkit [9] which provides a framework that supports the execution pattern for many enhanced sampling workflows on HPC resources. BioSimSpace [10] as a more generic framework for biomolecular simulation that concentrates particularly on providing interfaces between what would not otherwise be interoperable tools. Despite these and other options, for our recent and current research on the development of novel simulation and enhanced sampling methods, targeted particularly at large scale distributed computing resources (e.g. Cloud) we found no existing solution that was completely suitable. The most common limitations were a non-trivial process for the integration of new third-party software into the framework, and an indistinct division between the framework as a tool to code up a computational workflow, and as a tool to execute it.

This has led us to the development of Crossflow, a Python library that allows heterogenous collections of third-party command line-driven tools to be very simply integrated into workflows that can make use of all the execution control features that a Python program provides. Crossflow builds on Dask.distributed, an established Python library for distributed computing, permitting workflows to be executed with high efficiency on platforms ranging from lab clusters to HPC resources to Cloud facilities with little or no change in the code. We illustrate the flexibility and performance of Crossflow by using it to build and execute three workflows: one which implements the Weighted Ensemble [11] method for determining kinetic parameters, another which implements Hamiltonian replica exchange [12], and a third which implements an inverse docking [13] procedure.

Implementation and architecture

Background: the task-based execution model

Crossflow leverages Dask.distributed [14]. Dask is well-documented, but for convenience we begin by outlining its task-based execution model. A Dask.distributed cluster consists of a scheduler and a number of workers. Workers may be threads or processes on a local resource, or individual, free-standing, remote computing resources (e.g cloud instances). A client provides the mechanism to send tasks from a Python program to the cluster. Given a Python function of the form:

z = func(x,y)

execution of this via Dask.distributed becomes:

zf = client.submit(func, x, y)

z = zf.result()

Under the hood, Dask.distributed serialises the function and arguments and sends them to the scheduler which then selects a worker to execute the function. Importantly, the return value (“zf” here) is returned immediately as a future, so function execution is non-blocking. At some later time the actual return value can be obtained from a call to the .result() method of the future.

Trivially parallel tasks can be executed via the client’s .map() method, the scheduler takes care of distributing individual tasks amongst the available workers in as efficient a way as possible:

f_list = client.map(func, iterable_x, iterable_y

Futures may be used immediately as arguments to further functions and the scheduler will take account of both data dependencies and data locality in determining when and where the task is eventually executed:

c = client.submit(func1, a, b)

# c and e are futures:

e = client.submit(func2, c, d)

f = e.result()

The way Dask.distributed permits tasks to be chained together, data dependencies resolved, and work distributed amongst a pool of workers with minimal end user engagement in the details is obviously attractive for workflow design and execution, however there are a number of issues that make its adaption to the needs of typical computational chemistry use cases non-trivial. Firstly, despite the growing popularity of Python in the community, the majority of computational chemistry work still makes use of software tools that only have a command line interface, they are not Python functions. Secondly and relatedly, input and output are mainly files-based, which raises issues when planning to use distributed computing facilities that may not share a common file system. We have designed Crossflow to address these issues.

Crossflow Components

Below we outline the key components of Crossflow, and their key features. The online documentation to Crossflow contains fuller details.

Crossflow Tasks

Crossflow tasks provide a simple and flexible approach to wrapping legacy software tools with command line interfaces as Python functions. To illustrate the approach, we take the example of the AMBER [15] pmemd command, used to run a molecular dynamics simulation. At the command line, a typical invocation might be:

pmemd -i md.in -o md.log \

-p prm.prmtop \

-c start.ncrst -r end.ncrst \

-x trajectory.nc

where md.in is the input control data, md.log is the run log file, prm.prmtop contains the parameters/topology information, start.ncrst contains the starting coordinates, end.ncrst will contain the final coordinates, and traj.nc will contain the trajectory data (coordinates of the system at a series of time points). To convert this into a Crossflow Task – a Python function of the sort that can be handled by Dask.Distributed – requires just three lines of code:

runmd = SubprocessTask(

    ‘pmemd -i x.in -o x.log’

    ‘-p x.prmtop’

    ‘-c in.ncrst -r out.ncrst’

    ‘-x traj.nc’

runmd.set_inputs(

    [‘x.in’, ‘x.prmtop’, ‘in.ncrst’]

runmd.set_outputs(

    [‘x.log’, ‘out.ncrst’, ‘traj.nc’]

The first line instantiates the new task – a Python function whose name will be “runmd”. The argument is a template string. As a whole, the template string must represent a potentially valid invocation of the required command, however the names of input and output files are arbitrary, they are only placeholders. The second line identifies which of the placeholders in the template corresponds to an input to the function (exactly what we mean by an “input” we discuss in a moment). Similarly the third line identifies what the function outputs will be. The order in which the placeholders appear in the second and third lines defines the function signature when executed by the client:

log, endcrds, trajectory = \

client.submit(

    runmd,

    mdin,

    prmtop,

    startcrds

Crossflow FileHandles

The arguments to the command line invocation of pmemd are the names of files – i.e., pointers to data rather than the data itself. This raises problems for execution on distributed resources since they may not share a common filesystem. In Crossflow, input and output arguments that refer to files are replaced by FileHandles, objects that either contain the actual data content of a file-type object, or contain a pointer to it that is safe in a distributed environment (a URI). Users select a FileHandler appropriate to their situation (e.g. one that loads data into Python objects, or uses an NFS-mounted filesystem or S3 buckets) and then convert each input file into a FileHandle which is then the “input” that is provided to the Client/SubprocessTask:

# The default FileHandler loads

#.data into memory

fh = FileHandler()

mdin = fh.load(‘md.in’)

prmtop = fh.load(‘prm.prmtop’)

startcrds = fh.load(‘start.ncrst’)

log, endcrds, trajectory = \

client.submit(

    runmd,

    mdin,

    prmtop,

    startcrds

The objects returned by the client.submit() call are likewise Futures for FileHandles. FileHandles have methods modelled on those of Python’s PathLike type. They can be used directly as arguments to any Python function that accepts path-like types (e.g. open()), support the read_text() and read_binary() methods, and in addition have a save() method that allows their data to be saved to a conventional local file, e.g.:

mdin.save(‘my_md.in’)

Crossflow Clients

A Crossflow Client is a slightly modified version of a Dask.distributed Client. There are two particular enhancements. Firstly, a Dask Client’s .submit() method always returns a single Future, even if the function it is evaluating returns multiple values (as a tuple). Thus with a simple Dask client one would have to write:

log_endcrds_trajectory = \

client.submit(

    runmd,

    mdin,

    prmtop,

    startcrds

i.e., the return value would be a single future that would generate a tuple only when its result() method was called. With a Crossflow client, as illustrated above, a separate future is returned for each output argument, which can then immediately be used as an input to some following step in the workflow. The second enhancement is that the Crossflow client understands FileHandles as inputs and creates them automatically as required for each task output.

A simple example Crossflow workflow

We illustrate the process of encoding a workflow in Crossflow by creating one to prepare, and then run, a molecular dynamics simulation using GROMACS. This is a two-step process: the GROMACS tool grompp is used to create a “portable run file”; then this file is input to the MD simulation tool itself, mdrun. On the command line, the steps might be:

gmx grompp -f equil.mdp -c startcrds.gro \

    -p topol.top -o equil.tpr

gmx mdrun -s equil.tpr -o equil.trr \

    -g equil.log -c equilcrds.gro

The inputs are equil.mdp, which provides the parameters for the run, startcrds.gro which contains the initial coordinates, and topol.top which is the parameters/topology file. The GROMACS grompp command then generates the portable run file equil.tpr from these. In the second step equil.tpr is the only input file, the outputs are equil.trr, which contains the trajectory data, equil.log which is a human-readable log file, and equilcrds.gro which are the final coordinates at the end of the MD simulation.

Using Crossflow this can be written in Python as:

# Create a task to run grompp:

grompp = SubprocessTask(

    ‘gmx grompp -f x.mdp -c x.gro’

    ‘ -p x.top -o x.tpr’

grompp.set_inputs(

    [‘x.mdp’, ‘x.gro’, ‘x.top’]

grompp.set_outputs([‘x.tpr’])

# Create a task to run mdrun:

mdrun = SubprocessTask(

    ‘gmx mdrun -s x.tpr’

    ‘ -o x.trr -g x.log -c x.gro’

mdrun.set_inputs([‘x.tpr’])

mdrun.set_outputs(

    [‘x.trr’, ‘x.log’, ‘x.gro’]

# Load the input files:

fh = FileHandler()

mdp = fh.load(‘equil.mdp’)

startcrds = fh.load(‘startcrds.gro’)

topol = fh.load(‘topol.top’)

# Run grompp then mdrun:

tpr = client.submit(

    grompp,

    mdp,

    startcrds,

    topol

traj, log, equilcrds =\

client.submit(mdrun, tpr)

# Save the output file data to local disk:

traj.result().save(‘equil.trr’)

log.result().save(‘equil.log’)

equilcrds.result().save(‘equilcrds.gro’)

While at first sight this may seem much more involved, there are significant advantages. Firstly though the Python script might be run on a local laptop or workstation, the Crossflow client could be connected to a remote HPC facility or cluster of GPU workers in the cloud and this is where the compute-intensive steps would be executed. Secondly the workflow could be easily scaled; for example the MD job could be run as ten replicates, maybe with different random number seeds to initialize velocities, with minor changes to the code:

tprs = client.map(

    grompp,

    [mdp]*10,

    [startcrds]*10,

    [topol]*10

trajs, logs, equilcrdss = \

client.map(mdrun, tprs)

The Crossflow/Dask scheduler would take care of ensuring the workload was distributed across the available workers in as parallel a way as was optimal. Thirdly, all Python’s execution control features (loops, conditional executions, etc.) are available to allow a workflow of any complexity to be constructed. It is also simple to create workflows that mix legacy command-line oriented tools with Python’s rich ecosystem of data analysis and manipulation packages.

Case Studies

Below we describe the construction and execution of three diverse workflows designed to test the flexibility and performance of Crossflow.

A Hamiltonian Replica Exchange Workflow.
SWISH is an enhanced sampling method developed by the Gervasio group for the identification of cryptic ligand binding sites in proteins [16]. The method uses a Hamiltonian replica exchange (HREX) approach where successive replicas feature a gradually enhanced attractive potential between water oxygen atoms and the nonpolar C and S atoms of amino acid side chains. The method was originally implemented in GROMACS but is also adaptable to AMBER. For both codes, the simulations of individual replicas in an HREX experiment take place within a tightly coupled MPI framework. Both MD codes are very efficient on GPU-based compute systems, but large arrays of GPUs with high performance MPI support as well are relatively scarce and expensive resources. We were interested to see what the performance of HREX methods would be if implemented in Crossflow, as this could enable such simulations to be run on the sorts of loosely coupled, distributed, GPU resources that are much more commonly available within individual compute labs or via the Cloud.
Our test was the same IL2 system as studied by Gervasio et al. In brief, initial protein coordinates were taken from PDB code 1M47 [17], immersed in a cubic box of TIP3P water, and parameterised with the AMBER FF14SB forcefield. Seven modified parameter/topology files were generated from the original using the AMBER parmed program, scaling non-bonded interactions between water oxygen atoms and non-polar C and S atoms of the protein according to the SWISH methodology by factors ranging from 1.03 to 1.21. Reference simulations were run using AMBER’s multipmemd (i.e., conventional MPI harness) code on our local x86-based HPC service. A Python script to implement the same HREX workflow was written using Crossflow and executed on a manually constructed Dask.distributed cluster of AWS g4dn.xlarge instances (1 Tesla T4 GPU and 4 CPU cores per node, no fast interconnect). In both cases the simulations were run for 50 picoseconds, with an exchange attempt every 10 picoseconds. Results are shown in Table 1. The performance (speed and scalability) of the Crossflow workflow compares well with the conventional MPI version (multipmemd).
A Weighted Ensemble Workflow.
The weighted ensemble (WE) enhanced sampling method is a well-established approach to determining the kinetics of varied molecular processes, including ligand-protein association and dissociation rates, and protein folding rates. Like (H)REX, the workflow interleaves sets of parallel, independent, molecular dynamics simulations with aggregate data analysis and decision making steps that determine the course of the next MD stage. What is different about the WE approach is that the number of independent MD simulations run each cycle is in general much larger than for REX methods and can vary from cycle to cycle in ways that cannot be predicted a priori. This makes it an interesting use case to target at potentially adaptively scalable clusters. Currently the major MD simulation codes do not offer support for WE simulations “out of the box” but third-party tools that implement it are available, one of which is WESTPA [18]. While WESTPA is adaptable to work with a variety of MD engines, it does not currently interface straightforwardly with distributed computing resources (such as cloud). A Crossflow-based implementation could provide a simple approach to, for example, prototyping novel binning methods for WE.
A basic but functional Crossflow workflow to implement steady-state type WE simulations (where flux reaching the target state is recycled to the initial state) could be created in a Jupyter notebook with less than 100 lines of code (available from the Github repository). Applied to the chignolin unfolding test case provided in the WESTPA tutorials [19], we obtain an unfolding rate of 0.25 * 10⁸ s^–1, in good agreement with the reference result. The performance of the Crossflow workflow is the same as the WESTPA implementation when run on the same local single GPU resource, and scales well when run on multiple distributed resources – in this case our local Condor cluster – using dask-jobqueue to create and manage the cluster (Table 2).
An Inverse Docking Workflow.
Inverse docking, where the binding affinity of a single ligand to a range of different protein targets is predicted. This represents an interesting workflow challenge, as it is not a mode of operation supported in a straightforward way by most stand-alone molecular docking software tools. The workflow shown in Figure 1 was created using Crossflow (available in the Github repository).
For simplicity, the workflow was not written in as fully parallelized a manner as it could have been. Whilst the dockings to every identified pocket in a protein target were run in parallel, individual protein targets were examined sequentially. Ligand and proteins were prepared for docking using AutoDock Tools. Potential binding sites in each protein were identified using FPocket, and the ligand was docked to each using AutoDock Vina. Python functions downloaded the required protein structures from the Protein Data Bank. Then extracted the necessary information from FPocket log files to provide the necessary inputs for Vina, and analysed the docking results to report the most favourable docking score of the ligand to the most favourable pocket for each protein target.
We tested the workflow by exploring the inverse docking of 2-isobutyl-3-methoxypyrazine (PRZ) to a range of protein targets taken from the Cryptosite database [20]. To which we added PRZ’s established binding partner, the mouse major urinary protein (MUP, PDB code 1QY1). The tasks were distributed amongst a pool of five AWS c5.xlarge instances, using Dask’s Cloud Provider package. Out of the 81 proteins studied, only 5 returned binding affinities for PRZ that were greater than that obtained for re-docking to 1QY1 (Figure 2). In all of these cases PRZ binds at, or very close to, the binding site for an established ligand.

Pseudocode for the inverse docking workflow.

Table 1

Performance of Crossflow for the SWISH workflow [16], compared to the standard AMBER (multipmemd) reference implementation.

CODE	PLATFORM	WALLCLOCK TIME (S – MEAN OF DUPLICATE RUNS)	SPEEDUP
`multipmemd`	64 CPU cores	272	–
`multipmemd`	128 CPU cores	145
`pmemd.cuda + crossflow`	1 GPU	229	1.00
`pmemd.cuda + crossflow`	2 GPUs	118	1.94
`pmemd.cuda + crossflow`	4 GPUs	62	3.69
`pmemd.cuda + crossflow`	8 GPUs	35	6.54

Table 2

Scaling performance of a Weighted Ensemble workflow. A “segment” is a short MD simulation of one member of the ensemble.

NUMBER OF WORKERS	PERFORMANCE	MEAN WALLCLOCK TIME PER SEGMENT (SECONDS)	SPEEDUP
1	280 segments in 1154.8 seconds	4.12	1
2	384 segments in 936.3 seconds	2.44	1.7
4	484 segments in 699.6 seconds	1.45	2.8
8	424 segments in 397.9 seconds	0.94	4.4
16	220 segments in 118.5 seconds	0.54	7.6

Scores for ligand PRZ docked to 81 different protein targets (higher rank = better score). The score for the established binding partner (MUP, PDB code 1QY1) is indicated by the orange dot.

Conclusions

When applied to realistic use-cases, Crossflow appears to satisfy the requirements we specified at the outset. It is quick and easy to convert almost any legacy command-line tool into a form that can be integrated into a workflow. The workflows themselves are written in Python, a language pervasive in the field. Crossflow is significantly different from most other current workflow solutions in several ways. Firstly, there is no concept of a pre-defined library of “building blocks” from which workflows are constructed: they are ephemeral objects constructed as required (with as little as three lines of code). While this means that Crossflow is not so well suited to GUI-style “drag and drop” workflow construction, and end users must be familiar with the command line interfaces of all the tools they wish to use, it permits a high degree of flexibility. Secondly, there is a complete separation of workflow construction and task execution processes, so that workflows may be run on resources ranging from laptops to HPC services without any change to the code. By leveraging the capabilities of Dask.distributed, the user does not need to get involved in issues such as data locality and task distribution for optimal performance.

The dependency on Dask.distributed does have some consequences. Firstly, Dask.distributed is currently designed to operate most effectively on moderate sized clusters – up to about 1000 workers. Although we have not checked this, we would not expect Crossflow workflows to be efficient for the simultaneous execution of >1000 independent tasks on a resource (e.g. large HPC system) that could theoretically support this. Secondly, it can be technically challenging to create a job submission script for a typical HPC system that will concurrently launch a Dask scheduler and fixed number of workers, each of which has exclusive access to a section of the underlying compute (so for example MPI jobs can be run in parallel in distinct Universes). This is less of a problem in cases where it is desired to bind each worker to a unique GPU in an HPC system (and MPI is not required) – see the Github repository for an example Slurm submission script. Alternatively, packages like dask-jobqueue allow scalable clusters to be launched on HPC resources very simply, but each worker is then a separate submission to the queueing system. As a result, all of them may not be active concurrently but this may be advantageous in allowing the job to squeeze into gaps in the job queue. Running on cloud resources can avoid this issue as it is the user’s prerogative to define the number of worker nodes that get launched, and here Crossflow can make use of Dask’s Cloud Provider package that allows adaptively scalable Dask.distributed clusters to be launched and managed for a variety of current common cloud providers.

Quality control

The software has been adopted by the HECBioSim software development support programme and will be maintained as part of their community software development. The package includes a pytest test suite (72% code coverage). This is run on every push and pull request to the main branch of the Github repository, on Ubuntu 24.04, Mac Sequoia and Windows Server 2025 platforms, with Python versions 3.11, 3.12 and 3.13. The repository also contains a selection of example scripts and Jupyter Notebooks that can be used to check the correct functioning of a local installation of the software. Developers can install via pre-commit hooks tools that will automatically run code quality tests (such as pylint, black, isort) which will block commits to the repository if they fail. These tests are also part of the github CI workflow that are run on pull requests. Documentation built using sphinx is available in source on the GitHub repository and is hosted on readthedocs.

(2) Availability

Operating system

Crossflow has been used extensively on a wide range of Unix and Unix-like (i.e. Mac) operating systems. As a pure Python package it should work on Windows platforms, this is demonstrated by testing on the Windows operating system on github actions.

Programming language

Python v3.11 or later.

Additional system requirements

none

Dependencies

dask > = 2023.5.0

distributed > = 2023.5.0

fsspec > = 2023.5.0.

List of contributors

Christian Seuss (Postdoctoral Research Associate, School of Pharmacy, University of Nottingham, UK), developed underpinning and precursor software. CL wrote the software package. SC assisted with deployment and testing on Azure cloud resources. JG-R assisted with deployment and testing on UK Tier 1 and Tier 2 HPC resources. CL developed the example workflows and wrote the paper.

Software location

Archive

Name: Zenodo

Persistent identifier: https://doi.org/10.5281/zenodo.12191600

Licence: MIT

Publisher: Charles Laughton

Version published: 0.1.3

Date published: 20/06/24

Code repository

Name: Github

Identifier: https://github.com/HECBioSim/crossflow

Licence: MIT license

Date published: 11/09/25

Language

Python

(3) Reuse potential

The three case studies included above illustrate the flexibility and reuse potential for Crossflow. Though developed with the computational chemistry community in mind, it is potentially useable in any field where there is a requirement to chain together traditional command-line focussed applications into reproduceable workflows. Especially to execute these at scale on distributed, high performance, or heterogeneous computing facilities. The code is available under an Open Source licence on Github, the repository can easily be forked and modifications made if necessary. There is an issue tracker, and contributions are welcome via pull requests. Queries will be answered by the code owners on a best-efforts basis. The Crossflow documentation provides details and tutorials on Crossflow basics and principles of workflow design and coding.

Acknowledgements

We thank Amazon Web Services for support via their Cloud Credits for Research Program.

Competing interests

The authors have no competing interests to declare.