Abstract
Computational chemistry research activity very frequently features workflows: a series of computational or data transformational tasks must be chained together and executed to achieve the ultimate goal. A plethora of computational workflow systems have been developed over the last decades and although some of them have gained significant traction in some communities (e.g. bioinformatics, image analysis and machine learning), in areas such as biomolecular simulation or materials modelling there has been limited adoption, despite much potential utility. Here we describe Crossflow, a Python library that allows heterogeneous collections of third-party command line-driven tools to be integrated into workflows that can make use of all the execution control features that a Python program provides. Crossflow builds on Dask.distributed, permitting workflows to be executed with high efficiency on platforms ranging from lab clusters to HPC resources to Cloud facilities with no change in the code. We illustrate the flexibility and performance of Crossflow by using it to build and execute three workflows: one which implements the Weighted Ensemble method for determining kinetic parameters, another which implements Hamiltonian replica exchange, and a third which implements an inverse screening task.
